Medical multimodal entity linking under modality missingness

¹ Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), School of Computer Science and Engineering, Hefei University of Technology, Hefei, Anhui, China

² Department of Data Science and Big Data Technology, Faculty of Medical Instrumentation, Shanghai University of Medicine & Health Sciences, Shanghai, China

³ Department of Computer Science, School of Artificial Intelligence and Computer Science, Nantong University, Nantong, Jiangsu, China

AIH, 026070015 https://doi.org/10.36922/AIH026070015

Received: 11 February 2026 | Revised: 14 March 2026 | Accepted: 30 March 2026 | Published online: 19 May 2026

(This article belongs to the Special Issue Uncertain knowledge in cardiopulmonary medicine: discovery, applications, and analytical frameworks)

© 2026 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )

Download PDF

XML

Cite

Abstract

Multimodal entity linking aligns real-world multimodal mentions with corresponding entities in a knowledge base and is increasingly important for applications such as medical knowledge grounding and retrieval-augmented generation. However, existing methods struggle in real-world medical settings, where missing modalities are common due to privacy restrictions, acquisition failures, and legacy data, leading to frequent missing visual information in mentions. Furthermore, medical knowledge bases typically suffer from sparse visual coverage and long-tail entities—a modality asymmetry that weakens multimodal matching and reduces entity disambiguation performance. To address this, we introduce a robust framework for medical multimodal entity linking under incomplete modality conditions, which leverages conditional generative models to reconstruct missing visual information with high semantic consistency. The framework incorporates a dynamic loss balancing strategy to adaptively coordinate modality-specific and cross-modal learning during training. Experiments on a medical multimodal entity linking benchmark covering 11 medical domains demonstrate that the proposed method achieves consistent performance gains and more stable training under modality-missing conditions.

Keywords

Knowledge base

Multimodal linking

Entity linking

Robust framework

Dynamic loss balancing

Funding

This work is partly supported by the Youth Talent Support Program of Anhui Association for Science and Technology (No. RCTJ202420), the Hefei Key Technology R & D Champion-Based Selection Project (No. 2024SGJ010), and the Open Project of Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), under grant number BigKEOpen2025-08.

Conflict of interest

Chenyang Bu, He Ren, and Weiping Ding are Guest Editors of this special issue, with Chenyang Bu also serving as an Editorial Board Member for the journal. Separately, all the authors declared that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.

References

Li M, Kilicoglu H, Xu H, Zhang R. BiomedRAG: a retrieval augmented large language model for biomedicine. J Biomed Inform. 2025;162:104769. doi: 10.1016/j.jbi.2024.104769.

Xiong G, Jin Q, Wang X, Zhang M, Lu Z, Zhang A. Improving retrieval-augmented generation in medicine with iterative follow-up questions. In: Biocomputing 2025. World Scientific. 2024:199-214. doi: 10.1142/9789819807024_0015

Akhtar ZB. Artificial intelligence within medical diagnostics: A multi-disease perspective. Artif Intell Health. 2025;2(3):44–62. doi: 10.36922/aih.5173.

Neverova N, Wolf C, Taylor GW, Nebout F. ModDrop: adaptive multi-modal gesture recognition. IEEE Trans Pattern Anal Mach Intell. 2016;38(8):1692-1706. doi: 10.1109/TPAMI.2015.2461544.

Chen Z, Guo L, Fang Y, et al. Rethinking uncertainly missing and ambiguous visual modality in multi-modal entity alignment. In: Lecture Notes in Computer Science. Springer Nature Switzerland. 2023:121-139. doi: 10.1007/978-3-031-47240-4_7

Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). 2021. Available from: https://proceedings.mlr.press/v139/ radford21a/radford21a.pdf

Wang Z, Wu Z, Agarwal D, Sun J. MedCLIP: contrastive learning from unpaired medical images and text. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. 2022:3876-3887. doi: 10.18653/v1/2022.emnlp-main.256

Zhang S, Xu Y, Usuyama N, et al. BiomedCLIP: a multimodal biomedical foundation model pretrained on biomedical images and text. arXiv. 2023. doi: 10.48550/arXiv.2303.00915

Lin W, Zhao Z, Zhang X, Wu C, Zhang Y, Wang Y, Xie W. PMC-CLIP: contrastive language-image pre-training using biomedical documents. arXiv. 2023. doi: 10.48550/arXiv.2303.07240

Kendall A, Gal Y, Cipolla R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. 2018:7482-7491. doi: 10.1109/CVPR.2018.00781

Chen Z, Badrinarayanan V, Lee CY, Rabinovich A. GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: Proceedings of the International Conference on Machine Learning (ICML); 2018. Available from: https://proceedings.mlr.press/v80/ chen18a.html

Yu T, Kumar S, Gupta A, Levine S, Hausman K, Finn C. Gradient surgery for multi-task learning. In: Advances in Neural Information Processing Systems (NeurIPS); 2020. Available from: https://proceedings.neurips.cc/paper/2020/ hash/3fe78a8acf5fda99de95303940a2420c-Abstract.html

Li P, Wang Y. A multimodal entity linking approach incorporating topic concepts. IEEE Trans Knowl Data Eng. 2021;27(2):491-494. doi: 10.1109/CISAI54367.2021.00100

Bu C, Chang G, Chen Z, et al. Query-driven multimodal GraphRAG: dynamic local knowledge graph construction for online reasoning. In: Findings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics; 2025:21360-21380. doi: 10.18653/v1/2025.findings-acl.1100

Gao X, Lu K. RefSAM3D: adapting the segment anything model with cross-modal references for three-dimensional medical image segmentation. Artif Intell Health. 2025;2(4):114-128. doi: 10.36922/AIH025080010

Zhao X, Zhao Q, Tanaka T. EpilepsyLLM: fine-tuning large language models for Japanese epilepsy knowledge representation. Artif Intell Health. 2025;3(1):104-115. doi: 10.36922/AIH025180042

Gan J, Luo J, Wang H, Wang S, He W, Huang Q. Multimodal entity linking: a new dataset and a baseline. In: Proceedings of the ACM International Conference on Multimedia. ACM. 2021:993-1001. doi: 10.1145/3474085.3475400.

Sanz-Cruzado J, Lever J. Accelerating cross-encoders in biomedical entity linking. In: Proceedings of the 24th Workshop on Biomedical Language Processing. Association for Computational Linguistics. 2025:136-147. doi: 10.18653/v1/2025.bionlp-1.13

Mumtaz U, Ahmed A, Mumtaz S. LLMs-Healthcare: Current applications and challenges of large language models in various medical specialties. Artif Intell Health. 2024;1(2):16– 28. doi: 10.36922/aih.2558.

Song S, Zhao S, Wang C, et al. A dual-way enhanced framework from text matching point of view for multimodal entity linking. AAAI. 2024;38(17):19008-19016. doi: 10.1609/aaai.v38i17.29867

Xing S, Zhao F, Wu Z, Li C, Zhang J, Dai X. DRIN: dynamic relation interactive network for multimodal entity linking. In: Proceedings of the ACM International Conference on Multimedia. ACM. 2023:3599-3608. doi: 10.1145/3581783.3612575

Yao BM, Chen Y, Wang Q, et al. AMELI: enhancing multimodal entity linking with fine-grained attributes. arXiv. 2023. doi: 10.48550/arXiv.2305.14725

Shi S, Xu Z, Hu B, Zhang M. Generative multimodal entity linking. arXiv. 2023. doi: 10.48550/arXiv.2306.12725

Liu Q, He Y, Xu T, et al. UniMEL: a unified framework for multimodal entity linking with large language models. In: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM). ACM. 2024:1909-1919. doi: 10.1145/3627673.3679793.

Vishwanath AB, Srinivasalu VK, Subramaniam N. Role of large language models in improving provider–patient experience and interaction efficiency: A scoping review. Artif Intell Health. 2025;2(2):1–10. doi: 10.36922/aih.4808.

Vashishth S, Newman-Griffis D, Joshi R, Dutt R, Rosé CP. Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. J Biomed Inform. 2021;121:103880. doi: 10.1016/j.jbi.2021.103880

Zeng J, Zhou J, Liu T. Robust multimodal sentiment analysis via tag encoding of uncertain missing modalities. IEEE Trans Multimed. 2023;25(9):6301-6314. doi: 10.1109/TMM.2022.3207572.

Huan R, Zhong G, Chen P, Liang R. UniMF: a unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences. IEEE Trans. Multimed. 2024;26:5753-5768. doi: 10.1109/TMM.2023.3338769.

Luo P, Xu T, Wu S, Zhu C, Xu L, Chen E. Multi-grained multimodal interaction network for entity linking. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). ACM. 2023:1583-1594. doi: 10.1145/3580305.3599439

Wang H, Chen Y, Ma C, Avery J, Hull L, Carneiro G. Multi-modal learning with missing modality via shared-specific feature modelling.In: Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. 2023:15878-15887. doi: 10.1109/cvpr52729.2023.01524

Akhtar MS, Chauhan DS, Ghosal D, Poria S, Ekbal A, Bhattacharyya P. Multi-task learning for multi-modal emotion recognition and sentiment analysis. arXiv. 2019. doi: 10.48550/arXiv.1905.05812

Wang Q, Zhan L, Thompson P, Zhou J. Multimodal learning with incomplete modalities by knowledge distillation. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM 2020:1828-1838. doi: 10.1145/3394486.3403234

Xue Z, Marculescu R. Dynamic multimodal fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. 2023:2575- 2584. doi: 10.1109/CVPRW59228.2023.00256

Xaviar S, Yang X, Ardakanian O. Robust multimodal fusion for human activity recognition. arXiv. 2023. doi: 10.48550/arXiv.2303.04636

Chen D, Zhang R. Building multimodal knowledge bases with multimodal computational sequences and generative adversarial networks. IEEE Trans Multimed. 2024;26:2027- 2040. doi: 10.1109/TMM.2023.3291503

Zhang Y, Peng C, Wang Q, Song D, Li K, Zhou SK. Unified multi-modal image synthesis for missing modality imputation. IEEE Trans Med Imaging. 2024;43(1):225-236. doi: 10.1109/TMI.2024.3424785.

Zhao J, Li R, Jin Q. Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics. 2021:2608-2618. doi: 10.18653/v1/2021.acl-long.203

Zhou J, Tang J, Zuo Y, Wan P, Zhang D, Shao W. Robust Multimodal Survival Prediction with Conditional Latent Differentiation Variational AutoEncoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025:10384-10393

Emami H, Dong M, Glide-Hurst C. CL-GAN: contrastive learning-based generative adversarial network for modality transfer with limited paired data. In: Lecture Notes in Computer Science. Springer Nature Switzerland. 2023:527- 542. doi: 10.1007/978-3-031-25066-8_30

Zhang S, Xu Y, Usuyama N, et al. A multimodal biomedical foundation model trained from fifteen million image-text pairs. NEJM AI. 2025;2(1):AIoa2400640. doi: 10.1056/AIoa2400640.

Kingma DP, Welling M. Auto-encoding variational bayes. arXiv. 2013. doi: 10.48550/arXiv.1312.6114

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde- Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS); 2014;27. Available from: https://papers.nips.cc/paper_files/paper/2014/hash/ f033ed80deb0234979a61f95710dbe25-Abstract.html

Wang J, Black M, Rankin D, et al. A bagging ensemble machine learning method for imbalanced data to predict anxiety disorders and analysis of risk factors in older people: observational study. Artif Intell Health. 2025;1-22. doi: 10.36922/AIH025070009

Acharya A, Ghosh A, Verma P, Pasupa K, Saha S, Singh P. M3Retrieve: benchmarking multimodal retrieval for medicine. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 2025:15274- 15287. doi: 10.18653/v1/2025.emnlp-main.

Lee C, Roy R, Xu M, et al. NV-Embed: improved techniques for training LLMs as generalist embedding models. In: Proceedings of the International Conference on Learning Representations (ICLR). 2025. Available from: https:// openreview.net/pdf?id=lgsyLSsDRe

Lin J, Ma X, Lin SC, Yang JH, Pradeep R, Nogueira R. Pyserini: a Python toolkit for reproducible information retrieval research with sparse and dense representations. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 2021:2356-2362. doi: 10.1145/3404835.3463238.

Byrne B, Chen J, Coca A, Lin W, Mei J. Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering. In: Advances in Neural Information Processing Systems 36. Neural Information Processing Systems Foundation, Inc. (NeurIPS). 2023:22820- 22840. doi: 10.52202/075280-0990

Lao M, Li Z, Guo Y, Zhang X, Cai S, Ding Z, et al. Boosting discriminability for robust multimodal entity linking with visual modality missing. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery. 2025:989-999. doi: 10.1145/3726302.3729906

Xiao G, Zeng W, Zhang S, Lao M, Zhao X. Multi-Modal Entities Matter: Benchmarking Multi-Modal Entity Alignment. In: Proceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics. 2025:8714-8724.

Wang Y, Sun H, Wang J, et al. Towards semantic consistency: Dirichlet energy driven robust multi-modal entity alignment. In: Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE. 2024:3559-3572. doi: 10.1109/ICDE60146.2024.00274

Sui X, Zhang Y, Zhao Y, Song K, Zhou B, Yuan X. MELOV: Multimodal Entity Linking with Optimized Visual Features in Latent Space. In: Proceedings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics. 2024:816-826. doi: 10.18653/v1/2024.findings-acl.46

Kwon J, Kim M, Lee E, Choi J, Kim Y. See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics. 2025:4364-4378. doi: 10.18653/v1/2025.naacl-long.222

Previous article in this issue

Next article in this issue

Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing