AccScience Publishing / AIH / Online First / DOI: 10.36922/AIH026070015
ORIGINAL RESEARCH ARTICLE

Medical multimodal entity linking under modality missingness

Shiji Zang1 Chenyang Bu1* Yunpeng Hong1 He Ren2 Weiping Ding3
Show Less
1 Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), School of Computer Science and Engineering, Hefei University of Technology, Hefei, Anhui, China
2 Department of Data Science and Big Data Technology, Faculty of Medical Instrumentation, Shanghai University of Medicine & Health Sciences, Shanghai, China
3 Department of Computer Science, School of Artificial Intelligence and Computer Science, Nantong University, Nantong, Jiangsu, China
Received: 11 February 2026 | Revised: 14 March 2026 | Accepted: 30 March 2026 | Published online: 19 May 2026
© 2026 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )
Abstract

Multimodal entity linking aligns real-world multimodal mentions with corresponding entities in a knowledge base and is increasingly important for applications such as medical knowledge grounding and retrieval-augmented generation. However, existing methods struggle in real-world medical settings, where missing modalities are common due to privacy restrictions, acquisition failures, and legacy data, leading to frequent missing visual information in mentions. Furthermore, medical knowledge bases typically suffer from sparse visual coverage and long-tail entities—a modality asymmetry that weakens multimodal matching and reduces entity disambiguation performance. To address this, we introduce a robust framework for medical multimodal entity linking under incomplete modality conditions, which leverages conditional generative models to reconstruct missing visual information with high semantic consistency. The framework incorporates a dynamic loss balancing strategy to adaptively coordinate modality-specific and cross-modal learning during training. Experiments on a medical multimodal entity linking benchmark covering 11 medical domains demonstrate that the proposed method achieves consistent performance gains and more stable training under modality-missing conditions.

Keywords
Knowledge base
Multimodal linking
Entity linking
Robust framework
Dynamic loss balancing
Funding
This work is partly supported by the Youth Talent Support Program of Anhui Association for Science and Technology (No. RCTJ202420), the Hefei Key Technology R & D Champion-Based Selection Project (No. 2024SGJ010), and the Open Project of Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), under grant number BigKEOpen2025-08.
Conflict of interest
Chenyang Bu, He Ren, and Weiping Ding are Guest Editors of this special issue, with Chenyang Bu also serving as an Editorial Board Member for the journal. Separately, all the authors declared that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.
References
  1. Li M, Kilicoglu H, Xu H, Zhang R. BiomedRAG: a retrieval augmented large language model for biomedicine. J Biomed Inform. 2025;162:104769. doi: 10.1016/j.jbi.2024.104769.

 

  1. Xiong G, Jin Q, Wang X, Zhang M, Lu Z, Zhang A. Improving retrieval-augmented generation in medicine with iterative follow-up questions. In: Biocomputing 2025. World Scientific. 2024:199-214. doi: 10.1142/9789819807024_0015

 

  1. Akhtar ZB. Artificial intelligence within medical diagnostics: A multi-disease perspective. Artif Intell Health. 2025;2(3):44–62. doi: 10.36922/aih.5173.

 

  1. Neverova N, Wolf C, Taylor GW, Nebout F. ModDrop: adaptive multi-modal gesture recognition. IEEE Trans Pattern Anal Mach Intell. 2016;38(8):1692-1706. doi: 10.1109/TPAMI.2015.2461544.

 

  1. Chen Z, Guo L, Fang Y, et al. Rethinking uncertainly missing and ambiguous visual modality in multi-modal entity alignment. In: Lecture Notes in Computer Science. Springer Nature Switzerland. 2023:121-139. doi: 10.1007/978-3-031-47240-4_7

 

  1. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). 2021. Available from: https://proceedings.mlr.press/v139/ radford21a/radford21a.pdf

 

  1. Wang Z, Wu Z, Agarwal D, Sun J. MedCLIP: contrastive learning from unpaired medical images and text. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. 2022:3876-3887. doi: 10.18653/v1/2022.emnlp-main.256

 

  1. Zhang S, Xu Y, Usuyama N, et al. BiomedCLIP: a multimodal biomedical foundation model pretrained on biomedical images and text. arXiv. 2023. doi: 10.48550/arXiv.2303.00915

 

  1. Lin W, Zhao Z, Zhang X, Wu C, Zhang Y, Wang Y, Xie W. PMC-CLIP: contrastive language-image pre-training using biomedical documents. arXiv. 2023. doi: 10.48550/arXiv.2303.07240

 

  1. Kendall A, Gal Y, Cipolla R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. 2018:7482-7491. doi: 10.1109/CVPR.2018.00781

 

  1. Chen Z, Badrinarayanan V, Lee CY, Rabinovich A. GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: Proceedings of the International Conference on Machine Learning (ICML); 2018. Available from: https://proceedings.mlr.press/v80/ chen18a.html

 

  1. Yu T, Kumar S, Gupta A, Levine S, Hausman K, Finn C. Gradient surgery for multi-task learning. In: Advances in Neural Information Processing Systems (NeurIPS); 2020. Available from: https://proceedings.neurips.cc/paper/2020/ hash/3fe78a8acf5fda99de95303940a2420c-Abstract.html

 

  1. Li P, Wang Y. A multimodal entity linking approach incorporating topic concepts. IEEE Trans Knowl Data Eng. 2021;27(2):491-494. doi: 10.1109/CISAI54367.2021.00100

 

  1. Bu C, Chang G, Chen Z, et al. Query-driven multimodal GraphRAG: dynamic local knowledge graph construction for online reasoning. In: Findings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics; 2025:21360-21380. doi: 10.18653/v1/2025.findings-acl.1100

 

  1. Gao X, Lu K. RefSAM3D: adapting the segment anything model with cross-modal references for three-dimensional medical image segmentation. Artif Intell Health. 2025;2(4):114-128. doi: 10.36922/AIH025080010

 

  1. Zhao X, Zhao Q, Tanaka T. EpilepsyLLM: fine-tuning large language models for Japanese epilepsy knowledge representation. Artif Intell Health. 2025;3(1):104-115. doi: 10.36922/AIH025180042

 

  1. Gan J, Luo J, Wang H, Wang S, He W, Huang Q. Multimodal entity linking: a new dataset and a baseline. In: Proceedings of the ACM International Conference on Multimedia. ACM. 2021:993-1001. doi: 10.1145/3474085.3475400.

 

  1. Sanz-Cruzado J, Lever J. Accelerating cross-encoders in biomedical entity linking. In: Proceedings of the 24th Workshop on Biomedical Language Processing. Association for Computational Linguistics. 2025:136-147. doi: 10.18653/v1/2025.bionlp-1.13

 

  1. Mumtaz U, Ahmed A, Mumtaz S. LLMs-Healthcare: Current applications and challenges of large language models in various medical specialties. Artif Intell Health. 2024;1(2):16– 28. doi: 10.36922/aih.2558.

 

  1. Song S, Zhao S, Wang C, et al. A dual-way enhanced framework from text matching point of view for multimodal entity linking. AAAI. 2024;38(17):19008-19016. doi: 10.1609/aaai.v38i17.29867

 

  1. Xing S, Zhao F, Wu Z, Li C, Zhang J, Dai X. DRIN: dynamic relation interactive network for multimodal entity linking. In: Proceedings of the ACM International Conference on Multimedia. ACM. 2023:3599-3608. doi: 10.1145/3581783.3612575

 

  1. Yao BM, Chen Y, Wang Q, et al. AMELI: enhancing multimodal entity linking with fine-grained attributes. arXiv. 2023. doi: 10.48550/arXiv.2305.14725

 

  1. Shi S, Xu Z, Hu B, Zhang M. Generative multimodal entity linking. arXiv. 2023. doi: 10.48550/arXiv.2306.12725

 

  1. Liu Q, He Y, Xu T, et al. UniMEL: a unified framework for multimodal entity linking with large language models. In: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM). ACM. 2024:1909-1919. doi: 10.1145/3627673.3679793.

 

  1. Vishwanath AB, Srinivasalu VK, Subramaniam N. Role of large language models in improving provider–patient experience and interaction efficiency: A scoping review. Artif Intell Health. 2025;2(2):1–10. doi: 10.36922/aih.4808.

 

  1. Vashishth S, Newman-Griffis D, Joshi R, Dutt R, Rosé CP. Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. J Biomed Inform. 2021;121:103880. doi: 10.1016/j.jbi.2021.103880

 

  1. Zeng J, Zhou J, Liu T. Robust multimodal sentiment analysis via tag encoding of uncertain missing modalities. IEEE Trans Multimed. 2023;25(9):6301-6314. doi: 10.1109/TMM.2022.3207572.

 

  1. Huan R, Zhong G, Chen P, Liang R. UniMF: a unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences. IEEE Trans. Multimed. 2024;26:5753-5768. doi: 10.1109/TMM.2023.3338769.

 

  1. Luo P, Xu T, Wu S, Zhu C, Xu L, Chen E. Multi-grained multimodal interaction network for entity linking. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). ACM. 2023:1583-1594. doi: 10.1145/3580305.3599439

 

  1. Wang H, Chen Y, Ma C, Avery J, Hull L, Carneiro G. Multi-modal learning with missing modality via shared-specific feature modelling.In: Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. 2023:15878-15887. doi: 10.1109/cvpr52729.2023.01524

 

  1. Akhtar MS, Chauhan DS, Ghosal D, Poria S, Ekbal A, Bhattacharyya P. Multi-task learning for multi-modal emotion recognition and sentiment analysis. arXiv. 2019. doi: 10.48550/arXiv.1905.05812

 

  1. Wang Q, Zhan L, Thompson P, Zhou J. Multimodal learning with incomplete modalities by knowledge distillation. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM 2020:1828-1838. doi: 10.1145/3394486.3403234

 

  1. Xue Z, Marculescu R. Dynamic multimodal fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. 2023:2575- 2584. doi: 10.1109/CVPRW59228.2023.00256

 

  1. Xaviar S, Yang X, Ardakanian O. Robust multimodal fusion for human activity recognition. arXiv. 2023. doi: 10.48550/arXiv.2303.04636

 

  1. Chen D, Zhang R. Building multimodal knowledge bases with multimodal computational sequences and generative adversarial networks. IEEE Trans Multimed. 2024;26:2027- 2040. doi: 10.1109/TMM.2023.3291503

 

  1. Zhang Y, Peng C, Wang Q, Song D, Li K, Zhou SK. Unified multi-modal image synthesis for missing modality imputation. IEEE Trans Med Imaging. 2024;43(1):225-236. doi: 10.1109/TMI.2024.3424785.

 

  1. Zhao J, Li R, Jin Q. Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics. 2021:2608-2618. doi: 10.18653/v1/2021.acl-long.203

 

  1. Zhou J, Tang J, Zuo Y, Wan P, Zhang D, Shao W. Robust Multimodal Survival Prediction with Conditional Latent Differentiation Variational AutoEncoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025:10384-10393

 

  1. Emami H, Dong M, Glide-Hurst C. CL-GAN: contrastive learning-based generative adversarial network for modality transfer with limited paired data. In: Lecture Notes in Computer Science. Springer Nature Switzerland. 2023:527- 542. doi: 10.1007/978-3-031-25066-8_30

 

  1. Zhang S, Xu Y, Usuyama N, et al. A multimodal biomedical foundation model trained from fifteen million image-text pairs. NEJM AI. 2025;2(1):AIoa2400640. doi: 10.1056/AIoa2400640.

 

  1. Kingma DP, Welling M. Auto-encoding variational bayes. arXiv. 2013. doi: 10.48550/arXiv.1312.6114

 

  1. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde- Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS); 2014;27. Available from: https://papers.nips.cc/paper_files/paper/2014/hash/ f033ed80deb0234979a61f95710dbe25-Abstract.html

 

  1. Wang J, Black M, Rankin D, et al. A bagging ensemble machine learning method for imbalanced data to predict anxiety disorders and analysis of risk factors in older people: observational study. Artif Intell Health. 2025;1-22. doi: 10.36922/AIH025070009

 

  1. Acharya A, Ghosh A, Verma P, Pasupa K, Saha S, Singh P. M3Retrieve: benchmarking multimodal retrieval for medicine. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 2025:15274- 15287. doi: 10.18653/v1/2025.emnlp-main.

 

  1. Lee C, Roy R, Xu M, et al. NV-Embed: improved techniques for training LLMs as generalist embedding models. In: Proceedings of the International Conference on Learning Representations (ICLR). 2025. Available from: https:// openreview.net/pdf?id=lgsyLSsDRe

 

  1. Lin J, Ma X, Lin SC, Yang JH, Pradeep R, Nogueira R. Pyserini: a Python toolkit for reproducible information retrieval research with sparse and dense representations. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 2021:2356-2362. doi: 10.1145/3404835.3463238.

 

  1. Byrne B, Chen J, Coca A, Lin W, Mei J. Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering. In: Advances in Neural Information Processing Systems 36. Neural Information Processing Systems Foundation, Inc. (NeurIPS). 2023:22820- 22840. doi: 10.52202/075280-0990

 

  1. Lao M, Li Z, Guo Y, Zhang X, Cai S, Ding Z, et al. Boosting discriminability for robust multimodal entity linking with visual modality missing. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery. 2025:989-999. doi: 10.1145/3726302.3729906

 

  1. Xiao G, Zeng W, Zhang S, Lao M, Zhao X. Multi-Modal Entities Matter: Benchmarking Multi-Modal Entity Alignment. In: Proceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics. 2025:8714-8724.

 

  1. Wang Y, Sun H, Wang J, et al. Towards semantic consistency: Dirichlet energy driven robust multi-modal entity alignment. In: Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE. 2024:3559-3572. doi: 10.1109/ICDE60146.2024.00274

 

  1. Sui X, Zhang Y, Zhao Y, Song K, Zhou B, Yuan X. MELOV: Multimodal Entity Linking with Optimized Visual Features in Latent Space. In: Proceedings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics. 2024:816-826. doi: 10.18653/v1/2024.findings-acl.46

 

  1. Kwon J, Kim M, Lee E, Choi J, Kim Y. See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics. 2025:4364-4378. doi: 10.18653/v1/2025.naacl-long.222
Share
Back to top
Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing