From raw text to cross-framework training data: Building medical named entity recognition datasets with FIT4NER

¹ Chair of Multimedia and Internet Applications, Faculty of Mathematics and Computer Science, University of Hagen, Hagen, North Rhine-Westphalia, Germany

JCI 2026, 2(1), 025420035 https://doi.org/10.36922/JCI025420035

Received: 14 October 2025 | Revised: 30 January 2026 | Accepted: 17 March 2026 | Published online: 15 May 2026

© 2026 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution -Noncommercial 4.0 International License (CC-by the license) ( https://creativecommons.org/licenses/by-nc/4.0/ )

Download PDF

XML

Cite

Abstract

High-quality annotated medical text data are essential for training robust machine learning–based named entity recognition (NER) models, particularly for extracting structured evidence from large volumes of unstructured medical literature to support the development of clinical practice guidelines. This article introduces a system for collecting, annotating, and managing high-quality training data for machine learning-based NER models. The system is designed to help medical professionals create and maintain extensive training and test datasets across multiple text formats and for different NER frameworks. It also supports the straightforward integration of new NER frameworks through customizable converters. Using the Nunamaker methodology for a structured approach to information system development, the article starts with an introduction to the topic, contextualizes the research, reviews the state of the art, and identifies challenges in text annotation by medical experts. This is followed by a description of the system’s modeling and implementation. The article concludes with an expert evaluation of the system, the resulting insights, and a summary of the main findings.

Keywords

Named entity recognition

Machine learning

Clinical practice guidelines

Information extraction

Cloud

Data preprocessing

Funding

None.

Conflict of interest

The authors declare they have no competing interests.

References

Freund F, Tamla P, Hemmje M. Towards improving clinical practice guidelines through named entity recognition: Model development and evaluation. In: Proceedings of the 2023 31st Irish Conference on Artificial Intelligence and Cognitive Science (AICS). 2023:1-8.doi: 10.1109/AICS60730.2023.10470480
Institute of Medicine (US) Committee on Standards for Developing Trustworthy Clinical Practice Guidelines; Graham R, Mancher M, Miller Wolman D, Greenfield S, Steinberg E, eds. Clinical Practice Guidelines We Can Trust. National Academies Press; 2011. doi: 10.17226/13058
Byyny RL. The data deluge: the information explosion in medicine and science. Pharos Alpha Omega Alpha-Honor Med Soc Alpha Omega Alpha. 2012;75(2):2-5
Klerings I, Weinhandl AS, Thaler KJ. Information overload in healthcare: too much of a good thing? Z Für Evidenz Fortbild Qual Im Gesundheitswesen. 2015;109(4):285-290. doi: 10.1016/j.zefq.2015.06.005
Tamla P, Hartmann B, Nguyen N, Kramer C, Freund F, Hemmje M. CIE: a cloud-based information extraction system for named entity recognition in AWS, AZURE, and medical domain. In: Communications in Computer and Information Science. Springer Nature Switzerland; 2023:127- 148. doi: 10.1007/978-3-031-43471-6_6
Konkol IM. Named Entity Recognition. PhD thesis. University of West Bohemia; 2015.
Bielefeld University. RATIO: Rationalizing Recommendations (RecomRatio). 2017. Available from: https://spp-ratio.de/projects/recomratio/ [Last accessed on August 6, 2024].
Nawroth C. Supporting Information Retrieval of Emerging Knowledge and Argumentation. PhD thesis. FernUniversität in Hagen; 2020
FTK. Artificial Intelligence for Hospitals, Healthcare & Humanity (AI4H3). FTK e.V. Research Institute for Telecommunications and Cooperation; Internal project proposal; 2020. Unpublished.
Liu F, Chen J, Jagannatha A, Yu H. Learning for biomedical information extraction: Methodological review of recent advances. Published online 2016. doi: 10.48550/ARXIV.1606.07993
Hemmje M. Chair of Multimedia and Internet Applications. 2023. Available from: http://www.lgmmia.fernuni-hagen. de/en.html [Last accessed on].
FTK. FTK e.V. Research Institute for Telecommunications and Cooperation. 2023. Available from: https://www.ftk.de/ en [Last accessed on February 25, 2023].
Vu B, Wu Y, Afli H, et al. A metagenomic content and knowledge management ecosystem platform. In: Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2019:1-8
Donovan R, Healy M, Zheng H, et al. SenseCare: Using Automatic Emotional Analysis to Provide Effective Tools for Supporting. In: Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2018:2682-2687. doi: 10.1109/BIBM.2018.8621250
Tamla P, Freund F, Hemmje M. SNERC: Enhancing Knowledge Management with Named Entity Recognition and Document Classification for Apply Gaming. Artif Intell Appl. 2025;3(4):392-407. doi: 10.47852/bonviewAIA52023841
Tamla P, Boehm T, Nawroth C, Hemmje M. Towards Semantic Web-Based Information Retrieval to solve Information Overload in an Applied Gaming Ecosystem. Bull IEEE Tech Comm Learn Technol. 2015;15(2):12
Tamla P, Böhm T, Gaisbachgrabner K, Mertens J, Fuchs M. Survey: Software Search in Serious Games Development. 2019;2348:155-166
Tamla P, Böhm T, Nawroth C, Hemmje M. What do serious games developers search online? A study of GameDev StackExchange. In: Proceedings of the 5th Collaborative European Research Conference (CERC 2019). CEUR workshop proceedings. CEUR-WS.org; 2019; 2348:131-142
Freund F, Tamla P, Reis T, Hemmje M, Kevitt PM. FIT4NER - Towards a Framework-Independent Toolkit for Named Entity Recognition. In: Proceedings of the CERC 2023. Hochschule Darmstadt; 2023:10. doi: 10.48444/h_docs-pub-518
Frei J, Kramer F. GERNERMED: An open German medical NER model. Softw Impacts. 2022;11:100212. doi: 10.1016/j.simpa.2021.100212
Ghiasvand O, Kate RJ. Learning for clinical named entity recognition without manual annotations. Inform Med Unlocked. 2018;13:122-127. doi: 10.1016/j.imu.2018.10.011
Wen C, Chen T, Jia X, Zhu J. Medical Named Entity Recognition from Un-labelled Medical Records based on Pre-trained Language Models and Domain Dictionary. Data Intell. 2021;3(3):402-417. doi: 10.1162/dint_a_00105
Giachelle F, Irrera O, Silvello G. MedTAG: a portable and customizable annotation tool for biomedical documents. BMC Med Inform Decis Mak. 2021;21(1):352. doi: 10.1186/s12911-021-01706-4
Pinto A, Oliveira HG, Alves AO. Comparing the performance of different NLP toolkits in formal and social media text. In: Mernik M, Leal JP, Oliveira HG, eds. 5th Symposium on Languages, Applications and Technologies (SLATE’16). OpenAccess series in informatics (OASIcs). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik; 2016;51:16.doi: 10.4230/OASIcs.SLATE.2016.3
Sang EFTK, De Meulder F. Introduction to the CoNLL- 2003 Shared Task: Language-Independent Named Entity Recognition. arXiv. Published online 2003. doi: 10.48550/arXiv.cs/0306050
Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J. BRAT: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. 2012:102-107.
Comeau DC, Islamaj Doğan R, Ciccarese P, et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database. 2013;2013(0):bat064. doi: 10.1093/database/bat064
Stieber S. Implementierung eines Systems für Vorverarbeitung von Daten für Named Entity Recognition in einem Wissensmanagement-System für den medizinischen Bereich. Bachelor‘s thesis. FernUniversität in Hagen; 2023
Nunamaker Jr JF, Chen M, Purdin TDM. Systems Development in Information Systems Research. J Manag Inf Syst. 1990;7(3):89-106. doi: 10.1080/07421222.1990.11517898
Li J, Sun A, Han J, Li C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans Knowl Data Eng. 2022;34(1):50-70. doi: 10.1109/TKDE.2020.2981314
Cohen KB, Verspoor K, Fort K, et al. The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi- Model Annotation in the Biomedical Domain. In: Ide N, Pustejovsky J, eds. Handbook of Linguistic Annotation. Springer Netherlands; 2017:1379-1394. doi: 10.1007/978-94-024-0881-2_53
Bada M, Vasilevsky N, Baumgartner WA, Haendel M, Hunter LE. Gold-standard ontology-based anatomical annotation in the CRAFT Corpus. Database. 2017;2017:bax087. doi: 10.1093/database/bax087
Kittner M, Lamping M, Rieke DT, et al. Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA Open. 2021;4(2):ooab025. doi: 10.1093/jamiaopen/ooab025
Ahmadi S, Shah A, Fox E. Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification. arXiv. Preprint posted online 2023:arXiv:2307.14899. doi: 10.48550/arXiv.2307.14899
Newman-Griffis D, Divita G, Desmet B, Zirikly A, Rosé CP, Fosler-Lussier E. Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets. J Am Med Inform Assoc. 2021;28(3):516-532. doi: 10.1093/jamia/ocaa269
Alhassan A, Schlegel V, Aloud M, Batista-Navarro R, Nenadic G. Discontinuous named entities in clinical text: A systematic literature review. J Biomed Inform. 2025;162:104783. doi: 10.1016/j.jbi.2025.104783
Liang S, Profitlich HJ, Klass M, et al. Building A German Clinical Named Entity Recognition System without In-domain Training Data. In: Proceedings of the 6th Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2024:70–81. doi: 10.18653/v1/2024.clinicalnlp-1.7
Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes. Bioinforma Adv. 2024;4(1):vbae116. doi: 10.1093/bioadv/vbae116
Freund F, Tamla P, Tran B, Hemmje M. Evaluating NERFlow: User-Centered Assessment of Automated LLM-Based Annotation for Medical Named-Entity Recognition. Procedia Comput Sci. In press.
Liu J, Wong ZSY. Utilizing active learning strategies in machine-assisted annotation for clinical named entity recognition: a comprehensive analysis considering annotation costs and target effectiveness. J Am Med Inform Assoc. 2024;31(11):2632-2640. doi: 10.1093/jamia/ocae197
Artstein R, Poesio M. Inter-Coder Agreement for Computational Linguistics. Comput Linguist. 2008;34(4):555-596. doi: 10.1162/coli.07-034-R2
Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research. Yearb Med Inform. 2008;17(01):128-144. doi: 10.1055/s-0038-1638592
Raja U, Mitchell T, Day T, Hardin JM. Text mining in healthcare. Applications and opportunities. J Healthc Inf Manag JHIM. 2008;22(3):52-56
Neves M, Ševa J. An extensive review of tools for manual annotation of documents. Brief Bioinform. 2021;22(1):146- 163. doi: 10.1093/bib/bbz130
Kim JD, Ohta T, Tateisi Y, Mima H, Tsujii J. XML-based linguistic annotation of corpus. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), 2001:47-53.
Ogren P. Knowtator: A Protégé plug-in for annotated corpus construction. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion V. 2006:273-275.
Krishnamoorthy S, Jiang Y, Buchanan W, Singh A, Ortega J. CLPT: a universal annotation scheme and toolkit for clinical language processing. In: Naumann T, Bethard S, Roberts K, Rumshisky A, eds. In: Proceedings of the 4th Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2022:1-9. doi: 10.18653/v1/2022.clinicalnlp-1.1
Sharir O, Peleg B, Shoham Y. The cost of training NLP models: a concise overview. arXiv Published online 2020. doi: 10.48550/ARXIV.2004.08900
Putzier M, Khakzad T, Dreischarf M, Thun S, Trautwein F, Taheri N. Implementation of cloud computing in the German healthcare system. Npj Digit Med. 2024;7(1):12. doi: 10.1038/s41746-024-01000-3
Wang H, Wang B, Wang S. Design and Implementation of a Primary Healthcare Cloud Platform. Front Comput Intell Syst. 2024;7(3):77-84. doi: 10.54097/01kn4y43
Akerele JI, Uzoka A, Ojukwu PU, Olamijuwon OJ. Improving healthcare application scalability through microservices architecture in the cloud. Int J Sci Res Updat. 2024;8(2):100- 109. doi: 10.53430/ijsru.2024.8.2.0064
Norman DA, Draper SW. User Centered System Design; New Perspectives on Human-Computer Interaction. L. Erlbaum Associates Inc.; 1986
Rumbaugh J, Jacobson I, Booch G. The Unified Modeling Language Reference Manual. 2nd ed. Addison-Wesley; 2005.
Gamma E, Johnson R, Helm R, Vlissides J. Entwurfsmuster: Elemente wiederverwendbarer objektorientierter Software. Pearson Deutschland GmbH; 2011
Mane D, Chitnis K, Ojha N. The spring framework: An open source java platform for developing robust java applications. Int J Innov Technol Explor Eng IJITEE. 2013;3(2):137-143
Ramírez S. FastAPI. Published online 2023. Available from: https://fastapi.tiangolo.com/ [Last accessed on October 3, 2023].
The Apache Software Foundation. Apache Tika: a content analysis toolkit. Published online 2023. Available from: https://tika.apache.org/ [Last accessed on July 24, 2023].
pdfminer community. pdfminer.six: We fathom PDF. Published online 2022. Available from: https://github.com/ pdfminer/pdfminer.six [Last accessed on July 22, 2023].
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in python. J Mach Learn Res. 2011;12:2825-2830
Tamla P, Freund F, Hemmje M. Cloud-based medical named entity recognition: a FIT4NER-based approach. Information. 2025;16(5):395. doi: 10.3390/info16050395
OpenAPI Specification. Available from: https://github.com/ OAI/OpenAPI-Specification/tree/main [Last accessed on February 1, 2025].
Federal Ministry for Economic Affairs and Climate Action. Guidelines on the protection of health data. Available from: https://www.bmwk.de/Redaktion/EN/Dossier/guidelines-on-the-protection-of-health-data.html [Last accessed on February 1, 2025].
Irvine C, Balasubramaniam Dharini, Henderson T. Short paper: Integrating the data protection impact assessment into the software development lifecycle. In: Lecture Notes in Computer Science. Springer International Publishing; 2020;219-228. doi: 10.1007/978-3-030-66172-4_13
Docker Compose|Docker Docs. Available from: https:// docs.docker.com/compose/ [Last accessed on January 29, 2025].
Helm Authors. Helm - The package manager for Kubernetes. 2025. Available from: https://helm.sh/ [Last accessed on August 25, 2025].
Kompose - Convert your Docker Compose file to Kubernetes or OpenShift. Available from: https://kompose.io/ [Last accessed on January 29, 2025].
Amazon Elastic Kubernetes Service Documentation. Available from: https://docs.aws.amazon.com/eks/ [Last accessed on January 29, 2025].
Google Kubernetes Engine (GKE)|Google Cloud. Available from: https://cloud.google.com/kubernetes-engine [Last accessed on January 29, 2025].
Azure Kubernetes Service (AKS) documentation | Microsoft Learn. Available from: https://learn.microsoft.com/en-us/ azure/aks/ [Last accessed on January 29, 2025].
Terraform by HashiCorp. Available from: https://www. terraform.io/ [Last accessed on January 29, 2025].
Nocentino AE, Weissman B. Storing persistent data in kubernetes. In: SQL Server on Kubernetes: Designing and Building a Modern Data Platform. Apress; 2021:111-137. doi: 10.1007/978-1-4842-7192-6_6
Container attached storage (CAS). Available from: https://openebs.io/docs/2.12.x/concepts/cas [Last accessed on February 1, 2025].
Rook - cloud-native storage orchestrator for Kubernetes. Available from: https://github.com/rook/rook [Last accessed on February 1, 2025].
Polson PG, Lewis C, Rieman J, Wharton C. Cognitive walkthroughs: a method for theory-based evaluation of user interfaces. Int J Man-Mach Stud. 1992;36(5):741-773. doi: 10.1016/0020-7373(92)90039-N
Collofello JS. The Software Technical Review Process. Published online 1988. Accessed. Available from: https:// web.archive.org/web/20150724025200/http://www.sei.cmu. edu/reports/88cm003.pdf [Last accessed on May 12, 2020].
IEEE. IEEE standard for software reviews and audits. IEEE Std 1028-2008. IEEE; 2008. doi: 10.1109/IEEESTD.2008.4601584
Furrer L, Cornelius J, Rinaldi F. UZH@CRAFT-ST: a Sequence-labeling Approach to Concept Recognition. In: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks. Association for Computational Linguistics; 2019:185- 195. doi: 10.18653/v1/D19-5726
Freund F, Tamla P, Tran B, Hemmje M. Open‑Source Large Language Models for FIT4NER: Automatic Annotation for Medical Named Entity Recognition. Manuscript submitted for publication. 2025.

Previous article in this issue

Next article in this issue

Journal of Clinical Informatics, Published by AccScience Publishing