From raw text to cross-framework training data: Building medical named entity recognition datasets with FIT4NER
High-quality annotated medical text data are essential for training robust machine learning–based named entity recognition (NER) models, particularly for extracting structured evidence from large volumes of unstructured medical literature to support the development of clinical practice guidelines. This article introduces a system for collecting, annotating, and managing high-quality training data for machine learning-based NER models. The system is designed to help medical professionals create and maintain extensive training and test datasets across multiple text formats and for different NER frameworks. It also supports the straightforward integration of new NER frameworks through customizable converters. Using the Nunamaker methodology for a structured approach to information system development, the article starts with an introduction to the topic, contextualizes the research, reviews the state of the art, and identifies challenges in text annotation by medical experts. This is followed by a description of the system’s modeling and implementation. The article concludes with an expert evaluation of the system, the resulting insights, and a summary of the main findings.
- Freund F, Tamla P, Hemmje M. Towards improving clinical practice guidelines through named entity recognition: Model development and evaluation. In: Proceedings of the 2023 31st Irish Conference on Artificial Intelligence and Cognitive Science (AICS). 2023:1-8.doi: 10.1109/AICS60730.2023.10470480
- Institute of Medicine (US) Committee on Standards for Developing Trustworthy Clinical Practice Guidelines; Graham R, Mancher M, Miller Wolman D, Greenfield S, Steinberg E, eds. Clinical Practice Guidelines We Can Trust. National Academies Press; 2011. doi: 10.17226/13058
- Byyny RL. The data deluge: the information explosion in medicine and science. Pharos Alpha Omega Alpha-Honor Med Soc Alpha Omega Alpha. 2012;75(2):2-5
- Klerings I, Weinhandl AS, Thaler KJ. Information overload in healthcare: too much of a good thing? Z Für Evidenz Fortbild Qual Im Gesundheitswesen. 2015;109(4):285-290. doi: 10.1016/j.zefq.2015.06.005
- Tamla P, Hartmann B, Nguyen N, Kramer C, Freund F, Hemmje M. CIE: a cloud-based information extraction system for named entity recognition in AWS, AZURE, and medical domain. In: Communications in Computer and Information Science. Springer Nature Switzerland; 2023:127- 148. doi: 10.1007/978-3-031-43471-6_6
- Konkol IM. Named Entity Recognition. PhD thesis. University of West Bohemia; 2015.
- Bielefeld University. RATIO: Rationalizing Recommendations (RecomRatio). 2017. Available from: https://spp-ratio.de/projects/recomratio/ [Last accessed on August 6, 2024].
- Nawroth C. Supporting Information Retrieval of Emerging Knowledge and Argumentation. PhD thesis. FernUniversität in Hagen; 2020
- FTK. Artificial Intelligence for Hospitals, Healthcare & Humanity (AI4H3). FTK e.V. Research Institute for Telecommunications and Cooperation; Internal project proposal; 2020. Unpublished.
- Liu F, Chen J, Jagannatha A, Yu H. Learning for biomedical information extraction: Methodological review of recent advances. Published online 2016. doi: 10.48550/ARXIV.1606.07993
- Hemmje M. Chair of Multimedia and Internet Applications. 2023. Available from: http://www.lgmmia.fernuni-hagen. de/en.html [Last accessed on].
- FTK. FTK e.V. Research Institute for Telecommunications and Cooperation. 2023. Available from: https://www.ftk.de/ en [Last accessed on February 25, 2023].
- Vu B, Wu Y, Afli H, et al. A metagenomic content and knowledge management ecosystem platform. In: Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2019:1-8
- Donovan R, Healy M, Zheng H, et al. SenseCare: Using Automatic Emotional Analysis to Provide Effective Tools for Supporting. In: Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2018:2682-2687. doi: 10.1109/BIBM.2018.8621250
- Tamla P, Freund F, Hemmje M. SNERC: Enhancing Knowledge Management with Named Entity Recognition and Document Classification for Apply Gaming. Artif Intell Appl. 2025;3(4):392-407. doi: 10.47852/bonviewAIA52023841
- Tamla P, Boehm T, Nawroth C, Hemmje M. Towards Semantic Web-Based Information Retrieval to solve Information Overload in an Applied Gaming Ecosystem. Bull IEEE Tech Comm Learn Technol. 2015;15(2):12
- Tamla P, Böhm T, Gaisbachgrabner K, Mertens J, Fuchs M. Survey: Software Search in Serious Games Development. 2019;2348:155-166
- Tamla P, Böhm T, Nawroth C, Hemmje M. What do serious games developers search online? A study of GameDev StackExchange. In: Proceedings of the 5th Collaborative European Research Conference (CERC 2019). CEUR workshop proceedings. CEUR-WS.org; 2019; 2348:131-142
- Freund F, Tamla P, Reis T, Hemmje M, Kevitt PM. FIT4NER - Towards a Framework-Independent Toolkit for Named Entity Recognition. In: Proceedings of the CERC 2023. Hochschule Darmstadt; 2023:10. doi: 10.48444/h_docs-pub-518
- Frei J, Kramer F. GERNERMED: An open German medical NER model. Softw Impacts. 2022;11:100212. doi: 10.1016/j.simpa.2021.100212
- Ghiasvand O, Kate RJ. Learning for clinical named entity recognition without manual annotations. Inform Med Unlocked. 2018;13:122-127. doi: 10.1016/j.imu.2018.10.011
- Wen C, Chen T, Jia X, Zhu J. Medical Named Entity Recognition from Un-labelled Medical Records based on Pre-trained Language Models and Domain Dictionary. Data Intell. 2021;3(3):402-417. doi: 10.1162/dint_a_00105
- Giachelle F, Irrera O, Silvello G. MedTAG: a portable and customizable annotation tool for biomedical documents. BMC Med Inform Decis Mak. 2021;21(1):352. doi: 10.1186/s12911-021-01706-4
- Pinto A, Oliveira HG, Alves AO. Comparing the performance of different NLP toolkits in formal and social media text. In: Mernik M, Leal JP, Oliveira HG, eds. 5th Symposium on Languages, Applications and Technologies (SLATE’16). OpenAccess series in informatics (OASIcs). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik; 2016;51:16.doi: 10.4230/OASIcs.SLATE.2016.3
- Sang EFTK, De Meulder F. Introduction to the CoNLL- 2003 Shared Task: Language-Independent Named Entity Recognition. arXiv. Published online 2003. doi: 10.48550/arXiv.cs/0306050
- Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J. BRAT: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. 2012:102-107.
- Comeau DC, Islamaj Doğan R, Ciccarese P, et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database. 2013;2013(0):bat064. doi: 10.1093/database/bat064
- Stieber S. Implementierung eines Systems für Vorverarbeitung von Daten für Named Entity Recognition in einem Wissensmanagement-System für den medizinischen Bereich. Bachelor‘s thesis. FernUniversität in Hagen; 2023
- Nunamaker Jr JF, Chen M, Purdin TDM. Systems Development in Information Systems Research. J Manag Inf Syst. 1990;7(3):89-106. doi: 10.1080/07421222.1990.11517898
- Li J, Sun A, Han J, Li C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans Knowl Data Eng. 2022;34(1):50-70. doi: 10.1109/TKDE.2020.2981314
- Cohen KB, Verspoor K, Fort K, et al. The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi- Model Annotation in the Biomedical Domain. In: Ide N, Pustejovsky J, eds. Handbook of Linguistic Annotation. Springer Netherlands; 2017:1379-1394. doi: 10.1007/978-94-024-0881-2_53
- Bada M, Vasilevsky N, Baumgartner WA, Haendel M, Hunter LE. Gold-standard ontology-based anatomical annotation in the CRAFT Corpus. Database. 2017;2017:bax087. doi: 10.1093/database/bax087
- Kittner M, Lamping M, Rieke DT, et al. Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA Open. 2021;4(2):ooab025. doi: 10.1093/jamiaopen/ooab025
- Ahmadi S, Shah A, Fox E. Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification. arXiv. Preprint posted online 2023:arXiv:2307.14899. doi: 10.48550/arXiv.2307.14899
- Newman-Griffis D, Divita G, Desmet B, Zirikly A, Rosé CP, Fosler-Lussier E. Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets. J Am Med Inform Assoc. 2021;28(3):516-532. doi: 10.1093/jamia/ocaa269
- Alhassan A, Schlegel V, Aloud M, Batista-Navarro R, Nenadic G. Discontinuous named entities in clinical text: A systematic literature review. J Biomed Inform. 2025;162:104783. doi: 10.1016/j.jbi.2025.104783
- Liang S, Profitlich HJ, Klass M, et al. Building A German Clinical Named Entity Recognition System without In-domain Training Data. In: Proceedings of the 6th Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2024:70–81. doi: 10.18653/v1/2024.clinicalnlp-1.7
- Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes. Bioinforma Adv. 2024;4(1):vbae116. doi: 10.1093/bioadv/vbae116
- Freund F, Tamla P, Tran B, Hemmje M. Evaluating NERFlow: User-Centered Assessment of Automated LLM-Based Annotation for Medical Named-Entity Recognition. Procedia Comput Sci. In press.
- Liu J, Wong ZSY. Utilizing active learning strategies in machine-assisted annotation for clinical named entity recognition: a comprehensive analysis considering annotation costs and target effectiveness. J Am Med Inform Assoc. 2024;31(11):2632-2640. doi: 10.1093/jamia/ocae197
- Artstein R, Poesio M. Inter-Coder Agreement for Computational Linguistics. Comput Linguist. 2008;34(4):555-596. doi: 10.1162/coli.07-034-R2
- Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research. Yearb Med Inform. 2008;17(01):128-144. doi: 10.1055/s-0038-1638592
- Raja U, Mitchell T, Day T, Hardin JM. Text mining in healthcare. Applications and opportunities. J Healthc Inf Manag JHIM. 2008;22(3):52-56
- Neves M, Ševa J. An extensive review of tools for manual annotation of documents. Brief Bioinform. 2021;22(1):146- 163. doi: 10.1093/bib/bbz130
- Kim JD, Ohta T, Tateisi Y, Mima H, Tsujii J. XML-based linguistic annotation of corpus. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), 2001:47-53.
- Ogren P. Knowtator: A Protégé plug-in for annotated corpus construction. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion V. 2006:273-275.
- Krishnamoorthy S, Jiang Y, Buchanan W, Singh A, Ortega J. CLPT: a universal annotation scheme and toolkit for clinical language processing. In: Naumann T, Bethard S, Roberts K, Rumshisky A, eds. In: Proceedings of the 4th Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2022:1-9. doi: 10.18653/v1/2022.clinicalnlp-1.1
- Sharir O, Peleg B, Shoham Y. The cost of training NLP models: a concise overview. arXiv Published online 2020. doi: 10.48550/ARXIV.2004.08900
- Putzier M, Khakzad T, Dreischarf M, Thun S, Trautwein F, Taheri N. Implementation of cloud computing in the German healthcare system. Npj Digit Med. 2024;7(1):12. doi: 10.1038/s41746-024-01000-3
- Wang H, Wang B, Wang S. Design and Implementation of a Primary Healthcare Cloud Platform. Front Comput Intell Syst. 2024;7(3):77-84. doi: 10.54097/01kn4y43
- Akerele JI, Uzoka A, Ojukwu PU, Olamijuwon OJ. Improving healthcare application scalability through microservices architecture in the cloud. Int J Sci Res Updat. 2024;8(2):100- 109. doi: 10.53430/ijsru.2024.8.2.0064
- Norman DA, Draper SW. User Centered System Design; New Perspectives on Human-Computer Interaction. L. Erlbaum Associates Inc.; 1986
- Rumbaugh J, Jacobson I, Booch G. The Unified Modeling Language Reference Manual. 2nd ed. Addison-Wesley; 2005.
- Gamma E, Johnson R, Helm R, Vlissides J. Entwurfsmuster: Elemente wiederverwendbarer objektorientierter Software. Pearson Deutschland GmbH; 2011
- Mane D, Chitnis K, Ojha N. The spring framework: An open source java platform for developing robust java applications. Int J Innov Technol Explor Eng IJITEE. 2013;3(2):137-143
- Ramírez S. FastAPI. Published online 2023. Available from: https://fastapi.tiangolo.com/ [Last accessed on October 3, 2023].
- The Apache Software Foundation. Apache Tika: a content analysis toolkit. Published online 2023. Available from: https://tika.apache.org/ [Last accessed on July 24, 2023].
- pdfminer community. pdfminer.six: We fathom PDF. Published online 2022. Available from: https://github.com/ pdfminer/pdfminer.six [Last accessed on July 22, 2023].
- Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in python. J Mach Learn Res. 2011;12:2825-2830
- Tamla P, Freund F, Hemmje M. Cloud-based medical named entity recognition: a FIT4NER-based approach. Information. 2025;16(5):395. doi: 10.3390/info16050395
- OpenAPI Specification. Available from: https://github.com/ OAI/OpenAPI-Specification/tree/main [Last accessed on February 1, 2025].
- Federal Ministry for Economic Affairs and Climate Action. Guidelines on the protection of health data. Available from: https://www.bmwk.de/Redaktion/EN/Dossier/guidelines-on-the-protection-of-health-data.html [Last accessed on February 1, 2025].
- Irvine C, Balasubramaniam Dharini, Henderson T. Short paper: Integrating the data protection impact assessment into the software development lifecycle. In: Lecture Notes in Computer Science. Springer International Publishing; 2020;219-228. doi: 10.1007/978-3-030-66172-4_13
- Docker Compose|Docker Docs. Available from: https:// docs.docker.com/compose/ [Last accessed on January 29, 2025].
- Helm Authors. Helm - The package manager for Kubernetes. 2025. Available from: https://helm.sh/ [Last accessed on August 25, 2025].
- Kompose - Convert your Docker Compose file to Kubernetes or OpenShift. Available from: https://kompose.io/ [Last accessed on January 29, 2025].
- Amazon Elastic Kubernetes Service Documentation. Available from: https://docs.aws.amazon.com/eks/ [Last accessed on January 29, 2025].
- Google Kubernetes Engine (GKE)|Google Cloud. Available from: https://cloud.google.com/kubernetes-engine [Last accessed on January 29, 2025].
- Azure Kubernetes Service (AKS) documentation | Microsoft Learn. Available from: https://learn.microsoft.com/en-us/ azure/aks/ [Last accessed on January 29, 2025].
- Terraform by HashiCorp. Available from: https://www. terraform.io/ [Last accessed on January 29, 2025].
- Nocentino AE, Weissman B. Storing persistent data in kubernetes. In: SQL Server on Kubernetes: Designing and Building a Modern Data Platform. Apress; 2021:111-137. doi: 10.1007/978-1-4842-7192-6_6
- Container attached storage (CAS). Available from: https://openebs.io/docs/2.12.x/concepts/cas [Last accessed on February 1, 2025].
- Rook - cloud-native storage orchestrator for Kubernetes. Available from: https://github.com/rook/rook [Last accessed on February 1, 2025].
- Polson PG, Lewis C, Rieman J, Wharton C. Cognitive walkthroughs: a method for theory-based evaluation of user interfaces. Int J Man-Mach Stud. 1992;36(5):741-773. doi: 10.1016/0020-7373(92)90039-N
- Collofello JS. The Software Technical Review Process. Published online 1988. Accessed. Available from: https:// web.archive.org/web/20150724025200/http://www.sei.cmu. edu/reports/88cm003.pdf [Last accessed on May 12, 2020].
- IEEE. IEEE standard for software reviews and audits. IEEE Std 1028-2008. IEEE; 2008. doi: 10.1109/IEEESTD.2008.4601584
- Furrer L, Cornelius J, Rinaldi F. UZH@CRAFT-ST: a Sequence-labeling Approach to Concept Recognition. In: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks. Association for Computational Linguistics; 2019:185- 195. doi: 10.18653/v1/D19-5726
- Freund F, Tamla P, Tran B, Hemmje M. Open‑Source Large Language Models for FIT4NER: Automatic Annotation for Medical Named Entity Recognition. Manuscript submitted for publication. 2025.
