AccScience Publishing / JCI / Online First / DOI: 10.36922/JCI025420035
ORIGINAL RESEARCH ARTICLE

From raw text to cross-framework training data: Building medical named entity recognition datasets with FIT4NER

Florian Freund1* Philippe Tamla1 Sven Stieber1 Matthias Hemmje1
Show Less
1 Chair of Multimedia and Internet Applications, Faculty of Mathematics and Computer Science, University of Hagen, Hagen, North Rhine-Westphalia, Germany
JCI 2026, 2(1), 025420035 https://doi.org/10.36922/JCI025420035
Received: 14 October 2025 | Revised: 30 January 2026 | Accepted: 17 March 2026 | Published online: 15 May 2026
© 2026 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution -Noncommercial 4.0 International License (CC-by the license) ( https://creativecommons.org/licenses/by-nc/4.0/ )
Abstract

High-quality annotated medical text data are essential for training robust machine learning–based named entity recognition (NER) models, particularly for extracting structured evidence from large volumes of unstructured medical literature to support the development of clinical practice guidelines. This article introduces a system for collecting, annotating, and managing high-quality training data for machine learning-based NER models. The system is designed to help medical professionals create and maintain extensive training and test datasets across multiple text formats and for different NER frameworks. It also supports the straightforward integration of new NER frameworks through customizable converters. Using the Nunamaker methodology for a structured approach to information system development, the article starts with an introduction to the topic, contextualizes the research, reviews the state of the art, and identifies challenges in text annotation by medical experts. This is followed by a description of the system’s modeling and implementation. The article concludes with an expert evaluation of the system, the resulting insights, and a summary of the main findings.

Keywords
Named entity recognition
Machine learning
Clinical practice guidelines
Information extraction
Cloud
Data preprocessing
Funding
None.
Conflict of interest
The authors declare they have no competing interests.
References
  1. Freund F, Tamla P, Hemmje M. Towards improving clinical practice guidelines through named entity recognition: Model development and evaluation. In: Proceedings of the 2023 31st Irish Conference on Artificial Intelligence and Cognitive Science (AICS). 2023:1-8.doi: 10.1109/AICS60730.2023.10470480
  2. Institute of Medicine (US) Committee on Standards for Developing Trustworthy Clinical Practice Guidelines; Graham R, Mancher M, Miller Wolman D, Greenfield S, Steinberg E, eds. Clinical Practice Guidelines We Can Trust. National Academies Press; 2011. doi: 10.17226/13058
  3. Byyny RL. The data deluge: the information explosion in medicine and science. Pharos Alpha Omega Alpha-Honor Med Soc Alpha Omega Alpha. 2012;75(2):2-5
  4. Klerings I, Weinhandl AS, Thaler KJ. Information overload in healthcare: too much of a good thing? Z Für Evidenz Fortbild Qual Im Gesundheitswesen. 2015;109(4):285-290. doi: 10.1016/j.zefq.2015.06.005
  5. Tamla P, Hartmann B, Nguyen N, Kramer C, Freund F, Hemmje M. CIE: a cloud-based information extraction system for named entity recognition in AWS, AZURE, and medical domain. In: Communications in Computer and Information Science. Springer Nature Switzerland; 2023:127- 148. doi: 10.1007/978-3-031-43471-6_6
  6. Konkol IM. Named Entity Recognition. PhD thesis. University of West Bohemia; 2015.
  7. Bielefeld University. RATIO: Rationalizing Recommendations (RecomRatio). 2017. Available from: https://spp-ratio.de/projects/recomratio/ [Last accessed on August 6, 2024].
  8. Nawroth C. Supporting Information Retrieval of Emerging Knowledge and Argumentation. PhD thesis. FernUniversität in Hagen; 2020
  9. FTK. Artificial Intelligence for Hospitals, Healthcare & Humanity (AI4H3). FTK e.V. Research Institute for Telecommunications and Cooperation; Internal project proposal; 2020. Unpublished.
  10. Liu F, Chen J, Jagannatha A, Yu H. Learning for biomedical information extraction: Methodological review of recent advances. Published online 2016. doi: 10.48550/ARXIV.1606.07993
  11. Hemmje M. Chair of Multimedia and Internet Applications. 2023. Available from: http://www.lgmmia.fernuni-hagen. de/en.html [Last accessed on].
  12. FTK. FTK e.V. Research Institute for Telecommunications and Cooperation. 2023. Available from: https://www.ftk.de/ en [Last accessed on February 25, 2023].
  13. Vu B, Wu Y, Afli H, et al. A metagenomic content and knowledge management ecosystem platform. In: Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2019:1-8
  14. Donovan R, Healy M, Zheng H, et al. SenseCare: Using Automatic Emotional Analysis to Provide Effective Tools for Supporting. In: Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2018:2682-2687. doi: 10.1109/BIBM.2018.8621250
  15. Tamla P, Freund F, Hemmje M. SNERC: Enhancing Knowledge Management with Named Entity Recognition and Document Classification for Apply Gaming. Artif Intell Appl. 2025;3(4):392-407. doi: 10.47852/bonviewAIA52023841
  16. Tamla P, Boehm T, Nawroth C, Hemmje M. Towards Semantic Web-Based Information Retrieval to solve Information Overload in an Applied Gaming Ecosystem. Bull IEEE Tech Comm Learn Technol. 2015;15(2):12
  17. Tamla P, Böhm T, Gaisbachgrabner K, Mertens J, Fuchs M. Survey: Software Search in Serious Games Development. 2019;2348:155-166
  18. Tamla P, Böhm T, Nawroth C, Hemmje M. What do serious games developers search online? A study of GameDev StackExchange. In: Proceedings of the 5th Collaborative European Research Conference (CERC 2019). CEUR workshop proceedings. CEUR-WS.org; 2019; 2348:131-142
  19. Freund F, Tamla P, Reis T, Hemmje M, Kevitt PM. FIT4NER - Towards a Framework-Independent Toolkit for Named Entity Recognition. In: Proceedings of the CERC 2023. Hochschule Darmstadt; 2023:10. doi: 10.48444/h_docs-pub-518
  20. Frei J, Kramer F. GERNERMED: An open German medical NER model. Softw Impacts. 2022;11:100212. doi: 10.1016/j.simpa.2021.100212
  21. Ghiasvand O, Kate RJ. Learning for clinical named entity recognition without manual annotations. Inform Med Unlocked. 2018;13:122-127. doi: 10.1016/j.imu.2018.10.011
  22. Wen C, Chen T, Jia X, Zhu J. Medical Named Entity Recognition from Un-labelled Medical Records based on Pre-trained Language Models and Domain Dictionary. Data Intell. 2021;3(3):402-417. doi: 10.1162/dint_a_00105
  23. Giachelle F, Irrera O, Silvello G. MedTAG: a portable and customizable annotation tool for biomedical documents. BMC Med Inform Decis Mak. 2021;21(1):352. doi: 10.1186/s12911-021-01706-4
  24. Pinto A, Oliveira HG, Alves AO. Comparing the performance of different NLP toolkits in formal and social media text. In: Mernik M, Leal JP, Oliveira HG, eds. 5th Symposium on Languages, Applications and Technologies (SLATE’16). OpenAccess series in informatics (OASIcs). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik; 2016;51:16.doi: 10.4230/OASIcs.SLATE.2016.3
  25. Sang EFTK, De Meulder F. Introduction to the CoNLL- 2003 Shared Task: Language-Independent Named Entity Recognition. arXiv. Published online 2003. doi: 10.48550/arXiv.cs/0306050
  26. Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J. BRAT: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. 2012:102-107.
  27. Comeau DC, Islamaj Doğan R, Ciccarese P, et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database. 2013;2013(0):bat064. doi: 10.1093/database/bat064
  28. Stieber S. Implementierung eines Systems für Vorverarbeitung von Daten für Named Entity Recognition in einem Wissensmanagement-System für den medizinischen Bereich. Bachelor‘s thesis. FernUniversität in Hagen; 2023
  29. Nunamaker Jr JF, Chen M, Purdin TDM. Systems Development in Information Systems Research. J Manag Inf Syst. 1990;7(3):89-106. doi: 10.1080/07421222.1990.11517898
  30. Li J, Sun A, Han J, Li C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans Knowl Data Eng. 2022;34(1):50-70. doi: 10.1109/TKDE.2020.2981314
  31. Cohen KB, Verspoor K, Fort K, et al. The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi- Model Annotation in the Biomedical Domain. In: Ide N, Pustejovsky J, eds. Handbook of Linguistic Annotation. Springer Netherlands; 2017:1379-1394. doi: 10.1007/978-94-024-0881-2_53
  32. Bada M, Vasilevsky N, Baumgartner WA, Haendel M, Hunter LE. Gold-standard ontology-based anatomical annotation in the CRAFT Corpus. Database. 2017;2017:bax087. doi: 10.1093/database/bax087
  33. Kittner M, Lamping M, Rieke DT, et al. Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA Open. 2021;4(2):ooab025. doi: 10.1093/jamiaopen/ooab025
  34. Ahmadi S, Shah A, Fox E. Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification. arXiv. Preprint posted online 2023:arXiv:2307.14899. doi: 10.48550/arXiv.2307.14899
  35. Newman-Griffis D, Divita G, Desmet B, Zirikly A, Rosé CP, Fosler-Lussier E. Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets. J Am Med Inform Assoc. 2021;28(3):516-532. doi: 10.1093/jamia/ocaa269
  36. Alhassan A, Schlegel V, Aloud M, Batista-Navarro R, Nenadic G. Discontinuous named entities in clinical text: A systematic literature review. J Biomed Inform. 2025;162:104783. doi: 10.1016/j.jbi.2025.104783
  37. Liang S, Profitlich HJ, Klass M, et al. Building A German Clinical Named Entity Recognition System without In-domain Training Data. In: Proceedings of the 6th Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2024:70–81. doi: 10.18653/v1/2024.clinicalnlp-1.7
  38. Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes. Bioinforma Adv. 2024;4(1):vbae116. doi: 10.1093/bioadv/vbae116
  39. Freund F, Tamla P, Tran B, Hemmje M. Evaluating NERFlow: User-Centered Assessment of Automated LLM-Based Annotation for Medical Named-Entity Recognition. Procedia Comput Sci. In press.
  40. Liu J, Wong ZSY. Utilizing active learning strategies in machine-assisted annotation for clinical named entity recognition: a comprehensive analysis considering annotation costs and target effectiveness. J Am Med Inform Assoc. 2024;31(11):2632-2640. doi: 10.1093/jamia/ocae197
  41. Artstein R, Poesio M. Inter-Coder Agreement for Computational Linguistics. Comput Linguist. 2008;34(4):555-596. doi: 10.1162/coli.07-034-R2
  42. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research. Yearb Med Inform. 2008;17(01):128-144. doi: 10.1055/s-0038-1638592
  43. Raja U, Mitchell T, Day T, Hardin JM. Text mining in healthcare. Applications and opportunities. J Healthc Inf Manag JHIM. 2008;22(3):52-56
  44. Neves M, Ševa J. An extensive review of tools for manual annotation of documents. Brief Bioinform. 2021;22(1):146- 163. doi: 10.1093/bib/bbz130
  45. Kim JD, Ohta T, Tateisi Y, Mima H, Tsujii J. XML-based linguistic annotation of corpus. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), 2001:47-53.
  46. Ogren P. Knowtator: A Protégé plug-in for annotated corpus construction. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion V. 2006:273-275.
  47. Krishnamoorthy S, Jiang Y, Buchanan W, Singh A, Ortega J. CLPT: a universal annotation scheme and toolkit for clinical language processing. In: Naumann T, Bethard S, Roberts K, Rumshisky A, eds. In: Proceedings of the 4th Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2022:1-9. doi: 10.18653/v1/2022.clinicalnlp-1.1
  48. Sharir O, Peleg B, Shoham Y. The cost of training NLP models: a concise overview. arXiv Published online 2020. doi: 10.48550/ARXIV.2004.08900
  49. Putzier M, Khakzad T, Dreischarf M, Thun S, Trautwein F, Taheri N. Implementation of cloud computing in the German healthcare system. Npj Digit Med. 2024;7(1):12. doi: 10.1038/s41746-024-01000-3
  50. Wang H, Wang B, Wang S. Design and Implementation of a Primary Healthcare Cloud Platform. Front Comput Intell Syst. 2024;7(3):77-84. doi: 10.54097/01kn4y43
  51. Akerele JI, Uzoka A, Ojukwu PU, Olamijuwon OJ. Improving healthcare application scalability through microservices architecture in the cloud. Int J Sci Res Updat. 2024;8(2):100- 109. doi: 10.53430/ijsru.2024.8.2.0064
  52. Norman DA, Draper SW. User Centered System Design; New Perspectives on Human-Computer Interaction. L. Erlbaum Associates Inc.; 1986
  53. Rumbaugh J, Jacobson I, Booch G. The Unified Modeling Language Reference Manual. 2nd ed. Addison-Wesley; 2005.
  54. Gamma E, Johnson R, Helm R, Vlissides J. Entwurfsmuster: Elemente wiederverwendbarer objektorientierter Software. Pearson Deutschland GmbH; 2011
  55. Mane D, Chitnis K, Ojha N. The spring framework: An open source java platform for developing robust java applications. Int J Innov Technol Explor Eng IJITEE. 2013;3(2):137-143
  56. Ramírez S. FastAPI. Published online 2023. Available from: https://fastapi.tiangolo.com/ [Last accessed on October 3, 2023].
  57. The Apache Software Foundation. Apache Tika: a content analysis toolkit. Published online 2023. Available from: https://tika.apache.org/ [Last accessed on July 24, 2023].
  58. pdfminer community. pdfminer.six: We fathom PDF. Published online 2022. Available from: https://github.com/ pdfminer/pdfminer.six [Last accessed on July 22, 2023].
  59. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in python. J Mach Learn Res. 2011;12:2825-2830
  60. Tamla P, Freund F, Hemmje M. Cloud-based medical named entity recognition: a FIT4NER-based approach. Information. 2025;16(5):395. doi: 10.3390/info16050395
  61. OpenAPI Specification. Available from: https://github.com/ OAI/OpenAPI-Specification/tree/main [Last accessed on February 1, 2025].
  62. Federal Ministry for Economic Affairs and Climate Action. Guidelines on the protection of health data. Available from: https://www.bmwk.de/Redaktion/EN/Dossier/guidelines-on-the-protection-of-health-data.html [Last accessed on February 1, 2025].
  63. Irvine C, Balasubramaniam Dharini, Henderson T. Short paper: Integrating the data protection impact assessment into the software development lifecycle. In: Lecture Notes in Computer Science. Springer International Publishing; 2020;219-228. doi: 10.1007/978-3-030-66172-4_13
  64. Docker Compose|Docker Docs. Available from: https:// docs.docker.com/compose/ [Last accessed on January 29, 2025].
  65. Helm Authors. Helm - The package manager for Kubernetes. 2025. Available from: https://helm.sh/ [Last accessed on August 25, 2025].
  66. Kompose - Convert your Docker Compose file to Kubernetes or OpenShift. Available from: https://kompose.io/ [Last accessed on January 29, 2025].
  67. Amazon Elastic Kubernetes Service Documentation. Available from: https://docs.aws.amazon.com/eks/ [Last accessed on January 29, 2025].
  68. Google Kubernetes Engine (GKE)|Google Cloud. Available from: https://cloud.google.com/kubernetes-engine [Last accessed on January 29, 2025].
  69. Azure Kubernetes Service (AKS) documentation | Microsoft Learn. Available from: https://learn.microsoft.com/en-us/ azure/aks/ [Last accessed on January 29, 2025].
  70. Terraform by HashiCorp. Available from: https://www. terraform.io/ [Last accessed on January 29, 2025].
  71. Nocentino AE, Weissman B. Storing persistent data in kubernetes. In: SQL Server on Kubernetes: Designing and Building a Modern Data Platform. Apress; 2021:111-137. doi: 10.1007/978-1-4842-7192-6_6
  72. Container attached storage (CAS). Available from: https://openebs.io/docs/2.12.x/concepts/cas [Last accessed on February 1, 2025].
  73. Rook - cloud-native storage orchestrator for Kubernetes. Available from: https://github.com/rook/rook [Last accessed on February 1, 2025].
  74. Polson PG, Lewis C, Rieman J, Wharton C. Cognitive walkthroughs: a method for theory-based evaluation of user interfaces. Int J Man-Mach Stud. 1992;36(5):741-773. doi: 10.1016/0020-7373(92)90039-N
  75. Collofello JS. The Software Technical Review Process. Published online 1988. Accessed. Available from: https:// web.archive.org/web/20150724025200/http://www.sei.cmu. edu/reports/88cm003.pdf [Last accessed on May 12, 2020].
  76. IEEE. IEEE standard for software reviews and audits. IEEE Std 1028-2008. IEEE; 2008. doi: 10.1109/IEEESTD.2008.4601584
  77. Furrer L, Cornelius J, Rinaldi F. UZH@CRAFT-ST: a Sequence-labeling Approach to Concept Recognition. In: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks. Association for Computational Linguistics; 2019:185- 195. doi: 10.18653/v1/D19-5726
  78. Freund F, Tamla P, Tran B, Hemmje M. Open‑Source Large Language Models for FIT4NER: Automatic Annotation for Medical Named Entity Recognition. Manuscript submitted for publication. 2025.
Share
Back to top
Journal of Clinical Informatics, Published by AccScience Publishing