AccScience Publishing / AIH / Online First / DOI: 10.36922/AIH026080021
ORIGINAL RESEARCH ARTICLE

Enhancing medical data quality using hybrid machine learning models: A comparative study of isolation forest and support vector machine on numerically encoded clinical text

Raed Abdullah Althabeti1* Ebeid Ali Ebeid2 Hany Maher Sayed Lala2 Kamal Abdelraouf Eldahshan2
Show Less
1 Information Systems Department, Faculty of Information Technology and Computer Science, University of Saba Region, Marib, Marib, Yemen
2 Mathematics Department, Faculty of Science, Al-Azhar University, Cairo, Egypt
Received: 22 February 2026 | Revised: 31 March 2026 | Accepted: 1 April 2026 | Published online: 8 May 2026
© 2026 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )
Abstract

Ensuring high-quality medical text data is a persistent challenge in healthcare analytics, as issues such as inconsistency, incompleteness, and hidden anomalies can substantially compromise the reliability of clinical decision-making and downstream research outcomes. This article presents and assesses a hybrid isolation forest (IF)–support vector machine (SVM) model (HIFSVM) to improve the quality of medical text data via anomaly detection, comparing with IF and SVM. The research used three public health datasets: Alzheimer’s Disease and Healthy Aging Data (ADHAD), Breast Cancer Global Dataset, and 500 Cities: The Local Data for Better Health (500LDFB), and employed 10 extensive data quality metrics: accuracy, completeness, consistency, timeliness, validity, uniqueness, precision, change rate, error rate, and data density. The models’ anomaly detection performance was evaluated using synthetic ground truth, cross-validation, and expert validation. The HIFSVM model reported increased change rates (e.g., 0.96 for ADHAD’s data value and 1.00 for 500LDFB’s confidence limits), demonstrating considerable potential for tackling intricate or dynamic medical datasets that necessitate sophisticated anomaly detection. It also demonstrated enhanced detection accuracy (F1-scores: 0.89–0.91 across datasets), markedly surpassing IF (0.83–0.87) and SVM (0.74–0.79) (both p < 0.05). Validation established that 79–91% of the identified anomalies were clinically significant, exhibiting considerable inter-rater agreement (κ = 0.71–0.84). These findings demonstrate that HIFSVM proficiently detects authentic anomalies while maintaining robust data quality evaluation capabilities. The study provides a structured methodological framework for data preprocessing, feature engineering, and anomaly detection that healthcare professionals can employ to enhance the reliability of clinical decision-making, thereby improving the accuracy of clinical studies and ensuring better patient outcomes.

Keywords
Anomaly detection
Data quality assessment
Hybrid machine learning
Isolation forest
Support vector machine
Medical datasets
Funding
None.
Conflict of interest
The authors declare that they have no conflict of interest.
References
  1. Doern GV, Carroll KC, Diekema DJ, et al. Practical Guidance for Clinical Microbiology Laboratories: A Comprehensive Update on the Problem of Blood Culture Contamination and a Discussion of Methods for Addressing the Problem. Clin Microbiol Rev. 2019;33(1):e00009-19. doi: 10.1128/CMR.00009-19

 

  1. Sidey-Gibbons JAM, Sidey-Gibbons CJ. Machine learning in medicine: a practical introduction. BMC Med Res Methodol. 2019;19(1):64. doi: 10.1186/s12874-019-0681-4

 

  1. Rakers MM, van Buchem MM, Kucenko S, et al. Availability of Evidence for Predictive Machine Learning Algorithms in Primary Care: A Systematic Review. JAMA Netw Open. 2024;7(9):e2432990. doi: 10.1001/jamanetworkopen.2024.32990

 

  1. An Q, Rahman S, Zhou J, Kang JJ. A Comprehensive Review on Machine Learning in Healthcare Industry: Classification, Restrictions, Opportunities and Challenges. Sensors (Basel). 2023;23(9):4178. doi: 10.3390/s23094178

 

  1. Harrison CJ, Sidey-Gibbons CJ. Machine learning in medicine: a practical introduction to natural language processing. BMC Med Res Methodol. 2021;21(1):158. doi: 10.1186/s12874-021-01347-1

 

  1. Yoon CH, Torrance R, Scheinerman N. Machine learning in medicine: should the pursuit of enhanced interpretability be abandoned? J Med Ethics. 2022;48(9):581-585. doi: 10.1136/medethics-2020-107102

 

  1. Phan DT, Idé T, Kalagnanam J, Menickelly M, Scheinberg K. A novel l0-constrained Gaussian graphical model for anomaly localization. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE; 2017:830-833. doi: 10.1109/ICDMW.2017.115

 

  1. Al-amri R, Murugesan RK, Man M, Abdulateef AF, Al-Sharafi MA, Alkahtani AA. A Review of Machine Learning and Deep Learning Techniques for Anomaly Detection in IoT Data. Appl Sci. 2021;11(12):5320. doi: 10.3390/app11125320

 

  1. Ahmed T, Oreshkin B, Coates M. Machine learning approaches to network anomaly detection. In: Proceedings of the 2nd USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques. USENIX; 2007:1-6. Available from: https://www.usenix.org/legacy/ event/sysml07/tech/full_papers/ahmed/ahmed_html/ sysml07CR_07.html [Last accessed on September 24, 2025].

 

  1. Nassif AB, Talib MA, Nasir Q, Dakalbab FM. Machine learning for anomaly detection: A systematic review. IEEE Access. 2021;9:78658-78700. doi: 10.1109/ACCESS.2021.3083060

 

  1. Yang Z, Fang H, Liu H, Li J, Jiang Y, Zhu M. Active Visual Perception Enhancement Method Based on Deep Reinforcement Learning. Electronics. 2024;13(9):1654. doi: 10.3390/electronics13091654

 

  1. Ajitha P. Artificial Intelligence Based Anomaly Detection in Patient Health Monitoring Using Ensemble Learning Methods. In: Proceedings of the 1st International Conference on Artificial Intelligence, Communication, IoT, Data Engineering and Security (IACIDS 2023); November 23-25, 2023; Lavasa, Pune, India. 2024. doi: 10.4108/eai.23-11-2023.2343250

 

  1. Bao J, Sun H, Deng H, He Y, Zhang Z, Li X. BMAD: Benchmarks for medical anomaly detection. In: 2024 IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2024:4042-4053. doi: 10.1109/CVPRW63382.2024.00408

 

  1. Baydargil HB, Park J, Ince IF. Anomaly-based Alzheimer’s disease detection using entropy-based probability Positron Emission Tomography images. ETRI J. 2024;46(3):513-525. doi: 10.4218/etrij.2023-0123

 

  1. Eltanbouly S, Bashendy M, AlNaimi N, Chkirbene Z, Erbad A. Machine Learning Techniques for Network Anomaly Detection: A Survey. In: 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT). IEEE; 2020:156-162. doi: 10.1109/iciot48696.2020.9089465

 

  1. Almeida SD, Norajitra T, Lüth CT, et al. Prediction of disease severity in COPD: a deep learning approach for anomaly-based quantitative assessment of chest CT. Eur Radiol. 2023;34(7):4379-4392. doi: 10.1007/s00330-023-10540-3

 

  1. Idé T, Phan DT, Kalagnanam J. Multi-task multi-modal models for collective anomaly detection. In: 2017 IEEE International Conference on Data Mining (ICDM). IEEE; 2017:177-186. doi: 10.1109/icdm.2017.27

 

  1. Spies NC, Farnsworth CW, Jackups R Jr. Data-Driven Anomaly Detection in Laboratory Medicine: Past, Present, and Future. J Appl Lab Med. 2023;8(1):162-179. doi: 10.1093/jalm/jfac114

 

  1. Ibrahim MR, Youssef SM, Fathalla KM. Abnormality detection and intelligent severity assessment of human chest computed tomography scans using deep learning: a case study on SARS-COV-2 assessment. J Ambient Intell Human Comput. 2021;14(5):5665-5688. doi: 10.1007/s12652-021-03282-x

 

  1. Minic A, Jovanovic L, Bacanin N, et al. Applying recurrent neural networks for anomaly detection in electrocardiogram sensor data. Sensors. 2023;23(24):9878. doi: 10.3390/s23249878

 

  1. Liu B, Chang H, Yang D, et al. A deep learning framework assisted echocardiography with diagnosis, lesion localization, phenogrouping heterogeneous disease, and anomaly detection. Sci Rep. 2023;13(1):3. doi: 10.1038/s41598-022-27211-w

 

  1. Mohammed HA, Nazeeh I, Alisawi WC, et al. Anomaly Detection in Human Disease: A Hybrid Approach Using GWO-SVM for Gene Selection. Rev Intell Artif. 2023;37(4):913-919. doi: 10.18280/ria.370411

 

  1. Rahman MM, Gupta D, Bhatt S, et al. A Comprehensive Review of Machine Learning Approaches for Anomaly Detection in Smart Homes: Experimental Analysis and Future Directions. Future Internet. 2024;16(4):139. doi: 10.3390/fi16040139

 

  1. Bataineh B. A Heterogeneous Ensemble Learning Framework-based Binary Genetic Algorithm for Predictive Maintenance of HVAC Systems in Medical Facilities. Appl Math Inf Sci. 2026;20(1):181-213. doi: 10.18576/amis/200113

 

  1. Fuzail MZ, Din IU, Ahmed S, Alhumam A, Khan AH. Optimizing sepsis mortality prediction using hybrid federated learning and explainable AI framework. Sci Rep. 2026;16(1):5218. doi: 10.1038/s41598-026-36245-3

 

  1. Núñez-Valdez ER. Special Issue on Algorithms and Applications of Machine Learning Techniques for Healthcare. Algorithms. 2026;19(1):53. doi: 10.3390/a19010053

 

  1. Alhassan AM, Altmami NI. Detection of multiclass non-melanoma skin cancer with multi-variable DCNN with hybrid gradient boosting optimizer. Ain Shams Eng J. 2026;17(1):103956. doi: 10.1016/j.asej.2025.103956

 

  1. Paul T, Assaduzzaman M, Fahad N, Hossen MJ. Enhancing risk prediction for diabetes, hypertension, and heart disease using SMOTE-ENN balancing with PCA and gradient boosting in healthcare AI. Intell Based Med. 2026;13:100339. doi: 10.1016/j.ibmed.2025.100339

 

  1. El-Aziz RMA, Rayan A. Adaptive sampling enhanced deep learning framework for accurate interpretable stroke risk prediction. Egypt Inform J. 2026;33:100887. doi: 10.1016/j.eij.2026.100887

 

  1. Tascioglu AB, Bertini F, Pistore L, Fabbri A, Montesi D. Comorbidity Extraction for In-Hospital Mortality Analysis: a Comparison of Regular Expressions and Large Language Models. In: Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM; 2025:1-10. doi: 10.1145/3765612.3767202

 

  1. Foorthuis R. On the nature and types of anomalies: a review of deviations in data. Int J Data Sci Anal. 2021;12(4):297- 331. doi: 10.1007/s41060-021-00265-1

 

  1. Li J, Li S, Zhao W, Li J, Zhang K, Jiang Z. Distribution network line loss analysis method based on improved clustering algorithm and isolated forest algorithm. Sci Rep. 2024;14(1):19554. doi: 10.1038/s41598-024-68366-y

 

  1. Nguyen MN, Vien NA. Scalable and interpretable one-class SVMs with deep learning and random Fourier features. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018; September 10-14, 2018; Dublin, Ireland. Springer; 2019:157-172. doi: 10.1007/978-3-030-10925-7_10

 

  1. Ruiz-Gonzalez R, Gomez-Gil J, Gomez-Gil FJ, Martínez- Martínez V. An SVM-based classifier for estimating the state of various rotating components in agro-industrial machinery with a vibration signal acquired from a single point on the machine chassis. Sensors (Basel). 2014;14(11):20713-20735. doi: 10.3390/s141120713

 

  1. Singh U. Alzheimer’s Disease and Healthy Aging Data. Kaggle. Updated 2023. Available from: https://www.kaggle. com/datasets/utkarshx27/alzheimers-disease-and-healthy-aging-data [Last accessed on September 15, 2025].

 

  1. Breast Cancer Global Dataset. Kaggle. 2022. https://www. kaggle.com/datasets/yasserh/breast-cancer-dataset [Last accessed on September 20, 2025].

 

  1. Fandang. 500 Cities: Local Data for Better Health. Kaggle. Updated 2018. Available from: https://www.kaggle.com/ datasets/fandang/500-cities-local-data-for-better-health [Last accessed on September 24, 2025].

 

  1. Issa S. Linked data quality: completeness and conciseness. Dissertation. Conservatoire national des arts et métiers - CNAM; 2019. Available from: https://theses.hal.science/ tel-02513652v1/file/TheseISSA.pdf [Last accessed on September 24, 2025].

 

  1. Shi P, Cui Y, Xu K, Zhang M, Ding L. Data Consistency Theory and Case Study for Scientific Big Data. Information. 2019;10(4):137. doi: 10.3390/info10040137

 

  1. Nikiforova A. Timeliness of open data in open government data portals through pandemic-related data: a long data way from the publisher to the user. In: 2020 Fourth International Conference on Multimedia Computing, Networking and Applications (MCNA). IEEE; 2020:131-138. doi: 10.1109/MCNA50957.2020.9264298

 

  1. Van der Loo MPJ, De Jonge E. Data validation. Wiley StatsRef: Statistics Reference Online. 2014:1-7. doi: 10.1002/9781118445112.stat08255

 

  1. What are Data Quality Dimensions? Definitions, Examples and Best Practices. DQOps Data Quality Operations Center Documentation. Updated July 22, 2025. Available from: https://dqops.com/docs/dqo-concepts/data-quality-dimensions/ [Last accessed on October 7, 2025].

 

  1. Liu FT, Ting KM, Zhou ZH. Isolation Forest. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE; 2008:413-422. doi: 10.1109/icdm.2008.17

 

  1. Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC. Estimating the support of a high-dimensional distribution. Neural Comput. 2001;13(7):1443-1471. doi: 10.1162/089976601750264965

 

  1. Breunig MM, Kriegel HP, Ng RT, Sander J. LOF: identifying density-based local outliers. ACM SIGMOD Rec. 2000;29(2):93-104. doi: 10.1145/335191.335388

 

  1. Rousseeuw PJ, Driessen KV. A fast algorithm for the minimum covariance determinant estimator. Technometrics. 1999;41(3):212-223. doi: 10.1080/00401706.1999.10485670

 

  1. Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504-507. doi: 10.1126/science.1127647

 

  1. Elouataoui W, El Alaoui I, El Mendili S, Gahi Y. An Advanced Big Data Quality Framework Based on Weighted Metrics. Big Data Cogn Comput. 2022;6(4):153. doi: 10.3390/bdcc6040153

 

  1. Li M, Liu J, Yang Y. Financial Data Quality Evaluation Method Based on Multiple Linear Regression. Future Internet. 2023;15(10):338. doi: 10.3390/fi15100338

 

  1. Zhang HJ, Chen CC, Ran P, et al. A multi-dimensional hierarchical evaluation system for data quality in trustworthy AI. J Big Data. 2024;11(1):154. doi: 10.1186/s40537-024-00999-2

 

  1. Widad E, Saida E, Gahi Y. Quality anomaly detection using predictive techniques: An extensive big data quality framework for reliable data analysis. IEEE Access. 2023;11:103306-103318. doi: 10.1109/access.2023.3317354
Share
Back to top
Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing