Beyond SMOTE: Evaluating large language models and mixture of experts for prediction of surgical site infections

© 2025 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )

Download PDF

XML

Cite

Abstract

Handling severe class imbalance remains one of the most persistent barriers to deploying reliable artificial intelligence (AI) in healthcare. Conventional approaches such as SMOTE and other resampling strategies often inflate training performance but degrade under real-world distribution shift. We evaluated alternative modeling strategies, including classical machine learning, ensemble methods, imbalance-aware mixture of experts (MoE), and a fine-tuned large language model (LLM) (ModernBERT-large), for surgical site infection (SSI) prediction using structured electronic medical records. Across temporally shifted evaluation cohorts, the ModernBERT model consistently outperformed all baselines without synthetic oversampling or target ratio adjustments, achieving a Matthew correlation coefficient of 0.71 versus 0.35 for the best SMOTE-resampled CatBoost model. In contrast, MoE architectures failed to deliver robustness gains, and resampled classical models deteriorated under distributional change. These results highlight a paradigm shift: pre-trained language models can serve as deployment-stable alternatives to synthetic imbalance correction in structured clinical prediction tasks. Beyond SSI, this finding underscores the potential of LLMs to improve the resilience of healthcare AI systems where minority-class prediction is critical.

Graphical abstract

Keywords

Imbalanced datasets

Artificial intelligence

Health informatics

Large language models

Machine learning

Minority class prediction

Mixture of Experts

Surgical site infection

Funding

None.

Conflict of interest

The author declares no conflict of interest.

References

Leaper DJ, van Goor H, Reilly J, et al. Surgical site infection - a European perspective of incidence and economic burden. Int Wound J. 2004;1(4):247-273. doi: 10.1111/j.1742-4801.2004.00067.x

Kirkland KB, Briggs JP, Trivette SL, Wilkinson WE, Sexton DJ. The impact of surgical-site infections in the 1990s: Attributable mortality, excess length of hospitalization, and extra costs. Infect Control Hosp Epidemiol. 1999;20(11): 725-730. doi: 10.1086/501572.

Ghassemi M, Naumann T, Schulam P, Joshi AL, Suresh I, Nemati MR. A review of challenges and opportunities in machine learning for health. AMIA Jt Summits Transl Sci Proc. 2018;2018:191-200. doi: 10.1109/MSP.2018.8478161

Salmi M, Atif D, Oliva D, Abraham A, Ventura S. Handling imbalanced medical datasets: Review of a decade of research. Artif. Intell. Rev. 2024;57:273.

Winkler EA, Rolston MW, Chang JD, Veeravagu EJ, Burke CJ. Association of imbalanced datasets and bias in machine learning models in health care. JAMA Netw Open. 2021;4(7):e2131810.doi: 10.1001/jamanetworkopen.2021.31810

Fletcher RR, Olubeko O, Sonthalia H, et al. Application of Machine Learning to Prediction of Surgical Site Infection. In: Proceedings of the 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); 2019. Berlin, Germany. IEEE; 2019. p. 2234-2237.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321-357. doi: 10.1613/jair.953

Wu G, Cheligeer C, Southern DA, et al. Improving surgical site infection prediction using machine learning. In: Antimicrobial Resistance and Infection Control. Berlin: Springer; 2025.

Al-Ahmari S, Nadeem F. Improving surgical site infection prediction using machine learning: Addressing challenges of highly imbalanced data. Diagnostics (Basel). 2025;15:501. doi: 10.3390/diagnostics15040501

Colborn KL, Bronsert M, Amioka E, Hammermeister K, Henderson WG, Meguid R. Identification of surgical site infections using electronic health record data. Am J Infect Control. 2018;46(11):1230-1235. doi: 10.1016/j.ajic.2018.07.021

Cho SY, Kim YJ, Park J, et al. Development of machine learning models for the surveillance of colon surgical site infections. J Hosp Infect. 2024;146:224-231. doi: 10.1016/j.jhin.2023.03.025

Shen Z, Chen Z, Zhang Y, et al. The development and validation of a novel model for predicting surgical complications in colorectal cancer of elderly patients: Results from 1008 cases. Eur J Surg Oncol. 2018;44(4): 490-495. doi: 10.1016/j.ejso.2017.12.013

Chowdhury AAA, Sultana A, Rafi AH, Tariq M. AI-driven predictive analytics in orthopedic surgery outcomes. Rev Esp Doc Cient. 2024;19(2):104-124.

Gholampour S. Impact of nature of medical data on machine and deep learning for imbalanced datasets: Clinical validity of SMOTE is questionable. Mach Learn Knowl Extr. 2024;6(2):827-841. doi: 10.3390/make6020038

Sakho A, Malherbe E, Scornet E. Do we Need Rebalancing Strategies? A Theoretical and Empirical Study Around SMOTE and its Variants. [Preprint]; 2024.

Van den Goorbergh R, Van Smeden M, Timmerman D, Van Calster B. The harm of class imbalance corrections for risk prediction models: Illustration and simulation using logistic regression. J Am Med Inform Assoc. 2022;29(9):1525-1534. doi: 10.1093/jamia/ocac097

Jacobs R, Jordan MI, Nowlan SJ, Hinton GE. Adaptive mixtures of local experts. Neural Comput. 1991;3(1):79-87. doi: 10.1162/neco.1991.3.1.79

Cai W, Jiang J, Wang F, Tang J, Kim S, Huang J. A Survey on Mixture of Experts. [Preprint]; 2024.

Du H, Liu G, Lin Y, et al. Mixture of Experts for Intelligent Networks: A Large Language Model-Enabled Approach. In: Proceedings of the 2024 International Wireless Communications and Mobile Computing (IWCMC). Ayia Napa, Cyprus. IEEE; 2024. p. 531-536. doi: 10.1109/IWCMC60181.2024.10701184

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre- Training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT); 2019. p. 4171-4186. doi: 10.48550/arXiv.1810.04805

Warner B, Chaffin A, Clavié B, et al. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for fast, Memory- Efficient, and Long-Context Finetuning and Inference. [Preprint]; 2024.

Lee J, Yoon W, Kim S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682

Liu C, Ouyang C, Cheng S, Shah A, Bai W, Arcucci R. G2D: From Global to Dense Radiography Representation Learning Via Vision-Language Pre-Training. In: Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS); 2024.

Liu C, Cheng S, Ding W, Arcucci R. Spectral cross-domain neural network with soft-adaptive threshold spectral enhancement. IEEE Trans Neural Netw Learn Syst. 2023;34(11):9492-9504. doi: 10.1109/tnnls.2023.3253951

Qin J, Liu Y, Wang X, et al. Freeze the Backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-Training. In: ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2024. p. 1-5. doi: 10.1109/ICASSP48485.2024.10446328

Huang X, Khetan A, Cvitkovic M, Karnin Z. TabTransformer: Tabular Data Modeling using Contextual Embeddings. In: Proceedings of the 2020 Workshop on Tabular Representation Learning (NeurIPS 2020); 2020.

Gorishniy D, Rubinstein H, Khrulkov O, Lempitsky V. Feature Tokenizer Transformer (FT-Transformer): A Universal Architecture for Tabular Data. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) Workshop on Tabular Data; 2021.

Somepalli G, Goldblum M, Schwarzschild A, Bruss CB, Goldstein T. SAINT: Improved neural Networks for Tabular Data Via Row Attention and Contrastive Pre-Training. [Preprint]; 2021.

Burns RJ, Rosenthal AS, Hollenbeak TM, Shabahang BE, Bennett SM. Development of a simple surgical site infection (SSI) risk score for general surgery. Am Surg. 2016;82(11):1042-1049. doi: 10.1177/000313481608201118

Korol E, Johnston K, Waser N, Sifakis F, Jafri HS, Lo M, Kyaw MH. A systematic review of risk factors associated with surgical site infections among surgical patients. PLoS One. 2013;8(12):e83743. doi: 10.1371/journal.pone.0083743

Al Mamlook RE, Hammoudeh H, Khader N, Rawashdeh M, Al-Aqtash I. Predicting surgical site infections using deep learning models on large-scale medical claims data. BMC Med Inform Decis Mak. 2021;21(1):329. doi: 10.1186/s12911-021-01704-z

Wu C, Liu X, Liu Q, Yu Y, Li M. A machine learning model to predict surgical site infections after total joint arthroplasty. J Orthop Surg Res. 2021;16(1):301. doi: 10.1186/s13018-021-02412-7

Al Mamlook RE, Wells LJ, Sawyer R. Machine-learning models for predicting surgical site infections using patient pre-operative risk and surgical procedure factors. Am J Infect Control. 2023;51(5):551-557. doi: 10.1016/j.ajic.2022.10.009

Van Boekel AM, Greuters MJ, Van Zundert A, Noordzij PG, Van Dijk J. Systematic evaluation of machine learning models for postoperative surgical site infection prediction. PLoS One. 2024;19(3):e0301576. doi: 10.1371/journal.pone.0301576

Heffernan A, Ganguli R, Sears I, Stephen AH, Heffernan DS. Choice of machine learning models is important to predict post-operative infections in surgical patients. Surg Infect (Larchmt). 2025;26:520-529. doi: 10.1089/sur.2024.288

Petrosyan Y, Thavorn K, Smith G, et al. Predicting postoperative surgical site infection with administrative data: A random forests algorithm. BMC Med Res Methodol. 2021;21(1):179. doi: 10.1186/s12874-021-01369-9

Xiong C, Zhao R, Xu J, et al. Construct and validate a predictive model for surgical site infection after posterior lumbar interbody fusion based on machine learning algorithm. Comput Math Methods Med. 2022;2022:2697841. doi: 10.1155/2022/2697841

Sharafoddini A, Dubin JA, Lee J. Patient similarity in prediction models based on health data: A scoping review. JMIR Med Inform. 2017;5(1):e7. doi: 10.2196/medinform.6730

Dockes J, Varoquaux G, Poline JB. Preventing dataset shift from breaking machine-learning biomarkers. Gigascience. 2021;10(7):giab051. doi: 10.1093/gigascience/giab051

Gabriel RA, Kuo TT, McAuley J, Hsu CN. Identifying and characterizing highly similar notes in big clinical note datasets. J Biomed Inform. 2018;82:63-69. doi: 10.1016/j.jbi.2018.02.001

Purwadi J, Delima R, Wibowo A, Rumuy A. Comparison of the application of weighted cosine similarity and Minkowski distance similarity methods in stroke diagnostic systems. Sci World J. 2023;2023:5592673. doi: 10.1155/2023/5592673

McElfresh DC, Khandagale S, Valverde JP, et al. When do Neural Nets Outperform Boosted Trees on Tabular Data? In: Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, New Orleans, LA; 2023.

Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. doi: 10.1186/s12864-019-6413-7

Marchesi R, Micheletti N, Jurman G, Osmani V. Mitigating Health Data Poverty: Generative Approaches Versus Resampling for Time-Series Clinical Data. [Preprint]; 2022.

Previous article in this issue

Next article in this issue

Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing