AccScience Publishing / CP / Online First / DOI: 10.36922/CP025450072
ORIGINAL RESEARCH ARTICLE

Development of a machine learning-based risk prediction model for lymphoma using data from the 2023 National Health Interview Survey

Yu-Ying Guo1 Shu-Ling Hou1* Xue-Jing Yang1* Dong Song1*
Show Less
1 Cancer Center, Shanxi Bethune Hospital, Shanxi Academy of Medical Sciences, Third Hospital of Shanxi Medical University, Tongji Shanxi Hospital, Taiyuan, China
Received: 7 November 2025 | Revised: 8 December 2025 | Accepted: 19 December 2025 | Published online: 24 December 2025
© 2025 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )
Abstract

Lymphoma is a malignant tumor that originates from the lymphatic system. This study aims to develop a risk prediction model for lymphoma using lymphoma cases extracted from the 2023 National Health Insurance Service (NHIS) database. The χ2 test was used to examine differences in seven variables between groups. Subsequently, hyperparameter combinations were randomly tested for optimization. Four machine learning models were constructed after obtaining the optimal hyperparameters. Receiver operating characteristic (ROC) analysis was employed to evaluate the four machine learning models and select the most effective model. The relative importance of the seven variables was ranked using the machine learning algorithm. The data from the baseline characteristic table, comprising 3,670 participants, revealed that six variables differed significantly (p < 0.05). After hyperparameter tuning, the extreme gradient boosting (XGBoost) model achieved the optimal parameter combination, with a receiver operating characteristic value of 0.8567. The XGBoost model also demonstrated the highest area under the curve value (0.8161), validating it as the best-performing model, with a sensitivity of 0.9916, a specificity of 0.1188, a precision of 0.9086, an F1 score of 0.9483, and an accuracy of 0.9028. Finally, the machine learning model found that obesity (body mass index ≥ 30 kg/m2) exerted the greatest predictive contribution to lymphoma, with a relative importance score of 100. This study identified seven variables associated with lymphoma occurrence and developed a risk prediction model for lymphoma, providing valuable insights into the treatment of this disease.

Keywords
Lymphoma
Risk prediction model
National Health Insurance Service
Baseline characteristic
Machine learning
Funding
None.
Conflict of interest
The authors declare no conflict of interest.
References
  1. Yu T, Xu-Monette ZY, Yu L, Li Y, Young KH. Mechanisms of ferroptosis and targeted therapeutic approaches in lymphoma. Cell Death Dis. 2023;14(11):771. doi: 10.1038/s41419-023-06295-w

 

  1. Matasar MJ, Zelenetz AD. Overview of lymphoma diagnosis and management. Radiol Clin North Am. 2008;46(2):175-98, vii. doi: 10.1016/j.rcl.2008.03.005

 

  1. Mugnaini EN, Ghosh N. Lymphoma. Prim Care. 2016;43(4):661-675. doi: 10.1016/j.pop.2016.07.012

 

  1. Clinical practice guideline for multi-disciplinary treatment strategy of lymphoma in China. Zhonghua Zhong Liu Za Zhi. 2021;43(2):163-166. doi: 10.3760/cma.j.cn112152-20201109-00971

 

  1. Melani C, Wilson WH. Front-Line treatment of diffuse large B-Cell lymphoma in patients with cardiovascular comorbidities; omission of anthracycline reduces cure. Leuk Lymphoma. 2022;63(3):511-513. doi: 10.1080/10428194.2021.2002323

 

  1. Bhatt VR, Vose JM. Hematopoietic stem cell transplantation for non-Hodgkin lymphoma. Hematol Oncol Clin North Am. 2014;28(6):1073-95. doi: 10.1016/j.hoc.2014.08.015

 

  1. Mbous YPV, Mohamed R, Bhandari R. A Decomposition Analysis of Racial Disparities in Physical Activity Among Cancer Survivors: National Health Interview Survey 2009- 2018. J Phys Act Health. 2023;20(8):760-771. doi: 10.1123/jpah.2022-0356

 

  1. Moron LP, Irimata KE, Parker JD. Comparison of Mental Health Estimates by Sociodemographic Characteristics in the Research and Development Survey 3 and the 2019 National Health Interview Survey. Natl Health Stat Report. 2023;(188):1-11.

 

  1. Panos A, Mavridis D. TableOne: an online web application and R package for summarising and visualising data. Evid Based Ment Health. 2020;23(3):127-130. doi: 10.1136/ebmental-2020-300162

 

  1. Wang Z, Liu Q, Nemes J, et al. Associations Between Blast Exposures and Intestinal Permeability and Neurotrauma Symptoms During Mortar Fire Military Tactical Training Operations. Mil Med. 2025; doi: 10.1093/milmed/usaf478

 

  1. Yamamoto Y, Tsuzuki T, Akatsuka J, et al. Automated acquisition of explainable knowledge from unannotated histopathology images. Nat Commun. 2019;10(1):5642. doi: 10.1038/s41467-019-13647-8

 

  1. Zhang Z, Zhao Y, Canes A, Steinberg D, Lyashevska O. Predictive analytics with gradient boosting in clinical medicine. Ann Transl Med. 2019;7(7):152. doi: 10.21037/atm.2019.03.29

 

  1. Holland CH, Tanevski J, Perales-Patón J, et al. Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data. Genome Biol. 2020;21(1):36. doi: 10.1186/s13059-020-1949-z

 

  1. Collet E, Dalac S, Caillot D, Chavanet P, Beer F, Lambert D. [Paraneoplastic vasculitis and multiple myeloma]. Presse Med. 1990;19(23):1105. Vascularite paranéoplasique et myélome multiple.

 

  1. Fallon T, Nolan R, Peters J, Heron N. “Beyond the Finish Line” the Epidemiology of Injury and Illness in Professional Cycling: Insights from a Year-Long Prospective Study. Sports (Basel). 2025;13(1) doi: 10.3390/sports13010020

 

  1. Didion JP, Morgan AP, Clayshulte AM, et al. A multi-megabase copy number gain causes maternal transmission ratio distortion on mouse chromosome 2. PLoS Genet. 2015;11(2):e1004850. doi: 10.1371/journal.pgen.1004850

 

  1. Lin G, Gao Z, Wu S, et al. scRNA-seq revealed high stemness epithelial malignant cell clusters and prognostic models of lung adenocarcinoma. Sci Rep. 2024;14(1):3709. doi: 10.1038/s41598-024-54135-4

 

  1. Han S, Liu Y, Li X, et al. Development and Validation of a Preoperative Nomogram for Predicting Benign and Malignant Gallbladder Polypoid Lesions. Front Oncol. 2022;12:800449. doi: 10.3389/fonc.2022.800449

 

  1. Bakhshi TJ, Georgel PT. Genetic and epigenetic determinants of diffuse large B-cell lymphoma. Blood Cancer J. 2020;10(12):123. doi: 10.1038/s41408-020-00389-w

 

  1. Zhuang S, Yang Z, Cui Z, Zhang Y, Che F. Epigenetic alterations and advancement of lymphoma treatment. Ann Hematol. 2024;103(5):1435-1454. doi: 10.1007/s00277-023-05395-z

 

  1. Kendel NE, Stanek JR, Willen FK, Audino AN. Characterizing age-related differences in Hodgkin lymphoma in children, adolescents and young adults. Pediatr Hematol Oncol. 2024;41(5):336-345. doi: 10.1080/08880018.2024.2337627

 

  1. Huang J, Pang WS, Lok V, et al. Incidence, mortality, risk factors, and trends for Hodgkin lymphoma: a global data analysis. J Hematol Oncol. 2022;15(1):57. doi: 10.1186/s13045-022-01281-9

 

  1. Chang PW, Newman TB. Receiver Operating Characteristic (ROC) Curves: The Basics and Beyond. Hosp Pediatr. 2024;14(7):e330-e334. doi: 10.1542/hpeds.2023-007462

 

  1. Dong J, Peng L, Yang X, Zhang Z, Zhang P. XGBoost-based intelligence yield prediction and reaction factors analysis of amination reaction. J Comput Chem. 2022;43(4):289-302. doi: 10.1002/jcc.26791

 

  1. Wang J, Ren W, Zhang C, Wang X. A New Staging System Based on the Dynamic Prognostic Nomogram for Elderly Patients With Primary Gastrointestinal Diffuse Large B-Cell Lymphoma. Front Med (Lausanne). 2022;9:860993. doi: 10.3389/fmed.2022.860993

 

  1. Li Y, Zhu Y, Duan X. Prognostic value of multiple immune inflammatory markers in diffuse large B-cell lymphoma. Am J Transl Res. 2023;15(4):2610-2621.

 

  1. Vaitkus JA, Celi FS. The role of adipose tissue in cancer-associated cachexia. Exp Biol Med (Maywood). 2017;242(5):473-481. doi: 10.1177/1535370216683282

 

  1. Uddin S, Bu R, Ahmed M, et al. Leptin receptor expression and its association with PI3K/AKT signaling pathway in diffuse large B-cell lymphoma. Leuk Lymphoma. 2010;51(7):1305-14. doi: 10.3109/10428191003802365

 

  1. Arcidiacono B, Iiritano S, Nocera A, et al. Insulin resistance and cancer risk: an overview of the pathogenetic mechanisms. Exp Diabetes Res. 2012;2012:789174. doi: 10.1155/2012/789174

 

  1. Ng M, Fleming T, Robinson M, et al. Global, regional, and national prevalence of overweight and obesity in children and adults during 1980-2013: a systematic analysis for the Global Burden of Disease Study 2013. Lancet. 2014;384(9945):766-81. doi: 10.1016/s0140-6736(14)60460-8

 

  1. Breccia M, Mazzarella L, Bagnardi V, et al. Increased BMI correlates with higher risk of disease relapse and differentiation syndrome in patients with acute promyelocytic leukemia treated with the AIDA protocols. Blood. 2012;119(1):49-54. doi: 10.1182/blood-2011-07-369595

 

  1. Obesity: preventing and managing the global epidemic. Report of a WHO consultation. World Health Organ Tech Rep Ser. 2000;894:i-xii, 1-253.

 

  1. Mapp S, Sandhu G, Carrington C, Hennig S. A systematic review of treatment outcomes with weight-based dosing of chemotherapy in obese adult patients with acute leukemia or lymphoma. Leuk Lymphoma. 2016;57(4):981-4. doi: 10.3109/10428194.2015.1087520

 

  1. De Pergola G, Silvestris F. Obesity as a major risk factor for cancer. J Obes. 2013;2013:291546. doi: 10.1155/2013/291546

 

  1. Larsson SC, Wolk A. Obesity and risk of non-Hodgkin’s lymphoma: a meta-analysis. Int J Cancer. 2007;121(7):1564-70. doi: 10.1002/ijc.22762

 

  1. Larsson SC, Wolk A. Body mass index and risk of non- Hodgkin’s and Hodgkin’s lymphoma: a meta-analysis of prospective studies. Eur J Cancer. 2011;47(16):2422-30. doi: 10.1016/j.ejca.2011.06.029

 

  1. Wallin A, Larsson SC. Body mass index and risk of multiple myeloma: a meta-analysis of prospective studies. Eur J Cancer. 2011;47(11):1606-15. doi: 10.1016/j.ejca.2011.01.020

 

  1. Patel AV, Diver WR, Teras LR, Birmann BM, Gapstur SM. Body mass index, height and risk of lymphoid neoplasms in a large United States cohort. Leuk Lymphoma. 2013;54(6):1221-7. doi: 10.3109/10428194.2012.742523

 

  1. Larsson SC, Wolk A. Overweight and obesity and incidence of leukemia: a meta-analysis of cohort studies. Int J Cancer. 2008;122(6):1418-21. doi: 10.1002/ijc.23176

 

  1. Barbé-Tuana F, Funchal G, Schmitz CRR, Maurmann RM, Bauer ME. The interplay between immunosenescence and age-related diseases. Semin Immunopathol. 2020;42(5):545-557. doi: 10.1007/s00281-020-00806-z

 

  1. Thomas R, Wang W, Su DM. Contributions of Age- Related Thymic Involution to Immunosenescence and Inflammaging. Immun Ageing. 2020;17:2. doi: 10.1186/s12979-020-0173-8

 

  1. Burren OS, Dhindsa RS, Deevi SVV, et al. Genetic architecture of telomere length in 462,666 UK Biobank whole-genome sequences. Nat Genet. 2024;56(9):1832-1840. doi: 10.1038/s41588-024-01884-7

 

  1. M’Kacher R, Bennaceur-Griscelli A, Girinsky T, et al. Telomere shortening and associated chromosomal instability in peripheral blood lymphocytes of patients with Hodgkin’s lymphoma prior to any treatment are predictive of second cancers. Int J Radiat Oncol Biol Phys. 2007;68(2):465-71. doi: 10.1016/j.ijrobp.2007.01.050
Share
Back to top
Cancer Plus, Electronic ISSN: 2661-3840 Print ISSN: 2661-3832, Published by AccScience Publishing