AccScience Publishing / AIH / Online First / DOI: 10.36922/AIH025490111
ORIGINAL RESEARCH ARTICLE

Machine learning insights for cardiovascular risk prediction in diabetic patients: Emphasis on renal and cardiac markers using random forests

Julian Borges1*
Show Less
1 Department of Computer Science, Boston University Metropolitan College, Boston, Massachusetts, United States of America
Received: 4 December 2025 | Revised: 6 February 2026 | Accepted: 27 March 2026 | Published online: 19 May 2026
© 2026 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )
Abstract

Cardiovascular disease remains a leading cause of mortality among individuals with diabetes, yet many machine learning studies report inflated performance due to inadequate validation. This study evaluates whether standard, interpretable models can predict heart failure mortality using a rigorously validated analytic pipeline. Two publicly available datasets from the University of California, Irvine, Machine Learning Repository were analyzed independently: the Early Stage Diabetes Risk Prediction Dataset (n = 520) and the Heart Failure Clinical Records Dataset (n = 299). These datasets are not patient-linked; findings are framed as methodological feasibility rather than direct clinical prediction. The heart failure dataset was used for model development. Logistic regression and random forest classifiers were trained and evaluated using stratified five-fold cross-validation, with all metrics computed from pooled out-of-fold predictions. Preprocessing and class imbalance handling were confined to training folds to prevent information leakage. All analyses were conducted using R with the caret, randomForest, pROC, and ggplot2 packages. In a pooled out-of-fold evaluation, random forest achieved higher discriminative performance than logistic regression (area under the curve 0.91 versus 0.86). Random forest exhibited higher specificity, whereas logistic regression showed higher sensitivity, reflecting distinct error profiles. Feature importance analyses and Shapley additive explanations consistently identified serum creatinine, ejection fraction, age, and follow-up time as dominant predictors. Limitations include modest sample size, reliance on a single public dataset, and absence of external validation. These findings underscore the importance of conservative validation strategies and transparent baseline modeling for clinically responsible artificial intelligence research.

Keywords
Machine learning
Cardiovascular disease
Diabetes
Heart failure
Five-fold cross-validation
Logistic regression
Random forest
Reproducibility
Funding
None.
Conflict of interest
The author declares no conflict of interest.
References
  1. D’Agostino RB Sr, Vasan RS, Pencina MJ, et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743-753. doi: 10.1161/CIRCULATIONAHA.107.699579

 

  1. American Diabetes Association. Cardiovascular disease and risk management: standards of medical care in diabetes—2020. Diabetes Care. 2020;43(Suppl 1):S111-S134. doi: 10.2337/dc20-S010

 

  1. Peters SAE, Huxley RR, Woodward M. Diabetes as risk factor for incident coronary heart disease in women compared with men: a systematic review and meta-analysis of 64 cohorts including 858,507 individuals and 28,203 coronary events. Diabetologia. 2014;57(8):1542-1551. doi: 10.1007/s00125-014-3260-6

 

  1. World Health Organization. Cardiovascular diseases (CVDs). World Health Organization. 2025. Available from: https://www.who.int/news-room/fact-sheets/detail/ cardiovascular-diseases-(cvds) [Last accessed on March 15, 2026].

 

  1. Rajkomar A, Dean J, Kohane IS. Machine learning in medicine. N Engl J Med. 2019;380(14):1347-1358. doi: 10.1056/NEJMra1814259

 

  1. Borges JYV. Auditing shortcut learning in AI-based breast cancer genomic subtyping. JAMIA Open. In press.

 

  1. Tabák AG, Herder C, Rathmann W, Brunner EJ, Kivimäki M. Prediabetes: a high-risk state for diabetes development. Lancet. 2012;379(9833):2279-2290. doi: 10.1016/S0140-6736(12)60283-9

 

  1. Breiman L. Random forests. Mach Learn. 2001;45(1):5-32. doi: 10.1023/A:1010933404324

 

  1. Kuhn M, Johnson K. Applied Predictive Modeling. New York, NY: Springer. 2013.

 

  1. Early Stage Diabetes Risk Prediction Dataset. UCI Machine Learning Repository. Available from: https://archive.ics.uci.edu/dataset/529/ early+stage+diabetes+risk+prediction+dataset [Last accessed on March 15, 2026].

 

  1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer. 2009. doi: 10.1007/978-0-387-84858-7

 

  1. Islam MMF, Ferdousi R, Rahman S, Bushra HY. Likelihood prediction of diabetes at early stage using data mining techniques. In: Gupta M, Konar D, Bhattacharyya S, Biswas S, eds. Computer Vision and Machine Intelligence in Medical Image Analysis. Singapore: Springer. 2020:113-125. doi: 10.1007/978-981-13-8798-2_12

 

  1. Kuhn M. Building predictive models in R using the caret package. J Stat Softw. 2008;28(5):1-26. doi: 10.18637/jss.v028.i05

 

  1. Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. 2013. doi: 10.1002/9781118548387

 

  1. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301- 320. doi: 10.1111/j.1467-9868.2005.00503.x
Share
Back to top
Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing