Machine learning insights for cardiovascular risk prediction in diabetic patients: Emphasis on renal and cardiac markers using random forests

© 2026 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )

Download PDF

XML

Cite

Abstract

Cardiovascular disease remains a leading cause of mortality among individuals with diabetes, yet many machine learning studies report inflated performance due to inadequate validation. This study evaluates whether standard, interpretable models can predict heart failure mortality using a rigorously validated analytic pipeline. Two publicly available datasets from the University of California, Irvine, Machine Learning Repository were analyzed independently: the Early Stage Diabetes Risk Prediction Dataset (n = 520) and the Heart Failure Clinical Records Dataset (n = 299). These datasets are not patient-linked; findings are framed as methodological feasibility rather than direct clinical prediction. The heart failure dataset was used for model development. Logistic regression and random forest classifiers were trained and evaluated using stratified five-fold cross-validation, with all metrics computed from pooled out-of-fold predictions. Preprocessing and class imbalance handling were confined to training folds to prevent information leakage. All analyses were conducted using R with the caret, randomForest, pROC, and ggplot2 packages. In a pooled out-of-fold evaluation, random forest achieved higher discriminative performance than logistic regression (area under the curve 0.91 versus 0.86). Random forest exhibited higher specificity, whereas logistic regression showed higher sensitivity, reflecting distinct error profiles. Feature importance analyses and Shapley additive explanations consistently identified serum creatinine, ejection fraction, age, and follow-up time as dominant predictors. Limitations include modest sample size, reliance on a single public dataset, and absence of external validation. These findings underscore the importance of conservative validation strategies and transparent baseline modeling for clinically responsible artificial intelligence research.

Keywords

Machine learning

Cardiovascular disease

Diabetes

Heart failure

Five-fold cross-validation

Logistic regression

Random forest

Reproducibility

Funding

None.

Conflict of interest

The author declares no conflict of interest.

References

D’Agostino RB Sr, Vasan RS, Pencina MJ, et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743-753. doi: 10.1161/CIRCULATIONAHA.107.699579

American Diabetes Association. Cardiovascular disease and risk management: standards of medical care in diabetes—2020. Diabetes Care. 2020;43(Suppl 1):S111-S134. doi: 10.2337/dc20-S010

Peters SAE, Huxley RR, Woodward M. Diabetes as risk factor for incident coronary heart disease in women compared with men: a systematic review and meta-analysis of 64 cohorts including 858,507 individuals and 28,203 coronary events. Diabetologia. 2014;57(8):1542-1551. doi: 10.1007/s00125-014-3260-6

World Health Organization. Cardiovascular diseases (CVDs). World Health Organization. 2025. Available from: https://www.who.int/news-room/fact-sheets/detail/ cardiovascular-diseases-(cvds) [Last accessed on March 15, 2026].

Rajkomar A, Dean J, Kohane IS. Machine learning in medicine. N Engl J Med. 2019;380(14):1347-1358. doi: 10.1056/NEJMra1814259

Borges JYV. Auditing shortcut learning in AI-based breast cancer genomic subtyping. JAMIA Open. In press.

Tabák AG, Herder C, Rathmann W, Brunner EJ, Kivimäki M. Prediabetes: a high-risk state for diabetes development. Lancet. 2012;379(9833):2279-2290. doi: 10.1016/S0140-6736(12)60283-9

Breiman L. Random forests. Mach Learn. 2001;45(1):5-32. doi: 10.1023/A:1010933404324

Kuhn M, Johnson K. Applied Predictive Modeling. New York, NY: Springer. 2013.

Early Stage Diabetes Risk Prediction Dataset. UCI Machine Learning Repository. Available from: https://archive.ics.uci.edu/dataset/529/ early+stage+diabetes+risk+prediction+dataset [Last accessed on March 15, 2026].

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer. 2009. doi: 10.1007/978-0-387-84858-7

Islam MMF, Ferdousi R, Rahman S, Bushra HY. Likelihood prediction of diabetes at early stage using data mining techniques. In: Gupta M, Konar D, Bhattacharyya S, Biswas S, eds. Computer Vision and Machine Intelligence in Medical Image Analysis. Singapore: Springer. 2020:113-125. doi: 10.1007/978-981-13-8798-2_12

Kuhn M. Building predictive models in R using the caret package. J Stat Softw. 2008;28(5):1-26. doi: 10.18637/jss.v028.i05

Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. 2013. doi: 10.1002/9781118548387

Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301- 320. doi: 10.1111/j.1467-9868.2005.00503.x

Previous article in this issue

Next article in this issue

Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing