Value of machine learning algorithms for predicting diabetes risk: A subset analysis from a real‐world retrospective cohort study

Mao, Yaqian; Zhu, Zheng; Pan, Shuyao; Lin, Wei; Liang, Jixing; Huang, Huibin; Li, Liantao; Wen, Junping; Chen, Gang

Published in

Wiley Open Access, Journal of Diabetes Investigation, 2(14), p. 309-320, 2022

DOI: 10.1111/jdi.13937

Tools

Export citation

Search in Google Scholar

Value of machine learning algorithms for predicting diabetes risk: A subset analysis from a real‐world retrospective cohort study

Journal article published in 2022 by Yaqian Mao

, Zheng Zhu

, Shuyao Pan, Wei Lin

, Jixing Liang, Huibin Huang, Liantao Li, Junping Wen

, Gang Chen

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving allowed

Upload

Policy details

Data provided by

Abstract

AbstractAims/IntroductionTo compare the application value of different machine learning (ML) algorithms for diabetes risk prediction.Materials and MethodsThis is a 3‐year retrospective cohort study with a total of 3,687 participants being included in the data analysis. Modeling variable screening and predictive model building were carried out using logistic regression (LR) analysis and 10‐fold cross‐validation, respectively. In total, six different ML algorithms, including random forests, light gradient boosting machine, extreme gradient boosting, adaptive boosting (AdaBoost), multi‐layer perceptrons and gaussian naive bayes were used for model construction. Model performance was mainly evaluated by the area under the receiver operating characteristic curve. The best performing ML model was selected for comparison with the traditional LR model and visualized using Shapley additive explanations.ResultsA total of eight risk factors most associated with the development of diabetes were identified by univariate and multivariate LR analysis, and they were visualized in the form of a nomogram. Among the six different ML models, the random forests model had the best predictive performance. After 10‐fold cross‐validation, its optimal model has an area under the receiver operating characteristic value of 0.855 (95% confidence interval [CI] 0.823–0.886) in the training set and 0.835 (95% CI 0.779–0.892) in the test set. In the traditional LR model, its area under the receiver operating characteristic value is 0.840 (95% CI 0.814–0.866) in the training set and 0.834 (95% CI 0.785–0.884) in the test set.ConclusionsIn the real‐world epidemiological research, the combination of traditional variable screening and ML algorithm to construct a diabetes risk prediction model has satisfactory clinical application value.

Published in

Links

Tools

Value of machine learning algorithms for predicting diabetes risk: A subset analysis from a real‐world retrospective cohort study

Abstract