Machine Learning-based Framework for Early Diabetes Risk Classification
Mirali Mammadzade *
Department of Computer Science, University of Lodz, Lodz, Poland.
*Author to whom correspondence should be addressed.
Abstract
Aims: This study evaluated the effectiveness of supervised machine learning algorithms for early diabetes risk classification using demographic, behavioural, cardiovascular, and general health-related indicators. It also examined the variables most strongly associated with diabetes occurrence patterns.
Study Design: A quantitative experimental design was used, based on supervised multiclass classification and comparative machine learning evaluation.
Place and Duration of Study: The experimental analysis was conducted using the Diabetes Health Indicators BRFSS2015 dataset between March 2026 and mid-May 2026.
Methodology: The dataset contained demographic, lifestyle, cardiovascular, and general health-related variables associated with diabetes conditions. Before model implementation, duplicate inspection, exploratory data analysis, feature standardisation, and variable consistency evaluation were performed to improve analytical stability. Diabetes status was used as the target variable in a multiclass classification framework. Logistic Regression, K-Nearest Neighbours, Naïve Bayes, and AdaBoost classifiers were implemented using Python-based machine learning libraries. The dataset was divided into training and testing subsets using an 80:20 ratio. Model performance was evaluated using accuracy, precision, recall, F1-score, and ROC-AUC. Feature importance analysis and ROC curve comparison were also performed to assess classification behaviour and variable contribution patterns.
Results: Logistic Regression achieved the highest ROC-AUC value of 0.814 and demonstrated stable discrimination across diabetes categories. AdaBoost achieved the highest accuracy score of 0.847 and produced competitive precision, recall, and F1-score values. K-Nearest Neighbours showed moderate classification capability, whereas Naïve Bayes demonstrated comparatively weaker classification consistency. Feature importance analysis identified HighBP, GenHlth, Age, BMI, CholCheck, and HighChol as influential variables.
Conclusion: The findings indicate that supervised machine learning methods can support early diabetes risk classification. Cardiovascular conditions, obesity-related indicators, and general health variables were important contributors to classification behaviour within the implemented framework.
Keywords: Diabetes risk classification, machine learning, supervised learning, healthcare analytics, multiclass classification, BRFSS2015, Logistic Regression, AdaBoost, feature importance, preventive healthcare