Machine Learning Analysis of Factors Contributing to Diabetes Development

: Diabetes is a chronic condition that affects how the body processes blood sugar. Early diagnosis and management of diabetes are essential for preventing these complications. Machine Learning (ML) techniques offer an effective means to accurately diagnose diabetes by identifying key risk factors and developing predictive models. In this study, we assess the performance of 11 ML algorithms on four diabetes prediction datasets, considering the top 2, top 3, and all attributes. Through k-fold cross-validation, we ensure robust and generalizable results. We use a set of standard evaluation metrics such as accuracy, precision, recall, f1-score, and Receiver Operating Characteristic curve (ROC_ AUC). Our analysis aims to determine the optimal number of features and assess how performance changes with feature additions. Notably, some ML classifiers achieve satisfactory classification and predictive abilities using only the top 2 or 3 features. Furthermore, varying dataset performances across algorithms highlight the need for assessing multiple models to identify the most suitable one. These findings enable the creation of dependable models that enhance patient outcomes by leveraging effective algorithms and pertinent features.


Introduction
Diabetes is a chronic disease affecting millions worldwide, with an increasing prevalence over the past few decades [1][2].It is a complex metabolic disorder that occurs when the body cannot regulate blood glucose levels due to either a lack of insulin production or an inability to use insulin effectively.If left unmanaged, diabetes can lead to various complications, including cardiovascular disease, kidney failure, blindness, and nerve damage [3].
As such, predicting and preventing diabetes is a significant public health concern.Machine learning (ML) techniques have shown considerable promise in this area, as they allow for the development of accurate predictive models based on complex relationships between various data points.These models are useful in identifying patterns in extensive patient data that humans would probably miss.Recently, ML techniques have been applied to diabetes prediction, resulting in numerous studies exploring different algorithms and datasets [4][5][6][7][8].
The most commonly used ML techniques in diabetes prediction include Logistic Regression, Decision Trees, Support Vector Machines (SVM), Random Forest, and Artificial Neural Networks (ANN), all of which have shown varying degrees of success.
Various factors, including the size and quality of the dataset, the choice of ML algorithm, and the selection of relevant features, influence the accuracy of these ML models.Combining different dataset features as predictors can further improve the performance of these techniques.However, identifying the most important features for predicting diabetes remains a challenging problem, and different datasets may require different features to achieve the highest accuracy.
Hence, this study is motivated to assess the efficacy of diverse ML algorithms in predicting diabetes, employing four distinct datasets.Our focus includes evaluating various ML algorithms for diabetes prediction based on the top 2, top 3, and all dataset features.Additionally, we aim to analyze how different performance metrics, including accuracy, precision, recall, f1-Score, and ROC_AUC influence model evaluation.
The four datasets we will use in this study are the Mendeley dataset, the Pima Indians Diabetes (PID) dataset, the Diabetes Early Stage (DES) dataset, and the Vanderbilt dataset (see Section 4.1).Each dataset has unique characteristics, including the number of instances, features, target variables, and differences in the populations they represent.By exploring the performance of various ML algorithms on different datasets, we aim to provide a comprehensive evaluation of the effectiveness of these techniques for predicting and managing diabetes.
This study makes the following important contributions: a) It rigorously benchmarks predictive performance for diabetes risk across 11 diverse ML models and four realworld datasets, facilitating an accurate assessment.
b) It provides interpretable feature importance rankings, unveiling actionable risk factors.c) It offers a more comprehensive and multi-faceted examination.d) Its rigorous methodology corroborates the reliability and robustness of our findings, surpassing studies primarily concentrating on a few algorithms and requiring more extensive feature scrutiny.
The results of our study hold significant implications for advancing the application of precise and effective diabetes prediction models.These advancements can improve patient outcomes and mitigate complications associated with the disease.Moreover, our research contributes to the expanding knowledge of utilizing ML techniques for predicting and managing chronic diseases.This area of research has gained prominence in recent years, highlighting the relevance and impact of our study within this evolving field.
The remainder of this paper is organized as follows: Section 2 presents the literature related to our research.Section 3 details the proposed method.The results of applying the ML techniques to the different datasets are presented in Section 4. Finally, our conclusions are presented in Section 5.

Literature review
Diabetes is a chronic condition affecting many people globally, and its prediction and prevention are crucial for public health.This disease is characterized by high blood sugar levels, also known as hyperglycemia, and can cause various complications if not managed properly.Age, ethnicity, family history, low socioeconomic level, obesity, metabolic syndrome, cardiac complications, food intake, and some bad lifestyle choices are the main risk factors for diabetes [9][10].The World Health Organization predicts that by 2040, the number of individuals living with diabetes will reach 642 million, which translates to one out of every ten adults [1].This alarming statistic emphasizes the need for effective approaches to tackling the increasing incidence of diabetes.
One promising method for diabetes prediction is the use of machine learning methods [4].Machine learning (ML) is a branch of artificial intelligence (AI) that enables computer systems to learn and improve from data without human intervention [11].The accuracy of diabetes diagnosis and prediction has improved thanks to ML approaches, as demonstrated by encouraging results [12][13][14].These methods can analyze large data sets and find patterns humans would miss.For example, ML algorithms can analyze data from electronic health records to predict a person's risk of acquiring diabetes.These algorithms can also be trained on past data, including medical history and lifestyle factors, which enables them to become more accurate predictors as they accumulate more knowledge [8,15].
In this section, we will explore recent research on ML techniques for diabetes prediction and compare the performance of various ML models using different dataset features.
Alkaragole and Kurnaz [5] studied the precision of different ML methods, including Decision Trees, Naive Bayes, SVM, and hybrid algorithms.They found that combining SVM and Decision Trees was more accurate than the other accuracy and runtimes.
Similarly, Gill and Pathwat [23] analyzed diabetes symptoms to gather meaningful insights to help health experts make early diagnosis.The researchers used feature selection techniques such as Analysis of variance (ANOVA), mutual information, and genetic algorithm to increase accuracy and reduce overhead and training time.They used Logistic Regression, Naive Bayes, Stochastic Gradient Descent (SGD) Classifier, K-NN, Random Forest, Decision Trees, and SVM algorithms to predict diabetes.Random Forest showed the best accuracy of 93.95%, with Genetic Algorithm as a feature selection technique, selecting "Cholesterol", "Glucose", "Chol/HDL", "Systolic BP", "Weight", and "Hip ratio" as the most important features.
Zout et al. [1] used a Decision Tree, Random Forest, and Neural Network to predict diabetes mellitus using hospital physical examination data in Luzhou, China.They implemented five-fold cross-validation and independent test experiments to verify the models' universal applicability.Similarly, the researchers used Principal Component Analysis (PCA) and Minimum Redundancy Maximum Relevance (MRMR) to reduce the dimensionality.The results showed that Random Forest achieved the highest accuracy of 80.84% when all the attributes were used.
While the presented works demonstrate notable efforts in utilizing ML techniques for diabetes prediction, several weaknesses are evident across these studies.Firstly, there is a lack of consistency regarding dataset characteristics, which hinders the comparability of results.Varied dataset sizes, sources, and features make it challenging to draw conclusive insights or generalize findings.Secondly, the evaluation metrics employed in these studies vary, with some focusing on accuracy, specificity, sensitivity, and area under the curve (AUC), making direct comparisons cumbersome.Additionally, the absence of standardized metrics for assessing model performance across the studies introduces ambiguity and limits the robustness of the comparative analysis.Furthermore, most studies concentrate on a specific subset of ML algorithms, lacking a comprehensive exploration of various models, which could offer a more nuanced understanding of their strengths and weaknesses in diabetes prediction.
In conclusion, the literature reviewed in this paper shows that ML techniques can predict diabetes with high accuracy.Logistic regression, Decision Trees, SVM, Random Forest, and ANN are the most commonly used ML techniques for diabetic prediction.Furthermore, using a combination of different dataset features as predictors can improve the performance of these techniques.However, more research is needed to investigate the impact of different dataset features on the performance of ML techniques for diabetic prediction.
Our study takes a comprehensive and methodologically rigorous approach to explore the effectiveness of various ML algorithms for predicting and managing diabetes.Unlike some previous works that focused on specific algorithms or lacked detailed feature analysis, we extend our investigation across 11 diverse ML models and employ four distinct real-world datasets, each with unique characteristics.By systematically evaluating predictive performance and feature importance across different datasets, our study aims to enhance the comparability and generalization of results, mitigating the issue of inconsistent dataset characteristics encountered in previous works.Furthermore, our research addresses the variability in evaluation metrics by examining the impact of performance measures such as accuracy, precision, and recall on model assessment.Our rigorous methodology, involving thorough feature scrutiny and validation across diverse datasets, is designed to strengthen the reliability and robustness of our findings, setting a precedent for more comprehensive and conclusive studies in the field.
The six phases in our methodology can be described as follows: (1) Collect data from four distinct datasets related to diabetes prediction.
(2) Preprocess the data by handling missing values and resampling to mitigate class imbalance.
(3) Perform feature selection to identify and retain the most informative attributes.
(4) Train 11 different ML models to predict diabetes using the refined dataset.
(5) Use cross-validation to evaluate each model by quantifying performance metrics, such as accuracy, precision, recall, f1-Score, and ROC_AUC.( 6) Conduct a comparative analysis to find the best classifier given the most important attributes previously selected.
For our experiments, we use scikit-learn [https://scikit-learn.org/], a ML library for Python.Our experiments were executed inside Google Colab [https://colab.research.google.com], a platform for data science and machine learning.The next section provides a detailed description of the processes involved in each of the six phases.

Performance evaluation 4.1 Details of datasets
For our research, we identified four datasets used for diabetes prediction/classification. Each of these datasets contains instances with different attributes of patients and an attribute for the classes of interest.Table 1 shows a summary of these datasets.

Mendeley dataset
This is a publicly accessible dataset [https://data.mendeley.com/datasets/wj9rwkp9c2/1]published in July 2020 by the University of Information Technology [28].To construct this dataset, the researchers utilized data from Iraqi patients receiving care at the Medical City Hospital laboratory and the Specialized Center for Endocrinology and Diabetes at Al-Kindy Teaching Hospital.The dataset on diabetes was created by systematically reviewing patient files and extracting

Pima Indians Diabetes (PID) dataset
This public dataset [https://data.world/data-society/pima-indians-diabetes-database],originally from the National Institute of Diabetes and Digestive and Kidney Diseases [29], is designed to predict whether a patient has diabetes based on diagnostic measurements.The dataset contains information on 768 patients (i.e., instances).It includes eight characteristics and diagnostic measurements, including pregnancies, plasma glucose concentration, blood pressure, skin thickness, insulin levels, body mass index (BMI), diabetes pedigree function, and age.
It is important to note that the dataset is specifically selected to include only female patients of Pima Indian heritage who are at least 21 years old.This specific population was chosen due to the high incidence of diabetes in this group.The dataset also includes a class variable, which indicates whether the patient has diabetes or not (i.e., 0 or 1).There are 268 instances for the positive class and 500 for the negative.

Diabetes Early Stage (DES) dataset
This public dataset [https://www.kaggle.com/datasets/ishandutta/early-stage-diabetes-risk-prediction-dataset]comprises reports of diabetes-related symptoms from 520 individuals.It includes data on symptoms that may indicate the presence of diabetes and demographic information on the individuals surveyed.The dataset was created by conducting a direct questionnaire with individuals who have recently been diagnosed with diabetes or who are nondiabetic but present with one or more diabetes-related symptoms.The data was collected from patients at the Sylhet Diabetes Hospital in Bangladesh [30].
The dataset contains 16 attributes: Age, Sex, Polyuria, Polydipsia, sudden weight loss, weakness, Polyphagia, Genital thrush, visual blurring, Itching, Irritability, delayed healing, partial paresis, muscle stiffness, Alopecia, Obesity.All of these attributes have categorical values, with "Yes" indicating the presence of a symptom and "No" indicating the absence of a symptom.The dataset also includes two class variables used to determine whether the patient is at risk of developing diabetes (positive) or not (negative).There are 320 instances for the positive class and 200 for the negative.

Vanderbilt dataset
This public dataset [https://data.world/informatics-edu/diabetes-prediction] is based on a study of rural African Americans in Virginia [22].There are 390 data samples with both male and female patients.It consists of 15 features that help predict diabetes, including Cholesterol, Glucose, HDL Chol, Chol/HDL ratio, Age, Gender, Height, Weight, BMI, Systolic BP, Diastolic BP, waist, hip, and Waist/hip ratio.Except for Gender, which is categorical (i.e., male and female), the other attributes are numerical.The dataset includes two class variables, "Diabetes" and "No diabetes".There are 60 instances for the positive class and 330 for the negative.Figure 2(d) shows the distribution of the final classes.

Experimental setup
The datasets used in this study contain a class feature.This feature contains binary values that indicate if a patient (i.e., instance) has diabetes or not.Therefore, in our study, we are interested in different ML algorithms for classification.
For this study, we selected 11 of the most commonly used algorithms, which were grouped into six categories.These categories are not mutually exclusive; some algorithms can belong to multiple categories.Table 2 shows the categories and the classifiers used in this study.
In other words, we replaced each missing value with the mean of the observed values for that column.
Feature engineering is crucial for enhancing the data used to train machine learning models.Removing highly correlated features is a key aspect of feature engineering.Diabetes datasets often contain multiple measurements that exhibit high correlation, such as fasting plasma glucose and HbA1c [42].Retaining these redundant features can distort and dilute the importance scores during model training.Furthermore, eliminating these highly correlated features mitigates the risk of model overfitting.Therefore, another data preprocessing step is analyzing the correlation between variables.Figure 3 shows the correlation heat maps for the four datasets.A correlation heat map displays the correlation coefficients between multiple variables or features in a dataset.In this figure, red indicates a strong positive correlation, and purple indicates a strong negative correlation.The heat map makes it simple to determine which variables are strongly correlated with one another and which are not.Furthermore, the heat map can be very useful in detecting multicollinearity [43], a phenomenon in which two variables coexist.Figure 3(d) shows that in the Vanderbilt dataset there are columns with high correlation (i.e., multicollinearity), with 85% or above, that can lead to unstable and unreliable estimates.The columns "Waist", "Hip", and "Weight" are highly correlated because they are all measures of body size and shape.Research [44] has found that the waist-to-hip ratio (WHR) effectively predicts if a person is at risk of death from heart disease, cancer, diabetes, or any other cause.Therefore, given that the dataset already has a column "Waist/hip ratio", we decided to drop the columns "Waist", "Hip", and "Weight" from this dataset.
We used SMOTE [45] to address the class imbalance in the datasets.SMOTE is commonly used in machine learning, particularly in classification tasks, to improve the performance of models on imbalanced datasets.We used a value of 5 to define the neighborhood of samples and to resample all classes but the majority class.
We used a grid search [46] approach to find each model's best hyperparameter settings.Grid search allows us to explore a range of possible hyperparameter settings systematically and then select the model that yields the best performance.This process can reduce the time and effort needed to optimize a model's parameters and improve the predictions' accuracy.
In summary, these techniques together help prevent overfitting by ensuring models are trained on representative, unbiased, and diverse data with relevant, non-redundant features.Cross-validation helps assess generalization performance, while techniques like SMOTE address class imbalance issues.Additionally, hyperparameter tuning through gridsearch optimizes models for unseen data.Together, these methods will support our results and the ability to apply the findings more widely.

Performance metrics
We used k-fold cross-validation [47] in our experiments to evaluate the performance of each ML model.The benefits of using k-fold cross-validation lie in its ability to provide a robust and comprehensive assessment of model performance, mainly when working with limited datasets, by iteratively partitioning the data into training and validation sets, thereby reducing the variance in performance estimation [48].In our experiments, we adopted a value of k = 10, a commonly utilized parameter in similar studies [49].Figure 4 shows the k-fold configuration strategy followed in this study.With this configuration, we systematically split the data into ten parts: nine for training the model and one for testing.The process is repeated ten times, with each part used for testing once.

∑
Different performing metrics were used after performing this cross-validation.Table 3 shows a description of the metrics we used in this study.Each of these metrics provides a different perspective on the model's performance and it is essential to consider all of them when evaluating a ML model.In some cases, good accuracy might not be enough to consider a model good, and precision, recall, and other metrics might be more important, especially with highly imbalanced data.

Precision
It is a measure of the proportion of positive predictions that are actually correct, calculated as the number of true positive predictions divided by the sum of the true positive and false positive predictions.

Recall
It is a measure of the proportion of actual positive cases that are correctly identified by the model, calculated as the number of true positive predictions divided by the sum of the true positive and false negative predictions.
f1-Score It is a measure of the balance between precision and recall, calculated as the harmonic mean of Precision and Recall, with a higher score indicating a better balance between the two.

ROC_AUC
It is a measure of the model's ability to distinguish between positive and negative classes, with a higher score indicating a better model performance.It is calculated by plotting the true positive rate (recall) against the false positive rate at different classification thresholds and measuring the area under the curve.

Results and discussion
In this section, we present the findings of our study, including the performance of the different ML algorithms on the diabetes datasets.Table 4 shows the best parameters obtained by using grid search for each of the models that had the highest accuracy for the different datasets when using the top 2, top 3, and all features (refer to Sections 4.4.2,4.4.3, and 4.4.4).

Most important attributes
We used the Random Forest (RF) [50] algorithm to get the most important attributes.We decided to use this technique, given that existing literature [51][52][53] provides substantial evidence through comparative assessments that RF represents an effective data-driven approach for identifying relevant input features across various use cases.
RF is a powerful method that can identify the most informative features in a dataset.The feature importance measure in RF is calculated based on the decrease of the impurity in the data resulting from using the feature to split the data.RF creates multiple decision trees, each trained on a different subset of the data, and then averages the results from all the trees.The decrease in impurity is calculated as the weighted average of the decrease over all the decision trees in the forest.This measure allows RF to identify the most informative features by separating the data into different classes.The feature with the highest decrease in impurity is considered the most important feature.has diabetes or not. Figure 5 shows the most important features of each dataset.Upon a comprehensive review of the generated plots for each dataset, a notable observation was made regarding the Mendeley dataset.Specifically, three features emerged as particularly representative.Upon reviewing the plots generated for each dataset, we observed that three features are particularly representative in the Mendeley dataset.To maintain consistency across all datasets, we chose three as the maximum number of features for comparison.Other works related to disease prediction have presented a similar process [54][55].
For Mendeley (Figure 5(a)), the top 3 features are "HbA1c", "BMI", and "TG", in that order of importance.HbA1c, also known as glycated hemoglobin, is a blood test used to measure the average blood sugar levels over the past 2-3 months in people with diabetes.A high HbA1c level indicates poor blood sugar control and an increased risk of diabetes-related complications [56].Body Mass Index (BMI) is often used as an indicator of the risk of developing diabetes or as a way to monitor diabetes management [57].Finally, high levels of triglycerides (TG) in the blood are associated with an increased risk of developing type 2 diabetes and heart disease [58].
For PID (Figure 5(b)), the top 3 features in order of importance are "Glucose", "BMI", and "Age".The "Glucose" feature measures the amount of sugar in the blood after an OGTT test [59].After this test, high levels of sugar in the blood may show that a person has diabetes or is at risk of developing it.The "BMI" feature, was described above.Lastly, the "Age" feature is relevant in this dataset, mainly because as people get older, their body's ability to use insulin decreases, which can lead to diabetes.
For DES (Figure 5(c)), the top 3 most important features in order of importance are "Polyuria", "Polydipsia", and "Age"."Polyuria" and "Polydipsia" are symptoms of diabetes that are related to high blood sugar levels [60].In this dataset, the feature "Age" is also a factor for diabetes.We have mentioned the importance of this feature previously.
Finally, for Vanderbilt, we have "Glucose", "BMI", and "Age" as the top 3 features (Figure 5(d)) in that order of importance."Glucose" represents the values obtained for a Fasting Blood Sugar (FBS) test [61].This test measures the amount of glucose in a person's blood after fasting for at least 8 hours.The features "BMI" and "Age" were described previously.
After getting the most important features for the four datasets, we can see that "BMI" appears as an important factor in three of them (i.e., DES does not contain a "BMI" attribute).A similar situation happens with "Age".Interestingly, "Age" is not an important factor in Mendeley.This finding highlights the importance of these factors (i.e., "BMI" and "Age") in developing and managing diabetes.Similarly, these results emphasize the need for people to maintain a healthy BMI and to be aware of the risk of diabetes as they age.

Performance evaluation using the top 2 attributes
We evaluated the effectiveness of eleven machine learning algorithms on the top 2 attributes obtained in Section 4.4.1.We used a k-fold cross-validation method for the different datasets with k set to 10.
For the Mendeley dataset, the top 2 attributes are HbA1c and BMI.Table 5 shows that the Random Forest Classifier performed the best in terms of accuracy, recall, f1-Score, and ROC_AUC, with 98.83%, 98.04%, 98.84%, and 98.85%, respectively.The XGBoost model also performed well with 98.77% accuracy, 100% precision, 97.59% recall, 98.77% f1-Score, and 98.79%.Its recall was only lower than the Random Forest and K-Nearest Neighbor models.The Kernel SVM, Ada Boost, K-Nearest Neighbor, and Multi-Layer Perceptron also performed well with accuracy scores above 98%.The Naive Bayes and Quadratic Discriminant Analysis models had lower accuracy scores but still performed reasonably well.The average accuracy considering all models is 97.59%, with a median of 98.27%.To sum up, the Random Forest Classifier and XGBoost models are the best performers among the models tested.
Glucose and BMI are the top 2 attributes used in the PID dataset.Table 6 shows that the Kernel SVM and the Random Forest Classifier models performed best.Kernel SVM presented the highest accuracy, recall, f1-Score, and ROC_AUC.
Only Naive Bayes presented the best precision.Random Forest Classifier shows the second-best percentages after the Kernel SVM model.Overall, the accuracy obtained from this dataset using only two features is low.The average accuracy considering all models is 73.65% with a median of 73.6%.Finally, we used "Glucose" and "BMI" as the top 2 features for the Vanderbilt dataset.Table 8 shows that most algorithms performed well, with accuracy scores ranging from 81.36% to 90.45%.The kernel SVM model obtained the highest value for accuracy, recall, f1-Score, and ROC_AUC.The Naive Bayes model had the best precision with 94.03%, although it presented the worst recall with 66.82%.The models using this dataset obtained an average accuracy of 87.56% with a median of 88.79%.These results are better than those obtained for the PID and DES datasets.The results showed that using the top two attributes produced a high accuracy of the models in the Mendeley dataset, with an average accuracy of 97.59%.However, the accuracy was lower for the PID dataset, with a score lower than 80%.Comparably, the average accuracy achieved in both the DES and Vanderbilt datasets, while surpassing the accuracy of the PID dataset, did not attain the levels observed in the Mendeley dataset.Out of the 11 models tested, the Kernel SVM performed the best in terms of accuracy, recall, f1-Score, and ROC_AUC in the PID and Vanderbilt datasets.Although the Random Forest Classifier was the top performer in the Mendeley dataset, the overall performance of the Kernel SVM was still comparable in the same dataset.

Performance evaluation using the top 3 attributes
As in the previous section, we used a k-fold cross-validation strategy to evaluate the performance of the eleven classifiers using the top 3 attributes obtained in Section 4.4.1.
Table 9 shows the results for the Mendeley dataset.The top 3 attributes for this dataset are "HbA1c", "BMI", and "TG".The highest accuracy was achieved by XGBoost (98.89%), followed by Random Forest Classifier (98.83%) and AdaBoost Classifier (98.83%).The lowest accuracy was obtained by Naive Bayes (94.20%).Regarding precision, Kernel SVM performed the best (99.89%), followed by Random Forest Classifier (99.69%).The lowest precision was observed for Naive Bayes (97.18%).For recall, the AdaBoost Classifier performed the best (98.58%), whereas the lowest recall was observed for Quadratic Discriminant Analysis (90.39%).XGBoost (98.89%) performs the best for the f1-Score.The lowest f1-Score was observed for Naive Bayes (94%).Regarding ROC_AUC, XGBoost performed the best (98.91%), and Naive Bayes (94.16%) had the lowest score.Overall, the results suggest that XGBoost is the bestperforming model based on the metrics used for this dataset.In Table 10, we can find the results for the top 3 attributes in the PID dataset.These attributes are "Glucose", "BMI", and "Age".The results show that the best-performing algorithm in terms of accuracy is the Kernel SVM model, with a score of 79%.The Quadratic Discriminant Analysis had a lower accuracy score of 71.7%.Regarding precision, the Naive Bayes algorithm presented the highest score with 77.08%.The Decision Tree Classifier had the lowest precision score, 72.37%.For recall, the K-Nearest Neighbor also had the highest score with 86.16%.The Quadratic Discriminant Analysis had a lower recall score of 66.67%.The K-Nearest Neighbor had the highest f1-Score of 80.1%.The Quadratic Discriminant Analysis had the lowest f1-Score of 69.79%.Finally, the Kernel SVM model had the highest ROC_AUC score.The Quadratic Discriminant Analysis had the lowest ROC_AUC score of 71.89%.
The top 3 features for the DES dataset are "Polyuria", "Polydipsia", and "Age".Table 11 shows the results for the algorithms when using these features.The table shows that the Kernel SVM model achieves the highest accuracy, 89.69%.K-Nearest Neighbor achieves the lowest accuracy, with an accuracy of 56%.Regarding precision, the highest is achieved by the Kernel SVM with 98.47%, and the K-Nearest Neighbor achieves the lowest with 79.50%.Three models present the highest recall.Logistic Regression, SVC with a linear kernel, and Quadratic Discriminant Analysis, all with 85.93%.The Kernel SVM model achieves the lowest recall with 80.63%.The Random Forest Classifier achieves the highest f1-Score with 88.63%.K-Nearest Neighbor achieves the lowest f1-Score with 82.19%.Finally, the highest ROC_AUC is achieved by the Kernel SVM model with a ROC_AUC of 89.73%.K-Nearest Neighbor achieves the lowest ROC_AUC with 81.52%.In general, it can be seen that the Kernel SVM model performs the best in terms of accuracy, precision, and ROC_AUC.Table 12 shows the results for the algorithms when using the top 3 features (i.e., "Glucose", "BMI", and "Age") for the Vanderbilt dataset.The top model in terms of accuracy is Multi-Layer Perceptron, with accuracy scores of 93.64%.The model with the lowest accuracy is Naive Bayes, with 85.45%.In terms of precision, again, Multi-Layer Perceptron had the highest score with 92.52%, whereas K-Nearest Neighbor had the lowest with 87.76%.Multi-Layer Perceptron also presents the highest recall with 95.19%, and Naive Bayes the lowest with 76.04%.A similar situation is presented for f1-Score and ROC_AUC.Multi-Layer Perceptron presents the best results, with 93.76% and 93.55%, respectively.Naive Bayes presents the lowest values with 83.84% and 85.57% for f1-Score and ROC_AUC.
The results showed that using the top three attributes produced a high accuracy of the models in the Mendeley dataset, with an average accuracy of 97.58%.The accuracy was lower for the PID dataset, with an average score of 75.7%.The average accuracy obtained by the DES (88%) and Vanderbilt (89.3%) datasets, while higher than the PID dataset, did not show the high values observed in the Mendeley dataset.In general, Multi-Layer Perceptron obtained the best results in all metrics for this dataset.

Performance evaluation using all attributes
Analyzing the top 2 and top 3 attributes in a dataset can be helpful for classification by reducing the dimensionality of the data and making it easier to visualize and interpret.Focusing on this small number of attributes makes it easier to understand the relationships between the features and the target variables and identify patterns and trends.However, comparing the results of the top 2 and top 3 attributes with using all attributes in the dataset for a similar task is essential.This comparison is critical since reducing the number of attributes may result in a loss of information and may not accurately represent the full complexity of the data.Similarly, using a subset of the data can result in suboptimal classification performance, as we could miss important relationships between features and the target variable.
In this section, we present the results obtained from the models when using all the attributes for each of dataset.Table 13 shows the results for the Mendeley dataset.The Random Forest Classifier has the highest accuracy, with a score of 99.67%.The worst algorithm in terms of accuracy is Naive Bayes, with 92.75%.The Random Forest Classifier, K-Nearest Neighbor, and Multi-Layer Perceptron have the best precision, scoring 100%.The worst algorithm in terms of precision is Naive Bayes, with 95.49%.As in the previous metrics, the best algorithm in terms of recall is the Random Forest Classifier, with 99.36%.The worst algorithm in terms of recall is Naive Bayes with, 89.81%.One more time, the Random Forest Classifier has the best f1-Score with a score of 99.67%.Similar to other metrics, the worst algorithm in terms off1-Score is Naive Bayes, with 92.53%.Finally, regarding ROC_AUC the Random Forest Classifier and Logistic Regression are the winners with 99.68%.The worst algorithm in terms of ROC_AUC is Naive Bayes, with 92.68%.Overall, the Random Forest Classifier is the algorithm with the highest numbers for all five metrics.Table 14 shows the results for all the models with the PID dataset.The Kernel SVM model has the highest Table 15 displays the results of all the models tested on the DES dataset.The Random Forest Classifier emerged as the top performer, while the Naive Bayes model was the weakest.The accuracy scores were 98.12% and 88.44% for Random Forest Classifier and Naive Bayes, respectively.Similarly, the precision scores were 98.7% and 87.72%.Random Forest Classifier also boasted a higher recall score of 97.53%, compared to Naive Bayes' 89.35%.The f1-Score also showed a similar pattern, with Random Forest Classifier scoring 98.07% and Naive Bayes coming in at 88.42%.Finally, regarding ROC_AUC, both models showed a high difference between their results, with the Random Forest Classifier scoring 98.18% and Naive Bayes scoring 88.52%.Finally, Table 16 displays the results of all the models tested on the Vanderbilt dataset.The Multi-Layer Perceptron had the best accuracy performance with 96.52%, while the Naive Bayes model had the lowest with 84.55%.For precision, Kernel SVM score the best at 99.71% and K-Nearest Neighbor presented the lowest score with 86.61%.However, the K-Nearest Neighbor showed the best recall with 98.91%.Naive Bayes had the weakest score for the same metric, with 78.84%.The Multi-Layer Perceptron model had the best performance for f1-Score and ROC_AUC, with 96.6% and 96.49% respectively.Naive Bayes obtined the lowest f1-Score with 83.55%, and the Quadratic Discriminant Analysis model obtained the lowest score for ROC_AUC with 86.55%.Overall, the Multi-Layer Perceptron model presented the best overall performance in this dataset.The results showed that using all attributes produced an average accuracy above 90% for the Mendeley, DES, and Vanderbilt datasets.The average accuracy for the PID dataset was only 78.06%.The highest average accuracy was obtained in the Mendeley dataset, with 97.6%.Overall, using all features, the best results were obtained using the Random Forest Classifier.

Overall comparison of classifiers based on the number of features
In this section, we want to compare the performance regarding the accuracy metric for the 11 algorithms when using the top 2, top 3, and all features for training and testing.By comparing the accuracy metric for different feature subsets, we can identify which features are most significant for accurate predictions and which algorithms perform best with a smaller or larger number of features.This information can help optimize the performance of the machine learning models, reduce the number of features needed for accurate predictions, and increase the models' efficiency.
Figure 6 compares the accuracy of the machine learning models for the Mendeley dataset.The top 3 attributes in this dataset are "HbA1c", "BMI", and "TG".We can see that 54% of the algorithms (n = 6) present a higher accuracy when using all the features.These algorithms are the Multi-Layer Perceptron, AdaBoost Classifier, Kernel SVM, SVC, XGBoost, and Random Forest Classifier.The average accuracy for these algorithms is 99.06%.The Random Forest Classifier presented the best accuracy with 99.67%.From this Figure, we can also see that 36% (n = 5) of the algorithms had better accuracy when using the top 2 features.These algorithms are Quadratic Discriminant Analysis, K-Nearest Neighbor, Naive Bayes, Decision Tree Classifier, and Logistic Regression.The average accuracy for these algorithms is 96.58%.Interestingly, no algorithm performed better than others when using the top 3 features.These results suggest that using the top 3 features did not provide significant additional information beyond the top 2 features or that the additional features may have introduced noise or increased complexity to the model, leading to decreased performance.7 compares the accuracy of the machine learning models for the PID dataset.The top 3 attributes for this dataset are "Glucose", "BMI", and "Age".We can see that the accuracy score for all the classifiers is lower than 84%.Of the different algorithms, 90.9% (n = 10) performed better when using all the features.The average accuracy for these algorithms was 79.61%.Kernel SVM obtained the best accuracy with 82.4%.Only Naive Bayes obtained a better accuracy (74.9%) when using the top 3 features.We see a consistent pattern in the results except for Quadratic Discriminant Analysis and Naive Bayes.The accuracy is worst when using only the top 2 features, slightly better when using the top 3 features, and the best accuracy is achieved when using all the features.8 shows the accuracy of the eleven algorithms for the DES dataset.The top 3 attributes for this dataset are "Polyuria", "Polydipsia", and "Age".Except for Naive Bayes, the rest of the algorithms, 90.9% (n = 10), obtained the best accuracy when using all the features.The average accuracy for these algorithms was 94.9%.In the case of Naive Bayes, the algorithm obtained the same accuracy (84.44%) for the top 2 features and all the features.This algorithm also presented the lowest score of the rest of the algorithms for its best performance.As in the previous dataset, we can see a consistent pattern in seven (63.6%) of the algorithms (Quadratic Discriminant Analysis, K-Nearest Neighbor, AdaBoost Classifier, Kernel SVM, Decision Tree Classifier, XGBoost, and Random Forest Classifier).The K-Nearest Neighbor presents the worst accuracy with 50% when using only the top 2 attributes.This result is the lowest score obtained by an  Figure 9 shows the accuracy of the eleven algorithms for the Vanderbilt dataset.The top 3 attributes for this dataset are "Glucose", "MI", and "Age".Most of the algorithms, 72.7% (n = 8), obtained the best accuracy when using all the features.Only XGBoost and AdaBoost Classifier obtained the best results when using three features.None of the algorithms performed the best with just two features.Overall, the Random Forest Classifier presented the best accuracy with 97.79%.Naive Bayes obtained the worst accuracy, 81.36% when using only two attributes.

Discussion
The literature reviewed in this study highlights the efforts of developing effective methods for predicting and preventing diabetes, given its high prevalence and potential for severe complications if not managed properly.Machine learning (ML) techniques have shown promising results in accurately predicting diabetes, with Logistic Regression, Decision Trees, SVM, Random Forest, and ANN being the most commonly used ML techniques.Using a combination of different dataset features as predictors can further improve the performance of these techniques.In this work, we used four different diabetes datasets, as presented in Section 4.1.Our findings confirm that ML methods hold great potential for predicting and managing diabetes, thereby benefiting public health.
In diabetes prediction, accuracy can vary depending on the specific ML algorithm.Although we focus on predicting diabetes using the different sets of features using accuracy, it is important to note that this metric is just one performance measure.
For the Mendeley dataset, the most important attributes based on the analysis performed using Random Forest were "HbA1c", "BMI", and "TG", in that order.When using only the top 2 attributes, the best accuracy was obtained using the Random Forest Classifier with 98.83%.However, when using the top 3 features, the algorithm presenting the best accuracy was XGBoost, with 98.89%.The Random Forest Classifier was also the algorithm with the best accuracy, with 99.67% when all the attributes were used.Table 17 shows these results and presents the percentage of increment in the accuracy gained using the top 2, 3, and all features.The table shows that the percentage of increment in accuracy obtained from using the top 3 to use all features is not significant.The results show less than a 1% gain in accuracy by using all features.In the PID dataset, the most important attributes obtained by the Random Forest technique were "Glucose", "BMI", and "Age", in that order.For this dataset, the algorithm with the best performance, using the top 2, 3, and all features, was Kernel SVM.However, the percentage of accuracy obtained in this dataset was lower than in the others.Table 18 shows these results and presents the percentage of increment in the accuracy gained using the top 2, 3, and all features.We can see a 4.3% accuracy increment from using all the features compared to using only the top 3.After applying the Random Forest algorithms, the most important attributes for the DES dataset were "Polyuria", "Polydipsia", and "Age", in this order of importance.Three algorithms presented the best performance.Logistic Regression had the best accuracy when using the top 2 attributes, Kernel SVM for the top 3 attributes, and Random Forest Classifier when using all attributes.Table 19 shows that using all features gave a 9.4% accuracy increment compared with only using the top 3 features.This gain is considerable given that the accuracy obtained using only the top 2 or 3 features is less than 90%.Finally, for the Vanderbilt dataset, the most important attributes found by the Random Forest algorithm were "Glucose", "BMI", and "Age".Three algorithms presented the best performance for accuracy.Kernel SVM obtained the best results when using the top 2 features.The Multi-Layer Perceptron algorithm was the best when the top 3 features were used.Ultimately, the Random Forest Classifier obtained the best results using all the features.Table 20 compares these results and shows the increment in accuracy obtained using the different features.We can see a 4.43% accuracy increment when using all the features compared to using only the top 3. From these results, we can conclude that using all the features produces an increment in accuracy.However, as the results for the Mendeley dataset show, even with the top 2 attributes, the accuracy for the predictions is high at 98.83%.Therefore, we can infer that feature selection is a crucial step in machine learning, as it reduces the problem's dimensionality and increases the model's interpretability.Our study aimed to evaluate 11 powerful machine learning algorithms on four databases using three sets of features to predict patient outcomes.After thoroughly analyzing 132 results (i.e., 11 × 4 × 3), we found that Kernel SVM scored the highest in five, while Random Forest Classifier achieved the highest accuracy in four.These two algorithms outperformed the others and are highly suitable for accurately predicting patient outcomes.
While this study makes valuable contributions in evaluating machine learning techniques for diabetes prediction, certain limitations remain to be acknowledged.Though comprising real patient data, the datasets have relatively small sample sizes, between 390 to 1,000 instances.Testing on more extensive and more diverse datasets could result in additional insights.Furthermore, our work focused solely on structured tabular data features; incorporating unstructured inputs like clinical notes or medical images using deep learning approaches presents a promising opportunity.

Conclusions and future work
Early diagnosis of diabetes can help reduce the mortality rate due to the complications and risks to the patient from this disease.In this paper, we study the performance of 11 different ML algorithms to predict diabetes of individuals using four different datasets: Mendeley dataset, PID dataset, DES dataset, and Vanderbilt dataset.We evaluated and presented the performance analysis of these ML models, selecting the top 2 and 3 attributes obtained from feature selection methods.We used k-fold cross-validation methods to analyze the performance on different performance metrics: accuracy, precision, recall, f1-Score, and ROC_AUC.Our experiments have shown that different algorithms perform differently on various datasets, highlighting the importance of evaluating multiple models to choose the bestperforming one.The findings of this study could have significant implications for the medical community in developing accurate predictive models for diagnosing and treating patients.By utilizing the most effective algorithms and incorporating the right features, we can develop reliable models to improve patient outcomes.
Since a large dataset can provide adequate information for the model to perform better, our suggestions for future work will be to collect and use more data to train and develop more accurate and robust models.We also recommend using different datasets, combining different features of these datasets as predictors to improve the performance of the ML techniques further.Similarly, a study to explore the use of ensembles of different ML algorithms to achieve higher accuracy and using a variety of deep learning methods that have been proved instrumental in other healthcare domains will be the extension of this work.Also, implementing this study in real-life applications, such as wearable devices and web applications, for real-time diabetic prediction and forecasting will be our future interest.Finally, an innovative proposal would use interpretable models that can provide insights into the factors driving diabetes prediction and management.

Figure 1 .
Figure 1.Methodology for this study such as medical history, laboratory analysis results, and patient characteristics.The dataset includes 1,000 instances with 12 attributes: No. of Patient, Age, Gender, Creatinine ratio (Cr), Body Mass Index (BMI), Urea, Cholesterol (Chol), Fasting lipid profile (LDL, VLDL), Triglycerides (TG), HDL Cholesterol, and HBA1C.Similarly, it contains the classes for diabetic (Y), non-diabetic (N), and pre-diabetic(P).For this study, we consider the pre-diabetic (P) class as part of the diabetic (Y) class.For the positive class, we have 897 instances and 103 for the negative.Figure2(a) shows the distribution of the final classes.

Figure 2 .
Figure 2. Class distribution for the different datasets

Figure 2 (
b) shows the distribution of the final classes.

Figure 2 (
c) shows the distribution of the final classes.

Figure 3 .
Figure 3. Correlation heatmaps for the different datasets

Figure 5 .
Figure 5.Most important features for each dataset

Figure 6 .
Figure 6.Comparing the accuracy using the top 2, top 3, and all features for the Mendeley dataset

Figure
Figure7compares the accuracy of the machine learning models for the PID dataset.The top 3 attributes for this dataset are "Glucose", "BMI", and "Age".We can see that the accuracy score for all the classifiers is lower than 84%.Of the different algorithms, 90.9% (n = 10) performed better when using all the features.The average accuracy for these algorithms was 79.61%.Kernel SVM obtained the best accuracy with 82.4%.Only Naive Bayes obtained a better accuracy (74.9%) when using the top 3 features.We see a consistent pattern in the results except for Quadratic Discriminant Analysis and Naive Bayes.The accuracy is worst when using only the top 2 features, slightly better when using the top 3 features, and the best accuracy is achieved when using all the features.

Figure 7 .
Figure 7. Comparing the accuracy using the top 2, top 3, and all features for the PID dataset

Figure
Figure8shows the accuracy of the eleven algorithms for the DES dataset.The top 3 attributes for this dataset are "Polyuria", "Polydipsia", and "Age".Except for Naive Bayes, the rest of the algorithms, 90.9% (n = 10), obtained the best accuracy when using all the features.The average accuracy for these algorithms was 94.9%.In the case of Naive Bayes, the algorithm obtained the same accuracy (84.44%) for the top 2 features and all the features.This algorithm also presented the lowest score of the rest of the algorithms for its best performance.As in the previous dataset, we can see a consistent pattern in seven (63.6%) of the algorithms (Quadratic Discriminant Analysis, K-Nearest Neighbor, AdaBoost Classifier, Kernel SVM, Decision Tree Classifier, XGBoost, and Random Forest Classifier).The K-Nearest Neighbor presents the worst accuracy with 50% when using only the top 2 attributes.This result is the lowest score obtained by an

Figure 8 .
Figure 8. Comparing the accuracy using the top 2, top 3, and all features for the DES dataset

Figure 9 .
Figure 9. Comparing the accuracy using the top 2, top 3, and all features for the Vanderbilt dataset

Table 1 .
Summary of diabetes datasets

Table 3 .
Metrics used for evaluating the modelsMetric DescriptionAccuracy It is a measure of the proportion of correct predictions made by the model, calculated as the number of correct predictions divided by the total number of predictions.

Table 4 .
Best parameters

Table 5 .
Performance analysis for the classifiers on top 2 attributes for the Mendeley dataset

Table 7 .
Performance analysis for the classifiers on top 2 attributes for the DES dataset

Table 8 .
Performance analysis for the classifiers on top 2 attributes for the Vanderbilt dataset

Table 9 .
Performance analysis for the classifiers on top 3 attributes for the Mendeley dataset

Table 12 .
Performance analysis for the classifiers on top 3 attributes for the Vanderbilt dataset

Table 13 .
Performance analysis for the classifiers on all attributes for the Mendeley dataset

Table 15 .
Performance analysis for the classifiers on ALL attributes for the DES dataset

Table 16 .
Performance analysis for the classifiers on all attributes for the Vanderbilt dataset

Cloud Computing and Data Science Volume 5 Issue 1|2024| 177 algorithm
in all datasets.

Table 17 .
Percentage of increment based on the accuracy for the top 2, 3, and all features for the Mendeley dataset

Table 18 .
Percentage of increment based on the accuracy for the top 2, 3, and all features for the PID dataset

Table 19 .
Percentage of increment based on the accuracy for the top 2, 3, and all features for the DES dataset

Table 20 .
Percentage of increment based on the accuracy for the top 2, 3, and all features for the Vanderbilt dataset