Predicting Type 2 Diabetes Mellitus using Machine Learning Algorithms

Joyece Jane

Predicting Type 2 Diabetes Mellitus using Machine Learning Algorithms

Multiple, 2023

Purpose: to build an effective prediction model based on machine learning (ML) algorithms for the risk of type 2 (non-insulin-dependent) Diabetes Mellitus (T2DM). Methods: I developed two machine learning prediction models based on extreme gradient boosting (XGBoost) and logistic regression (LR). To evaluate the ML prediction models I used the Pima Indian Diabetes dataset (PIDD). The dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases and consists of 500 non-diabetic patients and 268 diabetes patients. Results: Models' performance was evaluated using six performance criteria. XGBoost model outperforms the logistic regression. The XGBoost model achieved: area under receiver operating characteristic curve (AUROC) = 85%, sensitivity = 71%, specificity = 81%, accuracy =77%, precision = 67%, and F1-score=69% respectively. Conclusion: This study showed that the XGBoost ML algorithm can be applied to predict individuals at high risk of T2DM in the early phase, which has a strong potential to control diabetes mellitus....Read more

لةم ال تعلرزمياتدام خوالثاني باستخ النوع ا من ي ؤ بمرض السكر التنب سليمانسرين د.م ن * * قسم الهندسة الطبية– لكهربائيةنيكية والميكاية الهندسة ا كل- جامعة دمشق- دمشق- سورية sulayman.nisreen@gmail.com, nisreen.sulayman@damascusuniversity.edu.sy, ملخص ال الهدفلتنبؤ لاء نموذج فعال : بن)سولينى النتمد علغير المع( لثاني النوع ارض السكري من بم لة.م ال تعلرزمياتدام خوا باستخ مواده طريقة البحث وموذجي تنبؤ : تم تطوير نلثاني النوع ارض السكري من بمتي خوارزميستخدام با لةم ال تعل: لوجستي.نحدار الشديد والعزيز التدرج ال تم قاعدة بياناتستخداموذجين بار النختبا تم ا كما لمرض السكريdataset (PIDD) Pima Indian Diabetes اضلسكري وأمر المعهد الوطني ل من ى في الهند. تكلز الهضمي واللجها ا تألفستخدمة منت الملبياناعدة ا قا500 غير مصاب شخص بمرض السكري و268 لثاني. النوع اض بالسكري من مري لنتائج المناقشة وا: ز التدرج الشديد. تفوق نموذج تعزيموذجينيم أداء الن لتقي ات مترم ستة بارم استخدا ت لوجستينحدار الى نموذج ال علحت المنحنىلمساحة تنحو التي: اى ال ات أداءه عل امتر وكانت بار 85 اسية، الحس% 71 ، النوعية% 81 ، الدقة% 77 لحكام، ا% 67 ، ومعامل% F1 69 على% والي. أظهرت ال التسة إمكانية ا درلصابة بمرضؤ بخطر التنبز التدرج الشديد ل نموذج تعزي استخدام لثاني النوع ا السكري من. لوجستينحدار الشديد، العزيز التدرج اللة، نموذج تم الثاني، تعل النوع الة: مرض السكري منلمفتاحيت اكلما ال

Predicting Type 2 Diabetes Mellitus using Machine Learning Algorithms Nisreen Sulayman* * Biomedical Engineering Department, Mechanical and Electrical Engineering Faculty, Damascus University, Damascus, Syria. sulayman.nisreen@gmail.com, nisreen.sulayman@damascusuniversity.edu.sy. Abstract Purpose: to build an effective prediction model based on machine learning (ML) algorithms for the risk of type 2 (non-insulin-dependent) Diabetes Mellitus (T2DM). Methods: I developed two machine learning prediction models based on extreme gradient boosting (XGBoost) and logistic regression (LR). To evaluate the ML prediction models I used the Pima Indian Diabetes dataset (PIDD). The dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases and consists of 500 non-diabetic patients and 268 diabetes patients. Results: Models' performance was evaluated using six performance criteria. XGBoost model outperforms the logistic regression. The XGBoost model achieved: area under receiver operating characteristic curve (AUROC) = 85%, sensitivity = 71%, specificity = 81%, accuracy =77%, precision = 67%, and F1-score=69% respectively. Conclusion: This study showed that the XGBoost ML algorithm can be applied to predict individuals at high risk of T2DM in the early phase, which has a strong potential to control diabetes mellitus. Keywords Type 2 Diabetes Mellitus, Machine Learning, XGBoost Model, Logistic Regression Citation: Sulayman, N. (2022). Predicting Type 2 Diabetes Mellitus using Machine Learning Algorithms. Tishreen University Journal -Engineering Sciences Series, 44(5), 89-100. ‫التنبؤ بمرض السكري من النوع الثاني باستخدام خوارزميات تعلم اآللة‬ ‫*‬ ‫د‪.‬م نسرين سليمان‬ ‫* قسم الهندسة الطبية – كلية الهندسة الميكانيكية والكهربائية ‪ -‬جامعة دمشق ‪ -‬دمشق ‪ -‬سورية‬ ‫‪sulayman.nisreen@gmail.com, nisreen.sulayman@damascusuniversity.edu.sy,‬‬ ‫الملخص‬ ‫الهدف‪ :‬بناء نموذج فعال للتنبؤ بمرض السكري من النوع الثاني (غير المعتمد على األنسولين)‬ ‫باستخدام خوارزميات تعلم اآللة‪.‬‬ ‫طريقة البحث ومواده‪ :‬تم تطوير نموذجي تنبؤ بمرض السكري من النوع الثاني باستخدام خوارزميتي‬ ‫تعلم اآللة‪ :‬تعزيز التدرج الشديد واإلنحدار اللوجستي‪ .‬كما تم اختبار النموذجين باستخدام قاعدة بيانات‬ ‫لمرض السكري )‪ Pima Indian Diabetes dataset (PIDD‬من المعهد الوطني للسكري وأمراض‬ ‫الجهاز الهضمي والكلى في الهند‪ .‬تتألف قاعدة البيانات المستخدمة من ‪ 500‬شخص غير مصاب‬ ‫بمرض السكري و‪ 268‬مريض بالسكري من النوع الثاني‪.‬‬ ‫النتائج والمناقشة‪ :‬تم استخدام ستة بارمترات لتقييم أداء النموذجين‪ .‬تفوق نموذج تعزيز التدرج الشديد‬ ‫على نموذج اإلنحدار اللوجستي وكانت بارامترات أداءه على النحو اآلتي‪ :‬المساحة تحت المنحنى‬ ‫‪ ،%85‬الحساسية ‪ ،%71‬النوعية ‪ ،%81‬الدقة ‪ ،%77‬اإلحكام ‪ ،%67‬ومعامل ‪ %69 F1‬على‬ ‫التوالي‪ .‬أظهرت الدراسة إمكانية استخدام نموذج تعزيز التدرج الشديد للتنبؤ بخطر اإلصابة بمرض‬ ‫السكري من النوع الثاني‪.‬‬ ‫الكلمات المفتاحية‪ :‬مرض السكري من النوع الثاني‪ ،‬تعلم اآللة‪ ،‬نموذج تعزيز التدرج الشديد‪ ،‬اإلنحدار اللوجستي‬ 1- Introduction Diabetes mellitus is a chronic, metabolic disease characterized by excess levels of blood glucose. The most common is type 2 (non-insulin-dependent) Diabetes Mellitus (T2DM), usually in adults, which occurs when the body becomes resistant to insulin or doesn't make enough insulin. There is a globally agreed target to cease the rise in diabetes by 2025. About 422 million people have diabetes, the major part living in low-and middle-income countries, and 1.5 million deaths are attributed to diabetes each year. Both the number of cases and the prevalence of diabetes patients have been increasing over the past few decades [1]. The number of individuals with diabetes patients rose from 108 million in 1980 to 422 million in 2014. Prevalence has been rising sooner in low- and middle-income countries than in highincome countries. Diabetes is a major reason of kidney failure, heart attacks, blindness, stroke, and lower limb amputation [2]. 537 million adults are living with diabetes. This number will rise to 643 million by 2030, and 783 million by 2045. Over 3 in 4 adults with diabetes are in low- and middle-income countries [3]. The rising incidence of diabetes imposes a significant burden on the individuals, health system, and the whole society [4, 5]. T2DM is an irreversible disease but preventable [6]. Therefore, it is essential to have an effective model to predict the onset of T2DM in individuals, which helps in the early identification of people at high risk of T2DM. The drastic increase in the rate of individuals suffering from diabetes mellitus makes the demand to make a system using the most effective available technology such as machine learning algorithms which provide accurate diabetes prediction results very essential to avoid or reduce common comorbidities and complications of diabetes. Although plenty of research has been conducted on T2DM prediction, there are still existing obstacles, due to the study population disparity and the difference in dataset sources and features. Thus, further studies are still required to be done in this area. The rest of this paper is arranged as follows: The related studies are discussed in section 2. A detailed description of the materials and methods is shown in section 3. Section 4 demonstrates the results. Section 5 presents a discussion about the results and compares them with the previously obtained from the literature. This paper is concluded in section 6. 2-Related work In recent years, Machine Learning (ML) algorithms have been applied in the medical field. They have proven to be efficient in disease diagnosis [7,8], treatment [9,10], and prognosis [11,12]. Predictive models based on ML algorithms can be useful in the identification and prediction of the risk of T2DM in individuals [13]. Pronab Ghosh et al. (2021) compared different ML algorithms for detecting diabetes. They used four ML algorithms: Gradient Boosting (GB), Support Vector Machine (SVM) AdaBoost (AB), and Random Forest (RF). ML algorithms are evaluated using seven different types of performance metrics with a 10-fold cross-validation approach. The best results were obtained with the RF approach after the features were selected with the minimal redundancy maximal relevance feature selection approach [14]. Chen et al. (2017) proposed a hybrid prediction model to help the diagnosis of type 2 diabetes. In the proposed model, the K-means clustering algorithm is used for data reduction with J48 decision tree as a classifier for classification. To get the experimental result, they used the Pima Indians Diabetes Dataset (PIDD) from the UCI machine learning repository. The result shows that the proposed model has reached 90% accuracy compared to other studies [15]. Sisodia, D. and Sisodia, DS designed a model which can prognosticate the likelihood of diabetes in patients with maximum accuracy. They used three machine learning classification algorithms namely Decision Tree (DT), SVM, and Naive Bayes to detect diabetes at an early stage. Assessment was performed on Pima Indians Diabetes Database (PIDD). The performances of all three algorithms are evaluated on various measures like precision, accuracy, F-Measure, and sensitivity. Results obtained show the Naïve Bayes outperforms with the highest accuracy comparatively to other algorithms [16]. Karthikeyani, V., and Begum, I. P. (2013) compared the results of ten supervised data mining algorithms using five performance criteria. He used partial least squares (PLS) to extract features of PIDD, and Linear Discriminate Analysis (LDA) method to build a model for predicting T2DM. The PLS-LDA was the best one among the ten algorithms with an accuracy of 74.40%. The Best results are achieved by using the Tanagra tool (a data mining matching set) [17]. 3- Materials and Methods The proposed procedure is summarized in figure 1. It shows the flow of the study conducted in constructing the machine learning predictive model. Diabetes Dataset Data Dreprocessing Machine Learning Algorithms Performance Evaluation Figure 1. Diabetes Prediction Model. 3-1 Dataset The dataset used in this study was obtained from Pima Indian Diabetes Dataset (PIDD) heritage. It is from the National Institute of Diabetes and Digestive and Kidney Diseases. The aim of the PIDD is to diagnostically predict whether or not an individual has diabetes. Several constraints were placed on the selection of the instances from a larger database. In particular, all patients in the dataset are females at least 21 years old. The dataset is available at the Kaggle repository [18]. The dataset consists of 500 non-diabetic patients and 268 diabetes patients. Each patient had eight medical predictor features and one target variable. Predictor features include the number of times pregnant, plasma glucose concentration at 2 Hours in oral Glucose Tolerance Test (GTT), diastolic blood pressure (mm Hg), triceps skin fold thickness (mm), 2-Hour serum insulin (µh/ml), Body Mass Index (BMI) (Weight in kg / (Height in In)), diabetes pedigree function, and age (years). The target variable has a binary value of either zero or one indicating a non-diabetic\diabetes patients. Table 1 represents descriptive statistics of the medical predictor features of the PIDD participants. Figure 2 shows a pair plot matrix of medical predictor features of the PIDD participants. It is helpful to clarify the pair-wise relationships of the medical features preliminarily. Table 1. Descriptive statistics of the medical predictor features of the PIDD participants. Pregnancies Glucose Blood Skin Insulin BMI Pressure Thickness Mean 3.84 120.89 69.10 20.53 79.79 31.99 Diabetes Pedigree Age Function 0.47 33.24 Std 3.36 31.97 19.35 15.95 115.24 7.88 0.33 11.76 Min 0.00 0.00 0.00 0.00 0.00 0.00 0.078 21.00 Max 17.00 199.00 122.00 99.00 846.00 67.100 2.42 81.00 Std: Standard Deviation; Min: minimum value of the feature; Max: maximum value of the feature 3-2 Data preprocessing For the successful use of the ML algorithms, data preprocessing is applied. Looking at table 1, the following features: glucose, blood pressure, skin thickness, insulin and BMI have an invalid zero as a minimum value which indicates missing value. Dealing with inconsistent values for the aforementioned features is done as follows: first replacing the zero values with Not a Number (NaN), then features distribution was examined by drawing a histogram of each feature: glucose concentration and diastolic blood pressure had a left-skewed distribution while skin thickness, insulin, and body mass index had a right-skewed distribution, and finally imputing NaN values with mean for glucose concentration and diastolic blood pressure; and with the median for skin thickness, insulin, and body mass index. To standardize the input features, the data were normalized using Python to mean 0 and variance 1 using the StandardScaler function from the Sklearn preprocessing library. 3-3 Machine learning algorithms This section briefly discusses the ML algorithms which have been used in this study. 3-3-1 Extreme Gradient Boosting (XGBoost) XGBoost is a scalable, an efficient implementation of the gradient boosting ensemble algorithm. It is the leading machine learning algorithm for regression, classification, and ranking problems and is known as one of the best machine learning algorithms utilized for supervised learning. Data scientists prefer XGBoost because of its high performance and computational speed. A detailed description of how XGBoost works is available at [19]. Figure 2. Pair plot matrix of medical predictor features. 3-3-2 Logistic Regression Logistic regression (LR) is a traditional classification algorithm that measures the relationship between a categorical dependent variable (input features) and one or more independent variables (outcome(s)) based on the sigmoid function [20]. LR is a simple method for prediction that provides baseline accuracy values to compare with other nonparametric machine learning algorithms [21]. 3-4 Evaluation metrics The dataset was randomly divided into two parts: the training set accounted for 80% (n = 614) and the test set accounted for 20% (n = 154). The training set is used to train the logistic regression and XGBoost machine learning algorithms and the test set is used to evaluate the models. The training set is independent from the test set. The hyperparameters for XGBoost were as followed: learning_rate = 0.1, max_depth = 5, n_estimators = 10, seed=42. Different performance metrics are considered for evaluating the prediction performance of logistic regression and the XGBoost ML models. The evaluating metrics include accuracy, sensitivity, precision, specificity, the area under receiver operating characteristic curve (AUROC), and F1-score. Sensitivity is the percentage of diabetes patients who are correctly predicted as having diabetes. Specificity is the percentage of non-diabetic patients who are correctly predicted as having no diabetes. Equations (1)-(5) refer to the definition of each metric. 𝑇𝑃 × 100% 𝑇𝑃 + 𝐹𝑁 (1) TN × 100% TN + FP (3) 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = Specificity = × 100% (2) 2𝑇𝑃 × 100% 2𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 (4) T𝑃 + FP 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = T𝑃 𝑇𝑃 + 𝑇𝑁 × 100% 𝑇𝑃 + 𝐹𝑁 + 𝑇𝑁 + 𝐹𝑃 (5) where TP is called true positive, denoting the number of diabetes patients who are correctly predicted as having diabetes, FN is called false negative, which determines the number of diabetes patients who are misclassified as having no diabetes, and (TP+FN) is the total number of diabetes patients. TN is called true negative and denotes the number of nondiabetic patients who are correctly predicted as having no diabetes, FP is called false positive, denoting the number of non-diabetic patients who are misclassified as having the diabetes, and (TN+FP) is the total number of non-diabetic patients. Accuracy is the percentage of correct predictions, and F1 score is the balance between precision and sensitivity. 4- Results and discussion: 4-1 Results Table 2 represents different performance metric values of XGBoost and logistic regression ML algorithms calculated on various measures. It illustrates that XGBoost and logistic regression have the same accuracy but XGBoost has a higher sensitivity and F1score compared to logistic regression. Table 2. Prediction results using XGBoost and Logistic Regression machine learning algorithms. Accuracy (%) Sensitivity (%) Specificity (%) Precision (%) F1 score (%) XGBoost model 77 71 81 67 69 Logistic Regression 75 62 80 67 64 Figure 3 and figure 4 show the confusion matrix and receiver operating characteristics (ROC) curve of XGBoost and logistic regression respectively. The area under the ROC curve (AUROC) provides a vital performance measurement for classification models and represents the degree of separability of classes. AUROC of XGBoost model is 85% compared to 83% of logistic regression. The advantage of using the XGBoost ML algorithm is that an importance score for each feature can be obtained. In general, the importance score measures the value of the feature in the construction of the model. Figure 5 shows the contributions of the eight features on the XGBoost ML model output ranked by the average absolute SHAP value. Glucose, body mass index, diabetes pedigree function, and age were the top four important features. 4 -2 Discussion In this study, I applied two machine learning algorithms to build a prediction model for the risk of T2DM among PIDD participants. It is found that the XGBoost ML model with eight features demonstrated good performance for predicting T2DM. This suggested that the prediction model derived in this study could be applied to predict individuals at high risk of T2DM, which could benefit the control of type 2 diabetes mellitus and hence the prevention of it. Table 3 presented the results of performance of XGBoost and LR machine learning models compared to other studies in the field on the same dataset. Figure 3. The confusion matrix and receiver operating characteristics (ROC) curve of the XGBoost machine learning model with AUROC of 85%. Figure 4. The confusion matrix and receiver operating characteristics (ROC) curve of the logistic regression machine learning model with AUROC of 82%. A B Figure 5. The interpretation for the XGBoost model. (A) The feature importance ranking by the SHAP value; (B) SHAP summary plot of the XGBoost ML model. Each dot represents an instance, with blue indicating a low feature value and red indicating a high feature value. The higher the value of a feature, the higher the risk of incident T2DM. The prediction results confirmed that the XGBoost ML model performed best with the highest AUROC value of 85% on the test set in predicting the probability that an individual develops type 2 diabetes mellitus T2DM. It is a good example of success in the research of diabetes risk prediction. This finding was consistent with earlier studies [16,22], which identified the good prediction power of the XGBoost ML model, with AUROC values of 82% and 83% respectively. Table 3. Performance metrics of XGBoost and logistic regression predictive models compared to other studies. Reference Prediction model Naïve Bayes [16] [23] [24] This study Support Vector Machine Decision Tree Random Forest J48 Neural Network Logistic Regression Random Forest XGBoost XGBoost Logistic Regression Accuracy (%) Sensitivity (%) AUROC (%) 76 65 74 76 73 76 76 75 75 77 75 76 65 74 76 72 78 63 64 65 71 62 82 50 75 76 75 75 85 82 There are several limitations in this study: The dataset used in this study is PIDD and it is believed that there are race/ethnic differences with type 2 diabetes mellitus [25], which might limit the extrapolation of the results. World Health Organization (WHO) has confirmed that a healthy diet, tobacco, and regular physical activity, are also important features to prevent or delay the onset of T2DM [2]. However, PIDD does not contain the aforementioned features of participants. 5- Conclusions and Recommendations The current study developed predictive models using XGBoost and logistic regression ML algorithms for the risk of incident T2DM. Glucose, age, diabetes pedigree function, and body mass index were the strongest medical predictors in the T2DM prediction model, which would benefit clinical practice in developing targeted T2DM prevention and control interventions. In the future, this work can be extended by taking into consideration additional predictor features such as education, healthy diet, smoking, and exercise to find how likely nondiabetic people can have diabetes in the next few years. Conflicts of Interest No conflict of interest to declare. References [1] World Health Organization. Diabetes. topics/diabetes#tab=tab_1 (Accessed on 12 July 2022). https://www.who.int/health- [2] World Health Organization. Diabetes. Fact sheets. https://www.who.int/news-room/factsheets/detail/diabetes (Accessed on 12 July 2022). [3] International Diabetes Federation IDF Atlas. Diabetes around the world in 2021. https://diabetesatlas.org/ (Accessed on 12 July 2022). [4] Ma RC, Tsoi KY, Tam WH, Wong CK. Developmental origins of type 2 diabetes: a perspective from China. European journal of clinical nutrition. 2017 Jul;71(7):870-80. https://doi.org/10.1038/ejcn.2017.48 [5] Huang Y, Vemer P, Zhu J, Postma MJ, Chen W. Economic burden in Chinese patients with diabetes mellitus using electronic insurance claims data. PLoS One. 2016 Aug 29;11(8):e0159297. http://dx.doi.org/10.1371/journal.pone.0159297 [6] Li Y, Wang DD, Ley SH, Vasanti M, Howard AG, He Y, Hu FB. Time trends of dietary and lifestyle factors and their potential impact on diabetes burden in China. Diabetes care. 2017 Dec 1;40(12):1685-94. https://doi.org/10.2337/dc17-0571 [7] Fatima M, Pasha M. Survey of machine learning algorithms for disease diagnostic. Journal of Intelligent Learning Systems and Applications. 2017;9(01):1. http://dx.doi.org/10.4236/jilsa.2017.91001 [8] Jain R, Chotani A, Anuradha G. Disease diagnosis using machine learning: A comparative study. InData Analytics in Biomedical Engineering and Healthcare 2021 Jan 1 (pp. 145-161). Academic Press. http://dx.doi.org/10.1016/B978-0-12-819314-3.00010-0 [9] McConnell KJ, Lindner S. Estimating treatment effects with machine learning. Health services research. 2019 Dec;54(6):1273-82. http://dx.doi.org/10.1111/1475-6773.13212 [10] McIntosh C, Conroy L, Tjong MC, Craig T, Bayley A, Catton C, Gospodarowicz M, Helou J, Isfahanian N, Kong V, Lam T. Clinical integration of machine learning for curativeintent radiation treatment of patients with prostate cancer. Nature medicine. 2021 Jun;27(6):999-1005 http://dx.doi.org/10.1038/s41591-021-01359-w [11] Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal. 2015 Jan 1;13:8-17. http://dx.doi.org/10.1016/j.csbj.2014.11.005 [12] Diller GP, Kempny A, Babu-Narayan SV, Henrichs M, Brida M, Uebing A, Lammers AE, Baumgartner H, Li W, Wort SJ, Dimopoulos K. Machine learning algorithms estimating prognosis and guiding therapy in adult congenital heart disease: data from a single tertiary centre including 10 019 patients. European heart journal. 2019 Apr 1;40(13):1069-77. http://dx.doi.org/10.1093/eurheartj/ehy915 [13] Mujumdar A, Vaidehi V. Diabetes prediction using machine learning algorithms. Procedia Computer Science. 2019 Jan 1;165:292-9. http://dx.doi.org/10.1016/j.procs.2020.01.047 [14] Ghosh, P., Azam, S., Karim, A., Hassan, M., Roy, K., & Jonkman, M. (2021). A comparative study of different machine learning tools in detecting diabetes. Procedia Computer Science, 192, 467-477. http://dx.doi.org/10.1016/j.procs.2021.08.048 [15] Chen W, Chen S, Zhang H, Wu T. A hybrid prediction model for type 2 diabetes using K-means and decision tree. In2017 8th IEEE International conference on software engineering and service science (ICSESS) 2017 Nov 24 (pp. 386-390). IEEE. http://dx.doi.org/10.1109/ICSESS.2017.8342938 [16] Sisodia D, Sisodia DS. Prediction of diabetes using classification algorithms. Procedia computer science. 2018 Jan 1;132:1578-85. http://dx.doi.org/10.1016/j.procs.2018.05.122 [17] Karthikeyani V, Begum IP. Comparison a performance of data mining algorithms (CPDMA) in prediction of diabetes disease. International journal on computer science and engineering. 2013 Mar 1;5(3):205. [18] https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database [19] Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016 Aug 13 (pp. 785-794). http://dx.doi.org/10.1145/2939672.2939785 [20] Choi RY, Coyner AS, Kalpathy-Cramer J, Chiang MF, Campbell JP. Introduction to machine learning, neural networks, and deep learning. Translational Vision Science & Technology. 2020 Jan 28;9(2):14-. https://doi.org/10.1167/tvst.9.2.14 [21] Cox DR. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological). 1958 Jul;20(2):215-32. http://dx.doi.org/10.1111/j.25176161.1958.tb00292.x [22] Dinh A, Miertschin S, Young A, Mohanty SD. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC medical informatics and decision making. 2019 Dec;19(1):1-5. http://dx.doi.org/10.1186/s12911-019-0918-5 [23] Zou Q, Qu K, Luo Y, Yin D, Ju Y, Tang H. Predicting diabetes mellitus with machine learning techniques. Frontiers in genetics. 2018 Nov 6;9:515. http://dx.doi.org/10.3389/fgene.2018.00515 [24] Liu Q, Zhang M, He Y, Zhang L, Zou J, Yan Y, Guo Y. Predicting the Risk of Incident Type 2 Diabetes Mellitus in Chinese Elderly Using Machine Learning Techniques. Journal of Personalized Medicine. 2022 Jun;12(6):905. http://dx.doi.org/10.3390/jpm12060905 [25] Spanakis EK, Golden SH. Race/ethnic difference in diabetes and diabetic complications. Current diabetes reports. 2013 Dec;13(6):814-23. http://dx.doi.org/10.1007/s11892-013-0421-9 View publication stats

Log In

Predicting Type 2 Diabetes Mellitus using Machine Learning Algorithms