PM For Diabetes
PM For Diabetes
PM For Diabetes
https://www.emerald.com/insight/2210-8327.htm
ACI
18,1/2 Predictive modelling and analytics
for diabetes using a machine
learning approach
90 Harleen Kaur and Vinita Kumari
Department of Computer Science and Engineering,
Received 2 November 2018
Revised 4 December 2018
School of Engineering Sciences and Technology, Jamia Hamdard, New Delhi, India
Accepted 19 December 2018
Abstract
Diabetes is a major metabolic disorder which can affect entire body system adversely. Undiagnosed diabetes
can increase the risk of cardiac stroke, diabetic nephropathy and other disorders. All over the world millions of
people are affected by this disease. Early detection of diabetes is very important to maintain a healthy life. This
disease is a reason of global concern as the cases of diabetes are rising rapidly. Machine learning (ML) is a
computational method for automatic learning from experience and improves the performance to make more
accurate predictions. In the current research we have utilized machine learning technique in Pima Indian
diabetes dataset to develop trends and detect patterns with risk factors using R data manipulation tool. To
classify the patients into diabetic and non-diabetic we have developed and analyzed five different predictive
models using R data manipulation tool. For this purpose we used supervised machine learning algorithms
namely linear kernel support vector machine (SVM-linear), radial basis function (RBF) kernel support vector
machine, k-nearest neighbour (k-NN), artificial neural network (ANN) and multifactor dimensionality
reduction (MDR).
Keywords Machine learning, Support vector machine (SVM), k-Nearest neighbour (k-NN), Artificial neural
network (ANN), Multifactor dimensionality reduction (MDR)
Paper type Original Article
1. Introduction
Diabetes is a very common metabolic disease. Usually onset of type 2 diabetes happens in
middle age and sometimes in old age. But nowadays incidences of this disease are reported in
children as well. There are several factors for developing diabetes like genetic susceptibility,
body weight, food habit and sedentary lifestyle. Undiagnosed diabetes may result in very
high blood sugar level referred as hyperglycemia which can lead to complication like diabetic
© Harleen Kaur and Vinita Kumari. Published in Applied Computing and Informatics. Published by
Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY
4.0) license. Anyone may reproduce, distribute, translate and create derivative works of this article (for
both commercial and non-commercial purposes), subject to full attribution to the original publication
and authors. The full terms of this license may be seen at http://creativecommons.org/licences/by/4.0/
legalcode.
This research work is catalysed and supported by National Council for Science and Technology
Communications (NCSTC), Department of Science and Technology (DST), Ministry of Science and
Technology (Govt. of India) for support and motivation [grant recipient: Dr. Harleen Kaur]. The authors
gratefully acknowledge financial support from the Ministry of Science and Technology (Govt. of India), India.
Publishers note: The publisher wishes to inform readers that the article “Predictive modelling and
analytics for diabetes using a machine learning approach” was originally published by the previous
publisher of Applied Computing and Informatics and the pagination of this article has been subsequently
Applied Computing and
Informatics changed. There has been no change to the content of the article. This change was necessary for the journal to
Vol. 18 No. 1/2, 2022 transition from the previous publisher to the new one. The publisher sincerely apologises for any
pp. 90-100
Emerald Publishing Limited inconvenience caused. To access and cite this article, please use Kaur. H., Kumari, V. (2022), “Predictive
e-ISSN: 2210-8327 modelling and analytics for diabetes using a machine learning approach”, Applied Computing and
p-ISSN: 2634-1964
DOI 10.1016/j.aci.2018.12.004 Informatics. Vol. 18 No. 1/2, pp. 90-100. The original publication date for this paper was 22/06/2019.
retinopathy, nephropathy, neuropathy, cardiac stroke and foot ulcer. So, early detection of Predictive
diabetes is very important to improve quality of life of patients and enhancement of their life modelling and
expectancy [1–4,22].
Machine Learning is concerned with the development of algorithms and techniques that
analytics for
allows the computers to learn and gain intelligence based on the past experience. It is a branch diabetes
of Artificial Intelligence (AI) and is closely related to statistics. By learning it means that the
system is able to identify and understand the input data, so that it can make decisions and
predictions based on it [5,23,24]. 91
The learning process starts with the gathering of data by different means, from various
resources. Then the next step is to prepare the data, that is pre-process it in order to fix the
data related issues and to reduce the dimensionality of the space by removing the irrelevant
data (or selecting the data of interest). Since the amount of data that is being used for learning
is large, it is difficult for the system to make decisions, so algorithms are designed using some
logic, probability, statistics, control theory etc. to analyze the data and retrieve the knowledge
from the past experiences. Next step is testing the model to calculate the accuracy and
performance of the system. And finally optimization of the system, i.e. improvising the model
by using new rules or data set. The techniques of machine learning are used for classification,
prediction and pattern recognition. Machine learning can be applied in various areas like:
search engine, web page ranking, email filtering, face tagging and recognizing, related
advertisements, character recognition, gaming, robotics, disease prediction and traffic
management [6,7,25]. The essential learning process to develop a predictive model is given in
Figure 1.
Now days, machine learning algorithms are used for automatic analysis of high
dimensional biomedical data. Diagnosis of liver disease, skin lesions, cancer classification,
risk assessment for cardiovascular disease and analysis of genetic and genomic data are
some of the examples of biomedical application of ML [8,9]. For liver disease diagnosis,
Hashemi et al. (2012) has successfully implemented SVM algorithm [10]. In order to diagnose
major depressive disorder (MDD) based on EEG dataset, Mumtaz et al. (2017) have used
classification models such as support vector machine (SVM), logistic regression (LR) and
Naı€ve Bayesian (NB) [11].
Our novel model is implemented using supervised machine learning techniques in R for
Pima Indian diabetes dataset to understand patterns for knowledge discovery process in
diabetes. This dataset discusses the Pima Indian population’s medical record regarding the
onset of diabetes. It includes several independent variables and one dependent variable i.e
class value of diabetes in terms of 0 and 1. In this work, we have studied performance of five
different models based upon linear kernel support vector machine (SVM-linear), radial basis
Figure 1.
Essential Learning
process to develop a
predictive model.
ACI kernel support vector machine (SVM-RBF), k-nearest neighbour (k-NN), artificial neural
18,1/2 network (ANN) and multifactor dimensionality reduction (MDR) algorithms to detect
diabetes in female patients.
where x1 is real vector and y1 can be 1 or 1, representing the class to which x1 belongs.
A hyper-plane can be constructed so as to maximize the distance between the two classes
y 5 1 and y 5 1, is defined as:
!w $!x b¼0 (2)
where ! b
w is normal vector and ! is offset of hyper-plane along !
w.
kwk
Figure 2.
Representation of
Support vector
machine.
ACI 2.2 Radial basis function (RBF) kernel support vector machine
18,1/2 Support vector machine has proven its efficiency on linear data and non linear data. Radial
base function has been implemented with this algorithm to classify non linear data. Kernel
function plays very important role to put data into feature space.
Mathematically, kernel trick (K) is defined as:
2
jx 1 x 2 j
Kðx 1 ; x 2 Þ ¼ exp − (3)
94 2σ 2
A Gaussian function is also known as Radial basis function (RBF) kernel. In Figure 3, the
input space separated by feature map (Φ). By applying equation (1) & (2) we get:
X
N
f ðX Þ ¼ αi yi kðXi ; X Þ þ b (4)
i
By applying equation (3) in 4 we get new function, where N represents the trained data.
X
N
jx 1 x 2 j2
f ðX Þ ¼ αi yi exp − þb (5)
i
2σ 2
Figure 3.
Representation of
Radial basis function
(RBF) kernel Support
Vector Machine.
another. The neurons can be represented by some state (0 or 1) and each node may also have Predictive
some weight assigned to them that defines its strength or importance in the system. The modelling and
structure of ANN is divided into layers of multiple nodes; the data travels from first layer
(input layer) and after passing through middle layers (hidden layers) it reaches the output
analytics for
layer, every layer transforms the data into some relevant information and finally gives the diabetes
desired output [17].
Transfer and activation functions play important role in functioning of neurons. The
transfer function sums up all the weighted inputs as: 95
Xn
z¼ wi xi þ wb b (6)
x¼1
Since this function does not provide any limits to the data, sigmoid function is used which can
be expressed as:
1
a ¼ σ ðzÞ ¼ (8)
1 þ e−z
3. Predictive model
In our proposed predictive model (Figure 4), we have done pre- processing of raw data and
different feature engineering techniques to get better results. Pre-processing involved
removal of outliers and k-NN imputation to predict the missing values. Boruta wrapper
algorithm is used for feature selection as it provides unbiased selection of important features
and unimportant features from an information system. Training of raw data after feature
engineering has a significant role in supervised learning. We have used highly correlated
variables for better outcomes. Input data, here indicates to test data used for predict and
confusion matrix.
96
Figure 4.
Framework for
evaluating
Predictive Model.
All techniques of classification were experimented in “R” programming studio. The data set
have been partitioned into two parts (training and testing). We trained our model with 70%
training data and tested with 30% remaining data. Five different models have been
developed using supervised learning to detect whether the patient is diabetic or non-diabetic.
For this purpose linear kernel support vector machine (SVM-linear), radial basis function
(RBF) kernel support vector machine, k-NN, ANN and MDR algorithm are used.
To diagnose diabetes for Pima Indian population, performance of all the five different
models are evaluated upon parameters like precision, recall, area under curve (AUC) and F1
score (Table 3). In order to avoid problem of over fitting and under fitting, tenfold cross
validation is done. Accuracy indicates our classifier is how often correct in diagnosis of
whether patient is diabetic or not. Precision has been used to determine classifier’s ability
provides correct positive predictions of diabetes. Recall or sensitivity is used in our work to
find the proportion of actual positive cases of diabetes correctly identified by the classifier
used. Specificity is being used to determine classifier’s capability of determining negative
cases of diabetes. As the weighted average of precision and recall provides F1 score so this Predictive
score takes into account of both. The classifiers of F1 score near 1 are termed as best one [18]. modelling and
Receiver operating characteristic (ROC) curve is a well known tool to visualize performance of
a binary classifier algorithm [19]. It is plot of true positive rate against false positive rate as
analytics for
the threshold for assigning observations are varied to a particular class. Area under curve diabetes
(AUC) value of a classifier may lie between 0.5 and 1. Values below 0.50 indicated for a set of
random data which could not distinguish between true and false. An optimal classifier has
value of area under the curve (AUC) near 1.0. If it is near 0.5 then this value is comparable to 97
random guessing [20].
From Table 3 which represents different parameter for evaluating all the models, it is
found that accuracy of linear kernel SVM model is 0.89. For radial basis function kernel SVM,
accuracy is 0.84. For k-NN model accuracy is found to 0.88, while for ANN it is 0.86. Accuracy
of MDR based model is found to be 0.83.
Recall or sensitivity which indicates correctly identified proportion of actual positives
diabetic cases for SVM-linear model is 0.87 and for SVM-RBF it is 0.83. For k-NN, ANN and
MDR based models recall values are found to be 0.90, 0.88 and 0.87 respectively. Precision of
SVM-linear, SVM-RBF, k-NN, ANN and MDR models is found to be 0.88, 0.85, 0.87, 0.85 and
0.82 respectively. F1 score of SVM-linear, SVM-RBF, k-NN ANN and MDR models is found to
be 0.87, 0.83, 0.88, 0.86 and 0.84 respectively. We have calculated area under the curve (AUC)
to measure performance of our models. It is found that AUC of SVM linear model is 0.90 while
for SVM-RBF, k-NN, ANN and MDR model the values are respectively 0.85, 0.92 0.88 and 0.89.
So, from above studies, it can be said that on the basis of all the parameters SVM-linear
and k-NN are two best models to find that whether patient is diabetic or not. Further it can be
Evaluation Parameters
S.No. Predictive Models Accuracy Recall Precision F1 score AUC
Figure 5.
ROC curve for Linear
kernel Support Vector
Machine (SVM-
linear) model.
ACI
18,1/2
98
Figure 6.
ROC curve for
kNN model.
seen that accuracy and precision of SVM- linear model are higher in comparison to k-NN
model. But recall and F1 score of k-NN model are higher than SVM- linear model. If we
examine our diabetic dataset carefully, it is found to be an example of imbalanced class with
500 negative instances and 268 positive instances giving an imbalance ratio of 1.87. Accuracy
alone may not provide a very good indication of performance of a binary classifier in case of
imbalanced class. F1 score provides better insight into classifier performance in case of
uneven class distribution as it provides balance between precision and recall [21,25]. So in this
case F1 score should also be taken care of. Further it can be seen that AUC value of SVM-
linear and k-NN model are 0.90 and 0.92 respectively (Figures 5 and 6). Such a high value of
AUC indicates that both SVM- linear and k-NN are optimal classifiers for diabetic dataset.
5. Conclusion
We have developed five different models to detect diabetes using linear kernel support vector
machine (SVM-linear), radial basis kernel, support vector machine (SVM-RBF), k-NN, ANN
and MDR algorithms. Feature selection of dataset is done with the help of Boruta wrapper
algorithm which provides unbiased selection of important features.
All the models are evaluated on the basis of different parameters- accuracy, recall,
precision, F1 score, and AUC. The experimental results suggested that all the models
achieved good results; SVM-linear model provides best accuracy of 0.89 and precision of 0.88
for prediction of diabetes as compared to other models used. On the other hand k-NN model
provided best recall and F1 score of 0.90 and 0.88. As our dataset is an example of imbalanced
class, F1 score may provides better insight into performance of our models. F1 score provides
balance between precision and recall. Further it can be seen that AUC value of SVM- linear
and k-NN model are 0.90 and 0.92 respectively. Such a high value of AUC indicates that both
SVM- linear and k-NN are optimal classifiers for diabetic dataset. So, from above studies, it
can be said that on the basis of all the parameters linear kernel support vector machine (SVM-
linear) and k-NN are two best models to find that whether patient is diabetic or not.
This work also suggests that Boruta wrapper algorithm can be used for feature selection.
The experimental results indicated that using the Boruta wrapper features selection
algorithm is better than choosing the attributes manually with less medical domain
knowledge. Thus with a limited number of parameters, through the Boruta feature selection
algorithm we have achieved higher accuracy and precision.
References Predictive
[1] D. Soumya, B. Srilatha, Late stage complications of diabetes and insulin resistance, J. Diabetes modelling and
Metab. 2 (167) (2011) 2–7.
analytics for
[2] K. Papatheodorou, M. Banach, M. Edmonds, N. Papanas, D. Papazoglou, Complications of diabetes
diabetes, J. Diabetes Res. 2015 (2015) 1–5.
[3] L. Mamykinaa et al., Personal discovery in diabetes self-management: discovering cause and
effect using self-monitoring data, J. Biomed. Informat. 76 (2017) 1–8. 99
[4] A. Nather, C.S. Bee, C.Y. Huak, J.L.L. Chew, C.B. Lin, S. Neo, E.Y. Sim, Epidemiology of diabetic
foot problems and predictive factors for limb loss, J. Diab. Complic. 22 (2) (2008) 77–82.
[5] Shiliang Sun, A survey of multi-view machine learning, Neural Comput. Applic. 23 (7–8) (2013)
2031–2038.
[6] M.I. Jordan, M. Mitchell, Machine learning: trends, perspectives, and prospects, Science 349 (6245)
(2015) 255–260.
[7] P. Sattigeri, J.J. Thiagarajan, M. Shah, K.N. Ramamurthy, A. Spanias, A scalable feature learning
and tag prediction framework for natural environment sounds, Signals Syst. and Computers 48th
Asilomar Conference on Signals, Systems and Computers, 2014, 1779–1783.
[8] M.W. Libbrecht, W.S. Noble, Machine learning applications in genetics and genomics, Nat. Rev.
Genet. 16 (6) (2015) 321–332.
[9] K. Kourou, T.P. Exarchos, K.P. Exarchos, M.V. Karamouzis, D.I. Fotiadis, Machine learning
applications in cancer prognosis and prediction, Comput. Struct. Biotechnol. J. 13 (2015) 8–17.
[10] E.M. Hashem, M.S. Mabrouk, A study of support vector machine algorithm for liver disease
diagnosis, Am. J. Intell. Sys. 4 (1) (2014) 9–14.
[11] W. Mumtaz, S. Saad Azhar Ali, M. Azhar, M. Yasin, A. Saeed Malik, A machine learning
framework involving EEG-based functional connectivity to diagnose major depressive disorder
(MDD), Med. Biol. Eng. Comput. (2017) 1–14.
[12] D.K. Chaturvedi, Soft computing techniques and their applications, in: Mathematical Models,
Methods and Applications, 31–40. Springer Singapore, 2015.
[13] A. Tettamanzi, M. Tomassini, Soft computing: integrating evolutionary, neural, and fuzzy
systems, Springer Science & Business Media (2013).
[14] M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines, IEEE Intell.
Syst. Appl. 13 (4) (1998) 18–28.
[15] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and applications,
Neurocomputing 70 (1) (2006) 489–501.
[16] S.A. Dudani, The distance-weighted k-nearest-neighbor rule, IEEE Trans. Syst. Man Cybernet.
SMC-6 (4) (1976) 325–327.
[17] T. Kohonen, An introduction to neural computing, Neural Networks 1 (1) (1988) 3–16.
[18] Z.C. Lipton, C. Elkan, B. Naryanaswamy, Optimal thresholding of classifiers to maximize F1
measure, in: Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, Springer, Berlin, Heidelberg, 2014, pp. 225–239.
[19] L.B. Ware et al., Biomarkers of lung epithelial injury and inflammation distinguish severe sepsis
patients with acute respiratory distress syndrome, Crit. Care 17 (5) (2013) 1–7.
[20] M.E. Rice, G.T. Harris, Comparing effect sizes in follow-up studies: ROC area, Cohen’s d, and r,
Law Hum. Behav. 29 (5) (2005) 615–620.
[21] A. Ali, S.M. Shamsuddin, A.L. Ralescu, Classification with class imbalance problem: a review, Int.
J. Adv. Soft Comput. Appl. 5 (3) (2013) 176–204.
[22] S. Park, D. Choi, M. Kim, W. Cha, C. Kim, I.C. Moon, Identifying prescription patterns with a topic
model of diseases and medications, J. Biomed. Informat. 75 (2017) 35–47.
ACI [23] H. Kaur, E. Lechman, A. Marszk, Catalyzing Development through ICT Adoption: The
Developing World Experience, Springer Publishers, Switzerland, 2017.
18,1/2
[24] H. Kaur, R. Chauhan, Z. Ahmed, Role of data mining in establishing strategic policies for the
efficient management of healthcare system–a case study from Washington DC area using
retrospective discharge data, BMC Health Services Res. 12 (S1) (2012) P12.
[25] J. Li, O. Arandjelovic, Glycaemic index prediction: a pilot study of data linkage challenges and the
application of machine learning, in: IEEE EMBS Int. Conf. on Biomed. & Health Informat. (BHI),
100 Orlando, FL, (2017) 357–360.
Corresponding author
Vinita Kumari canbe contacted at: vkumari@jamiahamdard.ac.in
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com