International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017), pp.47-56
http://dx.doi.org/10.14257/ijgdc.2017.10.9.04
Predicting Diabetics Accuracy Using Rough Set Clusters
1
Shantan Sawa, H. Balaji1, N. Ch. S. N. Iyengar1 and Ronnie D. Caytiles2
SCOPE, VIT University, Vellore-632014
Sreenidhi Institute of Science and Technology, Ghatkesar, Hyderabad, India
2
Multimedia Engineering Department, Hannam University, Daejeon, Korea
shantansawa@gmail.com, rdcaytiles@gmail.com,
srimannarayanach@sreenidhi.edu.in
1
Abstract
The primary objective of this paper is to develop and propose a model using the
concepts from Rough Set Theory to cluster the patients in the diabetic dataset. The model
to be developed incorporates Rough Clustering of the dataset, and from the clusters
formed, compute the accuracy on the testing data. Rough Clustering will help splitting the
data into clusters of patients that suffer from Diabetes Mellitus and the ones which do not.
As a result, the patients suffering from Diabetes Mellitus will be clustered together and
will provide us with the average values of the features used in the model for data
clustering. The results obtained will provide more depth in the field of rough clustering
for diabetes as the number of studies done on diabetes using rough set theory are few to
none.
Keywords: Rough Set Theory, Diabetes Mellitus, Rough Clustering
1. Introduction
Diabetes mellitus, commonly known as diabetes, is group of metabolic diseases which
marks high blood sugar levels over a prolonged period. As per the survey done by
International Diabetes Federation, in 2015, it was estimated that 415 million individuals
are affected by diabetes around the world. And the number is estimated to rise to 642
million individuals by the year 2040. Type 1 diabetes mellitus (T1DM) itself comprises of
roughly around 10% of the reported cases with diabetes. T1DM usually affects children
where the causes are of the disease is unknown and no known ways to prevent the T1DM.
The study [9] performed and published by Jane L. Chiang et al, estimated about 80000
children showing symptoms and developing the disease each year. The motivation behind
pursuing a working model in the field of T1DM is to guide the doctors and the guardians
of the affected children to catch the disease in its nascent stage and based on the severity
of diabetes, accordingly provide apt treatment to the affected children.
Type 1 diabetes is a disorder in which the pancreas can no longer produce insulin. It is
also called juvenile diabetes primarily because of it’s presence in both children and adults.
However, statistics show that only 5% of all the type 1 cases are diagnosed in adulthood.
It is sometimes referred to as insulin-dependent diabetes mellitus and has no cure. If one
has it, they must take insulin to survive. According to the WHO (World Health
Organization) the number of children having type 1 diabetes is very high as mentioned in
the motivation. Hence it can be said that Diabetes is a serious chronic disease. Doctors
usually take patients’ blood samples and check the sugar concentration in their blood in
order to diagnose them with diabetes. This is a highly time- consuming process and there
are many other features which need to be reviewed while attempting to detect whether a
patient is diabetic or not. These other factors are, family history, body mass index and
Received (June 20, 2017), Review Result (August 30, 2017), Accepted (September 15, 2017)
ISSN: 2005-4262 IJGDC
Copyright ⓒ 2017 SERSC Australia
International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017)
age. If a patient’s ancestors show a presence of diabetes, then there is a high chance of the
patient exhibiting symptoms of diabetes as well. Presently, there is no tool to detect if a
person has type 1 diabetes and how severe the diabetes will affect him/her. Hence the
need arises for some sort of efficient application to predict the onset of type 1 diabetes in
patients for a quicker diagnosis for better safety.
In the real world, data isn’t crisp. But there exists some gray areas, some roughness
among the objects in the data-set. As a result, in the real world applications, there are
hardly any crisp data available. Thus, it becomes impossible to create crisp clusters from
from real world data. Hence, in order to tackle this issue of lack of crispness, or
involvement of roughness, in the data-set, implementing the concepts of Rough Set theory
plays a significant part in developing models that take the rough nature of the data-set into
account and accordingly form clusters of the data.
2. Literature Survey/Related Work
[1] In 2009, AsmaShaheen Khan, Waqas Ahmed proposed an Intelligent decision
support system in diabetic ehealth care from the perspective of elders. The proposed
system stores the patients’ information and gives them optimal advices according to their
condition entered by them. It also provides adequate and detail information about the
patient to the health-care providers that help them to take an optimal decision about the
patients. [2] In 2012, Tawfik Saeed Zeki, Mohammad V. Malakooti, Yousef Ataeipoor,
TalayehTabibi proposed an expert system for diabetes diagnosis. After data acquisition
and designing a rule-based expert system, this system has been coded with VP_Expert
Shell and tested in ShahidHasheminezhad Teaching Hospital affiliated to Tehran
University of Medical Sciences and final expert system was presented which could
diagnose all kinds of diabetes. [15] In 2014, Gaganjot Kaur, Amit Chhabra proposed
classification which used Decision tree algorithm to predict class whether patient is
diabetic or not. The labelled data is feed as input. The leaf of the tree j48 acts as the class
labels. The information gain is calculated for each attribute. Then the gain in information
is calculated that would result from a test on the attribute. [12] In 2003, Margret Anouncia
S., Clara Madonna L. J., Jeevitha P., Nandhini R. T. proposed a design for a Diabetic
Diagnosis System using Rough Sets where the authors created a knowledge base from the
existing data- set using upper and lower approximations and when a new incoming data is
collected and evaluated to compute the equivalence classes. These equivalence are then
compared against the knowledge, the system helped the user discern the type of diabetes.
[4] In 2015, Rahman Ali, Jamil Hussain, Muhammad Hameed Siddiqi, Maqbool Hussain
and Sungyoung Lee, proposed a Hybrid Rough Set Reasoning Model for Prediction and
Management of Diabetes Mellitus. When a new incoming data is evaluated and compared
against the knowledge base derived from the rough classification to classify the type of
diabetes of the patient. [6] In 2015, AishwaryaIyer, S. Jeyalatha and RonakSumbaly
conducted a comparative study where the authors used multiple data mining techniques to
classify the PIMA Indians Diabetic dataset. The techniques used for feature extraction and
classification were decision tree and Naive Bayesian Classifier. [8] In 2015, Áurea
Celeste Ribeiro, Allan Kardec Barros, Ewaldo Santana, José Carlos Príncipe, proposed a
redundancy reduction preprocessor that eliminated the redundant attributes from the data
set. The authors used the PIMA Indians Diabetic dataset available on UCI’s online
repository for their model. In 1998, Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler,
W.C., & Johannes, R.S. have used the data set to develop a model implementing the
ADAP learning algorithm to predict the onset of diabetes mellitus.
3. Motivation
Type 1 diabetes mellitus (T1DM) itself comprises of roughly around 10% of the
reported cases with diabetes. T1DM usually affects children where the causes are of the
48
Copyright ⓒ 2017 SERSC Australia
International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017)
disease is unknown and no known ways to prevent the T1DM. The study [9] performed
and published by Jane L. Chiang et al, estimated about 80000 children showing symptoms
and developing the disease each year. The motivation behind pursuing a working model
in the field of T1DM is to guide the doctors and the guardians of the affected children to
catch the disease in ascent stage and based on the severity of diabetes, accordingly
provide treatment to the affected children.
4. Experimental Setup
Author names RStudio is an integrated development environment (IDE) for R. It
includes a console, syntax-highlighting editor that supports direct code execution, as well
as tools for plotting, history, debugging and workspace management. RStudio was written
in C++ language and uses the Qt framework for its graphical user interface for tasks such
as plotting of graphs and charts. RStudio is available in open source and commercial
editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to
RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE
Linux).
The following are the IDE Features offered by RStudio:
RStudio has integrated support for Git and Subversion
Rstudio supports authoring HTML, PDF, Word Documents, Slide Shows.
RStudio supports interactive graphics with Shiny and ggvis.
RStudio integrates the tools used in R into a single environment.
The following libraries have been used for the model development phase:
RoughSets: Library consisting of functions for data analysis using RST and
FSRT. The library is used for converting the data set into Decision Tables and to
compute the discernability and indiscernability matrix from the decision tables.
RoughSetKnowledgeReduction: Simplification of Decision Tables using RST.
Incorporates the functions for reducts computation and feature selection.
SoftClustering: The library contains various soft clustering algorithms to be
implemented on the dataset. The function RoughKMeans_LW() is used to
implement Lingras and Wests’ rough clustering algorithm to discern clusters from
the dataset and the plotRoughKMeans() function to plot the rough clusters.
caret: Classification and Regression Training. The library is used for training the
dataset and testing the model for accuracy and prediction.
discretization: The library provides the user with various functions such as chi2() for
data processing and discretization of numeric data tables for classification. The data set
considered for the development analysis of the model is the Pima Indians Diabetes data
set which is available online on University of California, Irvine, Machine Learning
Repository. The data set has been provided by the National Institute of Diabetes and
Digestive and Kidney Diseases. The data set consists of relevant information of patients
of Pima Indian Heritage. There are 768 instances, with 500 testing negative for diabetes
and the remaining 268 instances testing positive for diabetes.
5. Methodology
The proposed system follows the following steps as shown in the Figure 1, which is the
system design for the model.
Copyright ⓒ 2017 SERSC Australia
49
International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017)
Figure 1. System Design Depicting the Processes Involved in the Rough Set
Theory Model
Once the data set is loaded on the platform, we must discern the features that needed to
be selected for the entire clustering process. First, we run the correlation matrix to
eliminate the possibility of attributes having similar trends.
Pearson’s correlation coefficient methodology has been adopted for the computation of
the correlation matrix. The method computes the correlation between two variables and
provides the value within the range of -1 to +1. This algorithm is chosen as it is the
traditional method of correlation computation and the accuracy increases with the increase
in the sample size.
For two variables X and Y, with standard deviations σx and σy respectively and means
μx and μy respectively, the correlation coefficient ρ(x,y) is given by the formulapx, y
X X Y Y
XY
(1)
Where E is the expectation.
In order to interpret the values better, cut off values for the absolute coefficient have
been established for better understanding.
Perfect 0
- No correlation
( 0, 0.35]
- Weakly correlated
( 0.35, 0.7]
- Moderately correlated
( 0.7, 1)
- Strongly correlated
Perfect 1
- Perfectly correlated
50
Copyright ⓒ 2017 SERSC Australia
International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017)
Table 1. Correlation Matrix on the PIMA Indians Data-set
From the above table, it is evident that no two attributes are strongly correlated. Hence,
no attribute is dropped after the table. We continue to compute the reducts from the dataset for feature extraction to reduce the complexity during computing the clusters, at the
same time, maintaining the integrity of the data-set.
In 1998, Jan Komorowski[16] documented the various concepts of Rough Set Theory
in their paper, they have a well documented section dedicated for reducts. The authors
discuss that reducts are a set of attributes that are used for feature selection process which
reduce the dimensionality of the information system, but preserve the integrity of the data
and help in eliminating redundant attributes from the data set.
Unfortunately, computing a minimal is a NP-Hard problem. But there are good
heuristic algorithms ([17][19]) based on modified genetic algorithm to provide us with
multiple reducts for the information system.
To find the reducts, we first compute the discernibility matrix. Discernibility matrix is
an x n matrix, where n is the number of attributes in the data set.
ci , j a A | a( xi ) a( x j ) for i, j 1,..., n
The entries of the matrix, c(i,j) are computed using the given formula-
. Where A is the set of attributes and a is an object
of A. xiis the value of the object for the corresponding attribute a.
After the discernibility matrix is computed, we proceed to discern the discernibility
function fA, which is given by-
*
f A (a1* ,..., am
) ci*, j | 1 j i n, ci, j
A set of all the prime implicants of the discernibility function fA determines the reducts
of the data set. Once the reducts are computed and the data-set is reduced, we deploy the
rough clustering algorithm as proposed by P. Lingras [20] and Georg Peters. The rough
clustering algorithm takes the lower and upper approximations of the features and to
incorporate the roughness in the data and accordingly cluster the similar data points to
discern the centroids in the model.
The algorithm to compute the centroids as given by P. Lingras, In which the clusters
formed, we split the data-set into 70% for training the model and 30% to validated and
verify the results from the trained model. The testing set is validated on the basis of
distance from the centroids of the clusters.
Copyright ⓒ 2017 SERSC Australia
51
International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017)
6. Simulation and Results
Figure 2. Ranking the Attributes for Feature Selection
Table 2. Attributes Arranged by Priority with Assigned Weights
Attribute
Assigned Weight
Glucose
0.8
BMI
0.7
Age
0.7
Pregnancies
0.615
Diabetes Pedigree Function
0.61
Blood Pressure
0.59
Skin Thickness
0.55
Insulin
0.47
On the basis of correlation matrix and the feature selection process, the following
attributes take precedence over the other features
Plasma Glucose Concentration
Body Mass Index
Age
On computing all the reducts to reduce the complexity of the execution, the data-set
generates 28 unique reducts. 19 of which are three attribute reducts and the remaining 9
are four attribute reducts. After comparing with the correlation matrix and ranking the
importance of the attributes, reduct 17 is selected for the clustering process of the data set.
52
Copyright ⓒ 2017 SERSC Australia
International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017)
Figure 3. Decision Reduct 17
On selecting the reducts from the data-set and dropping the non-significant features, we
proceed with rough clustering to cluster the data-set. The data set is split into 70% for
training and 30% for testing and validating the data-set.
For comparison purposes, we have even computed the centres implementing the classic
Hard K-means clustering algorithm. The centroids for the Hard K-means algorithm were
computed after 11 iterations over the data-set.
The instances are split into “Diabetic” and “Non- Diabetic” clusters. The first row in
ClusterMeans is the value for “Diabetic” centre of the particular attribute. Meanwhile, the
second row of ClusterMeans is the value for the “Non- Diabetic” centre of the attribute.
The values of the centroids for the individual attributes in the reducts is listed in the figure
below.
Figure 4. Cluster Means for the Attributes using Hard K-Means Algorithm
Meanwhile, the rough clustering algorithm as proposed by P. Lingras is implemented
on the 70% of the data to compute the centres for the attributes. The same algorithm is
then ran on the remaining 30% of the data to test and validate the values of the centres
computed to determine the accuracy achieved by the rough clustering algorithms when
implemented on a dataset.
Figure 5. Cluster Means for the Attributes using Rough K-Means Algorithm
on the Training Set
Figure 6. Cluster Means for the Attributes using Rough K-Means Algorithm
on the Testing Set
The plots for the rough clusters from the data set are plotted using the
plotRoughKMeans() function which maps the data points on a 2D plane with the centres
Copyright ⓒ 2017 SERSC Australia
53
International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017)
marked in the shape of a green circle which is the centre for a patient with no onset of
Diabetes. Whereas, the red box is the centre for diabetic patients.
Figure 7. Rough Clusters Plot on a 2D Plane with the Centres for Training
Data
Figure 7. Rough Clusters Plot on a 2D Plane with the Centres for Testing
Data
Table 3. Tabulating the Centre Values for Each Individual Attributes for the
Various Methods
V2
Hard K-means
V6
V8
(156, 102) (34.5, 30.7) (38.1, 30.7)
Rough K-means (Training) (147, 105) (33.5, 30.7) (36.3, 31.9)
Rough K-means (Testing)
(149, 108) (34.2, 31.1) (38.6, 31.7)
The centre values for the Plasma Glucose Concentration (V2) for non-diabetic patients
has an error of < 3%, and for diabetic patient is <1.5 %. The centre values for the Body
Mass Index (V6) for non-diabetic patient has an error of < 1.5%, and for diabetic patients
is < 2.1%. The centre values for Age (V8) for non diabetic patients has an error of < 0.7,
and for diabetic patients is < 6.5%
54
Copyright ⓒ 2017 SERSC Australia
International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017)
Table 4. Tabulating the Error Percentage for Centres Values for Each
Individual Attributes
Error %
V2
V6
V8
Nondiabetic
2.86 1.3
0.63
Diabetic
1.36 2.09 6.34
7. Conclusion
Type The Rough Clustering of the data-set was successfully trained, tested and
implemented on the data set [5]. The results obtained were above satisfactory and can be
further improved by increasing the size of the data set by adding to it the information
gathered by incoming patients in a hospital or in a network of hospitals. Furthermore, the
prediction of diabetes can also depend on other factors which are not present in said data
set. Taking into consideration these factors will further help to improve the accuracy of
the proposed system.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
A. Shaheen Khan and W. Ahmad, “Intelligent Decision Support System in Diabetic eHealth Care-From
the perspective of Elders”, Blekinge Institute of Technology, (2009), pp. 88.
Z. Tawfik Saeed, M. V. Malakooti, Y. Ataeipoor and S. Talayeh Tabibi, “An expert system for diabetes
diagnosis”, American Academic & Scholarly Research Journal, vol. 4, no. 5, (2012), pp. 1.
Margret Anouncia S, Clara Madonna L. J., Jeevitha P. and Nandhini R. T., “Design of a Diabetic
Diagnosis System Using Rough Sets”, Cybernetics And Information Technologies, vol. 13, no. 3,
(2013), pp. 124-139.
R. Ali, J. Hussain, M. Hameed Siddiqi, M. Hussain and S. Lee, “H2RM: A Hybrid Rough Set Reasoning
Model for Prediction and Management of Diabetes Mellitus”, Sensors, vol. 15, no. 7, (2015), pp. 1592115951.
A. Frank, “UCI machine learning repository”, http://archive. ics. uci. edu/ml, (2010).
Á. Celeste Ribeiro, A. Kardec Barros, E. Santana and J. Carlos Príncipe, “Diabetes classification using a
redundancy reduction preprocessor”, Research on Biomedical Engineering, vol. 31, no. 2, (2015), pp.
97-106.
J. L. Chiang, M. Sue Kirkman, L. MB Laffel and A. L. Peters, “Type 1 diabetes through the life span: a
position statement of the American Diabetes Association”, Diabetes care, vol. 37, no. 7, (2014), pp.
2034-2054.
Margret Anouncia S, Clara Madonna L. J., Jeevitha P. and Nandhini R. T., “Design of a Diabetic
Diagnosis System Using Rough Sets”, Cybernetics and Information Technologies, vol. 13, no. 3, (2013),
pp. 124-139.
R. Zolfaghari, “Diagnosis of Diabetes in Female Population of Pima Indian Heritage with Ensemble of
BP Neural Network and SVM”, International Journal of Computational Engineering & Management,
vol. 15, no. 4, (2012), pp. 2230-7893.
G. Kaur, “Improved J48 Classification Algorithm for the Prediction of Diabetes”, International Journal
of Computer Applications, vol. 98, no. 22, (2014).
J. Lamb, “MATH 3220 Final Project”.
M. Narasingarao, R. Manda, G. Sridhar, K. Madhu and A. Rao, “A clinical decision support system
using multilayer perceptron neural network to assess well being in diabetes”, Journal Assoc. Phys. India,
vol. 57, (2009), pp. 127-133.
M. Thirugnanam, P. Kumar, S. Vignesh Srivatsan and C. R. Nerlesh, “Improving the prediction rate of
diabetes diagnosis using fuzzy, neural network, case based (FNC) approach”, Procedia Engineering, vol.
38, (2012), pp. 1709-1718.
H. Chen and C. Tan, “Prediction of type-2 diabetes based on several element levels in blood and
chemometrics”, Biological trace element research, vol. 147, no. 1-3, (2012), pp. 67-74.
A. Sood, S. Diamond and S. Wang, “Type 2 Diabetes Mellitus Classification”, Department of Computer
Science, Stanford University: Stanford, CA, USA, (2012).
J. Komorowski, L. Polkowski and A. Skowron, “Rough Set: A tutorial”.
W. Jakub, “Genetic algorithms in decomposition and classification problems”, Rough Sets in
Knowledge Discovery 2, Physica-Verlag HD, (1998), pp. 471-487.
Copyright ⓒ 2017 SERSC Australia
55
International Journal of Grid and Distributed Computing
Vol. 10, No. 9 (2017)
[18] S. Hoa Nguyen, A. Skowron and P. Synak, “Discovery of data patterns with applications to
decomposition and classification problems”, Rough Sets in Knowledge Discovery 2, Physica-Verlag
HD, (1998), pp. 55-97.
[19] U. Wybraniec-Skardowska, “On a generalization of approximation space”, Bulletin of the Polish
Academy of Sciences, Mathematics, vol. 37, no. 1-6, (1989), pp. 51-62.
[20] P. Lingras and G. Peters, “Applying Rough Set Concepts To Clustering”, Rough Sets: Selected Methods
and Applications in Management and Engineering, (2012), pp. 23-37.
Authors
Shantan Sawa, passed out student, of Bachelors of Technology in
Computer Science and Engineering at VIT University, Vellore, Tamil
Nadu, India. He has high enthusiasm and ambition towards the
projects he works on. His academic interests include Artificial
Intelligence (primarily Fuzzy Logic and Artificial Neural Networks)
and Embedded Systems.
H. Bajaj currently working as an Associate Professor in Computer
Science and Engineering Department at Sreenidhi Institute of Science
and Technology, Hyderabad, Telangana India. His areas of research
include Data Warehousing and Mining and Network Security.
Ronnie D. Caytiles, he had his Bachelor of Science in Computer
Engineering- Western Institute of Technology, Iloilo City,
Philippines, and Master of Science in Computer Science– Central
Philippine University, Iloilo City, Philippines. He finished his Ph.D.
in Multimedia Engineering, Hannam University, Daejeon, Korea.
Currently, he serves as an Assistant Professor at Multimedia
Engineering department, Hannam University, Daejeon, Korea. His
research interests include Mobile Computing, Multimedia
Communication, Information Technology Security, Ubiquitous
Computing, Control and Automation
N. Ch. S. N. Iyengar (b 1961), he currently Professor at the
Sreenidhi Institute of Science and Technology (SNIST) Yamnapet,
Ghatkesr, Hyderabad, Telengana, India. His research interests include
Agent-Based Distributed Computing, Intelligent Computing,
Network Security, Secured Cloud Computing and Fluid Mechanics.
He had 32+ years of experience in teaching and research, guided
many scholars, has authored several textbooks and had nearly 200+
research publications in reputed peer reviewed international journals.
He served as PCM/reviewer/keynote speaker/ Invited speaker.
56
Copyright ⓒ 2017 SERSC Australia