Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
118 views

Predicting Diabetes in Medical Datasets Using Machine Learning Techniques

diabetes in medical
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views

Predicting Diabetes in Medical Datasets Using Machine Learning Techniques

diabetes in medical
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1538

ISSN 2229-5518

Predicting Diabetes in Medical Datasets


Using Machine Learning Techniques
Uswa Ali Zia, Dr. Naeem Khan

Abstract-Healthcare industry contains very large and sensitive data and needs to be handled very carefully. Diabetes
Mellitus is one of the growing extremely fatal diseases all over the world. Medical professionals want a reliable
prediction system to diagnose Diabetes. Different machine learning techniques are useful for examining the data from
diverse perspectives and synopsizing it into valuable information. The accessibility and availability of huge amounts of
data will be able to provide us useful knowledge if certain data mining techniques are applied on it. The main goal is to
determine new patterns and then to interpret these patterns to deliver significant and useful information for the users.
Diabetes contributes to heart disease, kidney disease, nerve damage and blindness. So mining the diabetes data in
efficient way is a crucial concern. The data mining techniques and methods will be discovered to find the appropriate
approaches and techniques for efficient classification of Diabetes dataset and in extracting valuable patterns. In this
study a medical bioinformatics analyses has been accomplished to predict the diabetes. The WEKA software was
employed as mining tool for diagnosing diabetes. The Pima Indian diabetes database was acquired from UCI
repository used for analysis. The dataset was studied and analyzed to build effective model that predict and diagnoses
the diabetes disease. In this study we aim to apply the bootstrapping resampling technique to enhance the accuracy
and then applying Naïve Bayes, Decision Trees and k Nearest Neighbors (kNN) and compare their performance.

Index Terms- Healthcare, Diabetes, Classification, K-nearest neighbours, Decision Trees, Naive Bayes.

IJSER
————————————————————
1. INTRODUCTION

C
omputers have brought substantial improvements when disclosed to new or unseen data. Machine learning
to technology that lead to the production of algorithms are mostly categorized as being supervised or
massive volumes of data. Additionally, the unsupervised. A supervised learning algorithm uses the past
advancements and innovations in the healthcare database experience to make predictions on new or unseen data
management systems generate a huge number of medical while unsupervised algorithms can draw inferences from
databases. Healthcare industry contains very large and datasets. The supervised learning is also called
sensitive data. This data needs to be treated very carefully classification.This study uses classification technique to
to get benefitted from it. There is need to develop some produce a more accurate predictive model as it is one of
more accurate and efficient predictive models that helps in themost commonly applied machine learning technique that
diagnosing a disease although it was revealed that diabetes examines the training data and creates an inferred function,
mellitus is the diseases which becomes one of the global which can be used for mapping new or unseen examples.
hazard. Diabetic Mellitus is a set of associated diseases in The major goal of the classification technique is to forecast
which the human body is unable to control the quantity of the target class accurately for each case in the data.
sugar in the blood. It is a group of metabolic diseases which Classification Algorithms generally require that the classes
results in high blood sugar level, may be as the body does be defined grounded on the data attribute values. They
not produce sufficient insulin, or may because cells do not often define these classes by looking at the characteristics
react to the produced insulin. This disease becomes a global of data already known to belong to class. This process of
hazard and will increasing rapidly so it is estimated that finding useful information and patterns in data is also
almost sixty million people from all over the world will be called Knowledge Discovery in Databases (KDD) which
effected by diabetics in 2025. Hence there it is needed to involves certain phases like Data selection, Pre-processing,
analyses the already available huge diabetic data sets to Transformation, Classification and Evaluation.
discover some incredible facts which may help in Before applying any classification algorithm it is necessary
producing some prediction model. to prepare or preprocess the acquired original dataset to
The focus is to develop the prediction models by using enhance the performance of a classifier. Besides managing
certain machine learning algorithms. The machine learning the noise and dealing with the missing value, there is a
is a sort of artificial intelligence that enables the computers common issue in the real environment datasets that the
to learn without being explicitly programmed. Machine target class values are not equal or are not balanced.
learning emphases on the development of computer Several real world application for example medical
programs that can teach themselves to change and grow diagnoses, fraud detection, network interruption detection,

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1539
ISSN 2229-5518

fault monitoring,detection of pollution, biomedical, instances for each type of class labels.Therefore we
bioinformatics and remote sensing suffer from these consider resample as one approach to enhance
phenomena. This disorder is known as class imbalance. classification accuracy.
Class imbalance problem recently becoming a hot issueand In this study we have applied bootstrapping method which
being examinedby machine learning and data mining is a statistical re-sampling technique that allows to
researchers. Besides other major challenges faced by randomly replacing different set of data points within a
machine learning and data mining fields, class imbalance is dataset, and hence results in higher accuracy. Resampling
also among one of these challenges. Imbalance data sets methods useby computer to produce a huge amount of
reduces the performance of data mining and machine simulated samples. Patterns in these samples are then
learning techniques and also affect on the total accuracy summarized and evaluated. The strengths of using
and decision making as beinginclined to the majority class, bootstrap resampling technique are that each sample must
which lead to misclassifying the minority class samples or have an equal probability of being selected. The simulated
may handle them as noise.This affects prediction accuracy samples take full advantage of the information in the
of the classifier. The prediction accuracy in medical sample. Resampling is suggested to be done with
datasets is generally low while using conventional replacement. This technique will be simpler and more
classification techniques without applying additional accurate, needs less assumption, and have better
preprocessing or data preparation techniques. One of the generalizability. Resampling gives particularly rich
solutions is resample for dealing with class imbalance advantages where expectations of traditional parametric
problem. It is a preprocessing method that handles the tests are not met, as with minor samples from non-normal
imbalance problem by creatingalmost balanced training distributions.
data set and adjusting the preceding distribution for both Therefore this technique will help equalizing the minority

IJSER
minority and majority class. Sampling methods compriseof classes as it aims at obtaining the same size of data points
under sampling, over sampling and sometimes hybrid for each class. The efficiency of different classification
techniques. Under sampling approach will balance the data techniques would be then evaluated to suggest the suitable
by eliminatingsamples from majority class whereas the choice. The classification algorithms have been applied to
over sampling method will balance the data by creating the PIMA Indians Diabetes Dataset of National Institute of
theduplicates of the present samples or by adding new Diabetes and Digestive and Kidney Diseases that contains
samples to the minority class.Resample is one such the data of female diabetic patients.
technique which ensures selection of same sizes of class

2. LITERATURE REVIEW
Yasodhaet al.[1] uses the classification on diverse types of approach the study concluded that J48 algorithm gives an
datasets that can be accomplished to decide if a person is accuracy rate of 74.8% while the naïve Bayes gives an
diabetic or not. The diabetic patient’s data set is established accuracy of 79.5% by using 70:30 split.
by gathering data from hospital warehouse which contains Gupta et al. [3] aims to find and calculate the accuracy,
two hundred and forty nine instances with seven attributes. sensitivity and specificity percentage of numerous
These instances of this dataset are referring to two groups classification methods and also tried to compare and
i.e. blood tests and urine tests. In this study the analyse the results of several classification methods in
implementation can be done by using WEKA to classify WEKA, the study compares the performance of same
the data and the data is assessed by means of 10-fold cross classifiers when implemented on some other tools which
validation approach, as it performs very well on small includes Rapidminer and Matlabusing the same parameters
datasets, and the outcomes are compared. The naïve Bayes, (i.e. accuracy, sensitivity and specificity). They applied
J48, REP Tree and Random Tree are used. It was JRIP, Jgraft and BayesNet algorithms. The result shows
concluded that J48 works best showing an accuracy of that Jgraft shows highest accuracy i.e 81.3%, sensitivity is
60.2% among others. 59.7% and specificity is 81.4%. It was also concluded that
Aiswaryaet al. [2] aims to discover solutions to detect the WEKA works best than Matlab and Rapidminner.
diabetes by investigating and examining the patterns Lee et al. [4] focus on applying a decision tree algorithm
originate in the data via classification analysis by using named as CART on the diabetes dataset after applying the
Decision Tree and Naïve Bayes algorithms. The research resample filter over the data. The author emphasis on the
hopes to propose a faster and more efficient method of class imbalance problem and the need to handle this
identifying the disease that will help in well-timed cure of problem before applying any algorithm to achieve better
the patients. Using PIMA dataset and cross validation accuracy rates. The class imbalance is a mostly occur in a

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1540
ISSN 2229-5518

dataset having dichotomous values, which means that the four possible outcomes i.e. either the patient is positive
class variable have two possible outcomes and can be
handled easily if observed earlier in data preprocessing fordiabetes, pre-diabetes, gestationaldiabetesand non-
stage and will help in boosting the accuracy of the
diabetic. The csv file are loaded into R. after the
predictive model.The study illustrates the effect of
resampling means in field of medical the dataset used inthis prepressing the decision tree algorithm is applied to
study was acquired from the National Health and Nutrition
Examination Survey (NHANES) 2009–2010. The attributes predict all the four possible diabetics outcomes as
of the dataset includes glucose (fasting and non-fasting)
defined above and produces the results. The R tool
and body mass index.On this data the researcher built
some decision tree models to forecast undiagnosed diabetes analyses datasets in 748.54 seconds. This study uses R
among adults. The Centers for Disease Control and
Prevention declared that the occurrence of diagnosed and tool which is quite effective, extensible and having
undiagnosed diabetes are about6.0% and 2.3%, respectively
comprehensiveenvironment for statistical computing and
and results in its large burden to the social order, to identify
undiagnosed diabetes for improved decision-making of graphics. Another important feature of R is that it
health care providersefforts were dedicated.Classification
and Regression Tree (CART) being a recursive supports a variety of file formats (XML, binary files, CSV)
partitioning method aim at excruciating the data into and also user created R packages. The study also uses
different parts based on the maximum significant
exposure variables carried out by this method. The tool decision trees for the reason that they are easy to

IJSER
used for experimentation is R software. The data was
splinted into ratio of 70:30. Finally the maximum understand, economical to construct, easy to incorporate
accuracy achieved by this study is 67%. with database system and is relatively accurate in several
Chikhet al. [5] used enhanced AIRS2 called MAIRS2 to
increase the diagnostic accuracy of diabetes diseases. K- applications. In this study a thorough analysis of the
nearest neighbors algorithm swap with the fuzzy K-nearest
neighbors to enhance the diagnostic accuracy of diabetes diabetic datasets was done efficiently with the help of R.
diseases. The diabetes dataset acquired from UCI machine this information which was discovered from this study can
learning repository. The authors attained a good tradeoff
between classification accuracy and data reduction. The also be used to build efficient prediction models.
propose system (MAIRS2) that performed better than
Sadhanaet al. [7] emphasis on the need to analyse the
classical AIRS2. The authors achieved highest
already available huge diabetic data sets to analyzedso to
classification accuracy by MAIRS2 is 89.10%.
discover some vital facts which may help in producing
Sharmilaet al. [6] aims to analyze the data in predicting some prediction model. Besides using the data mining
thediabetes from medical record of the patients. The techniques (as previously used) this study is going to uses
Hadoop, hive and R for analysing the datasets. The datasets
study states that approximately 40 million Indianssuffer were taken from Pima Indians Diabetes Database of
National Institute of Diabetes and Digestive and Kidney
from diabetes till now. his. This study is analysing the
Diseases. Total eight attributes (no. of pregnancies, glucose
diabetes from huge medical records by using decision plasma concentration, blood pressure, serum insulin, body
mass index, age, diabetes pedigree and skin fold depth)
trees with statisticalimplication using R tool .R is a were used to produce the result as a patient being effected
by diabetes when output is 1 and not whenindicated 0. The
sequential programming language for the analysis,
raw csv file is injected to hive as input where these datasets
graphics and software development activities for were analysed on the basis of these attributes. The output of
hive is given to R as input which performs statistical
datamining and in various fields. The datasets were analyses along with producing the graphs. The basic benefit
of Hive is that it acts as a data warehouse solution that
collected from Chennai to analyze having ten attributes
constructed at top of Hadoop. The results produced are
(i.e. pregnant, LDL, post prandial HDL, BMI, HBAIC,
highly efficient as hive has analysed seven hundred and
age, creatinine, family) and a class variable. There are
sixty eight datasets in just 19 seconds. The graphs

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1541
ISSN 2229-5518

generated by R can help to understand the outcomes in clinicians, nurses, machinery, and other resources in a

a simpler manner. The study claims that a prediction better way.

model should be developed by using such graphs or Eswariet al.[9]focuses on a prediction model by

information. analyzingthe algorithm in Hadoop/Map Reduce

Gowsalyaet al.[8] aims to propose a system with the environment to predict the widespreadtypes of diabetes

ability of forecasting the risk of readmission of diabetic and the related problems and also the treatment. The

patients within coming 30 days and can accomplish this suggesteddesign of predictive analysis system is

take with help of MapReduce technique. This risk factor constructed on numerous levels e.g. data collection,

obtained will aids the physicians in suggesting suitable warehousing, predictive analysis, processing analysed

care for the patients.The study presents solution which reports. The system analysis by working in Hadoop/Map

usesHadoopMapReduce to analyse huge datasets and Reduce setting to categorize the type of diabetics, its

mine valuable observations from the dataset that aids in problems and the type of treatment suggested for such

assigning the resourceseffectively. For new patients, this patients. The suggested system uses Hadoop as the

IJSER
system makes use of the information of the prior patients open-source distributed data processing platform.

with similar illnesses and reuses those suggestions. The Hadoophas the ability toplay both functionsofa data

system collects the data straight from the patients (body manager as well as analytics tool.Big Data Analytics in

sensors) and their associated doctors. This data is then Hadoop’sapplication gives an organized approach for

stored on Hadoop Distributed File System (HDFS) and attaining improved results such as availability and

MapReduce technique is applied by HDFS. Analysis is affordability of healthcare facility to population as this

performed on the datasets with information of hospital research ambitions to deals with the study of curing

admission, diabetic encounter, laboratory tests, diabetes in medical industry via the big data analytics.

medications, time to stay in the hospital. The rate of the Salianet al. [10] explains that analysing the big data will

readmission is calculated on the features like age, HbAIC help in predicting the risk of diabetic patient’s

result and modification in prescription. Haemoglobin A I readmission efficiently by determining the risk predictors

C (HbAIC) is an considered important factor as a that can be a reason of readmission of diabetic patients.

measure of glucose control, which is mostly results to The study suggested a predictive model that can find the

measure of diabetes. The likelihood of getting readmitted patients with chronic diabetes diseases and are most

is high if the value is greater than 8%.The use of likely to be get admitted again and again. In the

distributed file system for the development of this suggested system works by loading the raw data is

proposed systemuses inexpensive present hardware and loaded into the Hadoop File System (HDFS) firstly and

stores data across nodes. This predictive system helps then by using Hive queries, all the nominated predictive

hospitals and other health care organizations to assign variables are recovered into a comprehensible dataset to

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1542
ISSN 2229-5518

use formodeling. And then model works by selecting and such as temperature and rain fall. The source data will be
gathered from various sources and in various formats
applying various classifications, prediction method using required to be treated in real time and thus make use of big
data techniques to map the surveillance of disease in real
Hadoop. The accurateness of the results was checked by
time. The input data from various sources is to be
confusion matrix. The top five readmission predictors in processed in real time and uses techniques (such as data
mining or machine learning) to map the surveillance of
diabetic dataset are body mass index,plasma glucose, disease in real time. Using data mining techniques such as
machine learning and the use of multitude sourcing
age, pregnant, pedigree function are top predictors in the
provides an opportunity of creating a continually or
proposed model. This study shows thatthe risk of frequently updated atlas of infectious diseases. Though
using big data analytics techniques it is possible to provide
readmission for diabetes patients can be evaluated by the risk map in real time.
big data analytics. Predictive modeling has been worked
Weber et al. [13] emphasis to identify all the diverse but
by applying decision tree classification method. The useful data sourceslike social media, census records, and
numerous other types of data and then link them together
chance of readmission in diabetic patient is successfully while taking care of the privacy and security, so as to get
fully benefitted from big data.The biomedical data is
predicted by this proposed model.
distributed across different isolated areas so it is necessary
Raghupathiet al.[11] defines the potential and possibilities to link them all to get better insights from this available

IJSER
of big data analytics in healthcare.Along through the data by analysing it. Although before linking data from all
potential of big data analytics the study also highlighted sources for analysis it is also necessary to distinguished
several challenges to address.The analysis of big data in between the useful sources and the irrelevant data sources.
The study applies the probabilistic linkage algorithm for
the health care sector results in cost reduction and linking the diverse sources. This algorithm’s main
advantage is that the same technique is used to match the
quality treatment to the patients, further benefits includes
patient’s crossways different electronic health records can
to identify those individuals who would be benefitted be stretched to the data sources outside the health care.

from anticipatory care or by changing their routine in a Meredithet al.[14] defines the importance of big data in
prevention of certain disease by continually measuring and
proactive manner; outlining the broad scale disease to analysing the data in real time from different sources and
support prevention initiatives; gathering and issuing data suggest precautions to particular individual about his/her
disease while lowering the cost. Big data can assist action
on medical actions, identifying, predicting and dropping on the risk factors such as physical activity, nutrition, use
of tobacco, andexposure to pollution. The study describes
fraud by applying advanced analytic systems for fraud
two case studies to show how big data is helpful in disease
recognition and checking the correctness and stability of prevention. Disease prevention is based upon to identify
modifiable risk factors for disease like exercise, diet,
claims. Several challenges are also highlighted which alcohol consumption, smoking and pollution get insights
then lead to interventions to improve therisk factors and
includes governance issues including ownership,
improve health.
security, privacy have however to be addressed. By
Raoet al. [15] enlightens the security challenges related to
overcoming the existing limitation as defined above will big data with particular reference to healthcare sector. The

help in more fast progress in analysing the big data in study aimed to propose feasible security solutions to get

healthcare. fully benefitted from big data relating to healthcare. The

Hay et al. [12] tries to maps the geographical areas where study explains the necessity of big data analysis in
there is a greater chance of an infectious disease to be healthcare sector to do proactive and reactive analysis of
occurred and those areas where the chances are relatively
low. The analysis is based upon the environmental factors the information which will results in providing chances for

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1543
ISSN 2229-5518

forecasting, realizing uncertain needs, and decreasing

risks as along with providing tailored services. The study

also proposed four security models which are data de-

identification model, data centric approach to security,

walled garden model and jujutsu security.Security

solutions should be implemented in such a way that they

should guarantee safe analytics and securing big data

frameworks.

Augustine et al. [16] focuses on the benefits of using


Hadoopas being more flexible, scalable and as a more
economical solution for the analysis of big medical data
(images) produce in healthcare sectors. Hadoop provides
solution to analyse the medical images by combining these
medical images from numerous sources and extracts the
important data for accurate diagnosis. The study emphasis

IJSER
on the use of an interface called Hadoop Image Processing
Interface (HIPI) supports the image processing as
accomplished in Hadoop.

3.PROPOSED FRAMEWORK

In view of the problem statement described in the


introduction section, we propose a classification model
with boosted accuracy to predict the diabetic patient. In
this model, we have employed different classifiers like
Decision Trees, KNN and Naïve Bayes. The major focus is
to increase the accuracy by using resample technique on a
benchmark well renowned diabetes dataset that was Figure 1. The proposed Classification Model for
acquired from PIMA Indian Diabetes Dataset from UCI Diabetes Datasets.
machine learning repository, which consists of eight
attributes. The proposed framework is shown in Figure 1.
The framework is composed of the following important
phases:
• Dataset Selection (PIMA Indian Diabetes
Dataset)
• Data Preprocessing
• Feature extraction through principle component
analysis (PCA)
• Applying Resample filter
• Learning by Classifier (Training) i.e. Naïve
Bayes, KNN and Decision Trees
• Achieving trained model with highest accuracy
• Using trained model for prediction

The detail description of the components and the activities


performed against each component is mentioned below.

3.1 Dataset Selection (Diabetes Dataset)

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1544
ISSN 2229-5518

In data mining and machine learning, the data selection is a runs on almost every platform. It consists of variety of
process in which the most relevant data is selected from a machine learning algorithms and is capable to solve a
specific domain to derive values that are informative and multitude of data mining and machine leaning problems.
facilitate learning within that domain. In the study, we have WEKA supports many machine learning and data mining
used diabetes dataset having eight attributes that are used to tasks such that regression, classification, prediction, feature
predict the symptom of gestational diabetes in a female selection and visualization. WEKA provides a database
patient. This dataset was obtained from UCI repository and connection to access data and manipulates it. WEKA
is a benchmark dataset.On the basis of historical allows us to create, run, modify and analyze experiments in
information stored in the dataset such as age, body mass more way that is suitable.The most prominent advantages
index, blood pressure and number of times pregnant the of WEKA include its free availability, portability, a broad
classifiers are trained for making decision whether diabetes collection of data preprocessing and modeling techniques
test for an individual is positive or negative. The PIMA and the friendly graphical user interface makes it easy to
diabetes dataset only represents the Indian national females use. WEKA performance is comparatively better than other
who are at least 21 years old.All of the attributes are of data mining tools named TANAGRA and MATLAB.
numeric-valued continuous data type. The attribute for Different classification techniques show much better results
class label is dichotomous variable(i.e., the binary response on WEKA than other tools [18].

IJSER
variable) within the PIMA dataset follows each tuple of the
dataset. PIMA Indian Diabetes Dataset from UCI 3.3 Data preprocessing
Data preprocessing is a technique of machine learning that
repository contains 768 instances.The PIMA dataset is
comprises of converting raw data into an logical or
converted from CSV to ".ARFF" format accepted by
comprehensible format. The real world data is mostly
WEKA 3.6.13. The complete details of all the eight
incomplete, inconsistent, unreliable, redundant and having
attributes are listed in Table below.
missing values etc. Data preprocessing is an conventional
Table1.PIMA Dataset Description.
technique of eliminating such problems which are also
Sno Attribute Type
known as noise. Preprocessing involves certain activities
1 Number of times pregnant Numeric
like data cleaning, integrating the data, transformation of
2 Plasma glucose concentration Numeric
data, data reduction, data discretization and data cleaning.
3 Blood pressure( Diastolic) Numeric
Here the dataset is checked for duplicate values, missing
4 Triceps skin fold thickness(mm) Numeric
values and type miss-matches etc. All these inconsistencies
5 2-Hourseruminsulin Numeric
are eliminated from this dataset, in the phase called data
6 2 Numeric
Body mass index(kg/m ) preprocessing phase. It is important to clean the dataset
7 Diabetes pedigree function Numeric before training it on a classifier in order to better learn the
8 Age (years) Numeric hidden patterns in the dataset. The set of pertinent feature
9 Class Variable ( True or False) Nominal vector fed to the classifier help it learn more accurately in a
shorter span of time.
3.2 WEKA Tool
WEKA 3.6.13is used in this study. WEKA stands for 3.4 Feature Extraction through Principle Component
the Waikato Environment for Knowledge Analysis and this Analysis (PCA)
After setting the classification objectives, we apply
tool is developed and distributed freely by the University of
principle component analysis (PCA) on the dataset to
Waikato, New Zealand .WEKA is one of the most famous
determine the most suitable set of attributes that can help
tool for data processing and data analysis. Since WEKA
achieve better classification. The set of attribute suggested
software has been written in Java language, therefore, it
by the PCA are termed as feature vector in this study.

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1545
ISSN 2229-5518

Feature reduction or dimensionality reduction will The classification objective set for this study is to achieve
benefitted us by reducing the computation and space enhanced accuracy by using Naïve Bayes, Decision Trees
complexity. Simple and more robust models should be and KNN classifiers and determine which one suits the
developed, which are easier to understand and also saves most for diabetes classification technique. The classifiers
the cost. Therefore, we applied PCA on the entire PIMA we are selected to use in this study are ranked among the
dataset within the WEKA tool. A threshold value of 0.21 is top ten best classifiers especially k nearest neighhour and
selected and all the attributes having range of greater than decision trees. The techniques used are Naïve Bayes, J48,
and equal to 0.21 is selected for further experimentation. J48graft and IBK. These classifiers are selected on the
3.5 Resample Filter bases of their strengths described below and also due to
The supervised Resample filter is applied to the their frequent use in previous research studies.
preprocessed dataset. As the class attribute is of nominal 3.6.1 Naïve Bayes
data type therefore we are using supervised resample filter Naïve Bayes is a data mining classification technique and it
in WEKA, which produces a random subsample of a is used as a classifier. This classifier is used for probability
dataset using either by doing sampling with replacement or prediction if a sample belongs to particular class. The
sampling without replacement. Re-sampling is a series of quality of Naïve Bayes is high accuracy and fastest to train
methods used to reconstruct your sample data sets, data. It is usuallyused on very large datasets. The Naïve
including training sets and validation sets. The original Bayes Algorithm is a probabilistic algorithm that is

IJSER
dataset must fit completely in memory. The amount of sequential, following steps of execution, classification,
instances in the generated dataset may be identified. This estimation and prediction. There are various data mining
filter helps to preserve the class distribution in the existing solution for finding relations between the diseases,
subsample, or to bias the class distribution to a near symptoms and medications, but these algorithms have their
balanced distribution. It can provide more "useful" different own limitations; numerous iterations, high computational
sample sets for learning process. This approachis very easy time and binning of the continuous arguments etc. Naïve
to implement and fast to run.The unbalanced classes do not Bayes overcomes various limitations and can be applied on
have the same number of instances, this is true for the a large dataset in real time.
experimental database .When the distribution of instances 3.6.2 Decision Trees
is not uniform, the resampling of the experimental database Decision tree is a classification technique. This technique is
is necessary. In this study, we adopted bootstrap method of mostly use for prediction and classification. A tree
resample on the datasetwhich obtains a random sample comprises of paths, branches and leave nodes. Collection of
with replacement from a sample. In order to achieve branches is called path and represents the attribute value.
balanced classes, WEKA can use a resampling with Leaves represented Class value. Each path in decision tree
replacement which replicates some instances within symbolizes a rule which is used for classification or
classes,whenever the classes have just a few instances. The prediction. Decision tree divides the data into subsets or
parameters defined are set according to our requirements. nodes. Root node represents the complete dataset. Tree
This approach helps in balancing the imbalanced datasets pruning is preformed after tree is built completely. Pruning
and also gives us an enhancement in our preferred accuracy is started from the lead node.
measures. Being particular to J48 Decision tree classifier, it works on
the following simple algorithm. While classify a new item,
3.6 Classifiers firstlyit generate a decision tree grounded on the attribute
A classifier is a tool in machine learning that proceeds a
values of the existing training data. So, each and every time
group of data demonstrating the objects we need to classify
it encounters a set of items (training set) it recognizes the
and tries to forecast which class the new data belongs to.
attribute that distinguishes the numerous instances utmost

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1546
ISSN 2229-5518

clear. Among the likely values of this feature, if there is WEKA tool is used for training and testing (learning)
some value for which the data instances belonginginside its model. Learning model accuracy is checked through MSE
category have the similar value for the target variable, then (Mean Squared Error). While training the classifier, it is
we terminate that branch and ascribe to it the target value important to determine that classifier has efficiently learnt
that we have obtained. from the dataset. For this purpose, mean square error
3.6.3 k Nearest Neighbors (k-NN) (MSE) technique is widely used. The aim is to train the
k-NN is a very simple data mining technique and use for classifier until mean square error becomes negligible.If
classification. k-NN is a sort of instance-based learning, desired data accuracy is met then trained model will be
also referred as lazy learning, which basically aims with saved otherwise preprocessing step will be performed
estimating the function locally and all computation is again.
postponed until classification. It can be beneficial to 4. Experimentation and Results
allocate weight to the contributions of the neighbors, so as For experimentation PIMA Indian diabetes dataset is used
to the closer neighbors contribute more to the average than in this study. In the PIMA dataset, we have two class
those who are reside more far-away. The distance is mostly problems of diabetes in individual patient having tests
measured by using Euclidean distance formula. Here k is either positive or not. The dataset has been acquired from
static value and mostly it takes an odd value like 1,3 and 5. UCI machine learning repository database. The dataset

IJSER
consists of 768 total instances and nine attributes, namely,
K folds cross validation technique is used for training data. Diastolic blood pressure (mm Hg), Plasma glucose
This technique is mostly used in circumstances wherever concentration, Number of times pregnant, Body mass
the aim is prediction, and we wish to evaluate how a index, 2-Hour serum insulin, Triceps skin fold thickness
predictive model in practice will perform especially in (mm), Age (years) and Diabetes pedigree function. After
terms of accuracy. In the prediction problem, a model is preprocessing the data instances are reduced. We also
generally fed with a dataset that contains known data applied PCA to reduce the dimensionality of dataset. By
instances on which training is done (training dataset), as applying PCA on all the attributes, PCA returned six
well as a dataset of anonymous data against which the attributes to be used for training the classifiers. Then
model is being tested so called testing dataset. This applying resample filter with no replacement that disables
technique is used to assess predictive models by dividing the data to be replicated. The classifiers are applied. The
the original sample dataset into a training set that is used naïve Bayes, Decision Trees and Lazy classifiers are
ahead to train the model, and a test set on which it did applied one by one on the same data. We applied these
testing to evaluate it. In k-fold cross-validation, the classifiers on PIMA Indian diabetes dataset. The
original sample is divided at random into k equivalent size classification results are evaluated by comparing them in
subsamples. Of these k subsamples, a particular subsample terms of correctly classified and incorrectly classified
is reserved as the validation data and used for testing the instances. There are certain performs measures produced by
model, while the k-1 remaining subsamples are utilized as WEKA other than accuracy, precision and recall. They
training data. After that this cross-validation process is include F measures and ROC area. The F measure is
recurring k times (called the folds), with every of the k actually the weighted average of Precision and Recall.
subsamples just used one time as the validation data. It Hence, this measure get both false positives and false
works in loop manner. One benefit to use this technique is negatives into account. Naturally F measure is usually more
that every observation is used for both training and useful than accuracy, when class distribution is uneven. It
validation, and every single observation is utilized for works very well when false positives and false negatives
validation justone time. In this study we set the value of have almost same cost. If the cost of false positives and
k=10.

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1547
ISSN 2229-5518

false negatives are very dissimilar, then a healthier choice


Predicted
is to consider Precision and Recall rather than Accuracy.
The formula for F1 Score is 2*(Recall * Precision)/(Recall
Negative Positive
+ Precision). Similarly the ROC (receiver operating
characteristic) curvere presents a graphical plot which True False
Negative
shows the performance of a binary classifier system positive(TP)=107 negative(FN)=38
because its discrimination threshold is varied. The ROC Actual
False True
curve is generated by plotting the true positive rate (TPR) Positive
positive(FP)=39 negative(TN)=112
and the false positive rate (FPR) at several threshold values.
The term sensitivity, recall or probability of detection also
Accuracy =74.8%
indicates the true-positive rate in machine learning.
Precision= TP/TP+FP*100 = 73.28%
This study is limited to three performance measure that
Recall =TP/TP+FN*100 = 73.79%
includes accuracy, precision and recall.
Accuracy is the utmost spontaneous performance measure.
Table 3. Confusion Matrix for Decision tree (J48)
Itbasically deals with ratio of correctly predicted
observations. It is best to measure the accuracy when the

IJSER
class is balanced; therefore our focus is to enhance the
accuracy. The formula used to calculate the Accuracy is Predicted
mentioned below:
Negative Positive
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹
..…. (1)
True False
Negative
Precision indicates the number of True Positives divided by positive(TP)=132 negative(FN)=13
the number of True Positives and False Positives. Hence, it Actual
shows the number of positive predictions divided by the False True
Positive
total number of positive class values predicted. Precision is positive(FP)=4 negative(TN)=157
also termed as the Positive Predictive Value (PPV). The
formula is mentioned below:
𝑇𝑇𝑇𝑇
Accuracy =94.44%
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹 ..… (2)
Precision= TP/TP+FP*100 =97.05%
Recall =TP/TP+FN*100 = 91.03%
Whereas recall indicates the number of True Positives
divided by the number of True Positives and the number of
False Negatives. Hence it is the number of positive
predictions divided by the number of positive class values
in the test data. Recall also sometimes titledas Sensitivity or
the True Positive Rate.The formula is mentioned below:

𝑇𝑇𝑇𝑇
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹 ..… (3)

Performances of each classifier are measured in these terms


by using equation 1, 2 and 3.
The final results are shown below:
Table 2. Confusion Matrix for Naïve Bayes

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1548
ISSN 2229-5518

Table 4.Confusion Matrix for Decision tree Accuracy =76.79%


(J48Graft) Precision= TP/TP+FP*100 =74.34%
Recall =TP/TP+FN*100 = 77.93%
Predicted

Negative Positive

True False Table 7. Comparison of all classifiers performance


Negative
positive(TP)=132 negative(FN)=13
Actual
False True Classif TP F FP T Accurac Precisio Recall Mean Absolute
Positive ier N N y% n Error
positive(FP)=4 negative(TN)=157
Naïve 10 38 39 11 74.84 73.28 73.79 0.249
Bayes 7 2
J48 13 13 4 15 94.44 97.05 91.03 0.045
Accuracy =94.4% , precision =97%, Recall =91.3% 2 7
JGragt 13 13 4 15 94.44 97.05 91.03 0.044
Table 5. Confusion Matrix for KNN (k=1) 2 7
k-NN 13 13 6 15 93.79 95.65 91.03 0.016

IJSER
=1 2 5
k-NN 11 32 39 12 76.79 74.34 77.93 0.098
=3 3 2
Predicted

Negative Positive
The comparison of performance of different classifiers is also shown in the graphs
below.
True False
Negative
positive(TP)=132 negative(FN)=13
Actual
True 100
False
Positive negative(TN)=15
positive(FP)=6 80
5
Accuracy %

60
Accuracy =93.79%, Precision=95.65%, Recall = 91.03% 40
20 Accuracy
0
Table 6. Confusion Matrix for KNN (k=3)
Naïve J48 Jgraft kNN kNN
Bayes (k=1) (k=3)
Predicted
Classifiers
Negative Positive
Figure 2. Accuracy comparison graph
True False
Negative
positive(TP)=113 negative(FN)=32

Actual
False True
Positive
positive(FP)=39 negative(TN)=122

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1549
ISSN 2229-5518

In this study, we are used classification algorithms Naïve


150
Bayes, Decision Trees and kNN for prediction diabetes.
100
Accuracy The result obtained from this study is compare with the
50
Precision similar study of other authors. From the comparison table
0 we have notice the decision trees work better than others.
Recall
Naïve J48 Jgraft kNN The decision tree algorithms i.e. J48 and Jgraft outperforms
Bayes (k=1) over other classifiers and previous studies. It achieves the
highest accuracy rate of 94.44%. The decision tree is
Figure 3. Comparison of all performance measures simple and good classifier for prediction diabetes. A
comparison of the accuracy produced by all the classifiers
before applying resampling and theaccuracy produced by
5. COMPARISON OF RESULTS
them after applying resampling is given below:
We compared the results achieved in this study with the
results reported by other researcher in the existing Table 9: comparison before and after applying Resampling
literature. We mainly focused on the method used and the
accuracy achieved by the other studies. A comparison of Classifiers Without Bootstrapping After
our framework with other studies is provided in Table 8. (Accuracy Rates%) Bootstrapping
(Accuracy

IJSER
Table 8: Results Comparison Table Rates%)

Naïve Bayes 71.45% 74.89%


Accurac
Proposed
Dataset y Decision Tree
Reference model / Purpose 78.43% 94.44%
Used Achieved (J48)
Method
(%) Decision Tree
PIMA 78.43% 94.44%
N. Gupta To (J48graft)
Decision Indian
et al. predict 81.33% 93.79%
Tree Diabetes k-NN (k=1) 69.93%
(2013) diabetes
Dataset
P.Yasodha k-NN (k=3) 72.22% 76.79%
To
, M. Bayes A hospital
predict 66.2%
Kannan Net repository
diabetes
(2011)
PIMA
A. Iyeret Decision Indian
To 6.CONCLUSION AND FUTURE WORK
predict 74.8%
al. (2015) Tree Diabetes
diabetes
Dataset Data mining plays an important role in various fields such
K. Rajesh, PIMA
To as artificial intelligence (AI) and machine learning (ML),
V. Decision Indian
predict 87%
Sangeetha Tree Diabetes statistics and database systems. The core objective of this
diabetes
(2012) Dataset
National study is to enhance the accuracy of predictive model. The
Health and To
Decision accuracy can be increase by improving the performance of
Lee (2014) Nutrition predict 67%
Tree
Examinatio diabetes
the data, the algorithms or even by algorithm tuning. We
n Survey
PIMA
To enhance the accuracy by improving the data in
Chick et Indian
k-NN predict 89.10% preprocessing phase that really works well. Applying
al. (2012 ) Diabetes
diabetes
Dataset
bootstrapping resampling technique on this PIMA dataset
Decision
Trees will increases the accuracy of almost all classifiers but the
94.44%
Naïve To decision trees leads over others. It is also concluded that the
PIMA
Our Bayes improve 74.89%
Indian accuracy of a model is highly dependent on the dataset. So,
proposed diabetes
Diabetes
framework kNN(k=1 predictio 93.79% this technique works very well on PIMA diabetic dataset
Dataset
) n
76.79% but may not guaranteed the same results on a different
kNN(k=3
)
dataset. In future work includes it is plan to use further

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1550
ISSN 2229-5518

more advanced classifiers such as artificial neural networks Datasets", International Journal of Environmental
(ANN), genetic algorithm (GA) and evolutionary algorithm Research and Public Health, vol. 11, no. 9, pp. 9776-
(EA). 9789,
2014.
The diabetes dataset considered in this study might not
[5] M. Chikh,M. Saidi, and N. Settouti, “Diagnosis
consider some other important factors that are related to
of diabetes diseases using an Artificial Immune
gestational diabetes, like metabolic syndrome, family Recognition
history, habit of smoking, lazy routines, some dietary System2 (AIRS2) with fuzzy K-nearest neighbor,”
patterns etc. The appropriate prediction model would want Journal of medical systems,vol.36, no.5, pp. 2721-
additional relevant data to make it more accurate. This 2729, 2012.
would be accomplished by gathering diabetic patient’s [6] K. Sharmila and S. Manickam, “Efficient
datasets from various sources, to generate a better relevant Prediction and Classification of Diabetic Patients
from bigdata using
prototype. This is a limitation of this research.
R,”International Journal of Advanced Engineering
ACKNOWLEDGMENT Research and Science, vol. 2, Sep 2015.
I would like to thank Dr. Naeem Khan for his guidance [7] S. Sadhana and S. Savitha, “Analysis of Diabetic
and cooperation in completing my research work. I would Data Set Using Hive and R,” International Journal of
also be thankful for my family for their support throughout
Emerging Technology and Advanced Engineering,

IJSER
my life.
vol. 4, July 2014.
• Uswa Ali Ziais a student of MSCS at SZABIST
Islamabad. E-mail: malka_09@yahoo.com [8] M. Gowsalya, K. Krushitha, and C.
• Dr. Naeem Khan is working as assistantprofessor ( Valliyammai, "Predicting the risk of readmission of
department of Computer Science ) at SZABIST
Islamabad. Email: dr.naeem@szabist-isb.pk diabeticpatientsusing
References MapReduce," pp. 297–301,2014.

[9] N. M. S. kumar, T. Eswari, P. Sampath, and S.


[1]P.Yasodha and M. Kannan, "Analysis of a Lavanya, "Predictive methodology for diabetic
dataanalysis in
Population of Diabetic Patients Databases in
big data," Procedia Computer Science, vol. 50, pp.
WekaTool", International Journal of Scientific & 203–208, 2015.
Engineering Research, vol. 2, no. 5, 2011.
[10] S.Salian and G. Harisekaran, “Big Data
[2]A. Iyer, J. S and R. Sumbaly, "Diagnosis of
Analytics Predicting Risk of Readmissions of
Diabetes Using Classification Mining
Diabetic Patients,”
Techniques", IJDKP, vol. 5,
International Journal of Science and Research, vol.
no. 1, pp. 01-14, 2015.
4, April 2015.

[3]N. NiyatiGupta,A.Rawal, and [11] W. Raghupathi and V. Raghupathi, "Big data


V.Narasimhan,"Accuracy, Sensitivity and analytics in healthcare: promise and
Specificity Measurement of Various potential", Health
Classification Techniques on Healthcare
Information Science and Systems, vol. 2, no. 1, p. 3,
Data", IOSR Journal of Computer Engineering,
2014.
vol. 11, no. 5, pp. 70-73, 2013.
[12] S. Hay, D. George, C. Moyes and J.
[4]P. Lee, "Resampling Methods Improve the Brownstein, "Big Data Opportunities for Global
Predictive Power of Modeling in Class-Imbalanced Infectious Disease

Surveillance", PLoS Med, vol. 10, no. 4, p. e1001413,


2013.

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 1551
ISSN 2229-5518

[13] G. Weber, K. Mandl and I. Kohane, "Finding


the Missing Link for Big Biomedical
Data", JAMA, 2014.

[14]M. Barrett, O. Humblet, R. Hiatt and N. Adler,


"Big Data and Disease Prevention: From Quantified
Self to
Quantified Communities", Big Data, vol. 1, no. 3, pp.
168-175, 2013.
[15] S. Rao, S. Suma and M. Sunitha, "Security
Solutions for Big Data Analytics in
Healthcare", 2015 Second

International Conference on Advances in Computing


and Communication Engineering, 2015.

[16]D. Peter Augustine, "Leveraging big data


Analytics and Hadoop in developing India’s

healthcare services," International Journal of


Computer Applications, vol. 89, no. 16, pp. 44–50,

IJSER
2014.

[17] S.David and A. Saeb, “Comparative Analysis of


Data Mining Tools and Classification Techniques
usingWEKA in Medical Bioinformatics”, Computer
Engineering and Intelligent Systems, vol.4, no.13,
2013.

IJSER © 2017
http://www.ijser.org

You might also like