Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
127 views

Comparative Analysis of Machine Learning Algorithms Using Diabetes Dataset

This document is a report submitted by Pratiksha Dutta for the partial fulfillment of the requirements for a Bachelor of Technology degree in Computer Science and Engineering. The report details a comparative analysis of machine learning algorithms using a diabetes dataset. Pratiksha conducted the project under the supervision of Mr. Kishor Kashyap and the report includes a declaration, certificate of supervision, external examiner's certificate, acknowledgements, abstract, contents, and introduction sections.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
127 views

Comparative Analysis of Machine Learning Algorithms Using Diabetes Dataset

This document is a report submitted by Pratiksha Dutta for the partial fulfillment of the requirements for a Bachelor of Technology degree in Computer Science and Engineering. The report details a comparative analysis of machine learning algorithms using a diabetes dataset. Pratiksha conducted the project under the supervision of Mr. Kishor Kashyap and the report includes a declaration, certificate of supervision, external examiner's certificate, acknowledgements, abstract, contents, and introduction sections.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

COMPARATIVE ANALYSIS OF MACHINE LEARNING

ALGORITHMS USING DIABETES DATASET

REPORT SUBMITTED IN PARTIAL FULFILLMENT OF


THE REQUIREMENT FOR THE DEGREE OF

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

BY
PRATIKSHA DUTTA
​oll Number: 160103011
R

​MR. KISHOR KASHYAP

ASSISTANTPROFESSOR

DEPARTMENT OF INFORMATION TECHNOLOGY


GAUHATI UNIVERSITY
GUWAHATI, INDIA, JULY –2020
DECLARATION

I “ ​PRATIKSHA DUTTA​”, Roll No “ ​160103011”​ B.Tech. student of the


department of Information Technology, Gauhati University hereby declares that I have
compiled this report reflecting all my works during the semester long full time project
as part of my BTech curriculum.
I declare that I have included the descriptions etc. of my project work, and
nothing has been copied/ replicated from other’s work. The facts, figures, analysis,
results, claims etc. depicted in my thesis are all related to my full time project work.
I also declare that the same report or any substantial portion of this report has not
been submitted anywhere else as part of any requirements for any degree/ diploma etc.

(Pratiksha Dutta)
Branch:CSE
Date:
GAUHATI UNIVERSITY

DEPARTMENT OF INFORMATION TECHNOLOGY

Gopinath Bordoloi Nagar, Jalukbari Guwahati-781014

Date:

CERTIFICATE

This is to certify that “​Pratiksha Dutta​” bearing Roll No: “​160103011​” has
carried out the project work “Comparative Analysis of Machine Learning Algorithms
Using Diabetes Dataset ” ​under my supervision and has compiled this report reflecting
the candidate’s work in the semester long project. The candidate did this project full time
during the whole semester under my supervision, and the analysis, results, claims etc. are
all related to his/her studies and works during the semester.

I recommend submission of this project report as a part for partial fulfillment of the
requirements for the degree of Bachelor of Technology in Information
Technology/Computer Science & Engineering of Gauhati University.
(MR. KISHOR KASHYAP )
(Assistant Professor )
GAUHATI UNIVERSITY
DEPARTMENT OF INFORMATION TECHNOLOGY
Gopinath Bordoloi Nagar, Jalukbari Guwahati-781014

External Examiners Certificate

This is to certify that “Pratiksha Dutta”, bearing Roll No “160103011” has


delivered her project presentation and I examined his/her report entitled “Comparative
Analysis of Diabetes Dataset Using Machine Learning Algorithms” and recommend this
project report as a part for partial fulfillment of the requirements for the degree of
Bachelor of Technology in Information Technology/Computer Science & Engineering
of Gauhati University.

_ _
​ACKNOWLEDGEMENT

Project is an open door for learning and self-development. It is been a pleasure to


have so many wonderful people lead me through in completion of this project work.

I convey my sincere gratitude to Mr. Kishor Kashyap​ ​who guided us throughout the
project. Without his spirit of accommodation, willing disposition, timely
clarification and frankness and moreover, faith in me, this study could not have been
completed and at the same time fruitful. His readiness to discuss all important
matters at work deserve special attention. I am indebted to him for his constant
support, encouragement and interest at all steps of this project.

I sincerely thank Dr. Vaskar Deka,Head of Department of Information


Technology,Gauhati University for giving me the opportunity to carry out my
project.I am highly obliged and grateful.I also like to thank all the faculty of the
college for their valuable suggestions and help.

Pratiksha Dutta

Branch:CSE
Abstract

Machine Learning centers predominantly around planning of framework, consequently permitting them to learn
and make prediction dependent on certain encounters which is data in the event of machines.It empowers the
system to act and settle on data driven choices as opposed to being unequivocally programmed in playing out a
specific task.These system learn and improve over time with more and more experiences through the new data
fed to it. Advancement in this technology has brought a remarkable change in the arena of medical science as it
helps in improving patient care now and in future by processing huge datasets beyond human capabilities and
as well as equip doctors with more valuable information such as early detection of diseases for better diagnosis
and treatment options. This project tries to bring a comparative analysis of the classification algorithm and
predict and evaluate its performance on medical dataset such as the diabetes dataset.Since diabetes is a chronic
lifestyle disease,its prior detection is very necessary for an efficient and accurate treatment thus making way for
the application of machine learning tasks such as classification.

Keyword: Machine Learning,Classification,Diabetes Dataset.


Chapter 1: Introduction

1.1 Overview 6

1.2 Problem 6

1.3 Thesis Goals 7

1.4 Thesis Methodology 7

Chapter 2:Background

Chapter 3:Methods and Methodology

Chapter 4: Implementation

Chapter 5:Results and Discussion 27-28

Chapter 6: Conclusion 29-31

Future Work 32
References 33-35
List of Figures
Figure 1:Bar graph
Figure 2:Histogram
Figure 3:Heatmap
Figure 4:ROC-AUC curve of Naive Bayes classifier
Figure 5:ROC-AUC curve of Linear Discriminant classifier
Figure 6:ROC-AUC curve of KNN classifier
Figure 7:ROC-AUC curve of Random Forest classifier
Figure 8:ROC-AUC curve of Support Vector Machine classifier
Figure 9:ROC-AUC curve of Logistic Regression classifier
Figure 10:ROC-AUC curve of Decision Tree classifier

List of Tables
Table 1:Performance Evaluation of Machine Learning Algorithms.
CHAPTER 1: INTRODUCTION

Machine learning is a subset of Artificial Intelligence that acquaints the machine to learn like a human
brain.The machine is enabled to develop a pattern on the information that is fed to it through machine
learning.The astounding development of machine learning as an instrument to identify, describe, perceive,
order, or create complex information and its quick applications in a wide scope of fields, including image
recognition,face detection,traffic prediction,disease prediction,data analysis to even in quantum systems has
marked a tremendous potentiality in this technology and is now gaining a fresh momentum.Especially in the
field of medical science,machine learning is one of the brilliant technique of modelling a broad range of
variable associated with a disease.

Examining physiological information, ecological impacts, and hereditary factors permits medical experts to
analyze diseases early and all the more viably.Machine learning double checks for a medical expert that he
might miss.This project aims to study and compare the outcomes of the machine learning algorithms on the
prima indian diabetes dataset.

1.1 Overview

Machine learning is helping clinical experts make conclusion simpler by overcoming any
barrier between gigantic informational indexes and human knowledge.It enables us to
study the underlying biological mechanism associated with a disease and also determine
the impact of risk factor on their development, as well as help detect the group of people
who tends to have the risk of developing the disease.

1.2 Motivation and Objective

Diabetes is a chronic disease which affects the normal metabolism of human body resulting in increased blood
sugar level.Given the clinical information we can assemble about individuals, we ought to have the option to
improve prediction on how likely an individual is to endure the beginning of diabetes, and in this way early
treatment can be done .Diabetes left untreated can have many fatal complications.We can begin breaking down
information and trying different machine learning algorithms that will assist us with considering the beginning
of diabetes in Pima Indians and then draw a comparison between them to study the most effective algorithm in
diagnosis of this ailment.
1.3 Problem Statement

The application of machine learning in healthcare sector is essentially benefitting this field with early diagnosis
of diseases as well as smarter techniques of diagnosis.This powerful subset of AI has been proved to have true
life impacting potential in health care especially in the area of medical diagnosis.
The focus of this project is to identify the group of people who are highly likely to be detected with diabetes
through machine learning classification algorithms and further to draw an evaluation of the various algorithms
comparing their performance on the given dataset.
CHAPTER 2: BACKGROUND STUDY AND RELATED WORKS

2.1 Classification in Machine Learning

Machine learning is one of the fastest developing zones of software engineering, with sweeping application.
Machine learning field is exploding day by day.As it gains more and more experiences,the system’s
performance in a particular task is anticipated to be better at it. Based on the type of feedback received by the
machine for it to learn,machine learning is broadly categorized into 3 types: supervised learn- ing, unsupervised
learning and reinforcement learning .

• In supervised learning, the learner is trained on labelled examples or instances. The ideal outputs for a problem
are known ahead of time. The basic objective of the machine is to learn a function that draws a relationship
between the input and the output. The most common form of supervised learning is classification.In
classification, a machine correctly predicts categorization of a set of instances,to be precise, features or
attributes into their classes,given their class labels. Classification can be binary or multi class.

• Unsupervised learning works on unlabelled instances Here,the system cannot explicitly learn from any correct
answers. It endeavors to discover intrinsic patterns or similarities and differences that can then be used to
determine groups for the given instances. Clustering is a form of unsupervised learning, where the learner must
explore underlying structures or correlations in the data to learn relationships rather than rules.

• In reinforcement learning, desired outputs are not directly provided.Each activity of the learner has diverse
effect on nature and the nature provides feedback on its action in the form of prizes and punishments. The
learner learns based on the prizes and punishments that it receives from the environment.

• Classification is one of the significant errands in machine learning, which is a process of appointing a given
data that are instances or examples to a class or a category . The learning algorithm uses a set of examples to
learn a clas- sifier that is expected to correctly predict the class label of unseen (future) instances . The learnt
classifier takes the values of the features or attributes of an object as input and the predefined class labels for the
object as output. The set of class labels is defined as part of the problem (by users).
A typical classification example is the email spam-catching system, which is important and necessary in
real-world applications. Given a set of emails marked as “spam” and “non-spam” , the learner will learn the
characteristics of the spam emails and then the learnt classifier is able to process future email messages to mark
them as “spam” or “non-spam”.

2.2 K-Nearest Neighbors

The most straightforward classifier in the collection of machine learning techniques is the Nearest Neighbour
Classifier .In this algorithm,the value of a query point is determined with the help of ‘feature similarity’ with the
nearest neighbors and further this query point is assigned a class based on how closely it is related to those
neighbors.
Here, a value of K is chosen in terms of integer, which is the nearest data point.Then for every test data point,a
distance is calculated by considering the test data and each row of training data and then on the basis of the
distance calculated, the datapoints are sorted and then first k rows are selected from this sorted array.On the
basis of frequent class of these rows of the array, the test datapoints are assigned a class.

2.3 Support Vector Machine

The latest supervised machine learning technique is Support Vector Model.These models are similar to classical
multilayer perceptron neural networks.SVMs is based on the idea of a ―margin‖—either side of a hyperplane
that separates two data classes. Maximizing the margin and thereby creating the largest possible distance
between the separating hyperplane and the instances on either side of it has been proven to reduce an upper
bound on the expected generalisation error.

2.4 Decision Tree Classifier

Decision Trees (DT) are trees that group cases by arranging them dependent on feature values. A feature in an
example to be classified is depicted by each node of a decision tree, and each branch depicts the assumed value
of the node. Classification of the nodes of instances starts at the root node and orderly arranged according to
their feature values.Decision tree used in decision tree learning, data mining and machine learning, is nothing
but a predictive model which maps observations about an item to decisions about the objective value of the
item.. Such tree models are also known as classification trees or regression trees .Decision tree classifiers for the
most part utilize post-pruning procedures that assess the presentation of decision trees, as they are pruned by
utilizing an validation set. Any node can be removed and allotted the most well known class of the training
instances that are arranged to it.

2.5 Naive Bayes

Naive bayes classifier is an classification algorithm dependent on bayes hypothesis which gives a supposition of
freedom among predictors.In basic terms, a Naive Bayes classifier accept that the presence of a specific feature
in a class is random to the presence of some other feature.
Even if the features rely on each other, all of these properties contribute to the probability independently. Naive
Bayes model is easy to make and is especially useful for relatively large data sets. Even superficially, Naive
Bayes is known to outflank majority of the classification methods in machine learning. The Bayes theorem used
to implement Naive Bayes is as followed:

2.6 Logistic Regression

The most commonly used machine learning algorithm to predict binary classes i.e churn or not churn is logistic
regression.​ Based on the Sigmoid Function to each input,these algorithm attaches an output having probability
of 0 or 1.The threshold is set to 0.5 based on which anything higher than that will be categorized as ​1​ (churn)
and anything below that as ​0​ (not churn).

2.7 Random Forest

Forest as it means group of trees .The most simple and diverse supervised classifying algorithm which combines
the prediction results of each of the decision trees on data samples and then finally finds the best value by means
of voting is Random Forest.This is an ensemble method which considers the average of the result of each single
decision tree and hence avoid overfitting.
2.8 Linear Discriminant Analysis

The limitation of logistic regression such as multi class classification,unstability with well separated classes and
instability with fewer instances is overcomed by another linear classifying algorithm which calculates some
statistics for each classes.This considers the mean and variance for a single feature whereas for multiple feature
this is the means and covariance matrix calculated over Gaussian multivariate.

2.9 Related Works

● Ramana ​.et al. [1] ​has carried out an analysis on classification algorithms such as ​Bagging, IBK, J48,
JRip, Multilayer perceptron (MP) and Naive Bayes (NB) classifiers​ ​on several medical datasets such as
Breast Cancer Data, Chronic Kidney Disease, Cryotherapy, Hepatitis, Immunotherapy, Indian Liver
Patient Dataset (ILPD), Liver Disorders, and Liver disorders dataset.
● Nindrea.et al. [2] worked to predict the diagnostic accuracy of different machine learning algorithms for
breast cancer risk calculation.They showed that SVM had a better accuracy value than other machine
learning algorithm.
● PR.et al.[3] detected lung cancer by using machine learning algorithm and thus carrying out a
comparison among them to determine efficiency of the algorithms on prior detection of the disease.They
showed SVM was the better algorithm in terms of accuracy for the lung cancer dataset.
● Orabi.et al.[4] proposed a classification model for diabetes disease which can anticipate the likeliness
of diabetes at a particular age.The proposed model is organized subject to application of decision
tree.The outcomes acquired were acceptable for the expectation of the disease at a particular age, with
the most noteworthy exactness using Decision tree framework.[11][3].
● Jakka.et al.[5] carried out an investigation to assess the performance of classifiers in predicting the
probability of diabetes in patients.The classifiers were evaluated to find that Logistic Regression (LR)
has the best accuracy rate i.e 77.6% compared to other algorithms.
● Mujumdar.et al.[6] proposed a similar comparison of classifiers on two different diabetes dataset to
find Logistic Regression bagged an highest accuracy of 97.2% .Also application of pipeline resulted in
AdaBoost to be the best classification model and have an accuracy of 98.8%.
● Pradhan.P.et. al.[7] trained and test the PID dataset for diabetes from the UCI store applying
Genetic-programming (GP) method.
● Yuvaraj.et al.[8] introduced an application for diabetes prediction utilizing three different ML classifiers
that is Random Forest, Decision Tree, and the Naïve Bayes.They extracted the relevant features and
chose only eight features among 13.The results they found is that random forest algorithm has a
accuracy of 94% which is higher than the other algorithm.
● Nongyao et al [9] used the classifiers DT, SVM, LR and NB to determine the risk of diabetes mellitus
The experiment demonstrated that LR gives perfect result among others.
CHAPTER 3: METHODS AND METHODOLOGY

3.1 Importing the data


The dataset is fetched in csv format.The dataset used for implementation of this project is one already available
dataset from Prima India diabetes database which is available on kaggle.The datasets comprise of few clinical
indicators (independent) variables and one target (dependent) variable, Outcome.Number of pregnancies the
patient has had, their blood pressure, insulin level, glucose level etc are some of the independent variables.Some
of the important python open source libraries such as numpy which is an universal standard for working with
scientific numerical data, pandas which allows high performance,easy to handle data structures and data analysis
tool,seaborn for data visualization and matplotlib for static,animated,interactive visualization,are imported.
After this the analysis is done on the data to explore it to learn its potential features and also to check whether
the data needs cleaning.

3.2 Application of EDA

Exploratory Data Analysis (EDA) is a methodology/reasoning for data investigation that utilizes an assortment
of strategies (generally graphical) to

1. amplify understanding into a dataset;


2. reveal basic structure;
3. separate significant variables;
4. identify exceptions and abnormalities;
5. test hidden assumptions;
6. develop stingy models; and
7. decide ideal factor settings.

The specific graphical strategies utilized in EDA are usually very straightforward, comprising of different
methods of:
● Plotting the crude data( such as data traces, histograms, bihistograms, probability plots, lag plots, ​block
plots​, and Youden plots).
● Plotting basic insights, for example, mean plots, standard deviation plots, box plots, and main effect
plots of the raw data.
● Situating such plots in order to amplify our natural pattern-recognition abilities, such as using multiple
plots per page.

The dimension of the diabetes dataset using head() and shape() was found to be 768 rows and 9
columns.The outcome column is used for prediction of whether the patient has diabetes or
not.If the outcome is zero(0),then the patient doesn’t have diabetes and if it is one (1) it means
the patient doesn’t have diabetes.

After this,using isnull() pandas function the dataset is checked for null values for carrying out data
cleaning and however ,it was observed that there was no null data point in the diabetes dataset
that is obtained.
In this project, the graphical representation of the diabetes dataset is done through bar plot which is plotted
essentially for tally data, or data that show accumulation or proportion,histogram which is a classical
exploratory plots that show range and densities of the data and also show distribution of different variables and
are usually applied on numerical data and heatmap which is used to find the correlation of every variable
between the dataset in numerical form as well as in visual form for better understanding with the darkest red
that show high correlation and darkest blue that show none or negative correlation.

3.3 Splitting the data into test and train data

After the data is scaled and arranged using exploratory data analysis, the data is divided into train and test set.
The train set is the information on which we apply our algorithm or preparation procedure to build the model
while the test data is utilized to assess the model.

3.4 Application of Machine Learning Algorithms


The procedure by which a learning algorithm (classifier inducer) utilizes observations to get familiar with
another classifier is known as the training process and the procedure by which the learnt classifier is tried on
unseen perceptions is known as the testing process. The machine learning algorithm are programs (math and
rationale) that alter themselves to perform better as they are presented to more data. Just how human change
when they process data by learning,the "learning" in “Machine Learning implies that those algorithm change as
they process data over time.So an algorithm is a program with a particular method to changing its own
parameters, given feedback on its past performances making prediction about a dataset.The algorithms most
commonly used on almost an dataset for classifications are
● Linear Regression
● Logistic Regression
● Decision Tree
● SVM
● Naive Bayes
● kNN
● K-Means
● Random Forest
● Dimensionality Reduction Algorithms
● Gradient Boosting algorithms
○ GBM
○ XGBoost
○ LightGBM
○ CatBoost

In this project, an evaluation is drawn among the algorithms- Logistic Regression,Decision Trees,SVM,Naive
Bayes and KNN on the diabetes dataset for comparison of their performance on the same.

3.5 Evaluation of Performance Measures

To know which algorithm is best run on the dataset,the following parameters of performance measure are
evaluated on comparison.
ROC Curve-The ROC curve is considered good or bad depending on AUC(or Area under Curve) and also the
other parameters which are also called as confusion matrix.A confusion matrix is a table that is usually used to
depict the exhibition of an classification model on a lot of test data for which the true values are known.

From above,as we can see True Positive and True Negatives are the desired and correct prediction so shown in
green whereas False Positive and False Negative happen when the actual class negates with the predicted class.
True Positives (TP)​ - When the value of genuine class is yes and the value of predicted class is also yes,that is
correctly predicted true values.
True Negatives (TN)​ - When the value of genuine class is no and value of predicted class is also no that is
correctly predicted false values.
False Positives (FP)​ – When genuine class is no and predicted class is yes.
False Negatives (FN)​ – When genuine class is yes but predicted class in no.

Accuracy-Accuracy is the most instinctive performance measure and it is essentially a proportion of effectively
predicted observation to the total observation. One may believe that, on the off chance that we have high
accuracy, at that point our model is ideal. Indeed, accuracy is an extraordinary measure however just when we
have symmetric datasets where estimations of false positive and false negatives are practically same. Along
these lines, there are other parameters which are equally important to assess the performances of the model.

Accuracy = TP+TN/TP+FP+FN+TN

Precision​ - Precision refers to the ratio of positive observations that is predictably correctly to the total positive
observations predicted. Precision measures the extent of correct prediction by the model.High precision means
low false positive rate.

Precision = TP/TP+FP

Recall ​(Sensitivity) - Recall refers to the ratio of total positive observations correctly predicted to all the total
sum of positive observation as well negative observation in actual class.
Recall = TP/TP+FN

F1 score​ - F1 score refers to the mean of the precision and the recall.It considers both false positive and false
negative.​An F1 score is considered best when it’s score is ​1​, while the model is a total failure when it’s score is
0

F1 Score = 2*(Recall * Precision) / (Recall + Precision)


CHAPTER 4: RESULTS AND DISCUSSION

● EDA

Figure 1:This bar graph shows the positive class and negative class of the dataset.The blue bar having 0 value
depicts the negative class that is number of patients without diabetes and the orange bar having 1 value depicts
the positive class that is number of patients with diabetes.
Figure 2:The histogram provides the distribution pattern of all the attributes along the dataset. These plots can
also be interpreted as the probability distribution function (PDF) of each of the features.
Figure 3:A heatmap is used to collectively plot the correlation patterns of each of the features
present in a dataset. This gives a clear idea of how each of the features are related,and any
redundant or duplicate features present in the dataset can be visualised from here. The denser
the colour more correlated the data is.
● EXPERIMENTAL RESULT

Examination of Various ML Classifier models is assessed to the Diagnosis of Diabetes. Accuracy of the
performance and f1-score of the classifiers is assessed dependent on Incorrectly and Correctly Classified
Instances out of a complete number of cases.

ALGORITHMS ACCURACY PRECISION RECALL F1-SCORE

K-NEAREST 0.7552 0.7506 0.7552 0.7524


NEIGHBORS

SUPPORT VECTOR 0.7812 0.7742 0.7812 0.7729


MACHINE

LOGISTIC 0.7917 0.7859 0.7917 0.7864


REGRESSION

NAIVE BAYES 0.7656 0.7572 0.7656 0.7575

DECISION TREE 0.7344 0.7432 0.7344 0.7378

RANDOM FOREST 0.7865 0.7816 0.7865 0.7829

LINEAR 0.8021 0.797 0.8021 0.7945


DISCRIMINANT
ANALYSIS
Table1:Performance evaluation of the machine learning algorithm in terms of the value of accuracy,precision,f1
score and recall.
As the above table 1 shows,linear discriminant analysis has the best accuracy among all the other
algorithms.Linear discriminant analysis gives an accuracy of 80%.The f1-score of the algorithms also shows
that LDA is the better algorithm than others.

Figure 4: ROC-AUC curve of Naive Bayes classifier

Fig 5: ROC-AUC curve of Linear Discriminant Analysis classifier


Fig 6: ROC-AUC curve of KNN classifier

Fig 7: ROC-AUC curve of Random Forest classifier

Fig 8: ROC-AUC curve of SVC classifier


Fig 9: ROC-AUC curve of Logistic Regression

Fig 9: ROC-AUC curve of Decision Tree

The above curves gives the results as follows

Decision Tree: ROC AUC=0.711


Logistic: ROC AUC=0.861
SVC: ROC AUC=0.842
NB: ROC AUC=0.818
LDA: ROC AUC=0.860
KNN: ROC AUC=0.760
Random Forest: ROC AUC=0.848
Higher the value of AUC,better is the classifier. Logistic Regression and LDA has comparable AOC
scores and are both the best classifiers for this dataset.

CHAPTER 5:CONCLUSION AND FUTURE SCOPE

In medical field, an early detection and proper diagnosis plays a very important part for the
treatment of a particular disease.For early detection, the doctor has to put the patient go through
a numerous examinations which is cumbersome.Thus to optimize the process time for an early
detection,machine learning is the need of the hour.In this project,certain classification
algorithms were evaluated based on their classification performance in terms of accuracy,
sensitivity, precision, specificity and ROC area.In terms of accuracy,Linear Discriminant
Analysis shows a higher accuracy.Logistic Regression and Linear Discriminant Analysis has
comparable ROC area.
CHAPTER 6:REFERENCES

[1]​Bendi Venkata Ramana ​; ​Raja Sarath Kumar Boddu​,Performance Comparison of Classification Algorithms
on Medical Datasets
[2]​Ricvan Dana Nindrea​,*​Teguh Aryandono, Lutfan Lazuardi, and Iwan Dwiprahasto​,​Diagnostic Accuracy of
Different Machine Learning Algorithms for Breast Cancer Risk Calculation: a Meta-Analysis
[3]Radhika P R,Rakhi.A.S.Nair,Veena G,A Comparative Study of Lung Cancer Detection using Machine
Learning Algorithms
[4] Orabi,” Early Predictive System for Diabetes Mellitus Disease,” , in Industrial Conference on DataMining,
2016, Springer.pp.420–427
[5]Aishwarya Jakka, Vakula Rani J,Performance Evaluation of Machine Learning Models for Diabetes
Prediction
[6]Aishwarya Mujumdar, Dr. Vaidehi V.,Diabetes Prediction using Machine Learning Algorithms
[7]Pradhan P, “Design of Classifier for Detection of Diabetes Mellitus Using Genetic Programming,” in
Advances in Intelligent Systems and Computing ( AISC) , 2014, vol. 1, pp 763–770.
[8]Yuvaraj, N.; Sri Preetha, K.R. Diabetes prediction in healthcare systems using machine learning algorithms
on Hadoop cluster. Clust. Comput. 2017, 22, 1–9
[9]Nongyao Nai-arun and Punnee Sittidech,]Ensemble Learning Model for Diabetes Classification
[10]Souad Larabi-Marie-Sainte , Linah Aburahmah, Rana Almohaini and Tanzila Saba, Current Techniques for
Diabetes Prediction: Review and Case Study
[11]NaiArun, “Comparison of Classifiers for the Risk of Diabetes Prediction,” in Proceedia Computer Science,
7. 2015, vol. 69, pp 132–142.
[12]Dr Saravana kumar N M, Eswari T, Sampath P and Lavanya S,” Predictive Methodology for Diabetic Data
Analysis in Big Data”, 2nd International Symposium on Big Data and Cloud Computing,2015.
[13]https://www.edureka.co/blog/classification-in-machine-learning/#:~:text=In%20machine%20learning%2C
%20classification%20is,recognition%2C%20document%20classification%2C%20etc.
[14]https://machinelearningmastery.com/case-study-predicting-the-onset-of-diabetes-within-five-years-part-1-of
-3/
[15]https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d0
1761
[16]https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm

[17]https://towardsdatascience.com/machine-learning-workflow-on-diabetes-data-part-01-573864fcc6b8

You might also like