Performance Analysis of Machine Learning Classifier for

Predicting Chronic Kidney Disease

A Project Report

submitted to Swarnandhra College of Engineering and Technology

in partial fulfilment of the requirements for the award of the degree





Under the guidance of


Professor, Head of the Department of MCA



(Approved by AICTE & Permanently Affiliated to JNTUK)

(Accredited by NAAC with “A” Grade)SEETHARAMPURAM, NARASAPUR-534280


Seetharampuram, Narsapur – 534 280.


This is to certify that this project work entitled “Performance Analysis of Machine
Learning Classifier for Predicting Chronic Kidney Disease” is the bonafide work of
MS. CHERUKURI JALAJAKSHI who carried out the work under my supervision,
and submitted in partial fulfilment of the requirements for the award of the degree,
Master of Computer Applications, during the academic year 2018-2021.

Project Supervisor Head of Department

A.N.L. Kumar A.N.L.Kumar

Head of the Department of MCA Head of the Department of MCA

Submitted for the project Thesis/Dissertation Viva-voice held on……………..

External Examiner

I certify that

The work contained in the project work is original and has been done by me under the
guidance of my supervisor
The work has not been submitted to any other University for the award of any degree or
The guidelines of the University are followed in writing this report.


I extend my heartfelt gratitude to the Almighty for giving me strength in proceeding with

this project title “Performance Analysis of Machine Learning Classifier for Predicting
Chronic Kidney Disease”.

I express my heartfelt gratitude to my parents for supporting me in all the ways in every
walk of my life.

I express my sincere thanks to Honorable Dr. S.Ramesh Babu, Secretary & Correspondent of
our college for making necessary arrangement for doing the project

I wish to express my gratitude to Dr. S. Suresh Kumar, principal of our college, for giving us
permission to carry out this project.

I express my sincere thanks to Mr. A.N.L. Kumar, Head of the Department of M.C.A. for
his learned suggestions and encouragement which made this project a successful one.

I convey my sincere thanks to my project guide Mr. A.N.L. Kumar, Associate Professor,
Department of MCA for his learned suggestions and encouragement which made this project
a success.

I express my sincere thanks of all the faculty members of our Department of MCA for their
valuable support throughout this project work.

I wish to express my thanks to my friends for their enthusiasm, support and encouragement for the
completion of this project.


Regd. No.18A21F0002





Chronic Kidney Disease (CKD) is a type of chronic disease which means it happens slowly
over a period of time and persists for a long time thereafter. It is deadly at its end stage and will
only be cured by kidney replacement or regular dialysis which is an artificial filtering
mechanism. It is important to identify CKD at the early stage so that necessary treatments can
be provided to prevent or cure the disease. The main focus in this paper is on the classification
techniques, that is, tree-based decision tree, random forest, and logistic regression has been
analyzed. Different measure has been used for comparison between algorithms for the dataset
collected from standard UCI repository.
Chronic Kidney Disease (CKD) is a critical health condition worldwide that is a major reason
for malicious health outcomes, particularly in countries where income ranges from low-to-
middle where millions die regularly due to lack of modest treatment. As per the stages in any
chronic disease the fatality is related to the stage it had been without being cured. The high-
risk factors of CKD are increasing frequency of diabetic patient, hypertension, heart disease,
mellitus and family history of kidney failure. If CKD is left undetected and therefore untreated,
it can lead to hypertension and in severe cases to kidney failure. WE procured a standard dataset
from the UCI machine repository for chronic kidney disease. CKD if predicted early and
accurately, can benefit patients in many ways. It increases the probability of a successful
treatment while also adding years to the person’s life. This paper work aims to predict kidney
disease by using some of the selected machine learning algorithms and feature selection
methods. The objective is to collect the combination of different feature and then have used it
as input to the machine learning algorithms. The algorithms have been implemented on the
basis of selected features and then we compare their performances.


1) Prediction performance of individual and ensemble learners for chronic

kidney disease

Authors: Dili Singh Sisodia; Akanksha Verma

Automating the process of predicting diseases prove assistive and time-saving for a practitioner
in the field of medical diagnosis. The accurate prediction of any disease not only helps the
patients know about their health but also helps the doctors in medication suggestion well in
advance. In today's lifestyle, advance knowledge about health and proper care can add a
number of living days to a patient's life. In this paper, the prediction of chronic kidney disease
(CKD) is performed using individual and ensemble learners. The experiments are performed
on CKD dataset was taken from UCI repository. The three different classifiers from individual
classifiers, namely, Naive Bayes(NB), minimal sequential optimization (SMO), J48, and three
ensemble classifiers, namely, Random Forest (RF), bagging, AdaBoost respectively are used
for prediction. We have used the open source, weka tool, for all the experiments. The results
are evaluated using accuracy, precision, recall, F-measure and ROC performance measures.
The results suggested that the decision tree based individual learner (J48) and random forest
from ensemble classifier respectively perform better than the other classifiers.

2) Diagnosis of Chronic Kidney Disease using effective classification and

feature selection technique
Authors: Nusrat Tazin, Shahed Anzarus Sabab

The massive amount of data collected by healthcare sector can be effective for analysis,
diagnosis and decision making if it is mined properly. Hidden information extracted from the
voluminous data can provide help and remedy to handle critical healthcare situations. Chronic
kidney disease is a fatal illness of kidney which can be prevented with early correct predictions
and proper precautions. Data mining of the information collected from previously diagnosed
patients opened up a new phase of medical advancement. However, specific techniques must
be executed to accomplish better consequence. In this manuscript the capability of the
classification of Support Vector Machine, Decision tree, Naive Bayes and K-Nearest Neighbor
algorithm, in analysing the chronic kidney disease dataset collected from UCI repository, was

investigated to predict the presence of kidney disease. Data set has been analyzed in terms of
accuracy, Root Mean Squared Error, Mean Absolute Error and Receiver Operating
Characteristic curve. In the present study, Decision tree shows promising results when
implemented through WEKA data mining tool. Ranking algorithm provides vital
improvements in classifications with proper number attributes. 15 proves to be the magic
number for selecting attributes for the given dataset resulting highest percent of improvement
in accuracy.

3) Prediction of Chronic Kidney Disease Using Machine Learning

Authors: Mrs Prasuna Kotturu, Mr VVS Sasank, G Supriya, Ch Sai Manoj, M V

Chronic kidney disease (CKD) is a hazardous disease effecting many people worldwide.
Individuals with chronic kidney disease (CKD) are often unaware that the medical tests they
undergo may provide useful information about CKD for other purposes and this information
may not be used effectively to address disease diagnosis. The major problem of this disease is
it is hard to recognize till it reaches advanced stage. In this paper we are predicting chronic
kidney disease(CKD) using machine learning techniques. In this paper, we are using machine
learning algorithms like decision tree, naïve Bayes classification, logistic regression(LR),
support vector machine(SVM) and random forest In this paper we detect the chronic kidney
disease (CKD) using the best suited method and got 99.3% as the most accurate result using
random forest method.

4) Prediction of kidney disease stages using data mining algorithms,

Authors: El-Houssainy, A.Radya, Ayman S.Anwa

Early detection and characterization are considered to be critical factors in the management
and control of chronic kidney disease. Herein, use of efficient data mining techniques is shown
to reveal and extract hidden information from clinical and laboratory patient data, which can
be helpful to assist physicians in maximizing accuracy for identification of disease
severity stage. The results of applying Probabilistic Neural Networks (PNN), Multilayer
Perceptron (MLP), Support Vector Machine (SVM) and Radial Basis Function (RBF)
algorithms have been compared, and our findings show that the PNN algorithm provides better

classification and prediction performance for determining severity stage in chronic kidney

5) Chronic Kidney Disease analysis using data mining classification


Authors: Veenita Kunwar; Khushboo Chandel; A. Sai Sabitha; Abhay Bansal.

Data mining has been a current trend for attaining diagnostic results. Huge amount of unmined
data is collected by the healthcare industry in order to discover hidden information for effective
diagnosis and decision making. Data mining is the process of extracting hidden information
from massive dataset, categorizing valid and unique patterns in data. There are many data
mining techniques like clustering, classification, association analysis, regression etc. The
objective of our paper is to predict Chronic Kidney Disease(CKD) using classification
techniques like Naive Bayes and Artificial Neural Network(ANN). The experimental results
implemented in RapidMiner tool show that Naive Bayes produce more accurate results than
Artificial Neural Network.


Existing System:

Cosmology and machine learning for chronic kidney disease as a complex versatile WEKA
tool. Ontology and machine learning are the techniques that have been utilized in existing
methodology. Therefore, it shows a chronic kidney disease to help instrument for taking care
of mistakes in the and helps clinicians adequately recognize intense kidney torment patients
from those with different reasons for kidney torments. Another machine learning procedure is
coronary artery disease method called N2 Genetic optimizer agent (another hereditary
preparing) has been presented in this methodology. These outcomes are aggressive and
practically identical to the best outcomes in the field.

Disadvantages Of Existing System:

• Machine learning-based coronary artery disease examined datasets, test sizes,

highlights, areas of information accumulation, execution measurements, and applied
ML are the basic methods that have been broken down in this methodology.
• Chronic kidney Failure Detection is an anticipated the Constant kidney breakdown
identification from heart sounds utilizing a pile of machine learning classifiers. The
strategies used to foresee comprises filtering segmentation and feature extraction to the

Algorithm: K-Nearest Neighbor, Naïve Bayes.

Proposed System:

I worked on chronic kidney disease dataset obtained from UCI (University of California at
Irvine) repository, the data set contained attributes such as age, Blood pressure, specific gravity,
albumin, sugar, red blood cells, pus cell, pus cell clumps, bacteria, blood glucose random, blood
urea, serum creatinine, sodium, potassium, haemoglobin, packed cell volume, white blood cell
count, red blood cell count, hypertension, diabetes mellitus, coronary artery disease, appetite,
pedal edema, anemia and label. Blood u with 400 instances has taken. At first level, the dataset
is first cleansed and processed using preprocessing techniques like Data Integration, Data
transformation, Data reduction, and Data cleaning using pandas tool. The proposed framework
a total of 400 patient records were visualized. Data visualization techniques helps the data
scientist to understand the feasibility of the dataset.

Advantages of Proposed System:

• The accuracy of the classifiers was calculated using the confusion matrix.
• The classifier which bags up the highest accuracy could be determined as the best

Algorithm: Logistic Regression, Random Forest (RF), Decision Forest.

3.1 Feasibility Study:

The feasibility of the project is analyzed in this phase and business proposal is put forth with
a very general plan for the project and some cost estimates. During system analysis the
feasibility study of the proposed system is to be carried out. This is to ensure that the
proposed system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are,

• Economical Feasibility
• Technical Feasibility
• Social Feasibility

Economical Feasibility:

This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development
of the system is limited. The expenditures must be justified. Thus the developed system as well
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.

Technical Feasibility:

This study is carried out to check the technical feasibility, that is, the technical requirements of
the system. Any system developed must not have a high demand on the available technical
resources. This will lead to high demands on the available technical resources. This will lead
to high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.

Social Feasibility:

The aspect of study is to check the level of acceptance of the system by the user. This includes
the process of training the user to use the system efficiently. The user must not feel threatened
by the system, instead must accept it as a necessity. The level of acceptance by the users solely
depends on the methods that are employed to educate the user about the system and to make
him familiar with it. His level of c onfidence must be raised so that he is also able to make
some constructive criticism, which is welcomed, as he is the final user of the system.

3.2 Software Requirement Specification:

The specification of software requirements is whole explanation of behaviour of the
development. This contains some use cases of the system. That describes about the
connections with the sever having by the. these use cases are also called as functional
requirements. Non functional requirements are nothing but the performance quality
standards etc.

Functional Requirements:

Functional requirements defines a function of a system or its component, where a function is

described as a specification of behaviour between outputs and inputs. functional requirements
specify particular result of a system. functional requirements specifies that which output file
should be produced from the given input file. That describes the input and output of the

Patient Registration: It makes the patient as the authorized user.

View previous data: This system allows patient to see their previous health data.

Predicted result: The system takes the input from the patient and predict the result the
patient having chronic kidney disease or not.

Non Functional Requirements:

Non Functional Requirements describes about the requirements, which are completely related
to the modules. Non Functional Requirements contains security, availability, usability,
accuracy etc.


The system provides more security to the users. Only authorized users can access system. The
authorization can be given by the admin to the users. The system is having a priority role in
keeping the secrecy of the use.it helps in storing important files or documents. nobody can
directly access the documents.

Availability describes how likely the system is accessible for a user at a given point in time.
The information will be available for the user interested products instantly.

It should be easy for the user, and easy to learn, operate, prepare inputs and outputs through
interaction with a system and user will interact with your products to achieve required goals
effectively and efficiently.


The level of accuracy is very high. All the operation would be done well. The system can
perform accuracy when the labeled data and unlabeled data can be sharing.

System Specification:

Hardware Requirements:

• System : intel core i5 10th generation.

• Hard Disk : 1TB

• Monitor : 14 Color Monitor.

• Ram : 8 GB.
Software Requirements:

• Operating system : Windows 10.

• Coding Language : python

• IDE : python

• Database : sqlite.

3.3 Functional model of the system:

Use case diagram:

A use case diagram in the Unified Modeling Language (UML) is a type of behavioral diagram
defined by and created from a Use-case analysis. Its purpose is to present a graphical overview
of the functionality provided by a system in terms of actors, their goals (represented as use
cases), and any dependencies between those use cases. The main purpose of a use case diagram
is to show what system functions are performed for which actor. Roles of the actors in the
system can be depicted.

Use case diagram

Use Case description:

1. Patient Registration:
Use case name: Patient Registration

Participating Actor: Patient

Entry condition: Patient can register with valid details.

Flow of Events: Enter all the required details.

Validate the given detail.

Enter into home page.

Exit condition: Patients can successfully registered.

2. Login:

Use case name: Login

Participating Actor: Patient, Hospital superintendent

Entry condition: Patient and Hospital superintendent m ust have login id and password
to enter into the system.

Flow of events:

• Verify the loginid

• Verify the password

• Enter into the home page.

Post Condition: If the username and password is valid then successfully enter into the

3. Activate users:

Use case name: Activate users

Participating Actor: Hospital superintendent

Entry condition: The hospital superintendent enters into webpage and gives the correct
emails password then the hospital superintendent page is open.

Flow of Events:

• Hospital superintendent must can the activate patients.

• If the hospital superintendent is did not activate users that pages are can’t be

Exit condition: Hospital s superintendent successfully activate patients accounts.

4. Add data

Use case name: Add data

Participating Actor: Patient

Entry condition: Patient can add data.

Flow of Events:

• Patient can access home page, data view, logout.

• The patient can upload all the details in the pages.

Exceptional Flow of Events: If the patient can enter the invalid data the data can be
added unsuccessfully added.

Exit condition: The patient data is successfully add and get test result.

5. Predicated Results:

Use case name: Predicated Results

Participating Actor: Patient, Hospital superintendent

Entry condition: The patient and hospital superintendent enters into webpage and gives the
correct emails password then the patient and hospital superintendent pages are open.

Flow of Events: Patient given statement to checking by using some algorithm. Result is
show on a page with training accuracy.

Exit condition: Patient successfully know the percentage of the given statement.

4.1 System Architecture:

A System architecture is the conceptual model, the conceptual model is used for defines the
structure, behaviour, and more views of a system. An architecture description is a formal
description and the formal description is used for shows the performance of a System
behaviour. A System architecture is contain the system components and the sub-systems

4.2Data Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to
represent a system in terms of input data to the system, various processing carried out on this
data, and the output data is generated by this system.
The data flow diagram (DFD) is one of the most important modeling tools. It is used to model
the system components. These components are the system process, the data used by the
process, an external entity that interacts with the system and the information flows in the

DFD shows how the information moves through the system and how it is modified by a series
of transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output.
DFD is also known as bubble chart. A DFD may be used to represent a system at any level of
abstraction. DFD may be partitioned into levels that represent increasing information flow
and functional detail

4.3UML diagrams:
UML stands for Unified Modeling Language. UML is a standardized general-purpose
modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object oriented
computer software. In its current form UML is comprised of two major components: a Meta-
model and a notation. In the future, some form of method or process may also be added to; or
associated with, UML.
The Unified Modeling Language is a standard language for specifying, Visualization,
Constructing and documenting the artifacts of software system, as well as for business
modeling and other non-software systems.
The UML represents a collection of best engineering practices that have proven successful in
the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and the software
development process. The UML uses mostly graphical notations to express the design of
software projects.

The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks,
patterns and components.
7. Integrate best practices.

Class diagram:

In software engineering, a class diagram in the Unified Modeling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
classes, their attributes, operations (or methods), and the relationships among the classes. It
explains which class contains information.

Class diagram

Sequence diagram:

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram

that shows how processes operate with one another and in what order. It is a construct of a
Message Sequence Chart. Sequence diagrams are sometimes called event diagrams, event
scenarios, and timing diagrams.

Sequence diagram

Activity diagram:

Activity diagrams are graphical representations of workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified Modeling Language, activity
diagrams can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.

Activity Diagram

4.4Database Design


Database Normalization is a technique of organizing the data in the database. Normalization is

a systematic approach of decomposing tables to eliminate data redundancy(repetition) and
undesirable characteristics like Insertion, Update and Deletion Anomalies. It is a multi-step
process that puts data into tabular form, removing duplicated data from the relation tables.
Normalization is used for mainly two purposes

1. Eliminating redundant(useless) data.

2. Ensuring data dependencies make sense i.e. data is logically stored.

Normalization Rules:

Normalization rules are divided into the following normal forms:

• First Normal Form

• Second Normal Form

• Third Normal Form


• Fourth Normal Form

First Normal Form (1NF):

If a relation contain composite or multi-valued attribute, it violates first normal form or a

relation is in first normal form if it does not contain any composite or multi-valued attribute. A
relation is in first normal form if every attribute in that relation is singled valued attribute.

Second Normal Form (2NF):

To be in second normal form, a relation must be in first normal form and relation must not
contain any partial dependency. A relation is in 2NF if it has No Partial Dependency, i.e., no
non-prime attribute (attributes which are not part of any candidate key) is dependent on any
proper subset of any candidate key of the table. Partial Dependency – If the proper subset of
candidate key determines non-prime attribute, it is called partial dependency.

Third Normal Form (3NF):

A relation is in third normal form, if there is no transitive dependency for non-prime attributes
as well as it is in second normal form. A relation is in 3NF if at least one of the following
condition holds in every non-trivial function dependency X –> Y

1. X is a super key.

1.Y is a prime attribute (each element of Y is part of some candidate key).

Boyce-Codd Normal Form (BCNF):

A relation R is in BCNF if R is in Third Normal Form and for every FD, LHS is super key.

A relation is in BCNF in every non-trivial functional dependency X –> Y, X is a super key.

Fourth Normal Form (4NF):

A table is said to be in the Fourth Normal Form when,

• It is in the Boyce-Codd Normal Form.

• And, it doesn't have Multi-Valued Dependency.


Patient Registration table:

S.No Field Name Data Type Size Constrait

1 Id Integer 100 Not Null

2 Name Varchar 100 Not Null

3 Loginid Varchar 100 Not Null

4 Password Varchar 100 Not Null

5 Email Varchar 100 Not Null

6 Locality Varchar 100 Not Null

7 Address Varchar 1000 Not Null

8 City Varchar 100 Not Null

9 State Varchar 100 Not Null

10 Status Varchar 100 Not Null

4.5Life Cycle of Machine Learning:

1.Understanding objective

2. Data collection

3. Data preprocessing

4. Data splitting into training and testing

5. Model selection

6. Import model

7. Model implementation

8. Data visualization


Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches. The decisions or the test are
performed on the basis of features of the given dataset. It is a graphical representation for
getting all the possible solutions to a problem/decision based on given conditions.

It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure. In order to build a tree, we use
the CART algorithm, which stands for Classification and Regression Tree algorithm. A
decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.

Below diagram explains the general structure of a decision tree:

Uses Decision Trees:

There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand. The logic behind the decision tree can be easily understood because it shows a tree-
like structure.

Advantages of the Decision Tree:

• It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.

Steps to implement the Decision tree algorithm:

• Data Pre-processing step

• Fitting a Decision-Tree algorithm to the Training set
• Predicting the test result
• Test accuracy of the result(Creation of Confusion matrix)
• Visualizing the test set result.

Working of Decision Tree:

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

Step-1: For implementing any algorithm, we need dataset. So during the first step of decision
tree, we must load the training as well as test data.

Step-2: Begin the tree with the root node, says S, the best attribute in the dataset using Attribute
Selection Measure (ASM).

Step-3: Divide the S into subsets that contains possible values for the best attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.

Random Forest Algorithm:

Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based

on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of

The below diagram explains the working of the Random Forest algorithm:

Uses Random Forest

• It takes less training time as compared to other algorithms.

• It predicts output with high accuracy, even for the large dataset it runs efficiently.
• It can also maintain accuracy when a large proportion of data is missing.

Advantages of Random Forest
• Random Forest is capable of performing both Classification and Regression tasks.
• It is capable of handling large datasets with high dimensionality.
• It enhances the accuracy of the model and prevents the overfitting issue.

Steps to implement the Randon Forest algorithm:

• Data Pre-processing step

• Fitting the Random Forest algorithm to the Training set
• Predicting the test result
• Test accuracy of the result (Creation of Confusion matrix)
• Visualizing the test set result.

Working of Random Forest Algorithm:

• First we randomly select ‘p’ features from the total ‘q’ features (where p<<q).
• From the selected’ ‘p’ features, we need to calculate a node, referred as ‘d’ using the
method best split point.
• Using the best split, we need to split the node into daughter nodes.
• Then we repeat the above steps till a number ‘l’ is reached.
• A forest of ‘n’ number of trees is built by applying the above steps ‘n’ number of times.

Logistic Regression Algorithm:

Logistic regression is a supervised learning classification algorithm used to predict the

probability of a target variable. The nature of target or dependent variable is dichotomous,
which means there would be only two possible classes.

In simple words, the dependent variable is binary in nature having data coded as either 1
(stands for success/yes) or 0 (stands for failure/no).

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of

the simplest ML algorithms that can be used for various classification problems such as spam
detection, Diabetes prediction, cancer detection etc.


5.1 Modules:
• Patient
• Hospital superintendent
• Data Preprocessing
• Machine Learning
Modules Description:
The patient can register the first. While registering he required a valid user email and mobile
for further communications. Once the patient register then hospital superintendent can activate
the user. Once hospital superintendent activated the patient then patient can login into our
system. Patient can upload the dataset based on our dataset column matched. For algorithm
execution data must be in int of float format. Here we took UCI repository dataset for testing
purpose. Patient can also add the new data for existing dataset based on our Django application.
Patient can click the Data Preparations in the web page so that the data cleaning process will
be starts. The cleaned data and its required graph will be displayed.
Hospital superintendent:
Hospital superintendent can login with his login details. Hospital superintendent can activate
the registered users. Once he activate then only the patient can login into our system. Hospital
superintendent can view the overall data in the browser. He can also check the algorithms ROC
Curve, confusion matrix and accuracy. The comparison accuracy bar graph also displayed here.
All algorithm execution complete then hospital superintendent can see the overall accuracy in
web page.
Data Preprocessing:
A dataset can be viewed as a collection of data objects, which are often also called as a records,
points, vectors, patterns, events, cases, samples, observations, or entities. Data objects are
described by a number of features that capture the basic characteristics of an object, such as
the mass of a physical object or the time at which an event occurred, etc. Features are often
called as variables, characteristics, fields, attributes, or dimensions. The data preprocessing in
this forecast uses techniques like removal of noise in the data, the expulsion of missing
information, modifying default values if relevant and grouping of attributes for prediction at
various levels.

Machine learning:
Based on the split criterion, the cleansed data is split into 60% training and 40% test, then the
dataset is subjected to three machine learning classifiers such as Logistic Regression (LR) with
pipeline, Decision Tree (DT), Random Forest (RF).The accuracy of the classifiers was
calculated using the confusion matrix. The classifier which bags up the highest accuracy could
be determined as the best classifier. For arch algorithm confusion matrix roc curve and
accuracy has been calculated and displayed in my results.
Python is a general-purpose interpreted, interactive, object-oriented, and high-level
programming language. An interpreted language, Python has a design philosophy that
emphasizes code readability (notably using whitespace indentation to delimit code blocks
rather than curly brackets or keywords), and a syntax that allows programmers to express
concepts in fewer lines of code than might be used in languages such as C++or Java. Itprovides
constructs that enable clear programming on both small and large scales. Python interpreters
are available for many operating systems. CPython, the reference implementation of Python,
is open source software and has a community-based development model, as do nearly all of its
variant implementations. CPython is managed by the non- profit Python Software Foundation.
Python features a dynamic type system and automatic memory management. It supports
multiple programming paradigms, including object-oriented, imperative, functional and
procedural, and has a large and comprehensive standard library.
Django is a high-level Python Web framework that encourages rapid development and
clean, pragmatic design. Built by experienced developers, it takes care of much of the hassle
of Web development, so you can focus on writing your app without needing to reinvent the
wheel. It’s free and open source.

Django's primary goal is to ease the creation of complex, database-driven websites. Django
emphasizes reusability and "pluggability" of components, rapid development, and the principle
of don't repeat yourself. Python is used throughout, even for settings files and data models.

Django also provides an optional administrative create, read, update and delete interface that
is generated dynamically through introspection and configured via admin models

Create a Project

Whether you are on Windows or Linux, just get a terminal or a cmd prompt and navigate to
the place you want your project to be created, then use this code −

$ django-admin startproject myproject

This will create a "myproject" folder with the following structure −








The Project Structure

The “myproject” folder is just your project container, it actually contains two elements −

manage.py − This file is kind of your project local django-admin for interacting with your
project via command line (start the development server, sync db...). To get a full list of
command accessible via manage.py you can use the code −

$ python manage.py help

The “myproject” subfolder − This folder is the actual python package of your project. It
contains four files −

__init__.py − Just for python, treat this folder as package.

settings.py − As the name indicates, your project settings.

urls.py − All links of your project and the function to call. A kind of ToC of your project.

wsgi.py − If you need to deploy your project over WSGI.

Setting Up Your Project

Your project is set up in the subfolder myproject/settings.py. Following are some important
options you might need to set −

DEBUG = True

This option lets you set if your project is in debug mode or not. Debug mode lets you get more
information about your project's error. Never set it to ‘True’ for a live project. However, this
has to be set to ‘True’ if you want the Django light server to serve static files. Do it only in the
development mode.


'default': {

'ENGINE': 'django.db.backends.sqlite3',

'NAME': 'database.sql',

'USER': '',


'HOST': '',

'PORT': '',

Database is set in the ‘Database’ dictionary. The example above is for SQLite engine. As stated
earlier, Django also supports −

MySQL (django.db.backends.mysql)

PostGreSQL (django.db.backends.postgresql_psycopg2)

Oracle (django.db.backends.oracle) and NoSQL DB

MongoDB (django_mongodb_engine)

Before setting any new engine, make sure you have the correct db driver installed.

You can also set others options like: TIME_ZONE, LANGUAGE_CODE, TEMPLATE…

Now that your project is created and configured make sure it's working −

$ python manage.py runserver

You will get something like the following on running the above code −

Validating models...

0 errors found

September 03, 2015 - 11:41:50

Django version 1.6.11, using settings 'myproject.settings'

Starting development server at

Quit the server with CONTROL-C.

A project is a sum of many applications. Every application has an objective and can be reused
into another project, like the contact form on a website can be an application, and can be reused
for others. See it as a module of your project.


from django.conf.urls import url

from django.contrib import admin

from django.urls import path

from fstapp import views as fstapp

from user import views as user

url(r'^admins/', admin.site.urls),

# url(r'^$', index, name="index"),

url(r'^index/', fstapp.index, name="index"),

url(r'^adminhome/',fstapp. adminhome, name="adminhome"),

url(r'^adminbase/', fstapp.adminbase, name="adminbase"),


path('UserRegister/', user.UserRegister, name='UserRegister'),

path('UserRegisterAction/', user.UserRegisterAction, name='UserRegisterAction'),

path('UserLogin/', user.UserLogin, name='UserLogin'),




path('UserLoginCheck', user.UserLoginCheck, name='UserLoginCheck'),



from django.shortcuts import render

from django.contrib import messages

from user.models import *

from user.forms import *

import io

from django.core.paginator import Paginator, EmptyPage, PageNotAnInteger

import csv

def UserRegister(request):

form = UserRegistrationForm()

return render(request,'user/Register.html',{'form':form})

def UserRegisterAction(request):

if request.method == 'POST':

form = UserRegistrationForm(request.POST)

if form.is_valid():

print('Data is Valid')


messages.success(request, 'You have been successfully registered')

# return HttpResponseRedirect('./CustLogin')

form = UserRegistrationForm()

return render(request, 'user/register.html', {'form': form})


print("Invalid form")


form = UserRegistrationForm()

return render(request, 'user/register.html', {'form': form})

def UserLogin(request):

return render(request, 'user/userlogin.html', {})

def UserLoginCheck(request):

if request.method == "POST":

loginid = request.POST.get('loginname')

pswd = request.POST.get('pswd')

print("Login ID = ", loginid, ' Password = ', pswd)


check = UserRegistrationModel.objects.get(loginid=loginid, password=pswd)

status = check.status

print('Status is = ', status)

if status == "activated":

request.session['id'] = check.id

request.session['loggeduser'] = check.name

request.session['loginid'] = loginid

request.session['email'] = check.email

print("User id At", check.id, status)

return render(request, 'user/userhome.html', {})


messages.success(request, 'Your Account Not at activated')

return render(request, 'user/userlogin.html')

# return render(request, 'user/userpage.html',{})

except Exception as e:

print('Exception is ', str(e))


messages.success(request, 'Invalid Login id and password')

return render(request, 'user/userlogin.html', {})

def UserDataView(request):

data_list = livearDataModel.objects.all()

page = request.GET.get('page', 1)

paginator = Paginator(data_list, 10)


users = paginator.page(page)

except PageNotAnInteger:

users = paginator.page(1)

except EmptyPage:

users = paginator.page(paginator.num_pages)

return render(request, 'user/DataView_list.html', {'users': users})

def BrowseCSV(request):

return render(request,'user/BrowseCsv.html',{})

def UploadCSVToDataBase(request):

# declaring template

template = "users/UserHomePage.html"

data = HearDataModel.objects.all()

# prompt is a context variable that can have different values depending on their context

prompt = {

'order': 'Order of the CSV should be name, email, address, phone, profile',

'profiles': data

# GET request returns the value of the data with the specified key.

if request.method == "GET":

return render(request, template, prompt)

csv_file = request.FILES['file']

# let's check if it is a csv file

if not csv_file.name.endswith('.csv'):

messages.error(request, 'THIS IS NOT A CSV FILE')

data_set = csv_file.read().decode('UTF-8')

# setup a stream which is when we loop through each line we are able to handle a data in a

io_string = io.StringIO(data_set)


for column in csv.reader(io_string, delimiter=',', quotechar="|"):

_, created = livearDataModel.objects.update_or_create(


























return render(request, 'user/userhome.html')

def UserAddData(request):

if request.method == 'POST':

form = livearDataModelForm(request.POST)

if form.is_valid():

print('Data is Valid')


messages.success(request, 'Data Added Successfull')

# return HttpResponseRedirect('./CustLogin')

form = livearDataModelForm()

return render(request, 'user/UserAddData.html', {'form': form})


print("Invalid form")


form = livearDataModelForm()

return render(request, 'user/UserAddData.html', {'form': form})

def knn(request):

df = pd.read_csv('./chronic_kidney_disease_full.csv')

data = df


data['class'] = data['class'].map({'ckd': 1, 'notckd': 0})

data['htn'] = data['htn'].map({'yes': 1, 'no': 0})

data['dm'] = data['dm'].map({'yes': 1, 'no': 0})

data['cad'] = data['cad'].map({'yes': 1, 'no': 0})

data['appet'] = data['appet'].map({'good': 1, 'poor': 0})

data['ane'] = data['ane'].map({'yes': 1, 'no': 0})

data['pe'] = data['pe'].map({'yes': 1, 'no': 0})

data['ba'] = data['ba'].map({'present': 1, 'notpresent': 0})

data['pcc'] = data['pcc'].map({'present': 1, 'notpresent': 0})

data['pc'] = data['pc'].map({'abnormal': 1, 'normal': 0})

data['rbc'] = data['rbc'].map({'abnormal': 1, 'normal': 0})


# plt.figure(figsize=(19, 19))

# sns.heatmap(data.corr(), annot=True, cmap='coolwarm')



print(data.shape[0], data.dropna().shape[0])


from sklearn.neighbors import KNeighborsClassifier

classifier =KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)

X = data.iloc[:, :-1]

y = data['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print("prediction", y_pred)

from sklearn.metrics import confusion_matrix

acc = confusion_matrix(y_test, y_pred)


# return render(request,'decision.html',{"acc":acc})

from sklearn import metrics

from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn.metrics import confusion_matrix, mean_absolute_error, mean_squared_error,
f1_score, precision_score, \


train_accuracy = accuracy_score(y_test, y_pred)

print('Train Accuracy: ', train_accuracy)

print('Test Accuracy: ', accuracy_score(y_test, y_pred))

matrix = confusion_matrix(y_test, y_pred)

sns.heatmap(matrix, annot=True, fmt="d")


mae = mean_absolute_error(y_test, y_pred)

mse = mean_squared_error(y_test, y_pred)

rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

f1_score = f1_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

return render(request, 'knn.html',

{"pred": y_pred, 'train_accuracy': train_accuracy, 'mae': mae, 'mse': mse, 'rmse':


'f1_score': f1_score, 'precision': precision, 'recall': recall})


from django.db import models

# Create your models here.

class UserRegistrationModel(models.Model):

name = models.CharField(max_length=100)

loginid = models.CharField(unique=True,max_length=100)

password = models.CharField(max_length=100)

mobile = models.CharField(max_length=100)

email = models.CharField(max_length=100)

locality = models.CharField(max_length=100)

address = models.CharField(max_length=1000)

city = models.CharField(max_length=100)

state = models.CharField(max_length=100)

status = models.CharField(max_length=100)

def __str__(self):

return self.loginid

class Meta:


class HearDataModel(models.Model):

age = models.IntegerField()

sex = models.IntegerField()

cp = models.IntegerField()

trestbps = models.IntegerField()

chol = models.IntegerField()

fbs = models.IntegerField()

restecg = models.IntegerField()

thalach = models.IntegerField()

exang = models.IntegerField()

oldpeak = models.FloatField()

slope = models.IntegerField()

ca = models.IntegerField()

thal = models.IntegerField()

target = models.IntegerField()

def __str__(self):

return self.id

class Meta:

db_table = 'HeartDatabase'

class livearDataModel(models.Model):


























def __str__(self):

return self.id

class Meta:

db_table ='livearDatabase'


from user.models import *

from django import forms

class UserRegistrationForm(forms.ModelForm):

name = forms.CharField(widget=forms.TextInput(attrs={'pattern':'[a-zA-Z]+'}),

loginid = forms.CharField(widget=forms.TextInput(attrs={'pattern':'[a-zA-Z]+'}),

password = forms.CharField(widget=forms.PasswordInput(attrs={'pattern':'(?=.*\d)(?=.*[a-
z])(?=.*[A-Z]).{8,}','title':'Must contain at least one number and one uppercase and lowercase
letter, and at least 8 or more characters'}), required=True,max_length=100)

mobile = forms.CharField(widget=forms.TextInput(attrs={'pattern':'[56789][0-9]{9}'}),

email = forms.CharField(widget=forms.TextInput(attrs={'pattern':'[a-z0-9._%+-]+@[a-z0-
9.-]+\.[a-z]{2,}$'}), required=True,max_length=100)

locality = forms.CharField(widget=forms.TextInput(), required=True,max_length=100)

address = forms.CharField(widget=forms.Textarea(attrs={'rows':4, 'cols': 22}),


city = forms.CharField(widget=forms.TextInput(attrs={'class':'form-control' ,
'autocomplete': 'off','pattern':'[A-Za-z ]+', 'title':'Enter Characters Only '}),

state = forms.CharField(widget=forms.TextInput(attrs={'class':'form-control' ,
'autocomplete': 'off','pattern':'[A-Za-z ]+', 'title':'Enter Characters Only '}),

status = forms.CharField(widget=forms.HiddenInput(), initial='waiting' ,max_length=100)

class Meta():

model = UserRegistrationModel


class livearDataModelForm(forms.ModelForm):

age = models.CharField(max_length=50)

bp = models.CharField(max_length=50)

sg = models.CharField(max_length=50)

al = models.CharField(max_length=50)

su = models.CharField(max_length=50)

rbc = models.CharField(max_length=50)

pc = models.CharField(max_length=50)

pcc = models.CharField(max_length=50)

ba = models.CharField(max_length=50)

bgr = models.CharField(max_length=50)

bu = models.CharField(max_length=50)

sc = models.CharField(max_length=50)

sod = models.CharField(max_length=50)

pot = models.CharField(max_length=50)

hemo = models.CharField(max_length=50)

pcv = models.CharField(max_length=50)

wbcc = models.CharField(max_length=50)

rbcc = models.CharField(max_length=50)

htn = models.CharField(max_length=50)

dm = models.CharField(max_length=50)

cad = models.CharField(max_length=50)

appet = models.CharField(max_length=50)

pe = models.CharField(max_length=50)

ane = models.CharField(max_length=50)

class1 = models.CharField(max_length=50)

class Meta():

model = livearDataModel

fields = '__all__'


The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub assemblies, assemblies and/or a finished product It is the process of
exercising software with the intent of ensuring that the Software system meets its requirements
and user expectations and does not fail in an unacceptable manner. There are various types of
test. Each test type addresses a specific testing requirement.

7.1Types of Tests:
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of components
is correct and consistent. Integration testing is specifically aimed at exposing the problems
that arise from the combination of components.

Functional test
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.

Systems/Procedures : interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements, key functions, or

special test cases. In addition, systematic coverage pertaining to identify Business process
flows; data fields, predefined processes, and successive processes must be considered for
testing. Before functional testing is complete, additional tests are identified and the effective
value of current tests is determined.

System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process descriptions
and flows, emphasizing pre-driven process links and integration points.

White Box Testing

White Box Testing is a testing in which in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purpose. It is used
to test areas that cannot be reached from a black box level.

Black Box Testing

Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of tests,
must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document. It is a testing in which the software
under test is treated, as a black box .you cannot “see” into it. The test provides inputs and
responds to outputs without considering how the software works.
Unit Testing

Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two
distinct phases.

Test strategy and approach

Field testing will be performed manually and functional tests will be written in detail.

Test objectives

• All field entries must work properly.

• Pages must be activated from the identified link.
• The entry screen, messages and responses must not be delayed.
Features to be tested

• Verify that the entries are of the correct format

• No duplicate entries should be allowed
• All links should take the user to the correct page.
Integration Testing

Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.

The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level
– interact without error.

Test Results:

All the test cases mentioned above passed successfully. No defects encountered.

Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation
by the end user. It also ensures that the system meets the functional requirements.

Test Results:

All the test cases mentioned above passed successfully. No defects encountered.

Test Cases

Excepted Remarks(IF
S.no Test Case Result
Result Fails)
If User If already user
1. User Register registration Pass email exist then it
successfully. fails.
If User name and
password is
Un Register Users
2. User Login correct then it Pass
will not logged in.
will getting valid
According to UCI
A new record will repository the data
User Add the
3. added to our Pass must be Integer
dataset. fields otherwise its
The data will be in
int or float format,
Data will be
4. Data Cleaning Pass otherwise
algorithm will not
Target class is
Age and target positive or
Plot based graph
5. attribute plot a Pass negative label
is generated
box class, If not it will
User added data
User can add
will be consider Data added to test
6. extra records for Pass
for testing data for model.
For our all
Calculate models confusion Data is consider
7. Pass
Confusion Matrix matrix is for testing.
For our five Accuracy will be
Accuracy will models the consider, the
8. Pass
calculated accuracy will failed case is data
calculated in binary format
Admin can login
with his login Invalid login
9. Admin login credential. If Pass details will not
success he get his allowed here
home page
Admin can Admin can If user id not
activate the activate the Pass found then it
register users register user id won’t login.

7.2Confusion Matrix:

Confusion matrix is a table that is often used to describe the performance of a classification
model (or "classifier") on a set of test data for which the true values are known.The confusion
matrix is a matrix used to determine the performance of the classification models for a given
set of test data. It can only be determined if the true values for test data are known. The matrix
itself can be easily understood, but the related terminologies may be confusing. Since it shows
the errors in the model performance in the form of a matrix, hence also known as an error
matrix. Some features of Confusion matrix are given below:

• For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it is 3*3
table, and so on for n classes,it is n*n.

• The matrix is divided into two dimensions, that are predicted values and actual values along
with the total number of predictions.

• Predicted values are those values, which are predicted by the model, and actual values are
the true values for the given observations.

The above table has the following cases:

True Negative: Model has given prediction No, and the real or actual value was also No.

True Positive: The model has predicted yes, and the actual value was also true.

False Negative: The model has predicted no, but the actual value was Yes, it is also called as
Type-II error.

False Positive: The model has predicted Yes, but the actual value was No. It is also called a
Type-I error.

Calculations using Confusion Matrix:

We can perform various calculations for the model, such as the model's accuracy, using this
matrix. These calculations are given below:

Classification Accuracy: It is one of the important parameters to determine the accuracy of

the classification problems. It defines how often the model predicts the correct output. It can
be calculated as the ratio of the number of correct predictions made by the classifier to all
number of predictions made by the classifiers. The formula is given below:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑡𝑝+𝑡𝑛+𝑓𝑝+𝑓𝑛

= 10+30+0+0


Mean Absolute Error:

The Mean absolute error represents the average of the absolute difference between the actual
and predicted values in the dataset. It measures the average of the residuals in the dataset.

𝑀𝐴𝐸 = ∑|𝑦𝑖 − 𝑦̂|

MAE= Mean Absolute Error

𝑦𝑖 = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

𝑥𝑖 = 𝑡𝑟𝑢𝑒 𝑣𝑎𝑙𝑢𝑒

N=total number of data points

Mean Square Error:

Mean Squared Error represents the average of the squared difference between the original and
predicted values in the data set. It measures the variance of the residuals.
𝑀𝑆𝐸 = 𝑁 ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)2

MSE-Mean Square Error

𝑦𝑖 = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

𝑥𝑖 = 𝑡𝑟𝑢𝑒 𝑣𝑎𝑙𝑢𝑒

N=total number of data points

Root Mean Squared Error:

Root Mean Squared Error is the square root of Mean Squared error. It measures the standard
deviation of residuals.

RMSE=√𝑀𝑆𝐸 = √𝑁 ∑𝑁 ̂)2
𝑖=1(𝑦𝑖 − 𝑦

MSE-Mean Square Error

𝑦𝑖 = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

𝑥𝑖 = 𝑡𝑟𝑢𝑒 𝑣𝑎𝑙𝑢𝑒

N=total number of data points


If two models have low precision and high recall or vice versa, it is difficult to compare these
models. So, for this purpose, we can use F-score. This score helps us to evaluate the recall and
precision at the same time. The F-score is maximum if the recall is equal to the precision. It
can be calculated using the below formula

𝐹1_𝑠𝑐𝑜𝑟𝑒 = (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙)

= 1∗1



It can be defined as the number of correct outputs provided by the model or out of all positive
classes that have predicted correctly by the model, how many of them were actually true. It can
be calculated using the below formula:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝+𝑓𝑝



It is defined as the out of total positive classes, how our model predicted correctly. The recall
must be as high as possible.

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑝+𝑓𝑛

= 10+0


Patient Registration:

Login form for Hospital superintendent:

Hospital superintendent homepage:

Hospital superintendent Activates Patients:

Patient Login page:

Patient add data to check he had CKD or NOTCKD:

Patient having CKD:

Patient View data:

Confusion matrix for Logistic Regression Algorithm:

From Logistic Regression confusion matrix we can calculate

accuracy,F1_Score,Precision and Recall for patient:

Confusion matrix for Decision Tree Algorithm:

From Decision Tree Algorithm confusion matrix we can calculate

accuracy,F1_Score,Precision and Recall for patient:

Confusion Matrix for Random Forest Algorithm:

From Random Forest Algorithm confusion matrix we can calculate

accuracy,F1_Score,Precision and Recall for patient:

We were able to evaluate the performance of different ML algorithms on the chronic kidney
disease data set we took from UCI machine learning library. We preprocess the dataset and
then used the filter method of feature selection that is univariate selection and correlation
matrix along with feature importance to find best features from the dataset. The proposed
algorithm that is, Decision tree, Random Forest and logistic regression have achieved an
accuracy of 98.48, 94.16 and 99.24 respectively. Precision of 100, 95.12 and 98.82 and recall
of 97.61, 96.29 and 100. Two feature selecting techniques are combined by leveraging the
strength of each the techniques. On comparison we find Logistic Regression with highest
accuracy and recall while Decision tree have the highest precision.

1. R, G. Sasi, R. Sankar, and O. Deepa, “Decision Support system for diagnosis and prediction
of Chronic Renal Failure using Random Subspace Classification,” 2016, pp. 1287–1292.
2. D. S. Sisodia and A. Verma, “Prediction Performance of Individual and Ensemble learners
for Chronic Kidney Disease,” 2017, pp. 1027– 1031.
3. A. V Kshirsagar et al., “A Simple Algorithm to Predict Incident Kidney Disease,” ARCH
Intern Med, vol. 168, no. 22, pp. 2466– 2473, 2008.
4. A. K. Shrivas and S. Kumar Sahu, “Classification of Chronic Kidney Disease using Feature
Selection Techniques,” IJCSE, vol. 6, no. 5, pp. 649–653, 2018.
5. M. Kumar, “Prediction of Chronic Kidney Disease Using Random Forest Machine
Learning Algorithm,” Int. J. Comput. Sci. Mob. Comput., vol. 5, no. 2, pp. 24–33, 2016.
6. P. Yildirim, “Chronic Kidney Disease Prediction on Imbalanced Data by Multilayer
Perceptron: Chronic Kidney Disease Prediction,” in Proceedings - International Computer
Software and Applications Conference, 2017, vol. 2, pp. 193–198, doi:
7. V. Kunwar, K. Chandel, S. A. sai, and A. Bansal, “Chronic kidney disease analysis using
data mining classification techniques,” 2016, pp. 300–305.
8. E. H. A. Rady and A. S. Anwar, “Prediction of kidney disease stages using data mining
algorithms,” Informatics Med. Unlocked, vol. 15, pp. 1–7, Jan. 2019, doi:
9. V. S and D. S, “Data Mining Classification Algorithms for Kidney Disease Prediction,”
Int. J. Cybern. Informatics, vol. 4, no. 4, pp. 13– 25, Aug. 2015, doi:
10. A. Nway Oo, “Classification of Chronic Kidney Disease (CKD) Using Rule based
Classifier and PCA,” Int. J. Adv. Manag. Technol. Eng. Sci., vol. 8, no. 4, pp. 728–733,


You might also like