Prediction of Risk in Cardiovascular Disease Using Machine Learning Algorithms
Prediction of Risk in Cardiovascular Disease Using Machine Learning Algorithms
Yuvasree R Swetha G
Kathu Sara Renji UG Scholar
UG Scholar
Assistant Professor Department of Electronics and
Department of Electronics and
Department of Biomedical Engineering Communication Engineering
Communication Engineering
V.S.B Engineering College Rajalakshmi Institute of Technology
Rajalakshmi Institute of Technology
Karur,India Chennai, India
Chennai, India
Abstract—Prediction of heart disease is one of the most the death risk by using prediction techniques of Machine
complex tasks in medical field and prior detection of heart Learning.
disease become an area of research to save patient lives. During
the pandemic period, the number of cardiac arrest cases at home Machine Learning is a buzzword for the past few years,
has drastically increased due to inaccurate predictions and
the reason might be its applications and increased number of
delay in seeking medical attention. The health care industry
works on processing of huge data and the solution for this is computation power, designing and implementation of better
machine learning. Data science process large amount of data to algorithms. In today’s world Market field, Machine Learning
make intelligent health care decisions thereby avoids risk and [1] plays a vital role. Its application and techniques
alert the patients. In this paper, a comparative analysis of implemented from self-driving cars to predicting severe and
different machine learning classifiers based on dataset to predict deadly diseases. It involves in building and implementing a
the chance of heart disease with minimal attributes. The ML Predictive model which is used to find a solution for a
algorithms used for HD prediction are K- Nearest Neighbour, Problem Statement. Heart is used to maintain continuous
Gradient Boosting Classifier, Support Vector Machine, Naive blood circulation in our body. There are many cases in the
Bayes, Logistic Regression and Random Forest algorithm. The
world related to heart diseases. The term Heart attack is one
paper also finds the correlation between different attributes and
hence using them efficiently for prediction of heart attack. of the most common diseases in recent days. It shows up
various symptoms like irregular Heartbeats, chest pain, and
Keywords—Decision Tree, Support Vector Machine, Random so on. people from all over the world suffers from
Forest; Naïve Bayes, K-means, Artificial Neural Network cardiovascular diseases which can cost their lives even.
dimensional spaces. So, the best boundary is based on the most influential algorithms. The prediction model is the form
identifying the hyperplane of SVM. The datapoints nearer to of decision trees and it outperforms random forest. The
the hyperplane is called support vector. SVM is more algorithm is so called gradient boosted trees when the DT is
commonly used for image classification and face detection. a weak learner. The most vital step in gradient boosting
The most challenging issue in SVM is the kernel selection method is regularization by shrinkage. The base estimator is
and also the method selection to avoid overfitting and fixed in this case and it is Decision Stump. This algorithm is
underfitting of dataset. The training data is mapped with used for prediction of both continuous target variable and
function called kernel. categorical target variable. The cost function varies based on
whether it is used for regression or classification. The cost
C. Random Forest Algorithm functions are Mean Square Error (MSE) for former and Log
loss for latter.
Automated decision-making is supported by supervised F. Naive Bayes Algorithm
learning in the Random Forest Algorithm. This technique is The algorithm follows Bayes rule for calculating probabilities
employed in machine learning to solve classification and and conditional probabilities with the assumption that
regression problems. Because of the idea of "ensemble attributes are statistically independent of each other. It is
learning," this is possible. Multiple classifiers are combined classifier that provides better accuracy and is widely used in
in order to find solution for a complicated problem that computer vison applications. It is used for very large dataset
improves the overall performance. The accuracy of the analysis and it outperforms other classification methods in
dataset's prediction can be improved by using a random terms of accuracy.
forest, which uses an average of many decision trees from the
given dataset. Random forest takes less time and it predicts 𝑑
𝑃 ( ) ∗ 𝑃(ℎ)
high accurate outputs even for large data set and runs 𝑃(ℎ|𝑑) = ℎ
efficiently and maintain accuracy though large amount of 𝑃(ℎ)
data is missed. Steps for random forest are:
P(h|d) - class posterior probability
P(h) - class prior probability
Step-1: X data points are randomly selected from the
P(d|h) - likelihood probability
original input training set
P(d) - prior probability of predictor
Step-2: Develop the decision trees as per the data points that
were selected in step one.
Step-3: Decide on how many decision trees you want to IV. METHODOLOGY
create.
A. Data Collection
Step-4: Repeat the Steps 1 & 2
The Cleveland HD dataset downloaded from Kaggle UCI
D. KNN Algorithm repository is used in this research work. The database consists
The supervised machine learning technique of k-nearest of 76 attributes among that 14 are taken into consideration for
neighbours may be used to address problems involving the research purpose. The 14 attributes are explained in Table
classification and regression analysis. Naive bayesian 1.
classification is used in the development of Decision Support
in the Prediction of Heart Disease System . A database of TABLE 1. DATASET FOR HEART DISEASE PREDICTION
heart disease cases from the past is mined by the algorithm to
unearth previously unknown information. Heart disease S. No Attribute Description
patients may be accurately predicted using this approach. 1 age In years
This algorithm is non-parametric and does not rely on any 2 sex male or female
assumptions or previous values. To put it another way: It's a 3 cp The kind of chest pain
lazy learning algorithm because it doesn't actually do 4 trestbps blood pressure level
anything during the classification process itself. The main
steps involved in this are: 5 chol serum cholestoral (mg/dl)
6 fbs blood sugar level - fasting
Step-1: Choose K neighbours 7 restecg resting ecg
Step-2: Find the distance between the K nearest neighbours
8 thalach threshold heart rate achieved
by using the Euclidean distance formula
Step-3: Determine the Euclidean distance of K nearest 9 exang exercise angina (induced)
neighbours 10 oldpeak State of depression
Step-4: Compute the total number of the data points 11 slope the slope of ST segment
Step-5: Allocate the newly attained datapoints to the one with 12 Thal Major vessels
greatest neighbour counts 13 ca Status of heart
E. Gradient Boosting Classifier 14 target Output class
It is a machine learning algorithm for evaluating
regression and classification problems, and it is one of the
The total instances in the dataset are 1025 and is used for of 200. The only one attribute that is outside the outlier is
the data analysis in this work. The dataset undergoes pre- chol, having the value more than 400. Each of the attributes
processing to handle the missing values through statistical closer look is shown in Fig.5 and Fig.6.
techniques. The dataset is separated into two – training data
(80 percent ) and testing data (20 percent ).
B. Data Pre-processing
C. Building Model
Input Prediction
from
patient
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑁 + 𝐹𝑃
𝑇𝑃𝑅
𝑁𝑜. 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃)
=
𝑁𝑜. 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃) + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (𝐹𝑁)
𝑇𝑃𝑅
Fig. 5. Closer look of each attribute outlier -scatter plot
𝑁𝑜. 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃)
=
𝑁𝑜. 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝐹𝑃) + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (𝑇𝑁)
0.8
Precision or Positive Predictive Value : If the model predicts
the positive class correctly, it is called True Positive and if 0.7
the model predicts the negative class correctly, it is called
0.6
True Negative.
0.5
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑝) LR SVM RFC KNN GBA Naive
𝑁𝑜. 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃) Bayes
=
𝑁𝑜. 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃) + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝐹𝑃)