Accurate Prediction of Heart Disease Using Machine Learning: A Case Study On The Cleveland Dataset
Accurate Prediction of Heart Disease Using Machine Learning: A Case Study On The Cleveland Dataset
Accurate Prediction of Heart Disease Using Machine Learning: A Case Study On The Cleveland Dataset
Abstract:- Heart disease remains one of the leading human health worldwide. It could be attributed to factors such
causes of mortality worldwide, with diagnosis and as unhealthy lifestyles, including poor dietary habits, lack of
treatment presenting significant challenges, particularly physical activity, smoking, and other risk factors that
in developing nations. These challenges stem from the contribute to the development and progression of these
scarcity of effective diagnostic tools, a lack of qualified conditions. The application of machine learning in heart
medical personnel, and other factors that hinder good disease prediction has garnered significant attention from
patient prognosis and treatment. The rise in cardiac researchers and healthcare professionals alike.
disorders, despite their preventability, is primarily due to
inadequate preventive measures and a shortage of skilled
medical providers. In this study, we propose a novel
approach to enhance the accuracy of cardiovascular
disease prediction by identifying critical features using
advanced machine learning techniques. Utilizing the
Cleveland Heart Disease dataset, we explore various
feature combinations and implement multiple well-known
classification strategies. By integrating a Voting Classifier
ensemble, which combines Logistic Regression, Gradient
Boosting, and Support Vector Machine (SVM) models, we
create a robust prediction model for heart disease. This
hybrid approach achieves a remarkable accuracy level of
97.9%, significantly improving the precision of
cardiovascular disease prediction and offering a valuable
tool for early diagnosis and treatment. Fig 1 Worldwide Causes of Death
Keywords:- Heart Disease Prediction, Cardiovascular Despite the promising results achieved using machine
Disease, Machine Learning, Ensemble Learning, Logistic learning models in heart disease prediction, several obstacles
Regression, Gradient Boosting, Support Vector Machine remain. The diverse nature of patient populations, the intricate
(SVM), Hybrid Models, Voting Classifier, Cleveland Dataset. pathophysiology of cardiovascular diseases, and the
requirement for large, diverse datasets pose significant
I. INTRODUCTION challenges for the widespread implementation of these models
in clinical settings [2]. Addressing these challenges is crucial
Cardiovascular diseases are a major global health to harness the potential of machine learning in this domain.
concern, causing a staggering number of deaths worldwide.
According to the World Health Organization, these diseases Various studies have explored the use of different
contribute to approximately 31% of all global fatalities, machine learning algorithms, such as logistic regression,
highlighting the urgent need for effective strategies to address decision trees, support vector machines, and neural networks,
this challenge [1]. Early detection and accurate heart disease to develop predictive models [3-5]. These models aim to
prediction can play a pivotal role in mitigating its devastating identify key risk factors and biomarkers that contribute to the
impact and improving patient outcomes. In recent years, development of heart disease, enabling early intervention and
machine learning techniques have emerged as a promising personalized treatment strategies. Moreover, the
approach to tackle this issue, leveraging data-driven interpretability and transparency of machine learning models
algorithms to identify patterns and risk factors associated with are critical considerations, as healthcare professionals must be
heart disease. able to understand and trust the analysis and predictions made
by these algorithms [6].
In Figure 1, the largest portion represents cardiovascular
disease, which accounts for a horrifying 33% of global deaths. This research paper aims to explore the current state-of-
Cardiovascular diseases encompass conditions affecting the the-art in heart disease prediction using machine learning,
heart and blood vessels, such as heart attacks, strokes, and addressing the challenges and opportunities associated with
other circulatory disorders. This significant percentage this approach. By conducting a comprehensive review of
highlights the substantial impact of cardiovascular diseases on existing literature and proposing unique methodologies, we
seek to contribute to developing more accurate, reliable, and and achieving an accuracy of 96.77% on the UCI Heart
interpretable models for heart disease prediction. The goal is Disease dataset.
to empower healthcare professionals with data-driven tools
that can assist in the early detection and management of heart Feature selection plays a crucial role in improving the
disease, ultimately improving patient outcomes and reducing performance of machine learning models for heart disease
the global burden of cardiovascular diseases. In this study, we prediction. Dwivedi [13] investigated the impact of feature
utilize the Cleveland Heart Disease dataset and employ a selection techniques on model accuracy, comparing methods
hybrid Voting Classifier ensemble, combining Logistic such as Information Gain, Gain Ratio, and Relief, in
Regression, Gradient Boosting, and Support Vector Machine combination with different machine learning algorithms. The
(SVM) models. This innovative approach aims to study found that feature selection significantly improved
significantly enhance the precision of heart disease prediction, model accuracy, with the Information Gain method coupled
providing a valuable tool for early diagnosis and effective with the Logistic Regression classifier achieving the highest
treatment. accuracy of 85.48% on the Cleveland Heart Disease dataset.
informative features for heart disease prediction. Singh et al. Sharma et al. [33] proposed a federated learning
[22] used a combination of wrapper and filter-based feature framework for heart disease prediction, enabling collaborative
selection methods to identify the optimal feature subset, model training across multiple healthcare institutions without
achieving an accuracy of 93.5% using an ensemble model. sharing raw patient data. Their approach achieved comparable
performance to centralized models while preserving data
Deep learning approaches have also been explored for privacy. Chowdhury et al. [34] investigated the use of
heart disease prediction using various data sources. Sharma et differential privacy techniques in heart disease prediction to
al. [23] proposed a deep learning framework based on a CNN protect sensitive patient information. They demonstrated that
for heart disease prediction using electrocardiogram (ECG) their privacy-preserving model achieved an accuracy of
signals, achieving an accuracy of 95.2% on a private ECG 91.5% while maintaining a high level of privacy protection.
dataset. Mukhopadhyay et al. [24] developed a hybrid deep
learning model combining CNN and LSTM for heart disease Kadhim and Radhi [35] introduced a model to determine
prediction using electronic health records (EHR), achieving an the most effective machine learning algorithm for early-stage
accuracy of 94.6% on a large-scale EHR dataset. prediction of cardiovascular disease, ensuring high accuracy.
The results showed that the best accuracy for cardiovascular
The interpretability and explainability of machine disease classification has been achieved using a random forest
learning models have been addressed through various algorithm with a rate of 95.4%.
techniques. Verma et al. [25] proposed an interpretable
machine-learning framework for heart disease prediction Geweid and Abdallah [36] developed cardiovascular
using decision trees and rule-based models. Chowdhury et al. disease identification techniques employing an improved
[26] applied techniques such as SHAP (SHapley Additive SVM-based duality optimization technique. While the
explanations) and LIME (Local Interpretable Model-Agnostic aforementioned methods and techniques have exercised
Explanations) to provide explanations for individual several methods to unmask cardiovascular disease at its initial
predictions made by machine learning models. stages, they exhibit constraints in the matter of prediction
accuracy and computational time. In [37], the researchers
Ensemble approaches have also been explored to developed a classifier utilizing a blend of diverse support
enhance prediction performance. Rajput et al. [27] proposed a vector machines (SVMs) to classify ECG signals, focusing on
fresh ensemble approach based on stacking and voting the extraction of features from intervals between consecutive
techniques, combining decision trees, random forests, and beats. Furthermore, they dealt with the challenge of extremely
support vector machines (SVM), achieving an accuracy of imbalanced data by exercising both over and under-sampling
94.2% on the Cleveland Heart Disease dataset. Agarwal et al. methods and techniques on the Arrhythmia dataset.
[28] developed an ensemble model using bagging and
boosting techniques with decision trees and gradient boosting Dixit and Kala [38] proposed a random forest and CNN
machines, employing a genetic algorithm for feature selection algorithms for cardiovascular disease prediction. To this end,
and achieving an accuracy of 92.8% on the Framingham several imbalance techniques have been discussed. Bemando
Heart Study dataset. et al.[39] predicted models for coronary cardiovascular
disease (CHD), which is known as cardiovascular disease
Multimodal data integration has also been investigated have been proposed. To this point, numerous supervised
for heart disease prediction. Gupta et al. [29] proposed a machine-learning algorithms, including Gaussian Naïve
transfer learning approach using a pre-trained CNN model for Bayes, Bernoulli Naïve Bayes, and Random Forest, are
feature extraction from echocardiogram images, combined exercised in cardiovascular (heart) disease prediction. The
with clinical data, achieving an accuracy of 95.6% on a results demonstrated that the Gaussian Naïve Bayes, Bernoulli
private dataset. Patel et al. [30] investigated the integration of Naïve Bayes, and Random Forest algorithms achieved
electronic health records, genetic data, and wearable sensor accuracy rates of 85%, 85%, and 75%, respectively. Jan et al.
data using a deep learning framework based on a multi-modal [40] proposed an ensemble model adopted to enhance
autoencoder, achieving an accuracy of 93.1% on a predictive accuracy by combining the strengths of multiple
heterogeneous dataset. Singh et al. [31] proposed an classifiers. To this point, ensemble learning is exercised,
interpretable machine-learning approach using decision trees integrating five classifier models: SVM, ANN, Naïve Bayes,
and rule-based models for heart disease prediction. They regression analysis, and random forest to predict and diagnose
utilized techniques such as feature importance ranking and cardiovascular disease. Similarly, the approach was proposed
decision rule extraction to provide interpretable insights into by authors in [41] which has achieved an accuracy of 93.2%.
the model's predictions. Verma et al. [32] developed an
explainable machine learning framework using gradient The literature survey highlights the advancements and
boosting machines and SHAP (SHapley Additive challenges in heart disease prediction using various machine-
exPlanations) for heart disease prediction. Their approach learning techniques. While traditional methods have provided
provided personalized risk predictions along with feature a foundation for understanding cardiovascular risks, recent
importance scores and patient-specific explanations, studies have shown that machine-learning models offer
enhancing the interpretability and trust in the model. superior accuracy and reliability in predicting heart disease.
However, challenges such as feature selection, model
interpretability, and integration of diverse data sources
remain.
The initial phase involves preprocessing the dataset. This handling complex medical data and improving predictive
step includes thorough data cleaning and feature extraction to accuracy.
ensure that the data is reliable and relevant for further analysis
and model training. To determine the best model configuration, we employ
Grid-Search Cross-Validation. This technique systematically
Once preprocessed, the dataset is divided into training tests various combinations of model parameters across
and testing sets using an 80-20 split ratio. This division allows different sections of the dataset, ensuring robust tuning and
us to train our models on one part of the data and evaluate enhancing overall performance. Each section alternates
their performance on another, ensuring unbiased validation. between being used for training and validation, and the results
are aggregated to identify the model with the highest
We then implement multiple classification strategies accuracy.
such as Logistic Regression, Gradient Boosting, and Support
Vector Machine (SVM) known for their effectiveness in
Logistic Regression achieved a precision of 0.96 and accuracy of 97.90%. These results highlight the models'
recall of 0.94, with an accuracy of 95.50%. Gradient Boosting robust performance in identifying heart disease cases, with the
exhibited higher precision (0.975) and recall (0.98), achieving ensemble method particularly excelling in accuracy and
an accuracy of 97.80%. Support Vector Machine (SVM) sensitivity. Figure 3 shows the graphical representation of the
demonstrated a precision of 0.97, a recall of 0.973, and an results obtained on evaluation parameters. The graphs
accuracy of 97.30%. The Voting Classifier ensemble, precisely indicate that the ensemble approach has gained the
combining Logistic Regression, Gradient Boosting, and SVM, highest precision over all other algorithms.
achieved a precision of 0.98, a recall of 0.975, and an
Figure 4 depicts a comparative analysis of the proposed model algorithms with the voting classifier. The figure shows that the
voting classifier has achieved an accuracy of 97.9%, which also outperforms the state-of-the-art systems studied in the literature
survey.
[15]. Davagdorj, K., Lee, J. S., Pham, V. H., &Ryu, K. H. [28]. Agarwal, P., Patel, J., Chowdhury, A., Rajput, S.,
(2020). A comparative analysis of machine learning Mukhopadhyay, S. (2022). A genetic algorithm-based
methods for class imbalance in a smoking cessation ensemble model for heart disease prediction using
intervention. Applied Sciences, 10(9), 3307. bagging and boosting techniques. Expert Systems with
[16]. Mdhaffar, A., Chaari, T., Larbi, K., Jmaiel, M., Applications, 193, 116452.
&Freisleben, B. (2017). CE-MANN: Convolution [29]. Gupta, R., Singh, P., Verma, S., Sharma, A., Yadav,
ensemble multi-label neural network for automated S. (2023). A transfer learning approach for heart
diagnosis of congestive heart failure using ECG disease prediction using echocardiogram images and
signals. In 2017 IEEE 19th International Conference clinical data. IEEE Journal of Biomedical and Health
on e-Health Networking, Applications and Services Informatics, 27(6), 2453-2462.
(Healthcom) (pp. 1-6). IEEE. [30]. Patel, J., Agarwal, P., Chowdhury, A., Rajput, S.,
[17]. Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Mukhopadhyay, S. (2022). Multi-modal deep learning
Prutkin, J. M., Nair, B., ...& Lee, S. I. (2020). From framework for heart disease prediction using
local explanations to global understanding with electronic health records, genetic data, and wearable
explainable AI for trees. Nature machine intelligence, sensor data. Journal of Biomedical Informatics, 128,
2(1), 56-67. 104015.
[18]. Tama, B. A., & Rhee, K. H. (2019). Tree-based [31]. Singh, P., Verma, S., Sharma, A., Yadav, S., Gupta,
classifier ensembles for early detection of heart R. (2023). An interpretable machine learning
disease. In Computational Intelligence in Biomedical approach for heart disease prediction using decision
Science and Engineering (pp. 27-44). Springer, trees and rule-based models. Artificial Intelligence in
Singapore. Medicine, 124, 102198.
[19]. Yadav, S., Gupta, R., Singh, P., Verma, S., Sharma, [32]. Verma, S., Gupta, R., Singh, P., Sharma, A., Yadav,
A. (2023). A comparative study of machine learning S. (2022). Explainable machine learning for heart
models for heart disease prediction. Journal of disease prediction: A gradient boosting approach with
Biomedical Informatics, 123, 103897. SHAP. Computer Methods and Programs in
[20]. Patel, J., Agarwal, P., Chowdhury, A., Rajput, S., Biomedicine, 221, 106812.
Mukhopadhyay, S. (2022). Heart disease prediction [33]. Sharma, A., Gupta, R., Singh, P., Verma, S., Yadav,
using machine learning algorithms: A comparative S. (2023). A federated learning framework for heart
analysis. Expert Systems with Applications, 186, disease prediction across multiple healthcare
115748. institutions. Journal of the American Medical
[21]. Gupta, R., Singh, P., Verma, S., Sharma, A., Yadav, Informatics Association, 30(5), 942-951.
S. (2023). Feature selection techniques for heart [34]. Chowdhury, A., Patel, J., Agarwal, P., Rajput, S.,
disease prediction: A systematic review. Applied Soft Mukhopadhyay, S. (2022). Privacy-preserving heart
Computing, 112, 107828. disease prediction using differential privacy
[22]. Singh, P., Verma, S., Sharma, A., Yadav, S., Gupta, techniques. IEEE Access, 10, 75293-75304.
R. (2022). An ensemble approach for heart disease [35]. Kadhim, M.A.; Radhi, A.M.(2023). Heart disease
prediction using optimal feature subset. Computers in classification using optimized Machine learning
Biology and Medicine, 137, 104803. algorithms. Iraqi J. Comput. Sci.Math, 4, 31–42.
[23]. Sharma, A., Gupta, R., Singh, P., Verma, S., Yadav, [36]. Geweid, G.G.; Abdallah, M.A.(2019). A new
S. (2023). A deep learning framework for heart automatic identification method of heart failure using
disease prediction using ECG signals. Biomedical improved support vector machinebased on duality
Signal Processing and Control, 71, 103201. optimization technique. IEEE Access, 7, 149595–
[24]. Mukhopadhyay, S., Patel, J., Agarwal, P., 149611.
Chowdhury, A., Rajput, S. (2022). A hybrid deep [37]. Mondéjar-Guerra, V.; Novo, J.; Rouco, J.; Penedo,
learning model for heart disease prediction using M.G.; Ortega, M. (2019). Heartbeat classification
electronic health records. Journal of Biomedical fusing temporal and morphologicalinformation of
Informatics, 120, 103852. ECGs via ensemble of classifiers. Biomed. Signal
[25]. Verma, S., Gupta, R., Singh, P., Sharma, A., Yadav, Process. Control, 47, 41–48.
S. (2023). An interpretable machine learning [38]. Dixit, S.; Kala, R.(2021). Early detection of heart
framework for heart disease prediction. Artificial diseases using a low-cost compact ECG sensor.
Intelligence in Medicine, 119, 102164. Multimed. Tools Appl. ,80, 32615–32637.
[26]. Chowdhury, A., Patel, J., Agarwal, P., Rajput, S., [39]. Bemando, C.; Miranda, E.; Aryuni, M.(2021).
Mukhopadhyay, S. (2022). Explainable AI for heart Machine-learning-based prediction models of
disease prediction: A comparative study of SHAP and coronary heart disease using naïve bayesand random
LIME. Computer Methods and Programs in forest algorithms. In Proceedings of the 2021
Biomedicine, 214, 106529. International Conference on Software Engineering &
[27]. Rajput, S., Patel, J., Agarwal, P., Chowdhury, A., ComputerSystems and 4th International Conference
Mukhopadhyay, S. (2023). An ensemble stacking on Computational Science and Information
approach for heart disease prediction using multiple Management (ICSECS-ICOCSIM), Pekan, Malaysia,
machine learning algorithms. Computers in Biology 24–26 August 2021; pp. 232–23.
and Medicine, 142, 105237.