3.1. Dataset
The dataset used in this study was obtained from the UCI Heart Disease Data Repository [
34]. It comprises a total of 14 features, with the target variable being the dependent variable. The independent variables include age, sex, cp (chest pain), trestbps (resting blood pressure), chol (serum cholesterol), fbs (fasting blood sugar), restecg (resting electrocardiographic results), thalach (maximum heart rate achieved), exang (exercise-induced angina, with zero representing absence and one representing presence), oldpeak (ST depression induced by exercise relative to rest), slope (the slope of the peak exercise ST segment), ca (number of major vessels colored by fluoroscopy), and thal (thalassemia).
Before constructing the model, a comprehensive analysis and visualization of the dataset were conducted to gain valuable insights into the distribution of values. This initial exploratory analysis facilitated a deeper understanding of the data, allowing us to make informed decisions during the subsequent modeling phase. It is important to note that “angina” (exercise-induced angina) represents whether or not the patient experienced angina during the stress test, with values of zero indicating no angina and one indicating the presence of angina.
Table 1 presents the distribution of different chest pain values in relation to the target variable.
Table 2 displays the sample distribution of high and low blood sugar values against the target variable.
In
Table 3, the sample distribution of exang (exercise-induced angina) values with respect to the target variable can be depicted.
The distribution of continuous variables, including age, cholesterol, oldpeak, thalach, and trestbps, is visualized through the scatter plots in
Figure 3. This figure helps us identify potential data patterns and outliers, which can influence the choice of appropriate modeling techniques and preprocessing steps.
3.5. Feature Selection
Feature selection plays a critical role in ML models, as not all features may positively contribute to decision-making. To address this, the Extra Tree classifier has been employed to select the most relevant features from the dataset. The Extra Trees Classifier, also known as Extremely Randomized Trees, is an ensemble learning method based on decision trees. Mathematically, we can represent the decision function of the Extra Trees Classifier as follows in Equation (
2):
where
represents the decision function and
denotes the class assigned to the input feature vector
x by the
k-th decision tree.
The strategic feature selection process using the Extra Trees Classifier enhances the efficiency, interpretability, and generalization of our heart disease prediction model. Before feature selection, the dataset comprised a comprehensive set of features, including age, sex, chest pain type (cp), resting blood pressure (tresbps), serum cholesterol (chol), fasting blood sugar (fbs), resting electrocardiographic results (restecg), maximum heart rate achieved during exercise (thalach), exercise-induced angina (exang), ST depression induced by exercise relative to rest (oldpeak), slope of the peak exercise ST segment (slope), number of major vessels colored by fluoroscopy (ca), and thalassemia type (thal), as depicted in
Table 4.
After careful consideration, the feature selection process retained the most informative features, including age, chest pain type (cp), maximum heart rate achieved during exercise (thalach), exercise-induced angina (exang), ST depression induced by exercise relative to rest (oldpeak), slope of the peak exercise ST segment (slope), number of major vessels colored by fluoroscopy (ca), and thalassemia type (thal). This selection aligns with existing medical knowledge about factors that influence heart disease, ensuring that our model focuses on the most relevant aspects to achieve accurate predictions.
The selected features contribute significantly to the model’s predictive capabilities while eliminating redundancy and reducing the risk of overfitting. This strategic feature selection process not only improves the computational efficiency of our model, but also enhances its interpretability and generalization to new data. Certain features, such as sex, thalach, and exang, were deemed less clinically relevant to heart disease prediction and were discarded during the feature selection process.
To visualize the significance of each feature, we apply the Extra Tree classifier, as shown in
Figure 4. Notably, slope, age, oldpeak, thalach, thal, exang, ca, and cp were identified as the most influential features in our model.
These insights into feature selection aim to provide a deeper understanding of the variables influencing our model’s predictions.
3.7. Ensemble Learning
The proposed approach introduces an ensemble learning technique, which combines the predictions of two hybrid ensemble classifiers:
- 1.
Hybrid Ensemble 1: This ensemble consists of SVM, Decision Tree, and KNN classifiers. Each base classifier is trained on the preprocessed dataset independently.
- 2.
Hybrid Ensemble 2: This ensemble includes Logistic Regression, Adaboost, and Naive Bayes classifiers, each trained on the same dataset as in Hybrid Ensemble 1.
The selection of base learners was conducted based on their robust performance in previous studies and their relevance to the specific characteristics of the dataset. While the ensemble method employs a majority voting scheme, the fusion of diverse classifiers with distinct decision boundaries enables the exploration of complementary aspects of the data, thereby enhancing the model’s predictive capabilities. The inclusion of these basic classifiers in components 1 and 2 was chosen to leverage their individual strengths and ensure a diverse range of learning strategies within the hybrid ensemble framework.
Specifically, in designing Hybrid Ensemble 1, we aimed to integrate classifiers with diverse capabilities to enhance the model’s overall performance. Three base classifiers were selected based on their individual strengths:
- 1.
SVM: Linear kernels were chosen for their simplicity and robustness to linearly separable data. SVMs are known for producing efficient decision boundaries.
- 2.
Decision Tree: Selected for its ability to represent nonlinear relationships in the data and for its interpretability.
- 3.
KNN: Employed for recognizing local patterns and adjusting to the structure of the data.
The integration of these three classifiers in Hybrid Ensemble 1 provides the model with the ability to handle linear and nonlinear patterns, contributing to its generalization and robustness.
For Hybrid Ensemble 2, three distinct base classifiers were chosen:
- 1.
Logistic Regression: A straightforward yet powerful linear classifier suitable for binary classification tasks.
- 2.
AdaBoost: An ensemble technique known for building a powerful classifier by combining weak ones, adapting to complex data.
- 3.
Naive Bayes: A probabilistic classifier frequently used in various domains, particularly in text categorization.
This ensemble design ensures a combination of classifiers with diverse attributes, enhancing the model’s adaptability to different data features and improving the overall prediction accuracy. The rationale behind the selection aligns with established practices in the literature, promoting transparency and reproducibility of the proposed approach.
The predictions from both hybrid ensembles are then concatenated and used as input to a Voting Classifier. This final step aggregates the predictions from all base classifiers, employing a majority voting scheme to make the final prediction. The main reason behind this concatenated hybrid ensemble approach is to exploit the diversity of base classifiers, each with its strengths and weaknesses. By combining two hybrid ensembles, we aim to enhance the model’s overall predictive performance.
The proposed hybrid ensemble classifier can be represented using the equation below (Equation (
3)):
where
represents the decision function of the ensemble voting classifier,
returns the most frequent class among the predictions, and
represents the class predicted by the
i-th base classifier for the input feature vector
x.
For the first ensemble classifier, represents the Support Vector Machine, is the Decision Tree classifier, and is the K-Nearest Neighbor classifier.
The decision function of the Support Vector Machine (
) can be expressed as follows:
where
is the decision function,
is the sign function returning
for positive values and
for negative values, ∑ represents the summation over all support vectors,
are the Lagrange multipliers (coefficients obtained during training),
is the class label of the
i-th support vector (
or
),
is the kernel function calculating the similarity between the
i-th support vector
and the input feature vector
x, and
b is the bias term.
The Decision Tree classifier (
) for the first ensemble classifier can be expressed as follows:
where
is the decision function of the decision tree,
are the class labels associated with the terminal nodes (leaves) of the decision tree, and
are the decision conditions or rules based on the input feature vector
x that guide the traversal of the decision tree.
The K-Nearest Neighbor classifier (
) for the first ensemble classifier can be represented as follows:
where
is the decision function of the KNN classifier,
returns the most frequent class among the
k nearest neighbors, and
represents the class label of the
i-th nearest neighbor to the new data point
x.
For the second ensemble classifier, represents Logistic Regression, is the Adaboost classifier, and is the Naive Bayes classifier.
The decision function of Logistic Regression (
) can be described as:
where
is the decision function of the Logistic Regression classifier,
is the logistic function (sigmoid function),
w is the weight vector,
x is the input feature vector, denotes the dot product, and
b is the bias term.
The decision function of the Adaboost classifier (
) can be expressed as:
where
is the decision function of the Adaboost classifier,
represents the summation over all weak classifiers,
are the weights assigned to each weak classifier, and
represents the prediction of the
i-th weak classifier for the input feature vector
x.
The decision function of the Naive Bayes classifier (
) can be defined as:
where
is the decision function of the Naive Bayes classifier,
returns the class
c that maximizes the expression,
is the prior probability of class
c,
is the conditional probability of feature
given class
c, and ∏ represents the product operator, which calculates the product of the conditional probabilities for all features.
In the final step, these two ensemble classifiers are merged to create the ensemble classifier:
where
is the decision function of the concatenated classifier,
represent the individual predictions of the base classifiers for the input feature vector
x, and
is the final classifier that takes the concatenated feature vector as input and makes the final prediction.
In the proposed model, represents the first ensemble classifier, and represents the second ensemble classifier.