1. Introduction
Heart diseases, also known as cardiovascular diseases (CVDs), are the most common and leading causes of mortality worldwide. In the year 2019, approximately 18 million deaths were attributed to CVDs, which was 32% or nearly one-third of total deaths globally [
1]. Diagnosing heart disease in its early stages can save a large number of precious lives. Advancements in technology, especially artificial intelligence (AI), have revolutionized every domain of life, including healthcare, where it is assisting human experts in diagnosis, treatment, and medical care [
2]. In heart disease diagnostics or cardiology, electrocardiograms (ECGs) play a pivotal role in providing necessary information about heart conditions. The manual analysis of ECGs is not only time-consuming but also requires a high level of human expertise that only comes after years of experience. Thus, an accurate mechanism for the automated analysis of ECGs is crucial for improving and expanding quality healthcare services for patients with CVDs.
An ECG is the main source of information about the heartbeat rhythm that helps medical experts or cardiologists quickly screen patients for heart diseases. An ECG represents the electrical activity of the heart in the form of a waveform graph recorded through electrodes placed on different parts of the body (arms, chest, and legs). A specific combination of these electrodes is known as a lead. The simplest lead is determined by two electrodes, in which one electrode acts as a positive and the other one as a negative, but in general, two or three electrodes in combination form a lead. The most common standard is a 12-lead ECG, which is recorded using 10 electrodes. The placement of various electrodes on the body for a 12-lead ECG is shown in
Figure 1, and the output of a 12-lead ECG of a healthy person is shown in
Figure 2 on graph paper. A heartbeat, also called a cardiac cycle, has different components in an ECG plot [
3], as shown in
Figure 3. In a complete heartbeat, the P wave represents the contraction of the heart, or atrial depolarization; the QRS complex represents ventricular depolarization; and the T wave shows the relaxation phase, or ventricular repolarization. The heartbeat’s structures provide valuable information about the heart’s condition and functioning, assisting healthcare providers in diagnosing various cardiac conditions, such as arrhythmias and cardiac diseases [
4,
5,
6].
ECG classification is an important task in cardiology and biomedical research [
10]. It involves classifying an ECG or the heartbeats into different heart conditions based on the structure of the heartbeats from one or multiple leads. The most simple classification task performed on an ECG is to classify its output into binary classes, i.e., normal vs. abnormal [
11]. However, there are multiple arrhythmia conditions and heart diseases that need to be individually identified for accurate diagnosis [
3]. ECG classification plays a significant role in arrhythmia detection, assessing the risk of heart diseases, and analyzing fitness and performance in sports [
12]. However, the main focus in this work is on arrhythmia detection and the classification of heartbeats in different cardiac conditions. Many datasets of ECG data have been created for research and development purposes, and these have helped in exploring various machine learning (ML) models for ECG classification.
Machine learning (ML) has shown significant contributions in healthcare and medical diagnostics, such as in cancer detection methods that use image segmentation of CT scans, MRIs, and X-rays [
2]. Similarly, in the classification of cardiac abnormalities, ML has shown promising results [
3,
13,
14]. Early applications of machine learning for ECG classification employed various traditional machine learning techniques, including support vector machines (SVM) [
15,
16] and k-nearest neighbors (kNN) [
17]. As the field of AI has developed, more advanced deep learning-based techniques, such as convolutional neural networks (CNNs) [
3], long short-term memory (LSTM) networks [
18], and transformer neural networks [
19,
20] have been explored for the analysis of ECG signals. The classification of ECG data has many important applications, including but not limited to diagnosis in the absence of a cardiologist, automatic verification of ECG reports, and helping medics in teaching [
21].
In this work, ECG recordings from a dataset collected from various healthcare units in Pakistan are utilized for the classification of different arrhythmias and cardiac diseases. The dataset, known as the CPEIC cardiac dataset [
22], contains images of ECG records, which, in this work, are converted to numerical form using an open-source digitization tool [
23] by segmenting the Lead II heartbeats. The details of the digitization tool, known as
ecg_digitize, are discussed in
Section 3.2. The digitized signals are then pre-processed and fed to the deep learning model for the classification of four cardiac classes. Deep learning models are capable of learning complex patterns in data. Since ECG signals contain intricate patterns, deep learning is helpful in learning these patterns with state-of-the-art performance. Similarly, deep learning models generalize well on different patients’ data, thus providing accurate predictions on new records. Novel deep learning architectures are developed, which include a proposed CNN, an LSTM, and an SSL-based method using autoencoders. Training the models on digitized signals allows them to be used directly on the ECG signals that are generated from the ECG machine using electrodes/sensors so that direct monitoring of the ECG can be conducted. Thus, the work presented contributes to an important application of deep learning on sensors. The main contributions of this work are:
The use of a digitized time series data format of the CPEIC cardiac dataset, which hitherto has not been used in its digitized form for ECG classification.
The fast inference and classification of ECG in real time using only data from heartbeats from lead II.
The proposed models can be deployed on an ECG machine for the near-real-time monitoring and analysis of the ECG signals.
The proposed CNN model achieves competitive results on the CPEIC dataset with a high accuracy of ∼92% on four classes.
The paper is organized as follows.
Section 2 summarizes the latest techniques in the literature for ECG classification tasks.
Section 3 explains the proposed methodology, along with the datasets and pre-processing in detail. The results are discussed in
Section 4, and the paper is concluded with future recommendations in
Section 5.
2. Related Work
Many machine learning and deep learning techniques have been reported in the literature for classifying ECG data or heartbeats into different cardiac arrhythmia classes. The machine learning methods include a decision tree classifier, random forest, and k-nearest neighbors. The deep learning methods can be classified into convolutional neural networks, sequence models, attention mechanisms, zero-shot learning, and self-supervised learning methods.
2.1. Machine Learning Approaches
In the traditional machine learning approach, the decision tree classifier is used in many research works. An optimized decision tree classification model was proposed by Kumari et al. [
24] to classify six different classes of heartbeats, where one class was normal and the other five classes represented some arrhythmia. After pre-processing the data, the researchers extracted 17 morphological and 7 temporal features to feed to the classification model. An optimized decision tree classifier with an adaptive boosting mechanism was then used for classification. The decision tree was optimized to handle imprecise, uncertain, and incomplete data. The authors reported the best accuracy of 98.77% and a 93.85% F1 score on the MIT-BIH arrhythmia dataset with six selected classes.
Bhattacharyya et al. [
25] proposed an ensemble of a support vector machine (SVM) and random forest for heartbeat classification, which were combined using a weighted majority algorithm. Five classes from the MIT-BIH arrhythmia dataset were used in this study, including one normal class and four abnormal classes. A time series feature extraction library (TSFEL) [
26] was used to extract 61 features from the data. Feature selection techniques were employed as well, such as feature scaling, the removal of low-variance and highly correlated features, and recursive feature elimination. The synthetic minority oversampling technique (SMOTE) [
27] was used for oversampling to handle class imbalances. The proposed model demonstrated an accuracy of 98.21% and an F1 score of 96.4% on the five classes from the MIT-BIH arrhythmia dataset, which were normal (N), ventricular ectopic beat (VEB), supra-ventricular ectopic beat (SVEB), unclassifiable beat (Q), and fusion of a ventricular and normal beat (F).
Zou et al. [
28] proposed a novel method for feature extraction for improved heartbeat classification. A feature called
segment label was learned from a convolutional neural network, which was then combined with other traditional features and input to a random forest classifier. The model was trained on the MIT-BIH arrhythmia dataset for classifying three classes of heartbeats; N, VEB, and SVEB. For these three classes, this approach achieved an accuracy of 96% and an F1 score of 88.3%. This study showed the contribution of
segment label in accurately classifying heartbeats, particularly in cases where rhythm information serves as essential contextual information.
Ahmed et al. [
29] proposed an ensemble of different machine learning models, including a kNN, decision tree, artificial neural network (ANN), support vector machine (SVM), and LSTM. The hard voting method was used for ensembling. Class imbalance was also handled using class weights, which assigned greater emphasis to the minority class by assigning higher weights while allocating lower weights to the majority class. The performance of the ensemble was measured on two datasets: the Physikalisch Technische Bundesanstalt (PTB) Diagnostic ECG dataset and the MIT-BIH arrhythmia dataset, with two and five classes, respectively. The proposed approach demonstrated accuracies of 98.06% and 97.664% on the MIT-BIH and PTB datasets, respectively.
A summary of different machine learning approaches is shown in
Table 1.
2.2. Deep Learning Approaches
Deep learning approaches perform an end-to-end classification of ECGs. This is in contrast to traditional machine learning approaches, where feature extraction must be carried out before training the model [
30]. Cutting-edge deep learning techniques for ECG classification are discussed below to gain insight into the ongoing research and development in the field.
CNNs are among the most important and widely used deep learning models. CNN-based models contain many layers that transform their inputs into valuable representations with convolution filters. They are particularly effective in capturing local patterns and spatial dependencies within the data [
31]. CNNs have the capacity for hierarchical representation learning, capturing complex patterns from the input data [
32]. CNNs have been recently used in the classification of ECG signals [
31], giving improved performance.
Recently, Qureshi et al. [
3] proposed a deep learning approach using CNNs to classify multiple classes of heartbeats and achieved robustness by incorporating dropout and early stopping. The issue of class imbalance was addressed through the utilization of SMOTE. The model outperformed baseline models, achieving approximately 96% accuracy and very fast inference of less than one second. The proposed method demonstrated generality by classifying ten classes accurately.
A two-phase approach was proposed by Bruoth et al. [
21] for ECG classification on the PhysioNet 2021 challenge dataset. In phase 1, a base model was trained using data from different sources. The 1D version of ResNet-50 was used as the base model. In phase 2, the base model was fine-tuned on the challenge metrics and conditions. The authors also modified the labels in this study into three labels and used a flow-mixup layer to add random convex combinations to the input data. They used different lead configurations, including 2 leads, 3 leads, 4 leads, 6 leads, and 12 leads, and applied various techniques, such as data augmentation and wavelet transformation, to pre-process the ECG signals. In the training phase, 10-fold cross-validation was used, and the best model was selected based on the micro-F2 score obtained on the validation set. The final model was an ensemble of 10 neural networks. During the training, the authors employed the AdamW optimizer with a weight decay rate of 0.0005, utilizing a batch size of 128. The proposed method was computationally efficient and achieved 52% on the challenge metric, which is a generalized form of accuracy. The authors claimed that there is no advantage of using data sampled at 500 Hz compared to 100 Hz and no benefit of extending the input size beyond 3 s.
Ahmed et al. [
33] proposed a 1D CNN for the classification of cardiac arrhythmias. The proposed method incorporated noise in the MIT-BIH arrhythmia dataset for training to make the model perform better on noise-attenuated signals. Lead II heartbeats were extracted to train the model to classify them in 4 arrhythmic classes. The model achieved outstanding results, with the authors reporting an accuracy of 99% on the test set. The recall and specificity were 94% and 99%, respectively.
An ensemble approach was presented by Mahmud et al. [
34] that utilized a 1D CNN on ECG signals and a 2D CNN on ECG images using transfer learning. The predictions were combined using an ensemble technique, providing commendable results. On the ECG signals, the proposed approach achieved 94% accuracy, and on the ECG images, it achieved 93% accuracy on 5 classes. Data augmentation was employed on the images to improve the results.
A famous sequence model is long short-term memory (LSTM), which is a type of recurrent neural network (RNN) designed to handle patterns in sequential and time series data. LSTM networks consist of memory cells that can store and retrieve information over extended sequences, enabling them to retain important information for later use [
35]. Karri et al. [
36] proposed an LSTM network for ECG classification on the MIT-BIH arrhythmia database. The features were extracted and hybridized to feed to the LSTM network for classification. A method of detecting the QRS complex was devised using discrete wavelet transform and delta sigma modulation. The proposed method showed an accuracy of 99.64% and an F1 score of 98.18% on five classes from the MIT-BIH dataset.
An LSTM-based ECG classification model was used on compressed features of ECG signals by Yildirim et al. [
37]. The ECG signals were passed through a convolutional autoencoder to obtain a compressed representation of the data in order to reduce the computation time during classification. The features extracted using the encoder part of the autoencoder were then fed to the LSTM network for classification. The LSTM network trained on the encoded features demonstrated high accuracy on five classes of the MIT-BIH dataset. The accuracy and F1 score both were ∼99%.
Transformer networks extensively use the attention mechanism as a fundamental component of their architecture. The attention mechanism in the transformer is called self-attention or scaled dot-product attention [
38]. A transformer neural network was developed for the classification of 12-lead ECG signals in [
20]. The model classified the ECG signals from the PhysioNet-2020 challenge into 27 classes using 12-lead data. Hand-crafted features using random forests were combined with a transformer network’s discriminative features. The model was trained using 10-fold nested cross-validation. The model performed best on the challenge metric and achieved a score of 0.53 on the test data. The challenge metric is a generalized form of accuracy. During pre-processing, ECG records were upsampled or downsampled to obtain a fixed sampling rate of 500 Hz for consistency. A finite impulse response bandpass filter with a bandwidth of 3–45 Hz was also applied to remove any unwanted noise or artifacts from the ECG signal. Normalization was carried out to ensure each record had zero mean and unit variance.
Yamaç et al. [
39] developed a personalized arrhythmia detection system using one-shot learning for wearable devices. The proposed method used null space analysis on the normal heartbeats, which made it computationally efficient. It also used domain adaptation using sparse representation to map existing users’ information onto new users, enabling the classification of normal vs. abnormal heartbeats without the need for abnormal heartbeats from new users. Combining the null space analysis and domain adaptation, the authors proposed an ensemble classifier that yielded 98.2% accuracy and an F1 score of 92.8% on the MIT-BIH dataset in normal vs. abnormal scenarios.
Phan et al. [
14] proposed an SSL-based multi-modality method for ECG classification that used self-distillation without labels. There were two tasks in this method: a pre-stream task and a downstream task. In the pre-stream task, a 1D CNN for time series data and a 2D CNN for time–frequency spectral signals were used and trained on unlabeled data. In the downstream task, the 1D and 2D CNNs were fine-tuned on labeled data. The features of each CNN were fused using a proposed gated fusion mechanism. The proposed method demonstrated an accuracy of 48.9% and an F1 score of 62.1 on the test set of the PhysioNet 2020 challenge dataset. A summary of all the deep learning-based models that are reviewed in this study is presented in
Table 2.
Our main focus in this study is to train machine learning and deep learning models on a digitized form of the CPEIC cardiac dataset. All of the existing approaches implemented on this dataset use models directly on images, which hinders the use of techniques that are applied to time series data, such as MIT-BIH. Secondly, there are some important cardiac arrhythmia classes in the CPEIC cardiac dataset, but the dataset is available only in image format, so converting it to a time series format gives us a numeric dataset with those cardiac disease classes. In the following section, the proposed approach is discussed in detail for classifying cardiac diseases on the digitized form of the CPEIC cardiac dataset using both machine learning and deep learning techniques.
3. Proposed Methodology
The accurate and timely classification of abnormalities in the heart and arrhythmias is a crucial first step in cardiac healthcare. To accurately classify these abnormalities using machine learning is a challenging task and becomes even more challenging when the dataset size is small [
44]. In this work, multiple architectures using novel deep learning techniques are proposed for the classification of cardiac diseases on the basis of heartbeat data extracted from ECG records. The overall pipeline of the proposed methodology is shown in
Figure 4. This section discusses in detail the major building blocks of the pipeline, which include:
The datasets used for ECG classification;
The digitization tool used to digitize the ECG images;
The pre-processing method used to transform the data into a standardized form;
The architecture and the models used to obtain accurate results on the CPEIC cardiac dataset;
Lastly, the classes that the models are trained to classify.
3.1. Datasets
3.1.1. MIT-BIH Arrhythmia Database
The MIT-BIH arrhythmia database [
45] is a benchmark dataset in ECG classification that was released in 1980 by a collaboration of the laboratories at the Massachusetts Institute of Technology (MIT) and the Beth Israel Hospital (BIH) in Boston. The dataset was collected from 47 subjects. The dataset comprises 48 ECG recordings, each spanning half an hour. The recordings were obtained at a sampling rate of 360 Hz, capturing data with 11-bit resolution within a range of 10 mV. There are 19 beat annotations that represent different classes of heartbeats. The labels and descriptions of these heartbeat types are shown in
Table 3.
In this work, the MIT-BIH dataset is used only to augment the CPEIC cardiac dataset in the SSL-based approach. Lead II heartbeats from the MIT-BIH dataset are used without labels to train the pre-stream task in our SSL-based approach. The details are discussed in
Section 3.5.3.
3.1.2. ECG Image Dataset of Cardiac Patients
This dataset [
22] contains ECG images collected from various healthcare institutes in Pakistan under the supervision of the Chaudhry Pervaiz Elahi Institute of Cardiology (CPEIC), Multan. The ECG images are 12-lead ECG records, which are recorded at a sampling rate of 500 Hz. Each ECG record belongs to a different patient, which helps to increase the deep learning models’ generalization. The dataset is divided into four different classes, of which one is the normal class and the other three are related to cardiac diseases. The dataset is open-source and publicly available (
https://data.mendeley.com/datasets/gwbz3fsgp8/2, accessed on 1 August 2023). A summary of the dataset is listed in
Table 4.
In this work, instead of directly using images for classification, ECG images were first converted to numerical time-series format using an open-source digitization tool. The reason for using digitized data instead of images and the details of the digitization process are discussed in the following section.
3.2. Digitization
ECG images can be used as they are for training a classification model [
40,
41,
42], but these models cannot be used directly on ECG machines for real-time monitoring and prediction. The temporal correlation of ECG signals is also lost in image-based models, and these models require a larger number of parameters to train, thus increasing their computational complexity. Therefore, the ECG images were converted to numerical/time series signals by leveraging open-source ECG digitization tools.
Table 5 compares the benefits and limitations of using ECG images and time series signals for ECG classification.
For the digitization of the ECG images from the CPEIC cardiac dataset, an open-source tool,
ecg_digitize (
https://github.com/Tereshchenkolab/ecg-digitize, accessed on 1 August 2023), was used [
23]. On the backend,
ecg_digitize employs image processing techniques to extract signals from images. It first detects and extracts the grid from the ECG image, and then the ECG signal is detected and extracted. In this tool, one can select the region that is required to be digitized. For this study, only heartbeats from lead II were selected by selecting one heartbeat from each image, as the heartbeats are consistent. The digitized signals were saved as CSV files. The reason for using this tool was because of its high correlation of 0.977 and accurate mapping of signals from images. Since the region is selected manually, it performs better than other open-source tools that produce undesirable noise and wrong signals while working end-to-end [
46]. For every image, one CSV file was saved for the digitized heartbeat of lead II, and this yielded 928 CSV files at the end of the digitization process.
The user interface of ecg_digitze is shown in
Figure 5. The digitizing of ECG images using the ecg_digitize tool involved the following steps:
In the first step, the image was opened in the digitization app, and lead II was selected from the option “leads” from the top left of the panel. Next, the bounding box was placed on the lead II heartbeat. A complete heartbeat spanned 0.25 s before the R-peak to 0.4 s after the R-peak [
3]. The size of the bounding box could be adjusted based on the start and end of the heartbeat.
The second step comprised the selection of the time scale and voltage scale of the ECG plot. The image could be adjusted if the ECG plot was rotated or tilted.
The processed data were shown in a small window, and the digitized data were then saved in the CSV file format.
Figure 6 shows a few examples of original ECG signals and recreated signals from digitized heartbeat data from lead II. It is obvious that digitized data can reproduce original heartbeats with high fidelity.
3.3. Pre-Processing
Pre-processing is a crucial step in the ECG classification workflow. The signals may be of different sampling rates, can have different amplitudes, and may contain noise. The digitization step produced 928 records in the CSV file format, and these digitized ECG signals were pre-processed to make the data consistent, concise, and suitable for training the proposed models. The following pre-processing steps were performed on the CPEIC cardiac dataset.
3.3.1. Interpolation
The digitized values were of different lengths because of the different shapes of heartbeats and minor differences in the sizes of the selected regions. To make all the digitized signals of the same length feature vector, linear interpolation was applied.
Following the same feature length of the segmented heartbeats as in [
3], the digitized heartbeat signals were interpolated to make their length 234. This made each digitized heartbeat of the same length feature vector.
3.3.2. Noise Removal
In some of the digitized signals, there were noise or jittering at some points. This could be due to blurry images or some noisy pixels. A Savitzky–Golay smoothing filter [
47] was applied to make the signals smoother. There are many other methods for data smoothing, such as Gaussian smoothing and exponential smoothing, but the Savitzky–Golay filter performs local polynomial regression on the data, allowing it to capture more complex trends and patterns. It also preserves the shape of the signal better than other smoothing operations, especially when the data contain sharp edges or sudden changes.
Figure 7 shows the comparison of a digitized heartbeat before and after applying the smoothing filter.
3.3.3. Normalization/Standardization
The amplitude of the digitized ECG heartbeats can be on a random scale. Therefore, it was necessary to scale the heartbeats on a standard scale so that the model learns the features correctly. For this purpose, we could normalize the beats between 0 and 1 or apply standard scaling. The formula for 0–1 normalization is given below:
where
is the minimum value and
is the maximum value. Standard scaling was used in this paper, so the amplitudes of the heartbeats were centered around the mean and distributed over the standard deviation of the data samples.
where
is the mean and
is the standard deviation of the data.
3.3.4. Data Augmentation/Oversampling
Since one heartbeat was extracted per image, we only had 928 samples in the data. This number was too small to train an accurate and generalized deep learning model. Moreover, there were fewer data samples for some classes than others, which could hamper model accuracy. The issue of class imbalance was addressed by employing SMOTE, which increases the samples of the minority classes using synthetic oversampling. This increased the number of samples in the minority class so that each class had an equal number of samples.
In the present work, we utilized SMOTE to increase all training data instead of only the minority class. Since the CPEIC cardiac dataset has one normal class and the other three classes belong to abnormal classes, we divided the data into binary classes. Making the data binary increased the samples in the abnormal class because it contained the data of all three abnormal classes. This made the normal class a minority, so SMOTE was applied to balance the dataset by synthetically oversampling the normal class examples. Then, separating the abnormal class into its individual classes turned the normal class into a majority class, and the individual abnormal classes became a minority. Applying SMOTE again to oversample the minority classes resulted in a larger dataset. This dataset was then used for training our deep-learning models. Since data augmentation was applied on the training set only, the test set did not have any synthetic data. The test set contained samples from the original dataset separate from the training set; therefore, the data augmentation process did not affect the model’s generalization.
In the following subsections, we evaluate popular machine learning models on the CPEIC cardiac dataset and then discuss the proposed deep learning models.
3.4. Machine Learning Methods
Numerous machine learning models were used for training on the CPEIC cardiac dataset, including the random forest classifier [
48], k-nearest neighbors (kNN) classifier [
49], and decision tree classifier [
50]. To train all these models, Python’s library
was used, which provides ready-to-use models that only require their hyper-parameters to be set.
The decision tree method is a famous machine learning technique for classification. It works by splitting the feature space based on some splitting criteria recursively until a stopping criterion is met. In this work, to train the decision tree classifier, the Gini index was used as the criteria for splitting. The accuracy and F1 score achieved on the CPEIC cardiac dataset using this method were 67.3% and 67.7%, respectively.
Random forest is an ensemble machine learning technique consisting of numerous decision trees. The individual predictions of the decision trees are aggregated or combined using an ensemble algorithm to obtain the final prediction. The individual decision trees are trained on a random subset of the data. For the random forest classifier, the hyper-parameters were set as follows: the number of trees was equal to 100, the maximum depth was also set to 100, and the splitting criteria used was the Gini index. The random forest classifier achieved an accuracy of 80.9% and an F1 score of 80.8% on the CPEIC cardiac dataset.
The k-nearest neighbor or kNN classifier works by assigning an input data point a class label based on the label of k data points in the neighborhood of the query point. The label is assigned based on the majority class. The choice of k depends on the dataset and can be optimized by validation. Three nearest neighbors were used in this method for classification on the CPEIC cardiac dataset. Using 2 neighbors or more than 3 resulted in poor accuracy. Using the three nearest neighbors, the accuracy and F1 score were 79.3% and 79.5%, respectively.
3.5. Proposed Deep Learning Methods
Various deep learning architectures were implemented and experimented on the CPEIC cardiac dataset. In signal processing, the convolution operation is used for feature extraction and the analysis of signals [
51]. Since the ECG heartbeat signals are highly correlated in the time axis, applying convolution filters captures temporal correlations and extracts meaningful features for classification. Here, 1D convolutions are used in all of our proposed DL methods, as elaborated in the following subsections.
3.5.1. Convolutional Neural Network (CNN)
The proposed CNN contained a stack of 1D convolutional layers and fully connected (FC) layers at the end. The overall architecture is shown in
Figure 8. The input to the model was a 234-dimensional vector that was passed through a 1D convolutional layer with Leaky ReLU as an activation function, whose slope coefficient
was 0.001. Leaky ReLU is expressed mathematically as:
To avoid overfitting and make the model more generalizable, batch normalization layers were used after the convolutional layers. To reduce the vector’s spatial dimension, one-dimensional max-pooling layers were used. A skip connection was applied as well, inspired by ResNet [
52]. For this purpose, the features from the two convolutional layers were concatenated. Following the flattening of the output from the convolutional layers to obtain a one-dimensional feature vector, a fully connected layer was added, and subsequently, a softmax layer was added with four heads for the classification of cardiac diseases.
For training,
adam [
53] was used as an optimizer function with a learning rate of
. Adam merges the advantages of AdaGrad and RMSProp, enabling it to dynamically adapt the learning rate for individual parameters by considering the estimated first and second moments of the gradients. It is also computationally efficient and requires little memory. Sparse categorical cross-entropy loss was used to monitor the learning of our network. The proposed CNN was trained using a batch size of 16 for 100 epochs.
3.5.2. Long Short-Term Memory (LSTM) Network
Long short-term memory (LSTM) networks [
54] are usually used for sequential data or data that have a dependency in the time domain. For example, in natural language processing (NLP), LSTM networks are used for text generation [
55], translation [
56], etc. In time series data, such as stock prices and weather prediction, LSTMs can be used for analysis and forecasting [
57,
58]. Similarly, LSTMs are highly effective for the detection of anomalies in time series data [
59]. ECG data are also time-series data, so an LSTM network was trained to predict cardiac diseases based on heartbeats.
In the proposed LSTM network, the 234-dimensional input was passed to two LSTM layers, with each layer containing 32 units. The activation function in each LSTM unit was
. After the LSTM layers, two fully connected or dense layers were added, with 256 and 512 units, respectively. Leaky ReLU with a slope coefficient of 0.001 was used as an activation function in the FC layers. The output layer at the end contained four nodes with a softmax activation function for classification. To train the network, the Adam optimizer was used with a learning rate of 0.001, and the loss function used was sparse categorical cross-entropy. The LSTM network was trained for 100 epochs with a batch size of 16. The proposed LSTM network is shown in
Figure 9.
3.5.3. Self-Supervised Learning Technique
Self-supervised learning techniques learn representations from data without requiring labels. These techniques are helpful when we have a large amount of unlabeled data in addition to few labeled data [
60]. Since the CPEIC cardiac dataset is small compared to other benchmark datasets, such as the MIT-BIH dataset, using an SSL technique to extract features from the MIT-BIH dataset without labels can help meaningful features be learned, which can then be used in the classification model with labels on our smaller dataset. The self-supervised technique contained two tasks: a pre-stream task and a downstream task. The SSL model was trained on data without labels in the pre-stream task so that it learned representations from the data. Once the features were extracted from the pre-stream task, a classifier was trained with labels on the learned features. The overall flow of our SSL method is shown in
Figure 10.
For the pre-stream task, an autoencoder [
61,
62] was used. An autoencoder is a self-supervised technique that learns efficient representations by training a neural network for the reconstruction of the original input from a compressed representation. An encoder and a decoder are the two components of an autoencoder.
The encoder takes the input and learns a compressed feature using a series of layers. The dimensions of the layers decrease at each step. The final layer of the encoder is called the bottleneck, which captures the important features of the data [
63].
The proposed encoder consisted of two dense layers; the first layer had 468 neurons, and the second had 234 neurons. Subsequent to each layer, a batch normalization layer was incorporated. The activation function employed after each layer was Leaky ReLU, except for the bottleneck layer, where it was linear. The bottleneck consisted of half of the neurons present in the input dimension, i.e., 117, as our input size was 234.
The learned representations from the bottleneck layer were used by the decoder to reconstruct the original input. The decoder also used a series of layers, but it was the mirror of the encoder layers in reverse. The final layer of the decoder needed to learn the representations that should match the input layer. Therefore, a reconstruction loss such as the mean squared error was used at the output layer [
62,
63].
The decoder of the proposed autoencoder was again made up of two levels with 234 neurons and 468 neurons, respectively. After each layer, a batch normalization layer was added, and Leaky ReLU was used as the activation function. The output layer had 234 neurons since we tried to reconstruct the input. Linear activation was used in the output layer.
The autoencoder was trained on a combination of the MIT-BIH arrhythmia and CPEIC cardiac datasets without labels. Both datasets were concatenated to create an augmented dataset. These data were then divided into training and test sets. The model was trained for 10 epochs with the Adam optimizer. After training, we saved the trained encoder to extract the learned features for use in the next step.
After training the autoencoder on the augmented dataset, a novel model was created for the classification of the encoded features. This model can be referred to as a ‘classifier’. The structure of the classifier is shown in
Figure 10 in the down-stream task. It contained an FC layer with 128 neurons with ReLU activations. Then, attention layers were added, i.e., squeeze and excitation layers. The attention layer’s output and the previous dense layers were concatenated and passed through another dense layer with 128 neurons. After flattening, the final output layer with softmax activation was added with 4 neurons for classification with the CPEIC dataset.
First, encoded features were extracted from the saved encoder from the pre-stream task. The encoded features were then passed as input to the classifier. The classifier was trained for 75 epochs with a batch size of 16. The Adam optimizer with a learning rate of 0.001 was used for training, and categorical cross-entropy loss was used to monitor the classifier’s learning.
5. Conclusions and Future Directions
In this work, an efficient and accurate model is developed for the detection of cardiac diseases using digitized ECG signals from images. First, ECG images of cardiac diseases are digitized using an open-source digitization tool to obtain a numerical time-series format. Different deep learning models are proposed and trained on the CPEIC cardiac dataset, including a CNN, an LSTM, and an SSL-based model using autoencoders. Among the proposed models, the CNN performs best, giving an accuracy of ∼91.8% and a fast inference time of one-tenth of a second, making our model suitable for the direct and real-time monitoring and classification of ECG signals on an ECG machine. Summarizing the main contributions of this research, the time-series format of the CPEIC cardiac dataset is employed, deep learning models are proposed for the efficient and real-time prediction of cardiac diseases, and the proposed CNN model can be deployed on an ECG machine for direct monitoring with high accuracy.
The dataset used in this study for the classification of different cardiac diseases is small. Creating a generalizable and accurate classification model requires a large amount of data, along with different configurations of the network. For future work, the dataset can be increased by collecting more data, as well as exploring data augmentation techniques for time series data. Additionally, one can explore using all the heartbeats from the image to obtain more data instead of digitizing only one heartbeat from lead II. In the self-supervised technique, a dense autoencoder is employed, where only dense layers are used in the network. To extend our current work, convolutional autoencoders using convolutional layers can be explored to compare the performance of the autoencoders. Moreover, experiments can be performed on the SSL-based model using different sizes of the bottleneck in the encoder part of the autoencoder.