1 Introduction

Heart disease (HD) is a disease that affects over 64 million individuals globally, and its occurrence and widespread presence are consistently on the rise. HD patients often suffer from a diminished quality of life and premature demise. The global economic toll of HD has already surpassed the $100 billion mark. Given these circumstances, international guidelines strongly recommend interventions aimed at preventing the progression of HD. The early detection of these diseases is paramount in taking proactive measures to prevent any major incidents from occurring (Tuli et al. 2020).

However, it’s important to note that there exists a significant variation in risk levels among individuals within HD different stages. Certain groups, while not explicitly included in the current staging system for HD, are still at a higher risk of developing symptomatic HD. To ensure that the intensity of prevention efforts aligns with the individual’s absolute risk, it’s crucial to accurately assess their future risk of HD. Fortunately, incident HD can be predicted using risk prediction models. This recommendation emphasizes the importance of personalized risk assessment to optimize prevention strategies and improve outcomes for patients with HD (Ayon et al. 2022).

Over the past decade, most cardiovascular prediction studies have employed cutting-edge technologies such as artificial intelligence (AI) (Revathi and Anjuaravind 2021), machine learning (ML) (Uddin and Halder 2021), and DL (Krishnan et al. 2021). ML methods have a certain role in early disease detection. In recent years, most cardiovascular prediction studies have employed machine learning techniques such as support vector machines (SVM), decision trees (DT), and Naive Bayes (NB) (Rani et al. 2021; Almazroi 2022; Gupta et al. 2022). However, these strategies have yet to effectively harness the vast amount of data generated by medical institutions (Li et al. 2021; Akella and Akella 2021). Implementing machine learning models in clinical settings remains challenging, often resulting in less accurate predictions (Akella and Akella 2021; Bharti et al. 2021). DL is a subset of ML that excels at comprehending vast amounts of data at remarkable speeds without compromising accuracy (Krishnan et al. 2021; Bhushan et al. 2023). In recent years, researchers have been actively exploring the application of DL technology in medicine, as they have achieved promising results in different fields. For example, researchers have successfully utilized DL models such as convolutional neural networks (CNN), long short term memory (LSTM), and CNN-LSTM to predict heart failure by integrating multiple datasets (Hussain et al. 2021; Pan et al. 2020). Deep learning-based methods for image-level tasks have also been proposed recently (You et al. 2022b, 2023, 2022a). In addition, recent DL methods are being integrated with the internet of things (IoT) to enhance the predictive analysis of HD, leading to improved accuracy (Sarmah 2020). So, this article focuses on summarizing and summarizing the DL methods, their extension methods, and integrated methods in the research of HDP. While deep learning with neural networks has emerged as a front runner in HDP, some studies have fixated on accuracy as the sole metric, potentially overlooking its limitations (Mehmood et al. 2021; Sajja and Kalluri 2020). Accuracy alone doesn’t fully reveal how a model distinguishes between different classes or handles false positives and negatives. Therefore, this article comprehensively compares and analyzes multiple measurement standards, such as accuracy, precision, sensitivity, specificity, and so on.

In addition, diverse datasets have been developed by researchers to predict HDs (Golovenkin et al. 2020). Significant datasets such as Cleveland, Framingham, heart disease, and cardiovascular disease datasets play a pivotal role in HDP. These datasets encompass a broad array of attributes, enabling accurate predictions of HDs (Golovenkin et al. 2020). Both modifiable and non-modifiable risk factors contribute to the occurrence of HD. Non-modifiable risk factors include specific characteristics like gender, ethnic background, and family history. Conversely, modifiable risk factors such as cholesterol levels, blood pressure, unhealthy lifestyle habits, and smoking can be altered and controlled through specific measures and medical interventions (Biswas et al. 2021). These characteristics have influenced the creation of numerous databases, and academics have made considerable efforts to refine and enhance these datasets (Biswas et al. 2021).

Table 1 presents a comprehensive summary of the related surveys. As shown in Table 1, the research on predicting HD has a long history, but most existing studies use traditional ML methods. Although some studies integrate DL and ML techniques, many methods have their own limitations, advantages, and disadvantages. In addition, existing work does not summarize the publicly available and own-created datasets in the research on predicting HD, so it is necessary to conduct an overall review of DL-based methods for predicting HD.

Table 1 Summary of heart disease surveys

This review encompasses 64 primary studies that leverage various DL techniques, ETDL methods, and integrated DL methods for the prediction of HDs, as shown in Fig. 1. Such advancements will not only assist doctors in making informed decisions but also provide patients with a deeper understanding of their condition. The organization of this paper is structured as follows: Sect. 2 details the methodological approach employed in conducting this research. Section 3 summarizes the current existing research on HDs, organized into three distinct subsections that concentrate on DL, ETDL, and the integration of various Deep Learning techniques. Section 4 provides a detailed overview of the datasets utilized by various researchers in their studies. Section 5 presents the outcomes and deliberations stemming from the analysis of the primary studies, including a discussion of the unresolved challenges and questions encountered by many researchers. Ultimately, Sect. 6 brings the paper to a conclusion and outlines the potential future directions of the research.

Fig. 1
figure 1

The organization of the survey

2 Research methodology

To conduct this systematic literature review, we conducted a comprehensive analysis of many different articles. Informed by the methodological frameworks established by previous works (Bhushan et al. 2023; Ayano et al. 2023), this review employs a systematic and well-structured approach as follows:

2.1 Research questions

To provide a thorough understanding of the current state of knowledge, this subsection outlines the key questions tackled by this review, as shown in Table 2, encompassing diverse aspects of the field.

  1. (1)

    What are the methods related to DL employed for HD prediction in healthcare, and what are their relative strengths and weaknesses?

  2. (2)

    What are the various data types, sizes, and sources that have been used to study HDP in healthcare?

  3. (3)

    What are the main findings (journal quality, the leading publishers, significant keywords, prevalent usage of DL, ETDL, and integrated DL techniques, preferred programming languages, persisting open challenges, and the top cited articles)? What problems have been found through this review?

    Table 2 Research questions and main motivations

2.2 Source material

The following eight main electronic database sources were used for seeking research articles related to the topic of DL-based approaches for HDP.

  • Google Scholar (www.scholar.google.co.in), a free academic search engine that helps you find scholarly research papers, books, and other research materials;

  • Elsevier (https://www.elsevier.com/en-in), a global information analytic resource that provides scientific, technical, and medical information products;

  • ACM Digital Library (www.acm.org/dl), the world’s largest collection of full-text articles and bibliographic literature covering computing and information technology;

  • IEEE Xplore Digital Library (www.ieeexplore.ieee.org), the world’s leading source of technical literature in electrical engineering, computer science, and electronics, and other related fields;

  • PubMed (https://pubmed.ncbi.nlm.nih.gov/), a free search engine consisted of a large number of literary studies in the biomedical field, primarily from the medicine database;

  • Wiley Interscience (www.Interscience.wiley.com), a subscription-based online library that provides access to scientific, technical, and medical journals, books, and reference works;

  • Springer (www.springerlink.com), a leading global publisher of scientific, technical, and medical content, providing researchers with quality content through innovative information, products, and services;

  • ScienceDirect (www.sciencedirect.com), using this database, access to journals and technical articles published by Elsevier is possible; a leading source of scholarly research in the world that provides access to scientific, technical, and medical journals, books, and reference works published by Elsevier.

Following a rigorous methodology, our systematic review aims to achieve three objectives: (1) to serve as a reference for existing DL-related techniques for HDP; (2) to help researchers avoid redundant work; (3) to assist researchers in the field to identify research gaps in HDP using DL-related techniques.

To achieve these objectives, the review will cover the following: (1) a detailed discussion of DL-related techniques, including their definition, classification, contributions, and limitations; (2) the identification and characterization of HD datasets that are readily available for DL-based researches; (3) an assessment of the progress that has been made in HDP using DL-related techniques, as evidenced by various performance measurement techniques; (4) a discussion of the limitations and challenges associated with DL-related techniques in HDP.

2.3 Search criteria

This review employed a multifaceted search strategy to identify all pertinent literature. Leveraging keywords and subject headings related to HDP models and DL, as informed by the prior literature, the search was conducted in the English language (see online supplementary material). To maximize comprehensiveness, both forward and backward citation searching were performed for included studies and existing systematic reviews. Duplicate removal followed a stringent protocol, combining automated identification through EndNote and subsequent manual verification. To ensure a comprehensive and targeted literature search, a keyword-based approach was implemented. Table 3 shows the detailed keywords utilized in this review.

Table 3 Most often used keywords

As depicted in Fig. 2, the initial set of 3727 results underwent a screening process that prioritized titles and the specified time span of 2018–2023, leading to a reduction in the article count. Subsequent filtering emphasized the research article’s classification—review, technical, or survey—which narrowed down the count to 2836 articles. Following a title-based screening that eliminated 530 articles, a further abstract-based filtration reduced the count to 286. Finally, based on the full text of the articles, a total of 64 articles were selected for use as the main research.

Fig. 2
figure 2

Flow diagram of paper selection

Table 4 outlines the inclusion and exclusion criteria applied to the identified literature. The detailed steps of this process, along with the number of articles evaluated at each stage, are illustrated in Fig. 2.

Table 4 Literature inclusion and exclusion criteria

2.4 Data extraction and quality assessment

Several obstacles arose when acquiring suitable data for this review. To supplement information not readily identifiable from the reviewed articles, we engaged directly with a select group of researchers. The approach utilized for data extraction in this study encompassed:

  • A meticulous survey of 64 articles for data collection.

  • Author-driven cross-verification of the extracted data to maintain outcome consistency.

To ensure a comprehensive and accurate quantitative evaluation of the predictive performance, after determining the noteworthy articles using inclusion and exclusion criteria, the remaining articles were further assessed for quality. These articles originated from various journals and conferences. Using the CRD criteria given by the literature (Kitchenham and Charters 2007), we examined the bias and internal validity of each publication, as well as the external validity of the results. High-quality research publications were included in this review using DL, ETDL, and integrated DL methods for predicting HDs.

3 Literature review

To address Research Question 1 (outlined in Sect. 2.1), a comprehensive review of the existing literature on HDP has been conducted, and the relevant works were divided into three different subsections based on the techniques used, as follows:

3.1 DL used for HDP

This subsection systematically summarizes existing researches on using DL technology for HDP, as shown in Table 5. It comprehensively introduces information about publication year, specific techniques/tools, associated contributions and limitations, and performance metrics used.

Table 5 Summary of the existing work related to DL models for HDP

In the realm of HDP, researchers have deployed a diverse array of techniques with great success. Hussain et al. embarked on a comparative study that exhibited the prowess of CNNs in categorizing individuals as fit or unfit within the Cleveland dataset (Hussain et al. 2021). Their model attained impressive test accuracy of 96% and training accuracy of 97%. Notably, they integrated clinical parameters to delineate patient risk contours, enabling early disease identification. Furthermore, they emphasized the benefits of balanced datasets in overcoming the limitations posed by traditional machine learning approaches. Revathi and Anjuaravind harnessed the potential of CNNs to tackle early-stage HDP (Revathi and Anjuaravind 2021). Leveraging the Cleveland dataset, their study clearly demonstrated the superiority of CNNs over traditional methods, achieving a commendable accuracy of 94.78%. Notably, their model excelled not only in pre-processing and feature extraction but also in prognosis, highlighting its comprehensive capabilities. Sajja and Kalluri pioneered a CNN-based prediction model that outperformed traditional CVD prediction techniques such as logistic regression (LR), K-nearest neighbors (KNN), NB, and SVM (Sajja 2021). Their model achieved a remarkable accuracy of 94.78% on the Cleveland dataset, displaying its proficiency in handling intricate data pipelines, encompassing pre-processing, feature extraction, and prediction-all within a unified framework.

Tomov showcased the potential of flexible 5-layer deep neural network (DNN) models, achieving high accuracy even with a limited dataset (Tomov and Tomov 2021). Mehmood et al. introduced CardioHelp, a CNN-powered approach that proved adept at early heart failure prediction through temporal data modeling (Mehmood et al. 2021). Their method surpassed previous techniques in performance assessments. Pan et al. enhanced accuracy and sensitivity by seamlessly integrating DL with CNNs, leveraging insights from the Cleveland dataset (Pan et al. 2023). Subhadra and Vikas approached diagnosis from a distinct perspective, proposing a Multi-Layer Perceptron (MLP) neural network meticulously trained on the Cleveland dataset (Subhadra and Vikas 2019). This approach offers a novel perspective in the field of HDP, demonstrating the versatility and effectiveness of different DL techniques in this domain. In the domain of early HD detection, Singhal et al. showcased the transformative power of CNNs (Singhal et al. 2018). Utilizing a refined backpropagation approach and leveraging the Cleveland dataset, they crafted a system capable of not only identifying the presence or absence of HD but also discerning its severity. This groundbreaking innovation outperformed existing methods by a remarkable 9% in relative prediction accuracy. In another noteworthy study, Al-Makhadmeh et al. demonstrated an DL-based system that harnessed the power of DL to address data gaps (Al-Makhadmeh and Tolba 2019). The collected data were meticulously analyzed for feature extraction, highlighting the system’s ability to extract meaningful insights from diverse data sources. Given the global burden of HD, this paper delves into techniques that can further enhance patient-specific prediction accuracy.

Researchers have capitalized on the immense potential of DL, pushing the boundaries of HDP to new heights. Hauptmann et al. exhibited the remarkable prowess of CNNs in reconstructing highly accelerated radial data from HD patients, laying the foundation for precise HDP through a seamless integration of 3D CNN architecture and compressed sensing (Hauptmann et al. 2019). Their innovative U-Net approach stood out in terms of both speed and accuracy, delivering superior image quality and ventricular volume measurements. Poplin et al. shifted the focus to the intricate landscape of the retina, demonstrating how DL can mine valuable insights from retinal fundus images (Poplin et al. 2018). Their model, validated across diverse datasets such as UK BioBank and EyePACS, successfully predicted HD likelihood by meticulously analyzing vascular features within the retina. While promising, these studies acknowledged limitations like dataset size constraints, missing risk factors, and wide confidence intervals, pinpointing areas for further exploration and refinement.

3.2 ETDL techniques for HDP

This subsection discusses existing work using ETDL technologies to predict HD, as shown in Table 6. It also includes publication year, technology/tool, contributions, limitations, and various performance metrics.

Table 6 Summary of the existing work related to ETDL models for HDP

In a groundbreaking study, Arroyo and Delima have demonstrated the remarkable synergy achieved by integrating genetic algorithms (GAs) with artificial neural networks (ANN) for enhanced HD forecasting (Arroyo and Delima 2022). Their innovative hybrid model, denominated as GA-ANN, has significantly outperformed individual ANN, LR, DT, random forest (RF), SVM, and KNN algorithms when tested on a HD dataset. This achievement sets a new benchmark for prediction accuracy in the field. Echoing this success, Verma et al. have developed a distinct hybrid approach that combines GA-based feature selection with an ensemble DNN, specifically tailored for HDP (Verma et al. 2021). Their model, which was applied to a comprehensive HD dataset encompassing 900 patient reports with 54 features, achieved a remarkable accuracy of 98%. This significant performance surpasses previous endeavors in the field. Notably, the researchers employed Kalman filtering techniques to meticulously cleanse the data, effectively purging noise, inconsistencies, and duplicate records to optimize model performance. In the realm of HD detection, researchers have further explored innovative hybrid DL architectures. Krishnan et al. have crafted a potent model that seamlessly integrates gated recurrent units (GRUs) with recurrent neural networks (RNN) Krishnan et al. (2021). The model is further enhanced by the integration of LSTM and the Adam optimizer. Upon rigorous testing with the Cleveland dataset, this hybrid model achieved an exceptional accuracy of 98.68%, surpassing the performance of existing RNN models. This accomplishment highlights the potential of innovative hybrid deep learning approaches in enhancing the accuracy of HDP. Ashraf et al. ventured into the realm of automated heart attack prediction, developing a DNN strategy that was rigorously tested on the Cleveland dataset (Ashraf et al. 2019). Their approach effectively addressed common accuracy shortcomings associated with traditional prediction methods and overcame the limitations of manual preprocessing techniques. This innovative strategy not only improved the accuracy of heart attack predictions but also paved the way for future advancements in this crucial field of medical research.

In a complementary study, Hamad and Jasim delved into the application of DNNs for cardiac disease prognosis (Hamad and Jasim 2021). They meticulously designed a novel DNN classifier, ensuring that the database was appropriately divided into testing and training sets, with each set undergoing rigorous feature extraction preprocessing. When tested on the Cleveland dataset, their model achieved an accuracy of 84.67%, thereby demonstrating the immense potential of DNNs in this domain. Rao and Satya Prasad orchestrated a sophisticated ensemble of algorithms aimed at enhancing the precision of HDP (Rao and Satya 2021). They capitalized on the ensemble deep dynamic algorithm (EDDA) on the Cleveland dataset, meticulously training it with a linear regression model to prime the deep boltzmann machine (DBM) method for peak performance. This innovative approach overcame numerous prediction challenges, significantly boosting accuracy and recalibrating other crucial parameters. Oliver et al. devised a technique for early disease type prediction and classification utilizing the PHYSIONET dataset (Oliver et al. 2021). This approach incorporated signal processing, wavelet transformation-based segmentation, and a regressive learning-based neural network (RLBNN) classifier for feature identification. Their technique surpassed traditional methods in terms of efficiency, paving new avenues for disease prediction and classification. Dami and Yahaghizadeh embarked on a journey to predict arterial events across specific time spans, leveraging ECG recordings and time-frequency analysis (Dami and Yahaghizadeh 2021). Their LSTM-DBN model, trained on four datasets, achieved remarkable heart disease prediction with high specificity (85.54%), sensitivity (85.13%), and accuracy (88.42%). In a complementary approach, Ali et al. capitalized on the synergistic effects of ensemble DL and feature fusion to enhance the predictive capabilities of their model (Ali et al. 2020). They meticulously crafted a prediction model, which was nourished by patient data extracted from electronic medical tests and wearable sensors. By seamlessly incorporating frequency response functions (FRFs) into the primary data, they achieved a remarkable elevation in data quality and accuracy, thus enhancing the reliability of their predictions. Sarmah revolutionized the application of wearable technology by leveraging it to assess crucial heart health parameters and creating sensors tailored to detect heart aging, stress hormones, and cholesterol levels (Sarmah 2020). This innovative system could provide real-time insights derived from both simulated and real-world data, offering clinicians unprecedented access to patient health information. To ensure the utmost protection of patient privacy, they integrated advanced IoT measures for secure data transmission, bolstered by robust authentication, encryption, and classification techniques.

Ali et al. introduced an innovative x2-DNN model, acknowledging the potential challenges of overfitting or underfitting (Ali et al. 2019). Their model demonstrated superior performance, surpassing existing ANN and DNN models, and achieving a remarkable prediction accuracy of 93.33%. Wang et al. unveiled a groundbreaking neural network architecture known as the multitask deep and wide neural network (MT-DWNN) (Wang et al. 2019). This sophisticated architecture was designed to predict critical complications during hospitalization, pinpointing the most influential factors that impact patient outcomes. Their results demonstrated the superior forecasting performance of MT-DWNN for heart failure patients, significantly surpassing traditional techniques. This innovative approach offers clinicians a powerful tool for improving patient care and outcomes.

3.3 Integrated DL techniques for HDP

This subsection will discuss the existing works related to the integrated methods combined DL with other techniques for HDP (as shown in Table 7). It also includes the year of publication, techniques/tools, contributions, limitations, and various performance measures.

Table 7 Summary of the existing work related to integrated DL models for HDP

Certain researchers have conducted extensive investigations on the integration of frameworks with IoT/IoMT (internet of medical things) technologies for the purpose of enhancing HDP. For example, Khan introduced a modular prediction system that seamlessly integrates hardware devices, microcontrollers, and LoRa communication hardware, facilitating the efficient transmission of data to a cloud-based system (Khan 2020). In a separate study, Shafi et al. employed CNN and wearable devices to forecast HD, leveraging the capabilities of these technologies in data analysis and monitoring (Shafi et al. 2022). Beyond neural networks, researchers have also delved into nature-inspired algorithms for HDP. Deepika and Bajaji, for example, harnessed the dragonfly algorithm, which is inspired by the swarming behavior of dragonflies, demonstrating the potential of bio-inspired approaches in this domain (Deepika and Balaji 2022). In a similar vein, the integration of deep learning neural networks with IoT was explored for predicting HDs (Sarmah 2020), highlighting the synergistic potential of these two technologies. Furthermore, an intelligent HDP system based on swarm-ANN was proposed to capitalize on the collective intelligence and adaptability of swarm-based algorithms (Nandy et al. 2023). Additionally, a multi-label learning prediction model was introduced to leverage expert knowledge of disease duration to enhance the accuracy of HDP (Huang et al. 2023). Dhaka et al. presented a smart disease prediction model utilizing a WoM-based deep BiLSTM classifier, emphasizing the importance of incorporating domain-specific knowledge into machine learning models (Dhaka and Nagpal 2023). Lastly, the development of a scalable and real-time system for disease prediction, leveraging deep learning and big data processing, was reported in Sharma et al. (2023). This system demonstrates the potential of harnessing large-scale data and advanced machine learning techniques for accurate and timely disease predictions. Similarly, another deep learning model was applied in Zhou et al. (2023) to predict HDs, further extending the application of DL in the field of healthcare.

Recent research has delved into a diverse range of prediction algorithms for HD analysis, with various studies highlighting the strengths and potential of different approaches. Almazroi exhibited the superiority of DT in HDP, surpassing other algorithms such as LR and SVM by a noteworthy margin of 14% on a dataset comprising real patient records (Almazroi 2022). This study underscores the potential for further enhancing the robustness of datasets and ML algorithms in this domain. Gupta et al. took a more comprehensive approach by exploring a diverse array of ML algorithms, with a particular focus on maximizing model accuracy for improved prediction analysis (Gupta et al. 2022). This study emphasizes the importance of selecting and fine-tuning algorithms that are most suitable for the specific task and dataset. Bharti et al. conducted a comparative analysis of three distinct approaches-ML, DL, and a hybrid combination-using the Cleveland dataset (Bharti et al. 2021). Their objective was to identify the most effective strategy in terms of accuracy, reliability, and sensitivity. The findings revealed that the approach that integrated both feature selection and outlier detection emerged as the most effective, achieving the highest average accuracy. This suggests that a comprehensive and well-rounded approach, incorporating multiple techniques, can lead to superior prediction performance. In the realm of multi-modal HDP, Li et al. introduced a groundbreaking approach that capitalized on the synergistic power of electrocardiogram (ECG) and phonocardiogram (PCG) data (Li et al. 2021). Using the Cleveland dataset, they skillfully employed CNN to extract deep-rooted features from both modalities. This was followed by a rigorous GA-based feature selection process to identify the most informative feature subset. Finally, SVMs were employed for the classification task. This multifaceted approach outperformed its single-modal counterparts and alternatives, demonstrating its superior predictive capabilities and the potential of multi-modal data fusion in HDP. Akella embarked on a unique and captivating journey, leveraging a diverse suite of ML algorithms to predict the occurrence of HDs in patients (Akella and Akella 2021). Through a meticulous evaluation of DT, RF, SVM, ANN, and KNN on the Cleveland dataset, they discovered that ANN emerged as the unequivocal leader, boasting an impressive accuracy of 93.03%. In a demonstration of their commitment to open science, the researchers generously shared their codebase on GitHub, inviting peers to scrutinize and build upon their findings, fostering a collaborative and transparent research environment. A phonocardiography-based valvular heart diseases detection framework was proposed in Jamil and Roy (2023); Jamil et al. (2023). Bhattacharyya et al. delved into the challenging domain of chronic kidney disease prediction, expertly deploying a hybrid approach that seamlessly integrated ML and DL techniques to ensure timely and accurate diagnoses (Bhattacharyya et al. 2021). Their methodology gracefully addressed the complexities of imbalanced data, exhibiting robustness and diagnostic precision. Notably, the modular design of their model offers ample opportunities for future enhancements through the integration of novel algorithms and optimization strategies, hinting at a promising trajectory for continued refinement and evolution in the field of chronic kidney disease prediction.

In the field of HDP, Biswas et al. conducted a meticulous analysis of various ML algorithms and neural networks, leveraging open-source datasets and dividing them into distinct training and testing cohorts (Biswas et al. 2021). Their investigation revealed that the Cleveland dataset has served as a common benchmark for numerous studies, including the comparative analysis conducted by Sujatha and Mahalakshmi (2020). Notably, RF emerged as the supreme contender in terms of precision and overall performance metrics, achieving a remarkable accuracy of 95.60%. Meanwhile, Sharma and Parmar redirected their attention to Talos optimization, leveraging its capabilities to enhance a DNN model trained on real-world patient records (Sharma and Parmar 2020). Their findings underscored the superiority of Talos, which surpassed other optimizers with an accuracy of 90.78%, highlighting its potential to unlock even superior performance in the realm of HDP. By incorporating rigorous analytical techniques and utilizing diverse datasets, these studies have contributed significantly to the understanding and improvement of HDP models, paving the way for future advancements in this crucial field. Chicco et al. introduced a multifaceted system that integrates various techniques, including fuzzy logic (FL), DT, SVM, ANN, and Adaboost (Chicco and Jurman 2020). The effectiveness of this system was attributed to feature reduction, which was evaluated using LASSO, as well as feature selection techniques such as mRMR. Furthermore, Zeleznik et al. ingeniously combined a hybrid random forest with a linear model (HRFLM) feature selection algorithm and a RF model (Zeleznik and Eslami 2021). This optimized algorithm is adept at identifying features that are crucial for predicting HD, achieving a remarkable prediction accuracy of 88.7%. Lastly, Hassani et al. took a unique approach by merging the capabilities of a neural network with a decision tree (Hassani et al. 2020). This hybrid system exhibited enhanced accuracy and performance in classifying heart disease, surpassing existing methodologies. Collectively, these studies represent significant advancements in the field of HDP, contributing to the development of more accurate and effective risk prediction systems. Straw et al. emphasized the pivotal role of common supervised learning algorithms, such as RF, DT, and ensemble models, in the intricate process of data analysis (Straw and Wu 2022). These algorithms play a crucial part in facilitating the diagnosis of diverse HDs. Samuel et al. introduced a groundbreaking approach that seamlessly integrated multilayer networks with a hierarchical component-based learning model, thereby enhancing the prediction of HDs (Samuel et al. 2020). This innovative methodology excelled in deciphering the intricate and complex interactions among various risk factors. Notably, it surpassed standard methods in terms of prediction accuracy, demonstrating its superiority in this domain. Furthermore, Das et al. employed time series analysis to extract pertinent features related to a patient’s risk of HD (Das et al. 2022). Subsequently, they utilized rough set techniques to identify and analyze the intricate relationships between these extracted features. The results of their study revealed that their proposed methodology was capable of effectively predicting HD, thus contributing significantly to the field of healthcare analytics.

4 Dataset description

To address Research Question 2 (outlined in Sect. 2.1), a meticulous and comprehensive analysis was undertaken, encompassing a thorough examination of 64 primary studies. This analysis aimed to document, in fine detail, the diverse datasets utilized in the domain of heart disease analysis. The subsequent Table 8 offers a comprehensive overview of these datasets, providing a unique identifier for each, along with their descriptive nomenclature, accessibility status (publicly available or own created), and provenance (own created, hospital-collected, or sourced from Kaggle or the UCI Repository). For added granularity, the table also includes information on the year of public release, sample size, attribute characteristics, andand URL (where applicable) for each dataset. In cases where the dataset URL was unavailable or inaccessible, a "-" symbol is used as an indicator. This meticulous documentation enables a deeper understanding of the datasets utilized in heart disease analysis and facilitates the comparison and reproducibility of research findings.

Table 8 Various datasets for predicting heart diseases

After conducting a thorough analysis of datasets utilized in the research of heart disease prediction model, it was observed that the most commonly used dataset was the Cleveland dataset, which was obtained from the UCI Machine Learning Repository. Despite containing 76 attributes, a subset of 14 attributes was consistently chosen for model construction. The attributes involved various factors such as age, the type of chest pain (cp), gender, resting ECG results (restecg), resting blood pressure (trestbps), maximum heart rate achieved during testing (thalach), the degree of ST depression induced by exercise compared to rest (oldpeak), fasting blood sugar levels (fbs), the presence of exercise-induced angina (exang), the predicted target attribute (num), the slope of the peak exercise ST segment (slope), and the number of major vessels (ranging from 0 to 3) colored by fluoroscopy, indicating normal (3), fixed defect (6), or reversible defect (7). In terms of popularity, the Statlog (Heart) dataset and the Hungarian dataset ranked second and third, respectively. Notably, among the 40 datasets referenced in Table 8, a proportion of these were author-generated, while the remainder were publicly accessible, highlighting the diversity and accessibility of data resources utilized in this field of research.

As depicted in Fig. 3, a statistical analysis was conducted on the extant datasets, stratified by both publication year and type, encompassing publicly available datasets and those created by individual researchers. The timeline of publication is depicted on the X-axis, while the Y-axis serves to represent the cumulative count of datasets. Publicly released datasets are denoted by a blue hue, while those created by individual researchers are indicated in orange. Additionally, the gray line signifies the total number of datasets across both categories. The analysis reveals that public datasets tend to have been released earlier than those created by individual researchers. Notably, six public datasets were released in 1988, highlighting the early availability of such resources. Conversely, the release of datasets created by individual researchers has been more concentrated in recent years, with five such datasets emerging in 2021. From an overall perspective, the number of dataset releases was sparse between 1989 and 2016, indicating a relative lull in activity during this interval. However, in the past 3 years, the increase in the frequency of dataset release has demonstrated a resurgence of interest. This trend suggests that research on predicting HD has undergone a cyclical pattern, transitioning from a period of intense activity to a lull and subsequently returning to a state of renewed interest.

Fig. 3
figure 3

Statistics of existing datasets vs. public year

To further explore the intricate characteristics of the existing datasets, a comparative analysis was conducted to elucidate the intricate interplay between the instance size and the number of features within each dataset. The outcomes of this analysis are visually represented in Fig. 4, where the X-axis discriminates the dataset ID, while the Y-axis concurrently charts both the instance count and the feature count. A clear and concise visual distinction is achieved by utilizing blue bars to represent the instance count and an orange line to trace the varying trajectory of the feature count. This visualization offers a comprehensive overview of the datasets’ dimensionality and the relationship between the number of instances and features, thereby providing valuable insights for future research endeavors.

Fig. 4
figure 4

Comparison of existing datasets

5 Results and discussion

To address Research Question 3 (outlined in Sect. 2.1), a comprehensive analysis of the existing literature from diverse angles was presented. It enumerates the range of publishers, the quality of journals, key words, various programming languages, and the techniques employed by various researchers. Additionally, it highlights the ten most influential papers in this domain. Moreover, this section also explores open challenges, offering valuable insights for future researchers.

5.1 Range of publishers

Figure 5 illustrates the percentage of articles sourced from various publishers in this review. IEEE tops the list with 16 articles, accounting for 23% of the total, indicating its dominance in publishing primary studies. Conversely, Google Scholar, Springer, and Elsevier contribute articles that account for 19%, 17%, and 15% respectively. Science Direct, Wiley, ACM, and PubMed contribute articles in the proportions of 10%, 8%, 5%, and 3% respectively to the overall count.

Fig. 5
figure 5

Percentage of articles from various publishers

5.2 Quality of journals

Despite being one of the top journals with the largest number of published papers, IEEE Access only has five articles in this review. This suggests that journal publications in this area are highly diverse, and no single journal dominates in terms of the number of articles published on the prediction of heart disease. According to Table 9, among the journals for primary studies, Nature Biomedical Engineering and Information Fusion rank first and second, respectively, with impact factors of 29.3 and 18.6. This indicates that these journals are highly influential in the field.

Table 9 Top ten journals with impact factor

5.3 Key words

A word cloud offers a straightforward approach to identifying common key words employed in the referenced articles. This visualization technique allows us to quickly determine the most prevalent terms. As illustrated in Fig. 6, the most frequently used key words are emphasized with bolder and larger fonts, while those used less frequently are highlighted with smaller and more standard fonts.

Fig. 6
figure 6

Word cloud for the most frequently used key words

Figure 7 showcases the terms that are most frequently employed in the keyword sections of the articles. Notably, "Deep Learning" stands out as the most popular keyword, occurring 37 times. Following closely are "Heart Disease Prediction" and "Convolutional Neural Network," chosen as the second and third most preferred keywords by the researchers, with 25 and 18 occurrences, respectively.

Fig. 7
figure 7

The most used key words

5.4 Various programming languages

Researchers have utilized diverse languages in developing HDP. As depicted in Fig. 8, Python stands out as the most preferred programming language, chosen by 75% of researchers. Following closely are MATLAB with 9%, R with 8%, Java with 7%, and Semantic Web Rule Language (SWRL) with 1%.

Fig. 8
figure 8

The proportion of languages used by various researchers

5.5 Techniques employed by various authors

DL, ETDL and integrated DL algorithms frequently consume and process data to gain insights into relevant processes, patterns, events, and other relevant information. The following subsections will delve into the diverse types of DL, ETDL and integrated DL techniques that researchers have utilized in their work.

5.5.1 DL techniques

After analyzing 64 articles, Fig. 9 illustrates the most commonly employed DL algorithms. The plot reveals that CNN stands out as the most frequently used algorithm by researchers and practitioners, accounting for 31%. ANN follows closely with 23%, while backpropagation contributes 12%.

Fig. 9
figure 9

Various DL techniques used by different authors

5.5.2 ETDL techniques

Figure 10 presents the diverse ETDL strategies employed for implementing HDP. According to the plot, hybrid DL-based and CNN-based ETDL are the most widely used algorithms among researchers and practitioners, accounting for 32% and 20% of the total. DNN-based and LSTM-based ETDL follows closely with 16% and 12%, while the others each account 8%.

Fig. 10
figure 10

Various ETDL techniques used by different authors

5.5.3 Integrated DL techniques

Figure 11 illustrates the various integrated DL techniques utilized for the deployment of HDP. The graph indicates that hybrid DL methods are the predominant algorithms of choice among researchers and practitioners, comprising 26% of the total usage. The ML+DL and IoMT+DL methodologies are the preferred choices, making up 18% and 15% of the overall strategies, respectively. Wearabledevices+DL and IoT+DL approaches are also popular, each with a contribute of 11%. All other strategies are employed to a lesser extent, with 6%, 6% and 4% of the total.

Fig. 11
figure 11

Various integrated DL techniques used by different authors

5.6 Various techniques used in different years

This article provides a comprehensive summary of the DL, ETDL, and integrated DL techniques used in HDP research. As shown in Fig. 12, the number of papers using different techniques in different years is displayed. As can be seen, researchers are increasingly inclined to adopt a combination of DL and other techniques in HDP research.

Fig. 12
figure 12

Various techniques used in different years

5.7 Top 10 cited articles

Table 10 compiles a list of the top ten most-cited articles on DL techniques for the HDP. This compilation encompasses the year of publication, the type of publication (journal or conference), the name of the respective journal or conference, and the number of citations received by each article.

Table 10 Top ten cited articles

5.8 Open challenges

Nearly every primary study has inherent limitations. This subsection serves to collectively emphasize the challenges and shortcomings encountered by various researchers, specifically including:

(1) In order to obtain better prediction results, larger and more diverse data sets are urgently needed (Al-Makhadmeh and Tolba 2019; Hauptmann et al. 2019; Poplin et al. 2018). It is necessary to comprehensively consider various risk characteristics of a large population to validate algorithms (Almazroi 2022; Gupta et al. 2022). Data sets generated through APIs or cloud services can be selected for research purposes (Bharti et al. 2021), as cloud computing technology can be used to manage large amounts of patient data (Dhaka and Nagpal 2023). The integration of IoT devices for real-time capture of clinical parameters offers an opportunity to enhance the functionality of existing systems (Hamad and Jasim 2021). Another key aspect is to collaborate with doctors to obtain more valuable data that can be used to enhance models (Li et al. 2021). Models can be further optimized through training on diverse hospital datasets, enabling them to achieve exceptional prediction results (Biswas et al. 2021). In addition, model validation remains a challenge, and laboratory test data serves as a valuable resource for assessing the precision and accuracy of predictions (Al-Makhadmeh and Tolba 2019). Data expansion analysis using medical records could potentially create more refined prediction models using cardiac CT scans (Tomov and Tomov 2021). Finally, it is recommended to rely on real-world data sets rather than theoretical methods for simulation (Shafi et al. 2022).

(2) In order to better handle data sets with a large amount of missing data, it is recommended to optimize and improve existing models by incorporating diverse feature selection methods. Researches (Sarmah 2020; Huang et al. 2023; Chicco and Jurman 2020) suggested that using ensemble classifiers and additional attributes can develop new models that can accurately determine the severity and grade of diseases, thereby improving overall performance. They also suggested that using ensemble classifiers and additional attributes can develop new models that can accurately determine the severity and grade of diseases, thereby improving overall performance. In order to address the challenges posed by managing a large number of features and extensive medical information, it is important to develop a unique feature reduction method. In addition, exploring an advanced strategy to eliminate redundant features, manage missing data, and noise is crucial for achieving accurate prediction results. Literatures (Dami and Yahaghizadeh 2021; Ali et al. 2019) suggested that developing a novel feature selection technique is imperative for selecting the optimal combination of key features in a dataset that can maximize predictive performance. Authors suggested that developing a novel feature selection technique is imperative for selecting the optimal combination of key features in a dataset that can maximize predictive performance (Singhal et al. 2018).

(3) The "black box" nature of various HDP models has made it difficult for them to be integrated into clinical diagnostic workflows. Therefore, providing open source solutions that enable the wider use of models would be very beneficial. However, there are still some models that have not been fully opened to the public yet (Bhattacharyya et al. 2021). The lack of open source solutions may hinder the ability of medical professionals to predict HD (Gupta et al. 2022). Another challenge lies in the design of new methods suitable for specific applications, which requires cooperation from interdisciplinary experts from different fields. How to deploy the system without doctor supervision to test its performance using real-time data (Rani et al. 2021). Currently, there is a lack of practical medical prediction and early warning tools in large-scale clinical settings (Sajja 2021).

(4) In order to improve the accuracy of prediction, it is necessary to expand traditional DL methods to ensure their unique and easy-to-understand functions (Krishnan et al. 2021). In addition, DL methods combined with other related technologies can effectively improve accuracy, optimize attribute selection, and enhance data classification (Wang et al. 2019). To further improve HDP, probabilistic method evaluation can be conducted (Zeleznik and Eslami 2021). It is crucial to acknowledge that the absence of hybrid frameworks could potentially contribute to a certain decrement in accuracy. (Vaishali et al. 2023). In addition, the lack of recognized metrics for measuring the performance of prediction methods poses a challenge to evaluating the quality of proposed technologies. Therefore, it is crucial to improve the accuracy of prediction models and design new metrics through rigorous testing to measure the performance of prediction methods.

(5) The existing technology can be extended to diagnose a series of other chronic health issues, including chronic renal insufficiency, diabetes, cancer, and various brain-associated diseases (Rani et al. 2021; Mehmood et al. 2021; Sharma et al. 2023).

6 Conclusion and future work

This article summarizes 64 major studies that use DL, ETDL, and integrated DL methods to predict HDs. Through rigorous analysis of these studies, CNN emerges as a prominent DL technique, while hybrid DL-based ETDL becomes the leading ETDL method, and hybrid DL is the most commonly used integrated DL method. Moreover, there has been an increasing amount of research conducted in recent years using a combination of DL and other technologies. In addition, we discuss various existing datasets, among which the UCI’s Cleveland heart disease dataset is used by 62% of researchers and is the most widely used dataset. It is noteworthy that Python is the preferred programming language for implementing these technologies. The main research comes mainly from journals and is published by well-known publishers such as IEEE, Springer, Elsevier, etc. Although some progress has been made in this field, researchers still face numerous challenges, one of the most significant of which is the lack of larger and more diverse datasets. This severely limits the use of DL technology to improve the accuracy and reliability of HDP.

In this review, we have explored various issues that will help future researchers identify suitable research questions. To ensure reliable performance, real-time prediction of HD is a pivotal aspect. This involves utilizing various datasets combined with innovative DL-related comprehensive techniques. Adopting novel and advanced techniques can significantly improve the accuracy of disease prediction. Moreover, the detection method can be transformed into a network or mobile application, which can facilitate individuals to detect diseases in a timely manner. In addition, complementary methodologies enable the study and prediction of other chronic illnesses, for instance, brain disease and chronic kidney disease. Data plays a vital role for HDP research based on DL. Faced with the limited availability of public datasets, it is imperative for researchers to prioritize large amounts of data for future evaluation. Additionally, exploring multimodal technologies can enhance HD prediction and produce efficient and reliable results.

Based on the research questions listed in Sect. 2.1, the primary findings from this review are outlined below:

RQ1:

What are the DL techniques employed for the prediction of HD in healthcare, and what are their relative strengths and weaknesses?

As described in Sect. 3, a comprehensive review of existing literature related to HDP was conducted, and the related work was divided into three categories based on the technologies used: DL technology as shown in Table 5, extended DL technology as shown in Table 6, and integrated methods that combine DL with other technologies as shown in Table 7. Information about publication year, specific technology/tool, related contributions and limitations, and performance metrics used were comprehensively introduced.

RQ2:

Are there any freely available datasets for HDP? What are their characteristics?

As described in Sect. 4, the accessible status of each dataset is detailed, including publicly available or generated by the author, created by the author of the source, collected by the hospital, or from the Kaggle or UCI repository, year of public release, sample size, attribute characteristics, and URL. A comprehensive analysis of the datasets used in HDP models found that the sample sizes and attribute characteristics of the datasets vary. Among them, the most commonly referenced dataset in machine learning is the Cleveland dataset from UCI, with the StatlogHeart dataset and the Hungarian dataset ranking second and third in terms of utilization frequency. Among the 40 datasets listed in Table 8, own-created datasets account for a small portion, with most of the datasets publicly accessible, and own-created datasets are almost all concentrated in recent years.

RQ3:

What are the major findings and observations after conducting this review?

As described in Sect. 5, this article comprehensively analyzes existing literature from different perspectives, summarizing 64 major studies using DL, ETDL, and integrated DL methods to predict HDs. Through analysis, it is found that CNN is the most commonly used DL technique, while hybrid DL-based ETDL is the most commonly used ETDL method, and hybrid DL is the most commonly used integrated DL method. In recent years, more and more research has attempted to use DL combined with other technologies to predict HD. Through analysis, it is also found that Python is the preferred programming language for implementing these technologies. The main research comes mainly from journals, published by well-known publishers such as IEEE, Springer, Elsevier, etc.

RQ4:

Are there any limitations and challenges in DL-based HDP?

While there have been some advancements in this field, researchers still face numerous challenges. In this review, we explore various open challenges, providing valuable insights for future researchers. Among them, the most significant is the lack of larger and more diverse datasets, which severely limits the use of DL-related technologies to improve the accuracy and reliability of HDP. The adoption of novel and advanced comprehensive DL-related technologies for real-time prediction of HD can significantly enhance the accuracy of disease prediction. Detection methods can be transformed into mobile or web applications, which can facilitate individuals to detect diseases promptly. Additionally, complementary methodologies enable the study and prediction of other chronic illnesses, for instance, brain disease and chronic kidney disease. Exploring multimodal technologies can enhance HDP and produce efficient and reliable outcomes.