1 Introduction
As the digital healthcare ecosystem expands, healthcare data is increasingly being recorded within
electronic health records (EHRs) and
Administrative Claims (AC) systems [
1,
2]. The widespread adoption of these information systems has become popular with government agencies, hospitals, and insurance companies [
3,
4], capturing data from millions of individuals over many years [
5,
6]. As a result, physicians and other medical practitioners are increasingly overwhelmed by the massive amounts of recorded patient data, especially given these professionals’ relatively limited access to time, tools, and experience wielding this data on a daily basis [
7,
8]. This problem has caused
machine learning (ML) methods to gain attention within the medical domain, since ML methods effectively use an abundance of available data to extract actionable knowledge, thereby both predicting medical outcomes and enhancing medical decision making [
3,
9]. Specifically, ML has been utilized in the assessment of early triage, the prediction of physiologic decompensation, the identification of high cost patients, and the characterization of complex, multi-system diseases [
10,
11], to name a few. Some of these problems, such as early triage assessment, are not new and date back to at least World War I, but the success of ML methods and the concomitant, growing deployment of EHR and AC information systems have sparked broad research interest [
4,
12].
Despite the swift success of traditional ML in the medical domain, developing effective predictive models remains difficult. Due to the high-dimensional nature of healthcare data, typically only a limited set of appropriate features from among thousands of candidates are selected for each new prediction task, necessitating a labor-intensive and time-consuming process. This often requires the involvement of medical experts to extract, preprocess, and clean data from different sources [
13,
14]. For example, a recent systematic literature review found that risk prediction models built from EHR data use a median of 27 features from among many thousands of potential variables [
15]. Moreover, to handle the irregularity and incompleteness prevalent in patient data, traditional ML models are trained using coarse-grain aggregation measures, such as mean and standard deviation, for input features. These depend heavily on manually crafted features, and they cannot adequately leverage the temporal sequential nature of medical events and their dependencies [
16,
17]. Another crucial observation is that patient data evolves over time. The sequential nature of medical events, their associated long-term dependencies, and confounding interactions (e.g., disease progression and intervention) offer useful but highly complex information for predicting future medical events [
18,
19]. Aside from limiting the scalability of traditional predictive models, these complicating factors unavoidably result in imprecise predictions, which can often overwhelm practitioners with false alarms [
20,
21]. Effective modeling of high-dimensional, temporal medical data can help to improve predictive accuracy and thus increase the adoption of state-of-the-art models in clinical settings [
22,
23].
Compared with the traditional ML counterpart,
deep learning (DL) methods have shown superior performance for various healthcare prediction tasks by addressing the aforementioned high dimensionality and temporality of medical data [
12,
16]. These enhanced neural network techniques can learn useful representations of key factors, such as esoteric medical concepts and their interactions, from high-dimensional raw or minimally processed healthcare data [
5,
20]. DL models achieve this through repeated sequences of training layers, each employing a large number of simple linear and nonlinear transformations that map inputs to meaningful representations of distinguishable temporal patterns [
5,
24]. Released from the reliance on experts to specify which manually crafted features to use, these end-to-end neural net learners have the capability to model data with rich temporal patterns and can encode high-level representations of features as nonlinear combinations of network parameters [
25,
26].
Not surprisingly, the recent popularity of DL methods has correspondingly increased the number of their associated publications in the healthcare domain [
27]. Several studies have reviewed such works from different perspectives. Pandey and Janghel [
28] and Xiao et al. [
29] describe a wide variety of DL models and highlight the challenges of applying them to a healthcare context. Yazhini and Loganathan [
30], Srivastava et al. [
31] and Shamshirband et al. [
32] summarize various applications in which DL models have been successful. Unlike the aforementioned studies, which broadly review DL in various health applications, ranging from genomic analysis to medical imaging, Shickel et al. [
27] exclusively focus on research involving EHR data. They categorize deep EHR learning applications into five categories: information extraction, representation learning, outcome prediction, computational phenotyping, and clinical data de-identification, while describing a theme for each category. Finally, Si et al. [
33] focus on EHR representation learning and investigate their surveyed studies in terms of publication characteristics, which include input data and preprocessing, patient representation, learning approach, and evaluative outcome attributes.
In this article, we review studies focusing on DL prediction models that leverage patient structured time series data for healthcare prediction tasks from a technical perspective. We do not focus on unstructured patient data, such as images or clinical notes, since DL methods that include natural language processing and unsupervised learning tend to ask research questions that are quite different due to the unstructured nature of the data types. Rather, we summarize the findings of DL researchers for leveraging structured healthcare time series data, of numeric and categorical types, for a target prediction task in terms of the network architecture and learning strategy. Furthermore, we methodically organize how previous researchers have handled the challenging characteristics of healthcare time series data. These characteristics notably include incompleteness, multimodality, irregularity, visit representation, the incorporation of attention mechanisms or medical domain knowledge, outcome interpretation, and scalability. To the best of our knowledge, this is the first review study to investigate these technical characteristics of deep time series prediction in healthcare literature.
3 Results
Our literature search initially resulted in 1,524 studies, with 511 of them being duplicates (i.e., indexed in multiple databases). The remaining 1,014 works underwent a title and abstract screening. Following our exclusion criteria, 621 studies were excluded. Out of these 621 omitted studies, 74 did not use EHR or AC data, 81 did not use multivariate temporal data, 171 did not use DL methods for their prediction tasks, and 295 studies were based on unstructured data, such as images, clinical notes, or sensor data. The remaining 393 papers were then selected for a full-text review, and we subsequently removed 316 additional papers because they lacked one or more of the core study characteristics listed in Table
1. Specifically, 64 of the removed papers did not provide distinctive input features (e.g., medical code types), 99 did not have patient representation (e.g., embedding vector creation), 129 did not sufficiently describe their DL network architectures (e.g., RNN network type), and 24 did not specify their output temporality (i.e., static or dynamic) designs. Figure
1 summarizes the article extraction procedure, and Figure
2 shows the distribution of the 77 included studies based on their publication year. A majority of the studies (77%) were published after 2018, signaling a recent surge in interest among researchers for DL models applied to healthcare prediction tasks.
Table
2 lists the included studies by prediction task. Note that mortality, heart failure, readmission, and patient next-visit diagnosis predictions are the most studied prediction tasks, and a publicly available online dataset, the
Medical Information Mart for Intensive Care (MIMIC) [
35], is the most popular data source for the studies. A complete list of the included studies and their characteristics as delineated in Table
1 is available in the online supplement (Tables S2 and S3).
After reviewing the included studies, we found that the asserted contributions of researchers within the deep time series prediction literature can be distinguished and classified under the following 10 categories: patient representation, missing value handling, DL models, addressing temporal irregularity, attention mechanisms, (6) incorporation of medical ontologies, (7) static data inclusion, (8) learning strategies, (9) interpretation strategies, and (10) scalability. The rest of Section
3 devotes one section for each of these categories to describe the associated findings by category. Figure
3 gives a general overview of the focal approaches adopted by the included studies.
3.1 Patient Representation
Patient representations employed for deep time series prediction in healthcare can broadly be classified into one of two categories: sequence representation and matrix representation [
1]. In the former approach, each patient is represented as a sequence of medical event codes (e.g., diagnosis code, procedure code, or medication code), and the additional input may or may not include the time interval between the events (Section
3.3). Since a complete list of medical codes is generally quite long, various embedding techniques are commonly used to shorten it or combine similar medical codes with comparable values. In the latter approach, each patient is represented as a longitudinal matrix, where columns correspond to different medical events and rows correspond to regular time intervals. As a result, a cell in a patient matrix provides the code of the patient's medical or claims event at a particular time point. Zhang et al. [
57] followed a hybrid approach that splits the overall patient sequence of visits into multiple subsequences of equal length, then embeds the medical codes in each subsequence as a multi-hot vector.
As seen in Table S3, sequence representation is a slightly more prevalent approach employed by researchers (57%). Generally, for prediction tasks with numeric inputs, such as lab tests or vital signs, sequence representation is more commonly used, and for those with categorical inputs, like diagnosis codes or procedure codes, matrix representation is the trend. Nevertheless, there are some exceptions. Rajkomar et al. [
13] converted patient lab test results from numeric values to categories by assigning a unique token to each lab test name, value, and unit (e.g., “Hemoglobin 12 g/dL”) for predicting mortality, length-of-stay, and readmission in
intensive care units (ICUs). Ashfaq et al. [
61] included the lab test code with a value if the value was designated to be abnormal (determined according to medical domain knowledge), in addition to the typical inclusion of diagnosis and procedure codes. Several research groups [
72,
80,
89] converted numerical lab test results into predesigned categories by encoding them as either missing, low, normal, or high when predicting hypertension and the associated onset of high-risk cardiovascular states. Similarly, Barbieri et al. [
60] transformed vital signs into OASIS severity scores, then discretized these scores into categories of low, normal, and high. Of note, a singular study observed the superiority of matrix representation over sequence representation for readmission prediction of chronic obstructive pulmonary disease (COPD) patients using a large AC database [
1]. This study and other matrix representations [
44,
57,
96] found that integrating coarse time granularities such as weekly or monthly rather than finer time granularity measures can improve performance. This study also compared various embedding techniques, and the authors found no significant differences in their results. Finally, Qiao et al. [
78] summarized each numerical time series in terms of temporal measures such as their self-correlation structure, data distribution, entropy, and stationarity. They found that these measures can improve the interpretability of the extracted temporal features without degrading prediction performance.
For embedding medical events in the sequence representation, a commonly observed technique was to augment the neural network with an embedding layer that can learn effective medical code representations. This technique has benefited the prediction of hospital readmission [
58], patient next-visit diagnosis [
66], and the onset of vascular diseases [
82]. Another event embedding technique has been to use a pretrained embedding layer via probabilistic methods, especially word2vec [
101] and Skip-gram [
102], which have shown promising results for predicting an assortment of healthcare outcomes, such as patient next-visit diagnosis [
7], heart failure [
46,
51], and hospital readmission [
57]. Choi et al. [
7] demonstrated that pretrained embedding layers can outperform trainable layers by a 2% margin in recall for the next-visit diagnosis prediction problem. Instead of relying on individual medical codes for the next-visit diagnosis problem, several studies grouped medical codes using the first three digits of each diagnosis code, and other works implemented
Clinical Classification Software (CCS) [
103] to obtain groupings of medical codes [
68,
73]. However, Maragatham and Devi [
51] observed that pretrained embedding layers can outperform medical group coding methods by a 1.5% margin in
area under the curve (AUC) for heart failure prediction. Finally, Min et al. [
1] showed that, independent of the embedding approach, patient matrix representation generally outperformed sequence representation.
3.2 Missing Value Handling
Missing value imputation using methods such as zero [
3,
40], median [
58], forward-backward [
64,
66], and domain-knowledge by experts [
12,
38] has been the most common approach for handling missing values in patient time series data. The work of Lipton et al. [
74] was the first study that used a masking vector to utilize the availability of values as a separate input to predict discharge diagnosis. Other studies adopted the same approach for predicting readmission [
59], acute kidney injury [
93], ICU mortality [
37], and length-of-stay [
12]. Last, Che et al. [
36] utilized missing patterns as input for predicting mortality, length-of-stay, surgery recovery, and cardiac condition. Their approach outperformed the masking vector technique by approximately 2% margin in AUC.
3.3 DL Models
Table
3 shows the summary of model architectures adopted to learn a deep patient time series prediction model for each included study.
Recurrent neural networks (RNNs) and their modern variants, including
long short-term memory (LSTM) and
gated recurrent units (GRU), were by far the most frequently used models (84%). A few studies compared the GRU variant against the LSTM architecture. Overall, GRU achieved around 1% advantage in AUC metrics over LSTM for predicting heart failure [
47], kidney transplantation endpoint [
3], mortality in the ICU [
36], and readmission prediction of chronic disease patients [
1]. However, for predicting the diagnosis code group of a patient's next admission to the ICU [
68], septic shock [
83], and hypertension [
89], researchers did not find significant differences between these two advanced RNN model types. Additionally, bidirectional variants of GRU and LSTM—so-called Bi-GRU and Bi-LSTM—consistently outperformed their unidirectional counterparts for predicting hospital readmission [
57], diagnosis at hospital discharge [
66], patient next-visit diagnosis [
67,
69,
75], adverse cardiac events [
81], readmission after ICU discharge [
59,
60], in-hospital mortality [
2,
45], length-of-stay in hospital [
12], sepsis [
85], and heart failure [
54]. Although most studies (63%) employed single-layered RNN, many other works used multi-layered RNN models with GRU [
7,
48], LSTM [
40,
64,
68,
74], and Bi-GRU [
2,
67,
82]. However, despite the numerous studies employing these methods and their variants, multi-layered GRU is the only architecture that has been experimentally compared to its single-layered counterpart for the patient next-visit diagnosis [
7] and heart failure prediction tasks [
48]. Alternatively, researchers have extensively explored training separate network layers with the architectures of LSTM [
12,
38], Bi-LSTM [
77], and GRU layers [
17] for each feature. These channel-like architectures per feature were reported as being more successful than the simpler RNN models. Finally, for tasks such as predicting in-hospital mortality or hospital discharge diagnosis code, some RNN models were supervised to make assessments at each timestep [
12,
64,
74], a procedure known as
target replication. Their successes provided evidence that it can be more effective to repeatedly make a prediction at multiple time points than merely performing supervised learning for the last time-stamped entry.
Several studies, particularly those from when deep time series prediction within the healthcare domain was in its nascency, utilized
convolutional neural network (CNN) models for prediction tasks without benchmarking against other types of DL models [
18,
39,
58]. These early CNN models have been consistently outperformed by recently developed RNN models for predicting heart failure [
49,
52], readmission of patients diagnosed with chronic disease [
1], in-hospital mortality [
40], diabetes [
49], readmission after ICU discharge [
40,
59], and joint replacement surgery risk [
94]. Nevertheless, Cheng et al. [
18] showed that temporal slow fusion can enhance CNN performance, and Ju et al. [
49] suggested using 3D-CNN and spatial pyramid pooling for outperforming RNN models for heart failure and diabetes prediction tasks. Alternatively, hybrid deployments of CNN/RNN models have been successful in outperforming pure CNN or RNN models for predicting readmission after ICU discharge [
59], patient next-visit diagnosis [
73], mortality [
44], and heart failure [
54].
3.4 Addressing Temporal Irregularity
Two types of temporal irregularities, visit and feature, generally exist in patient data. Visit irregularity indicates that the time interval between visits can vary for the same patient over time. Feature irregularity occurs when different features belonging to the same patient for the same visit are recorded at various time points and frequencies.
The work of Choi et al. [
7] was the first study to make use of the time interval between patient visits as a separate input to a DL model for the patient next-visit diagnosis prediction task. This approach also proved to be efficacious in predicting heart failure [
46], vascular diseases [
82], hospital mortality [
13] and hospital readmission [
13]. Yin et al. [
47] used a sinusoidal transformation of time interval for assessing heart failure. In addition, Pham et al. [
65] and Wang et al. [
20] modified the internal mechanisms of the LSTM architecture to handle visit irregularity by giving higher weights to recent visits. Their proposed modifications outperformed traditional LSTM architectures by 3% in AUC for the highly frequent benchmarking task of predicting the diagnosis code group of a patient's next visit.
Certain studies hypothesized that handling feature irregularity is more effective than handling visit irregularity [
60,
91]. Zheng et al. [
91] also modified GRU memory cell learning processes to extract different decay patterns for each input feature for predicting the Alzheimer's severity score in half a year. Their results demonstrated that capturing feature and visit irregularity decreases the mean squared error (MSE) by up to 5% compared to models that capture visit irregularity only. Barbieri et al. [
60] and Liu et al. [
76] used a similar approach when predicting readmission to ICU and for generating relevant medications from billing codes.
3.5 Attention Mechanisms
Attention mechanisms, originally inspired by the visual attention system found in human physiology, have recently become quite popular among many domains, including deep time series prediction for healthcare [
57]. The core underlying idea is that patient visits and their associated medical events should not carry an identical weight during the inference process. Rather, they are contingent on their relative importance for the prediction task at hand.
Most commonly, attention mechanisms initially assign a unique weight for each visit or each medical event, and subsequently optimize these weight parameters during network backpropagation [
2,
13,
22,
37]. Also called
location-based attention [
69], this strategy has been incorporated into a variety of RNNs and learning tasks, such as GRU for heart failure [
22] and Bi-GRU for mortality [
51], as well as LSTM for hospital readmission, diagnosis, length-of-stay [
13], and asthma exacerbation [
99]. Other commonly used attention mechanisms include a concatenation-based attention device that has been employed for hospital readmission [
60] as well as next-visit diagnosis prediction [
69], and general attention models that are used primarily for hospital readmission [
57] and mortality prediction [
41]. Ma et al. [
69] benchmarked these three attention mechanisms for predicting medical codes by using a large AC database, and Suo et al. [
92] performed a similar benchmarking procedure for illness severity score prediction on EHR data. Both studies reported location-based attention as optimal.
With few exceptions, studies employing an attention mechanism tended not to report any differential prediction performance improvements enabled by attention. Those few studies that did distinguish a particular performance improvement reported that location-based attention mechanisms improved patient next-visit diagnosis by 4% in AUC [
65], increased hospital readmission F1-score by 2.4%, and also saw a 13% boost in F1-score for mortality prediction [
2]. Zhang et al. [
57] was the sole work reporting contributions of visit-level attention and medical code attention separately for hospital readmission, observing that each technique provided an approximate 4% increase in F2-score. An innovative study by Guo et al. [
67] argued that all medical codes should not go through the same weight allocation path during attention calculation. Instead, they proposed a crossover attention model with distinct bidirectional GRUs and attention weights for both diagnosis and medication codes. On the whole, we found that most studies utilized attention mechanisms to improve the interpretability of their proposed DL models by highlighting important visits or medical codes, at either a patient or population level. Section
3.9 further elaborates on patient- and population-level properties.
3.6 Incorporation of Medical Ontologies
Another facet of these research streams was the incorporation of medical domain knowledge into DL models to enhance their prediction performance. Standard CCS has the ability to establish a hierarchy of various medical concepts in the form of successive parent-child relationships. Based on this concept, Choi et al. [
53] employed CCS to create a medical ontology tree for use in a network embedding layer. These encoded medical ontologies were better able to represent abstract medical concepts when predicting heart failure. Zhang et al. [
77] later enhanced this initial ontological strategy by considering more than one parent for each node and also by providing an ordered set of ancestors for each medical concept. Separately, Ma et al. [
70] showed that medical ontology trees can be leveraged when calculating attention weights in GRU models, achieving a 3% accuracy increase over Choi et al. [
53] for the same prediction task. Following this, Yin et al. [
47] demonstrated that causal medical knowledge graphs like KnowLife [
104], which contain both “cause” and “is-caused-by” relationships between diseases, outperform both Choi et al. [
53] and Ma et al. [
70] with an approximate 2% AUC margin for heart failure prediction. Wang et al. [
20], however, enhanced Skip-gram embeddings by adding n-gram tokens from medical concept information, such as disease or drug name, to EHR data. These embedded tokens captured ancestral information for a medical concept similar to ontology trees, and they were applied to the patient next-visit diagnosis task.
3.7 Static Data Inclusion
RNNs are particularly apt at learning from sequential data, although leveraging static data into these types of models has been challenging. The hybrid combination of static along with temporal input is particularly important in a healthcare context, since static features like patient demographic information and prior history can be essential for achieving accurate predictions. Appending patient static data to the input of a final fully connected layer has been the most common approach for integrating these features. It has been applied to hospital readmission [
57,
58], length-of-stay [
40], and mortality [
38,
40] tasks. Alternatively, Esteban et al. [
3] fed 342 static features into an entirely independent feedforward neural network before combining the output with temporal data in a typical GRU layer for learning kidney transplant endpoints. Other studies also adopted this approach for predicting mortality [
42], phenotyping [
42], length-of-stay [
42], and the risk of cardiovascular diseases [
80]. Moreover, Pham et al. [
65] modified the internal processes of LSTM networks to specifically incorporate the effects of unplanned hospital admissions, which involve higher risks than planned admissions. They employed this approach for predicting patient next-visit diagnosis codes in mental health and diabetes cohorts. Finally, Maragatham and Devi [
51] converted static data into a temporal format by repeating it as input to every time point. Together, they used static demographic data, vascular risk factors, and a scored assessment of nursing levels for heart failure prediction. We found no study comparing the aforementioned static data inclusion methods against solid benchmarks.
3.8 Learning Strategies
We identified three principal learning strategies that differ from the basic supervised learning scenario: (1) cost-sensitive learning, (2) multi-task learning, and (3) transfer learning. When handling imbalanced datasets, cost-sensitive learning has frequently been implemented by modifying the cross-entropy loss function [
58,
61,
93,
100]. In particular, two studies convincingly demonstrated the performance improvement achieved by cost-sensitive learning. Gao et al. [
100] found a 3.7% AUC increase for neonatal encephalopathy prediction, and Ashfaq et al. [
61] observed a 4% increase for the hospital readmission task. The latter study further calculated cost-saving outcomes by estimating the potential annual cost savings if an intervention is selectively offered to patients at high risk for readmission. Instead, multi-task learning was implemented to jointly predict mortality, length-of-stay, and phenotyping with LSTM [
13,
40], Bi-LSTM [
12], and GRU [
42] architectures. Harutyunyan et al. [
12] was a seminal study that reported a significant contribution of multi-task learning over state-of-the-art traditional learning, with a solid 2% increase in AUC. Last, transfer learning, originally used as a benchmark evaluation by Che et al [
36], was recently adopted by Gupta et al. [
43] to study both task adaptation and domain adaptation utilizing a non-healthcare model, TimeNet. They found that domain adaptation outperforms task adaptation when the data size is small, but otherwise task adaptation is superior. Moreover, they found that for task adaption on medium-sized data, fine-tuning is a better approach than learning from scratch with feature extraction.
3.9 Interpretation
By far, the most common DL interpretation method is to show visualized examples of selected patient records to highlight which visits and medical codes most influence the prediction task [
2,
13,
22,
41,
47,
49,
54,
57,
60,
66,
67,
69,
75,
82,
95,
97]. Specific contributions by feature are extracted from the calculated weight parameters of an attention mechanism (Section
3.6). Visualizations can also be implemented through a global average pooling layer [
65,
82] or a one-sided convolution layer within the neural network [
57]. Another interpretation approach is to report the top medical codes with the highest attention weights for all patients together [
2] or for different patient groups by disease [
47,
57,
69,
80]. Specifically, Nguyen et al. [
63] extracted the most frequent patterns in medical codes by disease type, and Caicedo-Torres et al. [
39] identified important temporal features for mortality prediction using both DeepLIFT [
105] and Shapley [
106] values. The technique of using Shapley values for interpretation was also employed for continuous mortality prediction within the ICU setting [
90]. Finally, Choi et al. [
46] performed error analysis on false-positive and false-negative predictions to differentiate the contexts in which their DL models are more or less accurate.
3.10 Scalability
Although most review studies evaluated their proposed models on a single dataset—usually a publicly available resource such as MIMIC and its updates [
35]—certain studies focused on assessing the scalability of their models to a wider variety of data. Rasmy et al. [
50] evaluated one of the most popular deep time series prediction models with two GRU layers, called
RETAIN, which was first proposed in a study by Choi et al. [
22], on a collection of 10 hospital EHR datasets for heart failure prediction. Overall, they achieved a similar AUC compared to the original study, although a higher dimensionality did further improve prediction performance. Using the same RETAIN model, Solares et al. [
56] conducted a scalability study on approximately 4 million patients in the UK National Health Service, and they reported an identical observation to that of Ju et al. [
49]. Another large dataset was explored by Rajkomar et al. [
13], who demonstrated the power of LSTM models on a variety of healthcare prediction tasks for 216,000 hospitalizations involving 114,000 unique patients. Finally, we found a singular study [
1] investigating the scalability of deep time series prediction methods for AC data, as opposed to EHR sequences. Min et al. [
1] observed that DL models are effective for readmission prediction with patient EHR data, but they tend not to be superior to traditional ML models using AC data.
Studies on the MIMIC database have consistently used the same 17 features in the dataset, which have a low missing rate [
107]. To address dimensional scalability, Purushotham et al. [
42] attempted using as many as 136 features for mortality, length-of-stay, and phenotype prediction with a standard GRU architecture. Compared to an ensemble model constructed from several traditional ML models, they found that for lower-dimensional data, traditional ML performance is comparable to DL performance, whereas for high-dimensional data, DL's advantage is more pronounced. On a similar note, Min et al. [
1] evaluated a GRU architecture against traditional supervised learning methods on around 103 million medical claims and 17 million pharmacy claims for 111,000 patients. Again, they found that strong traditional supervised ML techniques have a comparable performance to that of their DL competitors.
4 Discussion
4.1 Patient Representation
Out of the commonly used sequential and matrix patient representations, prediction tasks with predominantly numeric inputs, such as lab tests and vital signs, often rely on sequence representations, whereas those studies utilizing mainly categorical inputs, like diagnosis codes or procedure codes, commonly incorporate a matrix representation. Other than a lone study [
1] that documented the superiority of the matrix approach on AC data, we found no consistent comparison between these two approaches in our systematic review. In addition, while a coarse-grain abstraction has not been suggested for each of these approaches, changing the granularity level to find the optimal level would be highly suggested to further ascertain their respective efficacy. The rationale is that the sparsity of temporal patient data is typically high, and considering every individual visit for an embedded patient representation may not be the optimal approach when factoring in the corresponding increase in computational complexity.
To combine numeric and categorical input features, researchers have generally employed three distinct methods. One method involves converting patient numeric quantities to categorical ones by assigning a unique token to each measure. Thus, each specific lab test code, value, and unit will have its own identifying marker. Using a second method, researchers encode numeric measures with clinically meaningful names, such as missing, low, high, normal, and abnormal. A third alternative requires the conversion of numeric measures to severity scores, to discretize them into low, normal, and high categories. The second approach was quite common in our selected studies, likely due to its implementation simplicity and effectiveness for a wide variety of clinical healthcare applications. We therefore report it to be the most dominant strategy for combining numeric and categorical inputs for deep time series prediction tasks.
When embedding medical events into a sequence representation, we again found three prevalent techniques. Using the first technique, researchers commonly added a separate embedding layer, prefacing the bulwark of the recurrent network, to optimize medical code representation. Alternatively, pretrained embedding layers with established methods such as word2vec were adopted in lieu of learning embeddings from scratch. Last, researchers often utilized medical code groups instead of the atomized medical code. Among the three practices, pretrained embedding layers have consistently outperformed naive embedding layers and medical code groupings for EHR data, whereas no significant difference in model performance has been observed for AC data. In addition, researchers have shown that temporal matrix representation is the most effective approach for AC data. The rationale is that the temporal granularity of EHR data is usually at the level of an hour or even minute, whereas the granularity of AC data is at the day level. As a result, the order of medical codes within a day is ordinarily lost for the embedding algorithms such as word2vec. Combining our findings, a sequence representation with a pretrained embedding layer is highly recommended for learning tasks on EHR data, whereas a matrix representation seems to be more effective for AC data.
Several important gaps exist regarding the specific representation of longitudinal patient data. Sequence and matrix methodologies should be compared in a sufficient variety of healthcare settings for EHR data. If extensive comparisons could confirm the relative performance of matrix representation, then it would further enhance its desirability, as it is easier to implement and has a faster runtime than sequences of EHR codes. Moreover, to improve patient similarity measures, researchers should analyze the effect of different representation approaches under various DL model architectures. Last, we found that few reviewed studies included both numerical and categorical measures as feature input. A superior approach that synergistically combines their relative strengths has not yet been sufficiently studied and thus requires the attention of future research. Further investigation of novel DL architectures with a variety of possible input measures is therefore recommended.
4.2 Missing Value Handling
The most common missing value handling approach found in the deep time series prediction literature was imputation by predetermined measures, such as zero or the median—also a common practice in non-healthcare domains [
108]. However, missing values in healthcare data typically do not occur at random, as they can reflect specific decisions by caregivers [
74]. These missing values thus represent
informative missingness, providing rich information about target labels [
36]. To capture this correspondence, researchers have implemented two primary approaches. The first approach involves creating a binary (masking) vector for each temporal variable, indicating the availability of data at each time point. This approach has been evaluated in various applications, and it seems to be an effective way of handling missing values. Second, missing patterns can be learned by directly training the imputation value as a function of either the latest observation or the empirical mean prior to variable observations. This latter approach is more effective when there is a high missing rate and a high correlation between missing values and the target variable. For instance, Che et al. [
36] found that learning missing values was more effective when the average Pearson correlation between lab tests with a high rate of missingness and the dependent variable, mortality, was above 0.5. Despite this, since masking vectors have been evaluated on a wider variety of healthcare applications, and with different degrees of missingness, they should remain as the suggested missing value handling strategy for deep time series prediction.
Interestingly, there was no study assessing the differential impact of missingness for individual features on a given learning task. The identification of features whose exclusion or missingness most harms the prediction process informs practitioners about how to focus their data collection and imputation strategies. Furthermore, although informative missingness applies to many temporal features, missing-at-random can still be the case for other feature types. As a direction for future study, we recommend a comprehensive analysis of potential sources of missingness, for each feature and its type, along with assistance from domain experts. This would better inform a missing value handling approach within the healthcare domain and, as a consequence, enhance prediction performance accordingly.
4.3 DL Models
Rooted in their ability to efficiently represent sequential data and extract its temporal patterns [
64], RNN-based DL models and their variants were found to be the most prevalent architecture for deep time series prediction on healthcare data. Patient data naturally has a sequential nature, where hospital visits or medical events occur chronologically. Lab test orders or vital sign records, for example, take place at specific timestamps during a hospital visit. However, vanilla RNN architectures are not sophisticated enough to sufficiently capture temporal dependencies when EHR sequences are relatively long, due to the vanishing gradient problem [
109]. To address this issue, LSTM and GRU recurrent networks, with their memory cells and elaborate gating mechanisms, have been habitually employed by researchers, with improved outcomes on a variety of healthcare prediction tasks. Although some studies display a slight superiority of GRU architectures versus LSTM networks (around 1% increase in AUC), other studies did not find significant differences between them. Overall, LSTM and GRU have similar memory-retention mechanisms, although GRU implementations are less complex and have faster runtimes [
89]. Due to this similarity, most works have used one without benchmarking it against the other. In addition, for very long EHR sequences, such as ICU admissions with a high rate of recorded medical events, bidirectional GRU and LSTM networks consistently outperformed their unidirectional counterparts. This is likely because bidirectional recurrent networks simultaneously learn from both past and future values in a temporal sequence, so they retain additional trend information [
69]. This is particularly important in the healthcare context, since patient health status patterns change rapidly or gradually over time [
12]. For example, an ICU patient with a rapidly fluctuating health status over the past week may eventually die, even if the patient is currently in a good condition. Another patient, initially admitted to the ICU within the past week in a very bad condition, may gradually improve and survive. Therefore, bidirectional recurrent networks are the most state-of-the-art DL models for time series prediction in healthcare. GRU, which has lower complexity and comparable performance to LSTM, is the preferred model variant, although additional comparative studies are recommended by this review to affirm this conclusion.
Most RNN studies employed single-layer architectures; however, some studies chose an increased complexity with multi-layered GRU [
7,
48], LSTM [
40,
64,
68,
74], and Bi-GRU [
2,
67,
82] networks. Other than two earlier works [
7,
48], multi-layered architectures were not consistently tested against their single-layered counterparts. Consequently, it is difficult to decipher if adding additional RNN layers, whether they are bidirectional or not, improves learning performance. However,
channel-wise learning, a technique that trains a separate RNN layer per feature or feature type, successfully enhanced traditional RNN models that contain network layers that learn all feature parameters simultaneously. There are two underlying ideas behind this development. First, it helps identify unique patterns within each individual time series (e.g., body organ system status) [
17] prior to integration with patterns found in multivariate data. Second, channel-wise learning facilitates the identification of patterns related to
informed missingness, by discovering which of the masked variables correlates strongly with other variables, target or otherwise [
12]. Nevertheless, channel-wise learning needs further benchmarking against vanilla RNN models to learn the conditions under which it is most beneficial. Additionally, certain works enhanced upon the supervised learning process of RNN models. For prediction tasks with a static target, such as in-hospital mortality, RNN models were supervised at multiple timesteps instead of merely the final time point. This so-called
target replication has been shown to be quite efficient during backpropagation [
64]. Specifically, instead of passing patient target information across many timesteps, the prediction targets are replicated at each time point within the sequence, thus providing additional local error signals that can be individually optimized. Moreover, target replication can improve model predictions even when the temporal sequence is perturbed by small, yet significant, truncations.
As noted in Section
3.3, convolutional network models were more commonly used in the early stages of deep time series prediction for healthcare. Eventually, they were shown to be consistently outperformed by recurrent models. However, recent architectural trends have been using convolutional layers as a complement to GRU and LSTM [
44,
54,
59,
73]. The underlying idea is that RNN layers capture the global structure of the data via modeling interactions between events, whereas the CNN layers, using their temporal convolution operators [
54], capture local structures of the data occurring at various abstraction levels. Therefore, our systematic review suggests using CNNs to enhance RNN prediction performance instead of employing either in a stand-alone setting. Another recent trend in the literature is the splitting of entire temporal sequences into subsequences for various time periods—before applying convolutions of different filter size—to capture temporal patterns within each time period [
49]. For optimal local pattern (motif) detection, slow-fusion CNN that considers both individual patterns of the time periods as well as their interactions has been shown to be the most effective convolutional approach [
18].
Several important research gaps were identified in the models used for deep time series prediction in healthcare. First, there is no systematic comparison among state-of-the-art models in different healthcare settings, such as rare versus common diseases, chronic versus nonchronic maladies, and inpatient versus outpatient visits. These different healthcare settings have identifiable heterogeneous temporal data characteristics. For instance, outpatient EHR data contains large numbers of visits with few medical events recorded during each visit, whereas inpatient visits contain relatively few visit records but documented long sequences of events for each visit. Therefore, the effectiveness of a given DL architecture will vary over these different clinical settings. Second, it is not clear whether adding multiple layers of RNN or CNN within a given architecture can further improve model performance. The maximum number of layers observed within the reviewed and selected studies was two. Given enough training samples, the addition of more layers may further improve performance by allowing for the learning of increasingly sophisticated temporal patterns. Third, most of the reviewed studies (92%) targeted a prediction task on EHR data, whereas the generalizability of the models to AC data needs more investigation. For example, although many studies reported promising outcomes for EHR-based hospital readmission predictions using GRU models, Min et al. [
1] found that similar DL architectures are ineffective for claims data. Finding novel models that can extract temporal patterns from EHR data—which are simultaneously applicable to claims data—can be an interesting future direction for transfer learning projects. Fourth, although channel-wise learning seems to be a promising new trend, it needs researchers to further investigate the precise temporal patterns detected by this approach. DL methods focused on interpretability would be ideal for such an application. Fifth, many studies compared their DL methods against expert domain knowledge, but a hybrid approach that leverages expert domain knowledge within the embeddings should help improve representation performance. Last, the prediction of medications, either by code or group, has been a well-targeted task. However, a more aggressive approach, such as predicting medications along with their appropriate dosage and frequency, would be a more realistic and useful target for clinical decision making in practice.
4.4 Addressing Temporal Irregularity
The most common approach for handling visit irregularity is to treat the time interval between adjacent events as an independent variable and concatenating it to the input embedding vectors. Although this technique is easy to implement, it does not consider contextual differences between recent and earlier visits. Addressing this limitation, researchers modified the internal memory cells of RNN networks by giving higher weights to recent visits [
20,
65]. However, a systematic comparison between the two approaches has not been explored. Therefore the time interval approach, which has been shown to be effective in various applications, remains the most efficient, tested strategy to handle visit irregularity. It is noteworthy that tokenizing time intervals is also considered the most effective method of capturing duration in the context of natural language processing [
110,
111], a field of study that inspires many of the deep time series prediction methods in healthcare.
Although most works addressing irregularity focus on visit irregularity, a few studies concentrated on feature irregularity [
60,
91]. A fundamental concept underpinning the difference between the two is that fine-grained temporal information is more complex and yet important to learn at the feature level than visit level. Specifically, different features expose different temporal patterns, such as when certain features decay faster than others. Paralleling the work on visit irregularity and time intervals, these studies [
60,
91] modified the internal processes of RNN networks to learn unique decay patterns for each individual input feature. Again, this research direction is relatively new and boasts few published works, so it is difficult to make a general suggestion for unilaterally handling feature irregularity in deep time series learning tasks.
Overall, we can stipulate that adjusting the memory mechanisms of recurrent networks when addressing either visit or feature irregularity needs additional benchmarking experiments to make the arguments robust. Currently, it has been evaluated in a single hospital setting for each case. Therefore, optimal synergies among patient types (inpatient vs. outpatient), sequence lengths (long vs. short), and irregularity approaches (time interval vs. modifying RNN memory cells) are not entirely conclusive, but time interval approaches have been most commonly published.
4.5 Attention Mechanisms
Attention mechanisms have been employed by researchers with the premise that neither patient visits nor medical codes should contribute equally when performing a target prediction task. As such, learning attention weights for visits and codes have been the subject of many deep time series prediction studies. The three most commonly used attention mechanisms are (1) location-based, (2) general attention, and (3) concatenation-based frameworks. The methods differ primarily on how the learned weight parameters are connected to the model's hidden states [
69]. Location-based attention schemes calculate weights from the most current hidden state. Alternatively, general attention calculations are based on a linear combination connecting the current hidden states to the previous hidden states, with weight parameters being the linear coefficients. Most complex is the concatenation-based attention framework, which trains a multi-layer perceptron to learn the relationship between parameter weights and hidden states. Location-based attention systems have been the most commonly used attention mechanisms for deep time series prediction in healthcare.
We found several research gaps regarding attention. Most studies relied on attention mechanisms to improve the interpretability of their proposed DL model by highlighting important visits or medical codes, without evaluating the differential effect of attention on prediction performance. This is an important issue, as incorporating attention into a model may improve interpretability, but it does not have an established effect on performance in the DL for the healthcare time series domain. Furthermore, with only a single exception [
57], we did not find studies reporting the separate contributions of visit-level attention and medical code level attention. Last, and again with only a single exception [
69], no study compared the performance or interpretability of different attention mechanisms. All of these research gaps should be investigated in a comprehensive manner in future studies, particularly for EHR data, as most prior attention studies have focused on the clinical histories of individual patients.
4.6 Incorporation of Medical Ontologies
When incorporating medical domain knowledge into deep time series prediction models, researchers have mainly utilized medical ontology trees and knowledge graphs within embedding layers of recurrent networks. Some of the success of these approaches is due to the enhancement they provide when addressing rare diseases. Being less frequent in the data, a proper representation and pattern extraction for rare diseases is challenging for simple RNN models. Medical domain knowledge graphs provide rare disease information to the model through ancestral node embeddings that contain hierarchical information of the disease. However, this advantage is not as exceptional when sufficient data is available for all patients over a long record history [
53,
70]. Continuing research is needed to expand the innovative architectures that incorporate medical ontologies for a broad variety of prediction tasks and case studies.
4.7 Static Data Inclusion
There are four published approaches for integrating static patient data with their temporal data. By far, the most common approach is to feed a vector of static features as additional input to the final fully connected layer of a DL network. Another strategy trains a separate feedforward neural network on the static features, then adds the encoded output of this separate network to the final dense layer in the principal neural network for target prediction. Researchers have also injected static data vectors as input to each time point of the recurrent network, effectively treating the patient demographic and historical data as quasi-dynamic. Last, similar to those strategies that handle visit and feature irregularities, researchers have modified the internal memory processes of recurrent networks to incorporate specific static features as input.
The most important research gap regarding static data inclusion is that we have found no study evaluating the differential effects of static data on prediction performance. Moreover, comparing these four approaches in a meaningful benchmarking setting, with the expressed goal of finding the most optimal technique, could be an interesting future research direction. Finally, since DL models may not learn the same representation for every subpopulation of patients (e.g., male vs. female, chronic vs. nonchronic, or young vs. old), significant research gaps exist in the post analysis of static feature performance as input. Such analyses could inform decision makers of crucial insights into model fairness and would also stimulate future research on predictive models that better balances fairness with accuracy.
4.8 Learning Strategies
Recent literature has investigated three new DL strategies: (1) cost-sensitive learning, (2) multi-task learning, and (3) transfer learning. Although many reviewed studies used an imbalanced dataset for their experiments, a select few embedded cost information as a learning strategy that incorporated additional cost-sensitive loss. Specifically, each of these studies changed the loss function of the DL model to increasingly penalize for misclassification of the minority class. In the healthcare domain, imbalanced datasets are very common, and patients with diseases are less common than healthy patients. Moreover, most of the prediction tasks on the minority class lead to critical care decisions, such as identifying those patients who are likely to die in the next 48 hours or those who will become diabetic in the relatively near future. Devising cost-sensitive learning components into DL networks thus needs further attention and is a wide open research gap for future inquiry. As an example, exploring cost-sensitive methods in tandem with the traditional ML techniques of oversampling or undersampling could lead to significant performance increases in model prediction rates for the minority class. In addition, calculating the precise cost savings when correctly identifying the minority class of patients, similar to Ashfaq et al. [
61], can further underline the importance of the cost-sensitive learning strategy.
Researchers have reported the benefit of multi-task learning by documenting its performance in a significant variety of healthcare outcome prediction tasks. However, the cited works do not distinguish the model components that exemplify why learning a single, multi-task deep model is preferable to simultaneously training multiple DL models for respective individualized prediction tasks. More specifically, we ask which layers, components, or learned temporal patterns in a DL network should be shared among different tasks, and in which healthcare applications might this strategy be most efficient? These research questions are straightforward and could be fruitfully studied in the near future with explainable DL models.
Among the three noted, transfer learning was the least studied learning strategy found within our systematic review of the literature with just a single citation [
43], displaying the effectiveness of the method for both task and domain adaptation. It is commonly assumed that, with sufficient data, trained DL models can be effective for a wider variety of prediction tasks and domains. However, in many healthcare settings, such as those with rural patients, sufficient data is difficult to collect [
112]. Transfer learning methods have the potential to make a huge impact on deep time series prediction in healthcare by making pretrained models applicable to essentially any healthcare setting. Still, further research is recommended to ascertain which pathological prediction tasks are most transferable, which network architectures are most flexible, and which model parameters require the least tuning when transferring to different domains.
4.9 Interpretation
One of the most common critiques of DL models is the difficulty of their interpretation, and researchers have attempted to alleviate this issue with five different approaches. The first approach uses feature importance measures such as Shapley and DeepLIFT. A Shapley value of a feature is the average of its contribution across all possible coalitions with other features, whereas DeepLIFT compares the activation of each neuron in the deep model inputs to its default reference activation value and assigns contribution scores according to the difference [
113]. Although both of these measures cannot illuminate the internal procedure of DL models, they can identify which features have been most frequently used to make final predictions. A second approach visualizes what input data the model focused on for each individual patient [
13] through the implementation of interpretable attention mechanisms. In particular, some studies investigated which medical visits and features contributed most to prediction performance with a network attention layer. As a clinical decision support tool, this raises clinician awareness of which medical visits deserve careful human examination. In addition to individual patient visualization, a third interpretation tactic aggregated model attention weights to calculate the most important medical features for specific diseases or patient groups. Additionally, error analysis of final prediction results allowed for consideration of the medical conditions or patient groups for which a DL model might be more accurate. This fourth interpretation approach is also popular in non-healthcare domains [
114]. Finally, considering each set of medical events as a basket of items and each target disease as the label, researchers extracted frequent patterns of medical events most predictive of the target disease.
Overall, this review found explainable attention to be the most commonly used strategy for interpreting deep time series prediction models evaluated on healthcare applications. Indeed, individual patient exploration can help make DL models more trustworthy to clinicians and facilitate subsequent clinical actions. Nevertheless, because implementing feature importance measures is much less complex, this study recommends consistently reporting them on most healthcare deep times series prediction studies, providing useful clinical implication with little added effort. Although individual-level interpretation is important, extracting general patterns and medical events associated with target healthcare outcomes is also beneficial for clinical decision makers, thereby contributing to clinical practice guidelines. We found just one study implementing a population-level interpretation [
63], extracting frequent CNN motifs of medical codes associated with different diseases. Otherwise, researchers broadly have reported the top medical codes with the highest attention weights for all patients [
2] or different patient groups, to provide a population-level interpretation. This current limitation can be an essential direction for future research involving network interpretability.
4.10 Scalability
We identified two main findings regarding the scalability of deep time series prediction methods in healthcare. First, although DL models are usually evaluated on a single dataset with a limited number of features, some studies confirmed their scalability to large hospital EHR datasets with high dimensionality. The fundamental observation is that higher dimensionality and larger amounts of data can further enhance model performance by raising their representational learning power [
42]. Such studies have typically used single-layered GRU or LSTM architectures, but analyzing more advanced neural network schemas, such as those proposed in recent studies (Section
3.1), is a venue for future research. In addition, one scalability study observed that models which are primarily purposed for EHR data may not be as effective with AC data [
1]. This is mainly because potent predictive features available in EHR data, such as lab test results, tend to be missing in AC datasets. Therefore, scalability studies on AC data merits further inquiry. Second, DL models are typically compared against traditional supervised ML methods on a singular method only (Table S3). However, two studies [
1,
42] compared DL methods against ensembled traditional supervised learning models, both on EHR and AC data, and found that their performances are comparable. This shows an important research gap for proper comparison between DL and traditional supervised learning models to identify data settings, such as feature types, dimensionality, and missingness, in which DL models either perform comparably or excel against their traditional ML counterparts.