3.1. RNN-Based Models
Recurrent neural network (RNN)-based models are the pioneer of deep learning in the TSF realm. Because of their recursive structure, RNN-based models are suitable for working with time series data or tasks that involve obtaining time dependencies. However, there is still a limit to handling long-term dependencies or capturing intricate patterns, particularly when working with LSTF, where the issue of vanishing gradients or inflating gradients is likely to occur.
LSTNet [
37], proposed by Lai et al. for multivariate TSF, introduced a convolutional layer to capture short-term features of time series data with LSTM or GRU for long-term feature extraction. Considering that RNNs cannot capture very long-term connections, a recurrent skip component is designed to record ultra-long-term features of time series data. Using the periodic pattern of real data, a periodic
hyperparameter is introduced to solve this problem. Finally, considering that the neural network is not flexible enough to the change in the input scale (changes in data size and dimension), which will lead to fluctuations in performance, an AR model is introduced to add linear components, making the non-linear deep learning model more robust to time series that violate the scale change, and the output can respond to the scale change to the input. The DA-RNN model [
38] used a dual-stage attention mechanism to adaptively assign higher weights to highly correlated feature variables in the input stage and find the encoder hidden states with the strongest temporal correlation over the entire time steps in the decoder stage. This kind of design ensures that DA-RNN can not only adaptively select the most relevant input features but can also capture the long-term temporal dependence of time series. MQ-RNN [
32], which also adopts Seq2Seq architecture, considers that a single value provided by point forecasting methods cannot reflect the uncertainty of the model for the predicted value, but probabilistic ones are suitable for this situation and provide abundant information for many decision-making scenarios. A novel forking-sequences approach is applied in the training process (reflected in the connection of a decoder after every time point in the encoder part), and a DMS strategy is adopted; an encoder–decoder structure is also optimized with the RNN being replaced with a MLP for prediction in the decoder.
Further, mWDN [
39] introduced wavelet decomposition as part of the network framework with trainable parameters rather than a preprocessing method before modeling. Subseries obtained by mDWN are sent to corresponding LSTMs and then ensemble the results. MTNet [
40] introduced a memory network and redesigned the attention weights. As with LSTNet, the traditional autoregressive linear model is paralleled with the non-linear neural network to address its insensitive output scale. Hybrid ES-LSTM [
41] combines the exponential smoothing (ES) method with LSTM and trains the preprocessed parameters together with a neural network. At the same time, the ES method can effectively capture active components such as seasonality, and the neural network can iteratively eliminate seasonal factors. By hierarchically leveraging both local and global components, sequence information is extracted and integrated simultaneously. This approach has inspired numerous subsequent models. Moreover, Fan et al. [
42] focuses on multi-modal fusion, regarding different historical periods as different modalities and fusing them with multimodal attention weights for better prediction.
C2FAR [
43] is a univariate probabilistic TSF model based on DeepAR [
1] that introduces the binning approach that already exists in CV and NLP. It first classifies the coarse bin in which the predicted value is located and then uses that as a condition to sort out which finer bin the prediction value is located in. It is worth mentioning that C2FAR focus a practical concern regarding the diverse nature of real-world time series, which often exhibit a combination of discrete, continuous, and semi-continuous characteristics. C2FAR deals with prediction problems through the discretization of sequence values and is suitable for prediction that does not require high precision. Moreover, Neural ODE is frequently used to describe the change in hidden state over time, and the derivative of the hidden state is used to model this change. Based on that, RNN-ODE-Adap [
44] included the data itself in the derivative of the hidden state to model together and selected the time steps adaptively according to the local variations in the time series, capturing the potential trend with fewer steps, and achieved higher prediction accuracy with lower time complexity.
With the emergence of Transformer-based models, the Transformer’s superior performance has rendered RNN-based models no longer the preferred choice for LSTF problem, and its performance has even been surpassed by MLP-based models, such as DLinear [
26] and TiDE [
24]. In the LSTF problem, with the expansion of the prediction horizon, the cumulative error of RNN-based model increases rapidly, and the inference time also increases rapidly. Although many previous efforts have made great improvements, they are dwarfed by the reality of longer data inputs: longer forecast horizons. However, SegRNN [
45] emerged at a time when Transformer-based models and MLP-based models were alternating in prominence. This was influenced by the patching technique used to preprocess data, which has been utilized in both of these kinds of models. SegRNN assumes that the main reason for RNNs’ failure in LSTF problems is their high recurrent iteration counts. It introduces a segment technique into the RNN and a parallel multi-step forecasting (PMF) strategy while replacing pointwise iterations with segment-wise iterations to reduce the number of iterations. These methods greatly improve the prediction accuracy of RNN-based models. Following this, WITRAN [
46] continued to emphasize capturing long- and short-term repeating patterns. However, inspired by TimesNET’s reconstruction of 1D time series in 2D space to simulate both intra- and inter-period changes, the author further rearranged the sequence, and multiple adaptive cycles of the input were learned by setting up hyperparameters instead of fast Fourier transformations (FFT). Gate select cells were designed in both horizontal and vertical directions to merge and select information. The computational efficiency is further improved through the parallel processing of two directions of information transmission through a recurrent acceleration network. Theoretically, both WITRAN and SegRNN solve the vanishing gradient and exploding gradient of RNNs by
reducing the length of the transmission path. This approach aligns with the main direction of restarting RNN-based models to solve LSTF problems.
While RNN-based methods have experienced periods without good performance in either predictive accuracy or forecasting capability for LSTF issues, they are lower in complexity compared to Transformer-based models. Moreover, the structure of RNN-based models is more suitable for processing time series data and allows long- and short-term patterns to be captured simultaneously. By studying strategies that have proven successful in other network models, such as patching techniques, RNN-based models may shine again. We summarize the RNN-based models in
Table 1.
3.2. CNN-Based Models
Convolutional neural network (CNN) is crucial for the success of deep learning in the CV realm. In the field of TSF, CNNs can leverage their high efficiency in local feature extraction. Meanwhile, temporal convolutional networks (TCNs), after effectively introducing causal convolution and dilated convolution, are highly suitable for processing sequence data and improve the performance of handling long-term sequences in some scenarios. They are designed to address the problems of RNNs such as vanishing gradients and high computational complexity encountered in handling long sequences and utilize the advantages of modeling long-term dependencies faster and the ability to compute in parallel. Even though TCNs have increased performance, other neural network types may still perform better on long sequences.
DSANet [
47] focuses on time series with dynamic periodic or non-periodic patterns. It independently inputs each univariate time series into two parallel convolution structures, each operating at a different scale, to model global and local complex patterns, respectively. Meanwhile, each branch combines an individual self-attention module to learn the dependencies between different sequences. DSANet also employs an autoregressive component to address the issue of neural network output scale insensitivity to input scale, and, finally, considers the mixture of linear and non-linear components as a result. MLCNN [
48] places the construal-level theory of psychology as the core of the entire model design. From the human perspective, MLCNN uses abstract features to predict distant future values. Some specific features are used to predict near-future values, thereby improving the prediction performance by fusing prediction information from different future moments. To achieve this objective, an MLCNN was used to build a multi-tasking architecture wherein one main task and four auxiliary tasks were used to learn about the target future moment and its near- and distant-future values. The raw time series constructed five intermediate feature maps with different levels through a 10-layer one-dimension CNN. After that, it was used as the input for the subsequent LSTM-based fusion-encoder–main-decoder part to obtain the result of the non-linear part, and then the linear part’s results obtained using AR were merged to obtain the final prediction.
SCINet [
22] believed that the unique characteristic of time series compared to other sequence data (e.g., text sequences) lies in the fact that after downsampling them into subsequences, they still contain relevant information and connections of time series. It is proposed that causal convolution in TCN is not only unnecessary but also self-restrictive in its ability to extract temporal information. Therefore, SCINet is designed as a recursive downsample–convolve–interact architecture. It also adopts a multi-layer binary tree structure, iteratively capturing information at different time resolutions. As the depth of the binary tree increases, finer-grained information is extracted. This hierarchical structure allows for the extraction of both short- and long-term dependencies in the sequence. Further, more complex sequences can be processed by stacking multiple SCINets. MICN [
49] uses a moving average similar to Autoformer [
33] and FEDformer [
34] to decompose the original sequence into trend terms and seasonal terms and predicts them separately, and it then integrates them to obtain the result. The innovation is that a multi-scale branch structure was designed for the seasonal part and down-sampled convolution was used to extract local features of time series. Then, instead of masked self-attention, an isometric convolution module is used to model the correlation between all local features and obtain global features. Isometric convolution involves padding the original sequence, which has a length of S, with S-1 zeros at the beginning, followed by convolving the sequence with a kernel of size S. Integrating both CNN and Transformer modeling perspectives enhances the performance of CNN models in LTSF. From the perspective of multi-periodicity, TimesNet [
50] breaks through the limitation that the original one-dimensional time series structure can only represent changes between adjacent time points. Further, complex temporal variations are divided into multiple intra-period and inter-period variations. By using the fast Fourier transform (FFT) algorithm to convert 1D sequences into 2D tensors, the processing of two different types of variations is extended to 2D space, which enables temporal patterns to be captured using various neural network backbones in CV such as Inception. It is important to note that TimesNet serves as a versatile neural network framework for time series analysis, capable of handling various tasks, including TSF, classification, and anomaly detection. Instead of depending on advancing the neural network, TLNet [
51] utilized a transform-based network architecture for interpretability and better performance in LSTF tasks. It was realized by transforming the input’s features into a domain defined by large receptive fields and then building representations within that domain with potential space for interpretability. In the new domain, it may be easier to learn the global features and structure.
MPPN [
52] focuses on how to automatically mine multi-resolution and multi-periodicity patterns in time series and proposes an entropy-based method to evaluate the predictability of time series to prevent the negative effects brought about by the introduction of unforeseeable noise in training. In terms of resolution, convolutions with different kernel sizes are employed to extract features from original sequences to form representations at different resolutions. For the periodic pattern, Fourier transform is used to map the time series to the frequency domain, and frequencies with the top-k amplitudes are selected as the corresponding pre-defined periods. Further, dilated convolution is adopted to achieve multi-periodic and multi-resolution pattern mining. Finally, the resolution and periodic information are concatenated for the encoding result. FDNet [
23], with a simple structure consisting of basic linear projection layers and CNN, uses a focal input sequence decomposition method to divide the input sequence into multiple continuous subsequences based on the temporal distance. The length of subsequences decreases as the temporal distance from the predicted element shortens, with the numbers determined by a hyperparameter. This kind of design effectively solves the problem of long sequence time series input (LSTI). PatchMixer [
7] is a model that integrates patching data-organizing strategy and CI strategy with CNN architecture, which have shown significant effectiveness in TSF. It still employs trend–seasonal decomposition and applies linear and non-linear methods to two branches, respectively, eventually summing their prediction results up. Two convolutional branches are creatively employed to model intra- and inter-patch data, facilitating the extraction of local and global information and further enhancing the performance in LSTF tasks.
ModernTCN [
53] drew inspiration from the superior performance of CNNs in the field of CV compared to Transformer-based models, and it further explored the potential of improving the performance of CNNs in the TSF realm. Specifically, it increases the kernel size to expand the effective reception field (ERF) and draws inspiration from the architectural advantages of Transformers. In addition, ModernTCN adopts the patching strategy to split each variable into patches for embedding to prevent variable mixing embedding methods from disrupting the modeling of relationships between subsequent variables. By cleverly decoupling the temporal part, channel part (referring to depth-wise convolution), and variable part through three types of convolutions, the prediction performance in LSTF problem was improved. Moreover, UniRepLKNet [
54] essentially serves as a feature extraction network, further exploring the potential of large-kernel CNNs. It still focuses on the ERF as its design center and decouples various elements when designing large-kernel CNNs: ensuring a large receptive field with a few large kernels, efficiently capturing features in local regions with flexible small kernels, enhancing the abstract hierarchy of spatial patterns, and improving the model depth and representational capacity with efficient structures like SE Blocks. In handling time series data, it draws inspiration from the embedding layer in CorrFormer, transforming data into tensors in latent space and then simply reshaping them into a single channel embedding map. This approach demonstrates promising results in prediction tasks, potentially even replacing Transformer model architectures.
We have systematically reviewed the latest developments of CNN-based networks in the field of TSF over time, as summarized in
Table 2. We can observe that the advantage of using CNN-based models for TSF is in feature extraction, especially when facing multivariate TSF tasks. Although the CNN has experienced a period of silence, the effectiveness of causal convolution, the crucial structure of the TCN and an important advancement in the TSF realm, has been questioned and challenged. In addition, the large-kernel CNN has shown promising results in various tasks in CV, such as semantic segmentation and object detection. As a result, researchers have taken notice of this technology and have modified convolution for use in TSF tasks. Numerous versions based on this concept have been proposed, particularly in the last few months, effectively solving the LSTF issue. This architecture will undoubtedly present a better future.
3.3. Transformer-Based Models
Transformer [
11] was initially proposed by Vaswani et al. in their paper “Attention is All You Need”. Its original model was an encoder–decoder structure that showed state-of-the-art performance across a wide range of NLP tasks. Models such as GPT use this structure. By treating each position in the sequence as a vector and utilizing the multi-head self-attention mechanism and feedforward neural network’s outstanding ability in capturing long-term dependencies, it has great potential for development in modeling time series. Significant breakthroughs that have revolutionized the TSF field during the last three years have built upon the Transformer’s foundation by exploring the nature of the time series itself (different from textual sequences), such as periodicity.
Initially, many applications considered directly applying Transformer, such as Wu et al., who directly applied Transformer to the influenza-like disease forecasting task. LogTrans [
55] emphasizes the specificity of time series data, which cannot be used only with a point-to-point attention mechanism similar to that in NLP realm but focuses on the contextual information around data points. Meanwhile, sparse attention is used to improve the efficiency of operation. The LogSparse self-attention mechanism proposed opts to select only elements from the range 1, 2, 4, 8, etc., adhering to the time step of the exponential growth interval. At the same time, a local attention mechanism is applied to enhance the influence of nearby elements to the current point. In addition, LogTrans integrates convolution and Transformer in its entirety; 1D convolution was used to extract the surrounding information of each node in the input sequence, and then multi-head attention mechanism was applied to learn the relationship between nodes and establish connections between parts with similar shapes. Similarly, Informer [
14] also optimized Transformer from the perspective of efficiency to solve the problem of LSTF. Following the discovery that the attention scores show long tail distribution, the emphasis shifts to modeling significant relationships, specifically by forming sparse query–key pairs between critical queries and keys to reduce computational overhead. Simultaneously, the distilling operation is added between each attention block, and the input sequence length of each layer is shortened so that the model can accept longer input sequences. Furthermore, a generative decoder is designed to prevent the propagation of cumulative errors in the inference stage. TFT [
12] comprehensively considers the necessary features for a high-quality TSF model, such as statistical properties like static information, adaptability to complex time series with increased noise, the application of DMS strategy, and the acquisition of uncertainty interval of predicted value. The entire model utilizes a combination of Transformer and LSTM. From the bottom up, the variable selection network (VSN) weights variables for feature screening to reduce unnecessary noise input; LSTM replaces the position embedding to capture long- and short-term information simultaneously; further, the gating mechanism is applied to identify the importance of different features and advance the network at the same time; the temporal self-attention layer learns the long-term dependence of time series data and provides additional knowledge about the importance of features to facilitate explanation; finally, quantile regression is used to forecast the interval. Another interval prediction method, AST [
56], embeds Transformer into GAN, and the α-entmax function is used to compute the sparse attention weights. This sparse Transformer functions as the generator, with the quantile loss measuring the disparity between the predicted output and ground truth serving as the generator’s loss function. The discriminator is attached to the Transformer’s decoder and classifies inputs to identify predicted values or ground truth, utilizing the cross-entropy loss function to calculate the adversarial loss. These two loss functions are leveraged for updating the sparse Transformer and the discriminator, respectively. The adversarial training process shapes the output distribution of the generator through backpropagation, thereby mitigating error accumulation. Furthermore, Autoformer [
33] innovates based on the modular architecture from the Informer, with embedded decomposition modules, a commonly used preprocessing method in traditional statistics, to decompose the time series into the trend term and seasonal term. To break the bottleneck of information loss caused by sparse pointwise attention mechanism in previous models, a series-wise mechanism is proposed, which uses subsequences that are correlated with the current time series representation as the current time point’s decision-making enhanced the capture of feature information with less computational complexity. Motivated by Autoformer [
33], FEDformer [
34] further exploits the characteristic wherein preserving a small number of frequency components in the frequency domain can be used to almost entirely reconstruct time domain signals without significant information loss. Raw time series are converted from the time domain to the frequency domain through Fourier transform and the attention part is performed in the frequency domain to improve the computational efficiency while better capturing the global view of time series. Aliformer [
57] systematically performs a targeted exploration on the influence of known future knowledge on prediction in real life, especially in the field of sales. To address the issue, Aliformer utilizes a bidirectional self-attention mechanism that allows future information to be leaked. A knowledge-guide branch was designed to modify the attention map to minimize the effect of noise due to replaced learnable tokens from introduced future statistics. In addition, the emphasis on future knowledge is emphasized by adding span masking in the middle of the sequence.
Pyraformer [
35] proposed a C-ary tree attention-based Transformer model. The mechanism relies on the path of pyramidal graph for information transfer. From the bottom to top is the fine-grained time series of the original input to coarse-grained ones obtained through convolution. The tree structure offers benefits by enabling direct interaction between any two nodes, effectively balancing the computational complexity and shortening the maximum signal traveling path. Preformer [
58] follows the modular structure, focusing on multi-scale construction by choosing segments with different lengths. It leverages the characteristics of time series’ continuous variations. Segment-wise attention was proposed to cope with its strong local characteristics by using matrix dot product to calculate the attention score instead of vectors. A new computing paradigm is designed to shift the query and value from the perspective of periodicity to make precise predictions. For time series decomposition, ETSformer [
59] was inspired by the traditional Holt–Winters method and utilized exponential smoothing to extract the trend while incorporating a damping coefficient for stable trend generation. In other words, ETSformer improves upon Autoformer’s simple detrending operation, which only relies on moving averages of sequences. The decomposed components offer interpretable insights into the prediction results. Furthermore, ETSformer employs residual learning to construct a deep architecture for modeling complex dependencies. Triformer [
60] designed a model with patch attention combined with a triangle hierarchical structure. From bottom to top, the layer input shrinks exponentially to replace the stacked attention layer that requires an additional pooling layer to maintain consistent dimensions to ensure accuracy while achieving linear complexity. Different from the patch as an input unit, the complexity is here reduced only by treating the pseudo timestamp as a query within the patch without considering the semantic features behind it. Meanwhile, according to the fact that the time series with different variables often have different temporal dynamics, the variable-specific modeling method is adopted, drawing on the idea of matrix factorization, to only learn the most prominent characteristics of each variable that are different from other variables and achieve lightweight generation of variable-specific parameters. TDformer [
61] analyzed the applicability of time domain attention and two frequency domain attentions and a TSF model based on Transformer combined with MLP. Decomposition remains to be used as a preprocessing method. The trend items after decomposition are directly modeled by MLP while seasonal items are modeled using Fourier attention. The above two parts are concatenated as the results. Non-stationary Transformer [
62] aims at tackling the over-stationarization problem caused by the traditional normalization methods, showing the attention matrix convergence phenomenon in the modeling of different time series after normalizing in Transformer-based models. By restoring statistical characteristics through normalization operations on inputs and de-normalization operations on outputs, Non-stationary Transformer designed a De-stationary Attention, a new form of attention from theory, where the attention matrix of the stationarized sequence is used to approximate the attention matrix of the original non-stationary sequence. Scaleformer [
63] developed a multi-scale model-agnostic framework to sample time series at different sampling frequencies by average pooling. The framework employs a hierarchical forecasting strategy, progressing from coarse to fine-grained levels. The forecasting results at lower scales serve as inputs for the decoder at higher scales, with each scale equipped with its own forecasting module. Cross-scale normalization is incorporated to mitigate errors caused by distributions at different scales and in the data itself. Furthermore, adaptive loss, rather than Mean Squared Error (MSE) loss, is utilized for model training to alleviate error accumulation due to iterative processes. Quatformer [
64] focuses on modeling complex cyclical patterns and mitigating the quadratic complexity in LSTF tasks. To solve these two challenges, learning-to-rotate attention (LRA) using quaternion was proposed, a mathematical tool used to represent rotation and direction; at the same time, a learnable period was introduced and phase information was used to describe complex periodic patterns. Meanwhile, linear complexity is achieved by stripping the global information and storing the global memory with an additional fixed-length latent series.
Persistence Initialization [
65] designed a model-agnostic framework and its overall structure is similar to nesting the ReZero normalization outside the Transformer-based model, in fact, consisting of a residual skip connection and a learnable scalar gating mechanism. Simultaneously, Rotary encoding is used in the Transformer-based model, and replace the normalization in different positions, such as Pre- or Post-Layer Normalization, with ReZero, which can achieve a better effect. W-Transformer [
66] is also a model-agnostic framework used for univariate TSF. The adopted strategy involves decomposing first using the MODWT algorithm, forecasting components separately after decomposition, and then performing aggregation before inverse MODWT. It leverages the advantage of the MODWT’s shift-invariance property thoroughly.
Crossformer [
67] focuses on modeling the relationship between multiple variables, aiming to simultaneously capture the dependences of time dimension and cross-variables simultaneously. Inspired by Vision Transformer (ViT) [
68] in CV on image segmentation tasks, the input sequence is embedded into 2D vector array by patching (different from Triformer) with designed Dimension-Segment-Wise (DSW) embedding method. In order to keep both time and variable dimension information, a two-stage attention (TSA) layer is used to capture these two dependencies, respectively. In the variable dimension, an efficient routing attention mechanism is proposed. Patch merging can be used to obtain hierarchical representations at different scales. After linear projection to each layer, sum them up to obtain results. PatchTST [
6] also adopts patching technique. Different from Crossformer, PatchTST adopts CI strategies. Each channel contains only one dimension of multivariate time series, which is processed separately and then input into Transformer backbone, which is equivalent to processing univariate series. All series share the same embedding and Transformer weights, and then concatenate the predicted results along the dimension direction. Improved performance on LSTF task is achieved by this design. From this perspective, we can observe that simpler models may outperform complex ones. Continuing the trend of simplifying models, the innovation of TVT [
69] lies in replacing the previous Time Point Tokenization (TPT) strategy with Time Variable Tokenization strategy (TVT), namely, it treats each time variable in the multivariate time series as a token instead of simultaneous data points which effectively lower over-smoothing degree and enhances the correlation between different variables. Moreover, TVT abandons the sinusoidal and cosinusoidal positional embedding strategy, questioned the effectiveness of the Transformer decoder and replacing it with a simple linear layer. Further, Conformer [
70] integrates the previously used Fourier transform, multi-frequency sequence sampling and recursive trend-seasonal decomposition strategy to improve the performance in LSTF tasks with complex periodic patterns. In addition, an RNN based module is designed to model the global information and combines it with the local information extracted by sliding-window attention to compensate for the loss of fitting ability caused by linear complexity. Finally, a Normalizing Flow module uses the latent states generated in the above module to directly generate the distribution of future sequences to obtain more stable prediction results. CARD (Channeled Aligned Dual Transformer) [
71] concentrates on Channel-Dependent (CD) design with patching technique to process input sequence, while capturing the temporal correlation and the dynamic relationship of different variables over time. Rethinking that widely used MSE loss function has the same weight when dealing with errors of different time steps, which fails to fully reflect the stronger correlation between near-future observations and historical ones. A signal decay-based loss function with superior performance is developed, weighted according to the importance of predictions within a finite horizon. JTFT [
72] proposed a joint time-frequency domain Transformer. Inspired by the frequency sparsity FEDformer leveraged to extract temporal dependencies, a customize discrete cosine transform (CDCT) with strong energy compaction property was developed to calculate custom frequency-domain components, and then combined with time-domain patches to generate a joint time-frequency domain representation (JTFR) as input for subsequent models. To mitigate the loss of predictability caused by non-stationarity and extract the latest local relationships. Further, JTFT utilizes a two-stage approach, namely, using Transformer encoder and low-rank attention (LRA) layers inspired by the router mechanism in Crossformer to extract time-frequency and cross-variable dependencies, respectively. Client [
73] continues the strategy in this stage of dividing time series modeling and multivariate relationship modeling into two modules and uses linear model and Transformer to carry out them, respectively. Taylorformer [
74], an autoregressive probabilistic model, only focuses on better capturing target data distribution instead of efficiency, so only autoregressive modeling of the target variable is implemented. By integrating concepts from Taylor series and Gaussian processes, it enables Taylorformer to approximate continuous processes consistently to improved performance of the model. GCformer [
75] combines the advantages of convolutional and self-attention mechanism, and adopts a two-branch design overall to extract local and global information. For long input sequences, a global convolutional branch is utilized which operates on all elements of the input data simultaneously. Additionally, three parameterization methods are introduced, including weights-decaying sub-kernels, kernel generation in the frequency domain, and perspectives from state space models, enabling slower sublinear parameter growth than linear while capturing long-term dependencies. Furthermore, integrating the local attention-based module enhances prediction accuracy. SageFormer [
76] designed a general framework that can be applied to various Transform-based models, aiming to effectively introduce the relationship between various variables, bringing information gain while avoiding redundant information from interfering with the model training process. The input data are still processed with patching technique and global information of each variable sequence is extracted by adding global tokens to each sequence, and then multi-variable relationship is extracted by graph learning. DifFormer [
77] is a multi-task model for time series analysis, which adopts a two-stream structure to extract information from the time domain and the frequency domain, respectively. Meanwhile, breaking through the traditional statistical approach of viewing neural differencing as a preprocessing technique, DifFormer uses it as a built-in module to capture temporal variances progressively and flexibly, thus displaying the desired properties in various temporal modes. DSformer [
78] incorporates a parallel structure throughout the whole model. Designed double sampling (DS) block performs down sampling operation through larger sampling intervals to avoid the influence of local noise and obtain more global information. At the same time, the original data are segmented by the piecewise sampling method in parallel to obtain local information. The following temporal variable attention (TVA) block also performs attention operations from the time and variable dimensions in parallel. Finally, a TVA block is used to fuse the information of the above two branches appending MLP-based generative decoder to obtain the results. SBT [
79] is also a multi-task model, in terms of TSF, only for multivariate TSF with single-step forecasting. It applied sparse and binary-weighted Transformer with attention masks to reduce the computational complexity which only focus on the current time step’s attention, allowing the propagation of the entire input sample to multiple attention layers, helping to keep relevant historical information for downstream layers and thus improving the model’s performance and accuracy with lightweight implementation. PETformer [
80] explored the novel structure of Transformers and the effective modeling of multivariable relationships. After processing the input data by patching techniques, integrate the historical and future segments as inputs to the Transformer in a placeholder enhanced manner, where each placeholder represents a part of the future data to be predicted. The distinction between encoder and decoder is eliminated, making it possible for the predicted component to access historical sequence information more naturally. In order to incorporate the relationship between variables into the model, the Inter-Channel Interaction module is incorporated at the same time.
TACTiS [
81] combines the Transformer and the probabilistic copula model for complex real-world time series with irregular sampling and missing values. It primarily uses a copula-based decoder to mimic the properties of non-parametric copulas, which are flexible and can adapt to data with varying features and structures. Different from the TACTiS, TactiS-2 [
82] designed the dual encoder and the decoder to generate the distributional parameters for the marginal CDFs and the copula, respectively, showing the training curriculum in a two-stage approach. These approach makes the number of distributed parameters varies linearly with the number of variables, while avoiding the problem of training dynamic difference and suboptimal prediction. PrACTiS [
83] improves the encoder of TACTiS and expands it by integrating perceiver model as an encoder to enhance the expression of covariates’ dependencies. At the same time, midpoint inference and local attention mechanisms are combined to solve the high computational complexity problems related to self-attention mechanisms.
iTransformer [
84] presents an innovative inverted perspective to rethink the appropriate responsibilities of Transformer in modeling time series data. Specifically, the token embedding is constructed by variate dimension instead of the original token embedding in time dimension. And attention mechanism is utilized to learn the relationships between variable sequences, while an inverted feedforward network is employed to extract complex temporal features. BasisFormer [
85] broke through the previous simple basis learning methods (such as generating only corresponding basis coefficients) and obtained the basis through adaptive self-supervised learning with bidirectional cross-attention to calculate the similarity coefficient to select and consolidate the basis in the future view to achieve accurate future predictions. MTST [
86] further learns temporal patterns with different frequencies by controlling the size of the patch. Each layer of the stacked network uses a multi-branch structure to process patches of different sizes, respectively, with independent Transformer. Finally, the results of each branch are fused as the input of the next layer to learn the representation of time series at different resolutions.
From the recent volume of work, it is evident that Transformer-based models undergo rapid iterations and updates, which are compiled in detail in
Table 3 along with the relevant papers. It has made significant explorations in the field of TSF, implying that mainstream trends can be easily extracted. We explore these trends from the following perspectives:
First,
design for attention.
In the initial phase, Transformer-based models focused on improving efficiency by building sparse attention, Longformer designed three sparse attention mechanisms to reduce the time complexity of Transformers to linear growth with exponential growth of sequence length, and later Logtrans [
55], AST [
56] and Informer [
14] adopted this design concept. It greatly improved the efficiency of TSF, but the attention operation at this stage was still based on point-wise and did not properly take into account the characteristics of local continuous variations in time series.
In the second stage, researchers believed that sparse point-wise attention would sacrifice information utilization, so models in this stage often adopted series/segment/patch-wise attention to reduce the number of elements participating in attention calculation, including Autoformer [
33] and Preformer [
58]. Lower complexity was realized, and the calculation formula of attention was redesigned according to the periodicity of time series. On the basis of segment-wise attention, the model
in the third stage fully redesigned the hierarchical structure of stacked attention layers, such as presenting a triangle structure or tree structure. With this approach, each node only needs to perform attention calculation with adjacent nodes rather than on all positions of sequence. This approach further reduces the number of operations, thereby decreasing the complexity, such as Triformer [
60] and Pyraformer [
35]. The design of the attention mechanism
in the fourth stage is based on decoupling of the dependencies among variables and temporal dependencies in multivariate time series. Several models adopt the strategy of modeling temporal dependencies and cross-variable dependencies, respectively, so temporal attention and cross-variable attention are designed synchronously, such as in DSformer [
78], or only one of them is modeled by the attention mechanism, while another one is modeled by other models such as Client [
73] and PETformer [
80], etc.
Second,
input format of time series data. It is widely recognized that NLP is the initial development area of Transformers, which primarily centered around textual data. Usually, a segment of text is selected as the model input, with each token in the text being mapped to a vector. The vector combined with position embedding, type embedding, etc. is input into the model. When applying Transformer to time series, the earliest approach is to directly input, specifically, each time step is represented as a vector, and then combined with position embedding, time-related embedding and Holiday embedding related to human activities are combined as the input of Transformer-based model. The work in this stage includes InfluenzaTransformer [
4] and Informer [
14], etc.
In the second stage, the correlation of the time series itself and the nature of local continuous variations are completely considered, by combining contextual information rather than only studying individual points as part of the input. From above articles, it is evident that by appending LSTM or convolutional layer behind the input layer to assist in establishing the context information of each point in the time series, the advantages of these two kinds of networks in capturing long- and short-term dependencies are cleverly used to improve the prediction performance, such as Logtrans and TFT.
In the third stage, it is apparent that all the models discussed in survey published over the past two years are influenced by ViT, which has achieved excellent performance in field of CV, and adopt patching technique to process input data and conduct modeling. It results from summarizing the similarities between image data and time series data, researchers in the field of TSF believe that a single pixel in an image is equivalent to each time point in a time series, and research at single point is meaningless, only aggregation can contribute to training. The specific method involves patching input windows, regarding the subsequences within each window as a whole, mapping them through MLP layers, and ultimately obtaining a vector as subsequent input. This approach obviously improves model efficiency by shortening the length of the input sequence, such as PatchTST [
6], SageFormer [
76], and PETformer [
80]. However, an increasing amount of research is starting to concentrate on the flexible way of patch generation, considering the multi-scale patching approach instead of applying fixed window for patching previously. The mainstream approach is to design a stacked network architecture, where different layers or different branches at the same layer handle patches with different sizes to realize different scales of modeling, such as MTST [
86]. The combination of this multi-scale patching technique and Transformer is more flexible and has become one of the most important directions of current innovations.
Third,
modular design of Network Architecture. Starting from Informer [
14], Autoformer [
33], FEDformer [
34], the mainstream models continue the modular design. Building upon this foundation, the gradual combination of methods originally used for preprocessing are served as in-bult modules, leading to further advancements in TSF, especially in the task of forecasting data with periodicity.
Fourth,
the modeling of relationships between multiple variables in multivariate TSF. In the initial TSF methods, the conventional approach was similar to that of univariate TSF, such as mapping multi-dimensional values into multi-dimensional vectors [
14,
33,
34,
35]. Later on, models like PatchTST found that better results can be achieved by using CI, i.e., a strategy that completely disregards the relationship between the variables, and models each variable individually. It may result from the increased noise and redundancy when introducing multiple variables. According to recent works, some researchers insist that capturing the dependencies between multiple variables provide useful information for prediction, but it is necessary to decouple the time dimension modeling from the multiple-variables relationship modeling to prevent problems such as overfitting or complex computation. For example, PETformer introduced an Inter-Channel Interaction module to keep the relationship between the channels.
However, it is essential to recognize that despite the gradual improvement in predictive performance, certain models still have limitations. For instance, some models struggle to capture temporal patterns in poorly structured time series data [
35,
70]. Some may overly rely on identifying periodic features in time series data, making them less effective for training on datasets with weak periodicity [
33]. Additionally, some of them overemphasize extracting future information, thereby restricting the scope of tasks to those entailing easily obtainable or predictable future information [
57]. Therefore, when applying them to tasks in specific domains, we need to further analyze the models themselves.
3.4. MLP-Based Models
With Transformer-based models propelling the TSF realm into a period of rapid advancement, Zeng et al. questioned the effectiveness of Transformers in LSTF tasks and proposed DLinear [
26], in which the author argued that the ordering of the time series itself is very important. Although the position embedding retains some ordering information, the self-attention mechanism has permutation invariant and inevitably loses some temporal information. Therefore, DLinear simply performs the time series decomposition first, fits the trend and seasonal terms obtained after decomposition with two linear layers, respectively, and finally merges them, which outperforms all the Transformer-based models at that time in terms of the data with obvious trend, which brings about a competition for optimal performance between MLP-based and Transformer-based models among a period.
N-BEATS [
36] is a relatively early work using MLP architecture to carry out TSF and built a method for univariate TSF which is as simple and effective as possible and has certain interpretability. It serves as the foundation for many subsequent models. N-BEATS adopts “stack-block” structure overall, and time series decomposition is achieved through multi-layer fully connected networks. According to the designed Doubly Residual Stacking, each layer fitted part of the time series information, that is, the residual fitting of the previous layers, and iteratively refined the prediction process. The interpretability of N-BEATS is achieved through the concept of basis, when projecting onto the basis, learn the coefficients of each time series in a flexible way. By weighting and combining the basis based on the coefficients, it is easier to observe which basis is more important to current output. N-BEATSx [
87] further extends the N-BEATS to enable it to input and process covariates while maintaining interpretability of the covariates’ influence on the prediction. Furthermore, NHITS [
88] improves N-BEATS for the perspective of long-horizon forecasting (different from LSTF, which just refers to a relatively long forecasting horizon), in order to alleviate increasing error as the forecast horizon increases. In brief, the concept of input-downsampling and output-upsampling is incorporated, and time series is divided into multiple granular sequences through downsampling, which reduces the number of parameters and the complexity to achieve higher efficiency. For periodic sequences, DEPTS [
89] applied Fourier transform to extract periods based on N-BEATS, where MLP is used to extract periodicity dependencies to solve the challenges of complex dependencies and multiple periodicities in periodic time series. Works at this stage benefited from the Residual learning introduced by N-BEATS firstly, which showed the potential to build a deep learning architecture with better expressive and generalization capabilities, at the same time, explored the periodicity of time series adequately. These ideas serve as references for follow-up works.
The second stage of development has been influenced by some advanced Transformer-based models, which focused research on the LSTF problem. The emergence of DLinear [
26] prompts researchers to think about the necessity of complex models. At the same time, the Mixer structure in the field of CV has gradually been applied to the TSF realm promoting the trend of structural simplification developing. For example, FreDO [
90] proposed a frequency-domain-based LSTF model. Through experimental comparison, it is found that it is easier to capture multiple periodicities in the frequency domain, and periodicity-based mining is conducive to long-term forecasting. Same as TimesNet [
50], LightTS [
27] also converts the 1D time series into 2D in structure and capture short-term local patterns and long-term dependencies, respectively, through interval sampling and continuous sampling. Finally, MLP is used to extract the features, which reduces the complexity and running time while ensuring the accuracy. MTS-Mixers [
29] further applies the structure of Mixer to TSF by designing a factorized temporal and channel mixing structure. Time series is divided into multiple subsequences, with each subsequence undergoing temporal information learning independently. Subsequently, these subsequences are concatenated in their original order and redundancy in channel dimension is addressed through matrix decomposition. TSMixer [
91] studied the problem that multivariate TSF models sometimes cannot outperform univariate TSF models. In order to use cross-variate information assist in improving forecasts, by introducing cross-variate feed-forward layers, extend the capabilities of linear models and model the time dimension and the feature dimension, respectively. TiDE [
24] designed an encoder–decoder model based on MLP with CI strategies. Static covariates and dynamic covariates information are utilized. Both the encoder and decoder of the model are stacked with multiple MLP based residual blocks. Through the global residual connection with only one linear layer, it is guaranteed that TiDE can theoretically achieve similar performance to DLinear. Koopa [
92] innovatively combined Koopman theory with TSF and designed the model from the perspective of dynamic system. The model itself was realized by a very simple MLP, which could achieve faster forecasting speed, and more accurate results compared with other models. TSMixer [
25] designed by IBM adopts patching technique for input data and considers intra-patch, inter-patch, and cross-variable information interaction. By using CI backbone with cross-channel forecast reconciliation heads, it achieves better performance than other mixing methods. FITS [
93] shows the similar idea to DLinear. The difference is that FITS operates in the frequency domain. After converting the time series data into the frequency domain by using Discrete Fourier Transform (DFT), and then passing them through complex linear transformation, finally inverse Fourier transform was used to return back to the time domain to obtain the result. TFDNet [
94] adopts a branching structure overall, capturing long-term latent patterns and temporal periodicity from the time domain and the frequency domain, respectively, while combining Channel-Dependent (CD) strategy and trend-seasonal decomposition method. FreTS [
95] still adopts the application of MLP in the frequency domain, by separately considering the complex numbers’ real mappings and imaginary mappings before stacking the output to obtain output results. This kind of method performed on both channel dimension and temporal dimension in the frequency domain.
So far, we have summarized the achievements from four network architecture in recent years, and showed in
Table 4. The development of TSF architecture from RNN/CNN-based models to Transformer-based models, and then to the popularity of MLP-based models, the whole structure presents a simplified process. However, this kind of simplification of structure is not unique to the field of TSF. We can observe that after N-BEATS [
36] proposed to perform TSF tasks in a fully connected way, the models were expanded just around how to make longer forecasting horizons or increasing functionality over almost two years. No advanced models came out, until MLP-Mixer achieved a performance comparable to CNN and ViT [
68] on ImageNet datasets in the field of CV with a pure MLP architecture, and then achieved competitive results on NLP tasks with a small number of parameters. The research of MLP-based models has been further deepened. We observed that recent models carefully incorporate innovations in other architectures, especially Transformer-based ones, such as processing data in the frequency domain, and reconstructing time series data in 2D, etc. All developments are accompanied with the innovations in other deep learning fields.