Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\cormark

[1]

\cortext

[1]Corresponding author

Is Mamba Effective for Time Series Forecasting?

Zihan Wang 2310744@stu.neu.edu.cn    Fanheng Kong kongfanheng@stumail.neu.edu.cn    Shi Feng fengshi@cse.neu.edu.cn    Ming Wang sci.m.wang@gmail.com    Xiaocui Yang yangxiaocui@stumail.neu.edu.cn    Han Zhao 2272065@stu.neu.edu.cn    Daling Wang wangdaling@cse.neu.edu.cn    Yifei Zhang zhangyifei@cse.neu.edu.cn Department of Computer Science and Engineering, Northeastern University, Shenyang, China
Abstract

In the realm of time series forecasting (TSF), it is imperative for models to adeptly discern and distill hidden patterns within historical time series data to forecast future states. Transformer-based models exhibit formidable efficacy in TSF, primarily attributed to their advantage in apprehending these patterns. However, the quadratic complexity of the Transformer leads to low computational efficiency and high costs, which somewhat hinders the deployment of the TSF model in real-world scenarios. Recently, Mamba, a selective state space model, has gained traction due to its ability to process dependencies in sequences while maintaining near-linear complexity. For TSF tasks, these characteristics enable Mamba to comprehend hidden patterns as the Transformer and reduce computational overhead compared to the Transformer. Therefore, we propose a Mamba-based model named Simple-Mamba (S-Mamba) for TSF. Specifically, we tokenize the time points of each variate autonomously via a linear layer. A bidirectional Mamba layer is utilized to extract inter-variate correlations and a Feed-Forward Network is set to learn temporal dependencies. Finally, the generation of forecast outcomes through a linear mapping layer. Experiments on thirteen public datasets prove that S-Mamba maintains low computational overhead and achieves leading performance. Furthermore, we conduct extensive experiments to explore Mamba’s potential in TSF tasks. Our code is available at https://github.com/wzhwzhwzh0921/S-D-Mamba.

keywords:
Time Series Forecasting \sepState Space Model \sepMamba \sepTransformer

1 Introduction

Time series forecasting (TSF) involves leveraging historical information from lookback sequence to forecast states in the future [15] as Fig. 1. These data often have built-in patterns including the temporal dependency (TD), e.g. morning and evening peak patterns in traffic forecast tasks, and the inter-variate correlations (VC), e.g. temperature and humidity correlation patterns in weather forecast tasks. Discerning and distilling these patterns from time series data can bring better forecasting [5].

Refer to caption
Figure 1: An example of Time Series Forecasting. Lines of different colors represent different variates, with solid lines indicating the historical changes of variates, and dotted lines indicating the future changes that need to be forecasted.

Transformer [52] exhibits formidable efficacy in TSF, primarily attributed to their inherent advantages in apprehending TD and VC. Numerous Transformer-based models with impressive capabilities have been introduced [59, 72], yet the Transformer architecture faces distinct challenges. Foremost is its quadratic computational complexity, which leads to a dramatic increase in calculation overhead as the number of variates or the lookback length increases. It hinders the deployment of Transformer-based models in real-world TSF scenarios that require the processing of large amounts of data simultaneously or have high real-time requirements. Many models attempt to reduce the computational complexity of the transformer in TSF by modifying its structure, such as focusing only on a portion of the sequence [28, 71, 31]. The loss of information in the above models may also lead to certain performance degradations. A more promising approach involves using linear models instead of transformer [32, 66], which possesses linear computational complexity. However, linear models relying solely on linear numerical calculations do not incorporate in-context information and are suboptimal compared to state-of-the-art Transformer models. And accurate forecasts can only be achieved when sufficient input information is available [66].

The State Space Models (SSM) [24, 51] demonstrate potential in simultaneously optimizing performance and computational complexity. SSMs employ convolutional calculation to capture sequence information and eliminate hidden states making it benefit from parallel computing and achieving near-linear complexity in processing speed. Rangapuram et al. [46] attempts to employ SSM for TSF, but the SSM architecture it used is unable to identify and filter content effectively, and the captured dependencies are solely based on distance, resulting in unsatisfactory performance. Mamba [22], introduces a selective mechanism into SSM, enabling it to discern valuable information like the attention mechanism. Numerous researchers develop models based on Mamba [73, 61], demonstrating its considerable potential across both text and image domains. These Mamba-based models achieve a synergistic balance between enhanced performance and computational efficiency. Consequently, we are motivated to explore further the potential of Mamba in TSF.

We launch a Mamba-based model Simple-Mamba (S-Mamba) for TSF tasks. In the S-Mamba, the time points of each variate are tokenized by a linear layer. Subsequently, a Mamba VC (Inter-variate Correlation) Encoding layer encodes the VC by utilizing a bidirectional Mamba to leverage the global inter-variate mutual information. A Feed-Forward Network (FFN) TD (Temporal Dependency) Encoding Layer containing a simple FFN is followed to extract the TD. Ultimately, a mapping layer is utilized to output the forecast results. Experimental results on thirteen public datasets from traffic, electricity, weather, finance, and energy domains demonstrate that S-Mamba not only has low requirements in GPU memory usage and training time but also maintains superior performance compared to the state-of-the-art models in TSF. Concurrently, extensive experiments are conducted to assess the efficacy and potential of Mamba in TSF tasks. For instance, we evaluate whether Mamba demonstrates generalization capabilities comparable to those of the Transformer in handling TSF data. Our contributions are summarized as follows:

  • We propose S-Mamba, a Mamba-based model for TSF, which delegates the extraction of inter-variate correlations and temporal dependencies to a bidirectional Mamba block and a Feed-Forward Network.

  • We compare the performance of the S-Mamba against representative and state-of-the-art models in TSF. The results confirm that S-Mamba not only delivers superior forecast performance but also requires less computational resources.

  • We conduct extensive experiments mainly focusing on exploring the characteristics of Mamba when facing TSF data to further discuss the potential of Mamba in TSF tasks.

2 Related Work

In conjunction with our work, two main areas of related work are investigated: (1) time series forecasting, and (2) applications of Mamba.

2.1 Time Series Forecasting

There have been two main architectures for TSF approaches, which are Transformer-based models [34, 43, 67] and linear models [5, 40, 48].

2.1.1 Transformer-based Models

Transformers are primarily designed for tasks that involve processing and generating sequences of tokens [52]. The excellent performance of Transformer-based models has also attracted numerous researchers to focus on time series forecasting tasks [3]. The transformer is utilized by Duong-Trung et al. [18] to solve the persistent challenge of long multi-horizon time series forecasting. Time Absolute Position Encoding (tAPE) and Efficient implementation of Relative Position Encoding (eRPE) are proposed in [20] to solve the position encoding problem encountered by Transformer in multivariate time series classification (MTSC). Wang et al. [53] replace the standard convolutional layer with an dilated convolutional layer and propose Graphformer to efficiently learn complex temporal patterns and dependencies between multiple variates. Some researchers have also considered the application of Transformer-based time series forecasting models in specific domains, such as piezometric level prediction [42], forecasting crude oil returns [1], predicting the power generation by solar panels [49], etc.

While they excel at capturing long-range dependencies in text, they may not be as effective in modeling sequential patterns. The use of content-based attention in Transformers is not effective in detecting essential temporal dependencies, especially for time-series data with weakening dependencies over time and strong seasonality patterns [56]. Particularly, the predictive capability and robustness of Transformer-based models may decrease rapidly when the input sequence is too long [55]. Moreover, the O(N2)𝑂superscriptN2O(\text{N}^{2})italic_O ( N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) time complexity makes Transformer-based models cost more computation and GPU memory resources. In addition, the previously mentioned issue of position encoding is also a challenge that deserves attention.

2.1.2 Linear Models

In addition to Transformer-based models, many researchers are keen to perform time series forecasting tasks using linear models [5]. Chen et al. [11] proposed TSMixer with all-MLP architecture to efficiently utilize cross-variate and auxiliary information to improve the performance of time series forecasting. LightTS [68] is dedicated to solving multivariate time series forecasting problems, and it can efficiently handle very long input series. Wang et al. [54] propose Time Series MLP to improve the efficiency and performance of multivariate time series forecasting. Yi et al. [64] explores MLP in the frequency domain for time series forecasting and proposes a novel architecture for FreTS that includes two phases: domain conversion and frequency learning.

Compared to Transformer-based models, MLP-based models are simpler in structure, less complex and more efficient. However, the MLP-based models also suffer from a number of shortcomings. In the case of high volatility and non-periodic, non-stationary patterns, MLP performance relying only on past observed temporal patterns is not satisfactory [11]. In addition, MLP is worse at capturing global dependencies compared to Transformers [64] and need longer input than Transformer-based models.

2.2 Applications of Mamba

As a new architecture, Mamba [22] swiftly attracted the attention of a large number of researchers in Natural Language Processing (NLP), Computer Vision (CV), and other Artificial Intelligence communities.

2.2.1 Mamba in Natural Language Processing

Pióro et al. [45] and Anthony et al. [4] replaced the Transformer architecture in the Mixture of Experts (MoE) with the Mamba architecture, achieving a complete override of Mamba’s and Transformer-MoE’s performance. Mamba has demonstrated strong performance in clinical note generation [62]. Jiang et al. [27] replace Transformers with Mamba and demonstrate that Mamba can achieve match or outperform results on speech separation tasks with fewer parameters than Transformer. Empirical evidence is provided using simple NLP tasks (like translation) that Mamba can be an efficient alternative to Transformer for in-context learning tasks with long input sequences [21].

2.2.2 Mamba in Computer Vision

Mamba has been used to solve the long-range dependency problem in biomedical image segmentation tasks [39]. Cao et al. [8] propose a local-enhanced vision Mamba block named LEVM to improve local information perception, achieving state-of-the-art results on multispectral pansharpening and multispectral and hyperspectral image fusion tasks. Fusion-Mamba block [17] is designed to map features from images with different types (such as RGB and IR) into a hidden state space for interaction and enhance the representation consistency of features. Liu et al. [38] utilize the proposed HSIDMamba and Bidirectional Continuous Scanning Mechanism to improve the capture of long-range and local spatial-spectral information and improve denoise performance. In addition, Mamba has also been used in small target detection [12], medical image reconstruction[26] and classification [65], hyperspectral image classification [63], etc.

2.2.3 Mamba in Others

In addition to the two single modalities described, the application of Mamba to multimodal tasks has received a lot of attention. VideoMamba [30] achieves efficient long-term modeling using Mamba’s linear complexity operator, showing advantages on long video understanding tasks. Zhao et al. [70] extend Mamba to a multi-modal large language model to improve the efficiency of inference, achieving comparable performance to LLaVA [35] with only about 43% of the number of parameters.

Furthermore, Mamba’s sequence modeling capabilities have also received attention from researchers. Schiff et al. [47] extend long-range Mamba to a BiMamba component that supports bi-directionality, and to a MambaDNA block as the basis of long-range DNA language models. Mamba has also been shown to be effective on the tasks of predicting sequences of sensor data [6] and stock prediction [50]. Sequence Reordering Mamba [60] are proposed to exploit the inherent valuable information embedded within the long sequences. Ahamed and Cheng [2] propose Mamba-based TimeMachine to capture long-term dependencies in multivariate time series data.

As can be seen from the application of Mamba in these areas, Mamba can effectively reduce the parameter size and improve the efficiency of model inference while achieving similar or outperforming performance. It captures global dependencies better in a lightweight structure and has a better sense of position relationships. In addition, the Mamba architecture is more robust. Furthermore, the performance of Mamba in sequence modelling tasks further inspired us to explore whether Mamba can effectively mitigate the issues faced by Transformer-based models and linear models on TSF tasks.

3 Preliminaries

3.1 Problem Statement

In time series forecasting tasks, the model receives input as a history sequence Uin=[u1,u2,,uL]L×Vsubscript𝑈𝑖𝑛subscript𝑢1subscript𝑢2subscript𝑢𝐿superscript𝐿𝑉U_{in}=[u_{1},u_{2},\ldots,u_{L}]\in{\mathbb{R}^{L\times V}}italic_U start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = [ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_V end_POSTSUPERSCRIPT and un=[p1,p2,,pV]subscript𝑢𝑛subscript𝑝1subscript𝑝2subscript𝑝𝑉u_{n}=[p_{1},p_{2},\ldots,p_{V}]italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ]. and then uses the information to predict a future sequence Uout=[uL+1,uL+2,,uL+T]T×Vsubscript𝑈𝑜𝑢𝑡subscript𝑢𝐿1subscript𝑢𝐿2subscript𝑢𝐿𝑇superscript𝑇𝑉U_{out}=[u_{L+1},u_{L+2},\ldots,u_{L+T}]\in{\mathbb{R}^{T\times V}}italic_U start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = [ italic_u start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_L + 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_L + italic_T end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_V end_POSTSUPERSCRIPT. The preceding L𝐿Litalic_L and T𝑇Titalic_T are referred to as the review window and prediction horizon respectively, representing the lengths of the past and future time windows, while p𝑝pitalic_p is a variate and V𝑉Vitalic_V represents the total number of variates.

3.2 State Space Models

State Space Models can represent any cyclical process with latent states. By using first-order differential equations to represent the evolution of the system’s internal state and another set to describe the relationship between latent states and output sequences, input sequences x(t)D𝑥𝑡superscript𝐷x(t)\in\mathbb{R}^{D}italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT can be mapped to output sequences y(t)N𝑦𝑡superscript𝑁y(t)\in\mathbb{R}^{N}italic_y ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT through latent states h(t)N𝑡superscript𝑁h(t)\in\mathbb{R}^{N}italic_h ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT in (1):

h(t)superscript𝑡\displaystyle h(t)^{{}^{\prime}}italic_h ( italic_t ) start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT =Ah(t)+Bx(t),absentA𝑡B𝑥𝑡\displaystyle=\textbf{\emph{A}}h(t)+\textbf{\emph{B}}x(t),= A italic_h ( italic_t ) + B italic_x ( italic_t ) , (1)
y(t)𝑦𝑡\displaystyle y(t)italic_y ( italic_t ) =Ch(t),absentC𝑡\displaystyle=\textbf{\emph{C}}h(t),= C italic_h ( italic_t ) ,

where AN×NAsuperscript𝑁𝑁\textbf{\emph{A}}\in{\mathbb{R}^{N\times N}}A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT and B,CN×DB,Csuperscript𝑁𝐷\textbf{\emph{B,C}}\in{\mathbb{R}^{N\times D}}B,C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT are learnable matrices. Then, the continuous sequence is discretized by a step size ΔΔ\Deltaroman_Δ, and the discretized SSM model is represented as (2).

htsubscript𝑡\displaystyle h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =A¯ht1+B¯xt,absent¯Asubscript𝑡1¯Bsubscript𝑥𝑡\displaystyle=\overline{\textbf{\emph{A}}}h_{t-1}+\overline{\textbf{\emph{B}}}% x_{t},= over¯ start_ARG A end_ARG italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + over¯ start_ARG B end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (2)
ytsubscript𝑦𝑡\displaystyle y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Cht,absentCsubscript𝑡\displaystyle=\textbf{\emph{C}}h_{t},= C italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where A¯=exp(ΔA)¯AexpΔ𝐴\overline{\textbf{\emph{A}}}={\mathrm{exp}}(\Delta A)over¯ start_ARG A end_ARG = roman_exp ( roman_Δ italic_A ) and B¯=(ΔA)1(exp(ΔA)I)ΔB¯BsuperscriptΔA1expΔA𝐼ΔB\overline{\textbf{\emph{B}}}=(\Delta\textbf{\emph{A}})^{-1}({\mathrm{exp}}(% \Delta\textbf{\emph{A}})-I)\cdot\Delta\textbf{\emph{B}}over¯ start_ARG B end_ARG = ( roman_Δ A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_exp ( roman_Δ A ) - italic_I ) ⋅ roman_Δ B. Since transitioning from continuous form (Δ,A,B,C)ΔABC(\Delta,\textbf{\emph{A}},\textbf{\emph{B}},\textbf{\emph{C}})( roman_Δ , A , B , C ) to discrete form (A¯,B¯,C)¯A¯BC(\overline{\textbf{\emph{A}}},\overline{\textbf{\emph{B}}},\textbf{\emph{C}})( over¯ start_ARG A end_ARG , over¯ start_ARG B end_ARG , C ), the model can be efficiently calculated using a linear recursive approach [25]. The structured state space model (S4) [24], originating from the vanilla SSM, utilizes HiPPO [23] for initialization to add structure to the state matrix A, thereby improving long-range dependency modeling.

Algorithm 1 The process of Mamba Block

Input: 𝑿:(B,V,D):𝑿𝐵𝑉𝐷\bm{X}:(B,V,D)bold_italic_X : ( italic_B , italic_V , italic_D )
Output: 𝒀:(B,V,D):𝒀𝐵𝑉𝐷\bm{Y}:(B,V,D)bold_italic_Y : ( italic_B , italic_V , italic_D )

1:  x,z:(B,V,ED)Linear(𝑼):𝑥𝑧𝐵𝑉𝐸𝐷Linear𝑼x,z:(B,V,ED)\leftarrow\mathrm{Linear}(\bm{U})italic_x , italic_z : ( italic_B , italic_V , italic_E italic_D ) ← roman_Linear ( bold_italic_U ) {Linear projection}
2:  x:(B,V,ED)SiLU(Conv1D(x)):superscript𝑥𝐵𝑉𝐸𝐷SiLUConv1D𝑥x^{{}^{\prime}}:(B,V,ED)\leftarrow\mathrm{SiLU}(\mathrm{Conv1D}(x))italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT : ( italic_B , italic_V , italic_E italic_D ) ← roman_SiLU ( Conv1D ( italic_x ) )
3:  A:(D,N)Parameter:A𝐷𝑁𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟\textbf{\emph{A}}:(D,N)\leftarrow ParameterA : ( italic_D , italic_N ) ← italic_P italic_a italic_r italic_a italic_m italic_e italic_t italic_e italic_r {Structured state matrix}
4:  B,C:(B,V,N)Linear(x),Linear(x):B,C𝐵𝑉𝑁Linearsuperscript𝑥Linearsuperscript𝑥\textbf{\emph{B,C}}:(B,V,N)\leftarrow\mathrm{Linear}(x^{{}^{\prime}}),\mathrm{% Linear}(x^{{}^{\prime}})B,C : ( italic_B , italic_V , italic_N ) ← roman_Linear ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , roman_Linear ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT )
5:  Δ:(B,V,D)Softplus(Parameter+Broadcast(Linear(x))):Δ𝐵𝑉𝐷Softplus𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟BroadcastLinearsuperscript𝑥\Delta:(B,V,D)\leftarrow\mathrm{Softplus}(Parameter+\mathrm{Broadcast}(\mathrm% {Linear}(x^{{}^{\prime}})))roman_Δ : ( italic_B , italic_V , italic_D ) ← roman_Softplus ( italic_P italic_a italic_r italic_a italic_m italic_e italic_t italic_e italic_r + roman_Broadcast ( roman_Linear ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) )
6:  A¯,B¯:(B.V.D.N)discretize(Δ,A,B)\overline{\textbf{\emph{A}}},\overline{\textbf{\emph{B}}}:(B.V.D.N)\leftarrow{% discretize}(\Delta,\textbf{\emph{A}},\textbf{\emph{B}})over¯ start_ARG A end_ARG , over¯ start_ARG B end_ARG : ( italic_B . italic_V . italic_D . italic_N ) ← italic_d italic_i italic_s italic_c italic_r italic_e italic_t italic_i italic_z italic_e ( roman_Δ , A , B ) {Input-dependent parameters and discretization}
7:  y:(B,V,ED)SelectiveSSM(A¯,B¯,C)(x):𝑦𝐵𝑉𝐸𝐷SelectiveSSM¯A¯BCsuperscript𝑥y:(B,V,ED)\leftarrow\mathrm{SelectiveSSM}(\overline{\textbf{\emph{A}}},% \overline{\textbf{\emph{B}}},\textbf{\emph{C}})(x^{{}^{\prime}})italic_y : ( italic_B , italic_V , italic_E italic_D ) ← roman_SelectiveSSM ( over¯ start_ARG A end_ARG , over¯ start_ARG B end_ARG , C ) ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT )
8:  y:(B,V,ED)ySiLU(z):superscript𝑦𝐵𝑉𝐸𝐷tensor-product𝑦SiLU𝑧y^{{}^{\prime}}:(B,V,ED)\leftarrow y\otimes\mathrm{SiLU}(z)italic_y start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT : ( italic_B , italic_V , italic_E italic_D ) ← italic_y ⊗ roman_SiLU ( italic_z )
9:  𝒀:(B,V,D)Linear(y):𝒀𝐵𝑉𝐷Linearsuperscript𝑦\bm{Y}:(B,V,D)\leftarrow\mathrm{Linear}(y^{{}^{\prime}})bold_italic_Y : ( italic_B , italic_V , italic_D ) ← roman_Linear ( italic_y start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) {Linear Projection}
Refer to caption
Figure 2: The structure of selective SSM (Mamba).

3.3 Mamba Block

Mamba [22] introduces a data-dependent selection mechanism into the S4 and incorporates hardware-aware parallel algorithms in its looping mode. The mechanism enables Mamba to capture contextual information in long sequences while maintaining computational efficiency. As an approximately linear perplexity series model, Mamba demonstrates potential in long sequence tasks, compared to transformers, in both efficiency enhancement and performance improvement. The details are presented in the algorithm related to the mamba layer in Alg.1 and the description in Fig. 2, where the former illustrates the complete data processing procedure, while the latter depicts the formation process of the output at sequence position t𝑡titalic_t. The Mamba layer takes a sequence 𝑿B×V×D𝑿superscript𝐵𝑉𝐷\bm{X}\in\mathbb{R}^{B\times V\times D}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_V × italic_D end_POSTSUPERSCRIPT as input, where B𝐵Bitalic_B denotes the batch size, V𝑉Vitalic_V denotes the number of variates, and D𝐷Ditalic_D denotes hidden dimension.

The block first expands the hidden dimension to ED𝐸𝐷EDitalic_E italic_D through linear projection, obtaining x𝑥xitalic_x and z𝑧zitalic_z. Then, it processes the projection obtained earlier using convolutional functions and a SiLU [19] activation function to get xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Based on the discretized SSM selected by the input parameters, denoted as the core of the Mamba Block, together with xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, it generates the state representation y𝑦yitalic_y. Finally, y𝑦yitalic_y is combined with a residual connection from z𝑧zitalic_z after activation, and the final output ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step t𝑡titalic_t is obtained through a linear transformation. In summary, the Mamba Block effectively handles sequential information by leveraging selective state space models and input-dependent adaptations. The parameters involved in the Mamba Block include an SSM state expansion factor N𝑁Nitalic_N, a size of convolutional kernel k𝑘kitalic_k, and a block expansion factor E𝐸Eitalic_E for input-output linear projection. The larger the values of N𝑁Nitalic_N and E𝐸Eitalic_E, the higher the computational cost. The final output of the Mamba block is 𝒀B×V×D𝒀superscript𝐵𝑉𝐷\bm{Y}\in\mathbb{R}^{B\times V\times D}bold_italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_V × italic_D end_POSTSUPERSCRIPT.

Refer to caption
Figure 3: Overall framework of S-Mamba, the left side of the figure presents the overall architecture of our model. The right side of the figure details the components of the S-Mamba Block.

4 Methodology

In this section, we provide a detailed introduction of S-Mamba. Fig. 3 illustrates the overall structure of S-Mamba, which is primarily composed of four layers. The first layer, the Linear Tokenization Layer, tokenizes the time series with a linear layer. The second layer, the Mamba inter-variate correlation (VC) Encoding layer, employs a bidirectional Mamba block to capture mutual information among variates. The third layer, the FFN Temporal Dependencies (TD) Encoding Layer, further learns the temporal sequence information and finally generates future series representations by a Feed-Forward Network. Then the final layer, the Projection Layer, is only responsible for mapping the processed information of the above layers as the model’s forecast. Alg. 2 demonstrates the operation process of S-Mamba.

Algorithm 2 The Forecasting Procedure of S-Mamba

Input: Batch(Uin)=[u1,u2,uL]:(B,L,V):𝐵𝑎𝑡𝑐subscript𝑈𝑖𝑛subscript𝑢1subscript𝑢2subscript𝑢𝐿𝐵𝐿𝑉Batch(U_{in})=[u_{1},u_{2},\ldots u_{L}]:(B,L,V)italic_B italic_a italic_t italic_c italic_h ( italic_U start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) = [ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_u start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] : ( italic_B , italic_L , italic_V )
Output: Batch(Uout)=[uL+1,uL+2,uL+T]:(B,T,V):𝐵𝑎𝑡𝑐subscript𝑈𝑜𝑢𝑡subscript𝑢𝐿1subscript𝑢𝐿2subscript𝑢𝐿𝑇𝐵𝑇𝑉Batch(U_{out})=[u_{L+1},u_{L+2},\ldots u_{L+T}]:(B,T,V)italic_B italic_a italic_t italic_c italic_h ( italic_U start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) = [ italic_u start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_L + 2 end_POSTSUBSCRIPT , … italic_u start_POSTSUBSCRIPT italic_L + italic_T end_POSTSUBSCRIPT ] : ( italic_B , italic_T , italic_V )

1:  Linear Tokenization Layer:
2:  Batch(Uin):(B,V,L)Transpose(Batch(Uin)):𝐵𝑎𝑡𝑐superscriptsubscript𝑈𝑖𝑛top𝐵𝑉𝐿Transpose𝐵𝑎𝑡𝑐subscript𝑈𝑖𝑛Batch(U_{in}^{\top}):(B,V,L)\leftarrow\mathrm{Transpose}(Batch(U_{in}))italic_B italic_a italic_t italic_c italic_h ( italic_U start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) : ( italic_B , italic_V , italic_L ) ← roman_Transpose ( italic_B italic_a italic_t italic_c italic_h ( italic_U start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) )
3:  𝑼tok:(B,V,D)LinearTokenize(Batch(Uin)):superscript𝑼𝑡𝑜𝑘𝐵𝑉𝐷LinearTokenize𝐵𝑎𝑡𝑐superscriptsubscript𝑈𝑖𝑛top\bm{U}^{tok}:(B,V,D)\leftarrow\mathrm{LinearTokenize}(Batch(U_{in}^{\top}))bold_italic_U start_POSTSUPERSCRIPT italic_t italic_o italic_k end_POSTSUPERSCRIPT : ( italic_B , italic_V , italic_D ) ← roman_LinearTokenize ( italic_B italic_a italic_t italic_c italic_h ( italic_U start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ) {Tokenization}
4:  for l𝑙litalic_l in MambaLayers𝑀𝑎𝑚𝑏𝑎𝐿𝑎𝑦𝑒𝑟𝑠Mamba~{}Layersitalic_M italic_a italic_m italic_b italic_a italic_L italic_a italic_y italic_e italic_r italic_s do
5:     Mamba VC Encoding Layer:
6:     𝒀:(B,V,D)MambaBlock(𝑼):𝒀𝐵𝑉𝐷MambaBlock𝑼\overrightarrow{\bm{Y}}:(B,V,D)\leftarrow\overrightarrow{\mathrm{Mamba~{}Block% }}(\bm{U})over→ start_ARG bold_italic_Y end_ARG : ( italic_B , italic_V , italic_D ) ← over→ start_ARG roman_Mamba roman_Block end_ARG ( bold_italic_U )
7:     𝒀:(B,V,D)MambaBlock(𝑼):𝒀𝐵𝑉𝐷MambaBlock𝑼\overleftarrow{\bm{Y}}:(B,V,D)\leftarrow\overleftarrow{\mathrm{Mamba~{}Block}}% (\bm{U})over← start_ARG bold_italic_Y end_ARG : ( italic_B , italic_V , italic_D ) ← over← start_ARG roman_Mamba roman_Block end_ARG ( bold_italic_U )
8:     𝒀:(B,V,D)𝒀+𝒀:𝒀𝐵𝑉𝐷𝒀𝒀\bm{Y}:(B,V,D)\leftarrow\overrightarrow{\bm{Y}}+\overleftarrow{\bm{Y}}bold_italic_Y : ( italic_B , italic_V , italic_D ) ← over→ start_ARG bold_italic_Y end_ARG + over← start_ARG bold_italic_Y end_ARG {Fusion Bidirectional Information}
9:     𝑼:(B,V,D)𝒀+𝑼:superscript𝑼𝐵𝑉𝐷𝒀𝑼\bm{U}^{{}^{\prime}}:(B,V,D)\leftarrow\bm{Y}+\bm{U}bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT : ( italic_B , italic_V , italic_D ) ← bold_italic_Y + bold_italic_U {Residual Connection}
10:     FFN TD Encoding Layer:
11:     𝑼:(B,V,D)LayerNorm(𝑼):superscript𝑼𝐵𝑉𝐷LayerNormsuperscript𝑼\bm{U}^{{}^{\prime}}:(B,V,D)\leftarrow\mathrm{LayerNorm}(\bm{U}^{{}^{\prime}})bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT : ( italic_B , italic_V , italic_D ) ← roman_LayerNorm ( bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT )
12:     𝑼:(B,V,D)FeedForward(𝑼):superscript𝑼𝐵𝑉𝐷FeedForwardsuperscript𝑼\bm{U}^{{}^{\prime}}:(B,V,D)\leftarrow\mathrm{Feed-Forward}(\bm{U}^{{}^{\prime% }})bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT : ( italic_B , italic_V , italic_D ) ← roman_Feed - roman_Forward ( bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT )
13:     𝑼:(B,V,D)LayerNorm(𝑼):superscript𝑼𝐵𝑉𝐷LayerNormsuperscript𝑼\bm{U}^{{}^{\prime}}:(B,V,D)\leftarrow\mathrm{LayerNorm}(\bm{U}^{{}^{\prime}})bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT : ( italic_B , italic_V , italic_D ) ← roman_LayerNorm ( bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT )
14:  end for
15:  Projection:
16:  𝑼:(B,V,T)Projection(𝑼):superscript𝑼𝐵𝑉𝑇Projectionsuperscript𝑼\bm{U}^{{}^{\prime}}:(B,V,T)\leftarrow\mathrm{Projection}(\bm{U}^{{}^{\prime}})bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT : ( italic_B , italic_V , italic_T ) ← roman_Projection ( bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT )
17:  Batch(Uout):(B,T,V)Transpose(𝑼):𝐵𝑎𝑡𝑐subscript𝑈𝑜𝑢𝑡𝐵𝑇𝑉Transposesuperscript𝑼Batch(U_{out}):(B,T,V)\leftarrow\mathrm{Transpose}(\bm{U}^{{}^{\prime}})italic_B italic_a italic_t italic_c italic_h ( italic_U start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) : ( italic_B , italic_T , italic_V ) ← roman_Transpose ( bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT )

4.1 Linear Tokenization Layer

The input for the Linear Tokenization Layer is Uinsubscript𝑈𝑖𝑛U_{in}italic_U start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT. Similar to iTransformer [37], we commence by tokenizing the time series, a method analogous to the tokenization of sequential text in natural language processing, to standardize the temporal series format. This pivotal task is executed by a single linear layer in Eq. (3).

𝑼=Linear(Batch(Uin)),𝑼Linear𝐵𝑎𝑡𝑐subscript𝑈𝑖𝑛\displaystyle\bm{U}=\mathrm{Linear}(Batch(U_{in})),bold_italic_U = roman_Linear ( italic_B italic_a italic_t italic_c italic_h ( italic_U start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) ) , (3)

where 𝑼𝑼\bm{U}bold_italic_U is the output of this layer.

4.2 Mamba VC Encoding Layer

Within this layer, the primary objective is to extract the VC by linking variates that exhibit analogous trends aiming to learn the mutual information therein. The Transformer architecture confers the capacity for global attention [52], enabling the computation of the impact of all other variates upon a given variate, which facilitates the learning of precise information. However, the computational load of global attention escalates exponentially with an increase in the number of variates, potentially rendering it impractical. This limitation could restrict the application of Transformer-based algorithms in real-world scenarios. In contrast, Mamba’s selective mechanism can discern the significance of different variates akin to an attention mechanism, and it exhibits a computational overhead that escalates in a near-linear fashion with an increasing count of variates. Yet, the unilateral nature of Mamba precludes it from attending to global variates in the manner of the Transformer; its selection mechanism is unidirectional, capable only of incorporating antecedent variates. To surmount this limitation, we employ two Mamba blocks to be combined as a bidirectional Mamba layer as Eq. (4), which facilitates the acquisition of correlations among all variates.

𝒀=MambaBlock(𝑼),𝒀MambaBlock𝑼\displaystyle\overrightarrow{\bm{Y}}=\overrightarrow{\mathrm{Mamba~{}Block}}(% \bm{U}),over→ start_ARG bold_italic_Y end_ARG = over→ start_ARG roman_Mamba roman_Block end_ARG ( bold_italic_U ) , (4)
𝒀=MambaBlock(𝑼).𝒀MambaBlock𝑼\displaystyle\overleftarrow{\bm{Y}}=\overleftarrow{\mathrm{Mamba~{}Block}}(\bm% {U}).over← start_ARG bold_italic_Y end_ARG = over← start_ARG roman_Mamba roman_Block end_ARG ( bold_italic_U ) .

The VC encoded by the bidirectional Mamba is aggregated 𝒀=𝒀+𝒀𝒀𝒀𝒀\bm{Y}=\overrightarrow{\bm{Y}}+\overleftarrow{\bm{Y}}bold_italic_Y = over→ start_ARG bold_italic_Y end_ARG + over← start_ARG bold_italic_Y end_ARG and connected with another residual network to form the output of this layer 𝑼=𝐘+𝑼superscript𝑼𝐘𝑼\bm{U}^{{}^{\prime}}=\mathrm{\bm{Y}}+\bm{U}bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = bold_Y + bold_italic_U.

4.3 FFN TD Encoding Layer

At this layer, we further process the output of the Mamba VC Encoding Layer. Firstly, we employ a normalization layer [37] to enhance convergence and training stability in deep networks by standardizing all variates to a Gaussian distribution, thereby minimizing disparities resulting from inconsistent measurements. Then, the feed-forward network (FFN) is used on the series representation of each variate. The FFN layer encodes observed time series and decodes future series representations using dense non-linear connections. During this procedure, FFN implicitly encodes TD by keeping the sequential relationships. Finally, another normalization layer is set to adjust the future series representations.

4.4 Projection Layer

Based on the output of the FFN TD Encoding layer, the tokenized temporal information is reconstructed into the time series requiring prediction via a mapping layer, subsequently undergoing transposition to yield the final predictive outcome.

5 Experiments

5.1 Datasets and Baselines

We conduct experiments on thirteen real-world datasets. For convenience of comparison, we divide them into three types. (1) Traffic-related datasets: Traffic [59] and PEMS [10]. Traffic is a collection of hourly road occupancy rates from the California Department of Transportation, capturing data from 862 sensors across San Francisco Bay area freeways from January 2015 to December 2016. And PEMS is a complicated spatial-temporal time series for public traffic networks in California including four public subsets (PEMS03, PEMS04, PEMS07, PEMS08), which are the same as SCINet [36]. Traffic-related datasets are characterized by a large number of variates, most of which are periodic. (2) ETT datasets: ETT [71] (Electricity Transformer Temperature) comprises data on load and oil temperature, collected from electricity transformers over the period from July 2016 to July 2018. It contains four subsets, ETTm1, ETTm2, ETTh1 and ETTh2. ETT datasets have few varieties and weak regularity. (3) Other datasets: Electricity [59], Exchange [59], Weather [59], and Solar-Energy [29]. Electricity records the hourly electricity consumption of 321 customers from 2012 to 2014. Exchange collects daily exchange rates of eight countries from 1990 to 2016. Weather contains 21 meteorological indicators collected every 10 minutes from the Weather Station of the Max Planck Biogeochemistry Institute in 2020. Solar-Energy contains solar power records in 2006 from 137 PV plants in Alabama State which are sampled every 10 minutes. Among them, the Electricity and Solar-Energy datasets contain many variates, and most of them are periodic, while the Exchange and Weather datasets contain fewer variates, and most of them are aperiodic. Tab. 1 shows the statistics of these datasets.

Table 1: The statistics of the thirteen public datasets.
Datasets Variates Timesteps Granularity
Traffic 862 17,544 1hour
PEMS03 358 26,209 5min
PEMS04 307 16,992 5min
PEMS07 883 28,224 5min
PEMS08 170 17,856 5min
ETTm1 &\&& ETTm2 7 17,420 15min
ETTh1 &\&& ETTh2 7 69,680 1hour
Electricity 321 26,304 1hour
Exchange 8 7,588 1day
Weather 21 52,696 10min
Solar-Energy 137 52,560 10min
Table 2: Full results of S-Mamba and baselines on traffic-related datasets. The lookback length L𝐿Litalic_L is set to 96 and the forecast length T𝑇Titalic_T is set to 12, 24, 48, 96 for PEMS and 96, 192, 336, 720 for Traffic.
Models S-Mamba iTransformer RLinear PatchTST Crossformer TiDE TimesNet DLinear FEDformer Autoformer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Traffic 96 0.382 0.261 0.395 0.268 0.649 0.389 0.462 0.295 0.522 0.290 0.805 0.493 0.593 0.321 0.650 0.396 0.587 0.366 0.613 0.388
192 0.396 0.267 0.417 0.276 0.601 0.366 0.466 0.296 0.530 0.293 0.756 0.474 0.617 0.336 0.598 0.370 0.604 0.373 0.616 0.382
336 0.417 0.276 0.433 0.283 0.609 0.369 0.482 0.304 0.558 0.305 0.762 0.477 0.629 0.336 0.605 0.373 0.621 0.383 0.622 0.337
720 0.460 0.300 0.467 0.302 0.647 0.387 0.514 0.322 0.589 0.328 0.719 0.449 0.640 0.350 0.645 0.394 0.626 0.382 0.660 0.408
Avg 0.414 0.276 0.428 0.282 0.626 0.378 0.481 0.304 0.550 0.304 0.760 0.473 0.620 0.336 0.625 0.383 0.610 0.376 0.628 0.379
PEMS03 12 0.065 0.169 0.071 0.174 0.126 0.236 0.099 0.216 0.090 0.203 0.178 0.305 0.085 0.192 0.122 0.243 0.126 0.251 0.272 0.385
24 0.087 0.196 0.093 0.201 0.246 0.334 0.142 0.259 0.121 0.240 0.257 0.371 0.118 0.223 0.201 0.317 0.149 0.275 0.334 0.440
48 0.133 0.243 0.125 0.236 0.551 0.529 0.211 0.319 0.202 0.317 0.379 0.463 0.155 0.260 0.333 0.425 0.227 0.348 1.032 0.782
96 0.201 0.305 0.164 0.275 1.057 0.787 0.269 0.370 0.262 0.367 0.490 0.539 0.228 0.317 0.457 0.515 0.348 0.434 1.031 0.796
Avg 0.122 0.228 0.113 0.221 0.495 0.472 0.180 0.291 0.169 0.281 0.326 0.419 0.147 0.248 0.278 0.375 0.213 0.327 0.667 0.601
PEMS04 12 0.076 0.180 0.078 0.183 0.138 0.252 0.105 0.224 0.098 0.218 0.219 0.340 0.087 0.195 0.148 0.272 0.138 0.262 0.424 0.491
24 0.084 0.193 0.095 0.205 0.258 0.348 0.153 0.275 0.131 0.256 0.292 0.398 0.103 0.215 0.224 0.340 0.177 0.293 0.459 0.509
48 0.115 0.224 0.120 0.233 0.572 0.544 0.229 0.339 0.205 0.326 0.409 0.478 0.136 0.250 0.355 0.437 0.270 0.368 0.646 0.610
96 0.137 0.248 0.150 0.262 1.137 0.820 0.291 0.389 0.402 0.457 0.492 0.532 0.190 0.303 0.452 0.504 0.341 0.427 0.912 0.748
Avg 0.103 0.211 0.111 0.221 0.526 0.491 0.195 0.307 0.209 0.314 0.353 0.437 0.129 0.241 0.295 0.388 0.231 0.337 0.610 0.590
PEMS07 12 0.063 0.159 0.067 0.165 0.118 0.235 0.095 0.207 0.094 0.200 0.173 0.304 0.082 0.181 0.115 0.242 0.109 0.225 0.199 0.336
24 0.081 0.183 0.088 0.190 0.242 0.341 0.150 0.262 0.139 0.247 0.271 0.383 0.101 0.204 0.210 0.329 0.125 0.244 0.323 0.420
48 0.093 0.192 0.110 0.215 0.562 0.541 0.253 0.340 0.311 0.369 0.446 0.495 0.134 0.238 0.398 0.458 0.165 0.288 0.390 0.470
96 0.117 0.217 0.139 0.245 1.096 0.795 0.346 0.404 0.396 0.442 0.628 0.577 0.181 0.279 0.594 0.553 0.262 0.376 0.554 0.578
Avg 0.089 0.188 0.101 0.204 0.504 0.478 0.211 0.303 0.235 0.315 0.380 0.440 0.124 0.225 0.329 0.395 0.165 0.283 0.367 0.451
PEMS08 12 0.076 0.178 0.079 0.182 0.133 0.247 0.168 0.232 0.165 0.214 0.227 0.343 0.112 0.212 0.154 0.276 0.173 0.273 0.436 0.485
24 0.104 0.209 0.115 0.219 0.249 0.343 0.224 0.281 0.215 0.260 0.318 0.409 0.141 0.238 0.248 0.353 0.210 0.301 0.467 0.502
48 0.167 0.228 0.186 0.235 0.569 0.544 0.321 0.354 0.315 0.355 0.497 0.510 0.198 0.283 0.440 0.470 0.320 0.394 0.966 0.733
96 0.245 0.280 0.221 0.267 1.166 0.814 0.408 0.417 0.377 0.397 0.721 0.592 0.320 0.351 0.674 0.565 0.442 0.465 1.385 0.915
Avg 0.148 0.224 0.150 0.226 0.529 0.487 0.280 0.321 0.268 0.307 0.441 0.464 0.193 0.271 0.379 0.416 0.286 0.358 0.814 0.659

Our models are fairly compared with 9 representative and state-of-the-art (SOTA) forecasting models, including (1) Transformer-based methods: iTransformer [37], PatchTST [44], Crossformer [69], FEDformer [72], Autoformer [59]; (2) Linear-based methods: RLinear [33], TiDE [14], DLinear [66]; and (3) Temporal Convolutional Network-based methods: TimesNet [57]. The brief introductions of these models are as follows:

  • iTransformer reverses the order of information processing, which first analyzes the time series information of each individual variate and then fuses the information of all variates. This unique approach has positioned iTransformer as the current SOTA model in TSF.

  • PatchTST segments time series into subseries patches as input tokens and uses channel-independent shared embeddings and weights for efficient representation learning.

  • Crossformer introduces a cross-attention mechanism that allows the model to interact with information between different time steps to help the model capture long-term dependencies in time series.

  • FEDformer is a frequency-enhanced Transformer that takes advantage of the fact that most time series tend to have a sparse representation in well-known basis such as Fourier transform to improve performance.

  • Autoformer takes a decomposition architecture that incorporates an auto-correlation mechanism and updates traditional sequence decomposition into the basic inner blocks of the depth model.

  • RLinear is the SOTA linear model, which employs reversible normalization and channel independence into pure linear structure.

  • TiDE is a Multi-layer Perceptron (MLP) based encoder-decoder model.

  • DLinear is the first linear model in TSF and a simple one-layer linear model with decomposition architecture.

  • TimesNet uses TimesBlock as a task-general backbone, transforms 1D time series into 2D tensors, and captures intraperiod and interperiod variations using 2D kernels.

Table 3: Full results of S-Mamba and baselines on ETT datasets. The lookback length L𝐿Litalic_L is set to 96 and the forecast length T𝑇Titalic_T is set to 96, 192, 336, 720.
Models S-Mamba iTransformer RLinear PatchTST Crossformer TiDE TimesNet DLinear FEDformer Autoformer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
ETTm1 96 0.333 0.368 0.334 0.368 0.355 0.376 0.329 0.367 0.404 0.426 0.364 0.387 0.338 0.375 0.345 0.372 0.379 0.419 0.505 0.475
192 0.376 0.390 0.377 0.391 0.391 0.392 0.367 0.385 0.450 0.451 0.398 0.404 0.374 0.387 0.380 0.389 0.426 0.441 0.553 0.496
336 0.408 0.413 0.426 0.420 0.424 0.415 0.399 0.410 0.532 0.515 0.428 0.425 0.410 0.411 0.413 0.413 0.445 0.459 0.621 0.537
720 0.475 0.448 0.491 0.459 0.487 0.450 0.454 0.439 0.666 0.589 0.487 0.461 0.478 0.450 0.474 0.453 0.543 0.490 0.671 0.561
Avg 0.398 0.405 0.407 0.410 0.414 0.407 0.387 0.400 0.513 0.496 0.419 0.419 0.400 0.406 0.403 0.407 0.448 0.452 0.588 0.517
ETTm2 96 0.179 0.263 0.180 0.264 0.182 0.265 0.175 0.259 0.287 0.366 0.207 0.305 0.187 0.267 0.193 0.292 0.203 0.287 0.255 0.339
192 0.250 0.309 0.250 0.309 0.246 0.304 0.241 0.302 0.414 0.492 0.290 0.364 0.249 0.309 0.284 0.362 0.269 0.328 0.281 0.340
336 0.312 0.349 0.311 0.348 0.307 0.342 0.305 0.343 0.597 0.542 0.377 0.422 0.321 0.351 0.369 0.427 0.325 0.366 0.339 0.372
720 0.411 0.406 0.412 0.407 0.407 0.398 0.402 0.400 1.730 1.042 0.558 0.524 0.408 0.403 0.554 0.522 0.421 0.415 0.433 0.432
Avg 0.288 0.332 0.288 0.332 0.286 0.327 0.281 0.326 0.757 0.610 0.358 0.404 0.291 0.333 0.350 0.401 0.305 0.349 0.327 0.371
ETTh1 96 0.386 0.405 0.386 0.405 0.386 0.395 0.414 0.419 0.423 0.448 0.479 0.464 0.384 0.402 0.386 0.400 0.376 0.419 0.449 0.459
192 0.443 0.437 0.441 0.436 0.437 0.424 0.460 0.445 0.471 0.474 0.525 0.492 0.436 0.429 0.437 0.432 0.420 0.448 0.500 0.482
336 0.489 0.468 0.487 0.458 0.479 0.446 0.501 0.466 0.570 0.546 0.565 0.515 0.491 0.469 0.481 0.459 0.459 0.465 0.521 0.496
720 0.502 0.489 0.503 0.491 0.481 0.470 0.500 0.488 0.653 0.621 0.594 0.558 0.521 0.500 0.519 0.516 0.506 0.507 0.514 0.512
Avg 0.455 0.450 0.454 0.447 0.446 0.434 0.469 0.454 0.529 0.522 0.541 0.507 0.458 0.450 0.456 0.452 0.440 0.460 0.496 0.487
ETTh2 96 0.296 0.348 0.297 0.349 0.288 0.338 0.302 0.348 0.745 0.584 0.400 0.440 0.340 0.374 0.333 0.387 0.358 0.397 0.346 0.388
192 0.376 0.396 0.380 0.400 0.374 0.390 0.388 0.400 0.877 0.656 0.528 0.509 0.402 0.414 0.477 0.476 0.429 0.439 0.456 0.452
336 0.424 0.431 0.428 0.432 0.415 0.426 0.426 0.433 1.043 0.731 0.643 0.571 0.452 0.452 0.594 0.541 0.496 0.487 0.482 0.486
720 0.426 0.444 0.427 0.445 0.420 0.440 0.431 0.446 1.104 0.763 0.874 0.679 0.462 0.468 0.831 0.657 0.463 0.474 0.515 0.511
Avg 0.381 0.405 0.383 0.407 0.374 0.398 0.387 0.407 0.942 0.684 0.611 0.550 0.414 0.427 0.559 0.515 0.437 0.449 0.450 0.459
Table 4: Full results of S-Mamba and baselines on Electricity, Exchange, Weather and Solar-Energy datasets. The lookback length L𝐿Litalic_L is set to 96 and the forecast length T𝑇Titalic_T is set to 96, 192, 336, 720.
Models S-Mamba iTransformer RLinear PatchTST Crossformer TiDE TimesNet DLinear FEDformer Autoformer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Electricity 96 0.139 0.235 0.148 0.240 0.201 0.281 0.181 0.270 0.219 0.314 0.237 0.329 0.168 0.272 0.197 0.282 0.193 0.308 0.201 0.317
192 0.159 0.255 0.162 0.253 0.201 0.283 0.188 0.274 0.231 0.322 0.236 0.330 0.184 0.289 0.196 0.285 0.201 0.315 0.222 0.334
336 0.176 0.272 0.178 0.269 0.215 0.298 0.204 0.293 0.246 0.337 0.249 0.344 0.198 0.300 0.209 0.301 0.214 0.329 0.231 0.338
720 0.204 0.298 0.225 0.317 0.257 0.331 0.246 0.324 0.280 0.363 0.284 0.373 0.220 0.320 0.245 0.333 0.246 0.355 0.254 0.361
Avg 0.170 0.265 0.178 0.270 0.219 0.298 0.205 0.290 0.244 0.334 0.251 0.344 0.192 0.295 0.212 0.300 0.214 0.327 0.227 0.338
Exchange 96 0.086 0.207 0.086 0.206 0.093 0.217 0.088 0.205 0.256 0.367 0.094 0.218 0.107 0.234 0.088 0.218 0.148 0.278 0.197 0.323
192 0.182 0.304 0.177 0.299 0.184 0.307 0.176 0.299 0.470 0.509 0.184 0.307 0.226 0.344 0.176 0.315 0.271 0.315 0.300 0.369
336 0.332 0.418 0.331 0.417 0.351 0.432 0.301 0.397 1.268 0.883 0.349 0.431 0.367 0.448 0.313 0.427 0.460 0.427 0.509 0.524
720 0.867 0.703 0.847 0.691 0.886 0.714 0.901 0.714 1.767 1.068 0.852 0.698 0.964 0.746 0.839 0.695 1.195 0.695 1.447 0.941
Avg 0.367 0.408 0.360 0.403 0.378 0.417 0.367 0.404 0.940 0.707 0.370 0.413 0.416 0.443 0.354 0.414 0.519 0.429 0.613 0.539
Weather 96 0.165 0.210 0.174 0.214 0.192 0.232 0.177 0.218 0.158 0.230 0.202 0.261 0.172 0.220 0.196 0.255 0.217 0.296 0.266 0.336
192 0.214 0.252 0.221 0.254 0.240 0.271 0.225 0.259 0.206 0.277 0.242 0.298 0.219 0.261 0.237 0.296 0.276 0.336 0.307 0.367
336 0.274 0.297 0.278 0.296 0.292 0.307 0.278 0.297 0.272 0.335 0.287 0.335 0.280 0.306 0.283 0.335 0.339 0.380 0.359 0.395
720 0.350 0.345 0.358 0.347 0.364 0.353 0.354 0.348 0.398 0.418 0.351 0.386 0.365 0.359 0.345 0.381 0.403 0.428 0.419 0.428
Avg 0.251 0.276 0.258 0.278 0.272 0.291 0.259 0.281 0.259 0.315 0.271 0.320 0.259 0.287 0.265 0.317 0.309 0.360 0.338 0.382
Solar-Energy 96 0.205 0.244 0.203 0.237 0.322 0.339 0.234 0.286 0.310 0.331 0.312 0.399 0.250 0.292 0.290 0.378 0.242 0.342 0.884 0.711
192 0.237 0.270 0.233 0.261 0.359 0.356 0.267 0.310 0.734 0.725 0.339 0.416 0.296 0.318 0.320 0.398 0.285 0.380 0.834 0.692
336 0.258 0.288 0.248 0.273 0.397 0.369 0.290 0.315 0.750 0.735 0.368 0.430 0.319 0.330 0.353 0.415 0.282 0.376 0.941 0.723
720 0.260 0.288 0.249 0.275 0.397 0.356 0.289 0.317 0.769 0.765 0.370 0.425 0.338 0.337 0.356 0.413 0.357 0.427 0.882 0.717
Avg 0.240 0.273 0.233 0.262 0.369 0.356 0.270 0.307 0.641 0.639 0.347 0.417 0.301 0.319 0.330 0.401 0.291 0.381 0.885 0.711

5.2 Overall Performance

Tab. 2, Tab. 3, and Tab. 4 present a comparative analysis of the overall performance of our models and other baseline models across all datasets. The best results are highlighted in bold red font, while the second best results are presented in underlined purple font. From the data presented in these tables, we summarize three observations and attach the analysis: (1) S-Mamba attain commendable outcomes on the traffic-related, Electricity, and Solar-Energy datasets. These datasets are distinguished by their numerous variates, most of which are periodic. It is worth noting that period variates are more likely to contain learnable VC. Mamba VC Fusion Layer benefits from this characteristic and improves S-Mamba performance. (2) In the context of the ETT, and Exchange datasets, S-Mamba does not demonstrate a pronounced superiority in performance; indeed, it exhibits a suboptimal outcome. This can be attributed to the fact that these datasets are characterized by a few number of variates, predominantly of an aperiodic nature. Consequently, there are weak VCs between these variates, and the employment of Mamba VC Encoding layer by S-Mamba can’t bring useful information and even may inadvertently introduce noise into the predictive model, thus impeding its predictive accuracy. (3) The Weather dataset is special in that it has fewer variates and most variates are aperiodic, but S-Mamba still achieves the best performance on it. We think that this phenomenon arises from the tendency of variates in the Weather dataset to exhibit simultaneous trends of either falling or rising despite the absence of periodic patterns So the Mamba VC Encoding Layer of S-Mamba can still benefit from these data. Moreover, the Weather dataset exhibits large sections of rising or falling trends. The Feed-Forward Network (FFN) layer accurately records these relationships, which is also beneficial for S-Mamba’s comprehension.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) S-Mamba
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) iTransformer
Figure 4: Comparison of forecasts between S-Mamba and iTransformer on five datasets when the input length is 96 and the forecast length is 96. The blue line represents the ground truth and the red line represents the forecast.

Furthermore, to provide a more intuitive assessment of S-Mamba’s forecast capabilities, we visually compare the predictions of S-Mamba and the leading baseline, iTransformer, on four datasets: Electricity, Weather, Traffic, Exchange, and ETTh1, through graphical representation. Specifically, we randomly select a variate and then input its lookback sequence, where the true subsequent sequence is depicted as a blue line and the model’s forecast is represented by a red line in Fig. 4. It is evident that on the Electricity, Weather, and Traffic datasets, S-Mamba’s predictions closely approximate the actual values, with nearly perfect alignment observed on the Electricity and Traffic datasets and are better than iTransformer. On the Exchange and ETTh1, the two models exhibit similar performance because the two datasets contain few variates, so there is no evident gap between using bidirectional Mamba or using Transformer for information fusion between variates.

Refer to caption
Figure 5: Comparison of S-Mamba and six baselines on MSE, Training Time, and GPU Memory. The lookback length L=96𝐿96L=96italic_L = 96, and the forecast length T=12𝑇12T=12italic_T = 12 for PEMS07 and T=96𝑇96T=96italic_T = 96 for other datasets

5.3 Model Efficiency

To evaluate the computational efficiency of the models, we compare the memory usage and computing time of S-Mamba with several baselines on PEMS07, Electricity, Traffic, and ETTm1. Independent runs are conducted on a single NVIDIA RTX3090 GPU with the batch size set to 16161616 and meticulously document the results in Fig. 5. In our analysis, bubble charts are used to depict the measurement outcomes, wherein the vertical axis denotes the Mean Squared Error (MSE), the horizontal axis quantifies the training duration, and the bubble magnitude correlates with the allocated GPU memory. The visualization reveals that the S-Mamba algorithm attains the most favorable MSE metric across the PEMS07, Electricity, and Traffic datasets. When benchmarked against Transformer-based models, S-Mamba typically necessitates short training time and low allocated GPU memory. The RLinear model does utilize minimal GPU memory and curtails training time, it does not confer a competitive edge in terms of forecast precision. Overall, S-Mamba manifests exemplary predictive accuracy with a low computational resource footprint.

Table 5: Ablation study on Electricity, Traffic, Weather, Solar-Energy, and ETTh2. The lookback length L=96𝐿96L=96italic_L = 96, while the forecast length T{96,192,336,720}𝑇96192336720T\in\left\{96,192,336,720\right\}italic_T ∈ { 96 , 192 , 336 , 720 }.
Design VC Encoding TD Encoding Forecast Electricity Traffic Weather Solar-Energy ETTh2
Lengths MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
S-Mamba bi-Mamba FFN 96 0.139 0.235 0.382 0.261 0.165 0.210 0.205 0.244 0.296 0.348
192 0.159 0.255 0.396 0.267 0.214 0.252 0.237 0.270 0.376 0.396
336 0.176 0.272 0.417 0.276 0.274 0.297 0.258 0.288 0.424 0.431
720 0.204 0.298 0.460 0.300 0.350 0.345 0.260 0.288 0.426 0.444
Replace bi-Mamba uni-Mamba 96 0.155 0.260 0.488 0.329 0.161 0.204 0.213 0.255 0.297 0.349
192 0.173 0.271 0.511 0.341 0.208 0.249 0.247 0.280 0.378 0.399
336 0.188 0.281 0.531 0.347 0.265 0.280 0.267 0.298 0.428 0.437
720 0.210 0.308 0.621 0.352 0.343 0.339 0.272 0.295 0.436 0.451
bi-Mamba bi-Mamba 96 0.154 0.259 0.512 0.348 0.162 0.205 0.221 0.261 0.297 0.349
192 0.175 0.273 0.505 0.344 0.210 0.250 0.271 0.291 0.377 0.398
336 0.184 0.276 0.527 0.369 0.266 0.288 0.271 0.291 0.428 0.437
720 0.216 0.315 0.661 0.423 0.344 0.339 0.278 0.296 0.436 0.451
bi-Mamba Attention 96 0.153 0.259 0.514 0.351 0.163 0.207 0.230 0.268 0.299 0.350
192 0.167 0.266 0.512 0.348 0.211 0.252 0.255 0.287 0.382 0.401
336 0.183 0.277 0.534 0.377 0.266 0.288 0.275 0.295 0.430 0.438
720 0.213 0.311 0.685 0.441 0.346 0.340 0.284 0.301 0.433 0.449
Attention FFN 96 0.148 0.240 0.395 0.268 0.174 0.214 0.203 0.237 0.297 0.349
192 0.162 0.253 0.417 0.276 0.221 0.254 0.233 0.261 0.380 0.400
336 0.178 0.269 0.433 0.283 0.278 0.296 0.248 0.273 0.428 0.432
720 0.225 0.317 0.467 0.302 0.358 0.349 0.249 0.275 0.427 0.445
w/o bi-Mamba w/o 96 0.141 0.238 0.380 0.259 0.167 0.214 0.210 0.250 0.298 0.349
192 0.160 0.256 0.400 0.270 0.217 0.255 0.245 0.276 0.381 0.400
336 0.181 0.279 0.426 0.283 0.276 0.300 0.263 0.291 0.430 0.437
720 0.214 0.304 0.466 0.299 0.353 0.348 0.268 0.296 0.433 0.446
w/o FFN 96 0.169 0.253 0.437 0.283 0.183 0.220 0.228 0.263 0.299 0.350
192 0.177 0.261 0.449 0.287 0.231 0.262 0.261 0.283 0.380 0.399
336 0.194 0.278 0.464 0.294 0.285 0.300 0.279 0.294 0.427 0.435
720 0.233 0.311 0.496 0.313 0.362 0.350 0.276 0.291 0.431 0.449

5.4 Ablation Study

To evaluate the efficacy of the components within S-Mamba, we conduct ablation studies by substituting or eliminating the VC and TD encoding layers. Specifically, the TD encoding layer is replaced with Attention, bidirectional Mamba, unidirectional Mamba, or omitted altogether (w/o). The choice of bidirectional Mamba (bi-Mamba), which is set to benchmark Attention, is made to facilitate global temporal information extraction. The rationale behind employing unidirectional Mamba is its resemblance to RNN models, so inherently possesses the capacity to preserve sequential relationships, thereby making it a suitable candidate for evaluating the impact of sequential encoding on TD. The VC encoding layer was replaced with an Attention mechanism or entirely removed. This modification is predicated on the empirical evidence from iTransformer experiments [37], which demonstrate that Attention was the optimal encoder for VC. We do not use a unidirectional Mamba as the VC Encoding Layer, because Mamba, like RNNs, can only observe information from one direction. A unidirectional Mamba setting would result in the loss of half of the information, making it less effective than bidirectional Mamba or Attention in capturing global information.

Our experimental investigations are conducted on five datasets: Electricity, Traffic, Weather, Solar Energy, and ETTh2. The findings from these experiments in Tab.5 indicate that Mamba exhibits superior performance in VC encoding, whereas the Feed-Forward Network (FFN) maintained its dominance in TD encoding. These findings demonstrate that S-Mamba’s current framework is the most efficient.

Refer to caption
Figure 6: In the Electricity dataset, adjust the distribution of periodic and aperiodic variates. The left side represents the distribution, while the right side indicates the two models’ performance when lookback length L=96𝐿96L=96italic_L = 96 and forecast length T=96𝑇96T=96italic_T = 96.

5.5 Can Variate Order Affect the Performance of S-Mamba?

In S-Mamba, each variate is treated as an independent channel, so variates themselves are not inherently ordered. But in the Mamba VC Encoding Layer, the Mamba Block interprets the sequence like RNN, implying that it apprehends the variates as a sequence with implicit order. Mamba’s Selective mechanism is closely linked to the Hippo matrix [23], which causes Mamba to prioritize closer variates in sequences at initialization, leading to a bias against more distant variates. The initial bias towards neighboring variates may potentially impede the acquisition of a global inter-variate correlation. inspiring us to investigate the impact of the variate order on the performance of S-Mamba.

We first use the Fourier transform [7] to categorize the variates into periodic and aperiodic groups and then consider periodic variates as containing reliable information and aperiodic variates as potential noise. This distinction is based on the assumption that periodic variates are more likely to exhibit consistent patterns that can be learned, while aperiodic variates may contain unreliable information due to irregular fluctuations. Next, we decide to alter the variate order by changing the positions of these noise variates for these noise variates have the greatest impact on performance by affecting VC Encoding. Instead of randomly shuffling the overall variate order, it is more effective to adjust the distribution of these noisy variates. Subsequent trials involve repositioning the aperiodic variates towards the middle or end of the variates sequence, followed by evaluating the predictive capabilities of the models trained on these modified datasets. For comparative analysis, we also included experiments with the iTransformer.

The variate distribution and corresponding model performance are illustrated in the Fig. 6. We conduct this experiment only on the Electricity dataset because it requires a dataset with a large number of both periodic and aperiodic variates and Electricity is the only one that satisfies the condition. Our findings suggest that the S-Mamba model’s performance remains largely unaffected by the perturbation of variate order. It implies that through adequate training, the S-Mamba can effectively mitigate the initial bias of the Hippo matrix to get accurate inter-variate correlations.

Refer to caption
Figure 7: Allocated GPU Memory, Training Time, and MSE of Autoformer, Flashformer, Flowformer and Auto-M, Flash-M, and Flow-M on four datasets. The lookback length L=96𝐿96L=96italic_L = 96 and the forecast length T=12𝑇12T=12italic_T = 12 for PEMS07, T=96𝑇96T=96italic_T = 96 for other datasets. The purple horizontal line represents the average performance of "Transformer models", and the purple dotted line represents the average performance of "Mamba models".

5.6 Can Mamba Outperform Advanced Transformers?

Beyond the foundational Transformer architecture, some advanced Transformers have been introduced, predominantly focusing on the augmentation of the self-attention mechanism. We aim to determine whether Mamba can still maintain an advantage over these advanced Transformers. To the end, we conduct a comparative experiment in which we directly replace the Encoder layer of three advanced self-attention mechanism in three Transformer: Autoformer [59], Flashformer [13] and Flowformer [58] with a unidirectional Mamba (uni-Mamba) for TSF tasks to get Auto-M, Flash-M and Flow-M to compare their performance. The reason behind using a uni-Mamba is that the Encoder layer of these three Transformers handles Temporal Dependency (TD), which is inherently ordered. Therefore, a uni-Mamba is more suitable than a bidirectional Mamba, for apprehending the sequential nature of TD.

We compare the GPU Memory, training time, and MSE of three advanced Transformer models and their Mamba Encoder counterparts on Electricity, Traffic, PEMS07 and ETTm1 as Fig. 7. The findings indicate that employing Mamba as the Encoder directly resulted in reduced GPU usage and training time consumption while achieving slightly improved overall performance. It means that Mamba can still maintain its advantage compared to these advanced self-attention mechanisms or, in other words, these advanced Transformers.

Refer to caption
Figure 8: Forecasting performance on four datasets with the lookback length L{96,192,336,720}𝐿96192336720L\in\{96,192,336,720\}italic_L ∈ { 96 , 192 , 336 , 720 } while the forecast length T=12𝑇12T=12italic_T = 12 for PEMS04 and T=96𝑇96T=96italic_T = 96 for other datasets.

5.7 Can Mamba Help Benefit from Increasing Lookback Length?

Prior research shows that Transformer-based models’ performance does not consistently improve with increasing lookback sequence length L𝐿Litalic_L, which is somewhat unexpected. A plausible explanation is that the temporal sequence relationship is overlooked under the self-attention mechanism, as it disregards the sequential order, and in some instances, even inverts it. Mamba, resembling a Recurrent Neural Network [41], concentrates on the preceding window during information extraction, thereby preserving certain sequential attributes. It prompts an exploration of Mamba’s potential effectiveness in temporal sequence information fusion, aiming to address the issue of diminishing or stagnant performance with increasing lookback length. Consequently, we add an additional Mamba block between the encoder Layer and decoder layer of Transformers-based models. The role of the Mamba Block is to add a layer of time sequence dependence from the information output by the encoder layer, to add some information similar to position embedding before the decoder layer processes it. We experiment with Reformer [59], Informer [71], and Transformer [52] to get Refor-M, Infor-M, and Trans-M, and evaluate their performance with varying lookback lengths. We also test the performance of S-Mamba and iTransformer as the lookback length increases. The experiment is conducted on four datasets: Electricity, Traffic, PEMS04 and ETTm1. The results are in Fig. 8, from which we can observe four results. (1) S-Mamba and iTransformer can enhance their performance as the input lengthens, but we believe it is not solely due to the Mamba Block or Transformer Block, but rather to the FFN TD Encoding Layer they both possess. (2) S-Mamba consistently outperforms iTransformer, primarily due to the superior performance of S-Mamba’s Mamba VC Encoding layer compared to iTransformer’s Transformer VC Encoding layer. (3) After incorporating the Mamba Block between the Encoder and Decoder layer, performance enhancements are typically observed in the original model across the four datasets. (4) Despite these variants’ performance gains sometimes, they do not achieve optimization with longer lookback lengths. It is consistent with the findings of Zeng et al. [66], which also suggests that encoding temporal sequence information into the model beforehand does not resolve the issue.

Refer to caption
Figure 9: Forecasting performance comparison between S-Mamba and iTransformer trained on 100% variates with on 40% variates. The lookback length L=96𝐿96L=96italic_L = 96 for all datasets. For the PEMS dataset, the output length T=12𝑇12T=12italic_T = 12, while for the other datasets T=96𝑇96T=96italic_T = 96.

5.8 Is Mamba Generalizable in TSF?

The emergence of pretrained models [16] and large language models [9] based on the Transformer architecture has underscored the Transformer’s ability to discern similar patterns across diverse data, highlighting its generalization capabilities. In the context of TSF, it is observed that some variates exhibit a similar pattern of differences, so the generalization potential of the Transformer for sequence data may also take effect on TSF tasks. In this vein, iTransformer [37] conducts a pivotal experiment. The study involves masking a majority of the variates in a dataset and training the model on a limited subset of variates. Subsequently, the model was tasked with forecasting all variates, including those previously unseen, based on the learned information from the few varieties. The results show that Transformer can use its generalization ability to make accurate predictions for unseen variates in TSF tasks. Building on it, we seek to evaluate the generalization capabilities of Mamba in TSF tasks. An experiment is proposed wherein the S-Mamba are trained on merely 40% of the variates in the PEMS03, PEMS04, Electricity, Weather, Traffic, and Solar datasets. We selected these datasets for testing because they contain a large number of variates, which makes it fair to evaluate the models’ generalization ability. Then they are employed to predict 100% variates, and the results are subjected to statistical analysis. The outcomes of this investigation in 9 reveal that S-Mamba exhibits generalization potential on the six datasets, which proves their generalizability in TSF tasks.

6 Conclusion

Transformer-based models have consistently exhibited outstanding performance in the field of time series forecasting (TSF), while Mamba has recently gained popularity, and has been shown to surpass the Transformer in various domains by delivering superior performance while reducing memory and computational overhead. Motivated by these advancements, we seek to investigate the potential of Mamba-based models in the TSF domain, to uncover new research avenues for this field. To this end, we introduce a Mamba-based model for TSF, Simple-Mamba (S-Mamba). It transfers the task of inter-variate correlation (VC) encoding from the Transformer architecture to a bidirectional Mamba block and uses a Feed-Forward Network to extract Temporal Dependencies (TD). We compare S-Mamba with nine representative and state-of-the-art models on thirteen public datasets including Traffic, Weather, Electricity, and Energy forecasting tasks. The results indicate that S-Mamba requires low computational overhead and achieves leading performance. The advantage is primarily attributed to the bidirectional Mamba (bi-Mamba) block within the Mamba VC Encoding Layer, which offers an enhanced understanding of VC at a lower overhead compared to the Transformer. Furthermore, we conduct extensive experiments to prove Mamba possesses robust capabilities in TSF tasks. We demonstrate that the Mamba maintains the same stability as the Transformer in extracting VC and still can offer advantages over advanced Transformer architectures. Transformer architectures can see performance gains by simply integrating or substituting with Mamba blocks. Additionally, Mamba exhibits comparable generalization capabilities to the Transformer. In a word, Mamba exhibits remarkable potential to outperform the Transformer in the TSF tasks.

7 Future Work

As the number of variates grows, global inter-variate correlations (VC) become increasingly valuable and the extraction of them becomes more difficult and consumes more computational resources. Mamba excels at detecting long-range dependencies and controlling the escalation of computational demands, thus equipping it to meet the challenges outlined. In real-life scenarios where resources are limited, compared with Transformer, Mamba is capable of processing more variates information simultaneously and delivering more accurate predictions. For example, in traffic forecasting, Mamba can rapidly assess traffic flows at more intersections, and in hydrological forecasting, it can provide insights into conditions across more tributaries. Looking forward, Mamba-based models are expected to be applicable to a broader spectrum of time series prediction tasks that involve processing extensive variate data.

Pretrained models utilizing the Transformer architecture capitalize on its robust generalization capabilities, achieving notable success in TSF. These models demonstrate effectiveness across various tasks through fine-tuning. Our experimental results indicate that Mamba matches the Transformer in both generalization and stability, suggesting that the development of a Mamba-based pre-training model for TSF tasks could be a fruitful direction to explore.

References

  • Abdollah Pour et al. [2022] Abdollah Pour, M.M., Hajizadeh, E., Farineya, P., 2022. A New Transformer-Based Hybrid Model for Forecasting Crude Oil Returns. AUT Journal of Modeling and Simulation 54, 19–30. URL: https://miscj.aut.ac.ir/article_4853.html, doi:10.22060/miscj.2022.20734.5263. publisher: Amirkabir University of Technology.
  • Ahamed and Cheng [2024] Ahamed, M.A., Cheng, Q., 2024. Timemachine: A time series is worth 4 mambas for long-term forecasting. arXiv:2403.09898.
  • Ahmed et al. [2023] Ahmed, S., Nielsen, I.E., Tripathi, A., Siddiqui, S., Rasool, G., Ramachandran, R.P., 2023. Transformers in Time-series Analysis: A Tutorial. Circuits, Systems, and Signal Processing 42, 7433--7466. URL: http://arxiv.org/abs/2205.01138, doi:10.1007/s00034-023-02454-8. arXiv:2205.01138 [cs].
  • Anthony et al. [2024] Anthony, Q., Tokpanov, Y., Glorioso, P., Millidge, B., 2024. Blackmamba: Mixture of experts for state-space models. ArXiv abs/2402.01771. URL: https://api.semanticscholar.org/CorpusID:267413070.
  • Benidis et al. [2023] Benidis, K., Rangapuram, S.S., Flunkert, V., Wang, Y., Maddix, D., Turkmen, C., Gasthaus, J., Bohlke-Schneider, M., Salinas, D., Stella, L., Aubet, F.X., Callot, L., Januschowski, T., 2023. Deep Learning for Time Series Forecasting: Tutorial and Literature Survey. ACM Computing Surveys 55, 1--36. URL: http://arxiv.org/abs/2004.10240, doi:10.1145/3533382. arXiv:2004.10240 [cs, stat].
  • Bhirangi et al. [2024] Bhirangi, R., Wang, C., Pattabiraman, V., Majidi, C., Gupta, A., Hellebrekers, T., Pinto, L., 2024. Hierarchical state space models for continuous sequence-to-sequence modeling. arXiv:2402.10211.
  • Bracewell [1989] Bracewell, R.N., 1989. The fourier transform. Scientific American 260, 86--95.
  • Cao et al. [2024] Cao, Z., Wu, X., Deng, L.J., Zhong, Y., 2024. A novel state space model with local enhancement and state sharing for image fusion. arXiv:2404.09293.
  • Chang et al. [2023] Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al., 2023. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology .
  • Chen et al. [2001] Chen, C., Petty, K., Skabardonis, A., Varaiya, P., Jia, Z., 2001. Freeway performance measurement system: mining loop detector data. Transportation research record 1748, 96--102.
  • Chen et al. [2023] Chen, S.A., Li, C.L., Yoder, N., Arik, S.O., Pfister, T., 2023. TSMixer: An All-MLP Architecture for Time Series Forecasting. URL: http://arxiv.org/abs/2303.06053, doi:10.48550/arXiv.2303.06053. arXiv:2303.06053 [cs].
  • Chen et al. [2024] Chen, T., Tan, Z., Gong, T., Chu, Q., Wu, Y., Liu, B., Ye, J., Yu, N., 2024. MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection. URL: http://arxiv.org/abs/2403.02148, doi:10.48550/arXiv.2403.02148. arXiv:2403.02148 [cs].
  • Dao et al. [2022] Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C., 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35, 16344--16359.
  • Das et al. [2023] Das, A., Kong, W., Leach, A., Mathur, S.K., Sen, R., Yu, R., 2023. Long-term forecasting with tide: Time-series dense encoder. Transactions on Machine Learning Research .
  • De Gooijer and Hyndman [2006] De Gooijer, J.G., Hyndman, R.J., 2006. 25 years of time series forecasting. International journal of forecasting 22, 443--473.
  • Devlin et al. [2018] Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
  • Dong et al. [2024] Dong, W., Zhu, H., Lin, S., Luo, X., Shen, Y., Liu, X., Zhang, J., Guo, G., Zhang, B., 2024. Fusion-mamba for cross-modality object detection. arXiv:2404.09146.
  • Duong-Trung et al. [2023] Duong-Trung, N., Nguyen, D.M., Le-Phuoc, D., 2023. Temporal Saliency Detection Towards Explainable Transformer-based Timeseries Forecasting. URL: http://arxiv.org/abs/2212.07771, doi:10.48550/arXiv.2212.07771. arXiv:2212.07771 [cs] version: 3.
  • Elfwing et al. [2017] Elfwing, S., Uchibe, E., Doya, K., 2017. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks : the official journal of the International Neural Network Society 107, 3--11.
  • Foumani et al. [2024] Foumani, N.M., Tan, C.W., Webb, G.I., Salehi, M., 2024. Improving position encoding of transformers for multivariate time series classification. Data Mining and Knowledge Discovery 38, 22--48. URL: https://doi.org/10.1007/s10618-023-00948-2, doi:10.1007/s10618-023-00948-2.
  • Grazzi et al. [2024] Grazzi, R., Siems, J., Schrodi, S., Brox, T., Hutter, F., 2024. Is mamba capable of in-context learning? arXiv:2402.03170.
  • Gu and Dao [2023] Gu, A., Dao, T., 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 .
  • Gu et al. [2020] Gu, A., Dao, T., Ermon, S., Rudra, A., Ré, C., 2020. Hippo: Recurrent memory with optimal polynomial projections. ArXiv abs/2008.07669.
  • Gu et al. [2021a] Gu, A., Goel, K., Ré, C., 2021a. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 .
  • Gu et al. [2021b] Gu, A., Johnson, I., Goel, K., Saab, K.K., Dao, T., Rudra, A., R’e, C., 2021b. Combining recurrent, convolutional, and continuous-time models with linear state-space layers, in: Neural Information Processing Systems.
  • Huang et al. [2024] Huang, J., Yang, L., Wang, F., Wu, Y., Nan, Y., Aviles-Rivero, A.I., Schönlieb, C.B., Zhang, D., Yang, G., 2024. Mambamir: An arbitrary-masked mamba for joint medical image reconstruction and uncertainty estimation. arXiv:2402.18451.
  • Jiang et al. [2024] Jiang, X., Han, C., Mesgarani, N., 2024. Dual-path mamba: Short and long-term bidirectional selective structured state space models for speech separation. arXiv:2403.18257.
  • Kitaev et al. [2020] Kitaev, N., Kaiser, Ł., Levskaya, A., 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 .
  • Lai et al. [2018] Lai, G., Chang, W.C., Yang, Y., Liu, H., 2018. Modeling long-and short-term temporal patterns with deep neural networks, in: The 41st international ACM SIGIR conference on research & development in information retrieval, pp. 95--104.
  • Li et al. [2024] Li, K., Li, X., Wang, Y., He, Y., Wang, Y., Wang, L., Qiao, Y., 2024. VideoMamba: State Space Model for Efficient Video Understanding. URL: http://arxiv.org/abs/2403.06977, doi:10.48550/arXiv.2403.06977. arXiv:2403.06977 [cs].
  • Li et al. [2019] Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.X., Yan, X., 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems 32.
  • Li et al. [2023a] Li, Z., Qi, S., Li, Y., Xu, Z., 2023a. Revisiting long-term time series forecasting: An investigation on linear mapping. arXiv preprint arXiv:2305.10721 .
  • Li et al. [2023b] Li, Z., Qi, S., Li, Y., Xu, Z., 2023b. Revisiting long-term time series forecasting: An investigation on linear mapping. arXiv preprint arXiv:2305.10721 .
  • Lim and Zohren [2021] Lim, B., Zohren, S., 2021. Time-series forecasting with deep learning: a survey. Philosophical Transactions of the Royal Society A 379, 20200209.
  • Liu et al. [2023a] Liu, H., Li, C., Wu, Q., Lee, Y.J., 2023a. Visual instruction tuning.
  • Liu et al. [2022] Liu, M., Zeng, A., Chen, M., Xu, Z., Lai, Q., Ma, L., Xu, Q., 2022. Scinet: Time series modeling and forecasting with sample convolution and interaction. Advances in Neural Information Processing Systems 35, 5816--5828.
  • Liu et al. [2023b] Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., Long, M., 2023b. itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625 .
  • Liu et al. [2024] Liu, Y., Xiao, J., Guo, Y., Jiang, P., Yang, H., Wang, F., 2024. Hsidmamba: Exploring bidirectional state-space models for hyperspectral denoising. arXiv:2404.09697.
  • Ma et al. [2024] Ma, J., Li, F., Wang, B., 2024. U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation. URL: http://arxiv.org/abs/2401.04722, doi:10.48550/arXiv.2401.04722. arXiv:2401.04722 [cs, eess].
  • Mahmoud and Mohammed [2021] Mahmoud, A., Mohammed, A., 2021. A Survey on Deep Learning for Time-Series Forecasting. Springer International Publishing, Cham. pp. 365--392. URL: https://doi.org/10.1007/978-3-030-59338-4_19, doi:10.1007/978-3-030-59338-4_19.
  • Medsker et al. [2001] Medsker, L.R., Jain, L., et al., 2001. Recurrent neural networks. Design and Applications 5, 2.
  • Mellouli et al. [2022] Mellouli, N., Rabah, M.L., Farah, I.R., 2022. Transformers-based time series forecasting for piezometric level prediction, in: 2022 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS), pp. 1--6. URL: https://ieeexplore.ieee.org/abstract/document/9787530, doi:10.1109/EAIS51927.2022.9787530. iSSN: 2473-4691.
  • Midilli and Parshutin [2023] Midilli, Y.E., Parshutin, S., 2023. A review for pre-trained transformer-based time series forecasting models. 2023 IEEE 64th International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS) , 1--8URL: https://api.semanticscholar.org/CorpusID:265256497.
  • Nie et al. [2022] Nie, Y., Nguyen, N.H., Sinthong, P., Kalagnanam, J., 2022. A time series is worth 64 words: Long-term forecasting with transformers, in: The Eleventh International Conference on Learning Representations.
  • Pióro et al. [2024] Pióro, M., Ciebiera, K., Król, K., Ludziejewski, J., Krutul, M., Krajewski, J., Antoniak, S., Miłoś, P., Cygan, M., Jaszczur, S., 2024. MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts. URL: http://arxiv.org/abs/2401.04081, doi:10.48550/arXiv.2401.04081. arXiv:2401.04081 [cs].
  • Rangapuram et al. [2018] Rangapuram, S.S., Seeger, M.W., Gasthaus, J., Stella, L., Wang, Y., Januschowski, T., 2018. Deep state space models for time series forecasting. Advances in neural information processing systems 31.
  • Schiff et al. [2024] Schiff, Y., Kao, C.H., Gokaslan, A., Dao, T., Gu, A., Kuleshov, V., 2024. Caduceus: Bi-directional equivariant long-range dna sequence modeling. ArXiv abs/2403.03234. URL: https://api.semanticscholar.org/CorpusID:268253280.
  • Sezer et al. [2020] Sezer, O.B., Gudelek, M.U., Ozbayoglu, A.M., 2020. Financial time series forecasting with deep learning : A systematic literature review: 2005–2019. Applied Soft Computing 90, 106181. URL: https://www.sciencedirect.com/science/article/pii/S1568494620301216, doi:https://doi.org/10.1016/j.asoc.2020.106181.
  • Sherozbek et al. [2023] Sherozbek, J., Park, J., Akhtar, M.S., Yang, O.B., 2023. Transformers-Based Encoder Model for Forecasting Hourly Power Output of Transparent Photovoltaic Module Systems. Energies 16, 1353. URL: https://www.mdpi.com/1996-1073/16/3/1353, doi:10.3390/en16031353. number: 3 Publisher: Multidisciplinary Digital Publishing Institute.
  • Shi [2024] Shi, Z., 2024. Mambastock: Selective state space model for stock prediction. arXiv:2402.18959.
  • Smith et al. [2022] Smith, J.T., Warrington, A., Linderman, S.W., 2022. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933 .
  • Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems 30.
  • Wang et al. [2024a] Wang, Y., Long, H., Zheng, L., Shang, J., 2024a. Graphformer: Adaptive graph correlation transformer for multivariate long sequence time series forecasting. Knowledge-Based Systems 285, 111321. URL: https://www.sciencedirect.com/science/article/pii/S0950705123010699, doi:https://doi.org/10.1016/j.knosys.2023.111321.
  • Wang et al. [2024b] Wang, Z., Ruan, S., Huang, T., Zhou, H., Zhang, S., Wang, Y., Wang, L., Huang, Z., Liu, Y., 2024b. A lightweight multi-layer perceptron for efficient multivariate time series forecasting. Knowledge-Based Systems 288, 111463. URL: https://www.sciencedirect.com/science/article/pii/S0950705124000984, doi:https://doi.org/10.1016/j.knosys.2024.111463.
  • Wen et al. [2023] Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., Sun, L., 2023. Transformers in Time Series: A Survey. URL: http://arxiv.org/abs/2202.07125, doi:10.48550/arXiv.2202.07125. arXiv:2202.07125 [cs, eess, stat].
  • Woo et al. [2022] Woo, G., Liu, C., Sahoo, D., Kumar, A., Hoi, S., 2022. ETSformer: Exponential Smoothing Transformers for Time-series Forecasting. URL: http://arxiv.org/abs/2202.01381, doi:10.48550/arXiv.2202.01381. arXiv:2202.01381 [cs].
  • Wu et al. [2022a] Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M., 2022a. Timesnet: Temporal 2d-variation modeling for general time series analysis, in: The eleventh international conference on learning representations.
  • Wu et al. [2022b] Wu, H., Wu, J., Xu, J., Wang, J., Long, M., 2022b. Flowformer: Linearizing transformers with conservation flows. arXiv preprint arXiv:2202.06258 .
  • Wu et al. [2021] Wu, H., Xu, J., Wang, J., Long, M., 2021. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in neural information processing systems 34, 22419--22430.
  • Yang et al. [2024a] Yang, S., Wang, Y., Chen, H., 2024a. Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology. arXiv:2403.06800.
  • Yang et al. [2024b] Yang, Y., Xing, Z., Zhu, L., 2024b. Vivim: a video vision mamba for medical video object segmentation. arXiv preprint arXiv:2401.14168 .
  • Yang et al. [2024c] Yang, Z., Mitra, A., Kwon, S., Yu, H., 2024c. ClinicalMamba: A Generative Clinical Language Model on Longitudinal Clinical Notes. URL: http://arxiv.org/abs/2403.05795. arXiv:2403.05795 [cs].
  • Yao et al. [2024] Yao, J., Hong, D., Li, C., Chanussot, J., 2024. Spectralmamba: Efficient mamba for hyperspectral image classification. arXiv:2404.08489.
  • Yi et al. [2023] Yi, K., Zhang, Q., Fan, W., Wang, S., Wang, P., He, H., An, N., Lian, D., Cao, L., Niu, Z., 2023. Frequency-domain MLPs are More Effective Learners in Time Series Forecasting. Advances in Neural Information Processing Systems 36, 76656--76679. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/f1d16af76939f476b5f040fd1398c0a3-Abstract-Conference.html.
  • Yue and Li [2024] Yue, Y., Li, Z., 2024. Medmamba: Vision mamba for medical image classification. arXiv:2403.03849.
  • Zeng et al. [2023] Zeng, A., Chen, M., Zhang, L., Xu, Q., 2023. Are transformers effective for time series forecasting?, in: Proceedings of the AAAI conference on artificial intelligence, pp. 11121--11128.
  • Zeng et al. [2022] Zeng, A., Chen, M.H., Zhang, L., Xu, Q., 2022. Are transformers effective for time series forecasting?, in: AAAI Conference on Artificial Intelligence. URL: https://api.semanticscholar.org/CorpusID:249097444.
  • Zhang et al. [2022] Zhang, T., Zhang, Y., Cao, W., Bian, J., Yi, X., Zheng, S., Li, J., 2022. Less Is More: Fast Multivariate Time Series Forecasting with Light Sampling-oriented MLP Structures. URL: http://arxiv.org/abs/2207.01186, doi:10.48550/arXiv.2207.01186. arXiv:2207.01186 [cs] version: 1.
  • Zhang and Yan [2022] Zhang, Y., Yan, J., 2022. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting, in: The eleventh international conference on learning representations.
  • Zhao et al. [2024] Zhao, H., Zhang, M., Zhao, W., Ding, P., Huang, S., Wang, D., 2024. Cobra: Extending mamba to multi-modal large language model for efficient inference. arXiv:2403.14520.
  • Zhou et al. [2021] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W., 2021. Informer: Beyond efficient transformer for long sequence time-series forecasting, in: Proceedings of the AAAI conference on artificial intelligence, pp. 11106--11115.
  • Zhou et al. [2022] Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., Jin, R., 2022. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting, in: International conference on machine learning, PMLR. pp. 27268--27286.
  • Zhu et al. [2024] Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X., 2024. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 .