T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
A BSTRACT
The recent boom of linear forecasting models questions the ongoing passion for
architectural modifications of Transformer-based forecasters. These forecasters
leverage Transformers to model the global dependencies over temporal tokens of
time series, with each token formed by multiple variates of the same timestamp.
However, Transformers are challenged in forecasting series with larger lookback
windows due to performance degradation and computation explosion. Besides, the
embedding for each temporal token fuses multiple variates that represent potential
delayed events and distinct physical measurements, which may fail in learning
variate-centric representations and result in meaningless attention maps. In this
work, we reflect on the competent duties of Transformer components and repurpose
the Transformer architecture without any modification to the basic components. We
propose iTransformer that simply applies the attention and feed-forward network
on the inverted dimensions. Specifically, the time points of individual series are em-
bedded into variate tokens which are utilized by the attention mechanism to capture
multivariate correlations; meanwhile, the feed-forward network is applied for each
variate token to learn nonlinear representations. The iTransformer model achieves
state-of-the-art on challenging real-world datasets, which further empowers the
Transformer family with promoted performance, generalization ability across differ-
ent variates, and better utilization of arbitrary lookback windows, making it a nice
alternative as the fundamental backbone of time series forecasting. Code is avail-
able at this repository: https://github.com/thuml/iTransformer.
1 I NTRODUCTION
Transformer (Vaswani et al., 2017) has achieved tremendous suc- 0.38
ETT 0.12
PE
computer vision (Dosovitskiy et al., 2021), growing into the foun- 0.91
0.22
dation model that follows the scaling law (Kaplan et al., 2020).
1.42 0.32
Inspired by the immense success in extensive fields, Transformer
L
EC
0.67 0.66
So
1
Published as a conference paper at ICLR 2024
s
ep
Token Token Token
Value
St
e
m
Ti
Embedding
Transformer
ta ed
View
n
en ix
tio
es - M
pr te
Re aria
Attention over FFN on Multivariate
V
Time Token
Temporal Tokens Representations
Invert Token
g
in
Value
dd
Token
be
Em
iTransformer Token
Variates
View
tio d
ta ixe
n
en m
Token
es n
pr e-U
Re riat
Time Attention over FFN on Series
Va
Variate Tokens Representations
Figure 2: Comparison between the vanilla Transformer (top) and the proposed iTransformer (bottom).
Transformer embeds the temporal token, which contains the multivariate representation of each time
step. iTransformer embeds each series independently to the variate token, such that the attention mod-
ule depicts the multivariate correlations and the feed-forward network encodes series representations.
information is ever more highlighted by recent research that explicitly models multivariate correla-
tions to achieve accurate forecasting (Zhang & Yan, 2023; Ekambaram et al., 2023), but this goal can
be hardly achieved without subverting the vanilla Transformer architecture.
Considering the disputes of Transformer-based forecasters, we reflect on why Transformers perform
even worse than linear models in time series forecasting while acting predominantly in many other
fields. We notice that the existing structure of Transformer-based forecasters may be not suitable for
multivariate time series forecasting. As shown on the top of Figure 2, it is notable that the points
of the same time step that basically represent completely different physical meanings recorded by
inconsistent measurements are embedded into one token with wiped-out multivariate correlations.
And the token formed by a single time step can struggle to reveal beneficial information due to
excessively local receptive field and time-unaligned events represented by simultaneous time points.
Besides, while series variations can be greatly influenced by the sequence order, permutation-
invariant attention mechanisms are improperly adopted on the temporal dimension (Zeng et al.,
2023). Consequently, Transformer is weakened to capture essential series representations and portray
multivariate correlations, limiting its capacity and generalization ability on diverse time series data.
Concerning the potential risks of embedding multivariate points of a timestamp as a (temporal) token,
we take an inverted view on time series and embed the whole time series of each variate independently
into a (variate) token, the extreme case of Patching (Nie et al., 2023) that enlarges local receptive field.
By inverting, the embedded token aggregates the global representations of series that can be more
variate-centric and better leveraged by booming attention mechanisms for multivariate correlating.
Meanwhile, the feed-forward network can be proficient enough to learn generalizable representations
for distinct variates encoded from arbitrary lookback series and decoded to predict future series.
Based on the above motivations, we believe it is not that Transformer is ineffective for time series
forecasting, but rather it is improperly used. In this paper, we revisit the structure of Transformer and
advocate iTransformer as a fundamental backbone for time series forecasting. Technically, we embed
each time series as variate tokens, adopt the attention for multivariate correlations, and employ the
feed-forward network for series representations. Experimentally, the proposed iTransformer achieves
state-of-the-art performance on real-world forecasting benchmarks shown in Figure 1 and surprisingly
tackles the pain points of Transformer-based forecasters. Our contributions lie in three aspects:
• We reflect on the architecture of Transformer and refine that the competent capability of
native Transformer components on multivariate time series is underexplored.
• We propose iTransformer that regards independent time series as tokens to capture multivari-
ate correlations by self-attention and utilize layer normalization and feed-forward network
modules to learn better series-global representations for time series forecasting.
• Experimentally, iTransformer achieves comprehensive state-of-the-art on real-world bench-
marks. We extensively analyze the inverted modules and architecture choices, indicating a
promising direction for the future improvement of Transformer-based forecasters.
2
Published as a conference paper at ICLR 2024
2 R ELATED W ORK
With the progressive breakthrough made in natural language processing and computer vision areas,
elaboratively designed Transformer variants are proposed to tackle ubiquitous time series forecasting
applications. Going beyond contemporaneous TCNs (Bai et al., 2018; Liu et al., 2022a) and RNN-
based forecasters (Zhao et al., 2017; Rangapuram et al., 2018; Salinas et al., 2020), Transformer has
exhibited powerful sequence modeling capability and promising model scalability, leading to the
trend of passionate modifications adapted for time series forecasting.
Through a systematical review of Transformer-based forecasters, we conclude that existing modifi-
cations can be divided into four categories by whether to modify the component and architecture.
As shown in Figure 3, the first category (Wu et al., 2021; Li et al., 2021; Zhou et al., 2022), which
is the most common practice, mainly concerns the component adaptation, especially the attention
module for the temporal dependency modeling and the complexity optimization on long sequences.
Nevertheless, with the rapid emergence of linear forecasters (Oreshkin et al., 2019; Zeng et al.,
2023; Das et al., 2023; Liu et al., 2023), the impressive performance and efficiency continuously
challenge this direction. Soon afterward, the second category attempts to fully utilize Transformer.
It pays more attention to the inherent processing of time series, such as Stationarization (Liu et al.,
2022b), Channel Independence, and Patching (Nie et al., 2023), which bring about consistently
improved performance. Moreover, faced with the increasing significance of the independence and
mutual interactions of multiple variates, the third category refurbishes Transformer in both aspects of
component and architecture. Representative (Zhang & Yan, 2023) explicitly captures the cross-time
and cross-variate dependencies by the renovated attention mechanism and architecture.
Unlike previous works, iTransformer modifies none of the native components of Transformer. Instead,
we adopt the components on the inverted dimensions with the altered architecture, as the only one that
belongs to the fourth category to our best knowledge. We believe the capabilities of the components
have stood the test extensively, the truth is that the architecture of Transformer is improperly adopted.
Series Processing
Transformer
Original Arch
No Modified
Modified Arch (II) PatchTST, NSTransformer,… Component (IV) iTransformer (Ours)
3 I T RANSFORMER
Our proposed iTransformer illustrated in Figure 4 adopts the encoder-only architecture of Trans-
former (Vaswani et al., 2017), including the embedding, projection, and Transformer blocks.
Embedding the whole series as the token Most Transformer-based forecasters typically regard
multiple variates of the same time as the (temporal) token and follow the generative formulation of
forecasting tasks. However, we find the approach on the numerical modality can be less instructive for
3
Published as a conference paper at ICLR 2024
Output
(d) (c)
es
Features 𝑥−µ
ur
𝑥! = Projection
at
Fe
σ TrmBlock
Variate
LayerNorm
+
µ σ
Variate
Temporal LayerNorm Feed-forward
Dense Act & Drop Dense
L×
LayerNorm
(a) (b)
+ Query
Multivariate Multivariate
Attention MatMul Scale Correlations
θ
Variate
Map
Q K V Key
Figure 4: Overall structure of iTransformer, which shares the same modular arrangement with the
encoder of Transformer. (a) Raw series of different variates are independently embedded as tokens.
(b) Self-attention is applied to embedded variate tokens with enhanced interpretability revealing
multivariate correlations. (c) Series representations of each token are extracted by the shared feed-
forward network. (d) Layer normalization is adopted to reduce the discrepancies among variates.
learning attention maps, which is supported by increasing applications of Patching (Dosovitskiy et al.,
2021; Nie et al., 2023) that broadens the respective field. Meanwhile, the triumph of linear forecasters
also challenges the necessity of adopting a heavy encoder-decoder Transformer for generating tokens.
Instead, our proposed encoder-only iTransformer focuses on representation learning and adaptive
correlating of multivariate series. Each time series driven by the underlying complicated process
is firstly tokenized to describe the properties of the variate, applied by self-attention for mutual
interactions, and individually processed by feed-forward networks for series representations. Notably,
the task to generate the predicted series is essentially delivered to linear layers, which has been proven
competent by previous work (Das et al., 2023) and we provide a detailed analysis in the next section.
Based on the above considerations, in iTransformer, the process of predicting future series of each
specific variate Ŷ:,n based on the lookback series X:,n is simply formulated as follows:
h0n = Embedding(X:,n ),
Hl+1 = TrmBlock(Hl ), l = 0, . . . , L − 1, (1)
Ŷ:,n = Projection(hL
n ),
We organize a stack of L blocks composed of the layer normalization, feed-forward network, and
self-attention modules. But their duties on the inverted dimension are carefully reconsidered.
4
Published as a conference paper at ICLR 2024
Layer normalization Layer normalization (Ba et al., 2016) is originally proposed to increase
the convergence and training stability of deep networks. In typical Transformer-based forecasters,
the module normalizes the multivariate representation of the same timestamp, gradually fusing the
variates with each other. Once the collected time points do not represent the same event, the operation
will also introduce interaction noises between noncausal or delayed processes. In our inverted version,
the normalization is applied to the series representation of individual variate as Equation 2, which
has been studied and proved effective in tackling non-stationary problems (Kim et al., 2021; Liu
et al., 2022b). Besides, since all series as (variate) tokens are normalized to a Gaussian distribution,
the discrepancies caused by inconsistent measurements can be diminished. By contrast, in previous
architecture, different tokens of time steps will be normalized, leading to oversmooth time series.
( )
hn − Mean(hn )
LayerNorm(H) = p n = 1, . . . , N (2)
Var(hn )
Feed-forward network Transformer adopts the feed-forward network (FFN) as the basic building
block for encoding token representation and it is identically applied to each token. As aforementioned,
in the vanilla Transformer, multiple variates of the same timestamp that form the token can be
malpositioned and too localized to reveal enough information for predictions. In the inverted version,
FFN is leveraged on the series representation of each variate token. By the universal approximation
theorem (Hornik, 1991), they can extract complicated representations to describe a time series. With
the stacking of inverted blocks, they are devoted to encoding the observed time series and decoding
the representations for future series using dense non-linear connections, which work effectively as
the recent works completely built on MLPs (Tolstikhin et al., 2021; Das et al., 2023).
More interestingly, the identical linear operation on independent time series, which serves as the
combination of the recent linear forecasters (Zeng et al., 2023) and Channel Independence (Nie et al.,
2023), can be instructive for us to understand the series representations. Recent revisiting on linear
forecasters (Li et al., 2023) highlights that temporal features extracted by MLPs are supposed to be
shared within distinct time series. We propose a rational explanation that the neurons of MLP are
taught to portray the intrinsic properties of any time series, such as the amplitude, periodicity, and even
frequency spectrums (neuron as a filter), serving as a more advantageous predictive representation
learner than the self-attention applied on time points. Experimentally, we validate that the division of
labor helps enjoy the benefits of linear layers in Section 4.3, such as the promoted performance if
providing enlarged lookback series, and the generalization ability on unseen variates.
Self-attention While the attention mechanism is generally adopted for facilitating the temporal
dependencies modeling in previous forecasters, the inverted model regards the whole series of one
variate as an independent process. Concretely, with comprehensively extracted representations of
each time series H = {h0 , . . . , hN } ∈ RN ×D , the self-attention module adopts linear projections to
get queries, keys, and values Q, K, V ∈ RN ×dk , where dk is the projected dimension.
With denotation of qi , kj ∈ Rdk as the specific query and key of one (variate)
√ token, we notice
that each entry of the pre-Softmax scores is formulated as Ai,j = (QK⊤ / dk )i,j ∝ q⊤ i kj . Since
each token is previously normalized on its feature dimension, the entries can somewhat reveal the
variate-wise correlation, and the whole score map A ∈ RN ×N exhibits the multivariate correlations
between paired variate tokens. Consequently, highly correlated variate will be more weighted for the
next representation interaction with values V. Based on this intuition, the proposed mechanism is
believed to be more natural and interpretable for multivariate series forecasting. We further provide
the visualization analysis of the score map in Section 4.3 and Appendix E.1.
4 E XPERIMENTS
We thoroughly evaluate the proposed iTransformer on various time series forecasting applications,
validate the generality of the proposed framework and further dive into the effectiveness of applying
the Transformer components on the inverted dimensions of time series.
Datasets We extensively include 7 real-world datasets in our experiments, including ECL, ETT (4
subsets), Exchange, Traffic, Weather used by Autoformer (Wu et al., 2021), Solar-Energy datasets
5
Published as a conference paper at ICLR 2024
proposed in LSTNet (Lai et al., 2018), and PEMS (4 subsets) evaluated in SCINet (Liu et al., 2022a).
We also provide the experiments on Market (6 subsets) in Appendix F.4. It records the minute-
sampled server load of Alipay online transaction application with hundreds of variates, where we
consistently outperform other baselines. Detailed dataset descriptions are provided in Appendix A.1.
In this section, we conduct extensive experiments to evaluate the forecasting performance of our
proposed model together with advanced deep forecasters.
Main results Comprehensive forecasting results are listed in Table 1 with the best in red and the
second underlined. The lower MSE/MAE indicates the more accurate prediction result. Compared
with other forecasters, iTransformer is particularly good at forecasting high-dimensional time series.
Besides, PatchTST as the previous state-of-the-art, fails in many cases of PEMS, which can stem from
the extremely fluctuating series of the dataset, and the patching mechanism of PatchTST may lose
focus on specific locality to handle rapid fluctuation. By contrast, the proposed model aggregating
the whole series variations for series representations can better cope with this situation. Notably, as
the representative that explicitly captures multivariate correlations, the performance of Crossformer
is still subpar to iTransformer, indicating the interaction of time-unaligned patches from different
multivariate will bring about unnecessary noise for forecasting. Therefore, the native Transformer
components are competent for temporal modeling and multivariate correlating, and the proposed
inverted architecture can effectively tackle real-world time series forecasting scenarios.
Table 1: Multivariate forecasting results with prediction lengths S ∈ {12, 24, 36, 48} for PEMS and
S ∈ {96, 192, 336, 720} for others and fixed lookback length T = 96. Results are averaged from all
prediction lengths. Avg means further averaged by subsets. Full results are listed in Appendix F.4.
iTransformer RLinear PatchTST Crossformer TiDE TimesNet DLinear SCINet FEDformer Stationary Autoformer
Models (Ours) (2023) (2023) (2023) (2023) (2023) (2023) (2022a) (2022) (2022b) (2021)
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
ECL 0.178 0.270 0.219 0.298 0.205 0.290 0.244 0.334 0.251 0.344 0.192 0.295 0.212 0.300 0.268 0.365 0.214 0.327 0.193 0.296 0.227 0.338
ETT (Avg) 0.383 0.399 0.380 0.392 0.381 0.397 0.685 0.578 0.482 0.470 0.391 0.404 0.442 0.444 0.689 0.597 0.408 0.428 0.471 0.464 0.465 0.459
Exchange 0.360 0.403 0.378 0.417 0.367 0.404 0.940 0.707 0.370 0.413 0.416 0.443 0.354 0.414 0.750 0.626 0.519 0.429 0.461 0.454 0.613 0.539
Traffic 0.428 0.282 0.626 0.378 0.481 0.304 0.550 0.304 0.760 0.473 0.620 0.336 0.625 0.383 0.804 0.509 0.610 0.376 0.624 0.340 0.628 0.379
Weather 0.258 0.278 0.272 0.291 0.259 0.281 0.259 0.315 0.271 0.320 0.259 0.287 0.265 0.317 0.292 0.363 0.309 0.360 0.288 0.314 0.338 0.382
Solar-Energy 0.233 0.262 0.369 0.356 0.270 0.307 0.641 0.639 0.347 0.417 0.301 0.319 0.330 0.401 0.282 0.375 0.291 0.381 0.261 0.381 0.885 0.711
PEMS (Avg) 0.119 0.218 0.514 0.482 0.217 0.305 0.220 0.304 0.375 0.440 0.148 0.246 0.320 0.394 0.121 0.222 0.224 0.327 0.151 0.249 0.614 0.575
In this section, we evaluate iTransformers by applying our framework to Transformer and its vari-
ants, which generally address the quadratic complexity of the self-attention mechanism, including
Reformer (Kitaev et al., 2020), Informer (Li et al., 2021), Flowformer (Wu et al., 2022) and FlashAt-
tention (Dao et al., 2022). Surprising and promising discoveries are exhibited, indicating the simple
inverted perspective can enhance Transformer-based forecasters with promoted performance with
efficiency, generalization on unseen variates, and better utilization of historical observations.
Performance promotion We evaluate Transformers and the corresponding iTransformers with the
reported performance promotions in Table 2. It is notable that the framework consistently improves
various Transformers. Overall, it achieves averaged 38.9% promotion on Transformer, 36.1% on
Reformer, 28.5% on Informer, 16.8% on Flowformer and 32.2% on Flashformer, revealing the
previous improper usage of the Transformer architecture on time series forecasting. Moreover, since
the attention mechanism is adopted on the variate dimension in our inverted structure, the introduction
of efficient attentions with linear complexity essentially addresses the computational problem due to
6
Published as a conference paper at ICLR 2024
numerous variates, which is prevalent in real-world applications but can be resource-consuming for
Channel Independence (Nie et al., 2023). Therefore, the idea of iTransformer can be widely practiced
on Transformer-based forecasters to take advantage of booming efficient attention mechanisms.
Table 2: Performance promotion obtained by our inverted framework. Flashformer means Transformer
equipped with hardware-accelerated FlashAttention (Dao et al., 2022). We report the average
performance and the relative MSE reduction (Promotion). Full results can be found in Appendix F.2.
Transformer Reformer Informer Flowformer Flashformer
Models
(2017) (2020) (2021) (2022) (2022)
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Original 0.277 0.372 0.338 0.422 0.311 0.397 0.267 0.359 0.285 0.377
ECL +Inverted 0.178 0.270 0.208 0.301 0.216 0.311 0.210 0.293 0.206 0.291
Promotion 35.6% 27.4% 38.4% 28.7% 30.5% 21.6% 21.3% 18.6% 27.8% 22.9%
Original 0.665 0.363 0.741 0.422 0.764 0.416 0.750 0.421 0.658 0.356
Traffic +Inverted 0.428 0.282 0.647 0.370 0.662 0.380 0.524 0.355 0.492 0.333
Promotion 35.6% 22.3% 12.7% 12.3% 13.3% 8.6% 30.1% 15.6% 25.2% 6.4%
Original 0.657 0.572 0.803 0.656 0.634 0.548 0.286 0.308 0.659 0.574
Weather +Inverted 0.258 0.279 0.248 0.292 0.271 0.330 0.266 0.285 0.262 0.282
Promotion 60.2% 50.8% 69.2% 55.5% 57.3% 39.8% 7.2% 7.7% 60.2% 50.8%
Variate generalization By inverting vanilla Transformers, it is notable that the models are empow-
ered with the generalization capability on unseen variates. Firstly, benefiting from the flexibility of the
number of input tokens, the amount of variate channels is no longer restricted and thus feasible to vary
from training and inference. Besides, feed-forward networks are identically applied on independent
variate tokens in iTransformer. As aforementioned, the neurons as filters learn the intrinsic patterns
of any time series, which are inclined to be shared and transferable among distinct variates.
To verify the hypothesis, we compare inverting with another generalizing strategy: Channel Indepen-
dence, training a shared backbone to forecast all variates. We partition the variates of each dataset
into five folders, train models with only 20% of variates of one folder, and directly forecast all
variates without fine-tuning. We compare the performance in Figure 5 and each bar presents the
averaged results of all folders to avoid the randomness of partition. CI-Transformers take a long time
to predict each variate one by one during inference while iTransformers directly predict all variates
and generally present smaller increases, indicating FFN is competent to learn transferable time series
representations. It leaves a potential direction to build a foundation model upon iTransformer, where
diverse multivariate time series with different numbers of variates can be feasibly trained together.
( &