Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dominant Shuffle: A Simple Yet Powerful Data Augmentation for Time-series Prediction

Kai Zhao1  Zuojie He2  Alex Hung1  Dan Zeng2
1UCLA  2 Shanghai University
kz@kaizhao.net
Abstract

Recent studies have suggested frequency-domain Data augmentation (DA) is effective for time series prediction. Existing frequency-domain augmentations disturb the original data with various full-spectrum noises, leading to excess domain gap between augmented and original data. Although impressive performance has been achieved in certain cases, frequency-domain DA has yet to be generalized to time series prediction datasets. In this paper, we found that frequency-domain augmentations can be significantly improved by two modifications that limit the perturbations. First, we found that limiting the perturbation to only dominant frequencies significantly outperforms full-spectrum perturbations. Dominant frequencies represent the main periodicity and trends of the signal and are more important than other frequencies. Second, we found that simply shuffling the dominant frequency components is superior over sophisticated designed random perturbations. Shuffle rearranges the original components (magnitudes and phases) and limits the external noise. With these two modifications, we proposed dominant shuffle, a simple yet effective data augmentation for time series prediction. Our method is very simple yet powerful and can be implemented with just a few lines of code. Extensive experiments with eight datasets and six popular time series models demonstrate that our method consistently improves the baseline performance under various settings and significantly outperforms other DA methods. Code can be accessed at https://kaizhao.net/time-series.

1 Introduction

Time-series prediction aims to forecast multivariate future values based on historical observations. It is a long-standing problem with various applications in electricity pricing, weather forecast, traffic prediction [10, 35]. Recently, impressive results have been achieved by using various deep learning architectures, e.g. recurrent neural networks (RNNs)  [21, 22, 14], Transformers [35, 31, 37, 13], and temporal convolutional networks (TCNs)  [27, 12, 30]. Neural networks require a large volume of training data to effectively fit their numerous parameters. Unfortunately, time-series data acquired from real-world sensors are often limited in many time-series applications. The patterns of the time series heavily depend on specific dynamic system that generates the data and other data sources are not applicable [3, 23].

To mitigate the impact of insufficient data in time series analysis, several data augmentation techniques have been explored  [28, 7, 1, 23, 3, 33, 4, 8, 34, 20, 26, 9, 24, 11, 16, 10]. Most of these data augmentation techniques in time series analysis focus on classification  [20, 26, 9, 24, 16, 10, 33, 4] and anomaly detection [11, 10, 8]. These augmentations alter the time series sequences while preserving the class labels. However, the prediction task requires more fine-grained temporal information to accurately estimate future dynamics [34, 3]. These perturbations designed for classification can disrupt the data-label coherence and lead to performance degradation [34, 3].

Coherence is a key factor to effective data augmentation [28, 34, 25]. It measures the semantic connection between the augmented data and the label. These augmentations designed for classification often struggle with prediction tasks, due to unilateral perturbations that disrupt the data-label coherence. Recently, to mitigate the data-label coherence, Chen et. al [3] proposed to simultaneously perturb the data (historical sequence) and labels (future sequences) in the frequency domain. Unlike common data augmentations that introduce slight perturbations only to the data while keeping the labels unchanged, this approach enables more radical perturbations, such as frequency mix (FreqMix) and frequency mask (FreqMask), to be applied without severely disrupting the data-label coherence. And the method indeed generates new data-label pairs that are significantly different from the originals.

However, the full-spectrum perturbations in FreqMask and FreqMix introduce external randomization and reduce the domain gap between the augmented and original data. This can lead to unstable and suboptimal results on some benchmarks, especially with a larger amount of augmented samples. As shown in Fig. 5, the performance of FrAug [3] degrades significantly with the rising number of augmented samples, which demonstrates that the augmented samples are out-of-distribution with the original samples.

In this paper, to reduce the domain gap between the augmented and original data, we propose to limit the perturbation and randomization in data augmentation. First, we limit the perturbation to specific frequencies instead of full-spectrum perturbation. Several recent studies have pointed out that a few frequency components are dominating the periodicity and main trends of the time series. And other Frequencies correspond to minor trends or noise  [30, 37, 36]. Following [30], we perturb top-k𝑘kitalic_k frequencies with highest magnitudes. Second, to avoid excess external noise, we use random shuffle for perturbation. Shuffle rearranges existing components without introducing any external randomness.

Extensive comparisons were made among nine different data augmentation methods on eight public datasets using six state-of-the-art time-series prediction network architectures. These comparisons demonstrate that, despite its simplicity, our method significantly outperforms other competitors by a substantial margin. As shown in Fig. 1, our method consistently improves the performance across various datasets, and outperforms other augmentations in most cases.

Comprehensive ablation studies demonstrate that perturbing dominant frequencies yields significantly better performance than various full-spectrum perturbations. And shuffle is proven to be superior to other randomization techniques. Besides, our augmentation demonstrates improved augmented-original gap over other augmentations, as indicated by higher performance with an increased number of augmented samples ( Fig. 5).

\begin{overpic}[width=433.62pt]{figures/rel-perf-crop} \put(9.0,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{bandara2021improving% }{}{}]}}} \put(22.0,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{chen2023supervised}% {}{}]}}} \put(35.5,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{chen2023fraug}{}{}]% }}} \put(48.0,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{chen2023fraug}{}{}]% }}} \put(62.9,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{gao2020robusttad}{}% {}]}}} \put(74.5,0.372){\footnotesize{~{}\cite[cite]{[\@@bibref{}{zhang2023towards}{}% {}]}}} \put(87.5,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{zhang2022self}{}{}]% }}} \end{overpic}
Figure 1: Relative improvements (%) of various data augmentations over the baseline on eight datasets using the state-of-the-art iTransformer [13] model. Zero corresponds to the original model without any data augmentation. Our method consistently improves the baseline on all the datasets and outperforms other augmentations in most cases. The improvements are based on the average performance of four prediction lengths: 96, 192, 336, and 720.

2 Related Work

In the last decade, deep learning has emerged as a powerful tool in time-series prediction and has shown superior performance over traditional statistical methods such as ARIMA and Exponential Smoothing [15]. A rich line of studies has introduced various deep-learning architectures, including recurrent neural networks (RNNs)  [21, 22, 14], temporal convolution neural networks (TCNs) [27, 12, 30], and Transformers [31, 17, 18, 13, 37]. These models learn to predict the future from large volumes of historical data.

Various data augmentations have been proposed for time series data and many of these techniques were proposed for the classification tasks [28, 20, 26, 9, 24, 16, 10, 33, 4]. Many of these methods regard time series data as one-dimensional image and borrowed data augmentations, e.g. cropping [9, 5] flipping [28], and noise injection [29], from computer vision. Window warping [28] is a time series-specific data augmentation that upsamples (or downsamples) a random range of the time series while keeping other time ranges unchanged.

In addition to time-domain augmentations, there are also methods that perturb the original data in the frequency domain. Gao [8] proposed to add noise on both magnitude and phase in the frequency domain. Zhang [33] proposed to add single or multiple frequency components in the first half of the frequency spectrum. Chen [4] proposed to perform pooling or smoothing operations in the frequency domain.

While most of the augmentations focus on the classification tasks, a few methods for forecasting task have also been explored. Bandara [1] introduces two DA methods for forecasting : (i) Average selected with distance (ASD), which generates augmented time series using the weighted sum of multiple time series, and the weights are determined by the dynamic time warping (DTW) distance[7]; (ii) Moving block bootstrapping (MBB) generates augmented data by manipulating the residual part of the time series after STL Decomposition [23] and recombining it with the other series. Zhang [34] proposed to simultaneously augment in frequency and time domains. Recently, Chen et. al. [3] proposed to augment both the data (historical sequence) and the label (future sequence) in the frequency domain to improve the data-label coherence. Although this method generally achieves decent results, full-spectrum randomization imposes a large domain gap between the augmented and the original data, sometimes leading to degraded performance.

3 Dominant Frequency Shuffle for Time-series

3.1 Time-series Prediction and Frequency Domain Augmentation

Time-series prediction is a sequence-to-sequence problem where the model estimates a future multivariate sequence based on a sequence of historical measurements. Let x={x1,x2,,xL}t=1LL×D𝑥superscriptsubscriptsuperscript𝑥1superscript𝑥2superscript𝑥𝐿𝑡1𝐿superscript𝐿𝐷x=\{x^{1},x^{2},...,x^{L}\}_{t=1}^{L}\in\mathbb{R}^{L\times D}italic_x = { italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT be the historical sequence, and y={xL+1,xL+2,,xL+T}t=L+1L+TT×D𝑦superscriptsubscriptsuperscript𝑥𝐿1superscript𝑥𝐿2superscript𝑥𝐿𝑇𝑡𝐿1𝐿𝑇superscript𝑇𝐷y=\{x^{L+1},x^{L+2},...,x^{L+T}\}_{t=L+1}^{L+T}\in\mathbb{R}^{T\times D}italic_y = { italic_x start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_L + 2 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_L + italic_T end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = italic_L + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT is the future sequence to be estimated. xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the measurement at timestep t𝑡titalic_t and D𝐷Ditalic_D is the number of variates. Next, we will use xRL×D𝑥superscript𝑅𝐿𝐷x\in{R}^{L\times D}italic_x ∈ italic_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT and yRT×D𝑦superscript𝑅𝑇𝐷y\in{R}^{T\times D}italic_y ∈ italic_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT to denote the historical and future sequences. x𝑥xitalic_x and y𝑦yitalic_y are the input and output of deep learning models, respectively.

3.2 Dominant Frequency Shuffle

Deep neural networks learn the xy𝑥𝑦x\rightarrow yitalic_x → italic_y mapping from large volume of (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) pairs, and data augmentation is an efficient way of expanding the training data. Frequency-domain augmentation is a family of augmentation methods that perturb time series in the frequency domain. These methods initially convert time series to the frequency domain, apply perturbations there, and then convert the modified data back to the time domain.

Following FrAug [3], we augmented the concatenation of data and label to preserve the data-label consistency. Let F(ω)=([x,y])𝐹𝜔𝑥𝑦F(\omega)=\mathcal{F}([x,y])italic_F ( italic_ω ) = caligraphic_F ( [ italic_x , italic_y ] ) be the discrete Fourier transform (DFT)111 We used the  torch.fft.rfft() and  torch.fft.irfft() for time-to-frequency and inverse conversions. of the time-series where [x,y]𝑥𝑦[x,y][ italic_x , italic_y ] denotes the concatenation of data and label. F(ω)𝐹𝜔F(\omega)italic_F ( italic_ω ) is the discrete Fourier transform of [x,y]𝑥𝑦[x,y][ italic_x , italic_y ]. We shuffle only the dominant frequencies with highest magnitudes (|F(ω)|𝐹𝜔\lvert F(\omega)\rvert| italic_F ( italic_ω ) |). Let F^(ω)^𝐹𝜔\hat{F}(\omega)over^ start_ARG italic_F end_ARG ( italic_ω ) be the frequency-domain data with dominant frequencies shuffled, F^(ω)^𝐹𝜔\hat{F}(\omega)over^ start_ARG italic_F end_ARG ( italic_ω ) is then converted back to time domain using inverse DFT (iDFT): [x^,y^]=iDFT(F^(ω))^𝑥^𝑦iDFT^𝐹𝜔[\hat{x},\hat{y}]=\text{iDFT}\big{(}\hat{F}(\omega)\big{)}[ over^ start_ARG italic_x end_ARG , over^ start_ARG italic_y end_ARG ] = iDFT ( over^ start_ARG italic_F end_ARG ( italic_ω ) ). Where x^,y^^𝑥^𝑦\hat{x},\hat{y}over^ start_ARG italic_x end_ARG , over^ start_ARG italic_y end_ARG is the augmented data-label pair.  Fig. 2 illustrates an example of the process of dominant shuffle with k=3𝑘3k=3italic_k = 3. The prediction models were trained on a combined training set with both augmented and original data.

\begin{overpic}[width=433.62pt]{figures/pipeline1x4-crop} \end{overpic}
Figure 2: Illustration of shuffling three dominant frequencies. (a) The original time-series xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. (b) and (c) Frequency-domain representations before and after dominant shuffle. Color dots represent the shuffle of dominant frequencies. (d) Augmented time series with original time series as reference.

4 Experiments

In this section, we first introduce the implementation details in Sec. 4.1, and then compared the performance of various SOTA models with and without dominant shuffle in Sec. 4.2. In Sec. 4.3, we thoroughly compared dominant shuffle with various data augmentation methods. Finally, we conducted ablation studies to verify hyperparameter sensitivity and justify design choices in Sec. 4.4.

4.1 Experimental Setups

Implementation details   All the experiments were conducted with the PyTorch [19] framework on a single NVIDIA RTX 3090 GPU. Some of the experimental results were from respective original papers, and some were reproduced using official code with default configurations. We only changed the data augmentation for fair comparisons. Please refer to Sec. A.2 for the details about our reimplementations. Following the practice of [3], we performed data augmentations to double the size of the original training dataset unless otherwise specified.

Evaluation protocols   We tested our method with short-term and long-term prediction protocols. In the long-term protocol, the prediction period T𝑇Titalic_T ranges from 96 to 720, with variations at 96, 192, 336, and 720. In contrast, the short-term protocol has prediction periods ranging from 12 to 48, with variations at 12, 24, 36, and 48. Following the common practice of previous works  [35, 31, 37, 13, 27, 30], we quantified the performance of the prediction using the mean-squared error (MSE) between the ground-truth and the prediction.

Datasets   For long-term prediction, we experimented on eight well-established benchmarks: the ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2) [35], and the Weather, Electricity, Exchange, and Traffic datasets [31]. For short-term prediction, following iTransformer [13], we used four public traffic network datasets (PEMS03, PEMS04, PEMS07, PEMS08) from PEMS [2].

Each dataset is divided into training, testing, and evaluation subsets in specific ratios. The training, testing, and evaluation ratio is 6:2:2 for ETT and PEMS datasets, and the ratio is 7:1:2 for Electricity, Traffic, Weather, and Exchange-rate datasets. Detailed statistics of these datasets are summarized in Sec. A.1. For each setting (dataset+prediction length T𝑇Titalic_T), we tuned the optimal number of dominant frequencies k𝑘kitalic_k on the evaluation set. The optimal k𝑘kitalic_k on various datasets can be found in  Sec. B.3.

Baseline Models   We selected diverse models as the baseline in our experiments, including two Transformer-based (iTransformer [13], Autoformer [31]), two MLP-based methods (TiDE [6], Lightts [32]), and two temporal convolutional network (TCN) based methods (MICN [27], SCINet [12]). iTransformer (Liu et al., 2024) is the state-of-the-art in Transformer-based models, TiDE (Das et al., 2023) is the state-of-the-art MLP-based model, and MICN (Wang et al., 2023) is the state-of-the-art TCN-based model. For short-term prediction, we used the SOTA iTransformer [13] model on PEMS [2] dataset as the baseline model.

Other data augmentation methods   We compared the proposed method with nine existing data augmentation methods, including three time-domain augmentations (ASD [7], MSB [1] Upsample  [23]), five frequency-domain methods (FreqMix [3], FreqMask [3], FreqAdd [33], FreqPool [4], Robusttad [8]), and a temporal-frequency method STAug [34].

4.2 Comparison With State-of-the-arts

We first compared our method with other state-of-the-art time series prediction models published in top-tier venues. We compared the performance of recent models (iTransformer [13] (ICLR2024), SCINet [12] (NIPS2022) AutoFormer [31] (NIPS2021)) with and without dominant shuffle. The averaged mean squared errors (MSE) across various prediction lengths (96, 192, 336, 720) is calculated for each dataset.

\begin{overpic}[width=433.62pt]{figures/sota-crop} \put(14.1,0.2){\scriptsize{\cite[cite]{[\@@bibref{}{wu2021autoformer}{}{}]} (% NeurIPS2021)}} \put(37.3,0.3){\scriptsize{\cite[cite]{[\@@bibref{}{liu2022scinet}{}{}]} (% NeurIPS2022)}} \put(59.7,0.3){\scriptsize{\cite[cite]{[\@@bibref{}{wang2023micn}{}{}]} (ICLR2% 023)}} \put(87.2,0.2){\scriptsize{\cite[cite]{[\@@bibref{}{liu2024itransformer}{}{}]}% (ICLR2024)}} \end{overpic}
Figure 3: Performance of different models with (right striped bars) and without (left color bars) dominant shuffle. The horizontal dotted lines demonstrate how dominant shuffle helps one model outperforms a more advanced model.

The results in Fig. 3 clearly demonstrate that our method consistently reduces the prediction error for all the cases. In some cases, dominant shuffle surpasses even a highly sophisticated model. For example, on the ETTh1 dataset, our approach significantly improves the performance of AutoFormer [31] and MICN [27], and helps them outperform the latest iTransformer [13] model. On the Exchange and Weather dataset, our approach enables AutoFormer to outperform SCINet [12] and assists MICN [27] in surpassing iTransformer [13]. The results in Fig. 3 clearly demonstrate the significant improvements achieved by our method.

4.3 Comparisons With Other Data Augmentations

We compared different data augmentation methods on various datasets and baseline models under short-term and long-term protocols.  Fig. 1 demonstrates the relative improvements (%) of various augmentation methods over the baseline.  Tab. 3, 4.3 and 2 summarize the average performance of 5 runs with distinct random seeds, and the standard deviations of different runs can be found in Sec. B.4. The best values in each colume are highlighted with color. Example predictions can be found in  Fig. 6 in the  appendix B.

We first compared different data augmentation methods for long-term prediction.  Sec. 4.3 summarizes the mean squared errors (MSE) on ETT datasets and  Tab. 2 summarizes the MSE on Weather, Electricity, and Exchange-rate datasets. Limited by the space, we only reported the results of six subsets (ETTh1, ETTh2, ETTm1, Electricity, Weather, and Exchange rate) in Sec. 4.3 and 2, and the results of the other two subsets (ETTm2 and Traffic) can be found in  appendix B. We also merged the results of FreqMix and FreqMask by selecting the superior one in each case. The merged results are denoted as ‘MixMask’.

     Method ETTh1 ETTh2 ETTm1
96 192 336 720 96 192 336 720 96 192 336 720
iTransformer [13] Baseline 0.392 0.447 0.483 0.516 0.303 0.381 0.412 0.434 0.344 0.383 0.421 0.494
ASD [7] 0.398 0.456 0.483 0.512 0.310 0.388 0.432 0.452 0.340 0.382 0.454 0.492
MSB [1] 0.387 0.460 0.494 0.531 0.309 0.382 0.447 0.433 0.339 0.386 0.467 0.510
Upsample [23] 0.391 0.445 0.481 0.519 0.305 0.381 0.419 0.430 0.351 0.381 0.432 0.489
FreqAdd [33] 0.389 0.446 0.475 0.510 0.300 0.384 0.416 0.438 0.350 0.385 0.422 0.490
FreqPool [4] 0.433 0.456 0.497 0.532 0.313 0.392 0.415 0.450 0.347 0.392 0.430 0.499
Robusttad [8] 0.390 0.445 0.497 0.510 0.312 0.388 0.412 0.439 0.353 0.382 0.421 0.498
STAug [34] 0.390 0.445 0.489 0.511 0.323 0.428 0.486 0.483 0.339 0.383 0.417 0.485
MixMask [3] 0.388 0.440 0.477 0.504 0.301 0.380 0.414 0.434 0.334 0.375 0.421 0.485
Ours 0.383 0.438 0.473 0.492 0.298 0.382 0.411 0.428 0.332 0.374 0.424 0.492
AutoFormer [31] Baseline 0.429 0.440 0.495 0.498 0.381 0.443 0.471 0.475 0.467 0.610 0.529 0.773
ASD 0.450 0.485 0.523 0.556 0.370 0.465 0.476 0.503 0.480 0.620 0.502 0.633
MSB 0.462 0.517 0.612 0.579 0.434 0.523 0.556 0.462 0.499 0.645 0.553 0.721
Upsample 0.416 0.523 0.480 0.482 0.353 0.460 0.455 0.509 0.498 0.630 0.512 0.667
FreqAdd 0.460 0.487 0.497 0.525 0.367 0.439 0.480 0.504 0.419 0.554 0.546 0.569
FreqPool 0.446 0.457 0.523 0.512 0.392 0.442 0.470 0.493 0.479 0.623 0.510 0.754
Robusttad 0.437 0.452 0.492 0.477 0.367 0.497 0.502 0.527 0.432 0.510 0.553 0.623
STAug 0.429 0.478 0.505 0.506 0.354 0.443 0.496 0.495 0.415 0.581 0.588 0.693
MixMask 0.420 0.445 0.467 0.474 0.358 0.421 0.470 0.467 0.415 0.510 0.491 0.588
Ours 0.409 0.436 0.458 0.486 0.335 0.419 0.453 0.452 0.392 0.506 0.491 0.559
MICN [27] Baseline 0.384 0.425 0.464 0.574 0.358 0.518 0.566 0.827 0.313 0.360 0.389 0.461
ASD 0.380 0.430 0.472 0.523 0.377 0.539 0.620 0.843 0.315 0.362 0.399 0.457
MSB 0.423 0.423 0.501 0.559 0.402 0.623 0.790 1.126 0.330 0.358 0.402 0.459
Upsample 0.396 0.435 0.463 0.550 0.366 0.500 0.831 0.752 0.339 0.377 0.402 0.475
FreqAdd 0.390 0.430 0.477 0.643 0.370 0.521 0.626 0.975 0.316 0.360 0.407 0.478
FreqPool 0.399 0.465 0.473 0.572 0.365 0.553 0.550 0.812 0.336 0.372 0.397 0.466
Robusttad 0.392 0.436 0.491 0.556 0.339 0.529 0.553 0.998 0.339 0.359 0.396 0.472
STAug 0.374 0.429 0.489 0.608 0.413 0.760 1.330 2.608 0.313 0.360 0.418 0.483
MixMask 0.378 0.423 0.461 0.521 0.339 0.488 0.544 0.735 0.301 0.352 0.401 0.454
Ours 0.373 0.421 0.452 0.510 0.310 0.427 0.507 0.731 0.314 0.360 0.387 0.470
SCINet [12] Baseline 0.485 0.506 0.519 0.552 0.372 0.416 0.429 0.470 0.316 0.353 0.387 0.431
ASD 0.494 0.480 0.491 0.559 0.362 0.402 0.432 0.499 0.331 0.367 0.389 0.453
MSB 0.489 0.466 0.502 0.547 0.359 0.396 0.458 0.476 0.320 0.351 0.396 0.478
Upsample 0.471 0.457 0.479 0.541 0.379 0.407 0.403 0.482 0.342 0.386 0.399 0.442
FreqAdd 0.428 0.452 0.469 0.532 0.335 0.385 0.403 0.447 0.304 0.338 0.373 0.421
FreqPool 0.499 0.510 0.557 0.549 0.410 0.453 0.432 0.475 0.331 0.362 0.379 0.432
Robusttad 0.462 0.501 0.498 0.559 0.362 0.431 0.419 0.496 0.331 0.351 0.394 0.438
STAug 0.457 0.500 0.524 0.534 0.538 0.636 0.681 0.648 0.319 0.357 0.389 0.445
MixMask 0.427 0.452 0.465 0.548 0.335 0.377 0.400 0.438 0.302 0.341 0.376 0.423
Ours 0.417 0.443 0.461 0.527 0.335 0.375 0.392 0.421 0.302 0.338 0.372 0.420
TiDE [6] Baseline 0.401 0.434 0.521 0.558 0.304 0.350 0.331 0.399 0.311 0.340 0.366 0.420
ASD 0.417 0.441 0.513 0.556 0.320 0.351 0.367 0.422 0.319 0.341 0.399 0.432
MSB 0.422 0.476 0.529 0.579 0.331 0.379 0.334 0.401 0.302 0.356 0.382 0.451
Upsample 0.431 0.452 0.533 0.604 0.346 0.372 0.350 0.456 0.324 0.339 0.378 0.463
FreqAdd 0.385 0.420 0.477 0.505 0.289 0.336 0.330 0.390 0.309 0.339 0.365 0.417
FreqPool 0.423 0.455 0.510 0.592 0.312 .376 0.339 0.397 0.319 0.352 0.397 0.453
Robusttad 0.396 0.432 0.521 0.537 0.331 0.352 0.337 0.398 0.321 0.346 0.382 0.437
STAug 0.515 0.535 0.521 0.558 0.390 0.437 0.403 0.508 0.310 0.337 0.364 0.417
MixMask 0.385 0.420 0.478 0.507 0.289 0.339 0.330 0.391 0.299 0.332 0.367 0.416
Ours 0.385 0.414 0.467 0.498 0.283 0.332 0.324 0.388 0.297 0.328 0.365 0.412
LightTS [32] Baseline 0.448 0.444 0.663 0.706 0.369 0.476 0.738 1.165 0.323 0.347 0.428 0.476
ASD 0.451 0.476 0.633 0.681 0.392 0.469 0.701 0.998 0.356 0.352 0.441 0.478
MSB 0.467 0.463 0.627 0.652 0.378 0.472 0.652 1.123 0.371 0.349 0.430 0.479
Upsample 0.449 0.472 0.610 0.637 0.401 0.487 0.714 1.245 0.329 0.366 0.453 0.492
FreqAdd 0.417 0.430 0.578 0.622 0.351 0.453 0.689 1.125 0.322 0.352 0.400 0.450
FreqPool 0.463 0.471 0.652 0.690 0.369 0.512 0.723 1.264 0.336 0.351 0.442 0.497
Robusttad 0.445 0.442 0.590 0.654 0.372 0.468 0.699 0.982 0.331 0.352 0.441 0.462
STAug 0.445 0.441 0.669 0.714 0.520 0.807 2.101 2.467 0.320 0.343 0.427 0.476
MixMask 0.417 0.429 0.575 0.620 0.337 0.426 0.643 0.993 0.316 0.340 0.398 0.447
Ours 0.405 0.423 0.565 0.603 0.335 0.395 0.575 0.827 0.322 0.340 0.391

 

Table 1: MSE of the long-term prediction on the ETT [35] datasets. The best values are marked with colors.

As demonstrated in Sec. 4.3 and 2, our method consistently improves the baseline on 96% of the cases, while other augmentation methods, e.g. FreqMix, outperform the baseline for around 87% of the cases.

 Method Electricity Weather Exchange Rate
96 192 336 720 96 192 336 720 96 192 336 720
  iTransformer [13] Baseline 0.152 0.159 0.179 0.230 0.175 0.224 0.281 0.362 0.086 0.180 0.335 0.856
ASD [7] 0.173 0.179 0.201 0.234 0.191 0.223 0.280 0.364 0.088 0.183 0.343 0.872
MSB [1] 0.182 0.182 0.194 0.267 0.185 0.235 0.284 0.359 0.089 0.189 0.359 0.907
Upsample [23] 0.166 0.188 0.216 0.221 0.204 0.257 0.291 0.373 0.086 0.180 0.338 0.834
FreqAdd [33] 0.150 0.157 0.172 0.204 0.181 0.230 0.285 0.362 0.087 0.181 0.333 0.837
FreqPool [4] 0.169 0.170 0.194 0.237 0.184 0.223 0.279 0.378 0.088 0.183 0.330 0.832
Robusttad [8] 0.150 0.157 0.176 0.210 0.172 0.225 0.281 0.357 0.087 0.179 0.329 0.833
STAug [34] 0.160 0.173 0.218 0.372 0.206 0.264 0.319 0.385 0.086 0.178 0.335 0.866
MixMask [3] 0.151 0.158 0.173 0.205 0.175 0.224 0.279 0.354 0.089 0.178 0.328 0.845
Ours 0.150 0.156 0.171 0.199 0.171 0.221 0.276 0.351 0.086 0.176 0.313 0.821
AutoFormer [31] Baseline 0.203 0.208 0.231 0.239 0.241 0.314 0.341 0.425 0.143 0.305 0.470 1.056
ASD 0.247 0.216 0.221 0.235 0.652 0.392 0.416 0.513 0.141 0.280 0.579 1.240
MSB 0.237 0.256 0.295 0.236 0.256 0.379 0.402 0.468 0.156 0.254 0.513 1.339
Upsample 0.201 0.209 0.232 0.268 0.281 0.294 0.329 0.385 0.141 0.292 0.553 1.295
FreqAdd 0.193 0.197 0.212 0.225 0.255 0.323 0.370 0.419 0.143 0.369 0.716 1.173
FreqPool 0.213 0.224 0.234 0.257 0.237 0.339 0.372 0.446 0.142 0.336 0.532 1.014
Robusttad 0.230 0.242 0.261 0.231 0.27 0.334 0.351 0.429 0.142 0.309 0.462 1.123
STAug 0.191 0.206 0.217 0.234 0.250 0.300 0.347 0.418 0.140 0.326 0.594 1.176
MixMask 0.177 0.194 0.206 0.224 0.240 0.302 0.330 0.422 0.141 0.284 0.453 0.778
Ours 0.171 0.191 0.203 0.219 0.214 0.273 0.327 0.383 0.136 0.243 0.418 0.695
MICN [27] Baseline 0.171 0.183 0.198 0.224 0.188 0.241 0.278 0.350 0.091 0.185 0.355 0.941
ASD 0.165 0.174 0.190 0.237 0.189 0.242 0.276 0.354 0.087 0.175 0.337 1.203
MSB 0.179 0.182 0.201 0.225 0.201 0.250 0.291 0.365 0.088 0.176 0.360 0.995
Upsample 0.182 0.180 0.203 0.220 0.193 0.249 0.279 0.372 0.084 0.171 0.313 0.702
FreqAdd 0.160 0.169 0.182 0.199 0.180 0.234 0.282 0.350 0.087 0.174 0.349 0.923
FreqPool 0.182 0.203 0.241 0.256 0.192 0.257 0.278 0.351 0.089 0.179 0.394 0.923
Robusttad 0.179 0.220 0.234 0.227 0.192 0.239 0.292 0.343 0.085 0.179 0.336 0.932
STAug 0.180 0.195 0.210 0.224 0.272 0.356 0.433 0.559 0.092 0.183 0.313 0.790
MixMask 0.159 0.165 0.178 0.195 0.185 0.239 0.281 0.344 0.086 0.174 0.337 0.796
Ours 0.157 0.168 0.178 0.211 0.179 0.232 0.275 0.342 0.084 0.169 0.303 0.750
SCINet [12] Baseline 0.212 0.237 0.255 0.286 0.229 0.282 0.334 0.402 0.099 0.191 0.356 0.916
ASD 0.229 0.241 0.239 0.282 0.254 0.276 0.356 0.462 0.095 0.204 0.379 1.230
MSB 0.232 0.237 0.228 0.274 0.279 0.265 0.374 0.454 0.093 0.267 0.402 0.965
Upsample 0.250 0.232 0.271 0.309 0.243 0.299 0.361 0.431 0.092 0.196 0.311 0.932
FreqAdd 0.176 0.195 0.212 0.237 0.208 0.258 0.309 0.385 0.092 0.186 0.343 0.920
FreqPool 0.230 0.221 0.242 0.339 0.261 0.290 0.337 0.456 0.096 0.183 0.551 0.938
Robusttad 0.189 0.202 0.210 0.243 0.229 0.281 0.331 0.410 0.093 0.186 0.334 0.957
STAug 0.210 0.239 0.282 0.411 0.277 0.329 0.372 0.435 0.098 0.191 0.342 0.931
MixMask 0.171 0.188 0.204 0.230 0.205 0.250 0.310 0.374 0.093 0.179 0.336 0.928
Ours 0.172 0.188 0.200 0.225 0.197 0.246 0.299 0.379 0.091 0.175 0.342 0.890
TiDE [6] Baseline 0.207 0.197 0.211 0.238 0.177 0.220 0.265 0.323 0.093 0.184 0.330 0.860
ASD 0.232 0.220 0.231 0.265 0.189 0.221 0.297 0.332 0.095 0.206 0.351 0.962
MSB 0.210 0.219 0.253 0.261 0.199 0.254 0.273 0.339 0.092 0.179 0.358 0.941
Upsample 0.206 0.199 0.223 0.274 0.203 0.267 0.331 0.355 0.091 0.182 0.331 0.852
FreqAdd 0.150 0.163 0.177 0.209 0.173 0.216 0.263 0.322 0.088 0.180 0.330 0.848
FreqPool 0.224 0.238 0.233 0.270 0.189 0.224 0.292 0.334 0.092 0.334 0.521 1.124
Robusttad 0.176 0.166 0.182 0.229 0.182 0.231 0.279 0.330 0.099 0.232 0.331 0.924
STAug 0.230 0.210 0.192 0.225 0.205 0.247 0.292 0.364 0.092 0.184 0.330 0.859
MixMask 0.143 0.155 0.164 0.210 0.173 0.216 0.263 0.323 0.089 0.180 0.329 0.861
Ours 0.143 0.150 0.165 0.202 0.177 0.219 0.261 0.322 0.088 0.179 0.324 0.847
LightTS [32] Baseline 0.210 0.169 0.182 0.212 0.168 0.210 0.260 0.320 0.139 0.252 0.412 0.840
ASD 0.225 0.179 0.198 0.232 0.179 0.210 0.271 0.321 0.132 0.320 0.436 1.036
MSB 0.233 0.182 0.204 0.228 0.170 0.214 0.259 0.332 0.117 0.294 0.502 0.964
Upsample 0.246 0.179 0.211 0.254 0.182 0.223 0.257 0.336 0.099 0.251 0.369 0.702
FreqAdd 0.213 0.159 0.177 0.210 0.164 0.207 0.258 0.317 0.098 0.522 0.565 1.583
FreqPool 0.219 0.174 0.197 0.236 0.193 0.254 0.267 0.339 0.099 0.275 0.394 0.793
Robusttad 0.212 0.169 0.181 0.223 0.172 0.223 0.259 0.324 0.092 0.279 0.451 0.796
STAug 0.224 0.267 0.294 0.351 0.214 0.263 0.382 0.371 0.096 0.212 0.380 0.690
MixMask 0.192 0.158 0.175 0.211 0.163 0.206 0.257 0.318 0.099 0.384 0.518 0.774
Ours 0.210 0.156 0.173 0.206 0.165 0.205 0.249 0.312 0.088 0.243 0.361

 

Table 2: MSE of the long-term prediction on the Weather, Electricity, and Exchange Rate[31] datasets. The best values are marked with colors.

Our method also outperforms other augmentation methods on more than 77% of the cases. Moreover, our method achieves larger relative improvements as the prediction length T𝑇Titalic_T increased, highlighting its strong capacity in long-term predictions.  Tab. 3 summarizes the MSE of short-term prediction using the iTransformer [13] model on the PEMS datasets [2]. The prediction errors are generally lower than the errors in long-term prediction. Our method outperforms other augmentations in most cases, although the improvements are marginal compared to long-term prediction. This is because short-term prediction is relatively easy, and the performance has already reached saturation.

  Methods PEMS03 PEMS04 PEMS07
12 24 36 48 12 24 36 48 12 24 36 48
Baseline 0.070 0.097 0.134 0.164 0.088 0.124 0.160 0.196 0.067 0.097 0.128 0.156
ASD [7] 0.072 0.096 0.152 0.239 0.098 0.132 0.156 0.190 0.069 0.099 0.154 0.181
MSB [1] 0.096 0.131 0.129 0.214 0.087 0.134 0.167 0.219 0.098 0.096 0.137 0.165
Upsample [23] 0.069 0.096 0.128 0.179 0.087 0.124 0.158 0.199 0.072 0.099 0.127 0.155
FreqAdd [33] 1.036 0.104 0.251 0.362 0.088 0.125 0.159 0.201 0.067 0.097 0.127 0.155
FreqPool [4] 1.234 0.178 0.296 0.451 0.099 0.145 0.178 0.226 0.079 0.104 0.152 0.172
Robusttad [8] 0.082 0.098 0.132 1.520 0.089 0.123 0.161 0.195 0.067 0.097 0.129 0.157
STAug [34] 0.079 0.112 0.195 0.456 0.087 0.120 0.162 0.304 0.066 0.096 0.132 0.165
Mask [3] 0.443 1.205 0.233 1.510 0.086 0.119 0.158 0.346 0.065 0.095 0.125 0.156
Mix [3] 1.018 0.097 0.877 1.501 0.085 0.119 0.154 0.205 0.065 0.094 0.134 0.152
Ours 0.067 0.095 0.126 0.235 0.085 0.118 0.149 0.182 0.065 0.094 0.123

 

Table 3: Short-term prediction using the iTransformer [13] on the PEMS datasets [2].

4.4 Ablation Study

Our method includes a hyper-parameter k𝑘kitalic_k and two unique designs: 1) perturb the dominant frequencies and 2) shuffle the dominant frequency components. We conducted ablation studies to investigate the impact of hyperparameters and to justify our design choices.

4.4.1 Number of Dominant Frequencies

\begin{overpic}[width=433.62pt]{figures/ablate-k-crop} \end{overpic}
Figure 4: Mean-squared errors with various k𝑘kitalic_k values on four datasets under the predict-96 setting. Our method is stable against k𝑘kitalic_k, and the performance varies slightly.

The only hyper-parameter in our method is the number of dominant frequencies k𝑘kitalic_k. We evaluated the performance using various k𝑘kitalic_k values with iTransformer [13]. The results in Fig. 4 reveal that our method is stable against different k𝑘kitalic_k values.

4.4.2 Shuffle the Dominant Frequencies

In this experiment, we compared the combination of different perturbation strategies and operations.

We first compared perturbing different frequency proportions including dominant frequencies, minor frequencies, and the full spectrum. The results in Tab. 4 clearly indicate that perturbing the dominant frequencies significantly outperforms other options, while perturbing the minor frequencies yields the worst performance.  Tab. 5 compares different perturbation operations including masking [3], adding noise [8, 10], randomization, and shuffling (ours). Shuffle consistently surpasses other operations in most of the cases.

  ETTh1 ETTm2 Weather
96 192 336 720 96 192 336 720 96 192 336 720
iTrans [13] Shuffle full 0.391 0.447 0.486 0.509 0.182 0.247 0.311 0.403 0.175 0.223 0.278 0.355
min 0.389 0.445 0.494 0.505 0.181 0.251 0.310 0.413 0.174 0.225 0.282 0.355
dom 0.383 0.438 0.473 0.492 0.178 0.246 0.309 0.409 0.171 0.221 0.276 0.351
\cdashline2-15 Mask full 0.390 0.442 0.475 0.503 0.179 0.251 0.311 0.411 0.178 0.228 0.284 0.359
min 0.389 0.444 0.487 0.499 0.183 0.252 0.311 0.412 0.180 0.226 0.282 0.361
dom 0.388 0.442 0.486 0.505 0.180 0.251 0.309 0.410 0.173 0.224 0.280 0.356
MICN [27] Shuffle full 0.385 0.427 0.466 0.604 0.184 0.293 0.375 0.594 0.182 0.239 0.280 0.348
min 0.390 0.430 0.480 0.565 0.191 0.281 0.365 0.580 0.197 0.236 0.283 0.349
dom 0.373 0.421 0.452 0.510 0.174 0.263 0.348 0.502 0.179 0.232 0.275 0.342
\cdashline2-15 Mask full 0.381 0.424 0.460 0.543 0.184 0.265 0.353 0.510 0.190 0.236 0.281 0.345
min 0.385 0.426 0.472 0.553 0.187 0.276 0.359 0.542 0.179 0.240 0.281 0.344
dom 0.377 0.421 0.454 0.543 0.175 0.268 0.337 0.505 0.178 0.239 0.283 0.342
Lightts [32] Shuffle full 0.415 0.426 0.577 0.621 0.202 0.235 0.325 0.445 0.163 0.205 0.251 0.317
min 0.418 0.432 0.577 0.619 0.206 0.239 0.326 0.444 0.164 0.212 0.259 0.317
dom 0.405 0.423 0.565 0.603 0.195 0.245 0.312 0.422 0.165 0.205 0.249 0.312
\cdashline2-15 Mask full 0.418 0.432 0.573 0.621 0.204 0.238 0.321 0.435 0.163 0.206 0.258 0.317
min 0.419 0.433 0.578 0.621 0.205 0.233 0.324 0.452 0.163 0.208 0.260 0.317
dom 0.418 0.424 0.579 0.618 0.198 0.240 0.312 0.430 0.162 0.201 0.250

 

Table 4: Comparison of perturbing different spectrum (full, minor, and dominant) using shuffle and random mask. Perturbing the dominant frequencies performs significantly better than perturbing other frequencies. And shuffle is also more effective than random mask.
  ETTh1 ETTm2 Weather
96 192 336 720 96 192 336 720 96 192 336 720
iTrans [13] Mask 0.388 0.442 0.486 0.505 0.180 0.251 0.309 0.410 0.173 0.224 0.280 0.356
Noise 0.387 0.445 0.482 0.510 0.180 0.256 0.312 0.409 0.177 0.222 0.281 0.359
Random 0.386 0.440 0.479 0.499 0.183 0.254 0.311 0.407 0.171 0.222 0.280 0.358
Shuffle 0.383 0.438 0.473 0.492 0.178 0.246 0.309 0.409 0.171 0.221 0.276 0.351
MICN [27] Mask 0.377 0.421 0.454 0.543 0.175 0.268 0.337 0.505 0.178 0.239 0.283 0.342
Noise 0.393 0.430 0.479 0.531 0.201 0.331 0.366 0.561 0.201 0.236 0.281 0.351
Random 0.381 0.423 0.476 0.670 0.183 0.284 0.367 0.614 0.182 0.233 0.282 0.349
Shuffle 0.373 0.421 0.452 0.510 0.174 0.263 0.348 0.502 0.179 0.232 0.275 0.342
Lightts [32] Mask 0.418 0.424 0.579 0.618 0.198 0.240 0.312 0.430 0.162 0.201 0.250 0.317
Noise 0.432 0.451 0.566 0.636 0.221 0.236 0.351 0.433 0.169 0.219 0.259 0.321
Random 0.414 0.431 0.570 0.610 0.206 0.244 0.324 0.442 0.171 0.213 0.263 0.323
Shuffle 0.405 0.423 0.565 0.603 0.195 0.245 0.312 0.422 0.165 0.205 0.249

 

Table 5: Comparison of different dominant frequency perturbations. Shuffle outperforms other alternatives with clear margins.

The results in Tab. 4 and 5 justified the design decisions in dominant shuffle and confirm that both perturbing dominant frequencies and the shuffle operation is superior to other alternatives. More details about the experiments, including how we defined minor frequencies and we implemented mask, noise, and randomization perturbations can be found in Sec. A.2.

4.4.3 Different Augmentation Sizes

In prior experiments, we explored data augmentation that doubled the original datasets. In this experiment, we assessed the performance of various augmentation sizes. The performance with a larger augmentation size reflects the domain gap between augmented and original data. A larger augmentation size indicates more augmented samples in the training set. If these augmented samples are out of distribution compared to the original data, larger augmentation sizes could lead to degraded performance due to a training/test gap.

\begin{overpic}[width=433.62pt]{figures/number-of-augmentation-crop} \put(45.0,-1.6){\footnotesize{{\color[rgb]{.5,.5,.5}{augmentation size}}}} \end{overpic}
Figure 5: MSE with different augmentation sizes using iTransformer [13]. An augmentation size of two, which was used in previous experiments, achieves the best results in most cases. Our method is more resistant to larger augmentation sizes, indicating the improved augmented-original gap.

As shown in Fig. 5, the performance of FreqMix and FreqMask declines significantly after an augmentation size of two. This is due to the domain gap between augmented and original data. Our method is slightly impacted by augmentation size, and even benefits from larger augmentation sizes on the Weather dataset. The results in Fig. 5 reveal a smaller augmented-original gap of our method.

5 Conclusion

We proposed the dominant shuffle, a simple yet highly effective data augmentation technique for time series prediction. Our method mitigates the domain gap between augmented and original data by limiting the perturbation to dominant frequencies, and uses shuffles to avoid external noises. Although being simple and effective, our method is proposed primarily based on heuristics and lacks theoretical explainability. Instead of theoretical justifications, we conducted extensive experiments using a wide range of datasets, baseline models, and augmentation methods to validate its consistent improvements across various configurations. Since dominant shuffle introduces significant perturbation to the original data and therefore disrupts the sample-wise class labels, our method is limited to prediction tasks and cannot be extended to classification tasks. Exploring theoretical justifications and principles of the proposed method would be a promising future direction that helps better understand it.

References

  • [1] Kasun Bandara, Hansika Hewamalage, Yuan-Hao Liu, Yanfei Kang, and Christoph Bergmeir. Improving the accuracy of global forecasting models using time series data augmentation. Pattern Recognition, 120:108148, 2021.
  • [2] Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia. Freeway performance measurement system: mining loop detector data. Transportation research record, 1748(1):96–102, 2001.
  • [3] Muxi Chen, Zhijian Xu, Ailing Zeng, and Qiang Xu. Fraug: Frequency domain augmentation for time series forecasting. arXiv preprint arXiv:2302.09292, 2023.
  • [4] Xi Chen, Cheng Ge, Ming Wang, and Jin Wang. Supervised contrastive few-shot learning for high-frequency time series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 7069–7077, 2023.
  • [5] Zhicheng Cui, Wenlin Chen, and Yixin Chen. Multi-scale convolutional neural networks for time series classification. arXiv preprint arXiv:1603.06995, 2016.
  • [6] Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. Long-term forecasting with tiDE: Time-series dense encoder. Transactions on Machine Learning Research, 2023.
  • [7] Germain Forestier, François Petitjean, Hoang Anh Dau, Geoffrey I Webb, and Eamonn Keogh. Generating synthetic time series to augment sparse datasets. In 2017 IEEE international conference on data mining (ICDM), pages 865–870. IEEE, 2017.
  • [8] Jingkun Gao, Xiaomin Song, Qingsong Wen, Pichao Wang, Liang Sun, and Huan Xu. Robusttad: Robust time series anomaly detection via decomposition and convolutional neural networks. In MileTS’20: 6th KDD Workshop on Mining and Learning from Time Series, pages 1–6, 2020.
  • [9] Arthur Le Guennec, Simon Malinowski, and Romain Tavenard. Data augmentation for time series classification using convolutional neural networks. In ECML/PKDD workshop on advanced analytics and learning on temporal data, 2016.
  • [10] Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey. Philosophical Transactions of the Royal Society A, 379(2194):20200209, 2021.
  • [11] Swee Kiat Lim, Yi Loo, Ngoc-Trung Tran, Ngai-Man Cheung, Gemma Roig, and Yuval Elovici. Doping: Generative data augmentation for unsupervised anomaly detection with gan. In 2018 IEEE international conference on data mining (ICDM), pages 1122–1127. IEEE, 2018.
  • [12] Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. Scinet: Time series modeling and forecasting with sample convolution and interaction. Advances in Neural Information Processing Systems, 35:5816–5828, 2022.
  • [13] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. In The Twelfth International Conference on Learning Representations, 2024.
  • [14] Yunshan Ma, Yujuan Ding, Xun Yang, Lizi Liao, Wai Keung Wong, and Tat-Seng Chua. Knowledge enhanced neural fashion trend forecasting. In Proceedings of the 2020 international conference on multimedia retrieval, pages 82–90, 2020.
  • [15] ED McKenzie. General exponential smoothing and the equivalent arma process. Journal of Forecasting, 3(3):333–344, 1984.
  • [16] Gue-Hwan Nam, Seok-Jun Bu, Na-Mu Park, Jae-Yong Seo, Hyeon-Cheol Jo, and Won-Tae Jeong. Data augmentation using empirical mode decomposition on neural networks to classify impact noise in vehicle. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 731–735. IEEE, 2020.
  • [17] Zelin Ni, Hang Yu, Shizhan Liu, Jianguo Li, and Weiyao Lin. Basisformer: Attention-based time series forecasting with learnable and interpretable basis. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [18] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In The Eleventh International Conference on Learning Representations, 2023.
  • [19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • [20] Hangwei Qian, Tian Tian, and Chunyan Miao. What makes good contrastive learning on small-scale wearable-based tasks? In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3761–3771, 2022.
  • [21] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. Advances in neural information processing systems, 31, 2018.
  • [22] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International journal of forecasting, 36(3):1181–1191, 2020.
  • [23] Artemios-Anargyros Semenoglou, Evangelos Spiliotis, and Vassilios Assimakopoulos. Data augmentation for univariate time series forecasting with neural networks. Pattern Recognition, 134:109132, 2023.
  • [24] Odongo Steven Eyobu and Dong Seog Han. Feature representation and data augmentation for human activity classification based on wearable imu sensor data using a deep lstm neural network. Sensors, 18(9):2892, 2018.
  • [25] Jianhua Sun, Hao-Shu Fang, Yuxuan Li, Runzhong Wang, Minghao Gou, and Cewu Lu. Instaboost++: Visual coherence principles for unified 2d/3d instance level data augmentation. International Journal of Computer Vision, 131(10):2665–2681, 2023.
  • [26] Terry T Um, Franz MJ Pfister, Daniel Pichler, Satoshi Endo, Muriel Lang, Sandra Hirche, Urban Fietzek, and Dana Kulić. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM international conference on multimodal interaction, pages 216–220, 2017.
  • [27] Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao. MICN: Multi-scale local and global context modeling for long-term series forecasting. In The Eleventh International Conference on Learning Representations, 2023.
  • [28] Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, and Huan Xu. Time series data augmentation for deep learning: A survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4653–4660. International Joint Conferences on Artificial Intelligence Organization, 8 2021. Survey Track.
  • [29] Tailai Wen and Roy Keyes. Time series anomaly detection using convolutional neural networks and transfer learning. In IJCAI Workshop on AI4IoT, 2019.
  • [30] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. In The Eleventh International Conference on Learning Representations, 2023.
  • [31] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in neural information processing systems, 34:22419–22430, 2021.
  • [32] Tianping Zhang, Yizhuo Zhang, Wei Cao, Jiang Bian, Xiaohan Yi, Shun Zheng, and Jian Li. Less is more: Fast multivariate time series forecasting with light sampling-oriented mlp structures. arXiv preprint arXiv:2207.01186, 2022.
  • [33] Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for time series via time-frequency consistency. Advances in Neural Information Processing Systems, 35:3988–4003, 2022.
  • [34] Xiyuan Zhang, Ranak Roy Chowdhury, Jingbo Shang, Rajesh Gupta, and Dezhi Hong. Towards diverse and coherent augmentation for time-series forecasting. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  • [35] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021.
  • [36] Tian Zhou, Ziqing Ma, Qingsong Wen, Liang Sun, Tao Yao, Wotao Yin, Rong Jin, et al. Film: Frequency improved legendre memory model for long-term time series forecasting. Advances in Neural Information Processing Systems, 35:12677–12690, 2022.
  • [37] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International conference on machine learning, pages 27268–27286. PMLR, 2022.

Appendix A More Details

A.1 Datasets

We evaluate the performance of different models and different augmentations for long-term forecasting on 8 well-established datasets, including Weather, Traffic, Electricity, Exchange Rate [31], and ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2) [35]. Furthermore, we adopt PEMS [2] datasets for short-term forecasting. We detail the descriptions of the dataset in Tab. 6.

  Dataset Variates Prediction length (T𝑇Titalic_T) Total Length (Train:Validation:Test) Frequency Information
ETTh1,ETTh2 7 {96,192,336,720} (8545, 2,881, 2,881) Hourly Temperature
ETTm1,ETTm2 7 {96, 192, 336, 720} (34465, 11521, 11521) 15min Temperature
Exchange 8 {96, 192, 336, 720} (5120, 665, 1422) Daily Economy
Weather 21 {96,192,336,720} (36792, 5271, 10540) 10min Weather
ECL 321 {96,192, 336, 720} (18317, 2633, 5261) Hourly Electricity
Traffic 862 {96, 192, 336, 720} (12185, 1757, 3509) Hourly Transportation
PEMS03 358 {12, 24, 36, 48} (15617, 5135, 5135) 5min Traffic network
PEMS04 307 {12, 24, 36, 48} (10172, 3375, 3375) 5min Traffic network
PEMS07 883 {12, 24, 36, 48} (16911, 5622, 5622) 5min Traffic network
PEMS08 170 {12, 24, 36, 48} (10690, 3548, 3548) 5min Traffic network
 
Table 6: Statistics of the eight datasets used in our experiments.

A.2 Implementation Details

A.2.1 Reimplementation other methods

For ASD, MSB, and upsample, we reproduce them based on the descriptions in their original paper [1, 7, 23]. For STAug [34] and MixMask [3], we use their official code. For Robusttad [8], we reproduce it by adding Gaussian noise to the frequency components of a time series. For FreqAdd [33], we perturb a single low-frequency component by setting its magnitude to half of the maximum magnitude. For FreqPool [4], we apply it by maximum pooling of the entire spectrum with size=4. For a fair comparison, all frequency-domain methods target both the data-label pair.

A.2.2 Different perturbations

In our ablation study, we define minor frequencies as other components except for the frequency components with the top 10 magnitudes. In Tab. 4, Mask on the full spectrum is similar to FrAug [3]. Mask on dominant frequencies means mask within frequency components with the top 10 magnitudes, Mask on minor frequencies is the opposite. In Tab. 5, Noise means adding Gaussian noise to the selected frequency components. For Random, we first get the maximum and minimum magnitude of the selected frequency components and then randomly assigned magnitude within the max-min range.

Appendix B More Results

B.1 Full forecasting results

Sec. B.1, B.1 and 9 show the full results of the forecasting task. Specifically, our method improves the performance of iTransformer by 13%percent\%% in Electricity when the predicted length is 720, and it improves the performance of Autoformer by 28%percent\%% in ETTm1 when the predicted length is 720. Our method also improves the performance of MICN by 18%percent\%% in ETTh2 when the predicted length is 192 and the performance of SCINet by 21%percent\%% in Electricity when the predicted length is 720. Similarly, our method improves the performance of Lightts by 29%percent\%% in ETTh2 when the predicted length is 720 and the performance of TiDE by 24%percent\%% in Electricity when the predicted length is 192. It is worth noting that the strong baseline MixMask falls short in Exchange rate, whose main goal is to predict trends. But our method improves the performance of Autoformer by 34%percent\%% in Exchange rate when the predicted length is 720, and it improves the performance of Lightts by 37%percent\%% in Exchange rate when the predicted length is 96. These results demonstrate the effectiveness of our method for long-term prediction, as it consistently improves the performance of SOTA methods in different datasets.

 Method ETTh1 ETTh2 ETTm1 ETTm2
96 192 336 720 96 192 336 720 96 192 336 720 96 192 336 720
iTransformer [13] Baseline 0.392 0.447 0.483 0.516 0.303 0.381 0.412 0.434 0.344 0.383 0.421 0.494 0.183 0.251 0.311 0.412
ASD [7] 0.398 0.456 0.483 0.512 0.310 0.388 0.432 0.452 0.340 0.382 0.454 0.492 0.199 0.254 0.341 0.423
MSB [1] 0.387 0.460 0.494 0.531 0.309 0.382 0.447 0.433 0.339 0.386 0.467 0.510 0.187 0.267 0.332 0.452
Upsample [23] 0.391 0.445 0.481 0.519 0.305 0.381 0.419 0.430 0.351 0.381 0.432 0.489 0.196 0.279 0.320 0.411
FreqAdd [33] 0.389 0.446 0.475 0.510 0.300 0.384 0.416 0.438 0.350 0.385 0.422 0.490 0.187 0.253 0.311 0.415
FreqPool [4] 0.433 0.456 0.497 0.532 0.313 0.392 0.415 0.450 0.347 0.392 0.430 0.499 0.187 0.256 0.324 0.449
Robusttad [8] 0.390 0.445 0.497 0.510 0.312 0.388 0.412 0.439 0.353 0.382 0.421 0.498 0.189 0.255 0.309 0.428
STAug [34] 0.390 0.445 0.489 0.511 0.323 0.428 0.486 0.483 0.339 0.383 0.417 0.485 0.196 0.267 0.339 0.449
MixMask [3] 0.388 0.440 0.477 0.504 0.301 0.380 0.414 0.434 0.334 0.375 0.421 0.485 0.178 0.248 0.311 0.407
Ours 0.383 0.438 0.473 0.492 0.298 0.382 0.411 0.428 0.332 0.374 0.424 0.492 0.178 0.246 0.309 0.409
AutoFormer [31] Baseline 0.429 0.440 0.495 0.498 0.381 0.443 0.471 0.475 0.467 0.610 0.529 0.773 0.233 0.278 0.383 0.488
ASD 0.450 0.485 0.523 0.556 0.370 0.465 0.476 0.503 0.480 0.620 0.502 0.633 0.231 0.282 0.379 0.499
MSB 0.462 0.517 0.612 0.579 0.434 0.523 0.556 0.462 0.499 0.645 0.553 0.721 0.232 0.285 0.389 0.487
Upsample 0.416 0.523 0.480 0.482 0.353 0.460 0.455 0.509 0.498 0.630 0.512 0.667 0.234 0.291 0.382 0.521
FreqAdd 0.460 0.487 0.497 0.525 0.367 0.439 0.480 0.504 0.419 0.554 0.546 0.569 0.223 0.268 0.330 0.458
FreqPool 0.446 0.457 0.523 0.512 0.392 0.442 0.470 0.493 0.479 0.623 0.510 0.754 0.250 0.291 0.394 0.482
Robusttad 0.437 0.452 0.492 0.477 0.367 0.497 0.502 0.527 0.432 0.510 0.553 0.623 0.235 0.291 0.375 0.478
STAug 0.429 0.478 0.505 0.506 0.354 0.443 0.496 0.495 0.415 0.581 0.588 0.693 0.224 0.291 0.338 0.431
MixMask 0.420 0.445 0.467 0.474 0.358 0.421 0.470 0.467 0.415 0.510 0.491 0.588 0.211 0.267 0.340 0.451
Ours 0.409 0.436 0.458 0.486 0.335 0.419 0.453 0.452 0.392 0.506 0.491 0.559 0.210 0.266 0.329 0.429
MICN [27] Baseline 0.384 0.425 0.464 0.574 0.358 0.518 0.566 0.827 0.313 0.360 0.389 0.461 0.200 0.282 0.375 0.606
ASD 0.380 0.430 0.472 0.523 0.377 0.539 0.620 0.843 0.315 0.362 0.399 0.457 0.189 0.331 0.399 0.617
MSB 0.423 0.423 0.501 0.559 0.402 0.623 0.790 1.126 0.330 0.358 0.402 0.459 0.192 0.279 0.376 0.651
Upsample 0.396 0.435 0.463 0.550 0.366 0.500 0.831 0.752 0.339 0.377 0.402 0.475 0.203 0.291 0.372 0.595
FreqAdd 0.390 0.430 0.477 0.643 0.370 0.521 0.626 0.975 0.316 0.360 0.407 0.478 0.176 0.273 0.378 0.614
FreqPool 0.399 0.465 0.473 0.572 0.365 0.553 0.550 0.812 0.336 0.372 0.397 0.466 0.212 0.287 0.390 0.623
Robusttad 0.392 0.436 0.491 0.556 0.339 0.529 0.553 0.998 0.339 0.359 0.396 0.472 0.200 0.296 0.356 0.617
STAug 0.374 0.429 0.489 0.608 0.413 0.760 1.330 2.608 0.313 0.360 0.418 0.483 0.180 0.264 0.323 0.670
MixMask 0.378 0.423 0.461 0.521 0.339 0.488 0.544 0.735 0.301 0.352 0.401 0.454 0.183 0.278 0.356 0.528
Ours 0.373 0.421 0.452 0.510 0.310 0.427 0.507 0.731 0.314 0.360 0.387 0.470 0.174 0.263 0.346 0.502
SCINet [12] Baseline 0.485 0.506 0.519 0.552 0.372 0.416 0.429 0.470 0.316 0.353 0.387 0.431 0.184 0.240 0.295 0.385
ASD 0.494 0.480 0.491 0.559 0.362 0.402 0.432 0.499 0.331 0.367 0.389 0.453 0.197 0.238 0.296 0.432
MSB 0.489 0.466 0.502 0.547 0.359 0.396 0.458 0.476 0.320 0.351 0.396 0.478 0.182 0.237 0.289 0.449
Upsample 0.471 0.457 0.479 0.541 0.379 0.407 0.403 0.482 0.342 0.386 0.399 0.442 0.179 0.254 0.292 0.401
FreqAdd 0.428 0.452 0.469 0.532 0.335 0.385 0.403 0.447 0.304 0.338 0.373 0.421 0.174 0.228 0.286 0.380
FreqPool 0.499 0.510 0.557 0.549 0.410 0.453 0.432 0.475 0.331 0.362 0.379 0.432 0.185 0.239 0.302 0.399
Robusttad 0.462 0.501 0.498 0.559 0.362 0.431 0.419 0.496 0.331 0.351 0.394 0.438 0.182 0.247 0.299 0.402
STAug 0.457 0.500 0.524 0.534 0.538 0.636 0.681 0.648 0.319 0.357 0.389 0.445 0.323 0.407 0.514 0.668
MixMask 0.427 0.452 0.465 0.548 0.335 0.377 0.400 0.438 0.302 0.341 0.376 0.423 0.174 0.230 0.289 0.368
Ours 0.417 0.443 0.461 0.527 0.335 0.375 0.392 0.421 0.302 0.338 0.372 0.420 0.174 0.228 0.283 0.372
TiDE [6] Baseline 0.401 0.434 0.521 0.558 0.304 0.350 0.331 0.399 0.311 0.340 0.366 0.420 0.166 0.220 0.273 0.356
ASD 0.417 0.441 0.513 0.556 0.320 0.351 0.367 0.422 0.319 0.341 0.399 0.432 0.177 0.241 0.291 0.371
MSB 0.422 0.476 0.529 0.579 0.331 0.379 0.334 0.401 0.302 0.356 0.382 0.451 0.182 0.232 0.287 0.359
Upsample 0.431 0.452 0.533 0.604 0.346 0.372 0.350 0.456 0.324 0.339 0.378 0.463 0.203 0.246 0.306 0.366
FreqAdd 0.385 0.420 0.477 0.505 0.289 0.336 0.330 0.390 0.309 0.339 0.365 0.417 0.164 0.219 0.273 0.355
FreqPool 0.423 0.455 0.510 0.592 0.312 .376 0.339 0.397 0.319 0.352 0.397 0.453 0.179 0.231 0.299 0.371
Robusttad 0.396 0.432 0.521 0.537 0.331 0.352 0.337 0.398 0.321 0.346 0.382 0.437 0.180 0.225 0.282 0.371
STAug 0.515 0.535 0.521 0.558 0.390 0.437 0.403 0.508 0.310 0.337 0.364 0.417 0.222 0.343 0.515 0.847
MixMask 0.385 0.420 0.478 0.507 0.289 0.339 0.330 0.391 0.299 0.332 0.367 0.416 0.165 0.219 0.271 0.347
Ours 0.385 0.414 0.467 0.498 0.283 0.332 0.324 0.388 0.297 0.328 0.365 0.412 0.165 0.218 0.271 0.350
LightTS [32] Baseline 0.448 0.444 0.663 0.706 0.369 0.476 0.738 1.165 0.323 0.347 0.428 0.476 0.212 0.237 0.350 0.473
ASD 0.451 0.476 0.633 0.681 0.392 0.469 0.701 0.998 0.356 0.352 0.441 0.478 0.258 0.251 0.351 0.483
MSB 0.467 0.463 0.627 0.652 0.378 0.472 0.652 1.123 0.371 0.349 0.430 0.479 0.236 0.242 0.359 0.471
Upsample 0.449 0.472 0.610 0.637 0.401 0.487 0.714 1.245 0.329 0.366 0.453 0.492 0.241 0.255 0.366 0.492
FreqAdd 0.417 0.430 0.578 0.622 0.351 0.453 0.689 1.125 0.322 0.352 0.400 0.450 0.206 0.237 0.327 0.455
FreqPool 0.463 0.471 0.652 0.690 0.369 0.512 0.723 1.264 0.336 0.351 0.442 0.497 0.233 0.259 0.372 0.453
Robusttad 0.445 0.442 0.590 0.654 0.372 0.468 0.699 0.982 0.331 0.352 0.441 0.462 0.232 0.227 0.342 0.446
STAug 0.445 0.441 0.669 0.714 0.520 0.807 2.101 2.467 0.320 0.343 0.427 0.476 0.230 0.266 0.372 0.475
MixMask 0.417 0.429 0.575 0.620 0.337 0.426 0.643 0.993 0.316 0.340 0.398 0.447 0.199 0.233 0.322 0.440
Ours 0.405 0.423 0.565 0.603 0.335 0.395 0.575 0.827 0.322 0.340 0.391 0.440 0.195 0.245 0.312

 

Table 7: MSE of the long-term prediction on the ETT [35] datasets.
 Method Eletricity Weather Exchange Rate Traffic
96 192 336 720 96 192 336 720 96 192 336 720 96 192 336 720
iTransformer [13] Baseline 0.152 0.159 0.179 0.230 0.175 0.224 0.281 0.362 0.086 0.180 0.335 0.856 0.399 0.418 0.428 0.463
ASD [7] 0.173 0.179 0.201 0.234 0.191 0.223 0.280 0.364 0.088 0.183 0.343 0.872 0.431 0.428 0.430 0.478
MSB [1] 0.182 0.182 0.194 0.267 0.185 0.235 0.284 0.359 0.089 0.189 0.359 0.907 0.417 0.416 0.422 0.471
Upsample [23] 0.166 0.188 0.216 0.221 0.204 0.257 0.291 0.373 0.086 0.180 0.338 0.834 0.433 0.419 0.433 0.476
FreqAdd [33] 0.150 0.157 0.172 0.204 0.181 0.230 0.285 0.362 0.087 0.181 0.333 0.837 0.480 0.441 0.450 0.501
FreqPool [4] 0.169 0.170 0.194 0.237 0.184 0.223 0.279 0.378 0.088 0.183 0.330 0.832 0.410 0.429 0.433 0.476
Robusttad [8] 0.150 0.157 0.176 0.210 0.172 0.225 0.281 0.357 0.087 0.179 0.329 0.833 0.406 0.417 0.429 0.458
STAug [34] 0.160 0.173 0.218 0.372 0.206 0.264 0.319 0.385 0.086 0.178 0.335 0.866 0.413 0.432 0.449 0.481
MixMask [3] 0.151 0.158 0.173 0.205 0.175 0.224 0.279 0.354 0.089 0.178 0.328 0.845 0.395 0.401 0.418 0.450
Ours 0.150 0.156 0.171 0.199 0.171 0.221 0.276 0.351 0.086 0.176 0.313 0.821 0.394 0.412 0.423 0.448
AutoFormer [31] Baseline 0.203 0.208 0.231 0.239 0.241 0.314 0.341 0.425 0.143 0.305 0.470 1.056 0.640 0.645 0.611 0.658
ASD 0.247 0.216 0.221 0.235 0.652 0.392 0.416 0.513 0.141 0.280 0.579 1.240 0.631 0.602 0.607 0.643
MSB 0.237 0.256 0.295 0.236 0.256 0.379 0.402 0.468 0.156 0.254 0.513 1.339 0.652 0.665 0.643 0.65
Upsample 0.201 0.209 0.232 0.268 0.281 0.294 0.329 0.385 0.141 0.292 0.553 1.295 0.653 0.676 0.702 0.694
FreqAdd 0.193 0.197 0.212 0.225 0.255 0.323 0.370 0.419 0.143 0.369 0.716 1.173 0.613 0.598 0.617 0.639
FreqPool 0.213 0.224 0.234 0.257 0.237 0.339 0.372 0.446 0.142 0.336 0.532 1.014 0.63 0.598 0.603 0.639
Robusttad 0.230 0.242 0.261 0.231 0.27 0.334 0.351 0.429 0.142 0.309 0.462 1.123 0.621 0.614 0.612 0.646
STAug 0.191 0.206 0.217 0.234 0.250 0.300 0.347 0.418 0.140 0.326 0.594 1.176 0.632 0.619 0.632 0.640
MixMask 0.177 0.194 0.206 0.224 0.240 0.302 0.330 0.422 0.141 0.284 0.453 0.778 0.560 0.584 0.594 0.635
Ours 0.171 0.191 0.203 0.219 0.214 0.273 0.327 0.383 0.136 0.243 0.418 0.695 0.577 0.581 0.592 0.638
MICN [27] Baseline 0.171 0.183 0.198 0.224 0.188 0.241 0.278 0.350 0.091 0.185 0.355 0.941 0.522 0.540 0.553 0.573
ASD 0.165 0.174 0.190 0.237 0.189 0.242 0.276 0.354 0.087 0.175 0.337 1.203 0.505 0.534 0.541 0.539
MSB 0.179 0.182 0.201 0.225 0.201 0.250 0.291 0.365 0.088 0.176 0.360 0.995 0.513 0.532 0.528 0.556
Upsample 0.182 0.180 0.203 0.220 0.193 0.249 0.279 0.372 0.084 0.171 0.313 0.702 0.533 0.559 0.556 0.590
FreqAdd 0.160 0.169 0.182 0.199 0.180 0.234 0.282 0.350 0.087 0.174 0.349 0.923 0.503 0.527 0.520 0.571
FreqPool 0.182 0.203 0.241 0.256 0.192 0.257 0.278 0.351 0.089 0.179 0.394 0.923 0.531 0.539 0.556 0.592
Robusttad 0.179 0.220 0.234 0.227 0.192 0.239 0.292 0.343 0.085 0.179 0.336 0.932 0.510 0.532 0.547 0.597
STAug 0.180 0.195 0.210 0.224 0.272 0.356 0.433 0.559 0.092 0.183 0.313 0.790 0.512 0.533 0.529 0.585
MixMask 0.159 0.165 0.178 0.195 0.185 0.239 0.281 0.344 0.086 0.174 0.337 0.796 0.490 0.512 0.519 0.538
Ours 0.157 0.168 0.178 0.211 0.179 0.232 0.275 0.342 0.084 0.169 0.303 0.750 0.501 0.507 0.518 0.556
SCINet [12] Baseline 0.212 0.237 0.255 0.286 0.229 0.282 0.334 0.402 0.099 0.191 0.356 0.916 0.550 0.526 0.545 0.596
ASD 0.229 0.241 0.239 0.282 0.254 0.276 0.356 0.462 0.095 0.204 0.379 1.230 0.537 0.521 0.541 0.570
MSB 0.232 0.237 0.228 0.274 0.279 0.265 0.374 0.454 0.093 0.267 0.402 0.965 0.520 0.510 0.537 0.565
Upsample 0.250 0.232 0.271 0.309 0.243 0.299 0.361 0.431 0.092 0.196 0.311 0.932 0.519 0.536 0.528 0.576
FreqAdd 0.176 0.195 0.212 0.237 0.208 0.258 0.309 0.385 0.092 0.186 0.343 0.920 0.492 0.497 0.512 0.550
FreqPool 0.230 0.221 0.242 0.339 0.261 0.290 0.337 0.456 0.096 0.183 0.551 0.938 0.557 0.519 0.533 0.562
Robusttad 0.189 0.202 0.210 0.243 0.229 0.281 0.331 0.410 0.093 0.186 0.334 0.957 0.523 0.519 0.522 0.569
STAug 0.210 0.239 0.282 0.411 0.277 0.329 0.372 0.435 0.098 0.191 0.342 0.931 0.560 0.517 0.521 0.566
MixMask 0.171 0.188 0.204 0.230 0.205 0.250 0.310 0.374 0.093 0.179 0.336 0.928 0.495 0.492 0.511 0.551
Ours 0.172 0.188 0.200 0.225 0.197 0.246 0.299 0.379 0.091 0.175 0.342 0.890 0.500 0.495 0.509 0.544
TiDE [6] Baseline 0.207 0.197 0.211 0.238 0.177 0.220 0.265 0.323 0.093 0.184 0.330 0.860 0.452 0.450 0.451 0.479
ASD 0.232 0.220 0.231 0.265 0.189 0.221 0.297 0.332 0.095 0.206 0.351 0.962 0.477 0.462 0.450 0.506
MSB 0.210 0.219 0.253 0.261 0.199 0.254 0.273 0.339 0.092 0.179 0.358 0.941 0.461 0.451 0.455 0.510
Upsample 0.206 0.199 0.223 0.274 0.203 0.267 0.331 0.355 0.091 0.182 0.331 0.852 0.490 0.466 0.472 0.493
FreqAdd 0.150 0.163 0.177 0.209 0.173 0.216 0.263 0.322 0.088 0.180 0.330 0.848 0.429 0.441 0.440 0.471
FreqPool 0.224 0.238 0.233 0.270 0.189 0.224 0.292 0.334 0.092 0.334 0.521 1.124 0.453 0.466 0.479 0.503
Robusttad 0.176 0.166 0.182 0.229 0.182 0.231 0.279 0.330 0.099 0.232 0.331 0.924 0.449 0.430 0.438 0.482
STAug 0.230 0.210 0.192 0.225 0.205 0.247 0.292 0.364 0.092 0.184 0.330 0.859 0.466 0.455 0.471 0.480
MixMask 0.143 0.155 0.164 0.210 0.173 0.216 0.263 0.323 0.089 0.180 0.329 0.861 0.421 0.427 0.434 0.466
Ours 0.143 0.150 0.165 0.202 0.177 0.219 0.261 0.322 0.088 0.179 0.324 0.847 0.423 0.426 0.433 0.466
LightTS [32] Baseline 0.210 0.169 0.182 0.212 0.168 0.210 0.260 0.320 0.139 0.252 0.412 0.840 0.505 0.515 0.539 0.587
ASD 0.225 0.179 0.198 0.232 0.179 0.21 0.271 0.321 0.132 0.320 0.436 1.036 0.510 0.514 0.534 0.579
MSB 0.233 0.182 0.204 0.228 0.170 0.214 0.259 0.332 0.117 0.294 0.502 0.964 0.532 0.510 0.539 0.584
Upsample 0.246 0.179 0.211 0.254 0.182 0.223 0.257 0.336 0.099 0.251 0.369 0.702 0.522 0.547 0.532 0.597
FreqAdd 0.213 0.159 0.177 0.210 0.164 0.207 0.258 0.317 0.098 0.522 0.565 1.583 0.492 0.500 0.530 0.572
FreqPool 0.219 0.174 0.197 0.236 0.193 0.254 0.267 0.339 0.099 0.275 0.394 0.793 0.501 0.519 0.533 0.592
Robusttad 0.212 0.169 0.181 0.223 0.172 0.223 0.259 0.324 0.092 0.279 0.451 0.796 0.499 0.502 0.521 0.572
STAug 0.224 0.267 0.294 0.351 0.214 0.263 0.382 0.371 0.096 0.212 0.380 0.690 0.520 0.534 0.520 0.596
MixMask 0.192 0.158 0.175 0.211 0.163 0.206 0.257 0.318 0.099 0.384 0.518 0.774 0.486 0.499 0.517 0.555
Ours 0.210 0.156 0.173 0.206 0.165 0.205 0.249 0.312 0.088 0.243 0.361 0.676 0.483 0.497 0.515

 

Table 8: MSE of the long-term prediction on the Electricity, traffic, Weather, and Exchange Rate [31] datasets.
  Methods PEMS03 PEMS04 PEMS07 PEMS08
12 24 36 48 12 24 36 48 12 24 36 48 12 24 36 48
Baseline 0.070 0.097 0.134 0.164 0.088 0.124 0.160 0.196 0.067 0.097 0.128 0.156 0.088 0.136 0.191 0.248
ASD [7] 0.072 0.096 0.152 0.239 0.098 0.132 0.156 0.190 0.069 0.099 0.154 0.181 0.089 0.138 0.196 0.247
MSB [1] 0.096 0.131 0.129 0.214 0.087 0.134 0.167 0.219 0.098 0.096 0.137 0.165 0.096 0.137 0.210 0.256
Upsample [23] 0.069 0.096 0.128 0.179 0.087 0.124 0.158 0.199 0.072 0.099 0.127 0.155 0.088 0.140 0.192 0.245
FreqAdd [33] 1.036 0.104 0.251 0.362 0.088 0.125 0.159 0.201 0.067 0.097 0.127 0.155 0.089 0.135 0.192 0.253
FreqPool [4] 1.234 0.178 0.296 0.451 0.099 0.145 0.178 0.226 0.079 0.104 0.152 0.172 0.099 0.155 0.203 0.264
Robusttad [8] 0.082 0.098 0.132 1.520 0.089 0.123 0.161 0.195 0.067 0.097 0.129 0.157 0.092 0.135 0.189 0.26
STAug [34] 0.079 0.112 0.195 0.456 0.087 0.120 0.162 0.304 0.066 0.096 0.132 0.165 0.092 0.147 0.192 0.276
Mask [3] 0.443 1.205 0.233 1.510 0.086 0.119 0.158 0.346 0.065 0.095 0.125 0.156 0.089 0.131 0.186 0.239
Mix [3] 1.018 0.097 0.877 1.501 0.085 0.119 0.154 0.205 0.065 0.094 0.134 0.152 0.089 0.131 0.184 0.234
Ours 0.067 0.095 0.126 0.235 0.085 0.118 0.149 0.182 0.065 0.094 0.123 0.148 0.087 0.134 0.184

 

Table 9: MSE of the Short-term prediction using the iTransformer [13] on the PEMS datasets [2].

B.2 Example predictions

We provided example prediction results on different datasets in Fig. 6

\begin{overpic}[width=433.62pt]{figures/example-pred-crop} \end{overpic}
Figure 6: Example predictions of different methods under long-term (top) and short-term (bottom) protocols.

B.3 Optimal k𝑘kitalic_k

We provide the optimal k𝑘kitalic_k for all long-term prediction datasets using iTranformer [13] in Tab. 10 and 11. As can be seen from the table, our method does not need too much effort to find the optimal parameters.

  Hypermeter ETTh1 ETTh2 ETTm1 ETTm2
96 192 336 720 96 192 336 720 96 192 336 720 96 192 336 720
Optimal k𝑘kitalic_k 4 4 4 4 2 2 2 4 3 3 2 2 4 4 2 4
Table 10: The optimal k𝑘kitalic_k on ETT datasets using the iTransformer [13] model.
  Hypermeter Electricity Traffic Weather Exchange Rate
96 192 336 720 96 192 336 720 96 192 336 720 96 192 336 720
Optimal k𝑘kitalic_k 2 3 2 2 2 2 2 2 3 3 2 4 2 2 8 8
Table 11: The optimal k𝑘kitalic_k on Electricity, Traffic, Weather, and Exchange Rate datasets using the iTransformer [13] model.

B.4 Standard deviations

Tab. 12, 13, 14 and 15 shows the standard deviations of different runs, indicating the performance of our method is stable.

  Model ETTh1 ETTh2
96 192 336 720 96 192 336 720
iTrans former Baseline 0.392±plus-or-minus\pm±0.001 0.447±plus-or-minus\pm±0.002 0.483±plus-or-minus\pm±0.003 0.516±plus-or-minus\pm±0.003 0.303±plus-or-minus\pm±0.001 0.381±plus-or-minus\pm±0.000 0.412±plus-or-minus\pm±0.001 0.434±plus-or-minus\pm±0.002
Mask [3] 0.390±plus-or-minus\pm±0.001 0.442±plus-or-minus\pm±0.002 0.475±plus-or-minus\pm±0.001 0.503±plus-or-minus\pm±0.003 0.301±plus-or-minus\pm±0.001 0.385±plus-or-minus\pm±0.003 0.414±plus-or-minus\pm±0.001 0.438±plus-or-minus\pm±0.005
Mix [3] 0.388±plus-or-minus\pm±0.002 0.440±plus-or-minus\pm±0.002 0.477±plus-or-minus\pm±0.000 0.504±plus-or-minus\pm±0.004 0.301±plus-or-minus\pm±0.001 0.380±plus-or-minus\pm±0.001 0.414±plus-or-minus\pm±0.001 0.434±plus-or-minus\pm±0.003
Ours 0.383±plus-or-minus\pm±0.001 0.438±plus-or-minus\pm±0.001 0.473±plus-or-minus\pm±0.002 0.492±plus-or-minus\pm±0.002 0.298±plus-or-minus\pm±0.002 0.382±plus-or-minus\pm±0.003 0.411±plus-or-minus\pm±0.004 0.428±plus-or-minus\pm±0.001
 
Table 12: Error bars on ETTh1 and ETTh2 datasets.
  Model ETTm1 ETTm2
96 192 336 720 96 192 336 720
iTrans former Baseline 0.344±plus-or-minus\pm±0.002 0.383±plus-or-minus\pm±0.003 0.421±plus-or-minus\pm±0.001 0.494±plus-or-minus\pm±0.003 0.183±plus-or-minus\pm±0.001 0.251±plus-or-minus\pm±0.002 0.311±plus-or-minus\pm±0.001 0.412±plus-or-minus\pm±0.001
Mask [3] 0.347±plus-or-minus\pm±0.002 0.383±plus-or-minus\pm±0.005 0.420±plus-or-minus\pm±0.001 0.494±plus-or-minus\pm±0.004 0.179±plus-or-minus\pm±0.003 0.251±plus-or-minus\pm±0.001 0.311±plus-or-minus\pm±0.001 0.411±plus-or-minus\pm±0.002
Mix [3] 0.334±plus-or-minus\pm±0.005 0.375±plus-or-minus\pm±0.002 0.421±plus-or-minus\pm±0.000 0.485±plus-or-minus\pm±0.002 0.178±plus-or-minus\pm±0.002 0.248±plus-or-minus\pm±0.001 0.311±plus-or-minus\pm±0.000 0.407±plus-or-minus\pm±0.002
Ours 0.332±plus-or-minus\pm±0.001 0.374±plus-or-minus\pm±0.001 0.424±plus-or-minus\pm±0.001 0.492±plus-or-minus\pm±0.002 0.178±plus-or-minus\pm±0.002 0.246±plus-or-minus\pm±0.001 0.309±plus-or-minus\pm±0.001 0.409±plus-or-minus\pm±0.000
 
Table 13: Error bars on ETTm1 and ETTm2 datasets.
  Model Electricity Traffic
96 192 336 720 96 192 336 720
iTrans former Baseline 0.152±plus-or-minus\pm±0.000 0.159±plus-or-minus\pm±0.001 0.179±plus-or-minus\pm±0.003 0.230±plus-or-minus\pm±0.013 0.399±plus-or-minus\pm±0.001 0.418±plus-or-minus\pm±0.000 0.428±plus-or-minus\pm±0.000 0.463±plus-or-minus\pm±0.000
Mask [3] 0.153±plus-or-minus\pm±0.001 0.157±plus-or-minus\pm±0.001 0.173±plus-or-minus\pm±0.001 0.208±plus-or-minus\pm±0.005 0.395±plus-or-minus\pm±0.001 0.401±plus-or-minus\pm±0.005 0.418±plus-or-minus\pm±0.001 0.450±plus-or-minus\pm±0.002
Mix [3] 0.151±plus-or-minus\pm±0.000 0.158±plus-or-minus\pm±0.001 0.173±plus-or-minus\pm±0.000 0.205±plus-or-minus\pm±0.003 0.400±plus-or-minus\pm±0.003 0.414±plus-or-minus\pm±0.004 0.424±plus-or-minus\pm±0.002 0.453±plus-or-minus\pm±0.003
Ours 0.150±plus-or-minus\pm±0.000 0.156±plus-or-minus\pm±0.001 0.171±plus-or-minus\pm±0.000 0.199±plus-or-minus\pm±0.002 0.394±plus-or-minus\pm±0.000 0.412±plus-or-minus\pm±0.002 0.423±plus-or-minus\pm±0.002 0.448±plus-or-minus\pm±0.001
 
Table 14: Error bars on Electricity and Traffic datasets.
  Model Weather Exchange Rate
96 192 336 720 96 192 336 720
iTrans former Baseline 0.175±plus-or-minus\pm±0.001 0.224±plus-or-minus\pm±0.001 0.281±plus-or-minus\pm±0.000 0.362±plus-or-minus\pm±0.003 0.086±plus-or-minus\pm±0.000 0.180±plus-or-minus\pm±0.000 0.335±plus-or-minus\pm±0.002 0.856±plus-or-minus\pm±0.004
Mask [3] 0.178±plus-or-minus\pm±0.001 0.228±plus-or-minus\pm±0.002 0.284±plus-or-minus\pm±0.002 0.359±plus-or-minus\pm±0.001 0.090±plus-or-minus\pm±0.002 0.178±plus-or-minus\pm±0.001 0.329±plus-or-minus\pm±0.006 0.845±plus-or-minus\pm±0.008
Mix [3] 0.175±plus-or-minus\pm±0.001 0.224±plus-or-minus\pm±0.000 0.279±plus-or-minus\pm±0.000 0.354±plus-or-minus\pm±0.000 0.089±plus-or-minus\pm±0.001 0.178±plus-or-minus\pm±0.001 0.328±plus-or-minus\pm±0.006 0.868±plus-or-minus\pm±0.008
Ours 0.171±plus-or-minus\pm±0.001 0.221±plus-or-minus\pm±0.000 0.276±plus-or-minus\pm±0.000 0.351±plus-or-minus\pm±0.002 0.086±plus-or-minus\pm±0.001 0.176±plus-or-minus\pm±0.001 0.313±plus-or-minus\pm±0.006 0.821±plus-or-minus\pm±0.003
 
Table 15: Error bars on Weather and Exchange Rate datasets.