Dominant Shuffle: A Simple Yet Powerful Data Augmentation for Time-series Prediction

Kai Zhao¹ Zuojie He² Alex Hung¹ Dan Zeng²
¹UCLA ² Shanghai University
kz@kaizhao.net

Abstract

Recent studies have suggested frequency-domain Data augmentation (DA) is effective for time series prediction. Existing frequency-domain augmentations disturb the original data with various full-spectrum noises, leading to excess domain gap between augmented and original data. Although impressive performance has been achieved in certain cases, frequency-domain DA has yet to be generalized to time series prediction datasets. In this paper, we found that frequency-domain augmentations can be significantly improved by two modifications that limit the perturbations. First, we found that limiting the perturbation to only dominant frequencies significantly outperforms full-spectrum perturbations. Dominant frequencies represent the main periodicity and trends of the signal and are more important than other frequencies. Second, we found that simply shuffling the dominant frequency components is superior over sophisticated designed random perturbations. Shuffle rearranges the original components (magnitudes and phases) and limits the external noise. With these two modifications, we proposed dominant shuffle, a simple yet effective data augmentation for time series prediction. Our method is very simple yet powerful and can be implemented with just a few lines of code. Extensive experiments with eight datasets and six popular time series models demonstrate that our method consistently improves the baseline performance under various settings and significantly outperforms other DA methods. Code can be accessed at https://kaizhao.net/time-series.

1 Introduction

Time-series prediction aims to forecast multivariate future values based on historical observations. It is a long-standing problem with various applications in electricity pricing, weather forecast, traffic prediction [10, 35]. Recently, impressive results have been achieved by using various deep learning architectures, e.g. recurrent neural networks (RNNs) [21, 22, 14], Transformers [35, 31, 37, 13], and temporal convolutional networks (TCNs) [27, 12, 30]. Neural networks require a large volume of training data to effectively fit their numerous parameters. Unfortunately, time-series data acquired from real-world sensors are often limited in many time-series applications. The patterns of the time series heavily depend on specific dynamic system that generates the data and other data sources are not applicable [3, 23].

To mitigate the impact of insufficient data in time series analysis, several data augmentation techniques have been explored [28, 7, 1, 23, 3, 33, 4, 8, 34, 20, 26, 9, 24, 11, 16, 10]. Most of these data augmentation techniques in time series analysis focus on classification [20, 26, 9, 24, 16, 10, 33, 4] and anomaly detection [11, 10, 8]. These augmentations alter the time series sequences while preserving the class labels. However, the prediction task requires more fine-grained temporal information to accurately estimate future dynamics [34, 3]. These perturbations designed for classification can disrupt the data-label coherence and lead to performance degradation [34, 3].

Coherence is a key factor to effective data augmentation [28, 34, 25]. It measures the semantic connection between the augmented data and the label. These augmentations designed for classification often struggle with prediction tasks, due to unilateral perturbations that disrupt the data-label coherence. Recently, to mitigate the data-label coherence, Chen et. al [3] proposed to simultaneously perturb the data (historical sequence) and labels (future sequences) in the frequency domain. Unlike common data augmentations that introduce slight perturbations only to the data while keeping the labels unchanged, this approach enables more radical perturbations, such as frequency mix (FreqMix) and frequency mask (FreqMask), to be applied without severely disrupting the data-label coherence. And the method indeed generates new data-label pairs that are significantly different from the originals.

However, the full-spectrum perturbations in FreqMask and FreqMix introduce external randomization and reduce the domain gap between the augmented and original data. This can lead to unstable and suboptimal results on some benchmarks, especially with a larger amount of augmented samples. As shown in Fig. 5, the performance of FrAug [3] degrades significantly with the rising number of augmented samples, which demonstrates that the augmented samples are out-of-distribution with the original samples.

In this paper, to reduce the domain gap between the augmented and original data, we propose to limit the perturbation and randomization in data augmentation. First, we limit the perturbation to specific frequencies instead of full-spectrum perturbation. Several recent studies have pointed out that a few frequency components are dominating the periodicity and main trends of the time series. And other Frequencies correspond to minor trends or noise [30, 37, 36]. Following [30], we perturb top- $k$ frequencies with highest magnitudes. Second, to avoid excess external noise, we use random shuffle for perturbation. Shuffle rearranges existing components without introducing any external randomness.

Extensive comparisons were made among nine different data augmentation methods on eight public datasets using six state-of-the-art time-series prediction network architectures. These comparisons demonstrate that, despite its simplicity, our method significantly outperforms other competitors by a substantial margin. As shown in Fig. 1, our method consistently improves the performance across various datasets, and outperforms other augmentations in most cases.

Comprehensive ablation studies demonstrate that perturbing dominant frequencies yields significantly better performance than various full-spectrum perturbations. And shuffle is proven to be superior to other randomization techniques. Besides, our augmentation demonstrates improved augmented-original gap over other augmentations, as indicated by higher performance with an increased number of augmented samples ( Fig. 5).

\begin{overpic}[width=433.62pt]{figures/rel-perf-crop} \put(9.0,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{bandara2021improving% }{}{}]}}} \put(22.0,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{chen2023supervised}% {}{}]}}} \put(35.5,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{chen2023fraug}{}{}]% }}} \put(48.0,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{chen2023fraug}{}{}]% }}} \put(62.9,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{gao2020robusttad}{}% {}]}}} \put(74.5,0.372){\footnotesize{~{}\cite[cite]{[\@@bibref{}{zhang2023towards}{}% {}]}}} \put(87.5,0.375){\footnotesize{~{}\cite[cite]{[\@@bibref{}{zhang2022self}{}{}]% }}} \end{overpic}

Figure 1: Relative improvements (%) of various data augmentations over the baseline on eight datasets using the state-of-the-art iTransformer [13] model. Zero corresponds to the original model without any data augmentation. Our method consistently improves the baseline on all the datasets and outperforms other augmentations in most cases. The improvements are based on the average performance of four prediction lengths: 96, 192, 336, and 720.

2 Related Work

In the last decade, deep learning has emerged as a powerful tool in time-series prediction and has shown superior performance over traditional statistical methods such as ARIMA and Exponential Smoothing [15]. A rich line of studies has introduced various deep-learning architectures, including recurrent neural networks (RNNs) [21, 22, 14], temporal convolution neural networks (TCNs) [27, 12, 30], and Transformers [31, 17, 18, 13, 37]. These models learn to predict the future from large volumes of historical data.

Various data augmentations have been proposed for time series data and many of these techniques were proposed for the classification tasks [28, 20, 26, 9, 24, 16, 10, 33, 4]. Many of these methods regard time series data as one-dimensional image and borrowed data augmentations, e.g. cropping [9, 5] flipping [28], and noise injection [29], from computer vision. Window warping [28] is a time series-specific data augmentation that upsamples (or downsamples) a random range of the time series while keeping other time ranges unchanged.

In addition to time-domain augmentations, there are also methods that perturb the original data in the frequency domain. Gao [8] proposed to add noise on both magnitude and phase in the frequency domain. Zhang [33] proposed to add single or multiple frequency components in the first half of the frequency spectrum. Chen [4] proposed to perform pooling or smoothing operations in the frequency domain.

While most of the augmentations focus on the classification tasks, a few methods for forecasting task have also been explored. Bandara [1] introduces two DA methods for forecasting : (i) Average selected with distance (ASD), which generates augmented time series using the weighted sum of multiple time series, and the weights are determined by the dynamic time warping (DTW) distance[7]; (ii) Moving block bootstrapping (MBB) generates augmented data by manipulating the residual part of the time series after STL Decomposition [23] and recombining it with the other series. Zhang [34] proposed to simultaneously augment in frequency and time domains. Recently, Chen et. al. [3] proposed to augment both the data (historical sequence) and the label (future sequence) in the frequency domain to improve the data-label coherence. Although this method generally achieves decent results, full-spectrum randomization imposes a large domain gap between the augmented and the original data, sometimes leading to degraded performance.

3 Dominant Frequency Shuffle for Time-series

3.1 Time-series Prediction and Frequency Domain Augmentation

Time-series prediction is a sequence-to-sequence problem where the model estimates a future multivariate sequence based on a sequence of historical measurements. Let $x=\{x^{1},x^{2},...,x^{L}\}_{t=1}^{L}\in\mathbb{R}^{L\times D}$ be the historical sequence, and $y=\{x^{L+1},x^{L+2},...,x^{L+T}\}_{t=L+1}^{L+T}\in\mathbb{R}^{T\times D}$ is the future sequence to be estimated. $x^{t}$ is the measurement at timestep $t$ and $D$ is the number of variates. Next, we will use $x\in{R}^{L\times D}$ and $y\in{R}^{T\times D}$ to denote the historical and future sequences. $x$ and $y$ are the input and output of deep learning models, respectively.

3.2 Dominant Frequency Shuffle

Deep neural networks learn the $x\rightarrow y$ mapping from large volume of $(x,y)$ pairs, and data augmentation is an efficient way of expanding the training data. Frequency-domain augmentation is a family of augmentation methods that perturb time series in the frequency domain. These methods initially convert time series to the frequency domain, apply perturbations there, and then convert the modified data back to the time domain.

Following FrAug [3], we augmented the concatenation of data and label to preserve the data-label consistency. Let $F(\omega)=\mathcal{F}([x,y])$ be the discrete Fourier transform (DFT)¹¹1 We used the torch.fft.rfft() and torch.fft.irfft() for time-to-frequency and inverse conversions. of the time-series where $[x,y]$ denotes the concatenation of data and label. $F(\omega)$ is the discrete Fourier transform of $[x,y]$ . We shuffle only the dominant frequencies with highest magnitudes ( $\lvert F(\omega)\rvert$ ). Let $\hat{F}(\omega)$ be the frequency-domain data with dominant frequencies shuffled, $\hat{F}(\omega)$ is then converted back to time domain using inverse DFT (iDFT): $[\hat{x},\hat{y}]=\text{iDFT}\big{(}\hat{F}(\omega)\big{)}$ . Where $\hat{x},\hat{y}$ is the augmented data-label pair. Fig. 2 illustrates an example of the process of dominant shuffle with $k=3$ . The prediction models were trained on a combined training set with both augmented and original data.

\begin{overpic}[width=433.62pt]{figures/pipeline1x4-crop} \end{overpic}

Figure 2: Illustration of shuffling three dominant frequencies. (a) The original time-series

x^{t}

. (b) and (c) Frequency-domain representations before and after dominant shuffle. Color dots represent the shuffle of dominant frequencies. (d) Augmented time series with original time series as reference.

4 Experiments

In this section, we first introduce the implementation details in Sec. 4.1, and then compared the performance of various SOTA models with and without dominant shuffle in Sec. 4.2. In Sec. 4.3, we thoroughly compared dominant shuffle with various data augmentation methods. Finally, we conducted ablation studies to verify hyperparameter sensitivity and justify design choices in Sec. 4.4.

4.1 Experimental Setups

Implementation details All the experiments were conducted with the PyTorch [19] framework on a single NVIDIA RTX 3090 GPU. Some of the experimental results were from respective original papers, and some were reproduced using official code with default configurations. We only changed the data augmentation for fair comparisons. Please refer to Sec. A.2 for the details about our reimplementations. Following the practice of [3], we performed data augmentations to double the size of the original training dataset unless otherwise specified.

Evaluation protocols We tested our method with short-term and long-term prediction protocols. In the long-term protocol, the prediction period $T$ ranges from 96 to 720, with variations at 96, 192, 336, and 720. In contrast, the short-term protocol has prediction periods ranging from 12 to 48, with variations at 12, 24, 36, and 48. Following the common practice of previous works [35, 31, 37, 13, 27, 30], we quantified the performance of the prediction using the mean-squared error (MSE) between the ground-truth and the prediction.

Datasets For long-term prediction, we experimented on eight well-established benchmarks: the ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2) [35], and the Weather, Electricity, Exchange, and Traffic datasets [31]. For short-term prediction, following iTransformer [13], we used four public traffic network datasets (PEMS03, PEMS04, PEMS07, PEMS08) from PEMS [2].

Each dataset is divided into training, testing, and evaluation subsets in specific ratios. The training, testing, and evaluation ratio is 6:2:2 for ETT and PEMS datasets, and the ratio is 7:1:2 for Electricity, Traffic, Weather, and Exchange-rate datasets. Detailed statistics of these datasets are summarized in Sec. A.1. For each setting (dataset+prediction length $T$ ), we tuned the optimal number of dominant frequencies $k$ on the evaluation set. The optimal $k$ on various datasets can be found in Sec. B.3.

Baseline Models We selected diverse models as the baseline in our experiments, including two Transformer-based (iTransformer [13], Autoformer [31]), two MLP-based methods (TiDE [6], Lightts [32]), and two temporal convolutional network (TCN) based methods (MICN [27], SCINet [12]). iTransformer (Liu et al., 2024) is the state-of-the-art in Transformer-based models, TiDE (Das et al., 2023) is the state-of-the-art MLP-based model, and MICN (Wang et al., 2023) is the state-of-the-art TCN-based model. For short-term prediction, we used the SOTA iTransformer [13] model on PEMS [2] dataset as the baseline model.

Other data augmentation methods We compared the proposed method with nine existing data augmentation methods, including three time-domain augmentations (ASD [7], MSB [1] Upsample [23]), five frequency-domain methods (FreqMix [3], FreqMask [3], FreqAdd [33], FreqPool [4], Robusttad [8]), and a temporal-frequency method STAug [34].

4.2 Comparison With State-of-the-arts

We first compared our method with other state-of-the-art time series prediction models published in top-tier venues. We compared the performance of recent models (iTransformer [13] (ICLR2024), SCINet [12] (NIPS2022) AutoFormer [31] (NIPS2021)) with and without dominant shuffle. The averaged mean squared errors (MSE) across various prediction lengths (96, 192, 336, 720) is calculated for each dataset.

\begin{overpic}[width=433.62pt]{figures/sota-crop} \put(14.1,0.2){\scriptsize{\cite[cite]{[\@@bibref{}{wu2021autoformer}{}{}]} (% NeurIPS2021)}} \put(37.3,0.3){\scriptsize{\cite[cite]{[\@@bibref{}{liu2022scinet}{}{}]} (% NeurIPS2022)}} \put(59.7,0.3){\scriptsize{\cite[cite]{[\@@bibref{}{wang2023micn}{}{}]} (ICLR2% 023)}} \put(87.2,0.2){\scriptsize{\cite[cite]{[\@@bibref{}{liu2024itransformer}{}{}]}% (ICLR2024)}} \end{overpic}

Figure 3: Performance of different models with (right striped bars) and without (left color bars) dominant shuffle. The horizontal dotted lines demonstrate how dominant shuffle helps one model outperforms a more advanced model.

The results in Fig. 3 clearly demonstrate that our method consistently reduces the prediction error for all the cases. In some cases, dominant shuffle surpasses even a highly sophisticated model. For example, on the ETTh1 dataset, our approach significantly improves the performance of AutoFormer [31] and MICN [27], and helps them outperform the latest iTransformer [13] model. On the Exchange and Weather dataset, our approach enables AutoFormer to outperform SCINet [12] and assists MICN [27] in surpassing iTransformer [13]. The results in Fig. 3 clearly demonstrate the significant improvements achieved by our method.

4.3 Comparisons With Other Data Augmentations

We compared different data augmentation methods on various datasets and baseline models under short-term and long-term protocols. Fig. 1 demonstrates the relative improvements (%) of various augmentation methods over the baseline. Tab. 3, 4.3 and 2 summarize the average performance of 5 runs with distinct random seeds, and the standard deviations of different runs can be found in Sec. B.4. The best values in each colume are highlighted with color. Example predictions can be found in Fig. 6 in the appendix B.

We first compared different data augmentation methods for long-term prediction. Sec. 4.3 summarizes the mean squared errors (MSE) on ETT datasets and Tab. 2 summarizes the MSE on Weather, Electricity, and Exchange-rate datasets. Limited by the space, we only reported the results of six subsets (ETTh1, ETTh2, ETTm1, Electricity, Weather, and Exchange rate) in Sec. 4.3 and 2, and the results of the other two subsets (ETTm2 and Traffic) can be found in appendix B. We also merged the results of FreqMix and FreqMask by selecting the superior one in each case. The merged results are denoted as ‘MixMask’.

Method		ETTh1				ETTh2				ETTm1
Method		96	192	336	720	96	192	336	720	96	192	336	720
iTransformer [13]	Baseline	0.392	0.447	0.483	0.516	0.303	0.381	0.412	0.434	0.344	0.383	0.421	0.494
	ASD [7]	0.398	0.456	0.483	0.512	0.310	0.388	0.432	0.452	0.340	0.382	0.454	0.492
	MSB [1]	0.387	0.460	0.494	0.531	0.309	0.382	0.447	0.433	0.339	0.386	0.467	0.510
	Upsample [23]	0.391	0.445	0.481	0.519	0.305	0.381	0.419	0.430	0.351	0.381	0.432	0.489
	FreqAdd [33]	0.389	0.446	0.475	0.510	0.300	0.384	0.416	0.438	0.350	0.385	0.422	0.490
	FreqPool [4]	0.433	0.456	0.497	0.532	0.313	0.392	0.415	0.450	0.347	0.392	0.430	0.499
	Robusttad [8]	0.390	0.445	0.497	0.510	0.312	0.388	0.412	0.439	0.353	0.382	0.421	0.498
	STAug [34]	0.390	0.445	0.489	0.511	0.323	0.428	0.486	0.483	0.339	0.383	0.417	0.485
	MixMask [3]	0.388	0.440	0.477	0.504	0.301	0.380	0.414	0.434	0.334	0.375	0.421	0.485
	Ours	0.383	0.438	0.473	0.492	0.298	0.382	0.411	0.428	0.332	0.374	0.424	0.492
AutoFormer [31]	Baseline	0.429	0.440	0.495	0.498	0.381	0.443	0.471	0.475	0.467	0.610	0.529	0.773
	ASD	0.450	0.485	0.523	0.556	0.370	0.465	0.476	0.503	0.480	0.620	0.502	0.633
	MSB	0.462	0.517	0.612	0.579	0.434	0.523	0.556	0.462	0.499	0.645	0.553	0.721
	Upsample	0.416	0.523	0.480	0.482	0.353	0.460	0.455	0.509	0.498	0.630	0.512	0.667
	FreqAdd	0.460	0.487	0.497	0.525	0.367	0.439	0.480	0.504	0.419	0.554	0.546	0.569
	FreqPool	0.446	0.457	0.523	0.512	0.392	0.442	0.470	0.493	0.479	0.623	0.510	0.754
	Robusttad	0.437	0.452	0.492	0.477	0.367	0.497	0.502	0.527	0.432	0.510	0.553	0.623
	STAug	0.429	0.478	0.505	0.506	0.354	0.443	0.496	0.495	0.415	0.581	0.588	0.693
	MixMask	0.420	0.445	0.467	0.474	0.358	0.421	0.470	0.467	0.415	0.510	0.491	0.588
	Ours	0.409	0.436	0.458	0.486	0.335	0.419	0.453	0.452	0.392	0.506	0.491	0.559
MICN [27]	Baseline	0.384	0.425	0.464	0.574	0.358	0.518	0.566	0.827	0.313	0.360	0.389	0.461
	ASD	0.380	0.430	0.472	0.523	0.377	0.539	0.620	0.843	0.315	0.362	0.399	0.457
	MSB	0.423	0.423	0.501	0.559	0.402	0.623	0.790	1.126	0.330	0.358	0.402	0.459
	Upsample	0.396	0.435	0.463	0.550	0.366	0.500	0.831	0.752	0.339	0.377	0.402	0.475
	FreqAdd	0.390	0.430	0.477	0.643	0.370	0.521	0.626	0.975	0.316	0.360	0.407	0.478
	FreqPool	0.399	0.465	0.473	0.572	0.365	0.553	0.550	0.812	0.336	0.372	0.397	0.466
	Robusttad	0.392	0.436	0.491	0.556	0.339	0.529	0.553	0.998	0.339	0.359	0.396	0.472
	STAug	0.374	0.429	0.489	0.608	0.413	0.760	1.330	2.608	0.313	0.360	0.418	0.483
	MixMask	0.378	0.423	0.461	0.521	0.339	0.488	0.544	0.735	0.301	0.352	0.401	0.454
	Ours	0.373	0.421	0.452	0.510	0.310	0.427	0.507	0.731	0.314	0.360	0.387	0.470
SCINet [12]	Baseline	0.485	0.506	0.519	0.552	0.372	0.416	0.429	0.470	0.316	0.353	0.387	0.431
	ASD	0.494	0.480	0.491	0.559	0.362	0.402	0.432	0.499	0.331	0.367	0.389	0.453
	MSB	0.489	0.466	0.502	0.547	0.359	0.396	0.458	0.476	0.320	0.351	0.396	0.478
	Upsample	0.471	0.457	0.479	0.541	0.379	0.407	0.403	0.482	0.342	0.386	0.399	0.442
	FreqAdd	0.428	0.452	0.469	0.532	0.335	0.385	0.403	0.447	0.304	0.338	0.373	0.421
	FreqPool	0.499	0.510	0.557	0.549	0.410	0.453	0.432	0.475	0.331	0.362	0.379	0.432
	Robusttad	0.462	0.501	0.498	0.559	0.362	0.431	0.419	0.496	0.331	0.351	0.394	0.438
	STAug	0.457	0.500	0.524	0.534	0.538	0.636	0.681	0.648	0.319	0.357	0.389	0.445
	MixMask	0.427	0.452	0.465	0.548	0.335	0.377	0.400	0.438	0.302	0.341	0.376	0.423
	Ours	0.417	0.443	0.461	0.527	0.335	0.375	0.392	0.421	0.302	0.338	0.372	0.420
TiDE [6]	Baseline	0.401	0.434	0.521	0.558	0.304	0.350	0.331	0.399	0.311	0.340	0.366	0.420
	ASD	0.417	0.441	0.513	0.556	0.320	0.351	0.367	0.422	0.319	0.341	0.399	0.432
	MSB	0.422	0.476	0.529	0.579	0.331	0.379	0.334	0.401	0.302	0.356	0.382	0.451
	Upsample	0.431	0.452	0.533	0.604	0.346	0.372	0.350	0.456	0.324	0.339	0.378	0.463
	FreqAdd	0.385	0.420	0.477	0.505	0.289	0.336	0.330	0.390	0.309	0.339	0.365	0.417
	FreqPool	0.423	0.455	0.510	0.592	0.312	.376	0.339	0.397	0.319	0.352	0.397	0.453
	Robusttad	0.396	0.432	0.521	0.537	0.331	0.352	0.337	0.398	0.321	0.346	0.382	0.437
	STAug	0.515	0.535	0.521	0.558	0.390	0.437	0.403	0.508	0.310	0.337	0.364	0.417
	MixMask	0.385	0.420	0.478	0.507	0.289	0.339	0.330	0.391	0.299	0.332	0.367	0.416
	Ours	0.385	0.414	0.467	0.498	0.283	0.332	0.324	0.388	0.297	0.328	0.365	0.412
LightTS [32]	Baseline	0.448	0.444	0.663	0.706	0.369	0.476	0.738	1.165	0.323	0.347	0.428	0.476
	ASD	0.451	0.476	0.633	0.681	0.392	0.469	0.701	0.998	0.356	0.352	0.441	0.478
	MSB	0.467	0.463	0.627	0.652	0.378	0.472	0.652	1.123	0.371	0.349	0.430	0.479
	Upsample	0.449	0.472	0.610	0.637	0.401	0.487	0.714	1.245	0.329	0.366	0.453	0.492
	FreqAdd	0.417	0.430	0.578	0.622	0.351	0.453	0.689	1.125	0.322	0.352	0.400	0.450
	FreqPool	0.463	0.471	0.652	0.690	0.369	0.512	0.723	1.264	0.336	0.351	0.442	0.497
	Robusttad	0.445	0.442	0.590	0.654	0.372	0.468	0.699	0.982	0.331	0.352	0.441	0.462
	STAug	0.445	0.441	0.669	0.714	0.520	0.807	2.101	2.467	0.320	0.343	0.427	0.476
	MixMask	0.417	0.429	0.575	0.620	0.337	0.426	0.643	0.993	0.316	0.340	0.398	0.447
	Ours	0.405	0.423	0.565	0.603	0.335	0.395	0.575	0.827	0.322	0.340	0.391

Table 1: MSE of the long-term prediction on the ETT [35] datasets. The best values are marked with colors.

As demonstrated in Sec. 4.3 and 2, our method consistently improves the baseline on 96% of the cases, while other augmentation methods, e.g. FreqMix, outperform the baseline for around 87% of the cases.

Method		Electricity				Weather				Exchange Rate
Method		96	192	336	720	96	192	336	720	96	192	336	720
iTransformer [13]	Baseline	0.152	0.159	0.179	0.230	0.175	0.224	0.281	0.362	0.086	0.180	0.335	0.856
	ASD [7]	0.173	0.179	0.201	0.234	0.191	0.223	0.280	0.364	0.088	0.183	0.343	0.872
	MSB [1]	0.182	0.182	0.194	0.267	0.185	0.235	0.284	0.359	0.089	0.189	0.359	0.907
	Upsample [23]	0.166	0.188	0.216	0.221	0.204	0.257	0.291	0.373	0.086	0.180	0.338	0.834
	FreqAdd [33]	0.150	0.157	0.172	0.204	0.181	0.230	0.285	0.362	0.087	0.181	0.333	0.837
	FreqPool [4]	0.169	0.170	0.194	0.237	0.184	0.223	0.279	0.378	0.088	0.183	0.330	0.832
	Robusttad [8]	0.150	0.157	0.176	0.210	0.172	0.225	0.281	0.357	0.087	0.179	0.329	0.833
	STAug [34]	0.160	0.173	0.218	0.372	0.206	0.264	0.319	0.385	0.086	0.178	0.335	0.866
	MixMask [3]	0.151	0.158	0.173	0.205	0.175	0.224	0.279	0.354	0.089	0.178	0.328	0.845
	Ours	0.150	0.156	0.171	0.199	0.171	0.221	0.276	0.351	0.086	0.176	0.313	0.821
AutoFormer [31]	Baseline	0.203	0.208	0.231	0.239	0.241	0.314	0.341	0.425	0.143	0.305	0.470	1.056
	ASD	0.247	0.216	0.221	0.235	0.652	0.392	0.416	0.513	0.141	0.280	0.579	1.240
	MSB	0.237	0.256	0.295	0.236	0.256	0.379	0.402	0.468	0.156	0.254	0.513	1.339
	Upsample	0.201	0.209	0.232	0.268	0.281	0.294	0.329	0.385	0.141	0.292	0.553	1.295
	FreqAdd	0.193	0.197	0.212	0.225	0.255	0.323	0.370	0.419	0.143	0.369	0.716	1.173
	FreqPool	0.213	0.224	0.234	0.257	0.237	0.339	0.372	0.446	0.142	0.336	0.532	1.014
	Robusttad	0.230	0.242	0.261	0.231	0.27	0.334	0.351	0.429	0.142	0.309	0.462	1.123
	STAug	0.191	0.206	0.217	0.234	0.250	0.300	0.347	0.418	0.140	0.326	0.594	1.176
	MixMask	0.177	0.194	0.206	0.224	0.240	0.302	0.330	0.422	0.141	0.284	0.453	0.778
	Ours	0.171	0.191	0.203	0.219	0.214	0.273	0.327	0.383	0.136	0.243	0.418	0.695
MICN [27]	Baseline	0.171	0.183	0.198	0.224	0.188	0.241	0.278	0.350	0.091	0.185	0.355	0.941
	ASD	0.165	0.174	0.190	0.237	0.189	0.242	0.276	0.354	0.087	0.175	0.337	1.203
	MSB	0.179	0.182	0.201	0.225	0.201	0.250	0.291	0.365	0.088	0.176	0.360	0.995
	Upsample	0.182	0.180	0.203	0.220	0.193	0.249	0.279	0.372	0.084	0.171	0.313	0.702
	FreqAdd	0.160	0.169	0.182	0.199	0.180	0.234	0.282	0.350	0.087	0.174	0.349	0.923
	FreqPool	0.182	0.203	0.241	0.256	0.192	0.257	0.278	0.351	0.089	0.179	0.394	0.923
	Robusttad	0.179	0.220	0.234	0.227	0.192	0.239	0.292	0.343	0.085	0.179	0.336	0.932
	STAug	0.180	0.195	0.210	0.224	0.272	0.356	0.433	0.559	0.092	0.183	0.313	0.790
	MixMask	0.159	0.165	0.178	0.195	0.185	0.239	0.281	0.344	0.086	0.174	0.337	0.796
	Ours	0.157	0.168	0.178	0.211	0.179	0.232	0.275	0.342	0.084	0.169	0.303	0.750
SCINet [12]	Baseline	0.212	0.237	0.255	0.286	0.229	0.282	0.334	0.402	0.099	0.191	0.356	0.916
	ASD	0.229	0.241	0.239	0.282	0.254	0.276	0.356	0.462	0.095	0.204	0.379	1.230
	MSB	0.232	0.237	0.228	0.274	0.279	0.265	0.374	0.454	0.093	0.267	0.402	0.965
	Upsample	0.250	0.232	0.271	0.309	0.243	0.299	0.361	0.431	0.092	0.196	0.311	0.932
	FreqAdd	0.176	0.195	0.212	0.237	0.208	0.258	0.309	0.385	0.092	0.186	0.343	0.920
	FreqPool	0.230	0.221	0.242	0.339	0.261	0.290	0.337	0.456	0.096	0.183	0.551	0.938
	Robusttad	0.189	0.202	0.210	0.243	0.229	0.281	0.331	0.410	0.093	0.186	0.334	0.957
	STAug	0.210	0.239	0.282	0.411	0.277	0.329	0.372	0.435	0.098	0.191	0.342	0.931
	MixMask	0.171	0.188	0.204	0.230	0.205	0.250	0.310	0.374	0.093	0.179	0.336	0.928
	Ours	0.172	0.188	0.200	0.225	0.197	0.246	0.299	0.379	0.091	0.175	0.342	0.890
TiDE [6]	Baseline	0.207	0.197	0.211	0.238	0.177	0.220	0.265	0.323	0.093	0.184	0.330	0.860
	ASD	0.232	0.220	0.231	0.265	0.189	0.221	0.297	0.332	0.095	0.206	0.351	0.962
	MSB	0.210	0.219	0.253	0.261	0.199	0.254	0.273	0.339	0.092	0.179	0.358	0.941
	Upsample	0.206	0.199	0.223	0.274	0.203	0.267	0.331	0.355	0.091	0.182	0.331	0.852
	FreqAdd	0.150	0.163	0.177	0.209	0.173	0.216	0.263	0.322	0.088	0.180	0.330	0.848
	FreqPool	0.224	0.238	0.233	0.270	0.189	0.224	0.292	0.334	0.092	0.334	0.521	1.124
	Robusttad	0.176	0.166	0.182	0.229	0.182	0.231	0.279	0.330	0.099	0.232	0.331	0.924
	STAug	0.230	0.210	0.192	0.225	0.205	0.247	0.292	0.364	0.092	0.184	0.330	0.859
	MixMask	0.143	0.155	0.164	0.210	0.173	0.216	0.263	0.323	0.089	0.180	0.329	0.861
	Ours	0.143	0.150	0.165	0.202	0.177	0.219	0.261	0.322	0.088	0.179	0.324	0.847
LightTS [32]	Baseline	0.210	0.169	0.182	0.212	0.168	0.210	0.260	0.320	0.139	0.252	0.412	0.840
	ASD	0.225	0.179	0.198	0.232	0.179	0.210	0.271	0.321	0.132	0.320	0.436	1.036
	MSB	0.233	0.182	0.204	0.228	0.170	0.214	0.259	0.332	0.117	0.294	0.502	0.964
	Upsample	0.246	0.179	0.211	0.254	0.182	0.223	0.257	0.336	0.099	0.251	0.369	0.702
	FreqAdd	0.213	0.159	0.177	0.210	0.164	0.207	0.258	0.317	0.098	0.522	0.565	1.583
	FreqPool	0.219	0.174	0.197	0.236	0.193	0.254	0.267	0.339	0.099	0.275	0.394	0.793
	Robusttad	0.212	0.169	0.181	0.223	0.172	0.223	0.259	0.324	0.092	0.279	0.451	0.796
	STAug	0.224	0.267	0.294	0.351	0.214	0.263	0.382	0.371	0.096	0.212	0.380	0.690
	MixMask	0.192	0.158	0.175	0.211	0.163	0.206	0.257	0.318	0.099	0.384	0.518	0.774
	Ours	0.210	0.156	0.173	0.206	0.165	0.205	0.249	0.312	0.088	0.243	0.361

Table 2: MSE of the long-term prediction on the Weather, Electricity, and Exchange Rate[31] datasets. The best values are marked with colors.

Our method also outperforms other augmentation methods on more than 77% of the cases. Moreover, our method achieves larger relative improvements as the prediction length $T$ increased, highlighting its strong capacity in long-term predictions. Tab. 3 summarizes the MSE of short-term prediction using the iTransformer [13] model on the PEMS datasets [2]. The prediction errors are generally lower than the errors in long-term prediction. Our method outperforms other augmentations in most cases, although the improvements are marginal compared to long-term prediction. This is because short-term prediction is relatively easy, and the performance has already reached saturation.

Methods	PEMS03				PEMS04				PEMS07
Methods	12	24	36	48	12	24	36	48	12	24	36	48
Baseline	0.070	0.097	0.134	0.164	0.088	0.124	0.160	0.196	0.067	0.097	0.128	0.156
ASD [7]	0.072	0.096	0.152	0.239	0.098	0.132	0.156	0.190	0.069	0.099	0.154	0.181
MSB [1]	0.096	0.131	0.129	0.214	0.087	0.134	0.167	0.219	0.098	0.096	0.137	0.165
Upsample [23]	0.069	0.096	0.128	0.179	0.087	0.124	0.158	0.199	0.072	0.099	0.127	0.155
FreqAdd [33]	1.036	0.104	0.251	0.362	0.088	0.125	0.159	0.201	0.067	0.097	0.127	0.155
FreqPool [4]	1.234	0.178	0.296	0.451	0.099	0.145	0.178	0.226	0.079	0.104	0.152	0.172
Robusttad [8]	0.082	0.098	0.132	1.520	0.089	0.123	0.161	0.195	0.067	0.097	0.129	0.157
STAug [34]	0.079	0.112	0.195	0.456	0.087	0.120	0.162	0.304	0.066	0.096	0.132	0.165
Mask [3]	0.443	1.205	0.233	1.510	0.086	0.119	0.158	0.346	0.065	0.095	0.125	0.156
Mix [3]	1.018	0.097	0.877	1.501	0.085	0.119	0.154	0.205	0.065	0.094	0.134	0.152
Ours	0.067	0.095	0.126	0.235	0.085	0.118	0.149	0.182	0.065	0.094	0.123

Table 3: Short-term prediction using the iTransformer [13] on the PEMS datasets [2].

4.4 Ablation Study

Our method includes a hyper-parameter $k$ and two unique designs: 1) perturb the dominant frequencies and 2) shuffle the dominant frequency components. We conducted ablation studies to investigate the impact of hyperparameters and to justify our design choices.

4.4.1 Number of Dominant Frequencies

\begin{overpic}[width=433.62pt]{figures/ablate-k-crop} \end{overpic}

Figure 4: Mean-squared errors with various

k

values on four datasets under the predict-96 setting. Our method is stable against

k

, and the performance varies slightly.

The only hyper-parameter in our method is the number of dominant frequencies $k$ . We evaluated the performance using various $k$ values with iTransformer [13]. The results in Fig. 4 reveal that our method is stable against different $k$ values.

4.4.2 Shuffle the Dominant Frequencies

In this experiment, we compared the combination of different perturbation strategies and operations.

We first compared perturbing different frequency proportions including dominant frequencies, minor frequencies, and the full spectrum. The results in Tab. 4 clearly indicate that perturbing the dominant frequencies significantly outperforms other options, while perturbing the minor frequencies yields the worst performance. Tab. 5 compares different perturbation operations including masking [3], adding noise [8, 10], randomization, and shuffling (ours). Shuffle consistently surpasses other operations in most of the cases.

			ETTh1				ETTm2				Weather
			96	192	336	720	96	192	336	720	96	192	336	720
iTrans [13]	Shuffle	full	0.391	0.447	0.486	0.509	0.182	0.247	0.311	0.403	0.175	0.223	0.278	0.355
		min	0.389	0.445	0.494	0.505	0.181	0.251	0.310	0.413	0.174	0.225	0.282	0.355
		dom	0.383	0.438	0.473	0.492	0.178	0.246	0.309	0.409	0.171	0.221	0.276	0.351
\cdashline2-15	Mask	full	0.390	0.442	0.475	0.503	0.179	0.251	0.311	0.411	0.178	0.228	0.284	0.359
		min	0.389	0.444	0.487	0.499	0.183	0.252	0.311	0.412	0.180	0.226	0.282	0.361
		dom	0.388	0.442	0.486	0.505	0.180	0.251	0.309	0.410	0.173	0.224	0.280	0.356
MICN [27]	Shuffle	full	0.385	0.427	0.466	0.604	0.184	0.293	0.375	0.594	0.182	0.239	0.280	0.348
		min	0.390	0.430	0.480	0.565	0.191	0.281	0.365	0.580	0.197	0.236	0.283	0.349
		dom	0.373	0.421	0.452	0.510	0.174	0.263	0.348	0.502	0.179	0.232	0.275	0.342
\cdashline2-15	Mask	full	0.381	0.424	0.460	0.543	0.184	0.265	0.353	0.510	0.190	0.236	0.281	0.345
		min	0.385	0.426	0.472	0.553	0.187	0.276	0.359	0.542	0.179	0.240	0.281	0.344
		dom	0.377	0.421	0.454	0.543	0.175	0.268	0.337	0.505	0.178	0.239	0.283	0.342
Lightts [32]	Shuffle	full	0.415	0.426	0.577	0.621	0.202	0.235	0.325	0.445	0.163	0.205	0.251	0.317
		min	0.418	0.432	0.577	0.619	0.206	0.239	0.326	0.444	0.164	0.212	0.259	0.317
		dom	0.405	0.423	0.565	0.603	0.195	0.245	0.312	0.422	0.165	0.205	0.249	0.312
\cdashline2-15	Mask	full	0.418	0.432	0.573	0.621	0.204	0.238	0.321	0.435	0.163	0.206	0.258	0.317
		min	0.419	0.433	0.578	0.621	0.205	0.233	0.324	0.452	0.163	0.208	0.260	0.317
		dom	0.418	0.424	0.579	0.618	0.198	0.240	0.312	0.430	0.162	0.201	0.250

Table 4: Comparison of perturbing different spectrum (full, minor, and dominant) using shuffle and random mask. Perturbing the dominant frequencies performs significantly better than perturbing other frequencies. And shuffle is also more effective than random mask.

		ETTh1				ETTm2				Weather
		96	192	336	720	96	192	336	720	96	192	336	720
iTrans [13]	Mask	0.388	0.442	0.486	0.505	0.180	0.251	0.309	0.410	0.173	0.224	0.280	0.356
	Noise	0.387	0.445	0.482	0.510	0.180	0.256	0.312	0.409	0.177	0.222	0.281	0.359
	Random	0.386	0.440	0.479	0.499	0.183	0.254	0.311	0.407	0.171	0.222	0.280	0.358
	Shuffle	0.383	0.438	0.473	0.492	0.178	0.246	0.309	0.409	0.171	0.221	0.276	0.351
MICN [27]	Mask	0.377	0.421	0.454	0.543	0.175	0.268	0.337	0.505	0.178	0.239	0.283	0.342
	Noise	0.393	0.430	0.479	0.531	0.201	0.331	0.366	0.561	0.201	0.236	0.281	0.351
	Random	0.381	0.423	0.476	0.670	0.183	0.284	0.367	0.614	0.182	0.233	0.282	0.349
	Shuffle	0.373	0.421	0.452	0.510	0.174	0.263	0.348	0.502	0.179	0.232	0.275	0.342
Lightts [32]	Mask	0.418	0.424	0.579	0.618	0.198	0.240	0.312	0.430	0.162	0.201	0.250	0.317
	Noise	0.432	0.451	0.566	0.636	0.221	0.236	0.351	0.433	0.169	0.219	0.259	0.321
	Random	0.414	0.431	0.570	0.610	0.206	0.244	0.324	0.442	0.171	0.213	0.263	0.323
	Shuffle	0.405	0.423	0.565	0.603	0.195	0.245	0.312	0.422	0.165	0.205	0.249

Table 5: Comparison of different dominant frequency perturbations. Shuffle outperforms other alternatives with clear margins.

The results in Tab. 4 and 5 justified the design decisions in dominant shuffle and confirm that both perturbing dominant frequencies and the shuffle operation is superior to other alternatives. More details about the experiments, including how we defined minor frequencies and we implemented mask, noise, and randomization perturbations can be found in Sec. A.2.

4.4.3 Different Augmentation Sizes

In prior experiments, we explored data augmentation that doubled the original datasets. In this experiment, we assessed the performance of various augmentation sizes. The performance with a larger augmentation size reflects the domain gap between augmented and original data. A larger augmentation size indicates more augmented samples in the training set. If these augmented samples are out of distribution compared to the original data, larger augmentation sizes could lead to degraded performance due to a training/test gap.

\begin{overpic}[width=433.62pt]{figures/number-of-augmentation-crop} \put(45.0,-1.6){\footnotesize{{\color[rgb]{.5,.5,.5}{augmentation size}}}} \end{overpic}

Figure 5: MSE with different augmentation sizes using iTransformer [13]. An augmentation size of two, which was used in previous experiments, achieves the best results in most cases. Our method is more resistant to larger augmentation sizes, indicating the improved augmented-original gap.

As shown in Fig. 5, the performance of FreqMix and FreqMask declines significantly after an augmentation size of two. This is due to the domain gap between augmented and original data. Our method is slightly impacted by augmentation size, and even benefits from larger augmentation sizes on the Weather dataset. The results in Fig. 5 reveal a smaller augmented-original gap of our method.

5 Conclusion

We proposed the dominant shuffle, a simple yet highly effective data augmentation technique for time series prediction. Our method mitigates the domain gap between augmented and original data by limiting the perturbation to dominant frequencies, and uses shuffles to avoid external noises. Although being simple and effective, our method is proposed primarily based on heuristics and lacks theoretical explainability. Instead of theoretical justifications, we conducted extensive experiments using a wide range of datasets, baseline models, and augmentation methods to validate its consistent improvements across various configurations. Since dominant shuffle introduces significant perturbation to the original data and therefore disrupts the sample-wise class labels, our method is limited to prediction tasks and cannot be extended to classification tasks. Exploring theoretical justifications and principles of the proposed method would be a promising future direction that helps better understand it.

References

[1] Kasun Bandara, Hansika Hewamalage, Yuan-Hao Liu, Yanfei Kang, and Christoph Bergmeir. Improving the accuracy of global forecasting models using time series data augmentation. Pattern Recognition, 120:108148, 2021.
[2] Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia. Freeway performance measurement system: mining loop detector data. Transportation research record, 1748(1):96–102, 2001.
[3] Muxi Chen, Zhijian Xu, Ailing Zeng, and Qiang Xu. Fraug: Frequency domain augmentation for time series forecasting. arXiv preprint arXiv:2302.09292, 2023.
[4] Xi Chen, Cheng Ge, Ming Wang, and Jin Wang. Supervised contrastive few-shot learning for high-frequency time series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 7069–7077, 2023.
[5] Zhicheng Cui, Wenlin Chen, and Yixin Chen. Multi-scale convolutional neural networks for time series classification. arXiv preprint arXiv:1603.06995, 2016.
[6] Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. Long-term forecasting with tiDE: Time-series dense encoder. Transactions on Machine Learning Research, 2023.
[7] Germain Forestier, François Petitjean, Hoang Anh Dau, Geoffrey I Webb, and Eamonn Keogh. Generating synthetic time series to augment sparse datasets. In 2017 IEEE international conference on data mining (ICDM), pages 865–870. IEEE, 2017.
[8] Jingkun Gao, Xiaomin Song, Qingsong Wen, Pichao Wang, Liang Sun, and Huan Xu. Robusttad: Robust time series anomaly detection via decomposition and convolutional neural networks. In MileTS’20: 6th KDD Workshop on Mining and Learning from Time Series, pages 1–6, 2020.
[9] Arthur Le Guennec, Simon Malinowski, and Romain Tavenard. Data augmentation for time series classification using convolutional neural networks. In ECML/PKDD workshop on advanced analytics and learning on temporal data, 2016.
[10] Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey. Philosophical Transactions of the Royal Society A, 379(2194):20200209, 2021.
[11] Swee Kiat Lim, Yi Loo, Ngoc-Trung Tran, Ngai-Man Cheung, Gemma Roig, and Yuval Elovici. Doping: Generative data augmentation for unsupervised anomaly detection with gan. In 2018 IEEE international conference on data mining (ICDM), pages 1122–1127. IEEE, 2018.
[12] Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. Scinet: Time series modeling and forecasting with sample convolution and interaction. Advances in Neural Information Processing Systems, 35:5816–5828, 2022.
[13] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. In The Twelfth International Conference on Learning Representations, 2024.
[14] Yunshan Ma, Yujuan Ding, Xun Yang, Lizi Liao, Wai Keung Wong, and Tat-Seng Chua. Knowledge enhanced neural fashion trend forecasting. In Proceedings of the 2020 international conference on multimedia retrieval, pages 82–90, 2020.
[15] ED McKenzie. General exponential smoothing and the equivalent arma process. Journal of Forecasting, 3(3):333–344, 1984.
[16] Gue-Hwan Nam, Seok-Jun Bu, Na-Mu Park, Jae-Yong Seo, Hyeon-Cheol Jo, and Won-Tae Jeong. Data augmentation using empirical mode decomposition on neural networks to classify impact noise in vehicle. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 731–735. IEEE, 2020.
[17] Zelin Ni, Hang Yu, Shizhan Liu, Jianguo Li, and Weiyao Lin. Basisformer: Attention-based time series forecasting with learnable and interpretable basis. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[18] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In The Eleventh International Conference on Learning Representations, 2023.
[19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[20] Hangwei Qian, Tian Tian, and Chunyan Miao. What makes good contrastive learning on small-scale wearable-based tasks? In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3761–3771, 2022.
[21] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. Advances in neural information processing systems, 31, 2018.
[22] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International journal of forecasting, 36(3):1181–1191, 2020.
[23] Artemios-Anargyros Semenoglou, Evangelos Spiliotis, and Vassilios Assimakopoulos. Data augmentation for univariate time series forecasting with neural networks. Pattern Recognition, 134:109132, 2023.
[24] Odongo Steven Eyobu and Dong Seog Han. Feature representation and data augmentation for human activity classification based on wearable imu sensor data using a deep lstm neural network. Sensors, 18(9):2892, 2018.
[25] Jianhua Sun, Hao-Shu Fang, Yuxuan Li, Runzhong Wang, Minghao Gou, and Cewu Lu. Instaboost++: Visual coherence principles for unified 2d/3d instance level data augmentation. International Journal of Computer Vision, 131(10):2665–2681, 2023.
[26] Terry T Um, Franz MJ Pfister, Daniel Pichler, Satoshi Endo, Muriel Lang, Sandra Hirche, Urban Fietzek, and Dana Kulić. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM international conference on multimodal interaction, pages 216–220, 2017.
[27] Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao. MICN: Multi-scale local and global context modeling for long-term series forecasting. In The Eleventh International Conference on Learning Representations, 2023.
[28] Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, and Huan Xu. Time series data augmentation for deep learning: A survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4653–4660. International Joint Conferences on Artificial Intelligence Organization, 8 2021. Survey Track.
[29] Tailai Wen and Roy Keyes. Time series anomaly detection using convolutional neural networks and transfer learning. In IJCAI Workshop on AI4IoT, 2019.
[30] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. In The Eleventh International Conference on Learning Representations, 2023.
[31] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in neural information processing systems, 34:22419–22430, 2021.
[32] Tianping Zhang, Yizhuo Zhang, Wei Cao, Jiang Bian, Xiaohan Yi, Shun Zheng, and Jian Li. Less is more: Fast multivariate time series forecasting with light sampling-oriented mlp structures. arXiv preprint arXiv:2207.01186, 2022.
[33] Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for time series via time-frequency consistency. Advances in Neural Information Processing Systems, 35:3988–4003, 2022.
[34] Xiyuan Zhang, Ranak Roy Chowdhury, Jingbo Shang, Rajesh Gupta, and Dezhi Hong. Towards diverse and coherent augmentation for time-series forecasting. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[35] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021.
[36] Tian Zhou, Ziqing Ma, Qingsong Wen, Liang Sun, Tao Yao, Wotao Yin, Rong Jin, et al. Film: Frequency improved legendre memory model for long-term time series forecasting. Advances in Neural Information Processing Systems, 35:12677–12690, 2022.
[37] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International conference on machine learning, pages 27268–27286. PMLR, 2022.

Appendix A More Details

A.1 Datasets

We evaluate the performance of different models and different augmentations for long-term forecasting on 8 well-established datasets, including Weather, Traffic, Electricity, Exchange Rate [31], and ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2) [35]. Furthermore, we adopt PEMS [2] datasets for short-term forecasting. We detail the descriptions of the dataset in Tab. 6.

Dataset	Variates	Prediction length ( $T$ )	Total Length (Train:Validation:Test)	Frequency	Information
ETTh1,ETTh2	7	{96,192,336,720}	(8545, 2,881, 2,881)	Hourly	Temperature
ETTm1,ETTm2	7	{96, 192, 336, 720}	(34465, 11521, 11521)	15min	Temperature
Exchange	8	{96, 192, 336, 720}	(5120, 665, 1422)	Daily	Economy
Weather	21	{96,192,336,720}	(36792, 5271, 10540)	10min	Weather
ECL	321	{96,192, 336, 720}	(18317, 2633, 5261)	Hourly	Electricity
Traffic	862	{96, 192, 336, 720}	(12185, 1757, 3509)	Hourly	Transportation
PEMS03	358	{12, 24, 36, 48}	(15617, 5135, 5135)	5min	Traffic network
PEMS04	307	{12, 24, 36, 48}	(10172, 3375, 3375)	5min	Traffic network
PEMS07	883	{12, 24, 36, 48}	(16911, 5622, 5622)	5min	Traffic network
PEMS08	170	{12, 24, 36, 48}	(10690, 3548, 3548)	5min	Traffic network

Table 6: Statistics of the eight datasets used in our experiments.

A.2 Implementation Details

A.2.1 Reimplementation other methods

For ASD, MSB, and upsample, we reproduce them based on the descriptions in their original paper [1, 7, 23]. For STAug [34] and MixMask [3], we use their official code. For Robusttad [8], we reproduce it by adding Gaussian noise to the frequency components of a time series. For FreqAdd [33], we perturb a single low-frequency component by setting its magnitude to half of the maximum magnitude. For FreqPool [4], we apply it by maximum pooling of the entire spectrum with size=4. For a fair comparison, all frequency-domain methods target both the data-label pair.

A.2.2 Different perturbations

In our ablation study, we define minor frequencies as other components except for the frequency components with the top 10 magnitudes. In Tab. 4, Mask on the full spectrum is similar to FrAug [3]. Mask on dominant frequencies means mask within frequency components with the top 10 magnitudes, Mask on minor frequencies is the opposite. In Tab. 5, Noise means adding Gaussian noise to the selected frequency components. For Random, we first get the maximum and minimum magnitude of the selected frequency components and then randomly assigned magnitude within the max-min range.

Appendix B More Results

B.1 Full forecasting results

Sec. B.1, B.1 and 9 show the full results of the forecasting task. Specifically, our method improves the performance of iTransformer by 13 $\%$ in Electricity when the predicted length is 720, and it improves the performance of Autoformer by 28 $\%$ in ETTm1 when the predicted length is 720. Our method also improves the performance of MICN by 18 $\%$ in ETTh2 when the predicted length is 192 and the performance of SCINet by 21 $\%$ in Electricity when the predicted length is 720. Similarly, our method improves the performance of Lightts by 29 $\%$ in ETTh2 when the predicted length is 720 and the performance of TiDE by 24 $\%$ in Electricity when the predicted length is 192. It is worth noting that the strong baseline MixMask falls short in Exchange rate, whose main goal is to predict trends. But our method improves the performance of Autoformer by 34 $\%$ in Exchange rate when the predicted length is 720, and it improves the performance of Lightts by 37 $\%$ in Exchange rate when the predicted length is 96. These results demonstrate the effectiveness of our method for long-term prediction, as it consistently improves the performance of SOTA methods in different datasets.

Method		ETTh1				ETTh2				ETTm1				ETTm2
Method		96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720
iTransformer [13]	Baseline	0.392	0.447	0.483	0.516	0.303	0.381	0.412	0.434	0.344	0.383	0.421	0.494	0.183	0.251	0.311	0.412
	ASD [7]	0.398	0.456	0.483	0.512	0.310	0.388	0.432	0.452	0.340	0.382	0.454	0.492	0.199	0.254	0.341	0.423
	MSB [1]	0.387	0.460	0.494	0.531	0.309	0.382	0.447	0.433	0.339	0.386	0.467	0.510	0.187	0.267	0.332	0.452
	Upsample [23]	0.391	0.445	0.481	0.519	0.305	0.381	0.419	0.430	0.351	0.381	0.432	0.489	0.196	0.279	0.320	0.411
	FreqAdd [33]	0.389	0.446	0.475	0.510	0.300	0.384	0.416	0.438	0.350	0.385	0.422	0.490	0.187	0.253	0.311	0.415
	FreqPool [4]	0.433	0.456	0.497	0.532	0.313	0.392	0.415	0.450	0.347	0.392	0.430	0.499	0.187	0.256	0.324	0.449
	Robusttad [8]	0.390	0.445	0.497	0.510	0.312	0.388	0.412	0.439	0.353	0.382	0.421	0.498	0.189	0.255	0.309	0.428
	STAug [34]	0.390	0.445	0.489	0.511	0.323	0.428	0.486	0.483	0.339	0.383	0.417	0.485	0.196	0.267	0.339	0.449
	MixMask [3]	0.388	0.440	0.477	0.504	0.301	0.380	0.414	0.434	0.334	0.375	0.421	0.485	0.178	0.248	0.311	0.407
	Ours	0.383	0.438	0.473	0.492	0.298	0.382	0.411	0.428	0.332	0.374	0.424	0.492	0.178	0.246	0.309	0.409
AutoFormer [31]	Baseline	0.429	0.440	0.495	0.498	0.381	0.443	0.471	0.475	0.467	0.610	0.529	0.773	0.233	0.278	0.383	0.488
	ASD	0.450	0.485	0.523	0.556	0.370	0.465	0.476	0.503	0.480	0.620	0.502	0.633	0.231	0.282	0.379	0.499
	MSB	0.462	0.517	0.612	0.579	0.434	0.523	0.556	0.462	0.499	0.645	0.553	0.721	0.232	0.285	0.389	0.487
	Upsample	0.416	0.523	0.480	0.482	0.353	0.460	0.455	0.509	0.498	0.630	0.512	0.667	0.234	0.291	0.382	0.521
	FreqAdd	0.460	0.487	0.497	0.525	0.367	0.439	0.480	0.504	0.419	0.554	0.546	0.569	0.223	0.268	0.330	0.458
	FreqPool	0.446	0.457	0.523	0.512	0.392	0.442	0.470	0.493	0.479	0.623	0.510	0.754	0.250	0.291	0.394	0.482
	Robusttad	0.437	0.452	0.492	0.477	0.367	0.497	0.502	0.527	0.432	0.510	0.553	0.623	0.235	0.291	0.375	0.478
	STAug	0.429	0.478	0.505	0.506	0.354	0.443	0.496	0.495	0.415	0.581	0.588	0.693	0.224	0.291	0.338	0.431
	MixMask	0.420	0.445	0.467	0.474	0.358	0.421	0.470	0.467	0.415	0.510	0.491	0.588	0.211	0.267	0.340	0.451
	Ours	0.409	0.436	0.458	0.486	0.335	0.419	0.453	0.452	0.392	0.506	0.491	0.559	0.210	0.266	0.329	0.429
MICN [27]	Baseline	0.384	0.425	0.464	0.574	0.358	0.518	0.566	0.827	0.313	0.360	0.389	0.461	0.200	0.282	0.375	0.606
	ASD	0.380	0.430	0.472	0.523	0.377	0.539	0.620	0.843	0.315	0.362	0.399	0.457	0.189	0.331	0.399	0.617
	MSB	0.423	0.423	0.501	0.559	0.402	0.623	0.790	1.126	0.330	0.358	0.402	0.459	0.192	0.279	0.376	0.651
	Upsample	0.396	0.435	0.463	0.550	0.366	0.500	0.831	0.752	0.339	0.377	0.402	0.475	0.203	0.291	0.372	0.595
	FreqAdd	0.390	0.430	0.477	0.643	0.370	0.521	0.626	0.975	0.316	0.360	0.407	0.478	0.176	0.273	0.378	0.614
	FreqPool	0.399	0.465	0.473	0.572	0.365	0.553	0.550	0.812	0.336	0.372	0.397	0.466	0.212	0.287	0.390	0.623
	Robusttad	0.392	0.436	0.491	0.556	0.339	0.529	0.553	0.998	0.339	0.359	0.396	0.472	0.200	0.296	0.356	0.617
	STAug	0.374	0.429	0.489	0.608	0.413	0.760	1.330	2.608	0.313	0.360	0.418	0.483	0.180	0.264	0.323	0.670
	MixMask	0.378	0.423	0.461	0.521	0.339	0.488	0.544	0.735	0.301	0.352	0.401	0.454	0.183	0.278	0.356	0.528
	Ours	0.373	0.421	0.452	0.510	0.310	0.427	0.507	0.731	0.314	0.360	0.387	0.470	0.174	0.263	0.346	0.502
SCINet [12]	Baseline	0.485	0.506	0.519	0.552	0.372	0.416	0.429	0.470	0.316	0.353	0.387	0.431	0.184	0.240	0.295	0.385
	ASD	0.494	0.480	0.491	0.559	0.362	0.402	0.432	0.499	0.331	0.367	0.389	0.453	0.197	0.238	0.296	0.432
	MSB	0.489	0.466	0.502	0.547	0.359	0.396	0.458	0.476	0.320	0.351	0.396	0.478	0.182	0.237	0.289	0.449
	Upsample	0.471	0.457	0.479	0.541	0.379	0.407	0.403	0.482	0.342	0.386	0.399	0.442	0.179	0.254	0.292	0.401
	FreqAdd	0.428	0.452	0.469	0.532	0.335	0.385	0.403	0.447	0.304	0.338	0.373	0.421	0.174	0.228	0.286	0.380
	FreqPool	0.499	0.510	0.557	0.549	0.410	0.453	0.432	0.475	0.331	0.362	0.379	0.432	0.185	0.239	0.302	0.399
	Robusttad	0.462	0.501	0.498	0.559	0.362	0.431	0.419	0.496	0.331	0.351	0.394	0.438	0.182	0.247	0.299	0.402
	STAug	0.457	0.500	0.524	0.534	0.538	0.636	0.681	0.648	0.319	0.357	0.389	0.445	0.323	0.407	0.514	0.668
	MixMask	0.427	0.452	0.465	0.548	0.335	0.377	0.400	0.438	0.302	0.341	0.376	0.423	0.174	0.230	0.289	0.368
	Ours	0.417	0.443	0.461	0.527	0.335	0.375	0.392	0.421	0.302	0.338	0.372	0.420	0.174	0.228	0.283	0.372
TiDE [6]	Baseline	0.401	0.434	0.521	0.558	0.304	0.350	0.331	0.399	0.311	0.340	0.366	0.420	0.166	0.220	0.273	0.356
	ASD	0.417	0.441	0.513	0.556	0.320	0.351	0.367	0.422	0.319	0.341	0.399	0.432	0.177	0.241	0.291	0.371
	MSB	0.422	0.476	0.529	0.579	0.331	0.379	0.334	0.401	0.302	0.356	0.382	0.451	0.182	0.232	0.287	0.359
	Upsample	0.431	0.452	0.533	0.604	0.346	0.372	0.350	0.456	0.324	0.339	0.378	0.463	0.203	0.246	0.306	0.366
	FreqAdd	0.385	0.420	0.477	0.505	0.289	0.336	0.330	0.390	0.309	0.339	0.365	0.417	0.164	0.219	0.273	0.355
	FreqPool	0.423	0.455	0.510	0.592	0.312	.376	0.339	0.397	0.319	0.352	0.397	0.453	0.179	0.231	0.299	0.371
	Robusttad	0.396	0.432	0.521	0.537	0.331	0.352	0.337	0.398	0.321	0.346	0.382	0.437	0.180	0.225	0.282	0.371
	STAug	0.515	0.535	0.521	0.558	0.390	0.437	0.403	0.508	0.310	0.337	0.364	0.417	0.222	0.343	0.515	0.847
	MixMask	0.385	0.420	0.478	0.507	0.289	0.339	0.330	0.391	0.299	0.332	0.367	0.416	0.165	0.219	0.271	0.347
	Ours	0.385	0.414	0.467	0.498	0.283	0.332	0.324	0.388	0.297	0.328	0.365	0.412	0.165	0.218	0.271	0.350
LightTS [32]	Baseline	0.448	0.444	0.663	0.706	0.369	0.476	0.738	1.165	0.323	0.347	0.428	0.476	0.212	0.237	0.350	0.473
	ASD	0.451	0.476	0.633	0.681	0.392	0.469	0.701	0.998	0.356	0.352	0.441	0.478	0.258	0.251	0.351	0.483
	MSB	0.467	0.463	0.627	0.652	0.378	0.472	0.652	1.123	0.371	0.349	0.430	0.479	0.236	0.242	0.359	0.471
	Upsample	0.449	0.472	0.610	0.637	0.401	0.487	0.714	1.245	0.329	0.366	0.453	0.492	0.241	0.255	0.366	0.492
	FreqAdd	0.417	0.430	0.578	0.622	0.351	0.453	0.689	1.125	0.322	0.352	0.400	0.450	0.206	0.237	0.327	0.455
	FreqPool	0.463	0.471	0.652	0.690	0.369	0.512	0.723	1.264	0.336	0.351	0.442	0.497	0.233	0.259	0.372	0.453
	Robusttad	0.445	0.442	0.590	0.654	0.372	0.468	0.699	0.982	0.331	0.352	0.441	0.462	0.232	0.227	0.342	0.446
	STAug	0.445	0.441	0.669	0.714	0.520	0.807	2.101	2.467	0.320	0.343	0.427	0.476	0.230	0.266	0.372	0.475
	MixMask	0.417	0.429	0.575	0.620	0.337	0.426	0.643	0.993	0.316	0.340	0.398	0.447	0.199	0.233	0.322	0.440
	Ours	0.405	0.423	0.565	0.603	0.335	0.395	0.575	0.827	0.322	0.340	0.391	0.440	0.195	0.245	0.312

Table 7: MSE of the long-term prediction on the ETT [35] datasets.

Method		Eletricity				Weather				Exchange Rate				Traffic
Method		96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720
iTransformer [13]	Baseline	0.152	0.159	0.179	0.230	0.175	0.224	0.281	0.362	0.086	0.180	0.335	0.856	0.399	0.418	0.428	0.463
	ASD [7]	0.173	0.179	0.201	0.234	0.191	0.223	0.280	0.364	0.088	0.183	0.343	0.872	0.431	0.428	0.430	0.478
	MSB [1]	0.182	0.182	0.194	0.267	0.185	0.235	0.284	0.359	0.089	0.189	0.359	0.907	0.417	0.416	0.422	0.471
	Upsample [23]	0.166	0.188	0.216	0.221	0.204	0.257	0.291	0.373	0.086	0.180	0.338	0.834	0.433	0.419	0.433	0.476
	FreqAdd [33]	0.150	0.157	0.172	0.204	0.181	0.230	0.285	0.362	0.087	0.181	0.333	0.837	0.480	0.441	0.450	0.501
	FreqPool [4]	0.169	0.170	0.194	0.237	0.184	0.223	0.279	0.378	0.088	0.183	0.330	0.832	0.410	0.429	0.433	0.476
	Robusttad [8]	0.150	0.157	0.176	0.210	0.172	0.225	0.281	0.357	0.087	0.179	0.329	0.833	0.406	0.417	0.429	0.458
	STAug [34]	0.160	0.173	0.218	0.372	0.206	0.264	0.319	0.385	0.086	0.178	0.335	0.866	0.413	0.432	0.449	0.481
	MixMask [3]	0.151	0.158	0.173	0.205	0.175	0.224	0.279	0.354	0.089	0.178	0.328	0.845	0.395	0.401	0.418	0.450
	Ours	0.150	0.156	0.171	0.199	0.171	0.221	0.276	0.351	0.086	0.176	0.313	0.821	0.394	0.412	0.423	0.448
AutoFormer [31]	Baseline	0.203	0.208	0.231	0.239	0.241	0.314	0.341	0.425	0.143	0.305	0.470	1.056	0.640	0.645	0.611	0.658
	ASD	0.247	0.216	0.221	0.235	0.652	0.392	0.416	0.513	0.141	0.280	0.579	1.240	0.631	0.602	0.607	0.643
	MSB	0.237	0.256	0.295	0.236	0.256	0.379	0.402	0.468	0.156	0.254	0.513	1.339	0.652	0.665	0.643	0.65
	Upsample	0.201	0.209	0.232	0.268	0.281	0.294	0.329	0.385	0.141	0.292	0.553	1.295	0.653	0.676	0.702	0.694
	FreqAdd	0.193	0.197	0.212	0.225	0.255	0.323	0.370	0.419	0.143	0.369	0.716	1.173	0.613	0.598	0.617	0.639
	FreqPool	0.213	0.224	0.234	0.257	0.237	0.339	0.372	0.446	0.142	0.336	0.532	1.014	0.63	0.598	0.603	0.639
	Robusttad	0.230	0.242	0.261	0.231	0.27	0.334	0.351	0.429	0.142	0.309	0.462	1.123	0.621	0.614	0.612	0.646
	STAug	0.191	0.206	0.217	0.234	0.250	0.300	0.347	0.418	0.140	0.326	0.594	1.176	0.632	0.619	0.632	0.640
	MixMask	0.177	0.194	0.206	0.224	0.240	0.302	0.330	0.422	0.141	0.284	0.453	0.778	0.560	0.584	0.594	0.635
	Ours	0.171	0.191	0.203	0.219	0.214	0.273	0.327	0.383	0.136	0.243	0.418	0.695	0.577	0.581	0.592	0.638
MICN [27]	Baseline	0.171	0.183	0.198	0.224	0.188	0.241	0.278	0.350	0.091	0.185	0.355	0.941	0.522	0.540	0.553	0.573
	ASD	0.165	0.174	0.190	0.237	0.189	0.242	0.276	0.354	0.087	0.175	0.337	1.203	0.505	0.534	0.541	0.539
	MSB	0.179	0.182	0.201	0.225	0.201	0.250	0.291	0.365	0.088	0.176	0.360	0.995	0.513	0.532	0.528	0.556
	Upsample	0.182	0.180	0.203	0.220	0.193	0.249	0.279	0.372	0.084	0.171	0.313	0.702	0.533	0.559	0.556	0.590
	FreqAdd	0.160	0.169	0.182	0.199	0.180	0.234	0.282	0.350	0.087	0.174	0.349	0.923	0.503	0.527	0.520	0.571
	FreqPool	0.182	0.203	0.241	0.256	0.192	0.257	0.278	0.351	0.089	0.179	0.394	0.923	0.531	0.539	0.556	0.592
	Robusttad	0.179	0.220	0.234	0.227	0.192	0.239	0.292	0.343	0.085	0.179	0.336	0.932	0.510	0.532	0.547	0.597
	STAug	0.180	0.195	0.210	0.224	0.272	0.356	0.433	0.559	0.092	0.183	0.313	0.790	0.512	0.533	0.529	0.585
	MixMask	0.159	0.165	0.178	0.195	0.185	0.239	0.281	0.344	0.086	0.174	0.337	0.796	0.490	0.512	0.519	0.538
	Ours	0.157	0.168	0.178	0.211	0.179	0.232	0.275	0.342	0.084	0.169	0.303	0.750	0.501	0.507	0.518	0.556
SCINet [12]	Baseline	0.212	0.237	0.255	0.286	0.229	0.282	0.334	0.402	0.099	0.191	0.356	0.916	0.550	0.526	0.545	0.596
	ASD	0.229	0.241	0.239	0.282	0.254	0.276	0.356	0.462	0.095	0.204	0.379	1.230	0.537	0.521	0.541	0.570
	MSB	0.232	0.237	0.228	0.274	0.279	0.265	0.374	0.454	0.093	0.267	0.402	0.965	0.520	0.510	0.537	0.565
	Upsample	0.250	0.232	0.271	0.309	0.243	0.299	0.361	0.431	0.092	0.196	0.311	0.932	0.519	0.536	0.528	0.576
	FreqAdd	0.176	0.195	0.212	0.237	0.208	0.258	0.309	0.385	0.092	0.186	0.343	0.920	0.492	0.497	0.512	0.550
	FreqPool	0.230	0.221	0.242	0.339	0.261	0.290	0.337	0.456	0.096	0.183	0.551	0.938	0.557	0.519	0.533	0.562
	Robusttad	0.189	0.202	0.210	0.243	0.229	0.281	0.331	0.410	0.093	0.186	0.334	0.957	0.523	0.519	0.522	0.569
	STAug	0.210	0.239	0.282	0.411	0.277	0.329	0.372	0.435	0.098	0.191	0.342	0.931	0.560	0.517	0.521	0.566
	MixMask	0.171	0.188	0.204	0.230	0.205	0.250	0.310	0.374	0.093	0.179	0.336	0.928	0.495	0.492	0.511	0.551
	Ours	0.172	0.188	0.200	0.225	0.197	0.246	0.299	0.379	0.091	0.175	0.342	0.890	0.500	0.495	0.509	0.544
TiDE [6]	Baseline	0.207	0.197	0.211	0.238	0.177	0.220	0.265	0.323	0.093	0.184	0.330	0.860	0.452	0.450	0.451	0.479
	ASD	0.232	0.220	0.231	0.265	0.189	0.221	0.297	0.332	0.095	0.206	0.351	0.962	0.477	0.462	0.450	0.506
	MSB	0.210	0.219	0.253	0.261	0.199	0.254	0.273	0.339	0.092	0.179	0.358	0.941	0.461	0.451	0.455	0.510
	Upsample	0.206	0.199	0.223	0.274	0.203	0.267	0.331	0.355	0.091	0.182	0.331	0.852	0.490	0.466	0.472	0.493
	FreqAdd	0.150	0.163	0.177	0.209	0.173	0.216	0.263	0.322	0.088	0.180	0.330	0.848	0.429	0.441	0.440	0.471
	FreqPool	0.224	0.238	0.233	0.270	0.189	0.224	0.292	0.334	0.092	0.334	0.521	1.124	0.453	0.466	0.479	0.503
	Robusttad	0.176	0.166	0.182	0.229	0.182	0.231	0.279	0.330	0.099	0.232	0.331	0.924	0.449	0.430	0.438	0.482
	STAug	0.230	0.210	0.192	0.225	0.205	0.247	0.292	0.364	0.092	0.184	0.330	0.859	0.466	0.455	0.471	0.480
	MixMask	0.143	0.155	0.164	0.210	0.173	0.216	0.263	0.323	0.089	0.180	0.329	0.861	0.421	0.427	0.434	0.466
	Ours	0.143	0.150	0.165	0.202	0.177	0.219	0.261	0.322	0.088	0.179	0.324	0.847	0.423	0.426	0.433	0.466
LightTS [32]	Baseline	0.210	0.169	0.182	0.212	0.168	0.210	0.260	0.320	0.139	0.252	0.412	0.840	0.505	0.515	0.539	0.587
	ASD	0.225	0.179	0.198	0.232	0.179	0.21	0.271	0.321	0.132	0.320	0.436	1.036	0.510	0.514	0.534	0.579
	MSB	0.233	0.182	0.204	0.228	0.170	0.214	0.259	0.332	0.117	0.294	0.502	0.964	0.532	0.510	0.539	0.584
	Upsample	0.246	0.179	0.211	0.254	0.182	0.223	0.257	0.336	0.099	0.251	0.369	0.702	0.522	0.547	0.532	0.597
	FreqAdd	0.213	0.159	0.177	0.210	0.164	0.207	0.258	0.317	0.098	0.522	0.565	1.583	0.492	0.500	0.530	0.572
	FreqPool	0.219	0.174	0.197	0.236	0.193	0.254	0.267	0.339	0.099	0.275	0.394	0.793	0.501	0.519	0.533	0.592
	Robusttad	0.212	0.169	0.181	0.223	0.172	0.223	0.259	0.324	0.092	0.279	0.451	0.796	0.499	0.502	0.521	0.572
	STAug	0.224	0.267	0.294	0.351	0.214	0.263	0.382	0.371	0.096	0.212	0.380	0.690	0.520	0.534	0.520	0.596
	MixMask	0.192	0.158	0.175	0.211	0.163	0.206	0.257	0.318	0.099	0.384	0.518	0.774	0.486	0.499	0.517	0.555
	Ours	0.210	0.156	0.173	0.206	0.165	0.205	0.249	0.312	0.088	0.243	0.361	0.676	0.483	0.497	0.515

Table 8: MSE of the long-term prediction on the Electricity, traffic, Weather, and Exchange Rate [31] datasets.

Methods	PEMS03				PEMS04				PEMS07				PEMS08
Methods	12	24	36	48	12	24	36	48	12	24	36	48	12	24	36	48
Baseline	0.070	0.097	0.134	0.164	0.088	0.124	0.160	0.196	0.067	0.097	0.128	0.156	0.088	0.136	0.191	0.248
ASD [7]	0.072	0.096	0.152	0.239	0.098	0.132	0.156	0.190	0.069	0.099	0.154	0.181	0.089	0.138	0.196	0.247
MSB [1]	0.096	0.131	0.129	0.214	0.087	0.134	0.167	0.219	0.098	0.096	0.137	0.165	0.096	0.137	0.210	0.256
Upsample [23]	0.069	0.096	0.128	0.179	0.087	0.124	0.158	0.199	0.072	0.099	0.127	0.155	0.088	0.140	0.192	0.245
FreqAdd [33]	1.036	0.104	0.251	0.362	0.088	0.125	0.159	0.201	0.067	0.097	0.127	0.155	0.089	0.135	0.192	0.253
FreqPool [4]	1.234	0.178	0.296	0.451	0.099	0.145	0.178	0.226	0.079	0.104	0.152	0.172	0.099	0.155	0.203	0.264
Robusttad [8]	0.082	0.098	0.132	1.520	0.089	0.123	0.161	0.195	0.067	0.097	0.129	0.157	0.092	0.135	0.189	0.26
STAug [34]	0.079	0.112	0.195	0.456	0.087	0.120	0.162	0.304	0.066	0.096	0.132	0.165	0.092	0.147	0.192	0.276
Mask [3]	0.443	1.205	0.233	1.510	0.086	0.119	0.158	0.346	0.065	0.095	0.125	0.156	0.089	0.131	0.186	0.239
Mix [3]	1.018	0.097	0.877	1.501	0.085	0.119	0.154	0.205	0.065	0.094	0.134	0.152	0.089	0.131	0.184	0.234
Ours	0.067	0.095	0.126	0.235	0.085	0.118	0.149	0.182	0.065	0.094	0.123	0.148	0.087	0.134	0.184

Table 9: MSE of the Short-term prediction using the iTransformer [13] on the PEMS datasets [2].

B.2 Example predictions

We provided example prediction results on different datasets in Fig. 6

\begin{overpic}[width=433.62pt]{figures/example-pred-crop} \end{overpic}

Figure 6: Example predictions of different methods under long-term (top) and short-term (bottom) protocols.

B.3 Optimal $k$

We provide the optimal $k$ for all long-term prediction datasets using iTranformer [13] in Tab. 10 and 11. As can be seen from the table, our method does not need too much effort to find the optimal parameters.

Hypermeter	ETTh1				ETTh2				ETTm1				ETTm2
Hypermeter	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720
Optimal $k$	4	4	4	4	2	2	2	4	3	3	2	2	4	4	2	4

Table 10: The optimal

k

on ETT datasets using the iTransformer [13] model.

Hypermeter	Electricity				Traffic				Weather				Exchange Rate
Hypermeter	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720
Optimal $k$	2	3	2	2	2	2	2	2	3	3	2	4	2	2	8	8

Table 11: The optimal

k

on Electricity, Traffic, Weather, and Exchange Rate datasets using the iTransformer [13] model.

B.4 Standard deviations

Tab. 12, 13, 14 and 15 shows the standard deviations of different runs, indicating the performance of our method is stable.

Model		ETTh1				ETTh2
Model		96	192	336	720	96	192	336	720
iTrans former	Baseline	0.392 $\pm$ 0.001	0.447 $\pm$ 0.002	0.483 $\pm$ 0.003	0.516 $\pm$ 0.003	0.303 $\pm$ 0.001	0.381 $\pm$ 0.000	0.412 $\pm$ 0.001	0.434 $\pm$ 0.002
	Mask [3]	0.390 $\pm$ 0.001	0.442 $\pm$ 0.002	0.475 $\pm$ 0.001	0.503 $\pm$ 0.003	0.301 $\pm$ 0.001	0.385 $\pm$ 0.003	0.414 $\pm$ 0.001	0.438 $\pm$ 0.005
	Mix [3]	0.388 $\pm$ 0.002	0.440 $\pm$ 0.002	0.477 $\pm$ 0.000	0.504 $\pm$ 0.004	0.301 $\pm$ 0.001	0.380 $\pm$ 0.001	0.414 $\pm$ 0.001	0.434 $\pm$ 0.003
	Ours	0.383 $\pm$ 0.001	0.438 $\pm$ 0.001	0.473 $\pm$ 0.002	0.492 $\pm$ 0.002	0.298 $\pm$ 0.002	0.382 $\pm$ 0.003	0.411 $\pm$ 0.004	0.428 $\pm$ 0.001

Table 12: Error bars on ETTh1 and ETTh2 datasets.

Model		ETTm1				ETTm2
Model		96	192	336	720	96	192	336	720
iTrans former	Baseline	0.344 $\pm$ 0.002	0.383 $\pm$ 0.003	0.421 $\pm$ 0.001	0.494 $\pm$ 0.003	0.183 $\pm$ 0.001	0.251 $\pm$ 0.002	0.311 $\pm$ 0.001	0.412 $\pm$ 0.001
	Mask [3]	0.347 $\pm$ 0.002	0.383 $\pm$ 0.005	0.420 $\pm$ 0.001	0.494 $\pm$ 0.004	0.179 $\pm$ 0.003	0.251 $\pm$ 0.001	0.311 $\pm$ 0.001	0.411 $\pm$ 0.002
	Mix [3]	0.334 $\pm$ 0.005	0.375 $\pm$ 0.002	0.421 $\pm$ 0.000	0.485 $\pm$ 0.002	0.178 $\pm$ 0.002	0.248 $\pm$ 0.001	0.311 $\pm$ 0.000	0.407 $\pm$ 0.002
	Ours	0.332 $\pm$ 0.001	0.374 $\pm$ 0.001	0.424 $\pm$ 0.001	0.492 $\pm$ 0.002	0.178 $\pm$ 0.002	0.246 $\pm$ 0.001	0.309 $\pm$ 0.001	0.409 $\pm$ 0.000

Table 13: Error bars on ETTm1 and ETTm2 datasets.

Model		Electricity				Traffic
Model		96	192	336	720	96	192	336	720
iTrans former	Baseline	0.152 $\pm$ 0.000	0.159 $\pm$ 0.001	0.179 $\pm$ 0.003	0.230 $\pm$ 0.013	0.399 $\pm$ 0.001	0.418 $\pm$ 0.000	0.428 $\pm$ 0.000	0.463 $\pm$ 0.000
	Mask [3]	0.153 $\pm$ 0.001	0.157 $\pm$ 0.001	0.173 $\pm$ 0.001	0.208 $\pm$ 0.005	0.395 $\pm$ 0.001	0.401 $\pm$ 0.005	0.418 $\pm$ 0.001	0.450 $\pm$ 0.002
	Mix [3]	0.151 $\pm$ 0.000	0.158 $\pm$ 0.001	0.173 $\pm$ 0.000	0.205 $\pm$ 0.003	0.400 $\pm$ 0.003	0.414 $\pm$ 0.004	0.424 $\pm$ 0.002	0.453 $\pm$ 0.003
	Ours	0.150 $\pm$ 0.000	0.156 $\pm$ 0.001	0.171 $\pm$ 0.000	0.199 $\pm$ 0.002	0.394 $\pm$ 0.000	0.412 $\pm$ 0.002	0.423 $\pm$ 0.002	0.448 $\pm$ 0.001

Table 14: Error bars on Electricity and Traffic datasets.

Model		Weather				Exchange Rate
Model		96	192	336	720	96	192	336	720
iTrans former	Baseline	0.175 $\pm$ 0.001	0.224 $\pm$ 0.001	0.281 $\pm$ 0.000	0.362 $\pm$ 0.003	0.086 $\pm$ 0.000	0.180 $\pm$ 0.000	0.335 $\pm$ 0.002	0.856 $\pm$ 0.004
	Mask [3]	0.178 $\pm$ 0.001	0.228 $\pm$ 0.002	0.284 $\pm$ 0.002	0.359 $\pm$ 0.001	0.090 $\pm$ 0.002	0.178 $\pm$ 0.001	0.329 $\pm$ 0.006	0.845 $\pm$ 0.008
	Mix [3]	0.175 $\pm$ 0.001	0.224 $\pm$ 0.000	0.279 $\pm$ 0.000	0.354 $\pm$ 0.000	0.089 $\pm$ 0.001	0.178 $\pm$ 0.001	0.328 $\pm$ 0.006	0.868 $\pm$ 0.008
	Ours	0.171 $\pm$ 0.001	0.221 $\pm$ 0.000	0.276 $\pm$ 0.000	0.351 $\pm$ 0.002	0.086 $\pm$ 0.001	0.176 $\pm$ 0.001	0.313 $\pm$ 0.006	0.821 $\pm$ 0.003

Table 15: Error bars on Weather and Exchange Rate datasets.