4.1 Experimental Settings
Datasets We evaluate the performance of our proposed method on both synthetic and real-world time series datasets.
Synthetic data. To assess the efficacy of
Cedar in achieving generalization across diverse scenarios, we construct multiple synthetic time series datasets that manifest distinct patterns of invariance across domains; see Table
2 for a summary. These datasets satisfy the data assumptions from Section
3.1. We assume that there are no abrupt distribution shifts within each domain of time series, and all domains share a common characteristic, i.e., periodicity. Periodicity is a prevalent pattern in various real-world time series data, such as sales data in retail and temperature fluctuations. We use the sinusoidal function to generate periodic signals and apply Gaussian noise to those periodic signals. We use no trend (i.e., horizontal trend) when it is a common attribute. We use the previous 270 timestamps for training, the subsequent 90 timestamps for validation, and the remaining timestamps for testing. No exogenous attributes are used.
Real-world data. We also assess the domain generalization ability of
Cedar using real-world time series data, focusing on retail, transportation, and finance. The dataset summary is in Table
3.
—
Retail. We use Favorita [
41], which is a Kaggle dataset that contains five years (2013–2017) of daily sales for store-product combinations taken from a retail chain. We construct two datasets based on Favorita, namely,
Favorita-cat and
Favorita-store. In Favorita-cat, we focus on the category-level sales in the same store, treating each category as a separate domain. Favorita-store comprises the time series data for a single category (Grocery I) across multiple stores. We use data from the year 2015. The training data span from 2015-03-01 to 2015-06-30. The validation samples are from 2015-07-01 and 2015-8-15. The remaining data are used for testing. The sale variables are log-transformed.
—
Transportation. We use the U.S. Traffic Volume Data, which is openly available on the official website of U.S. Department of Transportation.
4 The traffic volume data are collected by state highway agencies. We use data in California from 2022-04 to 2022-11. The traffic volume is aggregated on a daily basis, considering both directions of travel, and the final values are divided by 1,000 to reduce the range of values. We use traffic volume data from 2022-04-01 to 2022-07-15 for training, data from 2017-07-16 to 2017-08-20 for validation, and the rest for testing.
—
Finance. We use Stock Exchange Data, which contains daily price data for indexes tracking stock exchanges collected from Yahoo! Finance.
5 We use the stock trading volume as the target variable, representing the number of shares of security traded between its daily open and close. The training data span from 2020-01-01 to 2020-08-30, the validation data extends up to 2020-11-30, and the remaining data up to 2021-05-31 are for testing. Since some stocks have large averages, we used a simple method for normalization, i.e., dividing all trading volumes by 1e7.
Data assumption validation. We assess the alignment of real-world datasets with our assumptions primarily through prior knowledge and visualization methods. For instance, retail and transportation datasets demonstrate consistent seasonal patterns, and stock volume data accounts for overall market influence (Assumption
1). To mitigate abrupt distribution shifts (Assumption
2), we restrict the time sequence length for each domain and use category-level sales for retail, daily traffic volume, and daily stock volume data.
Exogenous attributes. We add numerical covariates consisting of time indicators (e.g., day of the week) to real-world datasets. The time indicators are represented by two Fourier terms [
28] to represent the periodic nature of the time [
54].
Data Visualization. We visualize some datasets in Figure
1 to illustrate the disparity in patterns across domains. For the synthetic dataset PN-T, time series from different domains exhibit varying trends while sharing the same period and Gaussian noise applied to the periodic signals. For real-world datasets, we can observe that the data have diverse cross-domain and within-domain patterns but roughly follow a pattern similar to the synthetic dataset.
Methods used for comparison To evaluate the domain generalization performance of Cedar, we consider several state-of-the-art and popular domain generalization methods that are modality-agnostic (i.e., can be applied to time series data.) and are compatible with forecasting tasks as follows:
—
DANN [
17] is an adversarial learning method that learns features that are not capable of discriminating between training and test domains.
—
GroupDRO [
49] is the Group Distributionally Robust Optimization approach that aims to improve the robustness of models in the presence of domain shifts or changes in the data distribution.
—
MLDG [
34] is a meta-learning algorithm for domain generalization. The method trains for domain generalization by meta-optimization on simulated train/test splits with domain-shift.
—
IDGM [
53] is a gradient matching approach that learns invariance by maximizing the inner product between gradients from different domains.
—
wERM is a weighted empirical risk minimization method adapted from the time series prediction model under distribution shift using differentiable forgetting [
9].
wERM-exp and
wERM-mix are variants using two forgetting mechanisms, corresponding to exponential decay and a mixture of various functional forms of decay, respectively.
6 —
MMD [
61,
62] is a distribution matching method that matches the distributions between representations of data of two domains.
Cedar and the above domain generalization methods can be applied to different base models that forecast time series. Specifically, we consider two popular and representative time series models as the base model
7:
—
DeepAR [
50] is an RNN-based probabilistic forecasting model.
—
WaveNet [
44] is a CNN-based forecasting model.
Apart from the base model with domain generalization methods, we also consider the following two types of methods as our baselines:
—
Traditional time series models:
Seasonal Naive (SN) and
Exponential Smoothing (ES).
8—
Latest deep learning models: The variational recurrent autoencoder
VRNN [
13] that learns latent variables that capture temporal dependencies and an adaptive time series model,
AdaRNN [
15], which can be fit to our domain generalization for time series forecasting.
9Evaluation metrics We employ both the point accuracy metric and range accuracy metric to evaluate the probabilistic forecasting performance, following prior studies [
50,
54]. For point accuracy, we use the
normalized root mean squared error (NRMSE) and
symmetric mean absolute percentage error (sMAPE) [
3]:
For range accuracy, we use the normalized quantile loss function [
50]:
where
q is the quantile value,
\(\hat{y}_t^q\) is the prediction for quantile
q.
\(\mathbb {1}\) is the indicator function. We report the scores when
\(q = 0.5\), denoted by Q(0.5), and the mean quantile performance over the nine quantiles in the range
\(q = \lbrace 0.1, 0.2, \ldots , 0.9\rbrace\), denoted by Q(mean). The evaluation scores are computed across all training/test domains.
Implementation details. We implemented, trained, and evaluated all methods using PyTorch [
45] 1.7.1 with CUDA 10.2 on TITAN Xp. For all datasets, we use the scaling mechanism following prior studies [
50,
54]. All parameters are initialized with Glorot initialization [
20] and trained using the Adam [
32] optimizer; the dropout rate is 0.3. The learning rate is searched from
\(\lbrace 0.0005, 0.001, 0.005\rbrace\). The batch size is 64 for all models and datasets. The hidden state size is set to be consistent across layers for all models searched from
\(\lbrace 16, 32, 64\rbrace\). We adopt specific configurations for the number of hidden layers and the kernel size of the convolution operation based on prior work [
54]. For RNN-based models (DeepAR and VRNN), the number of hidden layers is 3. For CNN-based models (e.g., WaveNet), the number of hidden layers is 5 and the kernel size is 9. For the GroupDRO, MLDG baseline models, we adopt the hyperparameter selection suggestions from previous work [
24]. For wERM, we use the hyperparameter initialization method introduced in the original code [
9]. For models that employ
maximum mean discrepancy (MMD) as the discrepancy measure, we utilize the squared linear MMD [
51] due to its efficiency and effectiveness. The coefficient multiplied to MMD is searched from
\(\lbrace 10^i\rbrace ^{0}_{i=-7}\). For AdaRNN, the number of RNN layers is 2 following the original paper’s suggestion. The MMD coefficient has to be tuned with other parameters, and we set its range to
\(\lbrace 0.001, 0.0001\rbrace\) to control the search space.
For
Cedar, we grid-search the hyperparameter
\(\gamma\) applied to
\(\mathcal {L}_{DD}\) or
\(\mathcal {L}_{DDD}\) from the range
\(\lbrace 10^i\rbrace ^{0}_{i=-7}\) and
p in Equation (
4) from
\(\lbrace 1,2\rbrace\). We do not tune
\(\gamma\) in conjunction with other hyperparameters, such as the learning rate. We directly utilize the optimal settings of the other hyperparameters learned from the base model. It allows us to focus on the impact of
\(\gamma\) on the model’s performance, without introducing additional variations from tuning other hyperparameters.
Model selection. We use 80% of domains for training and 20% for testing and adopt the training-domain validation approach [
24].
10 We partition the available domains into training and test domains. Within the training domain, we further split the data into training and validation subsets in chronological order. Subsequently, we train models using the training subsets and choose the model that has the lowest loss on validation subsets. We run each experiment for 150 epochs with an early stopping criterion of 10 epochs. After we select the best model parameters based on the predefined criterion, we perform the evaluation on 5 random seeds for all models and report the average results. For each seed, we shuffle the training and test domains, ensuring that the model’s generalization performance is assessed under varying conditions.
4.2 Results on Synthetic Data
Table
4 presents the results of the normalized quantile loss metrics Q(0.5) and Q(mean) on the four synthetic datasets. Table
5 shows the results of point accuracy metrics NRMSE and sRMSE. We apply
Cedar to DeepAR and WaveNet and compare both with traditional time series models and deep learning models.
Among the traditional methods (SN and ES), ES performs well on datasets with fixed seasonality (PT-N and PN-T) but it falls behind some of the other methods. For deep learning baselines, the latent variable model and AdaRNN show poor performance and exhibit instability in different cases. This might be due to the complexity of the model structures, making it difficult to learn diverse time series patterns. The original paper of AdaRNN reports that the model’s performance significantly drops when the number of time series periods/domains increases beyond 10, which explains its performance in our experiments. We also notice that its training time becomes extremely long when the number of domains is large (e.g., 30).
For domain generalization methods, we notice that widely used domain generalization methods such as GroupDRO and MLDG do not yield good performance in time series forecasting tasks. This can be attributed to their limited ability to account for certain inherent characteristics within time series data. Baselines DANN, IDGM, and MMD consistently deliver better results. The methods designed for distribution shifts, i.e., wERM-exp and wERM-mix, also perform well.
Cedar and the variant model
Cedar(-D) achieve favorable and stable performance compared to others when applied to both DeepAR and WaveNet. We see the performance of
Cedar based on DeepAR sometimes surpasses that of WaveNet and vice versa. The effectiveness of the regularization is influenced by the temporal representation learned by the base model. The MMD method can be regarded as a naive variant of our approach, as it simply makes all domain representations similar (removing
\(d_{\mathcal {L}_{\text{fcst}}}\) in Equation (
2)). Its results are inferior to ours, indicating the efficacy of leveraging cross-domain forecasting performance to enhance domain generalization. For the two proposed regularization terms, we observe that in most cases (except on PN-T and T-PN),
Cedar consistently outperforms
Cedar(-D) or performs on par with
Cedar(-D). Hence, for the patterns observed in synthetic datasets, considering the prediction performance within individual domains is effective when applying cross-domain regularization.
4.4 Sensitivity Analysis
Ratio of training domains (\(M/K\)). We investigate the impact of different training domain ratios on our approach’s performance. Figure
2 shows the Q(0.5) results for the base model DeepAR and the model with our proposed method
Cedar, as we vary the training domain ratio from 0.2 to 0.8. The performance of other metrics shows a similar pattern. As the number of training domains increases, models tend to achieve better performance on test domains. This could be attributed to the fact that more training domains allow models to digest a greater variety of data, which enhances their ability to generalize to unseen domains. With fewer test domains, the model might have already seen similar data during training, leading to better generalization performance.
Cedar generally performs better or shows competitive results in most cases. However, when the number of training domains is smaller (0.2 and 0.4), we observe that the base model performs more stable across datasets. This might be because a reduced number of training domains introduces greater challenges for generalizing to unseen domains. In such scenarios, considering the base model in the first place could be a safer option.
Values of
\(\gamma\) In Figure
3, we illustrate the impact of different values of
\(\gamma\) on the regularization term
\(\mathcal {R}_{DDD}\) in
Cedar by plotting the corresponding validation loss for all datasets. We observe that the optimal value of
\(\gamma\) varies across datasets (typically achieving better results when less than 1), depending on the scale of the loss and the magnitude of differences between domain losses. In datasets with substantial differences in losses between domains, a smaller
\(\gamma\) is preferred to avoid excessive regularization term optimization at the expense of forecasting performance. Choosing an appropriate
\(\gamma\) is crucial to strike a balance between domain generalization and accurate forecasting performance at the sample level.
4.7 Computational Analysis
We conduct two experiments to evaluate the computational efficiency of Cedar, taking into account variations in both base forecasting model size and dataset size.
In the first experiment, we analyze the training time of
Cedar in comparison with the base models DeepAR and WaveNet while varying the base model size. These experiments are performed using the synthetic dataset NT-P with default settings, e.g., a batch size of 64 and a training domain ratio of 0.8. Figure
6 displays the increase in training time (per epoch) of
Cedar relative to the base model. We use the average training time of three epochs as the final measurement. We observe that the additional overhead incurred by
Cedar is very small, ranging only within a few seconds. Furthermore, this overhead exhibited a slight decreasing trend as the model size increases (i.e., the hidden state size increases). Our generalization approach causes nearly constant time overhead, demonstrating its effectiveness for larger models. These findings indicate that
Cedar is well-suited for real-world applications with more complex data, which typically require large models for accurate forecasting.
In the second experiment, we investigate how
Cedar performs with larger time series datasets. We extend the synthetic datasets from NT-P, increasing the length of each domain’s time series from 500 data points to 1k, 10k, and 100k data points. The training domain ratio (e.g., 0.8) and the ratios of training, validation, and testing timestamps remain the same as in our main experiments. To accommodate the larger dataset, we adjust the batch size to 1,024 and increase the hidden states of the base model to 1,024. The results in Figure
7 show the percentage increase in training time (per epoch) for
Cedar compared to the base model. We also use the average training time of three epochs as the final measurement. Notably, when the base model is DeepAR, the percentage increase is consistently less than 9%. For WaveNet, there is some randomness, but the largest increase percentage remained under 10%. In some cases,
Cedar adds almost no additional time to the base model. These results highlight the efficiency of
Cedar when dealing with larger datasets.
4.8 Beyond Our Assumptions
Cedar may not work effectively in scenarios that violate our data assumptions. To illustrate this, we present two negative examples where our method could potentially encounter challenges.
Violating Assumption 1. We conduct an experiment using the
UCI_air dataset. This dataset is built from the Beijing Multi-site Air-quality Dataset, which contains hourly air pollutant data from 12 nationally controlled air-quality monitoring sites from 2013 to 2017.
11 Each monitoring site is considered as a domain, and we use PM2.5 data from 2016-01-01 to 2016-09-10 for training, data from 2016-09-11 to 2016-12-01 for validation, and the rest of the data up to 2017-02-28 for testing. Due to a large number of missing values, we convert the data to daily values by averaging over each hour of the day. The historical window size is 28, and the prediction window is 7. In Figure
8(a), we observe a high degree of similarity and overlap across different domains, which violates Assumption
1, where
\(\epsilon _l\) is close to 0. The presence of these highly similar patterns also raises concerns about the suitability of applying a domain generalization model in this particular context. We show the comparison results of our method with the two base models in Figure
8(b). The base models perform better. This finding emphasizes the importance of considering the shared patterns and differences between domains to design effective domain generalization models (Assumption
1).
Violating Assumption 2. Assumption
2 assumes that the time series data should not exhibit abrupt distribution shifts within each domain. To test this assumption, we apply a straightforward method to manipulate the synthetic dataset NT-P by introducing random mean shifts within different segments while keeping other properties unchanged. A portion of the manipulated data
NT-P* is presented in Figure
9(a), and the comparison results are shown in Figure
9(b). Our observations reveal that when the base model is DeepAR, the performance differences between
Cedar and the base model are relatively minor, with the base model performing slightly better. However, for the base model WaveNet,
Cedar struggles to produce favorable results. When considering domain difficulties in the regularization, the results become much less stable. This outcome can be attributed to the fact that the factors considered for training, which address domain difficulties, do not effectively apply to unseen domains that may exhibit substantial disparities. When time series data involve abrupt changes or significant distribution shifts (i.e., scenarios that contradict Assumption
2), some researchers [
15,
39] propose characterizing temporal distributions to segment time series based on distribution differences, thus establishing distinct domains. Their characterization method presents a potential solution to address the limitations of our current approach. We leave further analysis of this potential enhancement for future research.