4.1. Experimental Setup
The experiment was conducted on an HP Shadow Wizard 9 consisting of Windows 11 operating system, Intel i5 core, 3.7 GHz CPU, 16 GB RAM, and Nvidia RTX 4060, using the programming language python and the deep learning framework Pytorch 1.13.1 and Tensorflow 2.13.1.
A total of two multivariate time series datasets were used in this experiment; the two public datasets were the SML2010 dataset [
42] and the exchange rate dataset [
43]. The SML2010 dataset selected indoor temperature as the target sequence of this time series prediction experiment, and the other 16 sequences were taken as exogenous variable sequences related to the experimental target sequence. The exchange rate dataset selects the Singapore exchange rate series as the target series of this time series prediction experiment, and the other seven sequences were selected as the exogenous sequences of the experiment target. Both datasets were divided into the training set, the validation set, and the test set in a ratio of 8:1:1. The exact size of the dataset is shown in
Table 1. The parameters of CNN-DA-RGRU are shown in
Table 2.
SML2010 dataset: The dataset was collected from a house built specifically for the Solar Decathlon Competition. The dataset contains about 40 days of monitoring data with a granularity of 15 min. It mainly includes 21 types of time series data, such as indoor temperature, outdoor temperature and relative humidity, among which there are 4137 data in each category.
Exchange rate dataset: The exchange rate dataset contains a daily exchange rate summary of eight countries from 1990 to 2016 for 26 years, including Australia, the United Kingdom, Canada, Switzerland, China, Japan, New Zealand, and Singapore, with a time granularity of 1D, eight variable series, and 7589 pieces of data.
To verify the prediction performance of the model in this chapter, RMSE, MAE, MAPE were used as the evaluation metric, and the SML2010 dataset and the exchange rate dataset were used as the experimental datasets.
In the experiment, we set the size of the sliding time window and gradually increased it to perform the single-step prediction experiment so that the sliding time window was set to 5, 10, 15, 20, 25, 30 and 50.
Figure 2 shows the relationship between the sliding window size set in the experiments for the two datasets and the results of the model evaluation metrics. As can be seen in
Figure 2a, for the SML2010 dataset, the RMSE and MAE of the model range from 0 to 0.2, while the MAPE fluctuates from 0.1 to 0.5. When the sliding window size is set to 15, the RMSE, MAE, and MAPE reach the lowest values. This indicates that the prediction error of the model is minimized. From
Figure 2b, it can be seen that for the exchange rate dataset, the RMSE and MAE of the model fluctuate between 0 and 0.01, while the MAPE fluctuates between 0.06 and 0.08. When the experiment gradually increases the sliding window size from 5, the RMSE, MAE, and MAPE decrease until the window size is equal to 10, when all three evaluation indexes reach the minimum value. When the window size is equal to 10, all three evaluation indicators reach the minimum value. It can be seen that when the sliding window size is set to 10, the model has the best prediction performance for the exchange rate dataset.
In addition to the variation of sliding time windows, choosing different convolution kernel sizes, different pooling kernel sizes, and different hidden layer unit sizes in the experiments can also lead to different prediction results. Therefore, in order to minimize the impact of parameter variations on the experimental results, we conducted parameter sensitivity experiments on the one-step prediction-based CNN-DA-RGRU model on two datasets to determine the convolution kernel size, the pooling kernel size, and the hidden layer unit size.
We first performed experiments on the sensitivity of the convolution kernel size. The other parameter settings were consistent with the results of the sliding time window experiments. In the experiments, the convolution kernel size was varied sequentially from 3 to 8 to observe the changes in RMSE, MAE, and MAPE.
Figure 3 shows the relationship between the convolution kernel size and the three metrics on the two datasets. It can be seen that when the convolution kernel size is 6, the three indicators on both datasets reach the minimum value; i.e., the model has the strongest spatial feature extraction ability and the best prediction results.
Based on our determination of the size of the convolution kernel and the size of the sliding time window, we then conducted sensitivity experiments on the size of the GRU hidden layer element, and the settings of other parameters in the experiment were consistent with the above experiments. In the experiment, we set up 16 combinations of two hidden layers with sizes 32, 64, 128, and 256 and observed the changes in three types of evaluation indicators.
Figure 4 shows the relationship between the size of the hidden layer unit and the three indicators on two types of datasets. It can be seen that when the hidden layer unit size is [128, 128], the RMSE, MAE, and MAPE show the minimum values, which also indicates that the model has the best time-dependent extraction ability at this time.
Finally, based on the determination of the size of the convolution kernel, the size of the sliding time window, and the size of the hidden layer unit, we carried out the sensitivity experiment of the pool kernel size, and the other parameters were set by the above experiment. The size of the pooled nucleus was selected as 1, 2, 3, and 4, respectively, and the changes of RMSE, MAE, and MAPE were observed.
Figure 5 shows the relationship between the size of pooled cores and the three metrics on the two datasets. As can be seen from the figure, when the pooled kernel is 2, the three evaluation indicators all obtain the minimum value, which indicates that the model has the strongest ability to eliminate redundant features at this time.
4.2. Ablation Experiments
To verify the effectiveness of different modules, namely CNN, DA, and RGRU, we conducted ablation experiments on two datasets. The models containing different components are as follows:
With a single-layer attention module only: We replaced the two-layer attention module in the model with a single-layer attention module, and the model is labeled CNN-DA-RGRU1.
Without the two-layer attention module: We removed the two-layer attention module from the model and recorded the model as CNN-DA-RGRU2.
Without the convolution module: We removed the convolution module from the model and recorded the model as CNN-DA-RGRU3.
Without the residual structure module: We removed the residual module from the model and recorded the model as CNN-DA-RGRU4.
Figure 6,
Figure 7,
Figure 8 and
Figure 9 show the metric results, actual values, and predictions of the CNN-DA-RGRU model and the ablation model on the SML2010 and exchange rate datasets. As can be seen from the figure, the introduction of the convolution module greatly improves the prediction results since it can extract the spatial relationship between sequences. The prediction results of the two-layer attention model are better than those of the single-layer attention model. Removing residual connections also reduces the prediction performance of the model. In the ablation experiments on the SML2010 dataset, the prediction performance of all the above structures, in descending order, is CNN-DA-RGRU > CNN-DA-RGRU1 > CNN-DA-RGRU2 > CNN-DA-RGRU4 > CNN-DA-RGRU3. This shows that in the ablation experiments on the SML2010 dataset, the removal of residual connections has the largest impact, followed by removing the convolution module, followed by reducing the number of layers of attention, and the smallest impact is the removal of the two-layer attention module. In the ablation experiments on the exchange rate dataset, the predictive performance of all the above structures, in descending order, is CNN-DA-RGRU > CNN-DA-RGRU1 > CNN-DA-RGRU2 > CNN-DA-RGRU3 > CNN-DA-RGRU4. This shows that removing residual joins has the greatest impact in the ablation experiments on the exchange rate dataset, followed by the removal of the convolution module, followed by the removal of the two-layer attention module, with the smallest impact being the reduction in the number of attention layers.
4.3. Comparative Experiments
In this experiment, a total of 13 deep learning models in multivariate prediction problems were selected and compared with CNN-DA-RGRU on public datasets.
(1) GRU: Che et al. constructed a GRU model for a multivariable time series prediction and achieved good results [
30];
(2) BIGRU: This is a bidirectional gated recurrent unit neural network based on GRU;
(3) GRU-Attention: Jung et al. proposed a predictive model combining GRU and attention mechanism for power prediction, and the results show that this model is superior to other models [
45];
(4) BIGRU-Attention: Song et al. used the BIGRU-Attention model to forecast the multi-variable tropical cyclone track dataset, and the experimental results showed that the accuracy of this model was improved compared with the mainstream prediction models in tropical cyclone tracking and prediction tasks [
46];
(5) CNN-GRU: Gao et al. proposed a CNN-GRU model for wind speed prediction on multivariate wind speed dataset and achieved good prediction effect [
33];
(6) CNN-BIGRU: This is a hybrid architecture combining CNN and bidirectional gated recurrent neural network based on CNN-GRU;
(7) LSTM: Sorkun et al. proposed a multivariate LSTM prediction model using multivariate meteorological data to predict solar radiation and compared the results with previous studies, finding that the multivariate approach performed better than the previous univariate model [
47];
(8) BILSTM: This is a bidirectional long short-term memory neural network model based on LSTM;
(9) LSTM-Attention: Ju and Liu proposed the LSTM-Attention model, and the experiments results show that the model has an excellent performance [
48];
(10) BILSTM-Attention: Hao et al. proposed a prediction model of atmospheric temperature based on the BILSTM-Attention model, and the results show that the BILSTM-Attention prediction model can effectively improve the prediction accuracy [
49];
(11) CNN-LSTM: Widiputra et al. proposed a hybrid architecture of multiple CNN-LSTM, which showed its superiority in multiple financial forecasting tasks [
31];
(12) CNN-BILSTM: Kim et al. used CNN and BILSTM to construct a mixed model to study the system marginal price of Jeju Island in South Korea, and the research results show that the model has a good forecasting performance in this forecasting task [
50];
(13) DSA-Conv-LSTM: Xiao et al. used the two-stage attention mechanism and Conv-LSTM to construct a hybrid architecture for multivariate time series prediction, and the results show that this model has advantages over other baseline models [
30].
Table 3,
Table 4 and
Table 5 present MAE, MAPE, and RMSE evaluation indicators of each baseline model and the prediction results of the model on the SML2010 dataset. As can be seen from the table, the prediction results of single GRU, LSTM, BIGRU, and BILSTM are smaller than those of the corresponding models with the attention mechanism added, regardless of the single- or multi-step prediction. This is because after the attention mechanism is added, based on the attention mechanism, the model pays more attention to the properties of important features and pays more attention to the features that are important and related to the target sequence variables, thus improving the performance of multivariate prediction. In addition, we can also see from the table that since convolutional networks can effectively extract spatial features, that is, relational features between variables, the combined model combined with convolutional networks also performs better than the single model without convolutional networks. In this comparison experiment, in addition to the model proposed in this paper, the DSA-Conv-LSTM model has the best prediction effect because the DSA-Conv-LSTM model can effectively extract spatial relationships and important features by using the convolution layer and the two-stage attention mechanism layer. The ability of the two-stage attention mechanism to extract important features is stronger than that of the general attention mechanism.
However, the DSA-Conv-LSTM model also has its shortcomings. We found that although the model works best in single-step prediction, the error increases greatly with the increase in step size, so the accuracy is not ideal in multi-step prediction. This is because in the process of calculating the model, the error of a large amount of information will accumulate, followed by the problem of gradient disappearance and explosion. In contrast, our proposed CNN-DA-RGRU model contains a residual structure that provides a direct connection across layers. The data processed by the convolutional module are added to the branch input of the next module, thus breaking the symmetry of the network, reducing the degradation of the network and solving the error accumulation problem mentioned earlier. Compared with the suboptimal DSA-Conv-LSTM model, in 2–6 prediction steps, the RMSE index of this model decreased by 0.0153, 0.0122, 0.0562, 0.0584, and 0.0491; the MAE index decreased by 0.0083, 0.0055, 0.0396, 0.0423, and 0.0167; and the MAPE error decreased by 0.0379%, 0.0031%, 0.2035%, 0.1776%, and 0.1191%, respectively. Also, observing the errors accumulated in the six-step prediction of this paper’s model and the suboptimal model, the errors accumulated in the six-step prediction of DSA-Conv-LSTM for RMSE are 0.0242, 0.009, 0.0609, 0.0106, and 0.0293; the errors accumulated in MAE are 0.0172, 0.0035, 0.0507, 0.0023, and 0.0122; the MAPE-accumulated errors are 0.0816%, 0.0174%, 0.2246%, 0.0265%, and 0.0772%, respectively. The CNN-DA-RGRU model predicted the RMSE metrics in these six steps with sequential cumulative errors of 0.0063, 0.0197, 0.0136, 0.0032, and 0.0466; the MAE metrics with sequential cumulative errors of 0.0047, 0.0071, 0.0063, 0.0023, and 0.0451; and the MAPE metrics with sequential cumulative errors of 0.0236%, 0.0260%, 0.0204%, 0.1657%, and 0.0017% respectively. It can be seen that in the accumulation of multi-step prediction errors, the error accumulation of CNN-DA-RGRU is mostly smaller than that of DSA-Conv-LSTM. This also shows the effectiveness of the residual connection structure.
Figure 10 and
Appendix A Figure A1,
Figure A2,
Figure A3,
Figure A4,
Figure A5 and
Figure A6 show the predictions for each model on the SML2010 dataset. Since the prediction results of other comparison models except DSA-Conv-LSTM are significantly different from the CNN-DA-RGRU proposed in this paper, only the prediction results of DSA-Conv-LSTM and CNN-DA-RGRU are compared with the real values in
Figure 10. It can be seen from the results in the figure that both the proposed model and the second-best model have better prediction performance for the SML2010 dataset, and the proposed model is superior. Therefore, from these chart data, we can see that the prediction performance of the CNN-DA-RGRU model proposed in this paper is slightly better than that of the DSA-Conv-LSTM model on the SML2010 dataset and also better than other baseline models.
Table 6,
Table 7 and
Table 8 show the results of the MAE, MAPE, and RMSE evaluation metrics for the prediction effectiveness of each model on the exchange rate dataset. From the data in the table, it can be seen that the DSA-Conv-LSTM model has the best prediction effect except for the model proposed in this paper when the given prediction steps are 3–6 time steps. Compared with the second-ranked DSA-Conv-LSTM model, the RMSE metrics of this paper’s model are reduced by 0.0003, 0.0006, 0.0006 and 0.0007; the MAE metrics are reduced by 0.0007, 0.0007, 0.0005, and 0.0008; and the MAPE error was reduced by 0.003%, 0.0078%, 0.0081%, and 0.0137%. In the six-step prediction, DSA-Conv-LSTM increased the RMSE errors by 0.0003, 0.0017, 0.0007, 0.0005, and 0.0007, respectively; MAE by 0.0004, 0.0013, 0.0002, and 0.0004; and the cumulative MAPE errors by 0.00065%, 0.0137%, 0.008%, 0.0075%, and 0.0073%, respectively. The CNN-DA-RGRU model predicted 0.0002, 0.0011, 0.0004, 0.0005, and 0.0003 successive increases in the RMSE metrics and 0.0002, 0.0003, 0.0004, 0.0005, and 0.0001 successive increases in the MAE metrics in the six steps, and the MAPE indicator successively increased the error by 0.0047%, 0.0032%, 0.0032%, 0.0072%, and 0.0017%, respectively. It can be seen that in the multi-step prediction error accumulation, the error accumulation of CNN-DA-RGRU is smaller than that of DSA-Conv-LSTM. Secondly, the GRU-Attention model, CNN-GRU model, LSTM-Attention model, BILSTM-Attention model, and CNN-BILSTM model perform better in single-step pre-prediction because these model components contain attention or convolutional layers. They can effectively extract spatio-temporal features and give more weights to important features. In addition to the model proposed in this paper, the DSA-Conv-LSTM model still has the best prediction performance on this dataset because the DSA-Conv-LSTM model can extract spatio-temporal feature relationships and focus on important features. In single-step prediction and multi-step prediction, the CNN-DA-RGRU model outperforms the above models in general.
The prediction results of each model for the exchange rate dataset are shown in
Figure 11 and
Appendix A,
Figure A7,
Figure A8,
Figure A9,
Figure A10,
Figure A11 and
Figure A12. Since the prediction results of the comparative models in this dataset, except for the DSA-Conv-LSTM model, are quite different from the prediction results of the CNN-DA-RGRU model proposed in this paper, only the actual values as well as the comparative prediction results of the DSA-Conv-LSTM and the CNN-DA-RGRU are given in
Figure 11. From the figure, it can be seen that the CNN-DA-RGRU model proposed in this paper outperforms DSA-Conv-LSTM when the prediction step size is between three and six steps, and the increase in the step size does not have much effect on the prediction performance of the model in this paper. In the case of multiple step lengths, the model in this paper still has good prediction performance. The reason why the prediction performance of the proposed model in this paper is not as good as that of the second-ranked DSA-Conv-LSTM model when the prediction step size is small is that, although the two models have the same performance in extracting spatial features and important correlation features, the two-phase attentional mechanism of the DSA-Conv-LSTM and the encoder–decoder structure of the LSTM layer make the prediction performance of the DSA-Conv-LSTM model better than that of the second-ranked DSA-Conv-LSTM model when the prediction step length is small. The Conv-LSTM model outperforms the DSA-CONV-LSTM model when the prediction step size is small. However, when the prediction step size gradually increases, the prediction performance of DSA-Conv-LSTM starts to decline due to the accumulation of prediction errors. However, the CNN-DA-RGRU model in this paper deepens the number of network layers due to the presence of residual blocks and residual connections and at the same time reduces the accumulation of errors to a certain extent, which results in a slight decrease in the multi-step prediction performance.
In summary, the CNN-DA-RGRU model proposed in this paper can better extract the spatial relationship between variables in multivariate series, pay more attention to important features, reduce the network degradation problem caused by residual structure, and increase the path of information flow. The single-step and multi-step forecasting performance of this model on the SML2010 and exchange rate datasets is generally better than other time series forecasting models.