Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Next Article in Journal
Forecasting Methods for Road Accidents in the Case of Bucharest City
Previous Article in Journal
Enhancing Strength and Surface Quality of 3D-Printed Metal-Infused Filaments in Fused Deposition Modelling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Explaining When Deep Learning Models Are Better for Time Series Forecasting †

by
Martín Solís
1,* and
Luis-Alexander Calvo-Valverde
2
1
Business School, Instituto Tecnológico de Costa Rica, Cartago 159-7050, Costa Rica
2
Computer Engineering School, Instituto Tecnológico de Costa Rica, Cartago 159-7050, Costa Rica
*
Author to whom correspondence should be addressed.
Presented at the 10th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 15–17 July 2024.
Eng. Proc. 2024, 68(1), 1; https://doi.org/10.3390/engproc2024068001
Published: 27 June 2024
(This article belongs to the Proceedings of The 10th International Conference on Time Series and Forecasting)

Abstract

:
There is a gap of knowledge about the conditions that explain why a method has a better forecasting performance than another. Specifically, this research aims to find the factors that can influence deep learning models to work better with time series. We generated linear regression models to analyze if 11 time series characteristics influence the performance of deep learning models versus statistical models and other machine learning models. For the analyses, 2000 time series of M4 competition were selected. The results show findings that can help explain better why a pretrained deep learning model is better than another kind of model.

1. Introduction

Different researchers have compared the performances of several forecasting methods using samples of time series to find out which method usually gives better results in forecasting regression tasks. Although the studies identified a method that tends to provide better results than others in general terms, there is no method that will always perform better [1,2,3,4]. It means that there could be an algorithm that gives better results when looking at the average of the test error, but it will not be the best for each time series. Therefore, the research must move from identifying the best algorithm at a global level to the analysis of the factors or reasons that influence an algorithm to work better under certain conditions.
On the other hand, in the field of time series, deep learning models have been gaining more relevance in different areas because of their good predictive capacity [5,6]. However, there is a gap in the knowledge about when they can provide better predictions and the conditions that favors those kinds of models. In [7], the athors found in a review of the literature that there is a lack of rigorous scientific publications that explain the benefits and limitations of the most popular algorithms for time series prediction. Hence, this research aims to identify the factors that can influence deep learning models to work better. The results could be useful to understand the behavior of deep learning in a time series. It can help to identify the conditions for using those methods, suggest transformations in the time series to improve the prediction of the model, and clarify the elements that should be investigated in learning architectures in order to overcome their weaknesses.

2. Literature Review

In this review, we describe studies that have been focused on comparing models using various time series. Most of them did not intend to understand the reason one model is better than another.
The first studies comparing the performance of statistical traditional models versus machine learning models on a larger scale were those of [3,4]. Both studies compared simple neuronal networks with other models using the M3 competition dataset. Recently, deep learning models have been gaining relevance in the field of forecasting. Therefore, some studies have emerged to compare the performance of deep learning with other machine learning models and statistical models on a larger scale. For example, [7] compared the performance of LSTM and MLP with other machine learning methods and statistical methods such as ARIMA and SARIMA. A total of 95 datasets were used for the comparisons. They used two strategies to evaluate their models: the multi-step-ahead projection with approximate iteration and the multi-step-ahead with updated iteration. Using both strategies, the LSTM was overcome by machine learning and statistical models in chaotic, stochastic, and deterministic time series.
In [8], they selected 120 intermittent and lumpy time series of M5 competition datasets to compare the performance of statistical, machine learning, and deep learning models. They found that modern deep learning methods, especially LSTMs, achieved good performance but not the best and that, with intermittent, as well as lumpy, time series, other models exceeded the deep learning models. In [9], using 1000 time series of M4 competition found that deep learning models had better results than machine learning models and statistical models. In [10], a comparative study was conducted on nine datasets (each dataset had several time series) with hourly, daily, and minute time series. The goal was to analyze the performance of a Gradient Boosting Regression Tree (GBRT) versus deep learning models like LSTM, Temporal Fusion Transformer, DeepAR, and others. The results suggest that GBRT can compete with state-of-the-art DNN models and sometimes outperform them by efficiently feature engineering the input and output structures of the GBRT. In [11], a comparison was made between LSTM and ARIMA based on a monthly financial time series. Their findings showed that LSTM outperforms the ARIMA model. Likewise, [12] found that deep learning models outperformed ARIMA. The authors predicted the aviation demand of 48 airports and multiple different aviation routes.

3. Materials and Methods

3.1. Data

We use the monthly M4 competition dataset [13]. We joined the training and testing datasets of the M4 competition, and then, 2000 times series were randomly selected for the training of deep learning models, and the other 2000 time series were selected (from the target dataset) to apply and evaluate the transfer learning. This second group of time series was called the target dataset.

3.2. Models

The deep learning models implemented were three of the most traditional methods used in time series: Long Short-Term Memory (LSTM), Temporal Convolutional Network (TCN), and Convolutional Neuronal Network (CNN). In order to make comparisons, we developed three of the most used statistical methods: ARIMA, ETS, and THETA, and another three methods of machine learning: XGBoost, Random Forest, and Support Vector Machines. All models were generated to predict 12 months into the future. In the case of the machine learning models, we used as input the past 12 months.

3.3. Training and Testing Process

The deep learning models were trained using the Adam optimizer and a stop criterion that consisted of stopping after two epochs without improvement in the validation sample’s loss function. The loss function was the mean absolute percentage error, and the batch size was equal to one, because it gave the best results.
The back-testing process with three repetitions was applied to the 2000 time series of the target dataset to transfer the knowledge of the deep learning models. Each time series was divided into two parts: one part for training and twelve future points for testing. The MAPE metric was used to compute the performance. We averaged the three MAPE results to obtain the final performance measure of each model with each metric. For the other models (machine learning and statistical models), the process was similar, but instead of using the training sample for transfer learning, it was used to train the models.
For the transfer learning, only the last layer of the model was unfrozen, and the weights were updated with a learning rate of 0.000005, which was less than the one used in the training phase. We decided to create pretrained models instead to train the models with the target time series, because it was shown that, in a monthly time series, pretrained deep learning models show better results [9].
The hyperparameters of each model were obtained with Bayesian optimization. The hyperparameters calibrated in the models during the training are in Table 1.

3.4. Features for Time Series Characterization

Eleven features were chosen to characterize each time series. All the time series were standardized in a range between 2 and 3, before the estimation of the features, in order to have the same scale. The names and explanations of the features are in Table 2.

3.5. Linear Regression

To analyze when deep learning models perform worst or better than other models, we computed linear regression models that used as dependent variable the percentage change in the MAPE of each deep learning model with respect to the other models. It is percentage change = M A P E   d e e p   l e a r n i n g M A P E   o t h e r   m o d e l M A P E   o t h e r   m o d e l . If the result is positive, the deep learning model generates an increment in the error, but if it is negative, the deep learning model reduces the error.
For example, to compare the performance of TCN versus ARIMA, the percentage change for each of the 2000 target time series is computed. The result of this calculation is the dependent variable of the model. On the other hand, the independent variables are the features of the time series pointed out in Table 2. In this way, the coefficients and the p-values of the regression model indicate if the time series features have a significant statistical influence on the percentage change (the difference between TCN and ARIMA) and to what extent the features influence the differences between TCN and ARIMA. The models were generated using ordinary least squares (OLS). We computed a robust standard error when the linear regression models violated the assumption of homoscedasticity.

4. Results

4.1. General Results

Figure 1 shows the average percentage change of the MAPE between models. Each bar in the figure shows the percentage change between a deep learning model and a statistical model or machine learning model. For example, the first bar indicates that the MAPE in the TCN models decreased almost 20% (on average) in relation to the XGBoost models.
The results suggest that the deep learning models surpass the machine learning but not the statistical models. The fact that statistical models have the best performance is contrary to the results of [9]; however, the methodology to evaluate the models is different for both studies. They evaluate the performance of the models in seven-fold instances without updating the model for each new instance. Each fold instance consists of 12 input past values ( X t 12 ,   X t 11 X t 1 ) and 12 output future values ( X t 0 ,   X t + 1 X t + 11 ). However, in this paper, the evaluation was on three instances with a horizon of 12, but the biggest difference was that the model was updated for each fold instance. Therefore, this result suggests that, when the models are updated for a new prediction, the statistical models tend to give the best results. Finally, in the comparison between deep learning models, the best, in general terms, was TCN, although its performance was similar to LSTM.
Table 3 and Table 4 show the coefficients of the multiple linear regression models. Each coefficient shows how the MAPE of the deep learning model changed when the features increased by one unit. The features were standardized, which means that the change of one unit corresponded to one standard deviation. For example, the trend coefficient in Table 3 indicates that an increment of one standard deviation in the time series trend generates an increase of 8.0 percent in the MAPE of TCN in relation to XGBoost MAPE. Therefore, XGBoost tends to work better than TCN as the strength of the trend increases. It is relevant to mention that the coefficients shaded in gray are significant using p = 0.05.

4.2. Comparison between Deep Learning Models and Statistical Models

The distance (Euclidean distance) between the target time series and the training dataset was the most influential feature when we compared deep learning models to ARIMA and ETS models. For example, in Table 3, we can see that the MAPE of TCN and LSTM increase by 19% and 27% in relation to the ARIMA MAPE for each increment of one standard deviation in the distance, respectively. This result is reasonable, because the deep learning models could generate the worst results when a record was different from those used to create the model [18].
Two significant variables were the outliers and normality. Increasing the number of outliers and closeness to the normal distribution of the time series reduced the MAPE regarding the ARIMA and ETS models. On the other hand, the increment in the trend strength increased the MAPE in relation to the ARIMA MAPE, and the increment of gamma increased the MAPE in relation to the ETS MAPE. The comparisons between TCN and LSTM with the THETA models showed less significant variables. The more relevant result was that the seasonal strength favored the THETA model.

4.3. Comparison between Deep Learning Models and Machine Learning Models

We also found a result associated with the distance that is more difficult to explain. The finding suggests that the increase in distance favors deep learning models when compared against machine learning models (Table 4). This result occurred in five comparisons.
The most important factor that affects the performance of deep learning models is the trend. When we compare the performance with any of the machine learning models, the strength of the trend reduces the error. On the other hand, the increment in nonlinearity, seasonal strength, and gamma increases the error of the deep learning models versus XGBoost and Random Forest.

5. Conclusions

The machine learning models showed a lower accuracy performance than the pretrained deep learning models but could be more robust, with less predictable time series. For example, variables like nonlinearity and seasonal_strength are associated with the best performance of XGBoost and Random Forest; meanwhile, the strength of the trend is associated negatively with the three types of machine learning models executed. Deep learning models could work better with less predictable time series than statistical models. For example, variables like the strength of the trend and seasonal strength are associated positively with the performance of statistical models, and the presence of outliers is associated negatively.
Some transformations could improve the performance of pretrained models. The normal distribution test of the time series is associated positively with deep learning models; therefore, any transformation that converts the time series to a normal distribution could improve the models.
This study offers findings that can help explain better why a pretrained deep learning model is better than another kind of model. Future studies should look at new variables that can help explain the differences between models. Additionally, new models can be incorporated as transformers, hybrid models, etc. Another question to be answered is whether the model can be improved by excluding older time points if it changes the properties of the time series. The findings of these kinds of studies not only contribute to explaining when one model works better than another but can also provide evidence about possible transformations that improve the performance of deep learning models.

Author Contributions

Conceptualization, M.S. and L.-A.C.-V.; methodology, M.S. and L.-A.C.-V.; software, M.S.; validation, M.S.; formal analysis, M.S.; investigation, M.S. and L.-A.C.-V.; resources, M.S.; data curation, M.S.; writing—original draft preparation, M.S.; writing—review and editing, M.S. and L.-A.C.-V.; visualization, M.S.; supervision, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The time series used for the models training and the evaluation of the model performance are in the next repository: https://github.com/martin12cr/explainability.git.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Papacharalampous, G.; Tyralis, H.; Koutsoyiannis, D. Comparison of stochastic and machine learning methods for multi-step ahead forecasting of hydrological processes. Stoch. Environ. Res. Risk Assess. 2019, 33, 481–514. [Google Scholar] [CrossRef]
  2. Petropoulos, F.; Makridakis, S.; Assimakopoulos, V.; Nikolopoulos, K. Horses for Courses’ in demand forecasting. Eur. J. Oper. Res. 2014, 237, 152–163. [Google Scholar] [CrossRef]
  3. Makridakis, S.; Hibon, M. The M3-Competition: Results, conclusions and implications. Int. J. Forecast. 2000, 16, 451–476. [Google Scholar] [CrossRef]
  4. Crone, S.F.; Hibon, M.; Nikolopoulos, K. Advances in forecasting with neural networks? Empirical evidence from the NN3 competition on time series prediction. Int. J. Forecast. 2011, 27, 635–660. [Google Scholar] [CrossRef]
  5. Sharma, A.; Jain, S.K. Deep Learning Approaches to Time Series Forecasting. In Recent Advances in Time Series Forecasting; CRC Press: Boca Raton, FL, USA, 2021. [Google Scholar] [CrossRef]
  6. Gamboa, J.C.B. Deep learning for time-series analysis. arXiv 2017, arXiv:1701.01887. [Google Scholar]
  7. Parmezan, A.R.S.; Souza, V.M.; Batista, G.E. Evaluation of statistical and machine learning models for time series prediction: Identifying the state-of-the-art and the best conditions for the use of each model. Inf. Sci. 2019, 484, 302–337. [Google Scholar] [CrossRef]
  8. Kiefer, D.; Grimm, F.; Bauer, M.; Van Dinther, C. Demand forecasting intermittent and lumpy time series: Co mparing statistical, machine learning and deep learning methods. In Proceedings of the 54th Hawaii International Conference on System Sciences, Online, 4–9 January 2021. [Google Scholar] [CrossRef]
  9. Solís, M.; Calvo-Valverde, L.A. Performance of Deep Learning models with transfer learning for multiple-step-ahead forecasts in monthly time series. Intel. Artif. 2022, 25, 110–125. [Google Scholar] [CrossRef]
  10. Elsayed, S.; Thyssens, D.; Rashed, A.; Jomaa, H.S.; Schmidt-Thieme, L. Do we really need deep learning models for time series forecasting? arXiv 2021, arXiv:2101.02118. [Google Scholar]
  11. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. A comparison of ARIMA and LSTM in forecasting time series. In Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; IEEE: New York, NY, USA, 2018; pp. 1394–1401. [Google Scholar] [CrossRef]
  12. Kanavos, A.; Kounelis, F.; Iliadis, L.; Makris, C. Deep learning models for forecasting aviation demand time series. Neural Comput. Appl. 2021, 33, 329–343. [Google Scholar] [CrossRef]
  13. M4 Team. M4 Competitor’s Guide: Prizes and Rules. 2018. Available online: https://www.m4.unic.ac.cy/wpcontent/uploads/2018/03/M4-CompetitorsGuide.pdf (accessed on 25 January 2020).
  14. Wang, X.; Smith, K.; Hyndman, R. Characteristic-based clustering for time series data. Data Min. Knowl. Discov. 2006, 13, 335–364. [Google Scholar] [CrossRef]
  15. Narajewski, M.; Kley-Holsteg, J.; Ziel, F. tsrobprep—An R package for robust preprocessing of time series data. SoftwareX 2021, 16, 100809. [Google Scholar] [CrossRef]
  16. Komsta, L. Package ‘Outliers’. 2022. Available online: https://cran.r-project.org/web/packages/outliers/outliers.pdf (accessed on 25 January 2023).
  17. Yang, Y.; Hyndman, R. Introduction to the Tsfeatures Package. 2022. Available online: https://pkg.robjhyndman.com/tsfeatures/articles/tsfeatures.html (accessed on 30 August 2023).
  18. Day, O.; Khoshgoftaar, T.M. A survey on heterogeneous transfer learning. J. Big Data 2017, 4, 1–42. [Google Scholar] [CrossRef]
Figure 1. Average MAPE percentual change between the deep learning models and the other models.
Figure 1. Average MAPE percentual change between the deep learning models and the other models.
Engproc 68 00001 g001
Table 1. Grid search for the Bayesian optimization of machine learning and deep learning models.
Table 1. Grid search for the Bayesian optimization of machine learning and deep learning models.
CNN:
Number of layers between 1 and 2 (Batch normalization is applied after the first layer)
Filters between 12 and 132 with a step of 24
Kernel size between 2 and 12 with a step of 2
Max pooling with a size of 2 or without max pooling
Activation function among linear, ReLU, and tanh
Learning rate among 0.001, 0.0001, 0.00001
TCN:
Filters between 12 and 132 with a step of 24
Kernel size between 2 and 12 with a step of 2
Activation function among linear, ReLU, and tanh
Return sequences True or False
Learning rate among 0.001, 0.0001, 0.00001
Learning rate among 0.001, 0.0001, 0.00001
Dilations between [1,2,4,8] or [1,2,4,8,16]
LSTM:
Recurrent units between 12 and 132 with a step of 24
Activation function among linear, ReLU, and tanh
Return sequences True or False
Learning rate among 0.001, 0.0001, 0.00001
XGBoost:
Max depth between 2 and 12
Learning rate between 0.01 and 1
estimators between 10 and 150
Support Vector Machines:
C between 0.01 and 10
Gamma between 0.001 and 0.33
Kernel = rvf
Random Forest:
estimators between 10 and 250
max_features between 1 and 15
min sample leaf between 2 and 8
mas samples between 0.70 and 0.99
Table 2. Features computed for each time series.
Table 2. Features computed for each time series.
NameExplanationPackage
sizeNumber time series points-
stl_features_trendThe trend is a long-term change in the mean level [14]. This metric measures the strength of the trend. Higher values mean higher strength of the trend. The formula applied is analogous to [14].
Y t = f t + s 1 ,   t + + s m ,   t + e t , where f t = smoothed trend component, s i ,   t = i-th seasonal component, and e t = remainder component. Trend = 1 V a r ( e t ) V a r ( f t + e t ) .
stl_features
function from package tsfeatures
alpha_tsfAlpha parameter of simple exponential smoothing. This feature measures the relevance of recent periods. High values mean more weight to recent lags for the forecast, and lower values imply a more even distribution of weight.tsfeatures
kurtosis_fisherIt is a measure to characterize the probability distribution of the time series. Higher values indicate more concentration around the mean.PerformanceAnalytics
outIt averages the proportion of outliers according to two different measures.
(a)
The probability of being an outlier based on model-based clustering [15].
(b)
Use of Student’s t-test scores to find outliers [16].
detect_outliers of package tsrobprep and scores from outliers package
pearson_testThe p-value of the normality Pearson’s test. Higher values indicate that the distribution of the time series is closer to the normal assumption.pearson.test from nortest library
adfThe p-value of the augmented Dickey–Fuller test for the null hypothesis of a non-stationary time series. Higher values indicate that the time series is non-stationary.adf.test from tseries package. K = 12
stl_features_linearityMeasures the linearity of a time series based on the coefficients of an orthogonal quadratic regression [17].tsfeatures
stl_features_seasonal_strengthMeasure the strength of the seasonality. Higher values mean higher strength. Following the decomposition of the time series described in the previous feature, the computation is next: seasonal = 1 V a r ( e t ) V a r s i , t + e t . tsfeatures
gamma_tsfGamma parameter from ETS model. High values mean more weight to the recent stational period.tsfeatures
white_noiseThe p-value of the Box test. The null hypothesis is that the time series points are independently distributed. Higher values indicate the time series points are independent, and therefore, the data can be white noise.Box.test from tseries package. We apply lag = 12
nonlinearity_tsfIt is based on Teräsvirta’s nonlinearity test, which takes larger values when the time series is nonlinear.nonlinearity
from tsfeatures
distanceWe created this feature that measures the average Euclidean distances between the feature vector of a target time series and the other 2000 feature vectors of the training dataset. The purpose was to measure the distance between a target time series and the training dataset. Higher values represent higher distances.
Table 3. Coefficients of linear regression for comparison with statistical models.
Table 3. Coefficients of linear regression for comparison with statistical models.
FeaturesARIMAETSTHETA
TCNLSTMCNNTCNLSTMCNNTCNLSTMCNN
distance0.190.270.490.110.160.440.020.010.18
size−0.01−0.050.040.030.000.110.030.020.08
trend0.080.090.140.030.020.080.040.020.08
alpha_tsf0.030.050.080.020.040.08−0.020.000.00
kurtosis_fisher−0.04−0.05−0.21−0.07−0.09−0.230.010.03−0.09
out−0.07−0.11−0.17−0.05−0.07−0.140.000.00−0.06
pearson_test−0.07−0.09−0.18−0.04−0.06−0.160.000.01−0.06
adf−0.020.00−0.030.000.020.00−0.03−0.01−0.04
linearity−0.01−0.040.070.00−0.020.07−0.02−0.020.04
seasonal_strength0.030.000.160.01−0.010.130.060.060.18
gamma0.050.150.150.070.180.18−0.020.040.00
white_noise−0.10−0.15−0.26−0.07−0.09−0.24−0.01−0.01−0.10
nonlinearity−0.05−0.08−0.14−0.01−0.02−0.090.020.030.01
Note: Gray shadow coefficients mean p < 0.05.
Table 4. Coefficients of linear regression for comparison with machine learning models.
Table 4. Coefficients of linear regression for comparison with machine learning models.
FeaturesXGBoostSVMRF
TCNLSTNCNNTCNLSTNCNNTCNLSTNCNN
distance−0.15−0.12−0.03−0.14−0.14−0.13−0.21−0.21−0.14
size0.080.060.120.020.010.030.110.100.17
trend−0.19−0.23−0.22−0.28−0.29−0.30−0.23−0.27−0.28
alpha_tsf0.040.070.070.030.040.030.040.080.09
kurtosis_fisher0.080.070.010.090.100.080.100.110.05
out0.040.030.000.050.050.040.060.060.03
pearson_test0.040.030.000.060.060.060.060.060.03
adf0.020.040.040.010.010.020.020.030.03
linearity−0.02−0.020.00−0.03−0.03−0.030.010.000.03
seasonal_strength0.050.050.12−0.02−0.020.000.040.040.11
gamma0.120.220.230.030.040.040.100.180.18
white_noise0.080.070.020.070.070.060.120.120.09
nonlinearity0.120.130.110.050.060.050.150.170.15
Note: Gray shadow coefficients mean p < 0.05.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Solís, M.; Calvo-Valverde, L.-A. Explaining When Deep Learning Models Are Better for Time Series Forecasting. Eng. Proc. 2024, 68, 1. https://doi.org/10.3390/engproc2024068001

AMA Style

Solís M, Calvo-Valverde L-A. Explaining When Deep Learning Models Are Better for Time Series Forecasting. Engineering Proceedings. 2024; 68(1):1. https://doi.org/10.3390/engproc2024068001

Chicago/Turabian Style

Solís, Martín, and Luis-Alexander Calvo-Valverde. 2024. "Explaining When Deep Learning Models Are Better for Time Series Forecasting" Engineering Proceedings 68, no. 1: 1. https://doi.org/10.3390/engproc2024068001

Article Metrics

Back to TopTop