1 Introduction

Time series refers to a set of data points that are collected at regular time intervals over periods of time. Specifically, it is a sequence of observations recorded in successive time points that may either be continuous or discrete. Time series can be found across many disciplines, including meteorology [1, 2], econometrics [3, 4], energy consumption [5, 6], retail sales [7, 8], healthcare [9, 10], transportation [11, 12], and marketing [13, 14].

Time series can be classified into the following categories: univariate and multivariate. Univariate time series involves the analysis of one single variable across multiple units of time (e.g., daily stock prices). On the other hand, multivariate time series deals with various variables across different periods (e.g., temperature and air pressure over multiple locations) [15, 16]. Multivariate time series may be difficult to analyse due to the curse of dimensionality and the difficulty in capturing the relationships among the data’s different features [17]. As a result of these complex characteristics, accurate forecasting of multivariate time series is very challenging [18]. Several methods for time series forecasting have been proposed [19,20,21], including traditional statistical methods and deep learning models, have been proposed [22]. Traditional statistical methods such as linear Auto-Regressive Integrated Moving Average (ARIMA) [23] and Vector Auto-Regression (VAR) [24] have been widely used for time series prediction, however, their performance is limited when dealing with high-dimensional and non-linear data [25, 26]. However, in many real-world time series problems, the relationships between the variables are non-linear, and the temporal dependencies can change over time.

Deep learning models, with their ability to automatically learn hierarchical representations and capture complex patterns, have shown great potential in improving time series forecasting accuracy [27,28,29]. Models like Recurrent Neural Networks (RNNs) [30, 31], Convolutional Neural Networks (CNNs) [32, 33], and Temporal Convolutional Networks (TCNs) have been widely applied in this context [34, 35].

While deep learning offers significant potential for multivariate time series forecasting, traditional architectures have inherent limitations that hinder their effectiveness. Notably, recurrent neural networks (RNNs), despite their strength in sequence modeling, are susceptible to vanishing or exploding gradients during training [15]. This particularly affects LSTMs, which can struggle to effectively capture the intricate interdependencies between multiple time series variables, a critical element for accurate forecasting [36]. Furthermore, LSTMs, when applied to high-dimensional multivariate data, encounter difficulties in learning long-term dependencies. This is partly due to their reliance on large-weight matrices, which increase the risk of overfitting and limit their ability to capture distant temporal relationships [37, 38]. Similarly, convolutional neural networks (CNNs), with their constrained kernel sizes, struggle to grasp patterns across extended time lags, hindering their performance in longer-term forecasts [39, 40].

Capturing spatiotemporal dynamics is indeed one of the fundamental challenges in time series forecasting research. From a theoretical perspective, real-world time series data often exhibits complex interdependencies that unfold across both spatial and temporal dimensions [41]. Whether modeling energy consumption patterns across interconnected grids [42], transportation flows within urban infrastructure networks [43], or even ecological fluctuations between interlinked habitats - there are nonlocal spatial correlations that evolve dynamically over time. Traditional univariate and multivariate forecasting methodologies struggle to adequately account for such spatiotemporal intricacies [44].

In this paper, we address all these limitations by employing a hybrid deep learning model that fuses the spatial processing capabilities of Convolutional Neural Networks (CNNs) with the temporal modeling strengths of Recurrent Neural Networks (RNNs), creating a unified architecture that includes CNN_LSTM, CNN_BILSTM, and CNN_GRU configurations. Additionally, we enhance the model’s temporal reach by integrating Temporal Convolutional Networks (TCNs) with RNNs, resulting in hybrid TCN_RNN models such as TCN_LSTM, TCN_BILSTM, and TCN_GRU. These models are designed to effectively capture and analyze the spatiotemporal characteristics of multivariate time series data.

We will compare the effectiveness of our proposed hybrid model with leading deep learning models, including LSTM, GRU, and BiLSTM networks. This comparison aims to demonstrate the superiority of hybrid models in overcoming the challenges associated with spatiotemporal data in multivariate time series forecasting.

The structure of this paper is as follows: Sect. 2 presents an overview of existing research relevant to our work. Section 3 provides a comprehensive background of deep learning architectures, essential for understanding our proposed method. Section 4 presents our proposed method, detailing the architecture and methodology employed.Sect. 5 describes the methodology employed in our study, including the dataset used, experimental design, and evaluation metrics. Section 6 presents the results obtained from our experiments, demonstrating the effectiveness of our proposed method. Finally, Sect. 7 concludes the paper.

2 Related Work

The field of time series forecasting has witnessed significant advancements, with researchers exploring the application of deep learning techniques to overcome the limitations of traditional statistical and machine learning approaches. This section provides an overview of the related work conducted in this area, discussing the advancements achieved and proposed solutions.

Statistical Models for Multivariate Time Series Statistical models, such as Vector Auto-Regressive (VAR) [45, 46] and multivariate Exponential Smoothing (ES) [47, 48], have long been used for time series analysis. These models excel in capturing linear interdependencies among multiple time series variables. However, they are limited in their ability to model non-linear relationships and can suffer from overparameterization when dealing with high-dimensional data [49,50,51].

Traditional Machine Learning Models Traditional machine learning models like Support Vector Machines (SVM) [52, 53] and k-Nearest Neighbors (kNN) have been adapted for time series forecasting [54]. These models provide non-linear modeling capabilities but often require extensive feature engineering and do not inherently capture sequential dependencies, which can be particularly limiting for complex time series data [55, 56].

Traditional Deep Learning Models Deep learning models such as RNNs [11], LSTMs [57], and GRUs [58] have addressed some of the limitations of earlier statistical and machine learning approaches by effectively capturing long-term dependencies in sequential data. However, they can be computationally expensive, difficult to train, and may still encounter vanishing gradient issues in very long sequences [44].

3 Background of Deep Learning Architectures

This section presents a brief overview of the theory behind the selected deep learning models.

3.1 Long Short-Term Memory (LSTM)

LSTM is introduced to resolve the problems caused by vanishing and exploding gradients in RNNs [59]. It is recommended for long-term dependency relationships identified in network traffic since chained memory blocks are used as short-term memory to remember previous actions taken in time steps. Each memory block contains a memory cell as well as three gates: an input gate (\(i_t\)), an output gate (\(o_t\)), and a forget gate (\(f_t\)).

First, the memory cell executes the forget gate (\(f_t\)) to discard the information that was unnecessary in the previous state (\(C_{t-1}\)), as shown in equation(1). This step aims to ensure that the model is effective and scalable. Based on the obtained information, the input gate (\(i_t\)) contains the values that should be updated, as shown in equation (2). Additionally, the activation function generates a vector of new candidate values (\(\tilde{C_t}\)). To update the state of the cell, both the input gate value (\(i_t\)) and the generated vector value are multiplied together. In the next step, the value of the forget gate (\(f_t\)) is multiplied by the value of the previous cell’s state (\(C_{t-1}\)) to obtain the updated value for the old cell state, as shown in equation (3). Finally, the output gate (\(o_t\)) is derived from the current cell state (\(C_t\)), as shown in equation (5).

$$\begin{aligned} f_t= & {} \sigma (W_f \cdot [h_{t-1}, x_t] + b_f) \end{aligned}$$
(1)
$$\begin{aligned} i_t= & {} \sigma (W_i \cdot [h_{t-1}, x_t] + b_i) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{C}_t= & {} \tanh (W_C \cdot [h_{t-1}, x_t] + b_C) \end{aligned}$$
(3)
$$\begin{aligned} C_t= & {} f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t\end{aligned}$$
(4)
$$\begin{aligned} o_t= & {} \sigma (W_o \cdot [h_{t-1}, x_t] + b_o) \end{aligned}$$
(5)
$$\begin{aligned} h_t= & {} o_t \cdot \tanh (C_t) \end{aligned}$$
(6)

Here, \(h_{t-1}\) represents the output from the previous time step. The weight matrices for the input, output, and forget gates are denoted by \(W_i\), \(W_o\), and \(W_f\), respectively. The sigmoid activation function is represented by \(\sigma \), and \(\tanh \) denotes the hyperbolic tangent function. The bias terms for the input, output, and forget gates are \(b_i\), \(b_o\), and \(b_f\). The input at time t is \(x_t\). Fig. 1 shows the architecture of the LSTM network.

Fig. 1
figure 1

Architecture of LSTM network [60]

3.2 Gated Recurrent Unit (GRU)

The GRU is considered less complex than the LSTM, which is its most significant advantage as it requires less training time due to its simplified architecture [61]. The GRU consists of two main components: the update gate (\(u_t\)) and the reset gate (\(r_t\)). The update gate determines which information is necessary to carry forward to the next stage (equation(7)), while the reset gate decides how much of the past information to forget as (equation(8)). This process is closely related to the current input and the previous hidden state.

$$\begin{aligned} u_t= & {} \sigma (w_u \cdot h_{t-1}, x_t) \end{aligned}$$
(7)
$$\begin{aligned} r_t= & {} \sigma (w_r \cdot h_{t-1}, x_t) \end{aligned}$$
(8)
$$\begin{aligned} h_t= & {} (1 - u_t) \cdot h_{t-1} + u_t \cdot \tanh (w_r \cdot h_{t-1}, x_t) \end{aligned}$$
(9)

where \(w_u\) and \(w_r\) are the weights for the update and reset gates, respectively. Figure 2 shows the architecture of the GRU network.

Fig. 2
figure 2

Architecture of GRU network [62]

3.3 Bidirectional Long Short-term Memory (BiLSTM)

Bidirectional LSTMs (BiLSTMs) are a useful variant of the standard LSTM architecture that is well-suited for modeling time series data. Unlike regular LSTMs which only can use past context when processing sequential input, Bi-LSTMs contain two separate LSTM layers - one layer that processes the input sequence forwards in time and another layer that processes the sequence backwards in time [63].

This bidirectional design allows the BiLSTM to have access to both past and future context at each time step, as the forward layer can look at future time steps while the backward layer observes past time steps. The outputs from the two layers are then concatenated and fed into the next part of the network. This additional contextual information provided by processing sequences in both directions has been shown to improve Bi-LSTMs performance on tasks involving sequential data like time series forecasting and natural language processing when compared to standard LSTMs [64]. However, a potential limitation is that Bi-LSTMs require more parameters than regular LSTMs due to the separate forward and backward layers, which can increase the risk of overfitting on smaller datasets and require more computational resources to train. Figure  3. illustrates the bidirectional LSTM.

Fig. 3
figure 3

Architecture of Bi-LSTM network [65]

3.4 Convolutional Neural Network (CNN)

CNNs were originally developed for computer vision tasks to process grid-like data such as images [66]. The key aspects of a CNN include convolutional layers and pooling layers. Convolutional layers apply linear convolution operations between the input and learned filters (typically 3D arrays of weights) to extract local spatial features from the data [67]. The filters slide across the width and height of the input volume to generate a feature map at each position.

Pooling layers perform downsampling operations after convolutional layers to reduce the spatial size of the data and control overfitting [68]. Common pooling methods are max pooling, which outputs the maximum value from the region, and average pooling [5]. CNNs stack multiple convolutional and pooling layers to learn increasingly abstract features. The learned filters in a layer are adapted in subsequent layers to best model the input data.

The basic mathematical formulation of a 1D convolution operation is equation(10):

$$\begin{aligned} y[i] = (x * w)[i] = \sum _{k=-\infty }^{\infty } x[k] \cdot w[i-k] \end{aligned}$$
(10)

Where x is the input, w is the filter, \(*\) denotes the convolution operation, and y is the output feature map. For multivariate time series with n variables at each time step, the input x would be a tensor of shape (number of samples, n, time steps). For time series forecasting, 1D convolutions along the temporal dimension allow CNNs to learn features directly from raw data without manual feature engineering [?]. CNNs have achieved state-of-the-art performance across domains by leveraging these properties. Figure  4. illustrates the bidirectional LSTM.

Fig. 4
figure 4

Architecture of CNN [44]

3.5 Temporal Convolutional Network(TCN)

Temporal Convolutional Networks (TCNs) are a type of neural network that is particularly effective for time series prediction tasks. Unlike traditional recurrent neural networks (RNNs) like LSTMs and GRUs, TCNs use dilated convolutions to capture long-term dependencies in time series data [64].

The architecture of TCNs consists of multiple stacked convolutional layers. Each layer applies a dilated convolution operation, which allows the model to consider information from a wider range of previous time steps. The dilation rate increases exponentially with each layer, enabling the model to capture long-range dependencies. Accordingly, a 1D dilated convolutional operation on an element of a sequence can be defined as equation(11):

$$\begin{aligned} f(s) =( x * df)(s) = \sum _{i=0}^{k-1} f_(i) \cdot x_{s - d \cdot i} \end{aligned}$$
(11)

Where \(f: \{0, \ldots , k - 1\} \rightarrow \mathbb {R}\) is the convolution kernel, k is the kernel size, d is the dilation factor, and \(s - d \cdot i\) represents the data of the past. d increases as the network gets deeper.

TCNs also employ residual connections that directly pass values from input to output of convolutional blocks, enabling very deep networks for capturing longer-term patterns [69]. Skip connections further allow information to skip over blocks to preserve temporal resolution

In a TCN, the output at time (t) is computed as a function of the input at time (t) and the outputs at previous time steps. This property ensures that the model’s predictions at a given time step are only influenced by past data, adhering to the causality principle typical of time series data. Figure  5. illustrates the bidirectional LSTM.

Fig. 5
figure 5

Temporal Convolutional Network (TCN) architecture [64]

4 Proposed Method

In this section, we present a proposal for hybrid TCN_RNNs and CNN_RNNs models, specifically CNN_LSTM,CNN_BiLSTM,CNN_GRU, TCN_LSTM, TCN_BiLSTM, and TCN_GRU.

4.1 Hybrid CNN_RNNs(LSTM,GRU or BiLSTM) Models

Our research proposes using hybrid CNN-RNN models, specifically CNN-LSTM, CNN-BiLSTM, and CNN-GRU, to predict multivariate time series. These models combine the strengths of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to handle the high-dimensional and spatiotemporal aspects of multivariate time series data.

The architecture of these models typically consists of three main components: the convolutional layers, the recurrent layers, and the output layer. The convolutional layers extract spatial features from the input data, while the recurrent layers capture the temporal dependencies.

In the case of CNN-LSTM, the input data is first fed into the CNN layers, which perform convolutional operations to extract relevant spatial features. The output of the CNN layers is then passed to LSTM, which enables the model to capture long-term dependencies in the temporal dimension. The LSTM layers process the sequential information and generate hidden states passed through time. Finally, the output layer, typically composed of fully connected layers, produces the predicted values for the multivariate time series. Figure 6 illustrates a novel deep learning architecture that combines Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks has been proposed for advanced time series analysis and forecasting. Additionally, substituting an LSTM layer with a GRU or BiLSTM layer introduces a unique set of benefits.

Fig. 6
figure 6

Architecture of hybrid CNN-LSTM for time series forecasting

Similarly, CNN-BiLSTM incorporates bidirectional LSTM, which processes the input data in both forward and backward directions. This allows the model to capture temporal dependencies from past to future and from future to past, enhancing its ability to learn complex patterns in the data. On the other hand, CNN-GRU models replace the LSTM with gated recurrent units (GRUs). GRUs have a simplified architecture compared to LSTMs, making them computationally more efficient while still capturing temporal dependencies effectively.

In several ways, these hybrid models address the challenges of high dimensionality and spatiotemporal dependencies in multivariate time series prediction. By leveraging the convolutional layers, they can automatically extract relevant spatial features from high-dimensional input data, reducing the dimensionality and focusing on the most informative aspects. Secondly, the recurrent layers, such as LSTM and GRU, enable the models to capture the temporal dependencies and patterns in the data across different time steps. Lastly, integrating both CNN and RNN components within the hybrid architecture ensures that the model can effectively capture spatial and temporal correlations, leading to improved prediction performance.

4.2 Hybrid TCN_RNNs(LSTM,GRU or BiLSTM) Models

In our research paper, we introduce hybrid Temporal Convolutional Network (TCN)-RNN models, such as TCN-LSTM, TCN-BiLSTM, and TCN-GRU, to tackle multivariate time series forecasting. These models address the challenges of high dimensionality, spatiotemporal dependencies, and other complexities inherent in multivariate time series. The architecture of a hybrid TCN-RNN model combines the parallel computation benefits of TCNs with the sequential data processing capabilities of RNNs. TCNs use causal convolutions, ensuring that predictions at a given time step are influenced only by past data, preserving the temporal order of events. This is achieved through dilated convolutions, which expand the receptive field exponentially without increasing the parameters.

On top of the TCN architecture, we integrate RNNs, such as LSTM, BiLSTM, and GRU, to capture both forward and backward temporal dependencies, further enriching the model’s predictive capacity. The LSTM, for instance, includes gates that regulate the flow of information. In Fig. 7, we can observe the overall structure of the hybrid TCN-LSTM architecture. The input time series data is passed through the TCN layers responsible for capturing local patterns and conducting temporal convolutions. Similarly, replacing the LSTM layer with a GRU or BiLSTM layer offers a different set of advantages.

Fig. 7
figure 7

Architecture of hybrid TCN-LSTM for time series forecasting

These hybrid TCN-RNN models effectively manage the high-dimensional nature of multivariate time series by capturing relevant features through the TCN layers and modeling temporal sequences via RNNs. The spatiotemporal component is addressed by the TCN’s ability to process multiple time steps simultaneously and by the RNN’s proficiency in capturing temporal sequences, making the models suitable for forecasting tasks where both spatial and temporal factors are critical.

5 Experiments

In the Experiments section, we describe the dataset used for evaluation, the experimental setting, and the evaluation metric employed to assess the performance of our proposed method.

5.1 Dataset description

In our study, we employ two distinct datasets: one focusing on air quality and the other on traffic volume control. The air quality dataset,Footnote 1 which spans from March 2004 to February 2005, provides historical data on various air pollutants, including carbon monoxide (CO), nitrogen dioxide (NO2), ozone (O3), and others. This dataset is derived from hourly averages collected by a set of five metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device, positioned at road level in a significantly polluted area within an Italian city. On the other hand, the traffic volume control dataset is designed to analyze and manage traffic flow, offering insights into the dynamics of vehicular movement and congestion patterns. By leveraging these datasets, we aim to develop and evaluate models that can effectively predict and control both air quality and traffic volume, contributing to environmental sustainability and efficient urban planning.

The traffic volume datasetFootnote 2 contains hourly Interstate 94 westbound traffic volumes for (MN DoT ATR) station -301 from 2012 to 2018, roughly halfway between Minneapolis and St Paul, Minnesota. The traffic volume is affected by hourly weather conditions and holidays. A detailed description of the datasets is represented in Table 1. Traffic volume is our target output to predict the traffic flow, and other data are treated as the model’s input. These columns dropped because min, Q1, Q2, and Q3 were all zero for rain and snow features.

Table 1 Dataset description

5.2 Data preprocessing

Data preprocessing plays a vital role in the overall effectiveness of deep learning algorithms. In this experiment, specific steps have been taken to preprocess the data. Firstly, missing values are addressed by replacing them with the mean value before and after the missing value (known as the near mean approach). Outliers, which can disrupt the learning process, are also identified and treated as missing values. When dealing with categorical data, deep learning algorithms face challenges in effectively representing them. To overcome this, a technique called “one-hot encoding” is employed. This method transforms categorical data with n possible values into n indicator features, with only one active feature at a given time. In this case, “one-hot encoding” is used for the categorical features of holiday, weather_main, and weather_description. We perform normalization to ensure a fair comparison and appropriate weighting of variables with different scales. This involves scaling the data to values between 0 and 1 [70]. Deep learning networks are susceptible to scaling, and the min-max scaler suits these networks. The min-max scaler is shown in equation (12):

$$\begin{aligned} x_{\textrm{scaled}} = \frac{x - x_{\textrm{min}}}{x_{\textrm{max}} - x_{\textrm{min}}} \end{aligned}$$
(12)

Where x represents the dataset, and \(x_{\textrm{scaled}}\)represents the normalized dataset.

To forecast traffic flow in the traffic volume dataset and true hourly averaged concentration CO in mg/\(m^3\)" in Air Quality for the next h time steps, a sliding window approach is employed to transform the input time series data into input–output pairs. This is achieved by considering a size window w, where \(x_t\) represents the time-series data, h denotes the forecasting horizon, and f indicates the deep learning model established through training. The input–output pairs can be represented as equation (13):

$$\begin{aligned} {[}x_{t+1} + x_{t+2}, \ldots , x_{t+h} ] = f(x_t, x_{t-1}, \ldots , x_{t-w}) \end{aligned}$$
(13)

5.3 Experimental Setting

This study aims to implement a hybrid deep learning model for multivariate time series forecasting. For each model, 80% of the training data is selected for training, and 20% is selected for validation. Fig. 8 illustrates the process of reshaping and splitting the dataset for further analysis and model training.

Fig. 8
figure 8

Data Reshape And Split

The selection of hyperparameters plays a critical role in the performance of any deep learning algorithm. After fine-tuning our forecasting model by using grid search to obtain the best-performing model across the whole dataset [71], we conducted a manual grid search over a series of combinations to select the best hyper-parameters. We use the Adam optimization algorithm to optimize the model parameters [72], which can adapt the learning rate. Table 2 outlines the hyperparameter search space for the deep learning models employed in this research, detailing the range of parameters explored to optimize model performance.

Table 2 Hyperparameter search space for deep learning models (grid search)

Table 3 delineates the optimal hyperparameters for the hybrid deep learning models applied to our datasets, showcasing the configuration that yielded the best performance.

We propose Multi-Layer, Convolutional, and Recurrent Networks as basic building blocks and then combine them into heterogeneous architectures with different variants, trained with optimization strategies like drop_out = 0.2 and skip connections, early stopping, adaptive learning rates, filters, and kernels of various sizes, between others. This study uses the ReLU since it most effectively forecasts noisy non-stationary time series [73]. In addition, the ReLU reduces training time, simplifying the model.

Table 3 Hyperparameter selections using the grid search

5.3.1 Baseline Models

Multilayer Perceptron (MLP): MLP is a type of artificial neural network that consists of multiple layers of interconnected nodes. It is commonly used for regression and classification tasks, including time-series forecasting. MLP learns complex nonlinear relationships between input and output variables by adjusting the weights and biases of the network during training. MLP’s ability to capture intricate patterns and dependencies in the data [74].

Support Vector Regression (SVR): SVR is a machine learning algorithm that extends the concept of support vector machines (SVM) to regression problems. It aims to find a function that best fits the training data while minimizing the error and maximizing the margin between the data points. SVR is particularly effective when dealing with nonlinear relationships in time-series data [75].

Vector Autoregression (VAR): VAR employs statistical modeling to examine and forecast the interconnectedness among multiple time series variables. VAR models leverage the lagged values of each variable as predictors to anticipate their future states [76].

Autoregressive Integrated Moving Average (ARIMA): The ARIMA model is a fundamental statistical tool for time-series analysis that combines autoregressive, differencing, and moving average components. It is widely utilized in time-series forecasting to capture and model the underlying patterns and trends within the data. ARIMA is particularly effective when dealing with stationary time series data, where the mean and variance remain constant over time [77].

Seasonal Autoregressive Integrated Moving Average with eXogenous variables (SARIMAX): SARIMAX builds upon the ARIMA model by integrating seasonal components and external variables. It is particularly useful for time-series forecasting when the data exhibits seasonal patterns and incorporates additional information (exogenous variables) to enhance prediction accuracy [78].

Table 4 presents a structured overview of the hyperparameters for the baseline model, including their names, descriptions, and default values where applicable.

Table 4 Hyperparameters of Baseline Models

5.4 Evaluation Metric

The evaluation of a model’s performance is a critical aspect of assessing its effectiveness. Several commonly employed evaluation metrics in the literature for time series models include the Mean Absolute Error (MAE), Mean-Squared Error (MSE), and R-squared ( \(R^2\)) [79].

Mean Squared Error (MSE): MSE measures the average squared difference between the predicted values and the actual target values. Equation(14) Define MSE as:

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^{n} (y_i - \hat{y}_i)^2 \end{aligned}$$
(14)

where n is the number of data points, \(y_i\) is the actual target value, and \(\hat{y}_i\) is the predicted value.

Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted values and the actual target values.Equation(15) define the MAE as:

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^{n} |y_i - \hat{y}_i| \end{aligned}$$
(15)

R-squared (\(R^2\)): \(R^2\) measures the proportion of the variance in the target variable that is predictable from the predictors. \(R^2\) ranges from 0 to 1, with higher values indicating a better fit of the model to the data. Equation(16) defines the \(R^2\) as:

$$\begin{aligned} R^2 = 1 - \frac{\sum _{i=1}^{N} \left( y_{t}^{i} - \hat{y}{t}^{i} \right) ^2}{\sum _{i=1}^{N} \left( y_{t}^{i} - \bar{y}_{t}^{i}\right) ^2} \end{aligned}$$
(16)

Mean Absolute Percentage Error (MAPE): MAPE measures the average percentage error between the predicted values and the actual target values. Equation(17) defines the MAPE as:

$$\begin{aligned} MAPE = \frac{1}{N} \sum _{i=1}^{N} \frac{\left| y_{t}^{i} -\hat{y}{t}^{i} \right| }{\left| y{t}^{i} \right| } \end{aligned}$$
(17)
Table 5 Results summary of all methods with two datasets (24 time steps)
Fig. 9
figure 9

Comparison of forecasting values and actual values for traffic flow (Timesteps =24)

Fig. 10
figure 10

Comparison of forecasting values and actual values for Air Quality (Timesteps =24)

6 Results and Discussion

In this study, we comprehensively evaluated various deep learning models for time series forecasting on two datasets: Traffic Volume and Air Quality. The models were assessed using several evaluation metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and R-squared (\(R^2\)).

Table 5 presents the evaluation results for the different models, categorized into baseline models, state-of-the-art neural networks, and our proposed models.

The baseline models, including Multilayer Perceptron (MLP),Support Vector Regression (SVR),Vector Autoregression (VAR), Autoregressive Integrated Moving Average (ARIMA), and Seasonal Autoregressive Integrated Moving Average (SARIMAX), showed relatively poor performance on both datasets. For Traffic Volume, they had high MSE, RMSE, MAE, and MAPE values, and negative \(R^2\) values, indicating that they did not fit the data well. Similarly, for Air Quality, they had high MSE, RMSE, MAE, and MAPE values, and low \(R^2\) values, suggesting that they struggled to capture the underlying patterns in the data. Among the baseline models, SARIMAX achieved the best performance for the Air Quality dataset, with an MSE of 1.5, RMSE of 1.2, MAE of 0.9, MAPE of 113, and an \(R^2\) of 0.2.In comparison, the MLP and SVR models demonstrated better performance on both datasets. MLP achieved an MSE of 0.0125, RMSE of 0.0751, MAE of 0.111, and an R2 of 0.80 for Traffic Volume, and an MSE of 0.0059, RMSE of 0.065, MAE of 0.076, and an R2 of 0.74 for Air Quality. SVR, on the other hand, had an MSE of 38.4, RMSE of 8, MAE of 9.02, and R2 of 0.32 for Traffic Volume, and an MSE of 24.5, RMSE of 4.9, MAE of 4.02, and an R2 of 0.30 for Air Quality.

The state-of-the-art models, including Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (BiLSTM), and Temporal Convolutional Network (TCN), showed significant improvements over the baseline models. For Traffic Volume, they had lower MSE, RMSE, MAE, and MAPE values, and higher \(R^2\) values, indicating better performance. For Air Quality, they also showed improvements in all metrics.TCN demonstrated the highest performance for both datasets. For the Traffic Volume dataset, TCN achieved an MSE of 0.0023, RMSE of 0.047, MAE of 0.36, MAPE of 188, and an \(R^2\) of 0.950. Similarly, for the Air Quality dataset, TCN obtained an MSE of 0.0018, RMSE of 0.0424, MAE of 0.0299, MAPE of 31, and an \(R^2\) of 0.93.

Our proposed models, which combine different architectures, showed notable improvements in forecasting performance. For the Traffic Volume dataset, the CNN-BiLSTM model achieved the lowest MSE (0.0020), RMSE (0.045), and MAE (0.030), along with an \(R^2\) of 0.960. The TCN-BiLSTM model also performed exceptionally well, with an MSE of 0.0018, RMSE of 0.042, MAE of 0.022, MAPE of 178, and the highest \(R^2\) of 0.976.

For the Air Quality dataset, the CNN-LSTM model achieved the lowest MSE of 0.0012, while the TCN-BiLSTM model outperformed all others with an RMSE of 0.422, MAE of 0.027, MAPE of 29, and the highest \(R^2\) of 0.94.

Figs. 9 and 10, show the comparison of forecasting values and actual values for Traffic Flow and Air Quality respectively:

Figure  9: Comparison of forecasting values and actual values for Traffic Flow (Timesteps = 24). The sub-figures show the actual vs. predicted values for each state-of-the-art model (LSTM, GRU, BiLSTM, TCN) and our proposed models (CNN-LSTM, CNN-BiLSTM, CNN-GRU, TCN-LSTM, TCN-BiLSTM, TCN-GRU).

The TCN-BiLSTM model achieves the closest match between actual and predicted values, demonstrating its superior performance for Traffic Flow forecasting. The CNN-BiLSTM and TCN-LSTM models also show good accuracy, while the LSTM, GRU, and BiLSTM models have larger forecasting errors.

Figure  10: Comparison of forecasting values and actual values for Air Quality (Timesteps = 24).

The TCN-BiLSTM model again achieves the closest match between actual and predicted values, demonstrating its effectiveness for Air Quality forecasting. The CNN-BiLSTM and TCN-LSTM models also show good accuracy, while the LSTM, GRU, and BiLSTM models have larger forecasting errors.

Overall, the visual comparisons in Fig. 9 and 10 confirm the quantitative results from Table 5. The TCN-BiLSTM model consistently achieves the closest match between actual and predicted values across both datasets. The CNN-BiLSTM and TCN-LSTM models also show promising performance. These findings provide further validation of the effectiveness of the proposed hybrid CNN-RNN and TCN-RNN models for multivariate time series forecasting.

The superior performance of our proposed hybrid models can be attributed to their ability to effectively capture both spatial and temporal dependencies in the multivariate time series data. By combining convolutional and recurrent modules, these architectures can extract relevant features and dynamics, leading to more accurate forecasting results.

7 Conclusion

In this paper, we comprehensively evaluated various deep learning models for multivariate time series forecasting on Traffic Volume and Air Quality datasets. Our proposed hybrid CNN-RNN and TCN-RNN models significantly outperformed both baseline and state-of-the-art models.These results highlight the effectiveness of combining convolutional and recurrent modules for multivariate time series forecasting. The hybrid models were able to capture both spatial and temporal dependencies, leading to more accurate predictions. The TCN-BiLSTM model achieved the best overall performance, with the lowest error and highest \(R^2\) values for both datasets. The CNN-BiLSTM and TCN-LSTM models also showed promising results.

The hybrid models presented in our paper have demonstrated an outstanding ability to address some of the traditional challenges faced by other models in time series forecasting. These challenges include capturing complex non-linear patterns, dealing with noisy data, and accounting for the seasonality and trend components inherent in many time series datasets.

In essence, these hybrid models combine the best of both worlds-the feature extraction capabilities of CNNs and TCNs and the sequential data processing strengths of RNNs. This results in a more robust and accurate forecasting model that can outperform models relying on a single approach. As evidenced by our results,the hybrid models have consistently achieved lower error rates and higher \(R^2\) values, signifying their superior predictive performance and generalization capabilities. While our proposed hybrid models have shown promising results, there are several avenues for future research to further improve the accuracy and efficiency of time series forecasting. Firstly, we plan to explore the application of Transformer-based models, which have recently shown great success in natural language processing and computer vision tasks. By leveraging the self-attention mechanism, Transformer models can potentially capture more complex dependencies in multivariate time series data, leading to even more accurate forecasting results. Secondly, our objective is to investigate the feasibility of real-time time series forecasting using our proposed models. By developing efficient and scalable architectures, we can enable real-time forecasting applications in various domains.