Transformer-Based Spatiotemporal Graph Diffusion Convolution Network for Traffic Flow Forecasting

Wei, Siwei; Yang, Yang; Liu, Donghua; Deng, Ke; Wang, Chunzhi

doi:10.3390/electronics13163151

Open AccessArticle

Transformer-Based Spatiotemporal Graph Diffusion Convolution Network for Traffic Flow Forecasting

by

Siwei Wei

^1,2,

Yang Yang

¹,

Donghua Liu

^3,*,

Ke Deng

¹ and

Chunzhi Wang

¹

School of Computer Science, Hubei University of Technology, Wuhan 430070, China

²

CCCC Second Highway Consultants Co., Ltd., Wuhan 430056, China

³

Information Center, China Waterborne Transport Research Institute, Beijing 100080, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3151; https://doi.org/10.3390/electronics13163151

Submission received: 28 June 2024 / Revised: 28 July 2024 / Accepted: 5 August 2024 / Published: 9 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Accurate traffic flow forecasting is a crucial component of intelligent transportation systems, playing a pivotal role in enhancing transportation intelligence. The integration of Graph Neural Networks (GNNs) and Transformers in traffic flow forecasting has gained significant adoption for enhancing prediction accuracy. Yet, the complex spatial and temporal dependencies present in traffic data continue to pose substantial challenges: (1) Most GNN-based methods assume that the graph structure reflects the actual dependencies between nodes, overlooking the complex dependencies present in the real-world context. (2) Standard time-series models are unable to effectively model complex temporal dependencies, hindering prediction accuracy. To tackle these challenges, the authors propose a novel Transformer-based Spatiotemporal Graph Diffusion Convolution Network (TSGDC) for Traffic Flow Forecasting, which leverages graph diffusion and transformer to capture the complexity and dynamics of spatial and temporal patterns, thereby enhancing prediction performance. The authors designed an Efficient Channel Attention (ECA) that learns separately from the feature dimensions collected by traffic sensors and the temporal dimensions of traffic data, aiding in spatiotemporal modeling. Chebyshev Graph Diffusion Convolution (GDC) is used to capture the complex dependencies within the spatial distribution. Sequence decomposition blocks, as internal operations of transformers, are employed to gradually extract long-term stable trends from hidden complex variables. Additionally, by integrating multi-scale dependencies, including recent, daily, and weekly patterns, accurate traffic flow predictions are achieved. Experimental results on various public datasets show that TSGDC outperforms conventional traffic forecasting models, particularly in accuracy and robustness.

Keywords:

traffic flow forecasting; graph diffusion convolution; efficient channel attention; transformer

1. Introduction

As a crucial element of intelligent transportation systems (ITS) [1], traffic flow forecasting entails using extensive observational data from sensors and devices in the road network to predict future traffic conditions over a defined period [2,3]. Accurate traffic flow forecasting is crucial for urban transportation management departments to optimize traffic planning and management strategies, enhance traffic safety, and improve traffic efficiency. Furthermore, it benefits travelers by enabling them to plan their routes ahead, reducing congestion and enhancing travel efficiency.

Recently, traffic flow prediction has become a focal point in academia. Researchers have employed methods such as statistics [4,5], machine learning, and deep learning to model traffic flow, capturing its temporal and spatial characteristics for accurate forecasting. Early statistical approaches, like the Historical Average model (HA) [6], focused on modeling traffic flow variations across different times and periods. However, these methods fail to capture the uncertainty, nonlinearity, and dynamics of traffic flow caused by unexpected events such as traffic accidents or other unexpected events. ARIMA [7] and VAR [8] models treated traffic flow as stochastic time sequences, considering the correlations and dependencies between time sequence observations for dynamic prediction, thus addressing the issue of traffic flow dynamics to some extent. Although statistical short-term prediction methods demonstrate high accuracy and precision, they have certain limitations. Firstly, these methods struggle to capture nonlinear relationships and are only suitable for stable traffic conditions, making them inadequate for adapting to rapid changes in traffic patterns. Secondly, these methods fail to consider the influence of upstream and downstream traffic flow at previous time points on the prediction results. To address these issues, machine learning methods have emerged. Algorithms such as SVM [9], SVR [10], KNN [11], and hashing algorithms [12] have found broad application in improving the accuracy of traffic flow prediction. However, these algorithms also come with inherent limitations. For instance, SVM excels in managing nonlinear relationships but encounters significant computational complexity with large-scale data. SVR seeks an optimal hyperplane by minimizing the margin, making it suitable for nonlinear regression problems. KNN predicts based on the nearest neighbors, capturing local data patterns but being sensitive to noise and high-dimensional data. Therefore, when choosing a machine learning method, it is crucial to consider these factors to effectively meet prediction requirements.

Recent advancements in new-generation information technologies have led to the widespread application of deep learning in areas such as computer vision and natural language processing [13,14,15,16,17,18]. Researchers have now turned to leveraging deep learning methods to model traffic flow, aiming to capture the nonlinearities, dynamics, and fine-grained characteristics within the data. For instance, CNNs [19] have been utilized for modeling traffic flow due to their prowess in capturing fine-grained features. GANs [20] have been employed to enhance the diversity and richness of real-world traffic data. Compared to traditional methods, deep learning models possess robust feature extraction and nonlinear modeling capabilities, enabling them to effectively capture the complex relationships between spatial nodes. This capability achieves more accurate predictions and enhanced decision-making when utilizing time sequence information. This capability provides more effective solutions for addressing intricate real-world problems. To address the non-Euclidean nature of real-world road networks, researchers have proposed leveraging GCN [21,22,23,24] to directly process traffic flow data structured as graphs, modeling the complex spatial dependencies within road networks. GCN-M [25] is capable of jointly addressing the tasks of missing value analysis and traffic flow prediction by utilizing an attention-based memory network to incorporate both local spatiotemporal features and global historical patterns. ISTGCN [26] uses physical network topology to explain traffic graph convolution and learns the accumulation of temporal and spatial properties by transmitting data from junctions at different timestamps. STG2Seq [27] employs gated residual GCN modules to capture spatiotemporal correlations in traffic data sequences for prediction. STN-GCN [28] utilizes spatiotemporal normalization and Temporal Convolutional Networks for detailed temporal feature extraction, while addressing spatial dependencies via a self-attention mechanism and employing curriculum learning [29,30] to optimize target sequences.

Despite the impressive performance achieved by current technologies in traffic flow prediction, there are still some challenges. Existing road-based network graphs fail to adequately capture higher-order dependencies between nodes, making it difficult to fully describe the complexity of traffic data and dynamically changing spatial complexities [31]. Additionally, while current models excel in time series modeling, the impact of various complex periodicities on traffic flow is significant in the real world, and the lack of a powerful model to decompose and model these effects makes it challenging for existing methods to capture complex and long-term temporal feature representations.

To tackle these challenges, the authors propose a novel Transformer-based Spatiotemporal Graph Diffusion Convolution Network approach for traffic flow prediction. The contributions of this paper are as follows:

This model utilizes a newly designed Efficient Channel Attention [32] to capture spatial and temporal features while suppressing irrelevant ones.

For spatial information modeling, the authors designed Chebyshev Graph Diffusion Convolution [33], which generates a diffusion matrix capable of capturing high-order dependencies between sensor nodes in the traffic graph. By using Chebyshev polynomials to approximate the high-order powers of the graph Laplacian matrix, it enhances the ability to model complex graph structures, capturing a wider range of topological structures and revealing potential interactions between nodes in real-world road networks.

To capture the complex and long-term temporal dependencies in traffic signals, the authors were inspired by Autoformer [34] to design an Enhanced Transformer. The Enhanced Transformer can decompose the time series into trend and periodic components. Due to this feature, the authors use recent, daily, and weekly traffic information sequences based on real-world periodic information as weighted inputs to the model, facilitating better periodic modeling of traffic data. Through a self-correlation mechanism, the model can discover and aggregate long-term temporal dependencies, thereby more accurately modeling temporal patterns in traffic data.

Practical evaluations on real-world datasets show that the TSGDC model outperforms state-of-the-art methods, as supported by our experimental results.

2. Related Work

2.1. Traffic Forecasting

Traffic flow prediction has long been a pivotal task in the transportation sector, critical for urban planning, traffic management, and public services. Traditional statistical methods attempt to predict future traffic flow through mathematical models and analyses of historical data. Time sequences analysis, such as the Historical Average (HA) [6] model, is one commonly used approach that infers future traffic conditions by modeling and analyzing past flow data. However, these methods tend to overlook the inherent complexity and dynamics of the transportation system, limiting their predictive power in response to emergencies or peak hours.

With the rise of machine learning techniques, data-driven approaches have emerged as a new trend in traffic flow prediction. These methods leverage large-scale historical data and advanced learning algorithms to better capture the complex relationships and dynamic changes in traffic flow. For instance, machine learning algorithms such as SVM [9], SVR [10], KNN [11], and hashing algorithms [12] have been applied to traffic flow prediction. By learning and fitting the characteristics of the data, they achieve accurate predictions of future traffic conditions. However, traditional machine learning methods struggle to uncover deep, complex correlations and nonlinear relationships within temporal sequences of traffic data. The advancements in deep learning technology have revolutionized traffic flow prediction. Leveraging deep learning models, the authors can more effectively automate the learning of abstract features within data and enhance predictive performance through multi-level network architectures. Among the various tasks in traffic flow prediction, RNNs [35] and their variant LSTM [36] networks have been widely employed. These models excel in processing time sequence data and capturing long-term dependencies. Meanwhile, CNNs have been utilized for spatial data processing, efficiently extracting spatial features from image data to predict traffic flow distributions more precisely. For instance, the CNN-GRU [37] model utilizes convolutionally processed feature sequences as inputs to the GRU, a variant of LSTM, effectively handling gradient vanishing and exploding gradients during training long sequences. Additionally, the STATF [38] model combines convolutional LSTM encoding layers with LSTM decoding layers, introducing an attention mechanism that adaptively learns the spatiotemporal dependencies and nonlinear correlational characteristics within urban traffic flow data, resulting in more accurate predictions. However, these models may face limitations regarding flexibility and robustness when modeling graph data and adapting to long-term trend changes.

2.2. Spatiotemporal Graph Neural Network

Spatiotemporal graph neural networks have emerged prominently in traffic flow prediction research, treating the transportation system as a dynamic spatiotemporal network. These networks employ techniques from graph neural networks to model spatiotemporal dependencies, including spatiotemporal sequence models and attention mechanisms. For example, the T-GCN [23] integrates GCN and GRU to handle spatial and temporal dependencies for traffic prediction. The DCRNN [39] uses graph convolutional networks and diffusion convolutional networks to predict spatiotemporal data. The STGCN [40] uses CNNs for temporal correlations and GCNs for spatial correlations. Graph WaveNet [41] introduces an adaptive adjacency matrix to discover spatial dependencies and combines graph convolutions with dilated causal convolutions for efficient spatiotemporal correlation capture. The STSGCN [42] constructs local spatiotemporal graphs and employs graph convolutional modules for local spatiotemporal correlations. The Z-GCNETs [43] nbvcxsze4rt6 incorporates a novel time-aware zigzag topological layer into time-conditioned GCNs. The STGODE [44] combines ordinary differential equations with GCNs, employing tensor-based ODEs to capture spatiotemporal dynamics. The STRGAT [45] employs deep, fully residual graph attention blocks for dynamic spatial feature aggregation of traffic network node information and designs sequence-to-sequence blocks to capture temporal dependencies in traffic flows. Attention-based models are capable of dynamically adapting to capture spatiotemporal dependencies based on real-time data. The ASTGCN [46] utilizes spatiotemporal attention mechanisms to capture dynamic correlations in traffic data, combining a basic traffic network structure with graph convolutions for spatial feature extraction and temporal convolutions to model dependencies over time. The GMAN [47] employs an encoder–decoder architecture that integrates ST-Attention blocks. Each block integrates a spatial attention mechanism for dynamic spatial correlations, a temporal attention mechanism for nonlinear temporal correlations, and a gated fusion mechanism to adaptively combine spatial and temporal representations. The transition attention mechanism links the encoder and decoder, capturing direct relationships between past and future time steps to reduce error propagation. Transformers have shown significant effectiveness in NLP and have been successfully applied to computer vision tasks as well. Due to their superior performance, they are widely used in various fields. The ASTTN [48] leverages the local self-attention mechanism of Transformers, stacking multiple spatiotemporal attention layers, limiting attention to spatially adjacent nodes, and introducing adaptive graphs to capture hidden spatiotemporal dependencies. The Traffic Transformer [49] uses a global encoder and a global–local decoder to extract global and local spatial features, capturing temporal features with a temporal embedding block. Additionally, position encoding and embedding blocks aid the model in understanding the absolute and relative positions of nodes. The DSTAGNN [50] leverages spatiotemporal awareness distance extracted from historical traffic data, effectively enhancing the representation of internal dynamic correlation attributes between road network nodes without relying on predefined static adjacency matrices, thus reducing reliance on prior knowledge of the road network, and utilizes multi-head self-attention to focus on long-term correlations in time sequence data. While the Traffic Transformer and DSTAGNN consider long-term temporal dependencies and the impact of road network nodes, their proposed models tend to be overly complex, providing limited gains.

Existing spatiotemporal graph neural networks provide relatively effective solutions for traffic flow prediction; yet, they fail to fully leverage traffic periodicity and still have limitations in considering the underlying logical information in traffic graphs and complex real-world temporal patterns. To address these issues, the proposed TSGDC in this paper introduces robust modeling of temporal and spatial data, employing advanced techniques such as Chebyshev Graph Diffusion Convolution, Enhanced Transformer, and Efficient Channel Attention, to efficiently capture and utilize spatiotemporal information. This enhances the model’s feature representation capabilities in dynamic spatiotemporal variations, leading to more accurate traffic flow predictions.

3. Problem Definition

The road network is defined as an undirected graph

G = (V, E, A)

, where

V = {v_{1}, v_{2}, \dots, v_{N}}

represents the set of traffic sensors. N represents the number of traffic sensors. E denotes the edges indicating connectivity between sensors.

A \in R^{N \times N}

measures connectivity based on distance or similarity between sensors.

A_{i, j}

indicates connectivity between nodes i and j; 1 signifies connection, and 0 signifies no connection.

Traffic information on the road network is represented as feature values

X \in R^{T \times N \times C}

associated with the nodes. T denotes time steps. C refers to the traffic features gathered by the traffic sensors.

x_{t}^{c, i} \in R

represents the cth feature value of traffic sensor i at time t, while

x_{t}^{i} \in R^{C}

represents all feature values of traffic sensor i at time t.

X_{t} = (x_{t}^{1}, x_{t}^{2}, \dots, x_{t}^{N}) \in R^{N \times C}

denotes all feature values of all sensors at time t.

X_{n}^{T} = (x_{1}, x_{2}, \dots, x_{T}) \in R^{T \times N \times C}

denotes the aggregate feature values of all sensors across T time slices.

Given historical measurements

X_{n}^{T}

of all sensors on the traffic network over the past T time slices, predict future traffic flow sequences

Y = {(y^{1}, y^{2}, \dots, y^{N})}^{T} \in R^{N \times P}

of all nodes across the entire traffic network over the next P time slices, where

Y^{i} = (y_{T + 1}^{i}, y_{T + 2}^{i}, \dots, y_{T + P}^{i}) \in R^{P}

represents the future traffic flow of node i starting from time

T + 1

.

4. Methodology

The TSGDC model tackles two main challenges in traffic flow prediction: (1) capturing complex spatial dependencies among traffic sensors across the road network and (2) capturing temporal features embedded in historical traffic flow data sequences collected by sensors. To tackle these issues, the authors propose the TSGDC model, with its overall architecture depicted in Figure 1.

The input layer primarily collects recent, daily, and weekly historical traffic signal data as input data for the S-T block module. The S-T block module, which focuses on learning spatial and temporal features, consists of a spatial feature learning module and a temporal sequence learning module. The spatial feature learning module incorporates SpatialECA and Graph Diffusion Convolution to capture the dynamic complexity of spatial features. The temporal feature learning module comprises the enhanced Transformer with TimeECA that attends to the temporal dimension, along with a layer of convolution to model complex historical traffic time sequence information. Finally, the spatial and temporal features learned by S-T Block module are fused and propagated to the prediction layer via residual connections, enabling accurate traffic flow forecasting.

4.1. Input Layer

In real life, traffic flow typically exhibits different characteristics due to the influence of temporal factors. To capture comprehensive traffic flow features, the proposed TSDGC model introduces recent, daily, and weekly traffic flow sequences at the input layer. Assuming the current time is

t_{0}

and the length of the prediction window is

T_{P}

, the daily sampling frequency of the traffic sensor is q; three time sequence segments

T_{h}

,

T_{d}

, and

T_{w}

are extracted along the time axis from the traffic flow data

X_{n}^{T} = (X_{1}, X_{2}, \dots, X_{T}) \in R^{T \times N \times C}

collected by N traffic sensors in the road network historical period T to serve as the input. Three time sequence segments are described as follows:

Recent segment

X_{h} = (X_{t_{0} - T_{h} + 1}, X_{t_{0} - T_{h} + 2}, \dots, X_{t_{0}}) \in R^{T_{h} \times N \times C}

refers to a historical time sequences fragment that is adjacent to the prediction period. Given that the aggregation and dispersal of vehicles on actual roads is a gradual process, the flow state of a specific traffic sensor node at its previous time point often has a significant impact on its subsequent flow states.

Daily periodic segment

X_{d} = (X_{t_{0} - (T_{d} / T_{p}) \times q + 1}, \dots, X_{t_{0} - (T_{d} / T_{p}) \times q + T_{p}}, X_{t_{0} - (T_{d} / T_{p} - 1) \times q + 1}, \dots, X_{t_{0} - (T_{d} / T_{p} - 1) \times q + T_{p}}, \dots,

X_{t_{0} - q + 1}, \dots, X_{t_{0} - q + T_{p}}) \in R^{T_{d} \times N \times C}

comprises sequence fragments from previous days corresponding to the target prediction period. It is influenced by morning and evening peak hours, as well as daily routines of individuals; traffic flow data on roads tends to exhibit notable similarities during the corresponding time frames each day. The primary aim of the daily cycle segment is to precisely capture and model this daily periodic pattern inherent in traffic data.

Weekly periodic segment

X_{w} = (X_{t_{0} - 7 \times (T_{w} / T_{p}) \times q + 1}, \dots,

X_{t_{0} - 7 \times (T_{w} / T_{p}) \times q + T_{p}},

X_{t_{0} - 7 \times (T_{w} / T_{p} - 1) \times q + 1}, \dots,

X_{t_{0} - 7 \times (T_{w} / T_{p} - 1) \times q + T_{p}}, \dots,

X_{t_{0} - 7 \times q + 1}, \dots,

X_{t_{0} - 7 \times q + T_{p}}) \in R^{T_{w} \times N \times C}

comprises sequence fragments that match the target weekday and time period, spanning over several weeks prior to the prediction horizon. Traffic flow data exhibit pronounced weekly periodic patterns, where traffic conditions on a Monday morning, for instance, tend to closely resemble those of the same hour and weekday in the preceding week, but might significantly differ from those observed on weekends. The primary objective of the weekly periodic segment is to precisely capture and model this week-based variation in traffic patterns.

By assigning weighted inputs to the three types of periodic information—recent, daily, and weekly cycles—and adjusting the impact of each component on the prediction task through learning parameters w, d, and h, taking the dot product of these periodic segments yields the input traffic signal

X_{w d h} = (w ⊙ X_{w} + d ⊙ X_{d} + h ⊙ X_{h}) \in R^{(w T_{w} + d T_{d} + h T_{h}) \times N \times C}

for the entire model.

4.2. Spatiotemporal Block

The spatiotemporal block (S-T block) dynamically adjusts attention to automatically focus more on valuable information, thereby enhancing the ability to capture spatial and temporal correlation features in complex environments. As shown in Figure 2, the S-T block comprises a spatial feature learning module and a temporal sequences learning module. The spatial feature learning module is aimed at capturing spatial dimension features inherent in the topological structure among traffic sensors, while the temporal sequences learning module extracts temporal dimension features from historical time sequences. The synergistic interplay of these two modules empowers the S-T block to comprehensively understand and simulate the intricate dynamic changes within road traffic networks.

The spatial feature learning module includes the SpatialECA and Chebyshev Graph Diffusion Convolution, aiming to capture spatial dimension features effectively. The SpatialECA, which is an Efficient Channel Attention based on the traffic features C collected by traffic sensors, boosts model performance by focusing on key features through convolutional attention. Chebyshev Graph Diffusion Convolution employs the diffusion mechanism to create new edge relationships, enhancing data representation and capturing intricate relationships among traffic nodes. These components synergistically enhance the module’s ability to capture complex spatial features in traffic flow data, thereby improving its overall performance.

The time sequences learning module is used to capture complex temporal correlation features, consisting of two parts: TimeECA and an Enhanced Transformer. TimeECA, which is an Efficient Channel Attention based on the time steps T of traffic signals, adjusts the importance of different time channels through the attention mechanism, enhancing the ability to capture key features in time sequences. The Enhanced Transformer focuses on automatically learning long-term dependencies and periodicities in time sequences, thereby improving the representation of temporal features.

4.2.1. Spatial Feature Learning Module

As shown in Figure 2, the SpatialECA employed by the authors dynamically adjusts feature weights based on their importance, thereby enhancing the effectiveness of feature selection. The diffusion graph generated through graph diffusion processing captures higher-order dependencies between nodes more effectively, while Chebyshev GDC performs efficient graph convolution operations by approximating high-order polynomials, thereby capturing higher-order neighbor information introduced by the diffusion graph.

SpatialECA: SpatialECA dynamically adjusts the weights of different traffic feature channels through adaptive average pooling and one-dimensional convolution, thereby highlighting significant features and suppressing irrelevant ones. Specifically, in this module, the authors first perform a global average pooling operation on the traffic feature map

x^{l - 1}

obtained after processing the input traffic signal

X_{w d h} \in R^{(w T_{w} + d T_{d} + h T_{h}) \times N \times C}

through a linear layer. An adaptive one-dimensional convolution kernel

k_{C}

is then derived through calculations, which can be described using Formula (1). In this formula, C represents the number of input feature channels and is used to compute the corresponding convolution kernel,

b = 1

, and

γ = 2

. Subsequently, the channel weights

ω_{c}

are scaled through the Sigmoid activation function. The computation of

ω_{c}

is as shown in Formula (2), where

C 1 D_{k_{c}}

represents a one-dimensional convolution. Ultimately, the normalized weights are applied to the input traffic feature map to produce the output feature map, as shown in Formula (3).

k_{C} = ψ (C) = | \frac{{log}_{2} (C)}{γ} + \frac{b}{γ} |

(1)

ω_{C} = σ (C 1 D_{k_{c}} (A ν g p o o l (x^{l - 1})))

(2)

x = ω_{C} \times x^{l - 1}

(3)

Chebyshev Graph Diffusion Convolution: The Chebyshev Graph Diffusion Convolution generates diffusion matrices that capture higher-order dependencies among sensor nodes in traffic graphs. This diffusion process smooths and propagates the features of traffic sensor nodes, allowing them to reflect their importance and relationships within a broader network structure. By using Chebyshev polynomials to approximate higher-order powers of the graph Laplacian matrix, the method further enhances the modeling capability for complex graph structures. The detailed steps are as follows: Initially, the adjacency matrix A of graph G is transformed into a compressed sparse row matrix, considering the total node count N. Subsequently, a self-loop is added to generate a new matrix connecting each node to itself. Next, the symmetric transfer matrix

T

is calculated. First, the degree vector of each node is calculated; then, the inverse square root matrix

D^{- 1}

of the corresponding degree matrix is computed; finally, the symmetric transfer matrix

T

is obtained. Through the diffusion calculation based on Personalized PageRank, the diffusion matrix

S

is generated; then, the diffusion matrix is sparsified according to the threshold e to obtain the sparse matrix and column normalization is applied to the sparse matrix to obtain the column normalized transfer matrix

\tilde{S}

as the final output. As shown in the following formula,

\tilde{S} = \sum_{k = 0}^{\infty} a {(1 - a)}^{k} {(D^{- 1} A)}^{k}

(4)

where a denotes the parameter for random walk probability and k denotes the coefficient of diffusion. Then, use

L = D - \tilde{S}

to define the Laplacian matrix of the graph and normalize the formula to

L = I_{N} - D^{- \frac{1}{2}} \tilde{S} D^{- \frac{1}{2}} \in R^{N \times N}

, where

D_{i j} = \sum A_{i j}

,

D \in R^{N \times N}

is the degree matrix of the nodes and

I_{N}

is the identity matrix. The Fourier basis orthogonal matrix

U

is obtained by decomposing

L = U Λ U^{T}

through the properties of the Laplace matrix, where the diagonal matrix of the eigenvalues of

L

is

Λ = d i a g ([λ_{0}, \dots, λ_{N - 1}]) \in R^{N \times N}

, and the graph signal is Fourier transformed to obtain

\hat{x} = U^{T} x

, transformed by inverse Fourier transform into

x = U \hat{x}

. This article employs diagonal linear operators in the Fourier domain to substitute traditional operators in graph convolution, thereby facilitating the computation of graph convolutions. Its definition can be expressed using Formula (5):

Θ *_{C} x = Θ (L) x = Θ (U Λ U^{T}) x = U Θ (Λ) U^{T} x

(5)

where

Θ

represents the convolution kernel and

*_{C}

represents the graph convolution operation. Due to the high computational complexity of directly performing eigenvalue decomposition on the Laplacian matrix during graph convolution operations on large graphs, the Chebyshev polynomial is used to approximate the solution:

Θ *_{C} x = Θ (L) x = \sum_{0}^{K - 1} θ_{K} T_{K} (\tilde{L}) x

(6)

In Formula (6),

θ_{K} \in R^{K}

is the coefficient of the Chebyshev polynomial,

\tilde{L} = \frac{2}{λ_{max}} L - I_{N}

, and

λ_{max}

is the maximum eigenvalue of the Laplace matrix. The Chebyshev polynomial is defined as

T_{K} (x) = 2 x T_{K - 1} (x) - T_{K - 2} (x)

, where

K = 0

,

T_{0} (x) = 1

, and

K = 1

,

T_{1} (x) = x

. This approximate solution method is used to solve the problem of high computational complexity by using convolutional kernels to extract information about the 0 to

K - 1

order neighbors centered on the node in the graph. One graph convolution operation is applied to input data

x^{l - 1}

to derive spatial feature representation

x_{S} = Θ *_{C} x^{l - 1}

.

4.2.2. Time Sequences Learning Module

As illustrated in Figure 2, since the traffic signal inputs in this study encompass traffic flow features across different time scales, the authors utilize the ECAtime module, which adaptively selects the convolution kernel size based on the length of the input time series. This allows the model to better capture dependencies across time series of varying lengths. By decomposing the time series and using an efficient self-correlation mechanism, the Enhanced Transformer effectively captures complex temporal dependencies. Additionally, a standard two-dimensional convolutional layer is appended after the Enhanced Transformer to integrate information from adjacent time slices and update traffic sensor node information.

TimeECA: To capture critical information at different time points, the authors first introduce the TimeECA operation in the temporal dimension of traffic signals. The TimeECA can automatically adjust its focus based on the dynamic changes in time sequence data, assisting the model in identifying and emphasizing important features at different time steps, thereby reducing noise interference. Since the calculation process of the Efficient Channel Attention has been detailed in the previous section, the authors simplify the computation of TimeECA as follows:

x_{T} = T i m e E C A_{k_{T}} (x_{S}^{l - 1})

(7)

In Formula (7), the module applies the TimeECA operation to the spatially dependent output feature map

x_{S}^{l - 1}

learned from the previous layer to obtain an efficient temporal dependency representation

x_{T}

. The temporal convolution kernels

k_{T}

are computed using Formula (1) with the historical time steps T in the traffic flow sequence as input channels.

The Enhanced Transformer: he Enhanced Transformer decomposes the multi-period weighted time series of the model into trend and seasonal components, where the trend component captures long-term changes and the seasonal component captures periodic changes. This decomposition method allows for more accurate modeling of multi-scale variations in time series. Additionally, this section employs an efficient attention mechanism that utilizes an autocorrelation mechanism to maintain the ability to capture complex time dependencies.

The input of the Enhanced Transformer component are the sequence data

x_{T}^{l - 1} \in R^{T \times N}

from the past T time steps, which have been learned through the previous TimeECA. This compression of feature dimensions into the temporal and spatial dimensions of traffic signals helps avoid information loss or confusion, ultimately enhancing model performance. This approach is particularly crucial in scenarios where the model needs to consider both spatial and temporal information, as in the case of this study. For a traffic signal input sequence with a historical time sequence length of T, the decomposition process is outlined as follows:

x_{t} = A v g P o o l (P a d i n g (x_{T}^{l - 1}))

(8)

x_{s} = x_{T}^{l - 1} - x_{t}

(9)

In this paper,

A v g P o o l

is used to implement moving average with padding operation to keep the length of the sequence constant, where

x_{s}, x_{t} \in R^{T \times d}

is denoted as the seasonal component and the extracted trend–period part. The process can be summarized as

x_{s}, x_{t} = S e r i e s D e c o m p (x_{T}^{l - 1})

. The input representation for the Enhanced Transformer encoder is denoted as

x_{e n} \in R^{T \times N}

. The decoder part of the Enhanced Transformer acts as a decomposition architecture whose input consists of a seasonal component

x_{d e s}

and a trending cyclic part

x_{d e t}

. Each initialization involves two components: the latter half of the encoder’s input decomposition for recent information and a placeholder of length O filled with a scalar, represented as follows:

x_{e n s}, x_{e n t} = S e r i e s D e c o m p (x_{e n \frac{T}{2} : T})

(10)

x_{d e s} = C o n c a t (x_{e n s}, x_{0})

(11)

x_{d e t} = C o n c a t (x_{e n t}, x_{M e a n})

(12)

where

x_{e n s}, x_{e n t} \in R^{\frac{1}{2} \times N}

denotes the seasonal and trending cyclic portion of

x_{e n} \in R^{T \times N}

, and

x_{0}, x_{M e a n} \in R^{O \times N}

denotes the positions populated with 0 values and input averages, respectively.

The encoder is mainly used for modeling seasonal sections. The decoder consists of two key components: the cumulative structure of trend cycles

x_{d e t}

and the superimposed self-correlation mechanism of seasonal components

x_{d e s}

. The detailed model architecture of the encoder and decoder has been described in the past successful Transformer models [34]. The final prediction result of the Enhanced Transformer is achieved by summing the seasonal component and the trend-cycle component obtained through several layers of encoders and decoders, represented as

x_{T} = x_{s} + x_{t}

.

The self-correlation mechanism boosts information efficiency by identifying period-based dependencies through sequence self-correlation computation and integrates similar subsequences via time-delayed aggregation. In this process, the phase positions between identical periods inevitably result in similar subsequences, thereby assisting the model in leveraging data information more effectively. For a practical, discrete time sequence

{x_{t}}

, its self-correlation

S_{x x (T)}

can be calculated using the following formula:

S_{x x (T)} = lim_{L \to \infty} \frac{1}{L} \sum_{t = 1}^{L} x_{t} x_{t - T}

(13)

S_{x x (T)}

describes the temporal delay similarity between a discrete time sequence

{x_{t}}

and its lagged sequence

{x_{t - T}}

. The self-correlation mechanism relies on periodically discovering correlations between subsequences. In this paper, the authors utilize a time-delay aggregation model block to execute a rolling operation based on selected time delays for sequences

T_{1}, \dots, T_{k}

analogous to the Self-Attentive mechanism. However, this method differs from the point-by-point dot product aggregation employed in the self-attention mechanism as it enables alignment of similar subsequences positioned at the same estimated period and phase. The subsequences are then aggregated using Softmax-normalized confidence. As depicted in Figure 3, in the single-head scenario, for the input time sequence

x_{l - 1} \in R^{L \times N}

of length L in the encoder and decoder of the previous layer, the query vector

Q = x^{l - 1} W_{Q}

, key vector

K = x^{l - 1} W_{K}

, and value vector

V = x^{l - 1} W_{v}

are obtained by learning parameters

W_{Q}

,

W_{K}

, and

W_{v}

through mapping. This approach serves as an effective alternative to the self-correlation mechanism. The representation of the self-correlation mechanism is outlined as follows:

T_{1}, \dots, T_{k} = a r g T o p k (S_{Q, K (T)}), T \in {1, \dots, L}

(14)

{\hat{S}}_{Q, K (T_{1})}, \dots, {\hat{S}}_{Q, K (T_{k})} = S o f t M a x (S_{Q, K (T_{1})}, \dots, S_{Q, K (T_{k})})

(15)

S e l f - C o r r e l a t i o n (Q, K, V) = \sum_{i = 1}^{k} R o l l (V, T_{i}) {\hat{S}}_{Q, K (T_{i})}

(16)

where

a r g T o p k (•)

is used to obtain the top k self-correlation parameters, with

k = ⌊c \times l o g L⌋

, where c is a hyperparameter.

R_{Q, K}

denotes the self-correlation between sequence Q, and K.

R o l l (x, T)

represents the roll operation on sequence

x

with time step T, during which elements shifted beyond the first position are reintroduced at the last position. For the encoding–decoding self-correlation, K and V originate from the encoder’s output, while Q stems from the previous module of the decoder. In the multi-attention-head version of the Enhanced Transformer, assuming a model with

d_{m o d e l}

hidden variables and h attention heads, the query Q, key K, and value V for the ith attention head can be expressed as

w h e r e h e a d_{i} = S e l f - C o l l e l a t i o n (Q_{i}, K_{i}, V_{i}), {(Q_{i}, K_{i}, V)}_{i} \in R^{L \times \frac{d_{m o d a l}}{h}}, i \in {1, \dots, h}

. The subsequent aggregation process serves to synthesize the information from different heads:

M u l t i H e a d (Q, K, V) = W_{o u t p u t} \times C o n c a t (h e a d_{1}, \dots, h e a d_{h})

(17)

Time -dimensional convolution: In the final stage of the model, the authors introduce a standard two-dimensional convolutional layer and process its output in conjunction with the residual connection. This design effectively integrates information from adjacent time slices, updating the feature representations of nodes and enhancing the training stability and generalization capabilities of the model. Specifically, the authors first apply the ReLU activation function to the feature map after the residual connection, achieving a nonlinear transformation that helps mitigate the issue of gradient vanishing and improves computational efficiency. Subsequently, the authors utilize layer normalization to standardize the feature map, mitigating the impact of internal covariate shift. This series of steps not only enhances the model’s feature extraction capabilities for temporal data but also improves its stability and generalization during training. Specifically,

ψ_{T}

and

ψ_{r e s}

are set as learnable convolutional kernel parameters related to the temporal dimension and residual convolutional kernel parameters, respectively.

x_{T}^{l - 1} \in R^{N \times C \times T}

represents the input from the previous Enhanced Transformer module:

x^{l} = L n (Re L U ((ψ_{T} \times x) + (ψ_{r e s} \times x_{T}^{l - 1})))

(18)

4.3. Prediction Layer

In this model, the input traffic signal data

X_{w d h} \in R^{(w T_{w} + d T_{d} + h T_{h}) \times N \times C}

undergo a linear layer before entering multiple S-T Block modules. These blocks sequentially extract spatial and temporal features from the input traffic signal data through spatial feature learning operation

S p a t i a l (•)

and temporal sequence learning operation

T e m p o r a l (•)

. This sequential extraction of spatiotemporal features enables the model to progressively decompose and abstract the complex structure of the input data. Through layer-by-layer processing, the model can effectively capture information and varying patterns at different levels within the data, thereby significantly enhancing its expressive power and prediction accuracy.

The output

x_{S - T Block}

from each S-T Block is concatenated along the temporal dimension; then, it undergoes a nonlinear transformation via the ReLU activation function. Subsequently, the concatenated result is processed by a two-dimensional convolutional layer, which possesses learnable convolutional kernel parameters

ψ

. This two-dimensional convolutional layer not only serves for feature extraction but also functions similarly to a fully connected layer, effectively mapping the characteristics of the input data. Finally, a fully connected layer further processes the refined features, leading to the ultimate prediction outcome

\hat{Y}

.

x_{S - T B l o c k} = T e m p o r a l (S p a t i a l ((Re L U (L i n e a r (X_{w d h})))))

(19)

\hat{Y} = L i n e a r (ψ \times Re L U (c a t (x_{S - T {Block}_{1}}, x_{S - T {Block}_{2}}, \dots, x_{S - T {Block}_{N}})))

(20)

In this article, the authors apply the L1 Loss function to the model. During training, the network computes gradients based on the value of the loss function and utilizes the backpropagation algorithm to update the network’s parameters. This process helps the model adjust in a more favorable direction to predict traffic data features with greater accuracy. The computation process is as follows:

L (θ) = \sum_{i = t}^{t + P} |Y_{i} - {\hat{Y}}_{i}|

(21)

Here,

θ

encompasses all learnable parameters in the model. Optimization occurs by comparing predicted results

\hat{Y}

from the output layer with the actual traffic data labels

Y

.

5. Experimental Analysis

5.1. The Experimental Environment as Well as the Dataset

This experiment utilizes the PyTorch framework on a system featuring an Intel Core i7-8700×6 processor, an RTX2070 graphics card, and 32 GB of RAM. It employs CUDA version 10.0 and Python version 3.9.

To evaluate the TSGDC proposed by the authors, this study utilizes three publicly available datasets: PeMSD4, PeMS07, and PeMSD8, which are sourced from the California Highway PEMS database. All three datasets record data every 5 min across dimensions including speed, traffic volume, and average occupancy rate. The data are partitioned into training, testing, and validation sets in a 6:2:2 ratio. Missing values are handled through linear interpolation, and data normalization is performed using the Z-Score method. Additional details are provided in Table 1.

5.2. Model Evaluation Metrics

The evaluation metrics employed in this study include MAE, RMSE, and MAPE. Below are their respective calculation formulas.

MAE = \frac{1}{N} \sum_{t = 1}^{N} | {\hat{Y}}_{t} - Y_{t} |

(22)

RMSE = \sqrt{\frac{1}{N} \sum_{t = 1}^{N} {({\hat{Y}}_{t} - Y_{t})}^{2}}

(23)

MAPE = \frac{1}{N} \sum_{t = 1}^{N} |\frac{{\hat{Y}}_{t} - Y_{t}}{Y_{t}}|

(24)

In Formulas (22)–(24), N denotes the length of the predicted sequence,

Y_{t}

represents the actual traffic flow value at a given time point, and

{\hat{Y}}_{t}

represents the predicted traffic flow at the same time point. The accuracy and performance of traffic flow prediction are evaluated using these indicators, where smaller numerical values indicate lower errors between predicted and actual values, thus achieving higher prediction accuracy.

5.3. Hyperparameter Settings

Extensive experiments were conducted in this study. To achieve optimal results, the number of model layers was uniformly set to 3, with weighted inputs for three types of periodic information. The learning parameters were set as

w = 1

,

d = 1

, and

h = 2

, and the embedding dimension of the fully connected layer was 64. Additionally, a Chebyshev polynomial parameter of 3 was adopted. In the configuration of the multi-head Enhanced Transformer, the encoder comprised two layers, the decoder had one layer, and the hidden variables were set to 16 channels with eight attention heads. The initial learning rate was set at 0.001, with a batch size of 32 and a maximum of 50 training epochs. The RAdam optimizer was chosen along with the L1 Loss function for training.

5.4. Experimental Results

5.4.1. Baseline Models

In this study, the authors evaluate the TSGDC model against fourteen traditional traffic flow prediction models using the PeMSD4, PeMSD7, and PeMSD8 datasets. These models include HA, ARIMA, VAR, LSTM, STGCN, DCRNN, Graph WaveNet, ASTGCN(r), STG2Seq, STSGCN, STGODE, Z-GCNETs, Traffic Transformer, and DSTAGNN. The authors evaluate the predictive performance of these models.

HA [6]: models traffic flow changes over different time periods.
ARIMA [7]: combines autoregression and moving average techniques for time sequence data forecasting.
VAR [8]: captures the correlation in traffic flow sequences.
LSTM [36]: effective in modeling long-term dependencies in sequence data.
STGCN [40]: applies graph convolution to traffic prediction challenges.
DCRNN [39]: dynamically models traffic flow using diffusion processes and directed graphs.
Graph WaveNet [41]: integrates adaptive graph convolution and dilated causal convolution to handle spatial dependencies across different scales.
ASTGCN(r) [46]: integrates graph convolution and standard convolution with a spatiotemporal attention mechanism.
STG2Seq [27]: simulates spatiotemporal correlation using a gated residual GCN module.
STSGCN [42]: designed to capture heterogeneity in local spatiotemporal graphs using multiple modules.
STGODE [44]: uses tensor-based ordinary differential equations (ODEs) to model spatiotemporal dynamics.
Z-GCNETs [43]: integrates an innovative time-sensitive zigzag topological layer into time-conditioned GCNs.
Traffic Transformer [49]: utilizes the Transformer model to capture temporal dependencies.
DSTAGNN [50]: extracts dynamic temporal dependencies from receptive field features via multi-scale gated convolutions.

5.4.2. TSGDC Performance Analysis

Based on the experimental results in Table 2, the proposed TSGDC model exhibits outstanding performance on the PeMSD4, PeMSD7, and PeMSD8 datasets. The authors comprehensively evaluated the average MAE, RMSE, and MAPE metrics of each model. Compared to statistical models such as HA, ARIMA, and VAR, TSGDC significantly improves traffic flow prediction accuracy. These statistical models, constrained by linear assumptions, struggle to capture nonlinear fluctuations in traffic flow. Conversely, TSGDC effectively grasps nonlinear relationships in traffic flow prediction and considers the profound influence of upstream and downstream traffic flow in historical periods.

Compared to machine learning-based methods, TSGDC demonstrates higher flexibility in handling complex and dynamically changing traffic flow data. Traditional machine learning approaches face limitations in capturing spatiotemporal correlations in traffic flow data and often need to strike a balance between model complexity and generalization ability. However, TSGDC, with its unique architecture and mechanisms, can not only effectively process spatiotemporal information but also automatically learn and extract valuable traffic features, leading to more precise and reliable traffic flow predictions.

In capturing spatial dependencies in traffic graphs, compared to classic deep learning network models based on ordinary graph convolutions (e.g., STGCN, DCRNN, Graph WaveNet, ASTGCN(r), STG2Seq, STSGCN, STGODE, and Z-GCNETs), while ordinary graph convolutions can process graph-structured data and capture direct connections between nodes, they have limited ability to capture indirect relationships between non-directly connected nodes. The Chebyshev Graph Diffusion Convolution employed in our model captures higher-order relationships between sensor nodes in the traffic graph structure. It integrates the contextual information of the sensor nodes, allowing information to propagate more extensively within the graph structure. By using multi-order polynomial approximation, it performs efficient graph convolution operations, thereby capturing spatial dependencies more comprehensively.

Compared to the Traffic Transformer model, which relies on a Transformer structure to analyze time sequences, it falls short in modeling real-world traffic time features as it fails to consider segment modeling methods such as recent, daily, and weekly periodicity. Conversely, TSGDC, combined with TimeECA and Enhanced Transformer, can deeply attend to and decompose complex real-world traffic time sequences, significantly enhancing the model’s prediction performance.

Furthermore, compared to the DSTAGNN model, which estimates dependencies between traffic sensor nodes and leverages spatiotemporal awareness distance and a multi-head self-attention mechanism to focus on long-term correlations in historical traffic time sequences, it suffers from higher complexity. Our proposed TSGDC, by integrating an efficient and lightweight Enhanced Transformer model for temporal information modeling and Chebyshev Graph Diffusion Convolution for spatial information modeling, demonstrates superior prediction performance. The comprehensive superior performance of TSGDC across different datasets and significant improvements in various metrics fully validate its substantial advantages and broad applicability in the field of traffic flow prediction.

5.4.3. Visualization of Experiments

Visualization of short-term prediction results: In Figure 4 and Figure 5, the authors conducted a series of experiments to evaluate the predictive accuracy of our model across different forecast horizons, incrementally increasing from 5 min to 60 min, at 5 min intervals. The authors recorded the 12 sets of prediction errors on the PeMSD4 and PeMSD8 datasets, assessing them using MAE, RMSE, and MAPE metrics. As the forecast horizon expanded, the MAE, RMSE, and MAPE indicators gradually rose, indicating a decrease in predictive performance. This phenomenon could be attributed to several factors. Firstly, as the forecast horizon increases, the prediction task becomes more challenging, as the model has to capture more variations and trends over a longer period, which may lead to more significant errors when forecasting further into the future. Secondly, there may be data instability, where traffic flow or other related variables may be influenced by more external factors over a longer time frame, such as weather or holidays, adding to the prediction’s uncertainty. Furthermore, with an increasing time horizon, the model may encounter more cumulative errors, as each time step within the prediction period is influenced by previous errors. Finally, the limitations of the model itself could also be a contributing factor, requiring more complex or practical modeling approaches to handle longer-term forecasting tasks. In summary, these factors collectively contribute to the gradual decline in predictive performance as the forecast horizon increases. Overall, our method performs better than others in predicting traffic flow.

Visualization of graph diffusion and predictions: Figure 6 showcases the diffusion graph generated by the proposed Graph Diffusion Convolution after preprocessing the original graph. The horizontal and vertical axes represent the nodes in the traffic graph. Through the diffusion process, Graph Diffusion Convolution can create edge relationships that do not exist in the original graph, thereby enriching the graph’s topological structure, capturing hidden information that was previously undetected, and enhancing the model’s ability to understand and model graph data. However, the color variations of the nodes typically represent the strength of connections between nodes or the weights of edges. Darker colors indicate tighter connections or higher weights, while lighter colors suggest weaker connections or lower weights. The diffusion graph, compared to the original graph, may contain weaker edges, which could be attributed to noise and uncertainty introduced during the diffusion process or due to the model’s increased smoothing of local information in the graph. This smoothing effect can result in the heatmap appearing more flat and consistent after Graph Diffusion Convolution, obscuring some noisy details in the original adjacency matrix and thus appearing relatively darker in visualization.

The application of Graph Diffusion Convolution enhances the predictive power of traffic sensor node features in traffic graphs. By leveraging the diffusion process, this method captures richer local and global information, rendering the representation of traffic sensor nodes more substantial and meaningful. The data visualization compares the actual traffic flow characteristics of 307 traffic sensors in the PeMSD4 dataset and 883 traffic sensors in the PeMSD7 dataset at a specific moment with their predicted data, as depicted in Figure 7. It is evident that the real and predicted data align closely, demonstrating the remarkable performance of the TSGDC model in traffic prediction.

5.4.4. Hyperparametric Learning

By adjusting the learning parameters w, d, and h, the inputs of this study are weighted to integrate weekly cycle information, daily cycle information, and recent information, allowing us to observe their influence on prediction outcomes. During the experiment, the varying lengths of the input time sequences arise from the weighted inputs of the various periodic information. The authors conducted experiments on the PeMSD8 dataset, and the results are presented in Figure 8, demonstrating the prediction effects across different time scales. Initially, when the learning parameters were set to

w = 1

,

d = 1

, and

h = 1

, the model’s prediction results were suboptimal. However, by incorporating the weekly and daily components into the inputs and utilizing parameter combinations such as

w = 2

,

d = 1

, and

h = 1

, or

w = 1

,

d = 2

, and

h = 1

, the authors observed an improvement in the model’s prediction performance. Remarkably, when the learning parameters were set to

w = 1

,

d = 1

, and

h = 2

the model exhibited the most optimal prediction performance. This indicates that the model is capable of effectively capturing information across different time scales, primarily when focusing on recent traffic flow data, significantly enhancing its predictive capabilities. These results demonstrate that the proposed model is capable of effectively capturing information across various time scales. By integrating inputs from different time scales for learning, the model can conduct a comprehensive analysis of the data, making it more likely to identify crucial information specific to each time scale. Notably, a particular focus on recent traffic flow information significantly enhances the model’s predictive performance.

5.4.5. Ablation Analysis

To further explore the superiority of our model, ablation studies were conducted, and four variants based on the TSGDC model were designed specifically as follows:

Chebyshev GDC: removing the Enhanced Transformer.
The Enhanced Transformer: eliminating the Chebyshev Graph Diffusion Convolution operation.
TSGDC(-ECA): removing the auxiliary network ECA.

These variations aim to thoroughly analyze how each module contributes to the overall performance of the model, thereby assessing the strengths of the TSGDC model.

Figure 9 presents a bar chart that details the performance of each variant model across different datasets. Among these models, the TSGDC model stands out as excellent, integrating three core components: Chebyshev Graph Diffusion Convolution, the Enhanced Transformer, and the Efficient Channel Attention auxiliary network. Our tests involving the removal of these components individually reveal that the absence of any single component significantly weakens the model’s performance. Specifically, both the Chebyshev GDC variant, with the Enhanced Transformer removed, and the Enhanced Transformer variant, with Chebyshev GDC eliminated, show decreased predictive capability compared to the complete TSGDC(-ECA) and TSGDC models. This underscores the importance of a comprehensive and efficient spatiotemporal modeling strategy in traffic flow prediction. Furthermore, the adoption of Efficient Channel Attention effectively focuses on critical information in dynamic spatiotemporal traffic flow data, enhancing the model’s prediction accuracy.

Figure 10 exhibits the prediction results of our model for various time steps on the PeMSD4 dataset. It reveals that the complete TSGDC model achieves precise predictions across all time steps, emphasizing the criticality of robust modeling of spatiotemporal dependencies in traffic flow prediction. This finding highlights the critical role of these components in ensuring the model’s overall performance. Meanwhile, the performance differences exhibited by the model at different time steps may stem from variations in the regularity of the data across time scales. Some time steps may contain more noise or unexpected events, making it difficult for the model to accurately capture these irregularities, affecting prediction performance. Therefore, the inconsistencies displayed by the model across different time steps reflect differences in data feature modeling capabilities across various time scales, suggesting that the authors need to comprehensively consider the data characteristics at different time scales in modeling and predicting traffic flow data to enhance the model’s prediction performance and robustness.

These findings emphasize the importance of comprehensively considering spatiotemporal dependencies and adopting efficient attention mechanisms in traffic flow prediction. Future research can further explore ways to optimize these components and their interactions to further enhance the model’s performance and generalization capabilities.

6. Conclusions

This article introduces a novel deep learning framework, TSGDC, designed to tackle various challenges in traffic flow prediction. The article addresses two key dimensions: mining traffic network information and modeling time series. In terms of traffic network information mining, the traffic flow prediction problem is transformed into a feature learning task based on graph-structured data. The framework employs spatially efficient channel attention for adaptive feature weighting and uses a Chebyshev Graph Diffusion Convolutional network to capture the structural attributes of traffic network nodes and model this high-order topological structure. This approach not only flexibly learns node features but also enhances the overall capture of graph structural characteristics through global information propagation. Regarding time series modeling, the article incorporates time series information such as recent, daily, and weekly cycles, and performs weighted fusion of these time series inputs. TimeECA is utilized to capture dependencies in time series of varying lengths and explore the impact of different time scales on prediction outcomes. Additionally, an enhanced transformer model based on the transformer architecture leverages a self-correlation mechanism to more accurately capture long-term dependencies and local features in time series data. This method not only improves the model’s ability to learn but also enhances its generalization capability, enabling it to handle more complex sequence patterns effectively. Finally, the article validates the superior performance of TSGDC in traffic flow prediction through various experiments, including comparisons with baseline models, hyperparameter tuning, and ablation studies.

However, the current TSGDC model is primarily trained on traffic flow data from specific regions, which may limit its transferability. Future research could explore cross-domain transfer learning methods to enable the model to migrate across different cities or regions while maintaining strong predictive performance.

In summary, the proposed TSGDC framework provides an innovative solution and methodology for traffic flow prediction. It represents a significant theoretical advancement and holds broad practical application potential, which could lead to substantial improvements and advancements in traffic management and planning.

Author Contributions

Conceptualization, D.L. and Y.Y.; methodology, Y.Y.; software, Y.Y.; validation, S.W. and D.L.; formal analysis, Y.Y.; investigation, S.W.; resources, C.W.; data curation, K.D.; writing—original draft preparation, Y.Y.; writing—review and editing, D.L.; visualization, S.W.; supervision, C.W.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 62206116 and Key R&D Plan of Hubei Provincial Department of Science and Technology 2023BCB041.

Data Availability Statement

All data are available publicly online; they are also available on request from the corresponding author.

Conflicts of Interest

Author Siwei Wei was employed by the company CCCC Second Highway Consultants Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhang, J.; Wang, F.Y.; Wang, K.; Lin, W.H.; Xu, X.; Chen, C. Data-driven intelligent transportation systems: A survey. IEEE Trans. Intell. Transp. Syst. 2011, 12, 1624–1639. [Google Scholar] [CrossRef]
Kang, L.; Chen, R.S.; Xiong, N.; Chen, Y.C.; Hu, Y.X.; Chen, C.M. Selecting hyper-parameters of Gaussian process regression based on non-inertial particle swarm optimization in Internet of Things. IEEE Access 2019, 7, 59504–59513. [Google Scholar] [CrossRef]
Tedjopurnomo, D.A.; Bao, Z.; Zheng, B.; Choudhury, F.M.; Qin, A.K. A survey on modern deep neural network for traffic prediction: Trends, methods and challenges. IEEE Trans. Knowl. Data Eng. 2020, 34, 1544–1561. [Google Scholar] [CrossRef]
Smith, B.L.; Demetsky, M.J. Traffic flow forecasting: Comparison of modeling approaches. J. Transp. Eng. 1997, 123, 261–266. [Google Scholar] [CrossRef]
Li, L.; Lin, W.H.; Liu, H. Type-2 Fuzzy Logic Approach for Short-Term Traffic Forecasting. In IEE Proceedings—Intelligent Transport Systems; IET: London, UK, 2006; Volume 153, pp. 33–40. [Google Scholar]
Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 2020. [Google Scholar]
Kumar, S.V.; Vanajakshi, L. Short-term traffic flow prediction using seasonal ARIMA model with limited input data. Eur. Transp. Res. Rev. 2015, 7, 1–9. [Google Scholar] [CrossRef]
Zivot, E.; Wang, J. Vector Autoregressive Models for Multivariate Time Series. In Modeling Financial Time Series with S-PLUS^®; Springer: Berlin/Heidelberg, Germany, 2006; pp. 385–429. [Google Scholar]
Jeong, Y.S.; Byon, Y.J.; Castro-Neto, M.M.; Easa, S.M. Supervised weighting-online learning algorithm for short-term traffic flow prediction. IEEE Trans. Intell. Transp. Syst. 2013, 14, 1700–1707. [Google Scholar] [CrossRef]
Zhan, A.; Du, F.; Chen, Z.; Yin, G.; Wang, M.; Zhang, Y. A traffic flow forecasting method based on the GA-SVR. J. High Speed Netw. 2022, 28, 97–106. [Google Scholar] [CrossRef]
Van Lint, J.; Van Hinsbergen, C. Short-term traffic and travel time prediction models. Artif. Intell. Appl. Crit. Transp. Issues 2012, 22, 22–41. [Google Scholar]
Yan, L.; Zou, F.; Guo, R.; Gao, L.; Zhou, K.; Wang, C. Feature aggregating hashing for image copy detection. World Wide Web 2016, 19, 217–229. [Google Scholar] [CrossRef]
Liu, Y.; Fan, B.; Xiang, S.; Pan, C. Relation-Shape Convolutional Neural Network for Point Cloud Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8895–8904. [Google Scholar]
Beck, D.; Haffari, G.; Cohn, T. Graph-to-sequence learning using gated graph neural networks. arXiv 2018, arXiv:1806.09835. [Google Scholar]
Hu, W.J.; Fan, J.; Du, Y.X.; Li, B.S.; Xiong, N.; Bekkering, E. MDFC–ResNet: An agricultural IoT system to accurately recognize crop diseases. IEEE Access 2020, 8, 115287–115298. [Google Scholar] [CrossRef]
Williams, R.J.; Zipser, D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1989, 1, 270–280. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Mehdi, M.Z.; Kammoun, H.M.; Benayed, N.G.; Sellami, D.; Masmoudi, A.D. Entropy-based traffic flow labeling for CNN-based traffic congestion prediction from meta-parameters. IEEE Access 2022, 10, 16123–16133. [Google Scholar] [CrossRef]
Wang, C.; Wu, P.; Yan, L.; Ye, Z.; Chen, H.; Ling, H. Image classification based on principal component analysis optimized generative adversarial networks. Multimed. Tools Appl. 2021, 80, 9687–9701. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Li, Z.; Xiong, G.; Chen, Y.; Lv, Y.; Hu, B.; Zhu, F.; Wang, F.Y. A Hybrid Deep Learning Approach with GCN and LSTM for Traffic Flow Prediction. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 1929–1933. [Google Scholar]
Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3848–3858. [Google Scholar] [CrossRef]
Zhang, T.; Ding, W.; Chen, T.; Wang, Z.; Chen, J. A graph convolutional method for traffic flow prediction in highway network. Wirel. Commun. Mob. Comput. 2021, 2021, 1997212. [Google Scholar] [CrossRef]
Zuo, J.; Zeitouni, K.; Taher, Y.; Garcia-Rodriguez, S. Graph convolutional networks for traffic forecasting with missing values. Data Min. Knowl. Discov. 2023, 37, 913–947. [Google Scholar] [CrossRef]
Gupta, A.; Maurya, M.K.; Goyal, N.; Chaurasiya, V.K. ISTGCN: Integrated spatio-temporal modeling for traffic prediction using traffic graph convolution network. Appl. Intell. 2023, 53, 29153–29168. [Google Scholar] [CrossRef]
Bai, L.; Yao, L.; Kanhere, S.; Wang, X.; Sheng, Q. Stg2seq: Spatial-temporal graph to sequence model for multi-step passenger demand forecasting. arXiv 2019, arXiv:1905.10069. [Google Scholar]
Wang, C.; Wang, L.; Wei, S.; Sun, Y.; Liu, B.; Yan, L. STN-GCN: Spatial and Temporal Normalization Graph Convolutional Neural Networks for Traffic Flow Forecasting. Electronics 2023, 12, 3158. [Google Scholar] [CrossRef]
Huang, Z.; Wu, G.; Wang, L. Webly-supervised semantic segmentation via curriculum learning. Comput. Vis. Image Underst. 2023, 236, 103810. [Google Scholar] [CrossRef]
Seong, S.; Cha, J. Domain Word Extension Using Curriculum Learning. Sensors 2023, 23, 3064. [Google Scholar] [CrossRef]
Zhang, Z.; Wei, S.; Xi, L.; Wang, C. GaitMGL: Multi-Scale Temporal Dimension and Global–Local Feature Fusion for Gait Recognition. Electronics 2024, 13, 257. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Gasteiger, J.; Weißenberger, S.; Günnemann, S. Diffusion improves graph learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Xiangxue, W.; Lunhui, X.; Kaixun, C. Data-driven short-term forecasting for urban road network traffic based on data processing and LSTM-RNN. Arab. J. Sci. Eng. 2019, 44, 3043–3060. [Google Scholar] [CrossRef]
Yang, D.; Chen, K.; Yang, M.; Zhao, X. Urban rail transit passenger flow forecast based on LSTM with enhanced long-term features. IET Intell. Transp. Syst. 2019, 13, 1475–1482. [Google Scholar] [CrossRef]
Ma, C.; Zhao, Y.; Dai, G.; Xu, X.; Wong, S.C. A novel STFSA-CNN-GRU hybrid model for short-term traffic speed prediction. IEEE Trans. Intell. Transp. Syst. 2022, 24, 3728–3737. [Google Scholar] [CrossRef]
Du, S.; Li, T.; Yang, Y.; Wang, H.; Xie, P.; Hong, S. A sequence-to-sequence spatial-temporal attention learning model for urban traffic flow prediction. J. Comput. Res. Dev. 2020, 57, 1715–1728. [Google Scholar]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar]
Chaolong, L.; Zhen, C.; Wenming, Z.; Chunyan, X.; Jian, Y. Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 2. [Google Scholar]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. arXiv 2019, arXiv:1906.00121. [Google Scholar]
Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-Temporal Synchronous Graph Convolutional Networks: A New Framework for Spatial-Temporal Network Data Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 914–921. [Google Scholar]
Chen, Y.; Segovia, I.; Gel, Y.R. Z-GCNETs: Time Zigzags at Graph Convolutional Networks for TIME Series Forecasting. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 1684–1694. [Google Scholar]
Fang, Z.; Long, Q.; Song, G.; Xie, K. Spatial-Temporal Graph Ode Networks for Traffic Flow Forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 364–373. [Google Scholar]
Zhang, Q.; Li, C.; Su, F.; Li, Y. Spatio-Temporal Residual Graph Attention Network for Traffic Flow Forecasting. IEEE Internet Things J. 2023, 10, 11518–11532. [Google Scholar] [CrossRef]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 922–929. [Google Scholar]
Zheng, C.; Fan, X.; Wang, C.; Qi, J. Gman: A Graph Multi-Attention Network for Traffic Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1234–1241. [Google Scholar]
Feng, A.; Tassiulas, L. Adaptive Graph Spatial-Temporal Transformer Network for Traffic Forecasting. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 3933–3937. [Google Scholar]
Yan, H.; Ma, X.; Pu, Z. Learning dynamic and hierarchical traffic spatiotemporal features with transformer. IEEE Trans. Intell. Transp. Syst. 2021, 23, 22386–22399. [Google Scholar] [CrossRef]
Lan, S.; Ma, Y.; Huang, W.; Wang, W.; Yang, H.; Li, P. Dstagnn: Dynamic Spatial-Temporal Aware Graph Neural Network for Traffic Flow Forecasting. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 11906–11917. [Google Scholar]

Figure 1. Overall framework of the proposed TSGDC model, which consists of the following key components: (a) The input layer: incorporating spatial and temporal information as input information. (b) The S-T Block module: capturing comprehensive spatial and temporal features. (c) The prediction layer for forecasting traffic flow over a future period.

Figure 2. Architecture of the Spatiotemporal Block. It consists of the following key components: (a) Spatial feature learning module: capturing spatial dimension features. (b) Temporal sequences learning module: capturing temporal dimension features. GAP denotes global average pooling, C stands for the feature dimension of the traffic signal, N represents the spatial node, T signifies the time step, k represents the computed one-dimensional convolution kernel, and

σ

indicates the Sigmoid activation function.

Figure 2. Architecture of the Spatiotemporal Block. It consists of the following key components: (a) Spatial feature learning module: capturing spatial dimension features. (b) Temporal sequences learning module: capturing temporal dimension features. GAP denotes global average pooling, C stands for the feature dimension of the traffic signal, N represents the spatial node, T signifies the time step, k represents the computed one-dimensional convolution kernel, and

σ

indicates the Sigmoid activation function.

Figure 3. Self-correlation mechanism.

Figure 4. The performance variations of various methods as the forecasting interval elongates on the PeMSD4 dataset.

Figure 5. The performance variations of various methods as the forecasting interval elongates on the PeMSD8 dataset.

Figure 6. Graph Diffusion Convolution visualization. (a) Original (left) and diffusion (right) plots of the PeMSD4 dataset. (b) Original (left) and diffusion (right) plots of the PeMSD7 dataset.

Figure 7. Comparison of predicted and actual data.

Figure 8. Hyperparametric study of PeMSD8.

Figure 9. Ablation experiments.

Figure 10. Ablation experiments at different time steps for the PeMSD4 dataset.

Table 1. Datasets information.

Datasets	Nodes	Collection Time	Data Shape
PeMSD4	307	2018/01/01–2018/02/28	(307, 16,992, 3)
PeMSD7	883	2017/05/01–2017/08/31	(883, 28,224, 3)
PeMSD8	170	2016/07/01–2016/08/31	(170, 17,856, 3)

Table 2. Comparison of experimental results of different models.

Model	PeMSD4			PeMSD7			PeMSD8
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
HA	28.22	41.85	19.99%	45.12	65.64	24.51%	23.08	34.18	14.52%
ARIMA	33.73	48.80	24.18%	38.17	59.27	19.46%	31.09	44.32	22.73%
VAR	23.75	36.66	18.09%	50.22	75.63	32.22%	23.46	36.33	15.42%
LSTM	27.14	41.59	18.20%	29.32	44.39	13.30%	22.20	34.06	14.20%
STGCN	22.70	35.55	14.59%	25.33	39.34	11.21%	18.02	27.83	11.40%
DCRNN	24.70	38.12	17.12%	25.22	38.61	11.82%	17.86	27.83	11.45%
Graph WaveNet	25.45	39.70	17.29%	26.39	41.50	11.97%	19.13	31.05	12.68%
ASTGCN(r)	22.93	35.22	16.56%	24.01	37.87	10.73%	18.61	28.16	13.08%
STG2Seq	25.20	38.48	18.77%	32.77	47.16	20.16%	20.17	30.71	17.32%
STSGCN	21.19	33.65	13.90%	24.26	39.03	10.21%	17.13	26.80	10.96%
STGODE	20.84	32.82	13.77%	22.59	37.54	10.14%	16.81	25.97	10.62%
Z-GCNETs	20.84	32.82	13.77%	21.77	35.17	9.25%	16.81	25.97	10.62%
Traffic Transformer	21.10	31.46	15.13%	22.07	34.21	10.12%	16.79	25.11	11.41%
DSTAGNN	19.30	31.46	12.70%	21.42	34.51	9.01%	15.67	24.77	9.94%
TSGDC	18.80	31.08	12.67%	19.96	33.25	8.54%	14.12	23.39	9.63%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, S.; Yang, Y.; Liu, D.; Deng, K.; Wang, C. Transformer-Based Spatiotemporal Graph Diffusion Convolution Network for Traffic Flow Forecasting. Electronics 2024, 13, 3151. https://doi.org/10.3390/electronics13163151

AMA Style

Wei S, Yang Y, Liu D, Deng K, Wang C. Transformer-Based Spatiotemporal Graph Diffusion Convolution Network for Traffic Flow Forecasting. Electronics. 2024; 13(16):3151. https://doi.org/10.3390/electronics13163151

Chicago/Turabian Style

Wei, Siwei, Yang Yang, Donghua Liu, Ke Deng, and Chunzhi Wang. 2024. "Transformer-Based Spatiotemporal Graph Diffusion Convolution Network for Traffic Flow Forecasting" Electronics 13, no. 16: 3151. https://doi.org/10.3390/electronics13163151

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Spatiotemporal Graph Diffusion Convolution Network for Traffic Flow Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Traffic Forecasting

2.2. Spatiotemporal Graph Neural Network

3. Problem Definition

4. Methodology

4.1. Input Layer

4.2. Spatiotemporal Block

4.2.1. Spatial Feature Learning Module

4.2.2. Time Sequences Learning Module

4.3. Prediction Layer

5. Experimental Analysis

5.1. The Experimental Environment as Well as the Dataset

5.2. Model Evaluation Metrics

5.3. Hyperparameter Settings

5.4. Experimental Results

5.4.1. Baseline Models

5.4.2. TSGDC Performance Analysis

5.4.3. Visualization of Experiments

5.4.4. Hyperparametric Learning

5.4.5. Ablation Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI