1. Introduction
An efficient transportation system significantly reduces traffic congestion, enhances transportation efficiency, lowers logistics costs, and promotes economic growth. Intelligent transportation systems (ITSs) are crucial in modern city management and transportation planning. By collecting and analyzing traffic data, ITSs provide a scientific foundation for urban planning and transportation management. Additionally, ITSs can enhance infrastructure safety through traffic load modeling [
1,
2].
Traffic flow forecasting is a crucial part of ITS and a significant branch of spatial–temporal prediction. It involves analyzing historical traffic conditions, modeling the spatial–temporal dependencies in traffic flows, and using these data to estimate future traffic conditions at a specific location. The complexity of modeling spatial–temporal dependencies challenges traffic forecasting.
Classical statistical methods such as autoregressive integrated moving average (ARIMA) and seasonal autoregressive integrated moving average (SARIMA) have been used to solve short-term traffic flow forecasting problems [
3,
4,
5]. Statistical models are interpretable. However, statistical methods require predefined structures and usually assume that the data are smooth. This limits their ability to deal with the complexity and nonlinearity inherent in traffic data.
Machine learning methods do not rely on predefined structures and can automatically detect patterns in data, which are more suitable for traffic flow prediction tasks. For example, support vector machine (SVM) is an excellent algorithm for traditional machine learning. SVM-based models have good generalization properties and are relatively easy to optimize [
6,
7,
8]. However, the hyper-parameter tuning process increases computational complexity and time. In contrast, ensemble learning methods that utilize boosting and bagging techniques not only reduce the need for extensive hyperparameter tuning, but are also easier to apply [
9,
10,
11]. Despite this, they still face the inherent limitations of traditional machine learning models, resulting in constrained prediction accuracy.
Recently, the rapid development of deep learning has led to significant advancements in traffic flow forecasting, markedly improving prediction accuracy. Various deep learning methods, such as recurrent neural networks (RNNs) [
12,
13,
14], temporal convolutional networks (TCNs) [
15], and Transformers [
16], can capture the temporal dependence in time sequences. Unlike RNNs and TCNs, Transformer models sequence data entirely on an attention mechanism, enabling efficient parallel computation. Transformer has been recognized as a robust neural network for modeling long time sequences and has been applied to various time series tasks. Transformer’s success originated from machine translation, which translates source sentences (source sequences) from one language into target sentences (target sequences) in another language. The source and target sequences are represented as tokens before being sent into the sequence-to-sequence (Seq2seq) models [
17]. Some similarities are identified between the machine translation process and traffic flow prediction. The historical traffic flow data (source sequences) are the input and the predicted future traffic flows (target sequences) are the output. Both historical and predicted sequences of traffic flows can be represented as embeddings like tokens. This similarity lays the foundation for applying Transformer to traffic flow forecasting [
18].
Traffic flow constantly changes on spatial–temporal scales, and the traffic flow data collected in a specific area are based on a set of time series and are viewed as data defined over a graph domain. Traffic sensors at intersections correspond to the graph nodes, and the connections and distances between them are like the paths on the graph. These sensors can collect parameters such as the density and speed of the vehicle flow. The connectivity and distance of the paths between the sensors correspond to the edges of the graph and their weights. The traffic condition is the graph signal of those changes over time. Since graph neural networks (GNNs) [
19,
20,
21,
22] are powerful tools for processing graph data, incorporating a GNN into a sequence model constitutes spatial–temporal graph neural networks (STGNNs) [
23,
24,
25,
26], which jointly capture spatial and temporal dependencies. It is also possible to incorporate graphical information into the Transformer to support graph structure understanding within the Transformer. This combination also belongs to one of the forms of STGNNs [
27,
28,
29]. These hybrid network structures, which can extract complex spatial–temporal correlations, make spatial–temporal modeling networks increasingly complex. Additionally, GNNs often heavily rely on predefined graph structures or focus only on GNN learning, disregarding the suboptimal nature of a graph structure that evolves and is not optimal at the current moment.
The self-attention mechanism in the Transformer allows for the dynamic modeling of spatial dependencies and capturing the real-time traffic flow. The mechanism is equivalent to updating the graph structure during the training process. The approaches developed in previous studies [
30] are based on only the Transformer to extract spatial–temporal information in traffic flow, simplifying the process compared with complex STGNNs. However, temporal and spatial information is fed into the first layer of the Transformer encoder at the same time, which confuses the spatial and temporal content before the temporal correlation is extracted independently.
The natural language processing (NLP) field has been remarkable in recent years with the development of pre-trained large language models (LLMs). These LLMs facilitate model training for various NLP downstream tasks, expanding beyond the traditional NLP scopes. However, spatial–temporal modeling has not fully benefited from the significant advancement in LLMs. While pre-trained temporal models have been applied [
31,
32], the largest dataset for time series analysis is much smaller than NLP [
33]. As a result, there is still a lack of sufficient data to pre-train spatial–temporal foundation models. Several studies have attempted to address this gap by applying LLMs to spatial–temporal tasks [
34,
35,
36,
37,
38]. For instance, [
34] pioneered using LLMs for time series analysis, the input embedding layer of the LLM was retrained to project time series data into the appropriate dimensions of LLM. Ref. [
35] combines the graph attention mechanism (GAT), which specializes in capturing dependencies in graph structures, with an LLM to predict the missing values in sequences. The LLM processes time series data, embedding sequence data into a high-dimensional space. Subsequently, the GAT integrates spatial information, enhancing the overall prediction accuracy. Ref. [
36] attempted to align the language and time series data, and inputs the aligned time series embedding into the LLMs and using the Prompt-as-Prefix technique in LLM fine-tuning. In [
37], LLMs were introduced into traffic flow forecasting for the first time, using fusion convolution to generate spatial–temporal representation and feed this representation into LLM. A recent study introduced UrbanGPT [
38], which integrates a spatial–temporal dependency encoder with an instruction fine-tuning approach to better understand the intricate relationship between time and space. To seamlessly align spatial–temporal signals with LLMs, a spatial–temporal instruction tuning paradigm has been developed. This approach enables the model to generalize effectively across various urban scenarios, even in data-scarce conditions.
This study proposes a spatial–temporal transformer network (STTLM) incorporating a pre-trained language model (LM) to forecast traffic flow. The key contributions of this study are summarized as follows:
- (1)
We have developed a framework to extract spatial–temporal features from traffic data using the Transformer’s self-attention mechanism and the design of embeddings to extract spatial–temporal dependencies.
- (2)
Our approach involves using the temporal Transformer (TT) first to extract the features related to temporal information separately. Then, these features are input into the spatial Transformer (ST) together with the unique embedding associated with spatial data. This method realizes the fusion of spatial–temporal information, avoids the confusion of spatial-temporal details during the initial self-attention process, and maximizes the role of embedding in the model.
- (3)
Additionally, we utilized pre-trained language models to improve sequence prediction performance without the need for complex temporal and linguistic data alignment.
3. Method
3.1. Embedding Layer
We used to represent the time steps of the historical (source) traffic series and to forecast (target) traffic series. The source sequence is denoted by , and the target sequence is denoted by . consists of a matrix of traffic features from time step to , formulated as [, , …, ], where each ∈, and , and ∈. N is the number of nodes, and dₛ is the dimension of the input features, equal to 1. The detailed process of generating embeddings based on source sequences is described as follows:
: To maintain the native information in the source sequences, we put through a fully connected layer of to obtain the feature embedding , .
: Positional encoding is incorporated when using a Transformer as a language model [
14].
. A study [
18] introduced relative position coding and global position coding for the temporal continuity of traffic flow. Both methods use hard coding (e.g., predefined sine/cosine functions, and precomputed values). Temporal continuity encoding can be obtained through training and hard coding. We first assign a random value to
and let it learn the time continuity of the traffic sequence during training.
: Unlike natural language, time series contain periodicity and temporal continuity information. For instance, traffic flow, characterized simultaneously on different days, may be extremely similar, and the same embedding can be designed for data in the traffic sequence at the same moment. Similarly, the same day in different weeks can correspond to the same embedding [
30].
denotes the embedding of the weekly cycle, and
denotes the embedding of the daily cycle. The weights of the embedding layer are also randomly assigned first and then trained.
,
=
,
.
: Time series from two traffic nodes that are physically close to each other may have a time difference even if their waveforms are similar. The embeddings mentioned above do not reflect this association. However, if the time domain waveforms are transformed into another transform domain, the effect of the time difference can be removed, showing a strong spatial node correlation.
is viewed as an embedding generated based on the information in the transform domain to represent spatial information obtained based on DFT or wavelet transform. As a result, we use Harr wavelet to perform wavelet transform on the traffic series of each node. The coefficients
l of the low-pass filter and coefficients
h of the high-pass filter in the Harr wavelet transform are calculated as follows:
The approximate coefficients
and detail coefficients
are calculated as follows:
Then, the spatial embedding .
We take , , , and to be of the same length , the spatial embedding , and the temporal embedding , .
3.2. Network Structure
Figure 1 demonstrates the STTLM framework, which consists of a spatial–temporal encoder and a pre-trained LM.
The spatial–temporal encoder consists of two components: the TT encoder and the ST encoder. The TT encoder processes the temporal dependency first, while the ST encoder handles the spatial dependency and integrates the spatial–temporal information.
Each time step’s embeddings from the temporal embedding are input into the TT encoder. The output
of the self-attention layer is computed using the Scaled Dot-Product Attention mechanism.
where
,
, and
are the learnable weight matrix.
, is the attention score matrix that captures the temporal dependencies within the respective time series of the
N nodes. Then,
goes through layer normalization, skip connection, and FFN layers to finally obtain the output of the TT
,
.
As shown in
Figure 2, we expand Z into
by the number of nodes before inputting Z into a pre-trained LM,
. The embedding of each time step of
is put into LM as a token. Here, 5
must be extended through padding to fit the input length
of the hidden layer in LM. The output of the last hidden layer in LM is projected into target sequences
through a linear layer.
4. Experiment Details
4.1. Dataset
We evaluated the algorithms’ performance on four real-world traffic prediction datasets: PEMS04, PEMS08, PEMS-BAY, and METR-LA. The METR-LA traffic dataset contains traffic information collected from 207 Los Angeles Freeway Loop sensors. The PEMS-BAY, PEMS04, and PEMS08 traffic datasets were collected by the California Department of Transportation (CalTrans) Performance Measurement System (PeMS) [
59]. The PEMS-BAY traffic dataset contains traffic information collected from 325 sensors in the Bay Area. The sampling interval for each dataset was 5 min, and the details are shown in
Table 1.
4.2. Implementation
METR-LA and PEMS-BAY are divided into training, validation, and test sets in a ratio of 7:1:2. In contrast, PEMS04 and PEMS08 are divided in a ratio of 6:2:2 ratio. Training and test sets were obtained sequentially. If the history–prediction data pairs are shuffled before being divided according to the ratio, the prediction performance will be significantly improved. For example, the 1 h mean absolute error (MAE) based on the METR-LA dataset can be reduced to 3.00, which is much better than the results 3.31 shown in
Table 2. However, to be consistent with the baselines used for comparison, the test dataset is still taken from the last 20% part of the sequence
The proposed model was implemented with Pytorch 2.0.1 on an NVIDIA RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA). The temporal Transformer encoder was set up with three layers, and the spatial Transformer Encoder was set up with four layers, both with a multi-head number of 4. We employed Llama-7B as the pre-trained LM; only the parameters of one hidden layer (decoder) were used and fine-tuned using LoRA.
When performing LoRA fine-tuning, the parameters of the pre-trained LM were frozen, and only the parameters of the newly added low-rank matrices were trained. This approach significantly reduced the number of trainable parameters and lowered the GPU requirements. Assuming that the original pre-trained parameter matrix is W
0, LoRA does not train W
0 directly. Instead, it adds ∆W = B∙A to the frozen W
0, where A and B are both low-rank matrices. The parameter matrix of LM becomes W with ∆W = W − W
0. Suppose W
0 has dimensions d × k, while A has dimensions d × r and B has dimensions r × k. Rank r is much smaller than d and k. We implemented LoRA fine-tuning by inserting the A and B of ∆W in the form of residual connections in the self-attention part (
Figure 2), with r set to 32.
Table 2 provides the parameters of the models with different numbers of hidden layers used. STTLM_2L differs from STTLM by using two hidden layers from Llama-7B. As shown in
Table 2, LoRA fine-tuning greatly reduced the number of trainable parameters in the pre-trained LM.
4.3. Metrics
We used three commonly used traffic prediction metrics [
23]: MAE, root mean squared error (RMSE), and mean absolute percentage error (MAPE). Let
denote the true values to be predicted;
denote the predicted value; and
denotes the number of observed samples. Then, the metrics are defined as follows:
Based on previous work, we compared the performance of the METR-LA and PEMS-BAY datasets on horizons 3, 6, and 12 (15, 30, and 60 min). We selected the average performance of all predicted 12 horizons to evaluate the PEMS04 and PEMS08 datasets.
4.4. Baselines
Our proposed method was compared with several widely used baselines. Five STGNN models, DCRNN [
23], GWNet [
24], AGCRN [
25], MTGNN [
26], and the Transformer-based STAEformer [
29] model, were considered. The spatial–temporal coding results from our method were also input into the four-layer (ST_4L) and seven-layer (ST_7L) Transformer decoders to compare the performance with that of pre-trained LM. The baseline methods are summarized as follows:
DCRNN: a diffusion convolutional recurrent neural network combines diffusion map convolution with RNNs.
GWNet: a spatial–temporal graph convolutional network (STGCN) that integrates diffusion graph convolution with one-dimensional unfolding graph convolution.
AGCRN: an adaptive graph convolutional recurrent neural network merges adaptive graph learning with recurrent neural networks.
MTGNN: a spatial–temporal graph convolutional network that blends graph convolution with time domain convolution.
STAEformer: a Transformer network combines spatial–temporal adaptive embedding with a Transformer encoder.