1. Introduction
In 2020, China put forward the goal of “reaching peak carbon by 2030 and achieving carbon neutrality by 2060” [
1]. Global carbon dioxide emissions from energy combustion hit a historic peak of 36.8 billion tons in 2022. Hence, wind energy, which embodies the latest advancements in energy generation, has emerged as a pivotal area of investigation in multiple nations [
2]. Nevertheless, meteorological factors in the atmosphere change rapidly, causing transient or short-term fluctuations in wind power generation. This poses a challenge to the optimum scheduling and safe operation of the power system [
3,
4]. By improving the accuracy of wind power prediction, the grid scheduling department can rationally arrange the power generation plan and improve the economy of grid operation.
With the development of Artificial Intelligence (AI) [
5], machine learning models [
6] and deep learning models have been widely used in wind power prediction and are gradually replacing previous methods. Reference [
7] presents a distinctive hybrid forecasting model that integrates Long Short-Term Memory (LSTM) with Gauss, Moret, Ricker, and Shannon for power forecasting. This integration enhances the utilization of deep learning by tackling issues like gradient vanishing and nonlinear mapping through the use of wavelet transforms, thus enhancing the precision of wind power prediction. Reference [
8] introduces a novel framework designed to improve the accuracy of predicting wind power generation in 5 min intervals. The system incorporates signal decomposition and heuristic techniques, known as CEMOLS, to optimize the parameters of neural networks. Reference [
9] introduces the AMC-LSTM model designed for wind power forecasting. It includes an Attention Mechanism to dynamically assign weights to physical attribute data, utilizes Convolutional Neural Networks (CNN) to extract short-term features, and applies LSTM to capture long-term trends. In reference [
10], Three types of Deep Neural Networks (DNNs) were used for the experiment and the best result was achieved with the Gated Recurrent Unit (GRU) network. Reference [
11] introduces a novel spatiotemporal directed graph convolution neural network and verify the superiority of the proposed model in spatiotemporal correlation representation. Reference [
12] combines point prediction and probability density prediction, and the results show accurate prediction in extreme weather. The literature outlined above demonstrates that the accuracy of the prediction model can be significantly enhanced by the utilization of various model combinations.
The Transformer is a highly effective deep learning model architecture that demonstrates strong performance when applied to continuous data. It is becoming increasingly popular in the field of wind power prediction. Reference [
13] utilizes transformer neural networks that incorporate a multi-head attention mechanism to effectively capture successive dependencies, irrespective of their distance. The reference [
14] investigates the application of Transformer-based architectures: Informer, LogSparse Transformer, and Autoformer for the first time. The study carried out in [
15] introduces a novel deep neural network structure based on transformers that incorporates wavelet transform to predict wind speed and wind energy generation up to 6 h ahead. The author of [
16] introduces a wind power forecasting model that integrates LSTM to capture the temporal dynamics of weather data and a Vision Transformer (ViT) to construct connections between extracted characteristics and desired outputs using multi-headed self-attention methods. However, the conventional Transformer model encounters the issue of significant time complexity when handling long sequential data, which can be resolved by using Probsparse self-attention. The ProbSparse self-attention mechanism achieves a time complexity and memory consumption of O (LlogL), which presents a significant enhancement compared to the self-attention mechanism in the conventional Transformer model [
17]. We utilized the Probsparse self-attention to enhance the original self-attention mechanism of the Transformer, which we refer to as the Psformer.
The Temporal Convolutional Network (TCN) utilizes convolutional operations to extract and acquire features from time series data. It has demonstrated a strong performance in many applications, including time series prediction and classification. Reference [
18] employs a deep clustering model based on the Categorical Generative Adversarial Network (CGAN) for precise classification. Additionally, they enhance the TCN by integrating a gating mechanism within its activation function. In reference [
19], an enhanced Temporal Convolutional Network (MITCN) is devised for predicting multi-step time series. They have incorporated the quadratic spline quantile function (QSQF) to facilitate probabilistic forecasting. Recently we have found that bi-directional neural networks are gaining popularity due to the fact that they can capture contextual information about both past and future states, which makes it possible to accurately model and predict sequences with long-term dependencies. To address the issue of offshore wind power generation being affected by extreme weather conditions, reference [
20] introduces a new prediction method that merges convolution and attention mechanisms. The approach integrates a bidirectional long short-term memory (BiLSTM) network with an attention mechanism (AM). It is proved that the mean square error of this model is much lower than that of LSTM without bidirectional structure. In [
21], BiLSTM’s bidirectional capability was also used to extract deep correlation in data to analyze wind power data. Reference [
22] proposed a model based on the XGBoost algorithm combining financial technical indicators, The model has high prediction accuracy and speed. Reference [
23] presents a hybrid model that combines BiTCN and BiLSTM with an attention mechanism; the attention mechanism improves the model’s capacity to concentrate on significant characteristics. From the recent literature, it can be seen that neural networks with forward and reverse processes are becoming more widely used than traditional neural networks.
Data decomposition is a highly successful technique for extracting the underlying patterns in data, thereby lowering the complexity of training prediction models and improving their accuracy. The Wavelet Packet Transform (WPT) [
24] uses a collection of orthogonal and rapidly decaying wavelet function bases to accurately fit signals. Ensemble Empirical Mode Decomposition (EEMD) [
25] is an enhanced version of Empirical Mode Decomposition (EMD) [
26]. EEMD improves EMD by introducing white noise into the signal to fill in missing scales. This technique has demonstrated excellent performance in signal decomposition. Variational Mode Decomposition (VMD) [
27] is an algorithm that iteratively searches for the optimal solution of a variational model to identify the center frequency and bandwidth of each component. This algorithm is a completely non-recursive model. The experimental findings from reference [
28] demonstrate that the decomposition impact of VMD is more pronounced in power prediction compared to EMD.
According to the aforementioned research, the majority of studies focus solely on enhancing prediction accuracy through the integration of one or two models. There is a scarcity of research that combines signal decomposition, BiTCN extraction features, and the Transformer model. Therefore, we propose a novel hybrid prediction model named VMD-BiTCN-Psformer, which offers the following key contributions:
Through VMD, the original signal is decomposed as a whole into other component signals. The decomposition process is carried out on the features of specific significance and can yield a greater amount of comprehensive information and structure within the signal.
The concept of position encoding is utilized to extract the hidden aspects of multi-scale temporal seasonal information, which are then merged with position encoding.
BiTCN was introduced to extract features of near-time segments of observation points, expand the scope from which the model may acquire information and generate predictions, and adjust the connection structure of the self-attention mechanism and BiTCN output data.
The ProbSparse self-attention is introduced, and the Psformer model is designed to integrate the characteristics of significant time points into the self-attention mechanism, resulting in enhanced forecast accuracy of wind power and reduced computational complexity.
3. Model of VMD-BiTCN-Psformer
3.1. Improvement of Transformer’s Positional Encoding
For time-series data, time-hidden features are essential in addition to location information. Therefore, we improved the original location coding of Transformer by incorporating temporal and seasonal information encoding. Based on the concept of positional encoding, time features are processed to extract global time features from different timestamps. The dimensions of time encoding, hourly encoding, daily encoding, and monthly encoding are normalized and then merged.
In
Figure 1, u
n is the projection of wind farm data series, p
n is absolute positional encoding, t
n, h
r, d
e, and m
w represents temporal and seasonal information encoding data in different time dimensions.
After splicing the three parts, c
n is obtained, and the equation for this is:
where
represents the i-th column data for the t-th sequence input,
is the normalized vector of the i-th column data for the t-th sequence input,
is the absolute positional encoding of the i-th column data for the t-th sequence input vector quantity,
is the encoding vector for the i-th column data input for the t-th sequence normalized according to the z time dimension,
represents the data after normalization processing of time, hour, day and month encoding, and z is the number of time dimensions.
3.2. BiTCN
TCN is a model that combines the benefits of causal convolution and expansive convolution, building upon the CNN paradigm.
Figure 2 shows the structural diagram of TCN.
Prioritizing the prevention of information leakage is crucial for long-term data prediction challenges. In contrast to convolution in a CNN, causal convolution is a unidirectional structure in which the features extracted at time t are solely influenced by the values of the current and previous moments, and the original data after time t cannot be used. By utilizing dilated convolution in TCN, the model can attain a larger receptive field with a reduced number of layers. This is possible because the input can be sampled at intervals during convolution, leading to an exponential increase in the size of the receptive field as more layers are added. Thus, TCN can effectively capture long-range dependencies and temporal patterns in the input data while reducing computational complexity. The equation for TCN causal unfolding convolution is:
where
represents the t-th output of TCN,
k is the kernel size,
is the s element in the convolution kernel,
is the input, dm is the expansion factor.
TCN exclusively focuses on the forward convolutional computation of the input sequence and ignores the backward information of the prediction results. BiTCN can capture the hidden features of the wind power sequences by considering both forward and backward information to better capture the long-term dependencies. BiTCN processes the input data through multiple convolutional layers, each of which extracts the features through different filters and activation functions. Meanwhile, dilated convolution is used to increase the receptive field, i.e., to expand the coverage of the model on the input data without increasing the parameters. The structure of BiTCN is shown in
Figure 3.
Wind power time series models will change over time due to the occurrence of extreme weather, day and night, and seasonal changes. Therefore, whether the current wind power is an anomaly, a change point, or part of a state pattern, it is highly dependent on the surrounding environment. BiTCN can extend the receptive field of the model by increasing the number of layers and then extracting the features of the time segments.
BiTCN is also introduced to improve the inherent defects of the Transformer model.
Q,
K, and
V in Transformer are different vectors obtained by multiplying the encoded and spliced initial vector with three weight matrices:
WQ,
WK, and
WV. In this case, the dot product of the query and value is computed as the correlation of time point data, without considering the time segment features near the observation point. As shown in
Figure 4, the introduction of BiTCN will change the connection structure of the self-attention mechanism, taking the time series data containing the feature information of different time segments as the input of the query vector and the key vector, while still using the original time series data as the input of the value vector. The correlation calculation with the query vector and key vector generated by BiTCN can effectively establish the relationship between time segments of time series data and highlight the feature information of key historical time segments.
3.3. Improvement of Multi-Head Self-Attention Mechanism
The wind power sequences undergo processing by BiTCN and are input into the Psformer model. When handling long-time sequences, the multi-head attention mechanism of the Transformer model often suffers from a high computational complexity and slow computation speed. Therefore, this paper proposes a multi-head ProbSparse self-attention mechanism. The structure is shown in
Figure 5.
During the calculation of self-attention, there is not a strong connection between all the data at every time point. Only a small number of dot products significantly contribute to the main attention, while the impact of other dot products is weak and can be ignored. Therefore, the ProbSparse self-attention mechanism model prioritizes the query vectors based on their relevance and then performs dot product computation on the selected vectors. The ProbSparse self-attention matrix is obtained by restricting each key vector to attend to only primary query vectors. The query vectors that have a relatively high relevance are kept as the primary elements of the ProbSparse matrix, while the query vectors with a relatively low importance are discarded. This process results in the formation of the ProbSparse self-attention matrix.
In ProbSparse self-attention, each key can focus on only
u major queries. The ProbSparse self-attention calculation process is shown in Algorithm 1. The improved mechanism function is:
where
represents the part of the ProbSparse self-attention matrix,
,
represent the i-th row of
Q, and
K respectively,
represents the length of the key. The time and space complexity of attention calculated by ProbSparse method are
.
Algorithm 1. ProbSparse Self-Attention Calculation Process |
|
Initialize: hyperparameter c, u = clnm, U = mlnn |
1. Choose dot product pairs randomly from to |
2. |
3. |
4. Choose the first u-th as |
5. |
6. |
7. |
Output: feature |
3.4. VMD-BiTCN-Psformer Wind Power Prediction Model
The short-term wind power prediction based on the VMD-BiTCN-Psformer is shown in
Figure 6.
Step 1: The VMD technique is employed to decompose the wind power data into several sub-series that exhibit generally stable behavior at different frequencies. This process effectively decreases the complexity and nonlinearity of the sequence.
Step 2: Each sub-mode of the decomposed wind farm original data is mapped to form a time series matrix, and the time features are transformed into dimensions through a one-dimensional convolution layer and then spliced with the improved positional encoding matrix and time series matrix.
Step 3: BiTCN is employed to extract the characteristics of the spliced matrices and capture the enduring connections between them.
Step 4: The output matrix of BiTCN is transformed into query vector and key vector by weight matrix WQ and WK, respectively. The concatenation matrix is transformed into value vector by weight matrix WV. Q, K, V are obtained as matrices by the attention mechanism function, and finally processed to obtain the intermediate feature matrices, which are used in the decoding layer.
Step 5: The output matrix of BiTCN, along with the mask concatenation matrix goes to the decoding layer of Psformer to obtain the query vector. The key vector, the value vector, and the query vector together go to the next sub-layer of the Psformer decoder. The wind power predictions for each sub-mode are derived through the fully connected layer, and then these predictions are aggregated to obtain the final wind power prediction.
5. Conclusions
We propose a VMD-BiTCN-Psformer model based on temporal and positional encoding, which aims to improve the accuracy of forecasting wind power.
Combined with the measured wind power data and the meteorological data from a wind farm in Fujian, the proposed VMD-BiTCN-Psformer prediction model was studied:
- (1)
The combination model of VMD-BiTCN-Psformer is superior to relatively single combination models, and the application of VMD, BiTCN, and Transformer all contributes to improving the accuracy of predictions.
- (2)
Table 4 shows that the MAE, RMSE, RRMSE, and R
2 of the wind power prediction results of the proposed model are better than those of other models. The prediction of the designed model is closest to the actual values, so it has a more accurate prediction effect.
- (3)
We compared the models before and after improving the positional encoding and found that the prediction accuracy improved after the improvement, demonstrating the effectiveness of the improved encoding.
Our proposed model still has some limitations, which we will continue to study in future work. In the process of VMD, we determined the number of decomposed modes as a fixed value, and it is more effective to optimize VMD parameters by using swarm intelligence algorithm or determine the number of sub-modes according to entropy. At the same time, for power data close to 0, the predicted values of the network model tend to be much higher than the true values, which is an issue that we need to address in our next study.