1. Introduction
Renewable energy, especially wind energy, has become the key to alleviating the energy problem. The installed capacity of wind power is also increasing year by year and most wind farms are integrated into grids in the form of large-scale clusters. Due to the fluctuation and intermittence of wind, wind power not only provides clean energy, but also brings severe challenges to the safe and stable operation of power systems. Accurate prediction of wind speed and wind power is a fundamental requirement and basic task to ensure the grid connection of wind power [
1].
There is a tremendous amount of research about the ultra-short-term wind power prediction, which can be divided into two types of methods: Physical method and Statistic Learning method. The physical method models the wind behavior according to the equation of atmosphere movement and can simulate the nonlinear characteristic of wind process. However, the parameter of the physical model is not easy to be obtained and it also leads to expansive computational costs. So, it is not suitable for the ultra-short-term wind power prediction directly.
The statistic learning methods can be roughly divided into four categories. The first kind of prediction method is the classical multivariate time series prediction method, which has a relatively solid statistical theoretical foundation. In essence, it expands the traditional Autoregressive Integrated Moving Average model (ARIMA) on the multivariate time series and adopts the lasso method to select the key characteristic variables [
2,
3]. The advantage is that the mathematical meaning is clear, and the model parameters are very convenient to adjust. Therefore, it is suitable for online application scenarios and is often used in practical fields. However, its disadvantage is that it is a linear model and does not consider the intrinsic relationships among variables. It often leads to larger error than other methods. The second kind of prediction method adopts the idea of the probabilistic graph model and models wind process as the Gaussian process [
4,
5]. It uses the Gaussian function as the kernel function and can model the nonlinear data. It also can provide the confidence interval of the model and output the probabilistic distribution directly. However, the Gaussian process is a non-parametric model and every inference has to use each sample for the inverse calculation. It is intractable when the data volume is large. It is more precise than the first method, but it is still hard to capture the complex spatiotemporal relationship of wind process. The third kind of prediction method is the AI method. It includes both the traditional machine learning method and modern deep learning method. The machine learning method includes Back Propagation (BP) neural network [
6], Decision Tree (DT) and its advanced derivatives Xgboost [
7], and Extreme Learning Machine (ELM) [
8]. The deep learning method is famous for its ability to extract the abstract features and different researchers use different models to predict the related energy time series. It includes Long Short-term Memory networks (LSTM) [
9], Long- and Short-term Time-series network (LSTNet) [
10] and so on. The advantage is that the accuracy of the model is higher than the two methods above when the structure of the model is well designed. The disadvantage is that the training time is relatively long and Graphics Processing Unit (GPU) is needed, which is suitable for offline training models and online application. The fourth kind of prediction method is the hybrid method. It often relates to the combination of different machine learning or deep learning methods and there are many varieties [
11]. Some methods also decompose time series into several more predictable components by the empirical mode decomposition or variational mode decomposition, and then establish prediction method for each decomposed subsequence to increase the accuracy [
12,
13]. It integrates the advantages of different methods and the structure can be adjusted according to the practical engineering scenario.
Although the methods vary in forms, the performance of prediction methods is partly decided by the data used. From the perspective of input data, in addition to using single wind farm data directly, there are also some methods that use data from multiple wind farms. The relevant research and the actual observation of the wind farm in the field show that there is obvious correlation among wind farms [
14,
15]. Cavalcante [
16] brought out Least Absolute Shrinkage and Selection Operator-Vector Autoregression (LASSO-VAR) which can take consideration of the historical data of all the wind farms in the region. However, it is still a linear regression model. Deep learning such as classic convolutional neural network (CNN) [
17] and stacked denoising auto-encoder (SDAE) [
18] are introduce for the prediction of multiple wind farms. It can effectively model the time-varying and nonlinear effect among all the closely related wind farms, but it does not consider the global geographical relation of wind farms in the region when dealing with the complex spatial and temporal features.
In most cases, we need to predict not only the power of each wind farm, but also the regional wind power. The additive method, extrapolation method and statistical scaling up method are commonly used [
19]. The superposition method is to predict the power of all wind farms in the cluster, and simply sum the results. Extrapolation is a prediction method by comparing NWP with historical meteorological data to find similar scenarios through a historical database. The statistical scale-up method is to obtain the regional wind power output by multiplying the prediction results of the reference wind farm. In addition, there are some downscale methods for the spatial–temporal correlation analysis of wind power and wind speed [
20,
21,
22]. The downscale method is about getting higher resolution wind speed or wind power from lower resolution NWP or prediction results. The downscaled NWP windspeed can provide more precise information for wind power prediction [
23].
In fact, there are difficulties on two levels to build a comprehensive wind farm prediction model. The first one is how to use the complex spatial–temporal relationship effectively among the historical wind power data and NWP data of different wind farms, in order to increase the accuracy of prediction. The second is how to get the output of every single wind farm and the whole region efficiently, especially when the number of wind farms is big.
Addressing these two goals, we proposed a hybrid prediction framework based on deep learning for wind power prediction in a region, calling it the Multi-modal Multi-task Graph Spatiotemporal NETwork (M2GSNet). The main contribution is as follows:
(1) We designed a spatiotemporal graph convolutional network which can extract the spatiotemporal feature of historical wind power and NWP data of wind farms in the given region. To the best of our knowledge, we are the first to employ a spectral graph neural network for the ultra-short-term wind farm cluster power prediction. Compared to the previous wind power prediction method, it can take consideration of the global geographical location and make better use of the historical wind power and NWP information of wind farms in a region. It can reduce normalized root mean square error (RMSE) in the fourth hour by 1.75%.
(2) We also designed multi-task learning for the wind power prediction of all the wind farms. This can enhance the learning efficiency by combining similar learning tasks and sharing weights of some neural network layers. The power of every single wind farm and the whole region efficiently can be predicted in one model. The time consumption of 20 wind farm forecasts is only 4.1 times the time used for one wind farm. There is also great potential to expand the method to a region which contain hundreds of wind farms.
The rest of this paper is organized as follows.
Section 2 analyzes the availability of NWP and formulates the problem of wind power prediction on the graph.
Section 3 provides temporal and spatial dependency modeling based on graph convolution. Here, the feature of historical wind power data and NWP data of different wind farms can be extracted. Based on these features,
Section 4 proposes a multi-modal multi-task graph convolutional network for wind power prediction. The experimental results are reported in
Section 5. The conclusion is made in
Section 6.
5. Case Study
The proposed method is also tested on the measurement data of a wind farm cluster in Northeast China. The proposed model is tested on Linux server Cluster (CPU: Intel Xeon (R) CPU E5-2650 v4 @ 2.10 GHz, GPU: NVIDIA Tesla P100) and deep learning framework Pytorch (1.4.0) with GPU acceleration to speed up the training process.
5.1. Data Set and Test Description
For the test system, only the historical wind power data and the NWP data are provided. The wind speed data are not included. The historical wind power is from the field measurement and the NWP is from the meteorology station. The NWP wind speed at the height of 170, 100, 30 and 10 m are used for the analysis. According to the analysis in 2.1, the cubic NWP windspeed can reflect the tendency of wind power more effectively. So, we use the cubic NWP windspeed rather than the NWP windspeed as the input of the model. The location of those wind farms in the cluster is as
Figure 6.
The red point in the figure is the wind farm and the number in the array is the wind farm number and the capacity, respectively. The whole capacity of the wind farm cluster is 2854.31 MW. The wind farms with strong correlation are linked together according to the adjacent matrix computed by Equation (6). Data from 2019 are used for the model training and testing. The training set include 8000 samples (from 2019-01-01 08:15:00 A.M. to 2019-03-25 04:00:00 P.M.) and the testing set include 2000 samples (from 2019-03-25 04:15:00 P.M. to 2019-04-15 12:00:00 P.M.). The training samples are randomly scrambled to avoid overfitting of the model. The measurement and prediction interval of the data is 15 min. However, the wind power and cubic wind speed have different units, the normalization is used as follows.
The root mean square error (RMSE) and mean absolute error (MAE) are selected as the evaluation metric to assess the performance of the model on the testing set:
where
and
are the normalized true value and normalized predicted value in prediction scenario
at prediction time step
.
is the number in the test set. To represent the prediction error of scenarios, we design another index:
In each scenario, there are 16 time steps and the maximum value of the time step is 16. The index is used to assess the similarity of a given scenario and it reflects the average deviation of true value and predicted value in each time step.
The sensitivity of some hyper-parameters is taken into account, such as learning rate and the hidden state number, which are very important for the training process [
33]. However, it is impossible to do the grid search on the whole parameter space. So, the hyper-parameter is determined according to the grid search combined by human experience. The learning rate is chosen from the set (0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4), The hidden layer numbers of the graph convolutional part for historical wind power and wind speed are both chosen from the set (10, 20, 30, 40, 50, 60, 70, 80, 90, 100). The input lengths of the historical wind power and cubic NWP windspeed are chosen from set (20, 30, 40, 50, 60). We decided the optimal value according to the prediction error in the fourth hour. The parameter combination with the lowest prediction error is the optimal value. After adjusting the structure and parameters of the model, the parameters of the final model are as follows. In the M2GSNet model, input characteristic matrix
is the power measurement information of each node on the graph. It includes 40 time steps (
), which require 10 h of power data of 20 wind farms to predict the power of the next 16 time steps (4 h). Input characteristic matrix
selects the NWP data of each node. It includes 20 time steps (
), which require 5 h wind speed of 20 wind farms in the future to predict the output. NWP variables include windspeed from four different altitudes which are 10, 30, 100 and 170 m. Data of 20 wind farms in Jilin Province are used for training and prediction. So,
is a (20 * 40) matrix and
is a (20 * 80) matrix. The hidden state for the wind power GCN module is 60 and for the NWP GCN module is 40. The adjacent matrix in the graph convolution network is calculated by the distance between wind farms. The dimension of variable is labeled in
Figure 5. The prediction error is calculated according to the RMSE after normalization and the specific calculation method can refer to the previous description.
The iteration epoch of model training is 200 and the training batch size is 256. The optimizer is Adadelta [
34] and the learning rate is 0.1. The five-fold cross-validation is used for verification.
5.2. Baseline Model
M2GSNet is our proposed model and it has three features. First, it utilizes the feature of cubic NWP wind speed by using the multi-modal learning. Second, it adopts the spatiotemporal model for the geographical information extraction. Third, it uses the multi-task learning to predict the power of each wind farm. To illustrate the accuracy improvement by each feature, we design the baseline model and other GCN model for ablation study.
(1) MLP [
6]: This is the multilayer perception model for regression and the hidden state number is 800. The historical wind power of each wind farm, including 40 time steps as the input and output, is the wind power of the same wind farm including 16 time steps. The sum of the wind power for each wind farm is the regional wind power.
(2) LSTM [
9]: This includes two LSTM layers and the dropout rate is 0.25. The historical wind power of each wind farm, including 40 time steps as the input and output, is the wind power of the same wind farm including 16 time steps.
(3) ELM [
35]: This uses the classic ELM model parameter. The historical wind power of each wind farm, including 40 time steps as the input and output, is the wind power of the same wind farm including 16 time steps. The sum of the wind power for each wind farm is the regional wind power.
(4) LSTNet [
10]: This uses the standard LSTNet structure and parameter and it can take consideration of the spatiotemporal relationship of wind farms. However, it cannot make use of the geographical information of the wind farms when extracting the spatial feature of the wind farms. The input only includes the historical wind power of each wind farm. The output is the wind power of each wind farm including 16 time steps.
(5) LSTNet_NWP: This uses the same structure and parameter with LSTNet. However, it also uses the cubic NWP windspeed as input and it concatenates the spatial–temporal feature of historical wind power and NWP. The output is the wind power of each wind farm including 16 time steps.
We also compare different M2GSNet models for the ablation study and the characteristic of each model is as shown in
Table 1.
Where M2GSNet means the model that uses the information of the cubic NWP wind speed, the w/o CW means the GCN model that only uses the raw data of NWP but without using the cubic NWP wind speed. The w/o AD means the GCN model that uses the information of cubic NWP wind speed but uses the wind speed time series correlation to define the graph. The w/o W means the GCN model that does not use the NWP. The w/o MT1 means the GCN model that does not use the multi-task learning and predicts the wind power of the region directly. The w/o MT2 means the GCN model that does not use the multi-task learning and predict the wind power of each wind farm separately. We use same model structure but train the model individually. The training hyper-parameter is the same as the description above and only the model structure is different.
5.3. The Main Prediction Results
5.3.1. The Prediction Results for Regional Wind Power
The prediction results of several structures of the M2GSNet are listed in
Table 2. From
Table 2, it is obvious that M2GSNet is the model that performs best. Besides, methods which take consideration of the NWP are better than those that do not include NWP data. From the results, the prediction error of the M2GSNet method is smaller than LSTM by over 2 percent in the fourth hour. This means it can reduce more than 50 MW prediction error for the whole cluster, which is vital progress for the operation center of the power grid.
LSTNet is a kind of deep learning method that takes consideration of the spatiotemporal relationship of wind farms in the cluster. It is an improved version of spatiotemporal prediction model [
17]. From
Table 2, we can see that LSTNet is indeed better than the MLP, LSTM and ELM which do not consider the spatiotemporal relationship. However, the M2GSNet method is better than the LSTNet due to its ability to extract the geographical location information feature.
Even for the M2GSNet, when using multi-task learning to predict the wind power of each wind farm, the results are better than predicting the regional wind power directly (w/o MT1) or predicting the wind power of each and summing them together (w/o MT2) which proves the effectiveness of multi-task learning. Besides, the result of using the cubic NWP windspeed in the multi-modal learning is better than the result of using NWP windspeed directly.
5.3.2. The Prediction Results of Each Wind Farm
The M2GSNet is not only convenient for predicting the regional wind power, but it also can output the detail power of each wind farm by one training session. To verify the effectiveness of M2GSNet on the single wind farm power prediction, the RMSEs of every wind farm on the 16 time steps are calculated. The RMSE data of 20 wind farms in 1 h are used for the boxplot analysis. We displayed the results of MLP, LSTM, ELM, LSTNet and M2GSNet in
Figure 7.
From
Figure 7, we can notice that the prediction errors of single wind farms are much higher than some other wind farms. However, due to the “smooth effect” of the wind farm cluster, the prediction error of the cluster is much smaller than the individual wind farms. This result is very meaningful for the power grid dispatching center. In addition, according to the mean value, max value and minimum value, the performance of the M2GSNet is much better than the other methods, especially in the 3rd hour and 4th hour in the statistical sense. However, due to the NWP feature fusion, the prediction error of M2GSNet in the 1st hour is a little higher than the other methods that do not consider the NWP. It also enlightens us to design a mechanism to dynamically select the models. For example, for the ultra-short-term prediction within 1 h, we can choose a model with lower RMSE.
5.3.3. Ablation Study
- (1)
The Comparison of Different Concatenate Method
The feature fusion of historical wind power and NWP is very important for the wind power prediction and there are three commonly used feature fusion methods. The prediction results of the three methods are listed in
Figure 8.
From
Figure 8, it is obvious that the feature fusion method of bilinear and concatenate is better than the nonlinear Tanh method. The prediction error of the concatenate method is slightly lower than the bilinear method, especially in the interval of 0.5 h–3.5 h. Considering that the bilinear method is more complex and has lower training efficiency, the concatenate method is chosen as the feature fusion method in our network.
- (2)
The Comparison of Training Time Consumption under Different Wind Farm Numbers
The training time of the M2GSNet is crucial because it determines whether it can be utilized in the large-scale renewable energy cluster which includes hundreds, even thousands, of small wind farms. Therefore, we compare the training time of M2GSNet under different wind farm numbers. The results are in
Figure 9.
According to the results, the training time for one wind farm is 66 min. So, if each wind farm uses one specific model, the training time is more than 1200 min in this case. However, when multi-task learning is used, the training time reduces to 271 min. Thus, it can be seen that by using multi-task learning, it saves a lot of training time and resources. Notably, when the wind farm number increases, the increase rate of training time is actually decreased. Therefore, when more wind farms are considered, the advantages of multi-task learning will be more remarkable.
5.3.4. The Remarkable Error Analysis in Test Set
The prediction error of the M2GSNet is analyzed as
Figure 10. In this figure, the M2GSNet is compared with the MLP and LSTM since they are the most commonly used machine learning and deep learning methods. The prediction results are visualized.
In the left array of the figure, it is the 4th-hour prediction results in the test set by three methods. The predicted values are compared with the true value. In the right array, the location of scenarios with larger prediction errors are visualized. The prediction error analysis is also very important because it can tell us which kind of scenario is difficult to be predicted. Then we can design methods to deal with it in the future. For each time step, we used different colors and different sizes to represent the prediction errors. The darker the color and the smaller the size, the smaller the prediction error. Since the
of most of scenarios are smaller than 0.15, we classify the prediction errors into four categories. If the prediction error according to
in Equation (19) is smaller than 0.05 p.u, it is the first category. This type includes the scenarios that are predicted rather accurately. If the the prediction error is between 0.05 and 0.1, it is the second category, and the color of this category in the figure is at 5. If the prediction error is between 0.1 and 0.15, it is the third category, and the color of those scenarios is at 10. If the prediction error is higher than 0.15, it is the fourth category and the color of them is at 20. So, different colors and sizes can reflect the prediction results of the same scenario. We counted the ratio of the different categories in
Table 3.
From
Figure 10 and
Table 3, it is obvious that the M2GSNet method has better performance since it has less points belonging to the high prediction error category. However, it also can be found from
Figure 10 that most light color and large size points are located in the turning point of the wind fluctuation process which means it is hard to predict and often leads to higher prediction errors.
6. Conclusions
In this paper, we bring out a spatiotemporal deep learning network for the ultra-short-term wind power prediction. Through the case study, we can draw the following conclusions:
(1) Adding a numerical weather forecast by virtue of multi-modal learning, especially the third power of wind speed as auxiliary information, can improve the accuracy of forecasts.
(2) The spatiotemporal graph neural network can extract the spatial–temporal feature of the wind farms effectively and is helpful in improving the accuracy of predictions compared to the other methods.
(3) By using the multi-task learning method, prediction accuracy can be improved, and the training time can also be reduced compared to additive methods.
In the follow-up study, we can consider designing a comprehensive method which can classify the wind process in advance and define the dynamic graph according to the spatial–temporal relationship among wind farms to further increase the accuracy.