1. Introduction
The prediction of the future has fascinated the human being since its early existence. Actually, many of these efforts can be noticed in everyday events such as energy management [
1], telecommunications [
2], pollution [
3], bioinformatics [
4], earthquakes [
5], and so forth. Accurate predictions are essential in economical activities as remarkable forecasting errors in certain areas may involve large loss of money.
Given this situation, the successful analysis of temporal data has been a challenging task for many researchers during the last decades and, indeed, it is difficult to figure out any scientific branch with no time-dependant variables.
A thorough review of the existing techniques devoted to forecast time series is provided in this survey. Although a description of classical Box-Jenkins methodology is also discussed, this text is particularly focused on those methodologies that make use of data mining techniques. Moreover, a family of energy-related time series are examined due to the scientific relevance exhibited during the last decade: electricity price and demand time series. These series have been chosen since they present some peculiarities such as nonconstant mean and variance, high volatility or presence of outliers, that turns the forecasting process into a particularly difficult task to fulfil.
Actually, the electric power markets have become competitive markets due to the deregulation carried out in the last years, allowing the participation of all buyers, producers, investors or traders. Thus, the price of the electricity is determined on the basis of this buying/selling system. Consequently, electricity-producer companies need to develop methods for optimal bidding [
6].
On the other hand, load forecasting or demand forecasting consists in forecasting the amount of required electricity for a particular period of time. The demand forecasting plays an important role for electricity power suppliers because both excess and insufficient energy production may lead to large costs and significative reduction of benefits.
Some works have already reviewed electricity price time series forecasting techniques. For instance, [
7] collates a massive review of artificial neural networks, but it barely reviews other data mining techniques. Also, Weron [
8] presented an excellent review, describing many different approaches for several markets. However, none of them are focused on the whole data mining paradigm. Moreover, they do not provide mathematical foundations for all the methods they evaluated. Indeed, this is maybe the most significative strength of the paper, since information relating to underlying mathematics is provided, as well as an exhaustive description of the measures typically used to evaluate the performance. In short, this survey is to provide the reader with a general overview of current data mining techniques used in time series analysis and to highlight all the skills these techniques are exhibiting nowadays. As case study, their application to a real-world energy-related set of series is reported.
As it will be shown in subsequent sections, the majority of the techniques have been applied to Pennsylvania-New Jersey-Maryland (PJM) [
9], New York (NYSIO) [
10] and Spain (OMEL) [
11] electricity markets. By contrast, both Australian National Electricity Market (ANEM) [
12] and Ontario [
13] follow a single settlement real-time structure and few researchers have dealt with such markets. ANEM is also well-known for its volatility and its frequent appearance of outliers, turning this market into a perfect target for robust forecasting. Additionally, the Californian electricity market (CAISO) [
14] has also been widely analyzed because of the well-known problems that it experienced in the second half of 2000’s. Some other markets appear in this work, given the relevance of the model applied. Such are the cases for the UK, India, Malaysia, Finland, Turkey, Egypt, Nord Pool, Brazil, Jordan, China, Taiwan or Greece. Note that most of them provide public access to data.
The remainder of this work is structured as follows.
Section 2 provides a formal description of a time series and describes its main features.
Section 3 describes statistical indicators and errors typically used in this field. Also, the concept of persistence model and forecasting skill is here described.
In particular,
Section 4 describes the approaches based on linear methods. Classical Box and Jenkins-based methods such as AR, MA, ARMA, ARIMA, ARCH, GARCH or VAR are thus reviewed. Note that from this section on, all sections consist of a brief mathematical description of the technique analyzed and a review of the most representative works.
As for
Section 5, it is a compendium of the non-linear forecasting techniques currently in use in the data mining domain. In particular, these methods are divided into global (neural networks, support vector machines, genetic programming) and local (nearest neighbors).
In
Section 6, rule-based forecasting methods are analyzed, providing a brief explanation of what a decision rule is, and revisiting the latest and most relevant works in this domain.
The use of wavelets, as relevant method for hybridization, is detailed in
Section 7 as well as discussing the most relevant improvements achieved by means of these techniques.
A compilation of several works that cannot be classified in none of the aforementioned groups is described in
Section 8. Thus, forecasting approaches based on Markov processes, on Grey models, on Pattern-Sequence similarity or on manifold dimensionality reduction, are there detailed.
Due to the large amount of ensemble models that are being used nowadays,
Section 9 is devoted to cover these methods.
Finally, the conclusions drawn from the exploration of all existing techniques are summarized in
Section 10.
2. Time Series Description
This section is to describe temporal data features as well as to provide mathematical description for such a kind of data. Thus, a time series can be understood as a sequence of values observed over time and chronologically ordered. Time is a continuous variable, however, samples are recorded at constant intervals in practice. When the time is considered as a continuous variable, the discipline is commonly referred as
functional data analysis [
15]. The description of this category is out of scope in this survey.
Let be the historical data of a given time series. This series is thus formed by T samples, where each represents the recorded value of the variable y at time i. Therefore, the forecasting process consists in estimating the value of () and, the goal, to minimize the error, which is typically represented as a function of . This estimation can be extended when the horizon of prediction is greater than one, that is, when the objective is to predict a sample at a time (). In this situation, the best prediction is reached when a function of is minimized.
Time series can be graphically represented. In particular, the
x-axis identifies the time (
) whereas the
y-axis the values recorded at punctual time stamps (
). This representation allows the visual detection of the most highlighting features of a series, such as oscillations amplitude, existing seasons and cycles or the existence of anomalous data or outliers.
Figure 1 illustrates, as example, the price evolution for a particular period of 2006 in the Spanish electricity market.
Figure 1.
Time series example.
Figure 1.
Time series example.
An usual strategy to analyze time series is to decompose them in three main components [
16,
17]: trend, seasonality and irregular components, also known as residuals.
Trend. It is the general movement that the variable exhibits during the observation period, without considering seasonality and irregulars. Some authors prefer to refer the trend as the long–term movement that a time series shows. Trends can present different profiles such as linear, exponential or parabolic.
Seasonality. This component typically represents periodical fluctuations of the variable subjected to analysis. It consists of the effects reasonably stable along with the time, magnitude and direction. It can arise from several factors such as weather conditions, economical cycles or holidays.
Residuals. Once the trend and cyclic oscillations have been calculated and removed, some residual values remain. These values can be, sometimes, high enough to mask the trend and the seasonality. In this case, the term
outlier is used to refer these residuals, and robust statistics are usually applied to cope with them [
18]. These fluctuations can be of diverse origin, which makes the prediction almost impossible. However, if by any chance, this origin can be detected or modeled, they can be thought of precursors in trend changes.
Figure 2 depicts how a time series can be decomposed in the variables above described.
Figure 2.
Time series main components decomposition.
Figure 2.
Time series main components decomposition.
Obviously, real-world time series present a meaningful irregular component, which makes their prediction a especially hard task to fulfil. Some forecasting techniques are focused on detecting trend and seasonality (especially traditional classical methods), however, residuals are the most challenging component to be predicted. The effectiveness of one technique or another is assessed according to its capability of forecasting this particular component. It is for the analysis of this component where data mining-based techniques has been shown to be particularly powerful, as this survey will attempt to show in next sections.
3. Accuracy Measures
The purpose of error measures is to obtain a clear and robust summary of the error distribution. It is common practice to calculate error measures by first calculating a loss function (usually eliminating the sign of the single errors) and then computing an average. Let in the following
be the observed value at time t, also called the reference value, and let
be the forecast for
. The error
is then computed by
. Hyndman and Koehler [
19] give a detailed review of different accuracy measures used in forecasting and classify the measures into the groups detailed in subsequent sections.
3.1. Scale-Dependent Measures
There are some commonly used accuracy measures whose scale depends on the scale of the data. These are useful when comparing different methods on the same set of data, but should not be used, for example, when comparing across data sets that have different scales.
The most commonly used scale-dependent measures are based on the absolute error
or squared error
. These errors are averaged by arithmetic mean or median, leading to the mean absolute error (MAE, Equation (
1)), the median absolute error (MDAE, Equation (
2)), the mean squared error (MSE, Equation (
3)) or the root mean squared error (RMSE, Equation (
4)).
When comparing forecast methods on a single data set, the MAE is popular as it is easy to understand and compute. While MAE do not penalize extreme forecast errors, MSE and RMSE emphasize the fact that the total forecast error is in fact much affected by large individual errors, i.e., large errors are much expensive than small errors. Often, the RMSE is preferred to the MSE as it is on the same scale as the data. However, MSE and RMSE are more sensitive to outliers than MAE or MDAE.
3.2. Percentage Errors
To address the scale-dependency, the error can be divided by the reference value. Thus, the percentage error (PE) is given by
. Percentage errors have the advantage of being scale-independent and, therefore, they are frequently used to compare forecast performance across different data sets. The most commonly used measure is the Mean Absolute Percentage Error (MAPE, Equation (
5)).
These measures have the disadvantage of being infinite or undefined if for any t in the period of interest, and having an extremely skewed distribution when any is close to zero. Where the data involves small counts (which is common with intermittent demand data) it is impossible to use these measures as occurrences of zero values of occur frequently.
By using the median for averaging these problems are easier to deal with, as single infinite or undefined values do not necessarily result in an infinite or undefined measure. However, they also have the disadvantage that they put a heavier penalty on positive errors than on negative errors. This observation led to the use of the so-called symmetric measures sMAPE and sMdAPE, defined in Equations (
6) and (
7).
3.3. Relative Errors
An alternative way of scaling is to divide each error by the error obtained using another standard method of forecasting as benchmark. Let
denote the relative error where
is the forecast error obtained from the benchmark method. Usually, the benchmark method is the random walk where
is equal to the last observation. Then we can define Mean Relative Absolute Error (MRAE, Equation (
8)) and Median Relative Absolute Error (MdRAE, Equation (
9)).
A serious deficiency in relative error measures is that can be small. In fact, has infinite variance because has positive probability density at 0. One common special case is when and are normally distributed, in which case has a Cauchy distribution.
3.4. Relative Measures
Rather than use relative errors, one can use relative measures. For example, let
denote the MAE from the benchmark method. Then, a relative MAE is given by:
Similar measures can be defined using RMSE, MDAE or MAPE. An advantage of these methods is their interpretability. For example relative MAE measures the possible improvement from the proposed forecast method relative to the benchmark forecast method. When , the proposed method is better than the benchmark method and when , the proposed method is worse than the benchmark method.
When the benchmark method is a random walk, and the forecasts are all one-step forecasts, the relative RMSE is the Theil’s U statistic, as defined in Equation (
11). The random walk (where
is equal to the last observation) is the most common benchmark method for such calculations.
The Theil’s U statistic is a normalized measure of total forecasting error and . This measure is affected by change of scale and data transformations. For assessing good forecast accuracy, it is desirable that the Theil’s U statistic is close to zero. means a perfect fit.
3.5. Persistence Model
The persistence model is an important dynamic property of any time series and usually related to memory properties. Specifically, a time series is a persistent process if the effect of infinitesimally small shock will influence future predictions of the time series for a very long time. Thus the longer the influence time the longer is the persistence.
If a series suffers an external shock, the persistence degree provides information about the impact of the shock on such series, whether it will soon revert to its mean path or it will be further pushed away from the mean path. In case of a highly persistence series, a shock to the series tends to persist for long and the series drifts away from its historical mean path. On the contrary, for the case of a time series with low persistence degree after a shock, the time series tends to get back to its historical mean path.
The persistence of a time series model has been measured by different ways in literature [
20].
3.6. Forecasting Skill
The forecasting skill is a type of measures that scores the ability of a forecasting method to predict future values of a time series with respect to a reference model as benchmark. The forecasting skill is a scaled representation of the relative forecasting error and its purpose is the same of the relative measures introduced in
Subsection 3.4.
The most commonly used forecasting skill measure is shown in Equation (
12) and it is based on the previously introduced mean squared error (MSE, see Equation (
3)).
is the error of the tested forecasting method and
is the error of the reference benchmark.
A perfect forecast skill implies , a forecast with similar skill to the benchmark forecast produces a close to 0, and a forecast which is less skillful than the benchmark would produce a negative value.
4. Forecasting Based on Linear Methods
There exist real complex phenomena that cannot be represented by means of linear difference equations since they are not fully deterministic. Therefore, it may be desirable to insert a random component in order to allow a higher flexibility on its analysis.
Linear forecasting methods are those that try to model a time series behavior by means of a linear function. From all the existing techniques, seven of them are quite popular: AR, VAR, MA, ARMA, ARIMA, ARCH and GARCH. These models follow a common methodology, whose application to time series analysis was first introduced by Box and Jenkins. The original work has been extended and published many times since its first apparition in 1970, but the newest version can be found in [
21].
Autoregressive ––, moving average ––, mixed –– autoregressive integrated moving average –– autoregressive conditional heteroskedastic –– and generalized autoregressive conditional heteroskedastic –– models were described following this idea, where p is the number of autoregressive parameters, q is the number of moving average parameters and d is the number of differentiations for the series to be stationary. Vector autoregressive models –– are the natural extension for AR models to multivariate time series, where p denotes the number of lags considered in the system.
4.1. Autoregressive Processes
An autoregressive process (AR) is denoted by
, where
p is the order of the AR process. This process assumes that every
can be expressed as a linear combination of some past values. It is a simple model but that adequately describes many real complex phenomena. The generalized AR model of order
p is described by:
where
are the coefficients that models the linear combination,
the adjustment error, and
p the order of the model.
When the error is small compared to the actual values, a future value can be estimated as follows:
4.2. Vector Autoregressive Models
Vector autoregressive models (VAR) are the natural extension of the univariate AR to multivariate time series. VAR models have shown to be especially useful to describe dynamic behaviors in time series and therefore to forecast. In a VAR process of order p with N variables ––, N different equations are estimated. In each equation a regression of the target variable over p lags is carried.
Unlike the univariate case, VAR allow that each series to be related with its own lag and the lag of the other series that form the system. For instance, in two time series systems, there are two equations, one for each variable. This two-series system (
,
) can be mathematically expressed as follows:
where
for
are the series to be modeled, and
α’s the coefficients to be estimated.
Note that the selection of an optimum length of the lag is a critical task for VAR processes and, for this reason, has been widely discussed in literature [
22].
4.3. Moving Average Processes
When the error
cannot be assumed as negligible, AR processes are not valid. In this situation it is practical to use the moving average (MA) process, where the series is represented as linear combination of the error values:
where
q is the order of the MA model and
the coefficients of the linear combination. As observed, it is not necessary to make explicit use of past values of
to estimate its future value. Finally, MA processes are seldom used alone in practice.
4.4. Autoregressive Moving Average Processes
Autoregressive and moving average models are combined in order to generate better approximations than that of Wold’s representation [
23]. This hybrid model is called autoregressive moving average process (ARMA) and denoted by
. Formally:
Again, ARMA assumes that
is small compared to
to estimate future values of
. The estimates of
past values at time
can be obtained from past actual values of
and past estimated values of
:
Therefore, the estimate for
is calculated as follows:
4.5. Generalized Autoregressive Conditional Heteroskedastic Processes
Autoregressive conditional heteroskedastic processes (ARCH), firstly presented in [
24], or extended ARCH models, called generalized autoregressive conditional heteroskedastic processes (GARCH), introduced in [
25], are especially designed to deal with volatile time series, that is, with series that exhibit high volatility and outlying data (for detailed information refer to [
26,
27]). The ARCH model considers that the conditional variance is dependent of the time, namely, a MA process of order
q of the square error values:
The extension of an ARCH model to a GARCH model is similar to the extension of AR models to ARMA models. The conditional variance depends on their own past values in addition to the past values of the square errors:
4.6. Autoregressive Integrated Moving Average Processes
Autoregressive integrated moving average processes (ARIMA) are the most general methods and are the result of combining AR and MA processes. ARIMA models are denoted as
, where
p is the number of autoregressive terms,
d the number of nonseasonal differences, and
q the number of lagged forecast errors in the prediction equation. These models follows a common methodology, whose application to time series analysis was first introduced by Box and Jenkins [
21]. Thus, this methodology proposes an iterative process formed by four main steps as illustrated in
Figure 3.
Figure 3.
The Box-Jenkins methodology.
Figure 3.
The Box-Jenkins methodology.
Identification of the model. The first task to be fulfilled is to determine wether the time series is stationary or not, that is, to determine if the mean and variance of a stochastic process do not vary along with time. If the time series does not satisfy this constraint, a transformation has to be applied and the time series has to be differentiated until reaching stationarity. The number of times that the series has to be differentiated is denoted by d and is one of the parameters to be determined in ARIMA models.
Estimation of the parameters. Once d is determined, the process is reduced to an ARMA model with parameters p and q. These parameters can be estimated by following non-linear strategies. From all of them, three stand out: the evolutionary algorithms, the least squares (LS) minimization and the maximum likelihood (ML). Evolutionary algorithms and LS consist in minimizing the square error of forecasting for a training set while the ML consists in maximizing the likehood function, which is proportional to the probability of obtaining the data given the model.
Comparisons between different Box-Jenkins time series models can be easily found in the literature [
28,
29,
30,
31], but there are very few works comparing the results of different parameter estimation methods. ML and LS were compared in [
32] to obtain an ARIMA model to predict the gold price. The results reported an error of 0.81% and 2.86% when using a LS and a ML, respectively. A comparative analysis between autocorrelation function, conditional likelihood, unconditional likelihood and genetic algorithms in the context of streamflow forecasting was made in [
33]. Although similar results were obtained by the four methods, the autocorrelation function and the methods based on ML were the most computationally cost, especially when increased the order of the model. For that, the authors finally recommended the use of evolutionary algorithms.
The good performance of several metaheuristics to solve optimization problems along with the limitations of the classical methods, such as the low precision and poor convergence, has motivated the appearance of recent works comparing evolutionary algorithms and traditional methods for parameter estimation in time series models [
34,
35]. In general, evolutionary algorithms obtain better results due to the likelihood function is highly nonlinear, and therefore, conventional methods usually converge to a local maxima contrarily to genetic algorithms, which tend to find the global maxima [
36].
Validation of the model. Once the ARIMA model has been estimated several hypotheses have to be validated. Thus, the fitness of the model, the residual values or the significance of the coefficients forming the model are forced to agree with some requirements. In cases in which this step is not fulfilled, the process begins again and the parameters are recalculated.
In particular, an ARIMA model is validated if estimated residuals behave as white noise, that is, if they exhibit normal distribution as well as constant variance and null mean and covariance. To determine if they are white noise, autocorrelation and partial autocorrelation functions are calculated. These values must be significatively small.
Additionally, to assess different models’ performance, Akaike information criterion (AIC) and Bayesian information criterion (BIC) measures are typically used (instead of classical error measures, such as MAE or RMSE) given their ability to avoid the overfitting that overparameterization causes.
A problem with the AIC is that it tends to overestimate the number of parameters in the model and this effect can be important in small samples. If AIC and BIC are compared, it can be seen that the BIC penalizes the introduction of new parameters more than the AIC does, hence it tends to choose more parsimonious models [
37].
Forecasts. Finally, if the parameters have been properly determined and validated, the system is ready to perform forecasts.
4.7. Related Work
The authors in [
38] used the GARCH method to forecast the electricity prices in two regions of New York. The obtained results were compared to different techniques such as dynamic regression (DR), transfer function models (TFM) and exponential smoothing. They also showed that accounting for the spike values and the heteroscedastic variance in these time series could improve the forecasting, reaching error rates lesser than 2.5%.
García
et al. [
39] proposed a forecasting technique based on a GARCH model. Hence, this paper focused on day-ahead forecast of electricity prices with high volatility periods. The proposal was tested on both mainland Spanish and California deregulated markets.
Also related with electricity prices time series, the approach proposed by Malo
et al. in [
40] was equally noticeable. In it, the authors considered a variety of specification tests for multivariate GARCH models that were used in dynamic hedging in the Nordic electricity markets. Moreover, hedging performance comparison were conducted in terms of unconditional and conditional ex-post variance.
An application of ARMA models to electricity prices can be found in [
41], where the exogenous variable is the electricity demand. The study was carried out with data of California. The average error verges on 10%.
In [
42] ARIMA models, selected by means of Bayesian Information Criteria, were proposed to obtain the forecasts of electricity prices in the Spanish market. In addition, the work analyzed the optimal number of samples used to build the prediction models.
Weron
et al. [
43] presented twelve parametric and semi-parametric time series models to predict electricity prices for the next day. Moreover, in this work forecasting intervals were provided and evaluated taking into account the conditional and unconditional coverage. They concluded that the intervals obtained by semi-parametric models are better than that of parametric models.
Table 1 summarizes the content of this section. Note that
5+ models means that the approach has been compared to five or more models. As it can be appreciated, linear methods were very popular at the beginning of 2000’s as main methods to make predictions. However, nowadays, these kind of methods have turned into baselines for other methods to be compared to.
Table 1.
Summary on linear methods.
Table 1.
Summary on linear methods.
Reference | Technique | Outperforms | Metrics | Horizon | Year | Market |
---|
[38] | GARCH | DR/TFM/Smoothing | RMSE/MAPE | 1 day | 2002 | NYISO |
[39] | GARCH | ARIMA | RMSE | 1 day | 2000 | CAISO/OMEL |
[40] | GARCH | 5+ models | MAPE/MAE | 1 day | 2004 | Northern Europe |
[41] | ARMA | 5+ models | RMSE | 1 day | 2000 | CAISO |
[42] | Mixed ARIMA | ARIMA | RMSE/MAPE | 1 day | 2000–2002 | OMEL |
[43] | ARIMA | 5+ models | MAE/MAPE | 1 day | 2004 | CAISO/Nord Pool |
8. Other Models
Despite of the vast description of methods provided in prior sections, some authors proposed new forecasting approaches that cannot be classified into any of the aforementioned categories. For this reason, this section is describe to introduce all these works.
Hence, transfer functions models (TFM)—known as dynamic econometric models in the economics literature—based on past electricity prices and demand were proposed to forecast day-ahead electricity prices by Nogales
et al. in [
111], but the prices of all 24 h of the previous day were not known. They used the median as measure due to the presence of outliers and they stated that the model in which the demand was considered presented better forecasts.
The authors in [
112] focussed on the one year-ahead electricity demand prediction for winter seasons by defining a new Bayesian hierarchical model (BH). They provided the marginal posterior distributions of demand peaks. The results for one year-ahead were compared to those of the National Grid Trasc (NGT) group in the United Kingdom.
A fuzzy inference system (FIS)—adopted due to its transparency and interpretability—combined with traditional time series methods was proposed for day-ahead electricity price forecasting [
113].
A novel non-parametric model using the manifold learning (MFL) methodology was proposed in [
114] in order to predict electricity price time series. For this purpose, the authors used cluster analysis based on the embedded manifold of the original dataset. To be precise, they applied manifold-based dimensionality reduction to curve modeling, showing that the day-ahead curve can be represented by a low-dimensional manifold.
Another different proposal can be found in [
115], where a forecasting algorithm based on Grey Models was introduced to predict the load of Shanghai. In the Grey model the original data series was transformed to reduce the noise of the data series and the accuracy was improved by using Markov chains techniques.
The use of clustering as an initial step to forecast electrical time series has been used. For instance, the authors in [
116,
117] evaluated the performance of both K-means and Fuzzy C-Means in detecting patterns in the Spanish market. Later, these patterns were used to transform the time series into a sequence of labels showing the benefits of using this information as previous step in time series forecasting [
118]. Finally, an extended and improved approach, PSF, was introduced in [
119], where New York, Australian and Spanish electricity and demand time series were successfully forecasted, showing remarkable performance compared to classical methods. The same method was adapted to forecast outliers (o-PSF) for the same markets in [
120].
A method using a principal component analysis (PCA) network was introduced in [
121] to forecast day-ahead prices. The PCA network extracts essential features from periodic information in the market. Later, these features are used as inputs in a multilayer feedforward network. PJM market was used to test the proposed method and the results compared to ARIMA models.
Finally,
Table 8 summarizes all the methods reviewed in this section.
Table 8.
Summary on other models for electricity forecasting.
Table 8.
Summary on other models for electricity forecasting.
Reference | Technique | Outperforms | Metrics | Horizon | Year | Market |
---|
[111] | TFM | ARIMA | RMSE/MAPE | 1 day | 2003 | PJM |
[112] | BH | NGT | RMSE | 1 year | 2002/03 | UK |
[113] | FIS | ARMA/GARCH | RMSE/MAPE | 1 day | 2003/04 | PJM |
[114] | MFL | ARIMA/Holt | MSE | up to 1 month | 2010 | NYISO |
[115] | Grey-Markov | Grey | MRE | 1 day | 2005/06 | Shangai |
[119] | PSF | 5+ methods | MRE/MAPE | 1 day | 2006 | NYISO/ANEM/OMEL |
[120] | o-PSF | 5+ methods | MRE/MAPE | 1 day | 2006 | NYISO/ANEM/OMEL |
[121] | PCA | ANN | MAE | 1 day | 2008 | PJM |
9. Ensemble Models
Recently, ensemble models are beginning to receive attention from the research community due to the good performance obtained for classification problems [
122,
123]. In general, ensemble models consists in combining different models in order to improve the accuracy of the individual models. In most of works, the combination is usually based on a system of majority votes (bagging) or weighted majority votes (boosting).
In the last years, ensemble techniques have been also applied to the prediction of energy time series. Fan
et al. [
124] proposed a machine learning model based on Bayesian Clustering by Dynamics (BCD) and SVM. First, Bayesian clustering techniques were used to split the input data into 24 subsets. Then, SVM methods were applied to each subset to obtain the forecasts of the hourly electricity load for the city of New York.
The work in [
125] introduced a price forecasting method based on wavelet transform combined with ARIMA and GARCH models. The method was assessed on Spanish and PJM electricity markets and compared to some other forecasting methods.
An ensemble of RBF neural networks for short-term load forecasting in seven buildings from Italy can be found in [
126]. The main novelty of this work is the introduction of a new term in the objective function to minimize the correlation between the error of a network with the errors of the rest of networks of the ensemble. In this case, the results were compared to SARIMA, which proved to be more competitive in most of the buildings.
An ensemble of ELM was presented in [
127] to short-term load forecasting of Australian electricity market. Both the weights of the input layer and the number of nodes in hidden layer for each ELM were randomly set. The median of the outputs generated for each ELM was the final prediction. The results reported an error of 1.82% for the year 2010 versus 2.89%, 2.93%, and 2.86% obtained by a single ELM, a back-propagation ANN and a RBF neural network, respectively.
Many ensembles of ANN have been recently published in the literature with the purpose of electricity prices or load forecasting. In fact, most of the proposed ensemble techniques for regression tasks have been ensembles of ANN. For instance, the authors in [
128] proposed the hybrid method PSF-NN, which combines pattern sequence similarity with neural networks. The results show that the use of ensemble of NNs instead of a single NN in the NN component of the PSF-NN prediction method is beneficial considering that it produces better accuracy at acceptable computational cost.
Another ensemble based on PSF was introduced in [
129]. In this case, five forecasting models using different clustering techniques: K-means, SOM, Hierarchical Clustering, K-medoids model, and Fuzzy C-means were used. The ensemble model was implemented with an iterative prediction procedure. The method was applied to New York, Australia and Spain markets, and the results compared to those of the original PSF algorithm.
The performance of an ensemble of ANN was compared with a Seasonal Autoregressive Integrated Moving Average (SARIMA) model, a Seasonal Autoregressive Moving Average (SARMA), a Random Forest, a Double Exponential Smoothing and Multiple Regression in [
130], providing the best results. The ANNs composing of the ensemble were trained with different subsets provided by a previous clustering.
An ensemble was proposed in [
131] to predict the load in California for the next day. The authors used a reference forecast made by the system operator as input variable of the proposed method, and this prediction was improved by means of two Box-Jenkins time series models. Then, the forecasts provided by these two models were combined to obtain the final prediction. The weights of the combination were optimized by means of least square method, and moreover, the authors built different ensembles considering global weights or weights depending on the hour or the day.
Finally,
Table 9 summarizes all the methods reviewed in this section.
Table 9.
Summary on ensembles for electricity forecasting.
Table 9.
Summary on ensembles for electricity forecasting.
Reference | Technique | Outperforms | Metrics | Horizon | Year | Market |
---|
[124] | BCD+SVM | SVR | MAPE | 1 day | 2001–2003 | NYISO |
[125] | WL+GARCH | 5+ models | RMSE/MAPE | 1 day | 2002 | OMEL/PJM |
[126] | ANN | SARIMA | MSE/MAE/MAPE | 1 day | 2010 | Italy |
[127] | ELM | ANN/RBF | MAE/MAPE | 1 day | 2010 | ANEM |
[128] | PSF+ANN | 5+ models | MAE/MAPE | 1 day | 2010 | ANEM |
[129] | PSF+Clust | PSF | MRE/MAPE | 1 day | 2006 | NYISO/ANEM/OMEL |
[130] | ANN | SARIMA | MAPE | 1 day | 2012 | C & I |
[131] | ARIMA | 5+ models | RMSE/MAE/MAPE | 1 day | 2013 | CAISO/ERCOT |
10. Conclusions
It is expected that this work serve as initial guide for those researchers interested in time series forecasting and, in particular, in forecasting based on data mining approaches. Thus, a brief but rigorous mathematical description of the main existing data mining techniques that have been applied to forecast time series is reported. Due to the wide variety of application of such techniques, one case study has been selected: The analysis of energy-related time series (electricity price and demand). The large amount of works carried out during the last decade in this topic highlights the strengths that data mining had already exhibit in other fields. With reference to the type of prediction, it can be concluded that almost all methods use a horizon of prediction equals to one day. There are few works forecasting recent years since, for comparative purposes, they prefer to use older data. Moreover, there are several techniques that have been rarely used so far in this research areas: nearest-neighbors and genetic programming. This fact suggests that much work is still remaining for such models. On the contrary, ANN and SVM have been extensively used for this forecasting task. Linear models are still being used, but mainly to be used as baselines, since most of the data mining approaches outperform them in terms of accuracy. Wavelets and rule-based methods are mainly used in hybrid approaches and are causing significative accuracy improvement when properly combined. The accuracy measures mainly used are MAPE and RMSE. Finally, the current trend in electricity forecasting points to the development of ensembles, thus highlighting single strengths of every method.