Short Term Prediction of Groundwater Level Using Improved Random Forest Regression With A Combination of Random Features

Applied Water Science (2018) 8:125
https://doi.org/10.1007/s13201-018-0742-6
ORIGINAL ARTICLE
Short‑term prediction of groundwater level using improved random

forest regression with a combination of random features
Xuanhui Wang1,2,3 · Tailian Liu1 · Xilai Zheng2,3 · Hui Peng2,3 · Jia Xin2,3 · Bo Zhang2,3
Received: 7 March 2017 / Accepted: 7 June 2018 / Published online: 24 July 2018
© The Author(s) 2018
Abstract
To solve the problem where by the available on-site input data are too scarce to predict the level of groundwater, this paper
proposes an algorithm to make this prediction called the canonical correlation forest algorithm with a combination of random
features. To assess the effectiveness of the proposed algorithm, groundwater levels and meteorological data for the Daguhe
River groundwater source field, in Qingdao, China, were used. First, the results of a comparison among three regressors
showed that the proposed algorithm is superior in terms of forecasting variations in groundwater level. Second, the results
of experiments were used to show the comparative superiority of the proposed method in terms of training time and com-
plexity of parameter optimization. Third, using the proposed algorithm, the highest prediction accuracy was achieved by
employing precipitation P(t − 2), temperature T(t), and groundwater level H(t) as the best time lag. This improved random
forest regression model yielded higher accuracy in forecasting the variation in groundwater level. The proposed algorithm
can also be applied to cases involving low-dimensional data.
Keywords Groundwater level prediction · Improved random forest regression · Least squares support vector regression ·
Daguhe River groundwater source field
Abbreviations RMSE Root-mean-squared error

ANN Artificial neural network SVM Support vector machine
CCA Canonical correlation analysis
CCF-CRF Canonical correlation forests with a combina-
tion of random features Introduction
Forest-RC Features combination based random forest
GWL Groundwater level Fluctuations in groundwater level (GWL) can be used to
LS-SVR Least squares support vector regression evaluate groundwater stability and flow as well as the char-
MAE Mean absolute error acteristics of the aquifer. Moreover, groundwater is the main
R Coefficient of correlation source of farmland irrigation in major agricultural regions.
RF Random forest Agricultural water managers assess whether groundwater
RFR Random forest regression is needed during sowing season and how much is likely
to be provided (Barlow and Clark 2011; Dyer et al. 2015;
Kebede et al. 2014). Although most government agencies
collect GWL data once or twice a year in major agricultural
* Xilai Zheng areas, this is not sufficient for short-term studies. Therefore,
zhxilai@ouc.edu.cn
it is necessary to achieve acceptable prediction accuracy of
1
Science and Information College, Qingdao Agricultural GWL variations when previous information is not available
University, Qingdao 266109, China and the computational sources are limited. It is well known
2
Key Lab of Marine Environmental Science and Ecology, that the dynamic variation in GWL is influenced by mete-
Ministry of Education, Ocean University of China, orological phenomena, urbanization, tidal effects, and land
Qingdao 266100, China subsidence (Khaki et al. 2015). Of these, such meteorologi-
3
Shandong Provincial Key Laboratory of Marine Environment cal parameters as atmospheric pressure, frost, precipitation,
and Geological Engineering, Ocean University of China,
Qingdao 266100, China
13
Vol.:(0123456789)
125 Page 2 of 12 Applied Water Science (2018) 8:125
and evapotranspiration are considered important indicators RF has been widely used in surface and ground water
of fluctuations in GWL. hydrology. Many studies have applied RF to study ground-
Models to simulate groundwater can be classed into two water from different aspects, such as its vulnerability to
main groups: physical models and data-driven models. For nitrate pollution (Rodriguez-Galiano et al. 2014), the dis-
physical models, an appropriate synthesis of the parame- solution of organic nitrogen in it (Wang et al. 2016), and the
ters of the aquifer is used to determine the spatial variation potential mapping of groundwater (Rahmati et al. 2016).
in underground space. This information is challenging to Moreover, the RF model has been compared with the SVM
obtain even through expensive on-site surveys, which fur- in terms of surface water level prediction (Li et al. 2015).
ther increase computational cost. Therefore, the physical However, no study to date has applied RF to the prediction
domain needs to be partitioned to obtain a numerical solu- of GWL owing to low-dimensional input features.
tion (Taormina et al. 2012). An appropriate alternative to the Breiman (2001) introduced a random forest algorithm
physical model is the data-driven model, which can provide (Forest-RC) that used linear combinations of input variables
accurate predictions without a costly calibration time when when dealing with low-dimensional input data. Although
data are insufficient and the physical mechanism is not the Breiman (2001), Hamza and Larocque (2005) and Moudani
focus of research (Mohanty et al. 2015). Artificial neural (2013) have shown that Forest-RC yields better results than
network (ANN)-based techniques and the support vector those obtained with no linear combination, there are five key
machine (SVM) are data-driven models treated as standard limitations to this approach:
nonlinear estimators and can overcome the limitations of the
physical model (Behzad et al. 2010). 1. The extended number of dimensions in Forest-RC is
In groundwater hydrology, ANNs have been used in such limited when the original data contain few dimensions
applications as groundwater level prediction (Emamgholiza- of feature.
deh et al. 2014; He et al. 2014; Khaki et al. 2015). However, 2. The original feature information, which is important
it is known that ANN models incur the disadvantages of for regression and has a certain impact on its results, is
local minima and problems of overfitting. The SVM can removed by Forest-RC.
overcome these drawbacks (Hosseini and Mahjouri 2016). 3. The information in the new feature varies with the value
Comparative studies on the SVM and the ANN have been of L, which is the number of selected variables that are
performed in the context of GWL prediction. Behzad et al. combined. Furthermore, it directly affects the construc-
(2010) compared the SVM and ANN for predicting tran- tion of the decision trees. Experiments have shown that
sient GWL under variable pumping and weather conditions. different values of L lead to different generalization
Yoon et al. (2011) developed and compared two time-series capabilities (Luo et al. 2016).
forecasting models using ANN and SVM and applied them 4. A fixed value of L is contrary to randomness, which is
to forecast GWL fluctuations in a coastal aquifer recharged introduced by choosing training samples randomly and
from precipitation and tidal effects. Yoon et al. (2016) com- using a random subset of all features as the set of split-
pared the recursive prediction performance of the SVM and ting rules. The value of L must be adjusted for different
ANN for the long-term prediction of GWL and concluded prediction horizons for GWL, which limits the versatil-
that SVM is a better substitute for ANN in terms of accuracy ity of the Forest-RC algorithm in practical applications.
and robustness for the prediction of GWL fluctuation. 5. Forest-RC uses orthogonal (i.e., axis-aligned) decision
Although SVM is superior to ANN in reflecting the trees that only recognize axis-parallel splits of the fea-
dynamic variation in GWL, it takes more time because of ture space and lead to poorer performance of trees than
trial and error (Raghavendra and Deka 2014). It is also sensi- oblique decision trees. Zhang and Suganthan (2015)
tive to outliers and redundant data (Suykens et al. 2002). On showed that oblique random forests achieved better per-
the contrary, results of the prediction of random forest (RF) formance than their conventional axis-parallel counter-
are unaffected by outliers and redundant data (Rodriguez- part in computation time and classification accuracy.
Galiano et al. 2014). Comparative studies on RF and SVM
have been carried out from different perspectives. For exam- Based on the above issues, in this paper, we propose a
ple, Pal (2005) compared RF with SVM in terms of clas- method of GWL modeling based on canonical correlation
sification accuracy, training time, and user-defined param- forests with a combination of random features (CCF-CRF)
eters. They were also compared on four criteria in mineral that expands the low-dimensional feature vector space to high-
prospectivity modeling (Rodriguez-Galiano et al. 2015). In dimensional space and uses canonical correlation components
these case studies, the authors concluded that the RF model for oblique splits. A comparison of performance among CCF-
exhibited stronger predictive capabilities and higher robust- CRF, random forest regression (RFR) algorithm, and least
ness, with a lower complexity of parameterization and a squares support vector regression (LS-SVR) was conducted
shorter training time for optimization than the SVM. and showed that CCF-CRF can provide the most accurate
13
Applied Water Science (2018) 8:125 Page 3 of 12 125
short-term GWL forecasts. For this purpose, we chose the support vector regression (LS-SVR) (Pelckmans et al. 2002;
monitoring well Guxian along the groundwater source for the Suykens and Vandewalle 2000). LS-SVR is a series of defor-
Daguhe River for a case study. mations that can replace the quadratic programming problem
with the linear equation problem. The calculation can thus
Materials and methods be made easier, and the time needed to obtain a solution can
be effectively reduced.
Random forest regression algorithm (RFR) The basic principle of LS-SVR is introduced in this sec-
tion (Suykens and Vandewalle 2000). The training sample
Based on classification and regression trees (CART), ran- set was assumed to be {Xi , Yi }Ni=1, where Xi ∈ Rk are the input
dom forest (RF) builds a forest from bootstrap samples of data and Yi ∈ R the output data. The relationship between the
observed data. Having generated multiple decision trees in input and output data is as follows:
the process of training, RF is designed to generate an output
N
by majority vote (for classification) and the average of the ∑
single-tree method (for regression) (Breiman 2001). GWL as
Y(X) = 𝛼i G(X, Xi ) + b (2)
i=1
the output of this prediction is a continuous variable. Thus,
we focus on the regression model of RF. where G is a kernel-based function and 𝛼i and b are param-
A regression tree (RT), where each non-leaf node con- eters of the regression. The primary method of LS-SVR
tains a set of decision rules and each leaf node is the out- involves converting the parameter estimation problem into
come of a prediction, is a form of decision tree (DT) (Quin- the following quadratic programming problem:
lan 1993; Rodriguez-Galiano et al. 2014).
The two user-defined parameters of RF are B, the number 1 𝛾
min 𝛼 T 𝛼 + eT e (3)
𝛼,b,e 2 2
of trees in the forest, and D, the number of features used to
split the nodes. The default value of B is 500, but is reset
according to application. One third of the variables for the N
∑
regression task are set to the default value of D. s.t. Yi = 𝛼j G(Xi , Xj ) + b + ei , i = 1, … , N (4)
j
Steps of the RF regression algorithm are as follows (for
full details, see Breiman 2001): 𝛾 ∈ R+ is a penalty factor, ei ∈ R is the bias term, G is the
radial basis function (RBF), a kernel function that is the
1. Different bootstrap samples Xi (i = bootstrap iteration) best choice for researchers because it has higher accu-
are randomly drawn from the original dataset X. Two- racy and reliability than other kernel functions (Raghav-
thirds of the samples are included in a bootstrap sample endra and Deka 2014), and 𝛼i and b are parameters of
and one third as the out-of-bag samples. Each tree is the regression. With Ŷ = (Y1 , Y2 , … , YN )T ∈ RN and
constructed to correspond to a particular subset of the 1N = (1, 1, … , 1)T ∈ RN , the solutions are given by
bootstrap.
[ ][ ] [ ]
2. At a node in each tree, a new split is randomly selected I
𝛺 + 𝛾N 1N 𝛼 Ŷ
from all indices, and the input variable with the lowest = (5)
1 T
0 b 0
mean square error (MSE) is chosen as the splitting cri- N
terion of the regression tree.

where 𝛺i,j = G(Xi , Xj ) is the kernel matrix and IN the identity
3. The data splitting process in each internal node is
matrix of order N .
repeated according to the above steps until all rand-
omized trees have been grown and a stop condition is
reached. Canonical correlation forests with a combination
4. The final results of regression can be calculated as fol- of random features (CCF‑CRF)
lows, where B stands for the number of trees in the forest
and Tb represents each tree: The algorithm improves the predictive performance in two
respects:
B
1∑ (1) the use of a combination of random features that intro-
ŷ (xi ) = T (x ). (1)
B b=1 b i duces more randomness to build the decision tree, and (2)
the use of canonical correlation analysis (CCA) to generate
Least squares support vector regression (LS‑SVR) candidate hyperplane splits and projection bootstrapping to
construct ensembles of the oblique decision tree.
An extension of the standard support vector regression
(SVR) (Smola et al. 2004; Vapnik 1995) is least squares 1. A combination of random features
13
Two ingredients involved in the generalization error for same attribute for each node, which reduces the similarity
random forests: the strength of a predictor in the forest and of the decision trees. To reducing the correlation coef-
the correlation between trees (Breiman 2001). When the ficient and the generalization error, more randomness is
strength is fixed, the lower the correlation coefficient, the introduced into the building of individual regression trees
smaller the generalization error. Luo et al. (2016) proposed by the CCF-CRF algorithm.
that one solution that reduces the correlation coefficient
involves expanding the low-dimensional feature vector 2. CCA and projection bootstrapping
space to a high-dimensional space.
The original RF method can handle cases with only a Canonical correlation analysis (CCA) (Hotelling 1936) is
few inputs. The only viable solution to this problem is to used to calculate pairs of linear projections to maximize the
increase the dimensionality of the feature space, which can correlation coefficient between matrices in a co-projected
be done by creating linear combinations of input features. space. In the first step, features λ appearing in Algorithm 1
Forest-RC (Breiman 2001) is one such method and was are sampled without replacement from the input feature
introduced to define more features by using random linear set Fs, which is generated from the process in part 2.2 of
combinations of input variables. this algorithm. In the next step, a new training set {X′, Y′}
In Forest-RC, NL features are selected randomly from is selected from {X(:,λ), Y} using the bootstrap sampling
the original features at each node and linearly combined technique. CCA coefficients relating the features to the
together with a random number uniformly distributed in outputs are then calculated based on the new training set.
the interval [− 1, 1]. We thus get the variable names V, The canonical coefficients Φ corresponding to X are used to
whose formulation is as follows: generate the new features, which are obtained by projecting
the original features into the canonical component space.
L
∑ Furthermore, the function FINDBESTSPLIT obtains the
V= Ki Vi , Ki ∈ [−1, 1] (6) best split ξ by calculating the maximum information gain
i=1
(Quinlan 1986).
Thus, more features are defined by taking random linear From the above analysis, it is clear that the main differ-
combinations of the input variables, and the nodes are split ence between RF and CCF-CRF is that the latter uses CCA
using the optimal predictive variable among these new to analyze the relationship between the output and the fea-
features. tures and then splits the nodes using an exhaustive search in
As mentioned in the “Introduction” section, Forest-RC the projected space.
has five defects. To overcome them and improve the pre-
dictive performance of the ensemble, this paper proposes
an algorithm called canonical correlation forests with a Study area and data processing
combination of random features (CCF-CRF). The value
of L induces randomness to enable the feature space to Study site and data description
contain a greater number of novel and different features.
The original m-dimensional feature space is extended to The study site was located at the groundwater source field
n-dimensions, and this relationship is described by the fol- of Daguhe River in Qingdao, China. The monitoring well
lowing equation: Guxian, located in the northwest of the study area, was used
as an example to verify the effectiveness of the CCF-CRF
model. The groundwater source field of Daguhe River is
m
∑
(7)
i
n= CM . among the important public drinking water sources in Qing-
i=1
dao. It is located in the northwest of the city and covers an
It is clear from Eq. (7) that the new n-dimensional fea- area of approximately 4600 km2, with the temperature rang-
ture space not only contains the original feature space, ing from − 3.3 °C in January to 25.3 °C in July and average
but also contains the linear combination of arbitrary fea- annual precipitation of 625.3 mm. The geographical position
tures. CCF-CRF is thus an approach that creates a higher- of the site is shown in Fig. 2.
dimensional feature space than that of Forest-RC and RFR. Figure 3 shows a typical geological profile along section
The processing of the CCF-CRF algorithm is described in a–a′ nearby Guxian. The major aquifer systems of the study
detail in Fig. 1. site were constituted by quaternary alluvium and diluvium,
CCF-CRF, where the split at each node is based on consisting of unconsolidated gravel, sand, silt, and clay with
completely random linear combinations of features instead a width of 5–7 km. The maximum width was more than
of a single one, has a lower probability of choosing the 10 km, and the thickness ranged from 10 to 20 km. The aqui-
fer system was discretized vertically into two layers. There
13
Fig. 1 Description of CCF-CRF algorithm
were particles of small size in layer 1, which was covered area were the vegetation cover and the absence of a signifi-
with silty clay, and their width ranged from 2 to 5 km. Layer cant variation in topography. The depth to the groundwa-
2 was stocked with sand and gravel, with larger particle sizes ter was small, with a thin silt-like clay layer in this area,
with a thickness ranging from 4 to 8 km. The bottom could which created conditions favorable for rainfall infiltration.
be regarded as aquiclude composed of clay rock and sand- Precipitation was thus the primary and most critical vari-
stone and is also known as the lower confining bed. able affecting short-term fluctuations in groundwater level
In a majority of the area, the rainy season peaked in within the study site.
May through July. The most obvious characteristics of this
13
Fig. 2 Map of situation of the studied site (Guxian Station)
Fig. 3 Simple geological profile along the section a–a′ near by Guxian well
The precipitation in millimeters and temperature data in GWL values used in this modeling were the averages of two
Celsius were obtained from a meteorological station located measurements per day, whereas the corresponding daily pre-
near the Guxian GWL station. Daily data for the groundwa- cipitation value was the total precipitation measured over a
ter level for the site were collected from a remote monitor- 24-h period, and the corresponding daily minimum tempera-
ing system designed to provide accurate scientific basis for ture was the minimum value of all observations within a day.
the monitoring, exploitation, and protection of groundwa- From Fig. 4, it is clear that the lowest GWL (during the
ter from the Daguhe River groundwater source field. Daily study period) obtained in July 2013 and 2014, with warmer
13
months yielding the lowest GWL. This can be attributed to in Eq. (8), so that the variables (X) in the training and testing
higher evaporation rates owing to higher temperatures in the datasets ranged from − 1 to 1. In Eq. (8), Xnorm, X, Xmean, and
summer, which motivated higher groundwater extractions Xmax represent the normalized value, the real value, the mean
and soil-moisture deficits. On the contrary, highest GWL value, and the maximum value, respectively:
occurred in spring under natural conditions because recharge
X − Xmean
was relatively high and groundwater pumping extraction Xnorm = . (8)
2Xmax
remained relatively low.
The root-mean-square error (RMSE), the mean absolute
Data preprocessing and performance criteria error (MAE), and the correlation coefficient (R) are usually
employed as evaluation metrics of the model used in this study.
The RFR, LS-SVR, and CCF-CRF models for GWL fore- The RMSE is a standard metric for model error and represents
casting at the Guxian well station were written in MATLAB. the degree of linear relationship between the observed GWL
The applied data were gathered from January 1, 2013, to values and the forecasted values. As the errors are squared
November 19, 2014, at the Guxian well. For each dataset, before they are averaged, it is very sensitive to large errors
66.5% of the data were selected as the initial training sam- in a set of measurement data. The MAE is another useful
ples set T, and the remaining 33.5% were used as test sam- method that uses the deviation between the forecasted values
ples. To calibrate the parameters of the LS-SVR model, the and the actual values to reflect the accuracy of the system.
training samples set were divided into two parts: a training Equation (11) represents the MAE. The predictive capability
set and a calibration set. In the former, system behavior had of the model was evaluated by RMSE and MAE from differ-
been learned and the correlated patterns between the input ent points of view. Our analysis indicates that MAE is a good
dataset and the target values identified by LS-SVR. In the measure of overall error in the training and testing sets, and
latter, the optimal model parameters were determined under RMSE measures the goodness of fit relevant to high values.
the rule of minimizing error through a trial-and-error pro- The best fit between the observed and estimated values would
cess. In the testing stage, the performance of a fully speci- have been R = 1, RMSE = 0, and MAE = 0. These parameters
fied predictor was assessed on criteria for model perfor- were calculated by using the following equations:
mance. However, a separate calibration set was not needed ∑n o ̄o p ̄p
for the RF to obtain an unbiased estimate of the test set i=1 (Qi − Q )(Qi − Q )
R= � (9)
error because out-of-bag estimation is the internal cross- ∑n ̄ o )2 ∑n (Qp − Q
̄ p )2
(Qoi − Q
validation of the training set and a good tool for optimizing i=1 i=1 i
the parameters.
√
Before the data were treated by these three models, all input 1 ∑n
and output data in the training process were normalized using RMSE =
p
(Qi − Qoi )2 (10)
n i=1
the mean (Xmean) and maximum (Xmax) values, as described
Fig. 4 Time-series plot col-

lected at Guxian studied site
13
than 100 (Rodriguez-Galiano et al. 2012). The default value

n
1 ∑| p | of parameter D was 1/3 of the total number of variables.
MAE = |Q − Qoi | (11) According to Fig. 5, the LS-SVR model yielded the best
n i=1 | i |
R, RMSE, and MAE scores in 1-day prediction. However,
where n is the number of input samples, Qoi and Qi are the
p
as the lag time of the forest increased, CCF-CRF gradu-
observed and predicted GWLs at the ith time step, respec- ally began to exhibit performance superior to that of the
tively, Q
̄ o and Q
̄ p represent the mean values of the observed RFR and LS-SVR models in terms of overall trend. Fur-
and predicted GWLs, respectively. thermore, Fig. 5 shows that RFR outperformed LS-SVR for
long-term predictions primarily because it split each node
using only the best feature from a random subset of fea-
tures. This operation can be seen as internal feature selec-
Results and discussion tion with no equivalent in the LS-SVM. Above all, it can be
concluded that the CCF-CRF model was the most efficient
Comparison of RFR, LS‑SVR, and CCF‑CRF for daily groundwater forecasting for two reasons. First, the
to determine the best model split at each node was based on completely random linear
combinations of features instead of a single one and had a
The performance of the original RFR, LS-SVR, and CCF- lower probability of choosing the same attribute for each
CRF models is compared in Fig. 5 for forecasting varia- node, which reduced similarity among the decision trees.
tions in 1-, 3-, 5-, 7-, and 10-day groundwater levels at the Second, as an alternative to the bagging technique (Brei-
Guxian site. All performance evaluation measures of the man 1996), the projection bootstrap technique was used to
CCF-CRF model, including R, RMSE, and MAE, are shown improve accuracy and diversity within decision trees. It was
in red. The same input structures and prediction horizons found that 1-day prediction of the three models obtained
as in the CCF-CRF model were introduced to the LS-SVR better performance than 3-, 5-, 7-, and 10-day predictions.
model, whereas the RFR model had three driving factors: The accuracy of the forecasted results degraded as the lead
temperature, precipitation, and previous GWLs. The number time of the forecast increased. These results are consistent
of decision trees (B) was set to 100 because the differences with those of previous studies (Chang et al. 2015).
in computation time were more obvious, but those in accu- Another important criterion for model comparison is
racy were small when the number of decision trees was more computation time. In this context, Table 1 lists the train-
ing times and complexity for parameter optimization for
Fig. 5 The performance com-

parison of the original RFR
model, LS-SVR model, and
CCF-CRF model for forecasting
the 1-, 3-, 5-, 7-, and 10-day-
ahead groundwater level varia-
tions at Guxian site
13
both regressors. The first row of Table 1 shows the optimal Comparison between horizons to determine
parameter values corresponding to each model. For RF, the optimal time lag
the number of trees (B) was set to 100 and the number of
features, randomly selected at each node (F), was set to The lag times of precipitation, temperature, and historical
three. For CCF-CRF, the number of trees (B) was set to GWL have a significant influence on the predicted GWL
100 and the number of features (F) to seven. The train- (Yoon et al. 2011). Therefore, based on the results of a com-
ing times under the optimal parameter configuration were parison among the results reported in the previous subsec-
given in M. The time required to perform the parameter tion, a series of time lags were selected to predict daily GWL
optimization is compared in N. using the CCF-CRF model. From Table 2, it is evident that
Having obtained the optimal parameters, CCF-CRF was the model with time lag P(t − 2)T(t)H(t) represents the best
found to be 17 times faster than LS-SVR. On 330 training performance at different horizons during the testing stage,
samples, parameter optimization for LS-SVR took nearly which confirms the retardation effect of GWL fluctuation
36 times longer than that for CCF-CRF. Pal (2005) reached due to rainfall at this site. This can be attributed to the con-
the same conclusions, whereby LS-SVR was known for struction of the aquifer on the site. It was also found that
having a longer training time than RF because it often a 1-day yielded obtained better performance than 3-, 5-,
requires an optimization phase that is seldom straightfor- 7-, and 10-day predictions. The accuracy of the forecasted
ward, but CCF-CRF needs only slight parameter tuning results decreased as the lead time of the forecast increased.
for a short training time for optimization. This is shown in For a long lag time (7 and 10 days), the CFRC-CCF model
Table 1, where the computation time of the RFR ensemble with a lag time of P(t − 2)T(t)H(t) performed well, which is
was small, but Fig. 5 shows that the proposed CCF-CRF consistent with the results shown in Fig. 5.
was more efficient than RFR in most cases, particularly for
long-term predictions. For CCF-CRF, there was a trade-off Predicting groundwater level
between performance and computation time.
Once the best model scenario and corresponding time lag
were determined, the optimal model was used to predict
groundwater levels for 1-, 3-, 5-, 7-, and 10-day forecasts.
Table 1 Training times in seconds for LS-SVR, RFR, and CCF-CRF Based on the best model and the time lag, Fig. 6 shows the
LS-SVR RFR CCF-CRF scatter plots of optimal models for deferent horizons in the
testing period. It is clear that 1-day GWL predictions were
Optimized parameter γ = 108 B = 100 B = 100
configuration less scattered and closer to the straight line than those the
e = 5.5 F=3 F=7
other predictions. In general, the CCF-CRF model showed
M 298 28 39
impressive performance with respect to R and RMSE for
N 1399 57 71
all horizons. It was also found that it underestimated high
γ is a penal factor, and e is the bias terms of LS-SVR GWLs and overestimated low ones in predicting extreme
Table 2 Performance of the H(t + 1) H(t + 3) H(t + 5) H(t + 7) H(t + 10)

CCF-CRF model with different
lag times during the testing R RMSE R RMSE R RMSE R MSE R MSE
periods for 1-, 3-, 5-, 7-, and
10-day-ahead groundwater level P(t − 1)T(t)H(t) 0.9410 0.1649 0.9006 0.1942 0.8650 0.2254 0.8176 0.2634 0.7690 0.2955
forecasting for Guxian well P(t − 2)T(t)H(t) 0.9581 0.1338 0.9213 0.1476 0.8704 0.2238 0.8361 0.2478 0.8223 0.2668
station P(t − 3)T(t)H(t) 0.9553 0.1400 0.9063 0.1929 0.8368 0.2449 0.8239 0.2642 0.8089 0.2855
P(t − 4)T(t)H(t) 0.9482 0.1349 0.9087 0.1876 0.8580 0.2300 0.8197 0.2575 0.7728 0.2893
P(t)T(t − 1)H(t) 0.9572 0.1375 0.9012 0.1873 0.8552 0.2319 0.8372 0.2619 0.7420 0.3023
P(t)T(t − 2)H(t) 0.9479 0.1353 0.9051 0.1493 0.8597 0.2394 0.8278 0.2651 0.7906 0.2855
P(t)T(t − 3)H(t) 0.9509 0.1349 0.9073 0.1900 0.8548 0.2341 0.8006 0.2713 0.7020 0.3200
P(t)T(t − 4)H(t) 0.9571 0.1375 0.9013 0.1766 0.8643 0.2315 0.8301 0.2621 0.6846 0.3277
P(t)T(t)H(t − 1) 0.9401 0.1480 0.9004 0.1500 0.8424 0.2419 0.8244 0.2627 0.8015 0.2827
P(t)T(t)H(t − 2) 0.9127 0.1847 0.8531 0.2330 0.8277 0.2492 0.8011 0.2802 0.7064 0.3111
P(t)T(t)H(t − 3) 0.9346 0.1700 0.9008 0.1992 0.8298 0.2468 0.8293 0.2642 0.8003 0.2943
P(t)T(t)H(t − 4) 0.9267 0.1746 0.8976 0.2205 0.8177 0.2615 0.8189 0.2567 0.7947 0.2876
P represents daily precipitation value, T represents daily minimum temperature value, and H represents
daily GWL value. The lowest RMSE and highest R are in bold font
13
13
◂Fig. 6 Comparison of observed and predicted groundwater level of References

the optimal model for 1, 3, 5, 7, and 10 days (corresponding to a–e,
respectively) ahead during the testing period
Barlow JRB, Clark BR (2011) Simulation of water-use conservation
scenarios for the Mississippi Delta using an existing regional
groundwater flow model. USGS
occurrences, as found by Wunsch et al. (2018) and Emamg- Behzad M, Asghari K, Coppola EA (2010) Comparative study of
holizadeh et al. (2014). Therefore, GWL prediction should SVMs and ANNs in aquifer water level prediction. J Comput
be divided into low and high levels in case of a rapid drop Civ Eng 24(5):408–413
or rise. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://
It was this verified that the CCF-CRF model with doi.org/10.1023/A:1010933404324
P(t − 2)T(t)H(t) lag time offers relatively good agreement Chang J, Wang G, Mao T (2015) Simulation and prediction of
between the observed values and their corresponding meas- suprapermafrost groundwater level variation in response
ured values. to climate change using a neural network model. J Hydrol
529:1211–1220
Dyer J, Mercer A, Rigby JR, Grimes A (2015) Identification of
recharge zones in the lower Mississippi River alluvial aquifer
using high-resolution precipitation estimates. J Hydrol 531(Part
Conclusions 2):360–369. https://doi.org/10.1016/j.jhydrol.2015.07.016
Emamgholizadeh S, Moslemi K, Karami G (2014) Prediction the
groundwater level of bastam plain (Iran) by artificial neural
This paper applied CCF-CRF to short-term GWL prediction network (ANN) and adaptive neuro-fuzzy inference system
at the Guxian well at the groundwater source field of the (ANFIS). Water Resour Manag 28(15):5433–5446. https://doi.
Daguhe River. To evaluate the effectiveness of CCF-CRF, its org/10.1007/s11269-014-0810-0
accuracy was compared with LS-SVR and RFR, and results Hamza M, Larocque D (2005) An empirical comparison of ensem-
ble methods based on classification trees. J Stat Comput Simul
show that it can generate a more accurate estimation of GWL 75(8):629–643. https://doi.org/10.1080/00949650410001729472
for various prediction horizons (1, 3, 5, 7, 10 days). It was He Z, Zhang Y, Guo Q, Zhao X (2014) Comparative study of artifi-
also found that the CCF-CRF model offers a better trade- cial neural networks and wavelet artificial neural networks for
off between prediction performance and computation time groundwater depth data forecasting with various curve fractal
dimensions. Water Resour Manag 28(15):5297–5317. https://doi.
than both the other two algorithms. Based on the optimal org/10.1007/s11269-014-0802-0
model, the highest prediction accuracy was achieved using Hosseini SM, Mahjouri N (2016) Integrating support vector regression
precipitation P(t − 2), temperature T(t), and groundwater and a geomorphologic artificial neural network for daily rainfall-
level H(t) as time lag. The CCF-CRF yielded excellent per- runoff modeling. Appl Soft Comput J 38:329–345. https://doi.
org/10.1016/j.asoc.2015.09.049
formance, particularly over longer prediction horizons with Hotelling H (1936) Relations between two sets of variates. Biometrika
sparser data pairs. 28:321–377
Overall, the results of the case study are favorable and Kebede H, Fisher DK, Sui R, Reddy KN (2014) Irrigation methods and
show that CCF-CRF is a promising prediction tool in scheduling in the delta region of Mississippi: current status and
strategies to improve irrigation efficiency. Am J Plant Sci 5:2917
groundwater hydrology. Moreover, this paper provided case Khaki M, Yusoff I, Islami N (2015) Simulation of groundwater
study-based illustrations that show that it is suitable for situ- level through artificial intelligence system. Environ Earth Sci
ations where only low-dimensional input data are available. 73(12):8357–8367. https://doi.org/10.1007/s12665-014-3997-8
This study used data from only one station, and more data Li B, Yang G, Wan R, Dai X, Zhang Y (2015) Comparison of random
forests and other statistical methods for the prediction of lake
from other areas can be used to verify the conclusions of water level: a case study of the Poyang Lake in China. Hydrol Res
this study. Furthermore, research is needed to improve pre- 47(S1):69–83. https://doi.org/10.2166/nh.2016.264
dictive accuracy when there is sudden change in GWL over Luo Y, Huang D, Liu P (2016) An novel random forests and its appli-
consecutive time periods. cation to the classification of mangroves remote sensing image.
Multimed Tools Appl 75:9707–9722. https://doi.org/10.1007/
s11042-015-2906-9
Acknowledgements This study was supported by the National Key Mohanty S, Jha MK, Raul SK, Panda RK, Sudheer KP (2015) Using
Research Project (2016YFC0402810), the National Natural Science artificial neural network approach for simultaneous forecasting of
Foundation of China (51409236), and Research on Intelligent Deci- weekly groundwater levels at multiple sites. Water Resour Manag
sion Support System of Tea Garden Fertigation Based on Big Data 29(15):5521–5532. https://doi.org/10.1007/s11269-015-1132-6
(ZR2017LC027) of the Shandong Natural Science Foundation. Moudani W (2013) Dynamic features selection for heart disease clas-
sification. World Acad Sci Eng Technol 7:105–110
Open Access This article is distributed under the terms of the Crea- Pal M (2005) Random forest classifier for remote sensing classification.
tive Commons Attribution 4.0 International License (http://creativeco Int J Remote Sens 26(1):217–222. https://doi.org/10.1080/01431
mmons.org/licenses/by/4.0/), which permits unrestricted use, distribu- 160412331269698
tion, and reproduction in any medium, provided you give appropriate Pelckmans K, Suykens JAK, Van Gestel T, De Brabanter J, Lukas
credit to the original author(s) and the source, provide a link to the L, Hamers B, MoorB De, Vandewalle J (2002) LS-SVMlab: a
Creative Commons license, and indicate if changes were made. MATLAB/C toolbox for least squares support vector machines.
ESAT-SISTA, K.U. Leuven, Leuven
13
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106 approximation. Neurocomputing 48(1–4):85–105. https://doi.
Quinlan JR (1993) C4.5 programs for machine learning San Francisco, org/10.1016/S0925-2312(01)00644-0
1st edn. Morgan Kaufmann Publishers Inc., California, p 303 Taormina R, Chau K, Sethi R (2012) Artificial neural network simula-
Raghavendra NS, Deka PC (2014) Support vector machine applications tion of hourly groundwater levels in a coastal aquifer system of
in the field of hydrology: a review. Appl Soft Comput 19:372–386. the Venice lagoon. Eng Appl Artif Intell 25(8):1670–1676. https
https://doi.org/10.1016/j.asoc.2014.02.002 ://doi.org/10.1016/j.engappai.2012.02.009
Rahmati O, Reza H, Melesse AM (2016) Application of GIS-based Vapnik VN (1995) The nature of statistical learning theory. Springer,
data driven random forest and maximum entropy models for New York
groundwater potential mapping: a case study at Mehran Region, Wang B, Oldham C, Hipsey MR (2016) Comparison of machine learn-
Iran. Catena 137:360–372. https : //doi.org/10.1016/j.caten ing techniques and variables for groundwater dissolved organic
a.2015.10.010 nitrogen prediction in an urban area. Proc Eng 154:1176–1184.
Rodriguez-galiano VF, Ghimire B, Rogan J, Chica-olmo M, Rigol- https://doi.org/10.1016/j.proeng.2016.07.527
sanchez JP (2012) An assessment of the effectiveness of a ran- Wunsch A, Liesch T, Broda S (2018) Forecasting groundwater levels
dom forest classifier for land-cover classification. ISPRS J Pho- using nonlinear autoregressive networks with exogenous input
togramm Remote Sens 67:93–104. https://doi.org/10.1016/j.isprs (NARX). J Hydrol 2(2):1–15. https://doi.org/10.1016/j.jhydr
jprs.2011.11.002 ol.2018.01.045
Rodriguez-Galiano V, Mendes MP, Garcia-Soldado MJ, Chica-Olmo Yoon H, Jun S-C, Hyun Y, Bae G-O, Lee K-K (2011) A comparative
M, Ribeiro L (2014) Predictive modeling of groundwater nitrate study of artificial neural networks and support vector machines
pollution using random forest and multisource variables related to for predicting groundwater levels in a coastal aquifer. J Hydrol
intrinsic and specific vulnerability: a case study in an agricultural 396(1–2):128–138. https://doi.org/10.1016/j.jhydrol.2010.11.002
setting (Southern Spain). Sci Total Environ 476–477:189–206. Yoon H, Hyun Y, Ha K, Lee K-K, Kim G-B (2016) A method to
https://doi.org/10.1016/j.scitotenv.2014.01.001 improve the stability and accuracy of ANN- and SVM-based
Rodriguez-Galiano V, Sanchez-Castillo M, Chica-Olmo M, Chica- time series models for long-term groundwater level predictions.
Rivas M (2015) Machine learning predictive models for mineral Comput Geosci 90:144–155. https://doi.org/10.1016/j.cageo
prospectivity: an evaluation of neural networks, random forest, .2016.03.002
regression trees and support vector machines. Ore Geol Rev Zhang L, Suganthan PN (2015) Oblique decision tree ensemble via
71:804–818. https://doi.org/10.1016/j.oregeorev.2015.01.001 multi-surface proximal support vector machine. IEEE Trans
Smola AJ, Sch B, Schölkopf B (2004) A tutorial on support vector Cybern 45(10):2165–2176
regression. Stat Comput 14(3):199–222. https://doi.org/10.1023/
B:STCO.0000035301.49549.88 Publisher’s Note Springer Nature remains neutral with regard to
Suykens JAK, Vandewalle J (2000) Recurrent least squares support jurisdictional claims in published maps and institutional affiliations.
vector machines. IEEE Trans Circuits Syst I 47(7):1109–1114
Suykens JAK, De Brabanter J, Lukas L, Vandewalle J (2002) Weighted
least squares support vector machines: robustness and sparse
13

Short Term Prediction of Groundwater Level Using Improved Random Forest Regression With A Combination of Random Features

Uploaded by

Copyright:

Available Formats

Short Term Prediction of Groundwater Level Using Improved Random Forest Regression With A Combination of Random Features

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Short Term Prediction of Groundwater Level Using Improved Random Forest Regression With A Combination of Random Features

Uploaded by

Copyright:

Available Formats

Applied Water Science (2018) 8:125