Sentinel2 Paper
Sentinel2 Paper
Sentinel2 Paper
Hongwei Guo , Jinhui Jeanne Huang , Bowen Chen , Xiaolong Guo & Vijay P.
Singh
To cite this article: Hongwei Guo , Jinhui Jeanne Huang , Bowen Chen , Xiaolong Guo & Vijay P.
Singh (2021) A machine learning-based strategy for estimating non-optically active water quality
parameters using Sentinel-2 imagery, International Journal of Remote Sensing, 42:5, 1841-1866,
DOI: 10.1080/01431161.2020.1846222
Article views: 1
CONTACT Jinhui Jeanne Huang huangj@nankai.edu.cn B406, College of Environmental Science and
Engineering, 38 Tongyan Rd., Haihe Education Park, Jinnan District, Tianjin, P.R.China, 300350
Supplemental data for this article can be accessed here.
© 2020 Informa UK Limited, trading as Taylor & Francis Group
1842 H. GUO ET AL.
1. Introduction
In urban areas, waterbodies, such as lakes and reservoirs, may be polluted by illegal
discharges of industrial effluent and domestic sewage (Shao et al. 2006). Deterioration of
water quality may increase human exposure to diseases and harmful chemicals; reduce
ecosystem productivity and biodiversity; and damage aquaculture, agriculture and other
water-related industries (Hoekstra, Buurman, and Van Ginkel 2018; Brönmark and
Hansson 2002). Traditional water-quality monitoring methods are primarily based on
water sample collection and testing or automatic in-situ measurements. Both methods
are either labour intensive or very costly. In addition, most water sample testing would
need reagents for testing, and the treatment of waste generated by testing is also costly.
Although these methods may have high accuracy, individual samples only reflect the
water quality at specific sampling points and are limited in characterizing water quality for
the entire water surface (Shuchman et al. 2013; Ritchie, Zimba, and Everitt 2003; O’Reilly
et al. 1998; Olmanson, Brezonik, and Bauer 2013). In many cases, the decision makers
would need a full picture of water characteristics over the entire water surface for water-
quality management. Remote sensing has been used to monitor water quality since the
1970s (Vignolo, Pochettino, and Cicerone 2006; Holyer 1978; Ritchie, Schiebe, and
McHenry 1976). Compared with traditional methods, remote sensing can provide the
full coverage required for dynamic water-quality monitoring (Duan, Ma, and Hu 2012).
Over the past several decades, scholars have carried out extensive research works on
water-quality monitoring by remote sensing, and have achieved good results in estimating
optically active parameters, such as Chlorophyll-a (Chl-a), suspended particulate matter
(SPM), coloured dissolved organic matter (CDOM), turbidity and transparency etc. (Brezonik
et al. 2015; Hou et al. 2017; Shi et al. 2015; Bugnot et al. 2018; Shahzad et al. 2018; Doña et al.
2015). However, estimating non-optically active parameters, such as total phosphorus (TP),
total nitrogen (TN), and chemical oxygen demand (COD) directly from spectral characteristics
is difficult, because they are less likely to impact the optical characteristics measured by
satellite sensors (Deng, Zhang, and Cen 2019; Gholizadeh and Melesse 2017; Mathew,
Srinivasa Rao, and Mandla 2017; Xiong et al. 2020; Ferdous, Tauhid, and Rahman 2020;
Chang, Bai, and Chen 2017; Gao et al. 2015). Generally, non-optically active parameters
have been estimated indirectly based on the correlation between optically active parameters
and non-optically active parameters (Carlson 1977; Wu et al. 2010; Mathew, Srinivasa Rao, and
Mandla 2017). For instance, Chang, Xuan, and Yang (2013) estimated TP in Tampa Bay (USA)
with the Moderate-resolution Imaging Spectroradiometer (MODIS) images and genetic
programming models. The results indicated that the Band 1, Band 3 and Band 4 of MODIS
images were most influential for the determination of TP concentrations. Li et al. (2017a)
developed empirical models to estimate TP and TN in the Xin’anjiang Reservoir (China) using
Land Remote-Sensing Satellite (System, Landsat) 8 Operational Land Imager (OLI) images.
The Landsat 8 OLI-derived factors (Band 1 + Band 3 + Band 4)/Band 2 and Band 4/(Band
2 + Band 5) shown a strong correlation with TP and TN concentrations, respectively. Wang
et al. (2004) estimated COD in the reservoirs of Shenzhen (China) using the Landsat Thematic
Mapper (TM) images. The results indicated that the TM Band 1 to Band 4 and organic
pollution measurements (e.g. COD) had high correlation. Although the previous studies on
non-optically active parameters were fairly limited, these studies proved the possibility of
retrieving non-optically active parameters from optical characteristics.
INTERNATIONAL JOURNAL OF REMOTE SENSING 1843
The most widely used remote-sensing imagery in the existing research works are from
Landsat TM, Enhanced Thematic Mapper Plus (ETM+) and OLI, Sea-Viewing Wide Field-of-
View Sensor (SeaWiFS), MODIS, and Medium Resolution Imaging Spectrometer (MERIS)
(Moses et al. 2009; Halme, Pellikka, and Mõttus 2019; Kishino, Tanaka, and Ishizaka 2005;
Shenglei et al. 2016; Keith et al. 2018). However, the temporal resolution of TM, ETM+ and
OLI data is 16 days, and the spatial resolution of SeaWiFS, MODIS and MERIS data is
greater than or equal to 250 × 250 m, resulting in challenges in high-frequency char
acterization of the water quality for small waterbodies. Hyperspectral imagery contains
a large number of continuous spectral information and provides more spectral character
istics for water-quality retrieval (Brando and Dekker 2003; Li et al. 2017a; Gitelson et al.
2011). However, spaceborne hyperspectral data (e.g. data of Hyperion and Compact High
Resolution Imaging Spectrometer (CHRIS)) is only experimental data rather than opera
tional at present, and airborne hyperspectral data (e.g. data of Airborne Visible Infrared
Imaging Spectrometer (AVIRIS), Compact Airborne Spectrographic Imager (CASI), and
Contact Image Sensors (CIS)) has very limited spatial coverage with high cost (Lunetta
et al. 2009; Halme, Pellikka, and Mõttus 2019). By comparison, the recently launched
Sentinel-2 produces imagery with a spatial resolution of 10 × 10 m and a temporal
resolution of 5 days. It provides an opportunity to conduct high-frequency water-
quality monitoring for small waterbodies.
In summary, studies on non-optically active parameters were fairly limited in the
past, and most remote-sensing imagery has a too coarse spatial or temporal resolution
for high-frequency water-quality monitoring of small waterbodies. Therefore, this
study aims to retrieve non-optically active parameters for small urban waterbodies
using Sentinel-2 imagery. TP, TN and COD were selected as the target parameters,
since they may help in identifying the sources of illegal discharges from industrial
effluent or domestic sewage. A total of 255 possible band compositions of eight
Sentinel-2 imagery bands were compared to identify the most appropriate ones for
retrieving each water-quality parameter. Three machine-learning models, namely
Random Forest (RF), Support Vector Regression (SVR) and Neural Networks (NN),
were introduced in the empirical methods (Wang et al. 2018; Li et al. 2017a; Le et al.
2011), and compared to seek the most robust ones for retrieving the non-optically
active parameters. This study may help urban water management by providing a more
practical and efficient water-quality monitoring strategy of non-optically active
parameters.
Figure 1. Locations of the City of Tianjin (a), the study area (b) and the sampling points (green pin
labels) (c).
samples were quickly put into amber glass bottles to avoid sunshine, and sent to the
laboratory for testing. The testing method for each water-quality parameter was listed in
Table 1. The field survey data on 20 May 2019 and 20 June 2019 (N = 40) constitute the
ground truth data set for model calibration and validation. The field survey data on
16 November 2018 (N = 20) was selected as a repetitive experiment to validate the
robustness of the developed models. The measurements in Lake Simcoe (N = 33) were
used as an independent data set to validate the model generalization.
The spatial distributions of the measured water-quality parameters on the three sampling
dates were visualized by ArcGIS 10.4 (Environmental Systems Research Institute, Inc.,
Redlands, California, USA). On 16 November 2018, the averages of TP, TN, and COD were
0.62 mg l−1, 1.37 mg l−1, and 31.70 mg l−1, respectively. On 20 May 2019, the averages of TP,
TN, and COD were 0.29 mg l−1, 0.66 mg l−1, and 50.45 mg l−1, respectively. On 9 June 2019, the
averages of TP and TN increased to 0.85 mg l−1 and 1.63 mg l−1, respectively. The average of
COD decreased to 17.45 mg l−1 (Figure 2).
Figure 2. Spatial distributions of the measured water-quality parameters. (a–c) represent TP, TN COD,
respectively on 16 November 2018; (d–f) represent TP, TN, COD, respectively on 20 May 2019; (g–i) represent
TP, TN, COD, respectively on 9 June 2019. The bigger the dot size, the more serious the water pollution.
XGreen XNIR
NDWI ¼ (1)
XGreen þ XNIR
where XGreen and XNIR are the pixel values of the green band and the NIR band, respec
tively. For Sentinel-2 imagery, the green band and the NIR band are B3 and B8 ,
respectively.
Figure 3. BOA reflectance of the sampling points on 20 May 2019 (a) and 9 June 2019 (b). Different
colours represent different sampling points.
PN
ð^yi y i Þ2
R2 ¼ 1 Pi¼1
N
(2)
i¼1 ð�yi y i Þ2
N �� �
100 X �
�^yi yi �
MAPEð%Þ ¼ � � (3)
N i¼1
yi �
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u N �
u 1 X ^y yi �2
RMSPEð%Þ ¼ 100 � t i
(4)
N i¼1 yi
Where yi , �yi , and ^yi are the measured, the mean, and the estimated water-quality para
meters, respectively; N is the number of the sampling points.
Learning curves were used to tune model parameters. A learning curve shows model
performance on the test set in the y-axis and different values of a model parameter in the
x-axis. According to learning curves, the optimal value of each model parameter was
determined. To prevent the over fitting and improve model generalization, a 10-fold cross
validation was used to calculate the evaluation metrics of the model performances
1848 H. GUO ET AL.
(Rapinel et al. 2019). For each cross validation, the whole data set was randomly split into
70% training set (N = 28) and 30% test set (N = 12). A model was fitted on the training set
and the model performance was evaluated on the test set. The final R2, MAPE, RMSPE were
the averages of the 10 fold cross validation. The principles of the three machine-learning
models were described in the following subsections.
where c1 and c2 are the mean sample output values of D1 and D2, respectively; yi is the
measured value.
Then, at the two nodes after splitting, the splitting continues according to the above
principle. The final prediction result is the mean of all decision trees. In this study, RF was
implemented by scikit-learn 0.21.3 of Python 3.7. Using learning curves, the main para
meters were set as Table 2:
!
XN
w2
min þC Lε ð x i ; y i ; f Þ (7)
2 i¼1
where Lε is the loss function; C is a pre-set penalty coefficient, which is used to punish
errors greater than ε; ε is the deviation between the estimated and the measured values;
N is the number of sampling points. Introducing slack variables �i and �i� , the solution of
Equation (7) can be transformed to Equation (9):
!
i¼1
X
w2 �
�
min þC �i þ �i (9)
2 N
The Lagrange multipliers αi and α�i are introduced to establish the Lagrange function and
then solve the dual problem of the original problem. Finally, the regression function of the
optimal hyperplane is obtained:
N
X �
f ðxÞ ¼ αi α�i K ðxi ; xÞ þ b (11)
i¼1
(1) Initialization. Initialize wi;j and wj;k . The wi;j is the connection weight between the jth
neuron in the input layer and the ith neuron in the hidden layer. The wj;k is the
connection weight between the kth neuron in the hidden layer and the jth neuron
in the output layer. Meanwhile, the threshold a of the hidden layer, threshold b of
the output layer, activation functions, and learning efficiency η are preset.
(2) Hidden layer output calculation. The output H of the hidden layer is calculated
according to the input variable x, wi;j and a:
N
X
Hj ¼ f ð wi;j xi aj Þj ¼ 1; 2; . . . ; L (14)
i¼1
(1) Output layer output calculation. The predicted output O is calculated according to H,
wj;k , and b:
L
X
Ok ¼ Hj wj;k bk k ¼ 1; 2; . . . ; M (15)
j¼1
(1) Error calculation. The prediction error e is calculated according to O and the
expected output y. If the error meets the requirement, the training will be com
pleted, otherwise step 5 will be repeated.
(2) Updating weights and thresholds. Turn back to step 2 after updating wi;j , wj;k , a, and
b according to e.
The common activation functions in NN include ReLU function, sigmoid function and
tanh function. Among them, ReLU function is most commonly used (Kurt et al. 2008).
ReLU function is a piecewise linear function, and well makes up for the gradient disap
pearance problem of sigmoid function and tanh function. The ReLU function is as follows:
�
z; z > 0
gð z Þ ¼ (16)
0; z < 0
In this study, NN was implemented in Keras 2.2.4 of Python 3.7. The number of neurons in
the input layer was set to six, for there were six input imagery bands. By analysing the
model performances and calculation efficiency, the number of the hidden layers was set
to four. For TP and TN, the number of neurons in each hidden layer was set to 300. For
COD, the number of neurons in each hidden layer was set to 200.
INTERNATIONAL JOURNAL OF REMOTE SENSING 1851
3. Results
3.1. Sentinel-2 imagery band selection
The average spectral shapes across the TP, TN and COD concentrations were shown in
Figure 4. The average BOA reflectance on the two sampling dates showed significant
differences. The BOA reflectance increased with the increase of the TP, TN and COD
Figure 4. Average spectral shapes across the concentrations of TP (a), TN (b), and COD (c). The BOA
reflectance was the average of the eight selected imagery bands in this study. When plotting, different
BOA reflectance corresponding to the same concentration was averaged.
1852 H. GUO ET AL.
concentrations. On each sampling date, BOA reflectance fluctuated significantly with the
changes of TP, TN and COD concentrations. The results indicated the possibility to
estimate TP, TN and COD from spectral characteristics.
Then, the standard deviations of the BOA reflectance in each band across the TP, TN
and COD concentrations were calculated and compared (Figure 5). For TP and TN, the
BOA reflectance of B3 , B4 , B5 , B6 , B7 and B8 fluctuated more obviously across the concen
tration ranges. For COD, the BOA reflectance of B2 , B3 , B5 , B6 , B7 and B8 fluctuated more
obviously across the concentration ranges. The results suggested that the band composi
tions of the above-mentioned bands could be used for the TP, TN and COD retrieval.
In order to further validate whether the above-mentioned band compositions were
most appropriate, this study used all possible band compositions (a total of 255) to
retrieve each water-quality parameter by multiple linear regression. R2 was selected to
evaluate the model performances (Figure 6).
According to Figure 6, with the increase of band number, the average R2 increased, and
reached the maximum when the band number was 6. The average R2 of seven bands
compositions was greater than that of eight bands compositions. The most influential
bands were B3 , B4 and B5 , namely, the green and red bands (B5 was the vegetation red
edge with a wavelength of 705 nm). The most appropriate band compositions for TP, TN,
and COD retrieval were ‘B3 þ B4 þ B5 þ B6 þ B7 þ B8 ’, ‘B3 þ B4 þ B5 þ B6 þ B7 þ B8 ’ and
‘B2 þ B3 þ B5 þ B6 þ B7 þ B8 ’, respectively.
Figure 5. Standard deviations of the BOA reflectance in each band across the TP, TN and COD
concentrations. The black squares, red dots and blue diamonds represent TP, TN and COD, respectively.
INTERNATIONAL JOURNAL OF REMOTE SENSING 1853
Figure 6. R2 of TP (a), TN (b) and COD (c) retrieval using different band compositions by multiple linear
regression.
The optimal models of TP, TN and COD retrieval were different. For TP, the performance
of NN was good. R2 reached 0.94, and the MAPE and RMSPE were 12.43% and 16.80%,
respectively. For TN, the performance of RF was good. R2 reached 0.88, and the MAPE and
RMSPE were 18.39% and 29.64%, respectively. For COD, the performance of SVR was
good. R2 reached 0.86, and the MAPE and RMSPE were 12.55% and 18.75%, respectively.
Taking the performances of multiple linear regression as a comparison, machine learn
ing significantly improved the retrieval accuracy of each water-quality parameter (Table 4).
Furthermore, using the Kriging spatial interpolation (Carletti, Picci, and Romano 2000;
Beaulant et al. 2008; Wang et al. 2019) of ArcGIS 10.4, each water-quality parameter was
interpolated with the ground truth values. Figure 8 showed the comparisons of water-
quality parameters estimated by the spatial interpolations and machine-learning models.
1854 H. GUO ET AL.
Figure 7. The model performances of TP, TN, and COD retrieval. (a–c) represent the accuracy of RF; (d–
f) represent the accuracy of SVR; and (g–i) represent the accuracy of NN. The red dots and blue
triangles represent the sampling points on 20 May 2019 and 9 June 2019, respectively.
Table 4. Comparison of accuracy between machine learning and multiple linear regression.
Machine learning Multiple linear regression
Parameter R2 MAPE (%) RMSPE (%) R2 MAPE (%) RMSPE (%)
TP 0.94 12.43 16.80 0.65 30.65 39.83
TN 0.88 18.39 29.64 0.76 22.14 36.24
COD 0.86 12.55 18.75 0.81 39.18 71.65
23.41%, 17.83% and 6.80% using the machine-learning model (RF) than the spatial
interpolation. For the COD estimation, the R2, RMSPE and MAPE were higher by 13.68%,
40.67% and 30.61% using the machine-learning model (SVR) than the spatial interpola
tion. The results proved that compared to the spatial interpolation, machine learning
could recreate the dynamic ranges of the measured water-quality parameters in more
detail, and significantly improve the estimation accuracy.
machine-learning models to estimate TP, TN and COD. Then the estimated results were
output into the imagery to generate the water-quality distributions (Figure 9).
The spatial distributions of TP and TN tended to be consistent, which might be due to the
fact that TP and TN were from the same pollution sources. For example, domestic sewage
from residential areas was discharged into the lake. TP and TN in the south and west of the
lake were higher than those in the north and east of the lake. On 20 May 2019, the high
values of TP and TN were distributed in the southwest of the lake, while on 9 June 2019, the
area expanded from the southwest of the lake to the north of the lake. Accordingly, the
averages of TP and TN increased from 0.29 mg l−1 and 0.66 mg l−1 on 20 May 2019 to
0.85 mg l−1 and 1.63 mg l−1 on 9 June 2019, respectively. COD in the east of the lake was
higher than that in the west of the lake. On 20 May 2019, the high values of COD were
distributed in most areas except the southwest of the lake. By 9 June 2019, the area of high
values contracted eastward to the centre and east of the lake. Meanwhile, the average of
COD decreased from 50.45 mg l−1 on 20 May 2019 to 17.45 mg l−1 on 9 June 2019.
According to the mapping of TP and TN, domestic sewage containing N and P might be
continuously discharged into the lake. It was observed from the remote-sensing imagery that
there was a piece of farmland with an area about 1.25 km2 as well as a subdivision of
a residential area adjacent to the lake on the southwest. Therefore, the increases of TP and
TN in the south and west of the lake were likely related to the application of chemical fertilizer
and the discharges of domestic sewage. The high values of COD were widely distributed in
Figure 9. Water-quality distributions on the two sampling dates. (a–c) represent TP, TN, COD,
respectively on 20 May 2019; (d–f) represent TP, TN, COD, respectively on 9 June 2019.
INTERNATIONAL JOURNAL OF REMOTE SENSING 1857
the centre and east of the lake. From 20 May 2019 to 9 June 2019, there was an obvious
contraction process to the east of the lake. According to this change, industrial effluent or
domestic sewage might be discharged into the east of the lake, which may originate from the
nearby residential areas and a pharmaceutical factory in the east of the lake.
From the perspective of optical characteristics, the spectrum of lake water is mainly
affected by three optically active components: SPM, phytoplankton and CDOM (Xiong
et al. 2020; Wang et al. 2020). Many previous studies confirmed that different compo
nents had different absorption characteristics. For instance, phytoplankton has obvious
absorption peaks at the blue band (430 to 500 nm) and the red band (650–750 nm) (Ma
et al. 2006; Pahlevan et al. 2020). CDOM has a strong absorption at the ultraviolet band
(280–400 nm), and the absorption shows an exponential decrease from the ultraviolet to
visible wavelengths (Mannino et al. 2014; Brezonik et al. 2015). In the west of the lake,
algae and aquatic plants grew continuously due to the high TP and TN, and the optical
characteristics of the lake water were dominated by phytoplankton. While in the east of
the lake, when industrial effluent or domestic sewage containing plenty of organic
matter entered the lake, the optical characteristics of the lake water were dominated
by CDOM and SPM. Therefore, the spatial distribution of water quality estimated from
spectral characteristics showed an obvious difference between TP and TN concentra
tions and COD. This result was consistent with the above analysis on the source tracing
of pollutants in the lake.
It also could be observed that the water quality was not evenly distributed, although
the waterbody was fairly small (the surface area was only 0.60 km2). The study of water-
quality retrieval based on high spatial resolution remote-sensing imagery was therefore
crucial for many water management issues, e.g. identifying illegal discharges to urban
waterbodies and spills on the shore etc.
4. Discussion
4.1. Model robustness, generalization and limitations
The novelty of the approach proposed in this study is to determine the optimal band
composition of TP, TN and COD retrieval by analysing the correlation between 255 band
compositions and each water-quality parameter. Moreover, three machine-learning models,
i.e. RF, SVR and NNs, were constructed for each water-quality parameter to seek the most
appropriate one. During the model training, the learning curves were used to tune each
model parameter to ensure the optimal model performance. The evaluation metrics of the
model performance were the averages of a 10 fold cross validation. In each cross validation,
the test set is new to the model, which can improve the model robustness and general
ization to a certain extent. Figure 10 showed the model accuracy on the training set and test
set. There was no significant difference between the estimation errors of the two data sets.
Furthermore, we compared the satellite-derived results in the field survey data on
16 November 2018 (Figure 11). All R2 of TP, TN and COD decreased, but kept above 0.60.
For TP and TN, the MAPE and RMSPE decreased, mainly because the concentration ranges
narrowed on 16 November 2018. The MAPE and RMSPE of COD increased to 37.41% and
30.49%, respectively. The results proved the model robustness and generalization in the
local area.
1858 H. GUO ET AL.
Figure 10. Comparisons of the model accuracy on the training set and test set. (a–c) represent TP, TN,
COD, respectively. The black and red dots represent the estimated values from the training set and test
set, respectively. The grey squares represent the measured values.
To validate whether the developed models work well in other areas, we compared the
satellite-derived results in the field measurements of Lake Simcoe in 2018. Since COD is
not a regular parameter, no matches between satellite-derived results and field measure
ments were generated. For TP and TN, 33 samples were matched, respectively (Figure 12).
According to Figure 12, the model of TP was completely failed due to the huge gap in the
concentration ranges. The R2, MAPE and RMSPE of TN were 0.53, 31.69% and 59.32%,
respectively. The model performance also decreased significantly. The results were con
sistent with the research work of Cao et al. (2020). In addition, since the optical character
istics of different waterbodies are different, the band composition is also referred to as
one of the reasons that affect the model performance. Therefore, the developed models
are capable of providing reliable results in local areas, but also have limitations in applying
INTERNATIONAL JOURNAL OF REMOTE SENSING 1859
Figure 11. The model performances of estimating TP, TN, and COD on 16 November 2018. (a–c)
represent TP, TN, COD, respectively.
Figure 12. The model performances of estimating TP and TN of Lake Simcoe in 2018. (a) and (b)
represent TP and TN, respectively.
to other areas. Band selection and tuning parameters with new data are necessary for
different areas.
Figure 13. Comparisons of the estimated water-quality parameters and the environmental quality
standards for surface water (MEE 2002). (a–c) represent TP, TN, COD, respectively. The red dashed lines
represent different water-quality classifications (noted with Roman numerals).
21.33% and 37.25% of the lake surface were subject to Class II, Class III, Class IV and Class V,
respectively. On 9 June 2019, the water-quality deteriorated. The Class III, Class IV, and
Class V area expanded to 28.63%, 25.65% and 74.35%, respectively. The Class II area
almost disappeared. COD of the lake surface on both 20 May 2019 and 9 June 2019
covered Class I to V. In the part of the lake surface, COD was worse than Class V. The
average COD on both 20 May 2019 and 9 June 2019 was subject to Class V. On
20 May 2019, 41.21% of the lake surface was worse than Class V. 4.29%, 9.10%, 19.09%
and 26.30% of the lake surface were subject to Class I to II, Class III, Class IV and Class V,
respectively. On 9 June 2019, the area with COD worse than Class V decreased to 34.49%.
The area of Class V and Class II decreased to 22.43% and 3.80%, respectively. The area of
Class III and Class IV increased to 11.44% and 27.84%, respectively.
These results indicated the potential feasibility of water-quality classification by remote
sensing. Visualization of water-quality classification can be used for integrating water-quality
online monitoring and early warning platforms. This helps the water management grasp the
water quality in real time and make reasonable decisions. In terms of retrieval models, using
machine-learning regression models, water-quality parameters can be estimated in specific
values. Then by comparing the estimated results with the water-quality evaluation standards,
the spatial distribution of water-quality classification can be acquired. In the future research,
machine-learning classification models (e.g. Convolutional Neural Networks (CNN), Support
Vector Machine (SVM) and eXtreme Gradient Boosting (XGBoost) etc.) can also be considered
to directly classify water quality by remote sensing (Mountrakis, Im, and Ogole 2011; Maxwell,
Warner, and Fang 2018). In this way, the model complexity will be reduced by a certain extent,
and consequently, the cost of model development will be reduced.
from phytoplankton dominated water experienced low reflectance in red (650 to 750 nm)
wavelengths due to the absorption by Chl-a and other pigments. Meanwhile, other
relevant studies also show that there is a significant correlation between Chl-a and the
blue (430 to 500 nm), green (543 to 578 nm), red (650 to 750 nm) and NIR (780 to 1100 nm)
bands (Ma et al. 2006; 2009; Gitelson et al. 2011; Chang, Xuan, and Yang 2013; Li et al.
2017a). The increase of TP and TN thus changes the optical characteristics of waterbodies.
When industrial effluent containing plenty of organic matter is discharged into water
bodies, insoluble substances directly lead to the turbidity increase. On the other hand,
aerobic microorganisms consume oxygen in the water to degrade organic matter. At
a certain depth, the dissolved oxygen might gradually decrease to 0, which results in
anaerobic condition. In an anaerobic environment, the cyclic state of Fe3+ from Fe2O3 and
Fe(OH)3 in water could be destroyed, and a certain amount of Fe2+ accumulates.
Meanwhile, S from sulphate and organic sulphur is reduced to H2S, and the anaerobic
environment prevents microorganisms from assimilating H2S to organic sulphur com
pounds. Unassimilated H2S might react with Fe2+ to form FeS. The FeS turns the water to
black by absorbing on the suspended solids (SS), or being raised into the water by the
bubbles generated from anaerobic decomposition (Duan et al. 2014) (Figure 14). The
increase of COD thus changes the optical characteristics of waterbodies, such as the
increase of reflectance in the green (543–578 nm) and red (650–750 nm) wavelengths
(Tan, Cherkauer, and Chaubey 2016). Based on the above analysis, it is feasible to retrieve
TP, TN and COD from spectral characteristics.
5. Conclusions
This research developed a machine learning-based strategy for non-optically active water-
quality parameters retrieval of small urban waterbodies based on Sentinel-2 imagery.
Compared with Landsat TM/ETM+, MODIS and other remote-sensing imagery, Sentinel-2
imagery with high spatiotemporal resolution makes it possible to retrieve water-quality
parameters for small urban waterbodies. The most influential Sentinel-2 imagery bands for
TP, TN and COD retrieval were B3 , B4 and B5 . The optimal retrieval accuracy of TP, TN and
COD was obtained from the band compositions of ‘B3 þ B4 þ B5 þ B6 þ B7 þ B8 ’,
‘B3 þ B4 þ B5 þ B6 þ B7 þ B8 ’ and ‘B2 þ B3 þ B5 þ B6 þ B7 þ B8 ’, respectively. Compared
to the spatial interpolation and multiple linear regression, the retrieval performances for
non-optically active parameters were significantly improved by the optimized machine-
learning models and imagery band selection, especially for TP and TN. The developed
models are capable of providing reliable results in local areas, but also have limitations in
applying to other areas. Band selection and tuning parameters with new data are necessary
for different areas. According to the water-quality mapping by remote-sensing imagery and
the interviews of the residents in the neighbourhood, the pollutants, especially the illegal
discharges of industrial effluent and domestic sewage, were traced back to the source.
Water-quality classification based on the water-quality parameter estimations helps in the
integration of water-quality online monitoring and early warning systems. Machine-learning
classification models can alternatively be considered for water-quality classification by
remote sensing in future research. This study provides a new practical and efficient water-
quality monitoring method for managing small waterbodies.
Disclosure statement
The authors declared that they had no conflict of interest over any part or the entirety of the
presented study.
Funding
This work was supported by the National Key Research and Development Program of China under
[Grant 2016YFC0400709]; Ministry of Science and Technology of the People’s Republic of China; and
Science and Technology Commission of Tianjin Binhai New Area under [Grant BHXQKJXM-PT-ZJSHJ
-2017001].
ORCID
Hongwei Guo http://orcid.org/0000-0003-3663-5908
References
Beaulant, A. L., G. Perron, J. Kleinpeter, C. Weber, T. Ranchin, and L. Wald. 2008. “Adding Virtual
Measuring Stations to a Network for Urban Air Pollution Mapping.” Environment International 34
(5): 599–605. doi:10.1016/j.envint.2007.12.004.
Brando, V. E., and A. G. Dekker. 2003. “Satellite Hyperspectral Remote Sensing for Estimating
Estuarine and Coastal Water Quality.” IEEE Transactions on Geoscience and Remote Sensing 41:
1378–1387. doi:10.1109/TGRS.2003.812907.
INTERNATIONAL JOURNAL OF REMOTE SENSING 1863
Brezonik, P. L., L. G. Olmanson, J. C. Finlay, and M. E. Bauer. 2015. “Factors Affecting the Measurement
of CDOM by Remote Sensing of Optically Complex Inland Waters.” Remote Sensing of
Environment. doi:10.1016/j.rse.2014.04.033.
Brönmark, C., and L. A. Hansson. 2002. “Environmental Issues in Lakes and Ponds: Current State and
Perspectives.” Environmental Conservation 29: 290–307. doi:10.1017/S0376892902000218.
Bugnot, A. B., M. B. Lyons, P. Scanes, G. F. Clark, S. K. Fyfe, A. Lewis, and E. L. Johnston. 2018. “A Novel
Framework for the Use of Remote Sensing for Monitoring Catchments at Continental Scales.”
Journal of Environmental Management 217: 939–950. Elsevier Ltd. doi:10.1016/j.
jenvman.2018.03.058.
Cao, Z., R. Ma, H. Duan, N. Pahlevan, J. Melack, M. Shen, and K. Xue. 2020. “A Machine Learning
Approach to Estimate Chlorophyll-A from Landsat-8 Measurements in Inland Lakes.” Remote
Sensing of Environment 248: 111974. doi:10.1016/j.rse.2020.111974.
Carletti, R., M. Picci, and D. Romano. 2000. “Kriging and Bilinear Methods for Estimating Spatial
Pattern of Atmospheric Pollutants.” Environmental Monitoring and Assessment 63 (2): 341–359.
doi:10.1023/A:1006293110652.
Carlson, R. E. 1977. “A Trophic State Index for Lakes.” Limnology and Oceanography 22: 361–369.
doi:10.4319/lo.1977.22.2.0361.
Chang, N. B., K. Bai, and C. F. Chen. 2017. “Integrating Multisensor Satellite Data Merging and Image
Reconstruction in Support of Machine Learning for Better Water Quality Management.” Journal of
Environmental Management 201: 227–240. doi:10.1016/j.jenvman.2017.06.045.
Chang, N. B., Z. Xuan, and Y. J. Yang. 2013. “Exploring Spatiotemporal Patterns of Phosphorus
Concentrations in a Coastal Bay with MODIS Images and Machine Learning Models.” Remote
Sensing of Environment 134: 100–110. doi:10.1016/j.rse.2013.03.002.
Chen, J., K. de Hoogh, J. Gulliver, B. Hoffmann, O. Hertel, M. Ketzel, M. Bauwelinck, et al. 2019.
“A Comparison of Linear Regression, Regularization, and Machine Learning Algorithms to
Develop Europe-Wide Spatial Models of Fine Particles and Nitrogen Dioxide.” Environment
International 130: 104934. February. doi:10.1016/j.envint.2019.104934.
Chen, J., and W. Quan. 2012. “Using Landsat/TM Imagery to Estimate Nitrogen and Phosphorus
Concentration in Taihu Lake, China.” IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing 5: 273–280. doi:10.1109/JSTARS.2011.2174339.
Deng, C., L. Zhang, and Y. Cen. 2019. “Retrieval of Chemical Oxygen Demand through Modified
Capsule Network Based on Hyperspectral Data.” Applied Sciences (Switzerland). doi:10.3390/
app9214620.
Doña, C., N. B. Chang, J. M. Vicente Caselles, A. C. Sánchez, J. Delegido, and B. W. Vannah. 2015.
“Integrated Satellite Data Fusion and Mining for Monitoring Lake Water Quality Status of the
Albufera de Valencia in Spain.” Journal of Environmental Management 151: 416–426. doi:10.1016/j.
jenvman.2014.12.003.
Duan, H., R. Ma, and C. Hu. 2012. “Evaluation of Remote Sensing Algorithms for Cyanobacterial
Pigment Retrievals during Spring Bloom Formation in Several Lakes of East China.” Remote
Sensing of Environment 126: 126–135. doi:10.1016/j.rse.2012.08.011.
Duan, H., R. Ma, S. A. Loiselle, Q. Shen, H. Yin, and Y. Zhang. 2014. “Optical Characterization of Black
Water Blooms in Eutrophic Waters.” Science of the Total Environment 482–483: 174–183.
doi:10.1016/j.scitotenv.2014.02.113.
Ferdous, J., M. Tauhid, and U. Rahman. 2020. “Developing an Empirical Model from Landsat Data
Series for Monitoring Water Salinity in Coastal Bangladesh.” Journal of Environmental
Management 255: 109861. November 2019. doi:10.1016/j.jenvman.2019.109861.
Gao, Y., J. Gao, H. Yin, C. Liu, T. Xia, J. Wang, and Q. Huang. 2015. “Remote Sensing Estimation of the
Total Phosphorus Concentration in a Large Lake Using Band Combinations and Regional
Multivariate Statistical Modeling Techniques.” Journal of Environmental Management 151: 33–
43. doi:10.1016/j.jenvman.2014.11.036.
Gholizadeh, M. H., and A. M. Melesse. 2017. “Study on Spatiotemporal Variability of Water Quality
Parameters in Florida Bay Using Remote Sensing.” Journal of Remote Sensing & GIS 6 (3).
doi:10.4172/2469-4134.1000207.
1864 H. GUO ET AL.
Remote Sensing Data: A Case Study of the Wabash River and Its Tributary, Indiana.” Remote
Sensing 8 (6): 517. doi:10.3390/rs8060517.
Vapnik, V. N. 1995. “Adaptive and Learning Systems for Signal Processing, Communications and
Control.” In The Nature of Statistical Learning Theory, edited by Michael Jordan, 138–167. New
York: Springer-Verlag. doi:10.2307/1271368.
Vignolo, A., A. Pochettino, and D. Cicerone. 2006. “Water Quality Assessment Using Remote Sensing
Techniques: Medrano Creek, Argentina.” Journal of Environmental Management 81 (4): 429–433.
doi:10.1016/j.jenvman.2005.11.019.
Wang, J., M. Hu, B. Gao, H. Fan, and J. Wang. 2019. “A Spatiotemporal Interpolation Method for the
Assessment of Pollutant Concentrations in the Yangtze River Estuary and Adjacent Areas from
2004 to 2013.” Environmental Pollution. doi:10.1016/j.envpol.2019.05.132.
Wang, S., J. Li, B. Zhang, E. Spyrakos, A. N. Tyler, Q. Shen, F. Zhang, et al. 2018. “Trophic State
Assessment of Global Inland Waters Using a MODIS-Derived Forel-Ule Index.” Remote Sensing of
Environment. doi:10.1016/j.rse.2018.08.026.
Wang, S., J. Li, B. Zhang, Z. Lee, E. Spyrakos, L. Feng, C. Liu, et al. 2020. “Changes of Water Clarity in
Large Lakes and Reservoirs across China Observed from Long-Term MODIS.” Remote Sensing of
Environment. doi:10.1016/j.rse.2020.111949.
Wang, Y., H. Xia, J. Fu, and G. Sheng. 2004. “Water Quality Change in Reservoirs of Shenzhen, China:
Detection Using LANDSAT/TM Data.” Science of the Total Environment. doi:10.1016/j.
scitotenv.2004.02.020.
Wu, C., J. Wu, Q. Jiaguo, L. Zhang, H. Huang, L. Lou, and Y. Chen. 2010. “Empirical Estimation of Total
Phosphorus Concentration in the Mainstream of the Qiantang River in China Using Landsat TM
Data.” International Journal of Remote Sensing 31: 2309–2324. doi:10.1080/01431160902973873.
Wu, M., W. Zhang, X. Wang, and D. Luo. 2009. “Application of MODIS Satellite Data in Monitoring
Water Quality Parameters of Chaohu Lake in China.” Environmental Monitoring and Assessment
148 (1–4): 255–264. doi:10.1007/s10661-008-0156-2.
Xiong, Y., Y. Ran, S. Zhao, H. Zhao, and Q. Tian. 2020. “Remotely Assessing and Monitoring Coastal
and Inland Water Quality in China: Progress, Challenges and Outlook.” Critical Reviews in
Environmental Science and Technology 50 (12): 1266–1302. doi:10.1080/10643389.2019.1656511.