1. Introduction
Soil heavy metal pollution is of great concern. In China, the last two decades of anthropogenic activities, such as industrial pollution, livestock wastewater, mine drainage, and chemical pesticides, have led to heavy metal pollution in soil. Especially in farmlands, heavy metal pollution not only destroys the normal function of soils and hinders crop growth, but also endangers human health through the food chain [
1,
2]. As one of the most rapidly developing areas in China, Guangdong province is facing the serious problem of soil contamination, where it has been estimated that 40% of the soils in the Pearl River Delta are polluted by heavy metals [
3]. Therefore, it is very important to devise methods to accurately and timely monitor soil heavy metal content and map their spatial distributions.
The conventional method of estimating soil heavy metal content is based on a regular soil sampling design with a soil measurement depth of 0–5 cm and subsequent chemical analysis of the sampled soils in the laboratory, followed by geostatistical interpolation of the data to obtain the spatial distributions of soil heavy metal content [
4,
5,
6]. However, this method is time-consuming and costly and cannot provide accurate estimates of soil heavy metal content over large areas [
7,
8]. Remote sensing technologies can rapidly lead to spatially explicit estimates of soil heavy metal content and monitor their dynamics at a regional scale with low cost. At present, because soil spectral reflectance is a cumulative property which derives from the inherent spectral behaviors of the heterogeneous combination of soil’s physical and chemical properties, hyperspectral visible and near-infrared reflectance (VNIR) spectroscopy coupled with calibration techniques has been developed to predict various soil properties and soil heavy metal content [
9,
10,
11,
12]. In contrast with traditional in situ measurements of soil heavy metal content, hyperspectral remote sensing techniques provide advantages in rapidly monitoring soil heavy metal content at a regional scale, such as near real-time detection, relatively low cost, and environmental friendliness [
9,
10,
11,
12,
13,
14,
15,
16,
17]. These methods mainly use a 0–5 cm soil sampling depth, which does not result in a reliable measurement of soil heavy metal content in the studied areas because the distributions of heavy metals in soils are not homogenous from the surface (0 m) to the depth (1 m) [
18,
19]. Soil samples should be taken from layers deeper than 5 cm [
18,
19]. On the other hand, the VNIR portion of light has a limited capacity of penetration and the shortwave infrared spectrum is often needed to obtain spectral reflectance and absorption information of soil heavy metal contents from soil layers deeper than 5 cm [
18,
19].
The present calibration techniques of hyperspectral estimation models for the determination of heavy metal content can be divided into two categories: statistical analysis models [
20,
21] and machine-learning models [
21,
22,
23]. However, these studies mainly focus on building relationships between soil heavy metal content and soil hyperspectral data, without considering the effect of soil water. Soil water has strong absorption features over the VNIR region [
24], which may interfere with the generation of accurate hyperspectral estimation models. Moreover, the soils that are characterized or polluted by different heavy metals have different features of spectral reflectance and absorption over wavelengths [
17,
25,
26]. In addition, the estimation models of soil heavy metal content are often developed based on spectral variables from hyperspectral data obtained from soil samples in the laboratory. It is very challenging to make the lab-derived models based on soil samples applicable to mapping the contents of heavy metals in soil using hyperspectral imagery at regional scales. The reasons for this include that soil is a complex system, different soils have their own characteristics of spectral reflectance and absorption, and a soil’s properties cannot be easily assessed using spectral reflectance curves, even under controlled laboratory conditions [
27,
28]. Thus, applying the lab-derived relationships to regional and national scales is problematic because the quality of space-borne and airborne hyperspectral data is often greatly affected by sensors, atmospheric conditions, and soil surface conditions such as vegetation cover [
29,
30]. Thus, there is a strong need to build up the linkage of lab-derived estimation models with the characteristics of soil properties in the field, such as soil moisture content, to make it possible to generate digital maps of soil heavy metal contents [
31].
The objective of this study was to develop a novel method for estimating and mapping soil heavy metal (As, Cd, and Hg) content using hyperspectral data. In this method, in addition to developing the estimation models using the selected spectral variables derived from dry soil spectral reflectance (DSSR) in the laboratory, we explored the ratio relationship of DSSR to moisture soil spectral reflectance (MSSR) with soil moisture content, which would lead to the linkage of DSSR with MSSR for the estimation and mapping of soil heavy metal content. This method provides the potential for applying the lab-derived models to mapping the soil heavy metal contents at a regional scale. This method was examined in Guangdong, China and the Conghua district of Guangzhou city using both hyperspectral data collected in the laboratory and HuanJing-1A (HJ-1A) HyperSpectral Imager (HSI) images.
4. Discussions
Developing estimation models of soil heavy metal contents based on spectral reflectance data from soil samples and then applying them to hyperspectral image based regional scale mapping, that is, generating spatial explicit estimates, is a complex process. Its accuracy varies greatly depending on many factors, including landscape complexity, type of soil heavy metals and their chemical state/form, environmental conditions when measurements are collected, the spectral variables selected and used to develop the models, spectral resolutions and spatial resolutions of hyperspectral data, modeling methods, and sample sizes. This discussion was only focused on following aspects.
First of all, pure metals do not absorb VNIR and mid-IR radiation. When soil heavy metals hold reflectance and absorption features, they can be estimated based on their relationships with the spectral features [
14]. The soil heavy metals with low content are often difficult to directly estimate using soil spectral features. However, soil heavy metals, often absorbed or bounded, are characterized by spectrally active constituents depending on environmental conditions, which make it possible to estimate their contents and derive their spatial distributions using spectral variables from remote sensing data, especially hyperspectral data [
49,
50]. Previous studies have shown the availability for predicting soil heavy metal content by spectroscopic reflectance [
49,
50]. However, how to select the spectral variables that significantly contribute to the reduction of model fitting errors and increase of estimation accuracy but are not correlated with each other is critically important [
15].
For this purpose, several methods, such as correlation analysis, VIF, and random forest are available. Studies have also shown that the Boruta algorithm exhibits superior performance with a higher accuracy and smaller error rate compared to the conventional statistical methods [
32,
43,
51]. However, in our experiment it was also found that there was collinearity among the spectral variables selected by the Boruta algorithm. For example, the Boruta algorithm led to a total of 15 spectral variables for the estimation of soil heavy metal Cd and most of them had VIF values of greater than 10 (
Table 6). We then used a stepwise regression with VIF to eliminate the collinearity among the spectral variables and identify the optimal relevant spectral variables (
Table 6,
Figure 14), which led to three spectral variables (FD
1059, FD
2178, and FD
2379) that significantly contributed to the increase in the estimation accuracy and were not significantly correlated with each other. Thus, the integration of the Boruta algorithm with the stepwise regression and VIF worked well.
Secondly, the spectral reflectance properties of soils over the VNIR spectrum (350–1000 nm) are attributed to the electromagnetic energy absorption caused by the electron transition of metal ions (such as Fe
2+, Fe
3+, Mn
2+). In the shortwave infrared spectrum (1000–2500 nm), the spectral absorptions of soils were mainly due to the extension, bending, and deformation of chemical bonds in various molecular groups (including OH-, CO-OH, Al-OH, Fe-OH, Mg-OH) of minerals, such as organic matter, layered silicate, carbonate, and sulfate. In general, different components or soil heavy metals have different characteristics of spectral absorption. For example, bands centering around 838 nm, 1930 nm, and 2148 nm are sensitive to soil lead content [
26], while wavelengths centered around 460 nm, 1400 nm, 1900 nm, and 2200 nm are considered to be appropriate for studying the content of As and Cu in mining areas [
17]. Liu [
25] ranked the adsorption capacity of heavy metals in soils as Pt > Cd > Hg > As > Cr, which may explain why our study found that the model for estimating soil heavy metal Cd was most accurate and the model for soil heavy metal As had the worst performance. In addition, ecause of a limited wavelength range (459 nm to 956 nm) of the used HJ-1A image for the Conghua district, it was found that the estimation models at the regional scale had a lower accuracy of estimating the contents of the soil heavy metals Hg and As at the regional scale than that at the soil sample level.
Moreover, developing estimation models using the hyperspectral data from soil samples aims to apply them to map the contents of soil heavy metals at regional scales—that is, generating spatially explicit estimates based on hyperspectral imagery [
12,
14,
16]. This requires consistent wavelength ranges and spectral resolutions of the hyperspectral data from the soil samples and used for model development, with those of the hyperspectral imagery used for mapping the soil heavy metal contents at regional scales. If the wavelength ranges and spectral resolutions are not consistent, the obtained models could not be directly applied to the mapping at regional scales. In this study, because the HJ-1A image had a much narrower range of wavelengths and a coarser spectral resolution compared with those of the hyperspectral data collected from the soil sample, the hyperspectral data were re-sampled and the estimation models were re-developed. It was found that the obtained results at the regional scale had similar estimation accuracy to those at the soil sample level. That is, the prediction model of Cd provided the most accurate estimations, then the models for Hg and As at both the soil sample level for Guangdong province and the regional scale of Conghua district. This may imply the generalization and repeatability of the proposed method. However, the test sample sizes used to validate the prediction accuracy of the obtained models at both the soil sample level and the regional scale were relatively small and further validation of the proposed method using larger sample sizes is needed in the future.
At present, almost all the studies on the development of estimation models using hyperspectral data to estimate the contents of soil heavy metals have focused on building the relationships of soil heavy metal contents with DSSR without considering the effect of soil water, which leads to lower estimation accuracy due to the inference of soil moisture. In this paper, we first built the relationship models of soil heavy metal contents with DSSR and then accounted for the relationship of the DSSR/MSSR ratio with soil moisture content. The latter could be used to derive the values of DSSR when the data of soil moisture and MSSR are available. It was found that there was great variation in the relationship for the spectral ratio of DSSR to MSSR with soil moisture content at the spectral wavelengths ranging from 340 nm to 1029 nm, and the variations converged at the spectral wavelength of 1029 nm and became stable after that. This implies that the soil moisture content would not significantly affect the spectral ratio of DSSR to MSSR after the wavelength of 1029 nm. Therefore, after the band 1029 nm, the relationship was stable and could be used to estimate the values of DSSR based on the data of MSSR and soil moisture content from hyperspectral images, and thus provided the potential of using the DSSR-derived models to estimate soil heavy metal contents with data of MSSR. This finding is novel. However, in this study, the sample sizes used to develop and validate the estimation models were relatively small, which might have affected the assessment. On the other hand, it is often very difficult to obtain spectral stability when the soil samples are measured by the AvaField portable spectrometer. This is partly because the soil composition is not homogeneous and partly because the environmental conditions when the soil samples and spectral data are collected also affect the accuracy of spectral data. Thus, the characteristics of the moist and dry soil samples used to develop the models control the transformation of MSSR data to DSSR. In the future, more soil samples should be collected to improve and assess the transformation model.
In this study, the larger RRMSE values of the As and Hg estimates were noticed and mainly caused by the overestimations and underestimations occurring for the soil samples with smaller and larger values of soil heavy metal content, respectively. The overestimations and underestimations are often observed due to the use of global modeling, such as linear regression. Global modeling captures the global trends and ignores the local variability. On the other hand, the content of heavy metals in soil are often spatially clustered and show spatial autocorrelation in addition to global trends. Thus, local variability-based modeling methods, such as geographically weighted regression and cokriging interpolation in geostatistics, would provide the potential to improve the prediction accuracy of Cd, Hg, and As contents [
12,
52,
53].
In order to derive the estimation models of the soil heavy metal content based on the hyperspectral data from the soil samples and apply the models to Conghua district, the HJ-1A image and the SMAP data were considered to be obtained from pure pixels. This implies an assumption that the relationships only existed in the homogeneous areas. Actually, the coarse spatial resolutions of 9 km × 9 km for SMAP and 100 m × 100 m for the HJ-1A image made it difficult to identify pure pixels. Pure pixels rarely exist, while mixed pixels often dominate a study area. It is unknown whether the relationships still hold true for mixed pixels. Thus, the conclusions presented here need support from additional studies.
Finally, in this study we used 65 soil samples to develop the models and 15 soil samples to validate the models for the whole Guangdong province, and 33 soil samples to validate the models for the Conghua district. Although the sampling design was conducted based on different levels of potential pollution and soil types, the sample sizes were relatively small. The large coefficients of variation for the sample means of the soil heavy metal contents in
Table 1 and
Table 5 explain the great RRMSE values of the As and Hg predictions, especially As. However, the study focused on the development of the proposed method, not on the generation of the soil heavy metal content maps. Thus, the sample sizes were statistically acceptable. In the future studies, larger sample sizes should be utilized to further develop and validate the proposed method.
5. Conclusions
It is well known that estimating and mapping the contents of soil heavy metals using hyperspectral data is a quick and effective method but very challenging due to complex landscapes, soil properties, spectral variables selected, modeling methods, and model transferability, etc. For this purpose, this study attempted to overcome some of the gaps that currently exist in this field by proposing a novel method. In this method, the optimal relevant spectral variables that significantly contributed to the reduction of model fitting errors and the improvement of estimation accuracy were first selected from the spectral indices derived from DSSR, using the integration of the Boruta algorithm with a stepwise regression and VIF. The estimation models of soil heavy metal content were developed using the selected spectral variables and field observations of soil heavy metal content. The model that accounted for the relationship of the spectral ratio of DSSR to MSSR with soil moisture content was then derived. The proposed method was examined and validated to estimate and map the contents of three soil heavy metals (As, Cd, and Hg) in Guangdong, China and in Conghua district, Guangzhou city of the same province. The results showed that (1) based on the RRMSE values from the validation datasets, the estimation model of soil heavy metal Cd content offered the most accurate estimates at both the soil sample level and regional scale, and the estimation model of As performed the worst; (2) the relationship of the DSSR/MSSR ratio with soil moisture content varied greatly before the wavelength of 1029 nm and became stable after that; (3) the DSSR/MSSR ratio model built up the linkage of DSSR with MSSR through soil moisture and provided the possibility of applying the DSSR-based models to map soil heavy metal contents at a regional scale using hyperspectral imagery; and (4) based on health standards, overall there were only a few soil samples seriously polluted by the soil heavy metals in the whole Guangdong province, while in the Conghua district of Guangzhou city, the serious pollution was mainly caused by Hg and As, with distributions mainly in the urbanized areas and croplands. This study implies that the new approach provided the potential to improve the estimation accuracy of the soil heavy metal contents, but Cd content was more reliably estimated than As and Hg.