5.1. Data Saturation Problem in Landsat Imagery and Potential Solution in Reducing the Saturation
In this study, we estimated the data saturation values of Landsat 5 TM imagery for six vegetation types using a spherical model in geostatistics. We obtained and compared the values of asymptote for spherical, exponential and Gaussian models, and then derived the threshold values of the X axis, that is, saturation values of forest AGB. The idea behind these models is that as forest AGB increases, the spectral reflectance changes quickly at the beginning and then slowly and eventually becomes stable. When the spectral reflectance becomes stable, the corresponding AGB value can be regarded as the saturation value. In geostatistics, these models are used to model the spatial autocorrelation of a random variable and find the maximum distance of spatial autocorrelation. In this study, these models are considered as general models that are characterized by asymptote of the Y axis in which, by seeking the values of asymptote for the Y axis (that is, spectral reflectance of Landsat TM band 7), the range parameter values of the X axis (that is, forest AGB) are estimated. The results showed that the spherical model led to the smallest residuals and the exponential and Gaussian models resulted in much larger and unreasonable saturation values (78 Mg/ha–423 Mg/ha) than the spherical model (55 Mg/ha–159 Mg/ha). The reason is mainly because the exponential and Gaussian models theoretically have smaller change rates of the Y axis and much larger values of asymptote than the spherical model. This implies that when the spectral reflectance of Landsat band 7 becomes insensitive to the increase of forest AGB, the asymptote values of the exponential and Gaussian models are not obtained and, thus, the data saturation values are enlarged. Our results for the spherical model show that pine forests had the greatest saturation value, then mixed forests, fir, and broadleaf forests. The shrubs had the smallest saturation value. The findings are consistent with the forest status of this study area. This is a novel examination of data saturation in a subtropical region and further studies are needed in other forest ecosystems such as in tropical and temperate regions. A better understanding of the data saturation problem in different vegetation types provides a foundation to find ways to reduce saturation.
As data saturation may be caused by different factors such as remote sensing data themselves (e.g., spatial, spectral, radiometric, and temporal resolutions in optical sensor data), vegetation (e.g., species composition, stand structure, growth stages), and topography (e.g., aspects, elevation, slope), many potential solutions may be used to reduce the data saturation problem. Saturation occurs when the spectral values remain insensitive to increases in AGB beyond a certain value. An optical sensor is not able to penetrate a dense tree canopy. When all canopy gaps have been closed by leaves and branches, trees may continuously grow and their biomass continuously increases in volume without changing the spectral signature of the canopy. The AGB saturation level varies with sensor types. C-band (6 cm wavelength) synthetic aperture radar (SAR) sensors may capture canopy roughness but they are not able to penetrate beyond the top layer of leaves and thin branches. L-band (24 cm) SAR is better at penetrating the canopy, and P-band (70 cm) is even better and may capture the entire tree structure. As a rule of thumb, SAR signals may be able to penetrate structures that are narrower than the wavelength. The integration of Landsat TM and SAR data may lead to the mitigation of the data saturation problem and, thus, improve AGB estimation [
21]. Further research is needed in the future to explore approaches to integrating multisensor data [
67,
68], especially the fusion of optical and radar or lidar data [
6,
21].
Lu’s study [
30] has shown the importance of incorporating textures into spectral responses in improving AGB estimation performance and this research also proved the necessity of image textures. Other research using textures from the optical or SAR data provided similar conclusions [
20,
25,
26,
27,
53]. The critical point is to identify specific textures for given vegetation types. This is because a good texture image for a given vegetation type depends on different factors such as spatial resolution of the remote sensing data, the complexity of forest stand structure, species composition, and the window size used for extraction of a textural image [
6,
30,
52]. The difficulty in identifying the best textural images was encountered in this research as different textural images were used for specific vegetation types, depending on the combination of texture measures, window size, and spectral bands. Another potential approach to reducing the data saturation problem is to use different seasonal Landsat images or time series [
20,
34,
48]. This is important because vegetation types such as pine forest, broadleaf forest, and bamboo forest have their own phenology, thus incorporation of different features inherent in vegetation phenology may be beneficial to AGB estimation.
This research indicates the important role of stratification of vegetation types and slope aspects in reducing the data saturation problem. More research is needed to identify suitable stratification approaches such as the optimal number of vegetation types and topographic factors. The key is to obtain a sufficient number of sample plots for each stratum. More strata require more sample plots, which is often a challenge because of the difficulty, time-consumption, and cost of collecting sample plots for AGB calculation. Also, a number of strata may be unnecessary, as shown in
Table 10 whereby the regression models look similar for pine and fir in shady and semi-shady slopes. This raises a new question of how many strata are optimal considering the required number of sample plots for each stratum, the accuracy of AGB estimates, vegetation types, and the time and labor involved in developing AGB estimation models. To date, no studies have identified the optimal strata based on availability of sample plots, vegetation data, and ancillary data. Since vegetation types are required for stratification, accurate classification is needed, and 85% is regarded as a standard [
69]. In this research, six vegetation types—pine, fir, broadleaf, mixed forest, bamboo, and shrub—were classified using MLC with an overall classification accuracy of 78%. Because the vegetation types were used as stratification for developing AGB estimation models for each vegetation type, higher classification accuracy of these vegetation types is needed; but in reality, it is a challenge to produce highly accurate classification results based on Landsat TM spectral signatures due to the spectral confusion between the vegetation types and the impacts of topographic factors. In the near future, we will incorporate other data sources such as DEM, Landsat, and SAR to improve classification accuracy.
The findings about the data saturation values in different vegetation types and potential solutions to reduce the saturation problem may provide new insights into the selection of remote sensing data or design of spectral wavelengths in the future. This research indicated that shortwave infrared bands such as Landsat 5 TM bands 5 and 7 have strong relationships with AGB. More research is needed to relate this research to hyperspectral data to identify more sensitive spectral bands corresponding to different vegetation types.
5.2. Selection of Suitable Algorithms to Establish the Relationship between AGB and Remote Sensing Variables
Linear regression analysis is often used to develop AGB estimation models [
4,
6]. In this study, most of the determination coefficients R
2 varied from 0.35 to 0.5 for all the forest AGB models and, as expected, the results are similar to those in other studies [
70]. However, the relationship between residuals and AGB reference data has linear features, that is, overestimations and underestimations for the smaller and larger observations, respectively (see
Figure 7), pointing to the problem of using linear-based regression models. The overestimations and underestimations were mainly caused by global regression modeling. Moreover, the data saturation of Landsat spectral reflectance may have greatly contributed to the underestimations of AGB for the larger observations. Appropriate algorithms should be further studied to reduce the overestimations and underestimations. There are several potential alternatives in algorithms. First, different source data such as optical images and their textural variables, radar, lidar, topographic variables (slope and aspect) from DEM, soil properties, and vegetation types can be combined to model their relationships with AGB and improve the accuracy of predictions [
6,
32]. Second, the relationships between AGB and independent variables can be modeled after stratification of a study area [
71].
In this study, compared to non-stratification, the stratification of vegetation types and slope aspects led to the decrease of RMSEr based on the validation dataset. However, the obvious overestimations and underestimations for the smaller and larger observations, respectively, were still noticed. One purpose of stratification is to reduce the errors due to the global regression modeling by minimizing the within-strata variability and maximizing the between-strata variance [
71]. In this study, the stratification of vegetation types and slope aspects did, to some extent, increase the accuracy of AGB estimates. However, because of the large area and complicated landscapes, the within-strata variability of AGB for each of the vegetation types was still large and this was especially true for the young and mature forests, resulting in high overestimations in young forests and underestimations in mature forests. Therefore, the third set of alternative algorithms may be the use of local modeling methods such as geographically weighted regression, co-kriging, and spatial co-simulation in geostatistics. In these local modeling algorithms, models are developed using the nearest sample plots within a neighborhood of a given radius. The neighborhood can be determined using the range of spatial autocorrelation. For the geographically weighted regression, the parameters of obtaining regression models will vary from place to place. Similarly, the co-kriging and spatial co-simulation will lead to variable weights of sample data. That is, the local modeling algorithms can capture the spatial variability of local areas and, thus, have great potential to reduce the overestimations and underestimations [
7,
71,
72,
73,
74]. There is also a simple way to reduce the linear bias of AGB estimates in which the sample plots that are at saturation can be excluded and a simple linear regression of the data that are not saturated in Landsat imagery can be then developed for each vegetation type. With this approach, the saturated data in Landsat imagery could be flagged as saturated. This research indicated that overestimation is obvious when AGB is less than 40 Mg/ha. When AGB is small, the sites are mainly shrub, bamboo forest, new plantations, and young broadleaf forests, where vegetation canopy is not sufficiently dense, thus soil will influence surface reflectance. Previous research has indicated that a forest site can be assumed as a combination of green vegetation, shade, soil, and nonphotosynthetic vegetation (e.g., stem, braches), and these components can be decomposed using spectral mixture analysis [
22,
31].
Stratification led to a decrease of sample plots for each stratum, and decreasing the number of sample plots generally improves the performance of the model. Thus, the improvement of the models based on the stratifications is related to not only the relationships between forest AGB and spectral variables, but also the sample size effect. To map AGB of African forests using a sample size of 26 plots, for example, Bastin
et al. [
75] tested the effect of sample size on performance of the models. However, the effect of sample size may be obvious when a small sample size (such as <30 plots) is utilized. As the sample size increases, the effect of sample size should gradually disappear. In this study, 589 sample plots and 213 sample plots were respectively used for developing and validating the models (
Table 3). When modeling was carried out based on stratification of vegetation types, large sample sizes were employed for all vegetation types except shrub. When modeling was conducted based on stratification of both vegetation types and slope aspects, most sample sizes were larger than 30 except bamboo and shrub relevant strata (
Table 3). Thus, in this study, we discarded the development of the models based on the stratification of slope aspects for bamboo and shrub, and the effects of sample sizes on performance of the models for other strata were ignored. In this study, pixel level predictions were conducted partly because spatially explicit estimates are needed for advanced and digital forest inventory, monitoring, and management, and partly because the detailed spatial distributions of forest AGB estimates can provide the opportunity to identify the areas with smaller and larger values of biomass and corresponding uncertainties of potential overestimations and underestimation. Especially, the areas of greater estimates indicate a higher possibility of data saturation. If only the estimates of large or small areas are of interest, and not pixel level predictions, combining post-stratification and spatial modeling or other synthetic or small area estimation methods may constitute more feasible approaches. Data saturation analyses may be less important. On the other hand, the estimates obtained with the approach used in this study may be improved using some kind of calibration technique.
In addition, the stepwise regression approach used in this study tends to increase the risk of overfitting, that is, a model accounts for random error or noise instead of the underlying relationship. When too many independent variables relative to the number of used observations are involved, overfitting will very likely take place and the resulting model will perform poorly in making predictions. In this study, in the case of non-stratification, a large number of observations was used and overfitting would not have occurred. However, in the case of stratification of vegetation types and slope aspects, the number of observations for some strata such as bamboo and shrub were relatively small and the overfitting probably would have happened, which might have led to uncertainties of the estimates. Algorithms that can be used to avoid overfitting include the use of cross-validation, regularization, pruning, and model comparison. This issue should be examined in future research.
5.3. Uncertainties Due to Sample Plots
In this study, field observations of AGB were collected from 20 m × 20 m sample plots, which were smaller than the 30 m × 30 m spatial resolution of the Landsat TM images. The small plot size tends to increase the coefficient of variation of AGB, consequently leading to potential underestimations or overestimations of forest AGB at the plot level and potential non-normal distribution of AGB at the landscape level [
76]. Unfortunately, the number of uncertainties due to the small plots in this study could not be quantified. However, the analysis of histograms based on plot-level observations and landscape level estimates of forest AGB showed that the distributions of AGB were close to normal. Moreover, although in this study the central coordinates of the sample plots were utilized to extract the values of image pixels, the small plot size and its inconsistency with the spatial resolution of Landsat TM images have probably induced errors in plot geolocalization and match with image pixels, and, thus, uncertainties of forest AGB estimates [
77]. This could become more serious as the texture measures from the windows of 3 × 3 pixels, 5 × 5 pixels,
etc., are employed. Wang and Zhang [
78,
79] studied the uncertainty due to error of plot geolocalization and mismatch of sample plots with image pixels and found that, as the distance of the mismatch increased, the estimation accuracy of forest AGB or carbon density obviously decreased.
In reality, the plot size of 20 m × 20 m is commonly used in forest inventory considering the work load required during fieldwork and its representativeness in a forest site. In order to reduce the errors between geolocalization of sample plots and Landsat imagery, a window size of 3 × 3 pixels was often used [
6,
30] to extract the remotely sensed mean values [
30]. In this research, each sample plot was first examined to make sure each plot was located within the forest sites and had good representation of the surveyed forest stand. In addition to the plot size and geolocation error, another critical factor is the use of allometric models for AGB calculation for each plot based on field measurement [
80]. Improper selection of the allometric models for specific tree species may produce high uncertainty of AGB calculation at the plot level, thus affecting the AGB estimation performance using the remote sensing data.