1. Introduction
The impacts of global climate change and human activities on natural ecosystems are intensifying. Forests, which play a crucial role in the Earth’s carbon cycle and ecological balance, have consequently become key subjects in global ecological research. The forest canopy is a fundamental characteristic of forest structure. Canopy height is not only a critical parameter for measuring aboveground biomass but also fundamental to forest ecosystem research, such as primary productivity, biodiversity, and carbon cycling [
1]. Large-scale, high-resolution forest height data are vital for assessing global and regional forest carbon stocks as well as the carbon balance in terrestrial ecosystems [
2,
3,
4]. The rich biodiversity and rapid ecological changes in tropical rainforest regions make accurate estimations of forest canopy height essential for forest resource management, climate change research, and carbon stock monitoring.
Remote-sensing technology is an effective tool for studying forest canopy height. Forest canopy height estimation relies primarily on various remote-sensing technologies, including optical remote sensing, light detection, and ranging (LiDAR) data, and digital elevation models (DEMs). Traditional optical remote sensing has been used for forest height inversion [
5,
6]. However, due to signal saturation and its inability to obtain vertical structural information of the canopy directly, its estimation accuracy is relatively low [
7]. Microwave remote sensing can penetrate the canopy and extract vertical structural parameters. However, microwave signals are easily affected by terrain and suffer from saturation issues, limiting their application in complex forest environments. In contrast, LiDAR can penetrate the forest canopy and accurately capture vertical structural information. As a result, LiDAR has become a core tool in forest height research and is now widely used to measure and model forest canopy height [
8,
9,
10].
Spaceborne LiDAR has distinct advantages for large-scale forest height inversion analysis and mapping [
11,
12,
13,
14,
15,
16]. The ICESat-1 satellite, launched in 2003, was equipped with the world’s first laser altimeter system (GLAS), laying the foundation for global forest height research [
16]. However, due to its low laser point density, ICESat-1 had limited spatial resolution and a narrower range of applications. The ICESat-2 satellite, launched in 2018, utilizes photon-counting technology to achieve a higher laser point density and a smaller spot size (with spot intervals as low as 0.7 m), significantly enhancing the resolution of forest height inversion data [
17]. The Global Ecosystem Dynamics Investigation (GEDI) system, installed on the International Space Station, employs full-waveform LiDAR technology to provide high-density point cloud data, greatly improving the accuracy of forest height measurements [
18,
19]. Since the data acquisition times of ICESat-2/ATLAS and GEDI are approximately the same, these two datasets can be integrated in a geographically complementary manner to increase the density of forest height sample points [
19], providing unprecedented opportunities for large-area, high-resolution forest height mapping.
Ground-based LiDAR can perform submeter-level high-precision measurements and rapidly capture the 3D structures and spectral information of target objects. It has vegetation penetration, non-destructive capabilities, high density, and high resolution. These characteristics make it highly promising for inversion of vegetation phenotypes, biochemical parameters, and biomass [
20,
21,
22,
23,
24,
25]. In forest canopy height research, ground-based LiDAR is often used to validate and calibrate spaceborne LiDAR data. By comparing the forest canopy heights measured by satellite LiDAR with those obtained by ground LiDAR, the accuracy and reliability of satellite LiDAR in measuring forest canopy heights can be verified.
In recent years, researchers have increasingly combined multisource remote-sensing data with machine-learning algorithms to enhance the accuracy of forest canopy height estimations. Previous studies have demonstrated that integrating LiDAR data with environmental factors (e.g., terrain and climate) can significantly improve canopy height prediction accuracy. Lefsky et al. combined ICESat spaceborne LiDAR data with shuttle radar topography mission (SRTM) terrain data to develop a model for estimating forest canopy height. Their model explained 59–68% of the variance in measured forest canopy height across the study areas, with root mean square errors (RMSEs) ranging from 4.85 to 12.66 m [
26]. Using machine-learning algorithms, such as random forest (RF), Pourshamsi et al. successfully estimated forest height by integrating LiDAR and PolSAR data. The model achieved an average R
2 value of 0.70 and an RMSE of 10 m [
27].
Machine-learning algorithms, such as RF, gradient boosting decision trees (GBDT), and deep-learning methods, have also been employed to process remote-sensing data, improving canopy height estimation accuracy. Shah et al. applied convolutional neural network (CNN) algorithms to model training based on Landsat satellite imagery, successfully enhancing forest canopy height estimation accuracy. The predicted mean absolute error was 3.092 m, the mean squared error was 0.8872 m, and the variance was 0.864 m [
28]. Stojanova et al. utilized spaceborne LiDAR data and machine-learning algorithms to estimate vegetation height in Slovenian forests, finding the integrated approach significantly outperformed single- and multi-target regression trees [
29].
While significant progress has been made in forest canopy height inversion analysis, challenges such as data saturation and low estimation accuracy remain for large-scale, high-resolution forest height estimations. This study builds on previous research by incorporating innovative analysis combinations to address these issues: (1) two types of spaceborne LiDAR data (ICESat-2 and GEDI) were combined to compensate for the limitations of single data sources regarding spatial coverage and accuracy; (2) various environmental factors (e.g., slope, temperature, and precipitation) were integrated to enhance the accuracy and comprehensiveness of the estimation results; (3) vegetation indices (e.g., the normalized difference vegetation index [NDVI]) derived from Landsat imagery were incorporated to improve model accuracy and robustness; (4) tree height data obtained from portable 3D LiDAR scanning was used as a validation dataset to enhance the credibility and interpretability of the model results; and (5) four machine-learning algorithms—RF, backpropagation neural network (BP), CNN, and GBDT—were evaluated to identify the best-performing models for estimating forest canopy height in the Hainan Tropical Rainforest National Park, China.
By integrating multi-modal remote-sensing data and employing machine-learning algorithms, such as RF, GBDT, CNN, and BP, this study aimed to monitor forest canopy height in the Hainan Tropical Rainforest National Park from 2003 to 2023 (the technical roadmap is shown in
Figure 1). The main objectives were as follows: (1) to improve the accuracy and reliability of forest canopy height estimations in the Hainan Tropical Rainforest National Park; (2) to identify the most suitable model for this region by comparing various algorithms; and (3) to map the forest canopy height distribution from 2003 to 2023. This study provides technical support for forest resource management, carbon storage monitoring, and climate change research.
3. Results
3.1. Accuracy Validation of the Canopy Height Remote Sensing Estimation Model
During the model development process, four machine learning algorithms underwent multiple rounds of testing and parameter tuning based on their performance in the training and testing stages to enhance the model’s generalization ability and accuracy. Accuracy assessment using the 20% testing dataset served as internal validation, evaluating the model’s performance on unseen data within the same dataset. Accuracy assessment using independent datasets, including 140 LiDAR-scanned plots and 315 UAV-derived canopy height validation points, served as external validation to verify the model’s generalization capability to entirely new data.
Table 1 and
Table 2 summarize the average accuracy results from these evaluations, reflecting the performance of each algorithm in internal and external validation across different percentiles (RH80, RH85, RH90, and RH95).
From an algorithmic perspective, RF exhibited the best performance among the four algorithms, demonstrating the highest internal and external accuracy. Specifically, RF outperformed the other algorithms in terms of R2 values for both the training and testing sets. For the RH80, RH85, RH90, and RH95 percentiles, the R2 values of the RF model on the testing set were 0.60, 0.56, 0.59, and 0.60, respectively. These results indicate that the RF model demonstrated strong fitting ability and accuracy. Additionally, RF had the lowest RMSE and RRMSE values on the testing set, particularly for RH80 and RH85, with RMSE values of 3.11 m and 3.31 m and RRMSE values of 21.36% and 21.28%, respectively. These results suggest high accuracy and low error in RF predictions. Although the accuracy of all algorithms decreased in the external validation, RF remained relatively stable in terms of bias and RMSE, particularly at the RH95 percentile, where the bias was 4.74 m and RMSE was 6.24 m. While the errors were higher, RF still outperformed the other algorithms in terms of stability and precision. Therefore, considering its overall performance and stability across different percentiles, RF was deemed the most suitable algorithm.
Compared to RF, CNN and GBDT showed moderate performances in canopy height prediction. At the RH90 and RH95 percentiles, CNN has a testing R2 of 0.46 and an RRMSE of 21.34%, while GBDT had a testing R2 of 0.49 and an RRMSE of 21.26%. These results were lower than those of RF but still improved compared to those of BP. Prediction errors increased as the percentiles increased. At RH95, both CNN and GBDT showed greater uncertainty. The BP algorithm performed the worst, with the lowest R2 values (0.44 at RH80 and 0.39 at RH90) and the highest RRMSE (25.79% at RH80 and 22.56% at RH90), highlighting its limitations in modeling the nonlinear complexity of canopy height. In external validation, CNN and GBDT had R2 values ranging from 0.39 to 0.43, with higher RRMSE values than RF, indicating greater prediction errors. BP performed the worst again, with an R2 of 0.38 at RH80 and an RRMSE of 39.63%, showing the lowest accuracy and the highest prediction error among the models.
Overall, while CNN and GBDT performed better than BP, they were less accurate and stable than RF, which demonstrated the highest precision and generalization ability in canopy height prediction.
Regarding percentile selection, RH80 was identified as the optimal choice for the prediction model. According to the data in
Table 1 and
Table 2, RF exhibited the optimal performance at RH80, with a testing set R
2 of 0.60, RMSE of 3.11 m, and RRMSE of 21.36%, indicating the highest precision at this percentile. As the percentiles increased (e.g., RH85, RH90, and RH95), prediction errors gradually increased, particularly in external validation, where bias and RMSE values increased significantly, indicating greater uncertainty in predictions at higher canopy heights. RH80 typically corresponds to lower canopy heights, which are more accurately estimated in remote-sensing data processing compared to higher canopy heights. Therefore, selecting RH80 reduced errors and enhanced the reliability and stability of the model. Additionally, RH80 is applicable across a wide range of ecological environments and forest types, providing more accurate and generalizable results for canopy height estimation in diverse regions. Hence, considering both accuracy and applicability, RH80 was determined to be the most suitable percentile.
To comprehensively evaluate the performance of different percentiles (RH80, RH85, RH90, and RH95) in predicting canopy height in the Hainan Tropical Rainforest National Park, as well as the effectiveness of various machine-learning models (RF, CNN, GBDT, and BP), scatter plots for all models are provided in the
Appendix A.
3.2. Canopy Height Changes in the Hainan Tropical Rainforest National Park from 2003 to 2023
This study employed an RF-based canopy height estimation model for the RH80 percentile to predict canopy heights at 3,332,789 sample points (spaced at 30 m intervals) across Hainan Tropical Rainforest National Park from 2003 to 2023. The resulting 30 m spatial resolution canopy height distribution map is shown in
Figure 4.
The results indicated an overall increasing trend in rainforest canopy height during the study period, ranging from 2.95 to 22.02 m. In 2003, canopy heights were generally lower than in later years, reflecting the early stages of ecosystem recovery. By 2013, and especially in 2023, canopy heights had significantly increased, with most areas exceeding 20 m. This change highlights the substantial recovery of the tropical rainforest ecosystem over the past two decades, likely driven by a combination of natural resilience and conservation efforts in Hainan Province.
Further analysis revealed that areas with higher canopy heights were primarily concentrated in mountainous core protection zones with complex terrain and higher elevations, such as the central mountainous regions of various park divisions. These regions exhibited more extensive and contiguous high canopy coverage, suggesting a robust recovery closely linked to strict conservation policies and favorable natural conditions. In contrast, canopy heights in low-elevation and peripheral areas were generally lower and more fragmented, possibly constrained by environmental conditions and human activities. Notably, some areas exhibited relatively limited canopy height growth from 2008 to 2013, potentially due to climatic fluctuations and the pace of ecological recovery.
From
Figure 5, it is evident that the canopy height of different forest types exhibited a general increasing trend from 2003 to 2023, reflecting forest growth and recovery. The most significant increase occurred between 2003 and 2013, followed by a slower growth rate after 2013. Among these forest types, tropical lowland rainforests and tropical seasonal forests exhibited the fastest growth, with mean canopy heights increasing from 13.02 m to 14.51 m and from 13.55 m to 15.03 m, respectively, indicating strong recovery potential. In contrast, tropical montane cloud forests and tropical coniferous forests showed the least growth, with mean canopy heights increasing only slightly from 17.10 m to 17.37 m and from 15.04 m to 16.14 m, suggesting that these ecosystems may have reached a stable or mature stage. Tropical montane rainforests showed moderate growth, with the mean canopy height increasing from 16.54 m to 17.25 m, primarily before 2013.
The standard deviation (σ) across all forest types remained relatively stable, indicating that while canopy height increased, the internal height variability did not expand significantly. This suggests that forest recovery was a relatively uniform process rather than a localized surge in canopy height.
Overall, variations in tree height across forest types reflect significant differences in growth conditions and ecological recovery capacities. Canopy height changes from 2003 to 2023 highlight the recovery potential and spatial variability of the rainforest ecosystem, providing critical insights for future conservation and restoration initiatives. A more comprehensive analysis of canopy height dynamics will offer valuable scientific data for guiding targeted and refined protection measures and recovery strategies.
4. Discussion
4.1. Comparative Analysis of Four Machine Learning Algorithms for Forest Canopy Height Estimation
An in-depth exploration of canopy height estimation models for the Hainan Tropical Rainforest National Park, including the performance of four machine-learning algorithms, was conducted.
For internal validation, the RF algorithm achieved an R
2 of 0.71, with a test set R
2 of 0.60 (RH80), indicating high fitting accuracy for both the training and test sets. The external validation results also highlighted the robust performance of the RF model, with an R
2 of 0.45 and an RRMSE of 33.05%. These results demonstrated that RF outperformed other algorithms when predicting new data, aligning with findings from several related studies. Ghosh et al. estimated the canopy height of the Bhitarkanika Mangrove Reserve in India and reported that the RF model achieved an internal validation RMSE of 1.57 m and an R
2 of 0.60, demonstrating the strong adaptability of RF across different geographic regions and ecosystems [
34]. Peng et al. applied the RF algorithm to estimate five types of forest canopy structures using data from 60 tropical forest plots in Hainan Province. They found that the RF algorithm exhibited relatively low RRMSE values (10.60–27.44%), further confirming its reliability for canopy height estimation in tropical rainforests [
35].
Although the CNN algorithm exhibited superior performance in terms of external validation bias (bias = 1.06 m, relative bias = 8.16%), indicating its potential in capturing nonlinear relationships and complex patterns, its RRMSE of 36.77% was higher compared to the overall superior performance of the RF algorithm model. CNN is advantageous because of its ability to extract spatial features from high-resolution remote-sensing imagery; however, its adaptability to different forest types has certain limitations. Shah et al. used a CNN algorithm to estimate forest canopy height in the Coconino National Forest region using Landsat images and found that the CNN model performed well in this area, with a mean absolute error of 3.092 m, a mean squared error of 0.8872 m, and a variance of 0.864 m. These results indicated that while the CNN algorithm was effective for canopy height estimation in specific regions, it was less stable than the RF algorithm when applied to different environments and vegetation types [
28].
In contrast, the BP algorithm exhibited the lowest performance in this study, with R2 values of 0.44 in internal validation and 0.38 in external validation, along with an RRMSE of 39.63, indicating its shortcomings in handling complex nonlinear problems and inability to effectively capture patterns in canopy height changes. The GBDT algorithm performed slightly worse than RF, with an R2 of 0.49 and an RRMSE of 24.67% on the test set. However, it still demonstrated some predictive ability, particularly in analyzing canopy height across different environments, where its performance remained relatively stable.
To further validate the superiority of the RF algorithm, we compared the accuracy results of this study with those of other relevant studies. The RF algorithm demonstrated strong adaptability across different locations and vegetation types, which aligns with previous studies, highlighting its robustness in accurately estimating canopy heights in diverse ecological environments. Jin et al. highlighted the transferability of the RF algorithm for estimating canopy height across different locations and vegetation types. They found that the RF model exhibited high accuracy, with R
2 > 0.6 and RMSE < 6 m, demonstrating the reliability of RF in large-scale canopy height estimation across diverse geographic regions and vegetation types [
36]. Fayad et al. used ICESat/GLAS LiDAR waveform data and SRTM DEM to study canopy height in the tropical forests of French Guiana. Their results showed that the RF algorithm provided the highest estimation accuracy, with an RMSE of 3.4 m, outperforming multiple linear regression and principal component analysis [
37]. Ghosh et al. applied the RF algorithm in combination with multiple remote-sensing data sources, including GEDI LiDAR, SAR backscatter, terrain, and canopy density, to estimate canopy height in different forest types in India. Their study also demonstrated that RF, when integrated with LiDAR and multisource data, could achieve high prediction accuracy, particularly in areas with complex terrain, further confirming its strong adaptability [
38].
Overall, the RF algorithm exhibited excellent performance in this study and has been widely applied in other research. Therefore, its superior accuracy in estimating the canopy height of the Hainan tropical rainforest further validates its broad adaptability for remote-sensing applications.
4.2. Research on Forest Canopy Height Estimation in the Hainan Tropical Rainforest National Park
This study provides new scientific evidence for estimating canopy height and assessing ecological restoration in Hainan’s tropical rainforest through long-term time-series analysis, multi-modal data integration, and the comprehensive application of environmental factors.
The study monitored forest canopy height in the Hainan Tropical Rainforest National Park over 20 years (2003–2023). The results revealed spatial and temporal trends in canopy height and provided valuable historical data for studying tropical rainforest ecological restoration. The findings indicated an overall increasing trend in canopy height from 2003 to 2023. In particular, from 2013 to 2023, following the implementation of a series of ecological protection measures, canopy height increased significantly. These results align with the analysis by Zhong et al. of ecological restoration projects in Hainan Province, highlighting the critical role of ecological protection and natural recovery in forest restoration [
39]. The variation in canopy height within tropical forests may be influenced by several factors, including climate conditions, elevation, species characteristics, and the implementation of conservation policies. The canopy height of lowland rainforests shows a stable increasing trend, indicating that under favorable water and heat conditions and strengthened conservation measures, lowland rainforests can maintain high productivity [
40]. In contrast, canopy height changes in montane cloud forests remain stable, likely constrained by lower temperatures and limited nutrient availability at higher elevations. The significant canopy height growth observed in seasonal rainforests may be attributed to species adaptability and ecosystem resilience under climate fluctuations. Seasonal rainforests exhibit strong growth potential and carbon sequestration capacity in favorable climates, which aligns with the findings of Chave et al. (2014) [
41]. The growth trend in mountain rainforests further supports their strong recovery ability under relatively balanced water and heat conditions at mid-to-high elevations. Conversely, coniferous forests, which occupy relatively small areas, have shown limited canopy height growth, as many of them have reached maturity or over maturity during the survey. This long-term analysis enhances our understanding of the restoration process of canopy height and its relationship with environmental changes.
This study integrated GEDI and ICESat-2 ATLAS spaceborne LiDAR data to overcome the limitations of single data sources in spatial coverage and accuracy. The high-density footprint points from the combined datasets enhanced the resolution of forest canopy height estimation while reducing errors and striping effects commonly observed in traditional remote-sensing methods. The advantage of multi-modal data integration enabled more accurate canopy height estimation in Hainan’s complex tropical rainforest environment, particularly in high-canopy areas. Compared to the global forest canopy height map developed by Potapov et al. using GEDI and Landsat data, this study improved spatial continuity and avoided underestimation in high-canopy regions, achieving greater accuracy and spatial stability [
14].
The “Third Geographic Law”, proposed by Zhu et al., states that in areas with similar geographic conditions, forest canopy characteristics (e.g., canopy height) tend to be more similar than in regions determined solely by spatial proximity. This theory provides a new perspective for studying forest canopy height by emphasizing the influence of geographic environment, climate conditions, and vegetation types, beyond spatial distance alone [
42]. Environmental factors effectively capture canopy differences under varying geographic conditions, enhancing model generalization and accuracy. Thus, this study incorporated environmental factors such as elevation, slope, aspect, rainfall, temperature, and NDVI to optimize the canopy height estimation model. Future studies may include additional environmental factors to further improve model adaptability
In this study, modeling was performed for different percentiles of the data, with RH80 achieving the optimal balance in internal and external validations. The internal validation RRMSE for RH80 was 21.36%, while the external validation RRMSE was 33.05%. These results indicated that RH80 was highly stable and reliable for predicting overall canopy height changes in the Hainan Tropical Rainforest National Park. The strong performance of RH80 was likely due to its sensitivity to the mid-height canopy, which typically forms the forest’s core structure. This core structure encompasses most of the photosynthetic biomass and plays a crucial role in essential ecological functions, such as maintaining species diversity and facilitating carbon absorption. Asner et al. emphasized that in tropical forests, the mid-height canopy region (similar to RH80) contains most of the functional leaf area index and productive tree species, making RH80 a key variable for large-scale ecological studies [
43].
Although RH80 effectively represents overall changes in forest canopy height, higher percentiles (RH90 and RH95) have unique research value in specific contexts. Higher percentiles focus on the extreme canopy height values, particularly the distribution of tall trees, which is essential for studying forest carbon storage, biomass, and structural complexity. In this study, the RF model achieved an internal validation R
2 of 0.60 and a low RRMSE of 18.06% for RH95, indicating high accuracy in capturing the heights of large individual trees. This finding supports Lefsky et al., who concluded that higher percentiles more accurately reflect the tallest structural features of forest canopies, making them important indicators for studying tropical rainforest carbon storage and biomass. High-percentile metrics capture the vertical structural complexity of forests, which is crucial for understanding ecosystem stability and resilience. Using high-percentile metrics such as RH95 provides a more accurate representation of tree growth trends [
44]. Therefore, high-percentile metrics such as RH90 and RH95 are indispensable for studies focusing on tall tree distribution, carbon storage assessment, or ecological extremes. In this study, RH80 was primarily selected for dynamic canopy height prediction due to its balance of accuracy and applicability, and results for RH85, RH90, and RH95 were retained for reference in future studies. Future research could dynamically adjust percentile selection based on ecological contexts to explore the applicability of different percentiles. Additionally, integrating ground sample data with high-resolution remote-sensing data could further validate and optimize percentile selection methods, providing tailored percentile indicators for various research objectives.
This study also enhanced model credibility and stability by incorporating external validation using portable 3D LiDAR data. This validation framework offers a novel solution to the challenge of validating large-scale forest canopy height estimates with low-resolution data. Nonetheless, the limitation of this study lies in the relatively limited spatial coverage of the validation data. Future research could improve canopy height estimation accuracy by collecting data from diverse regions and elevations.
5. Conclusions
This study developed a model to estimate canopy height in the Hainan Tropical Rainforest National Park using multi-modal remote-sensing data and machine-learning algorithms. It accurately estimated canopy height from 2003 to 2023. By integrating GEDI and ICESat-2 ATLAS satellite LiDAR data, the study enhanced spatial accuracy and mitigated signal saturation issues found in traditional optical remote sensing. Environmental factors such as elevation, slope, aspect, temperature, precipitation, and NDVI were also incorporated to optimize the model, revealing trends in canopy height changes over the past two decades.
While the study produced valuable results, it also had some limitations. In areas with low vegetation cover or complex terrain, estimation accuracy was lower. The portable 3D LiDAR data had limited spatial coverage; although it improved validation accuracy, it did not fully represent all regions. Expanding data collection could enhance accuracy. Additionally, the study did not account for geolocation biases in the GEDI and ICESat-2 ATLAS LiDAR footprints.
This study provided a high-accuracy method for estimating canopy height in the Hainan Tropical Rainforest National Park. It demonstrated the potential of combining remote-sensing data with machine learning for monitoring tropical rainforests. Future research should incorporate more validation data and develop improved methods for processing raw GEDI and ICESat-2 ATLAS data to further enhance canopy height estimation accuracy.