1. Introduction
In 2020, the United Nations released the Global Forest Resources Assessment report, which stated that, since 1990, a staggering 178 million hectares of forest have been lost worldwide, either legally or illegally [
1]. Continued forest loss will have a major impact on the global climate balance and hinder the achievement of the set goal of carbon neutrality [
2,
3]. Developed countries around the world, such as Japan, the EU, the UK, and Canada, have set their own carbon neutrality deadlines. Developing countries, such as China, are also striving to reach the goal of carbon neutrality. Emission trading is now a common practice to help cap carbon emissions in many countries [
4]. Therefore, accurate and timely information on forest change is essential for accurate carbon accounting and carbon neutrality.
Deforestation is mainly caused by forest change, due to natural factors, or human activities [
5]. Natural factors include diseases of trees, forest fires, parasites, and extreme weather such as floods or hurricanes [
6]. Human activities also play an important role in deforestation [
7]—for example, farmland reclamation, infrastructure construction, mining activities, and urbanization [
8]. Remote sensing imagery has some important advantages, such as the free-use policy and longhistorical archives, which make it has become the main data source to monitor forest change on the earth [
9,
10,
11,
12,
13,
14,
15]. For example, the Landsat images are the most widely used data source to monitor deforestation so far because of their long historical archives (over 40 years) and especially their open-source and free-use policy in place since 2008 [
16]. Based on the Landsat images, some excellent forest change research works have been carried out in tropical regions [
17], temperate regions [
11], and even at the global scale [
4], and many classic algorithms have been proposed [
17,
18,
19,
20,
21]. For instance, the CCDC algorithm (Continuous Change Detection and Classification) [
22], can detect landcover change by modeling the change of any pixel in the same area over a long period. However, the limitation of CCDC is that the calculation cost is heavy and slow. A further improved method, namely S-CCD (Stochastic Continuous Change Detection) [
23], was proposed to solve this problem. S-CCD relieves the computation burden by treating seasonal forest changes as stochastic processes and then introducing a mathematical tool called the “State Space Model” to detect changes. All the existing proposed methods have excellent performance with 30 m or 10 m median spatial resolution imagery. However, after analyzing the existing forest change products, researchers pointed out that forest change areas counted by the existing methods show large uncertainty, the relatively coarse spatial resolution of the remote sensing image being the main factor in this discrepancy [
24].
Recently, using deep learning methods to monitor deforestation on median resolution imagery, such as Landsat8 [
25] or Sentinel2A/2B [
26] images, has attracted much attention. For example, [
12] used a ResUnet model to monitor deforestation detection with Landsat8 and Sentinel2A/B, and demonstrated that the deep learning method outperforms traditional machine learning methods such as the Random Forest classifier. Additionally, high-resolution imagery, such as Planet (3.7 m) [
27] and Komsat-3 (0.7 m) [
28], has also been used to monitor deforestation with deep learning methods, with good detection accuracy. However, there is still a lack of high-quality deforestation training datasets for the community to use in training deep learning models. It is well known that a high-quality training dataset is very important for training good deep learning models, but the process of generating a large-size training dataset is very time-consuming and expensive.
Another factor affecting the accuracy of deforestation detection is the structure of deep learning models. Though there are some deforestation detection models that have been proposed, such as Unet [
13], DeepLabV3+ [
29], improving the model structure still has the potential to improve detection accuracy. For example, the attention Unet [
15] achieves better accuracy than Unet or other segmentation models. However, most of the existing deforestation detection models cannot maintain high-resolution semantic features forwarding during the whole training process, which will decrease the detection accuracy on a narrow object or other complex regions [
30]. In this manuscript, we proposed a new high-resolution deep deforestation detection network, namely SiamHRnet-OCR, which shows better detection accuracy than existing models. The main advantage of SiamHRnet-OCR is that high-resolution features forwarding is always kept in the whole model layers.
The major contributions of this manuscript are as follows:
(1) A new deforestation training sample dataset was proposed, containing a total of 8330 true color samples (512 × 512 pixels) of 2 m spatial resolution. This dataset was generated by visual interpretation in 11 provinces of China’s Yangtze River Economic Zone, and it will be open-sourced to the community to help researchers worldwide to conduct deforestation detection studies.
(2) A new deforestation detection model, namely the SiamHRnet-OCR, can effectively improve detection accuracy, especially for narrow objects or complex regions.
(3) The design principle of SiamHRnet-OCR can provide some new insights for other research fields, for example, road or building change detection.
Related Work
Deforestation detection based on deep learning methods is a hot research topic. Both optical-based and SAR-based deep learning methods have been proposed [
13,
14]. On the whole, most of the existing deforestation detection modules are encoder–decoder structures, with the Unet style being the most commonly used structure. For example, the ForestNet [
14] was designed with an encoder–decoder structure to classify the drivers of primary forest loss in Indonesia, and the results showed that it outperformed than the Random Forest method [
31]. To alleviate the cloud effect on optical remote sensing images, the dense time series Sentinel-1 SAR imagery was used to map forest harvesting in California, USA and Rondonia, Brazil [
13], with a simple Unet module. In addition, the Siamese CNN(S-CNN) [
32], the Pyramid Feature Extraction Unet model (PFE-UNet) [
33], the DeepLabv3+ model [
29], and the LSTM and CNN combined model [
34] are also used to detect deforestation. In order to extend the feature extraction ability of CNN, the attention module [
35] was also investigated and indicates better precision than the pure CNN structure. For example, [
15] proposed an attention-based Unet model to detect deforestation, and the results show that the attention Unet has higher accuracy than the Unet, Residual Unet, ResNet50-SegNet, and the FCN32-VGG16 models with Sentinel-2 optical remote sensing images.
Deforestation detection can be defined as a classical change detection task, and it can also be understood as an extension of pixel-level image classification [
9]. Therefore, other excellent change detection models designed for building change detection or other domains can provide some new insights for deforestation detection, such as the SiamFCN [
36], Unet++ [
37], STAnet [
38], DTCDSCN [
39], ESCNet [
40], and SNUNet [
41]. In the aforementioned models, the whole feature extraction process contains three main steps: firstly, the backbones module is used to extract multi-scale low-level and high-level semantic features, such as ResNet [
42], MobileNet [
43], etc. Secondly, multi-scale semantic features are fused by concatenation and skip-connection operation [
44,
45,
46]. Finally, a loss function is used to optimize the feature extraction direction. In the whole process, a critical question is to design a reasonable deep-learning model architecture; to acquire rich and effective semantic features of objects. However, the downsampling operators in most deep learning models lead to irreversible information loss [
47], especially for pixel-level classification in remote sensing imagery. As a result, the accuracy of change detection may be decreased, especially in the boundary or pseudo-change regions. The HRnet (High-resolution network) proposed by [
30] has achieved state-of-the-art accuracy in semantic segmentation tasks on naturalistic images. The main advantage of HRnet is that it can capture effective context features of small targets, such as tree trunks and traffic lights, because it always delivers deep semantic features in high resolution during the whole feature extraction process. In remote sensing images, the clarity of objects is mainly determined by the spatial resolution if high-resolution features can be kept in the entire semantic feature extraction process; therefore, even insignificant spectral changes and slight texture changes in remote sensing imagery can in theory be distinguished based on effective high-level semantic features.
2. Study Area
We conducted experiments in two large regions in southern China because a recent study reported many deforestation hotspots there [
48]. The bi-temporal images of the two study areas are shown in
Figure 1.
The first study area was Hengyang City in Hunan Province, China (chosen as the main study area due to the diverse land cover types and lower urban area proportion in this region). Hengyang city is located in the central-south of China, and the land area is approximately 2621 km2. The region has a subtropical monsoon climate, and the terrain is mainly hilly. The major forest types are evergreen broad-leaved forest, deciduous broad-leaved forest, and evergreen coniferous forest; the majority of trees are planted forests with few original forest coverages. The growth cycle of planted forests is fast, usually 5–10 years before they can be cut down for making furniture and hand tools. From the relevant public government’s statistical data, deforestation frequently occurs in this region.
The second study area is Qujing City in Yunnan Province. Qujing is located in the southwestern China, with a subtropical plateau monsoon climate and a whole area of approximately 28,900 km2. The forest coverage type in this region is mainly primary forests. In recent years, China’s policies of “Poverty Alleviation” and “Common Prosperity” have increased investment in this region; consequently, more and more infrastructure such as highways and railways have been built here. In addition, to improve the living conditions of the original inhabitants, many cultivated lands were also developed in this region, and as a result deforestation in the region become serious in recent years.
5. Discussion
The deforestation detection results in the Hengyang City and the Qujing City indicate that the boundary of change regions produced by the SiamHRnet-OCR is satisfying, which allows us to address the following questions: What is the feature extraction ability of SiamHRnet-OCR? What are the advantages of the SiamHRnet-OCR versus other deep learning models? What is the advantage of deforestation detection results by the SiamHRnet-OCR over existing deforestation products? To answer these questions, we did some qualitative and quantitative experimental analyses.
5.1. Feature Extraction Ability of the SiamHRnet-OCR
In this study, we proposed a deforestation detection model—SiamHRnet-OCR—to monitor deforestation using high-resolution RS images. To answer the first question (What is the feature extraction ability of SiamHRnet-OCR?), we used a feature visualization methodto help understanding [
39]. The deep feature extraction module, the deep feature fusion module, and the OCR refine module in the SiamHRnet-OCR model are visualized at different stages, as shown below.
In
Figure 10, we can see how the features change in different layers of SiamHRnet-OCR (the feature map in the above figure is the strongest feature response in the corresponding layer). It clearly indicates that, with the deepening of the model layer, the response of change information in deforestation regions is more and more obvious. It is interesting that, during the whole feature extraction process from Stage 1 to Stage 4, the extracted features are gradually gathered in the change regions and can finally be accurately located in deforestation regions. From the feature fusion layer to the OCR refine layer, the feature response of the “pseudo-change” regions is largely reduced. This phenomenon means high-level semantic features extracted by the OCR refine module have a positive effect on hard-to-classify regions. indicating that the SimaHRnet-OCR model has a strong feature extraction ability to capture the change signal of deforestation, even for ares with subtle changes.
5.2. Comparison with Other Change Detection Methods Based on Deep Learning
To answer the second question (What are the advantages of the SiamHRnet-OCR vs. other deep learning models?), we first discuss the feature extraction ability of different deep learning models for elongated objects. A newly constructed road in the forest was selected for a detailed comparative analysis.
In
Figure 11, the deforestation detection result of the semantic segmentation models including Unet, PSPnet, and DeeplabV3+ is relatively worse than those of the change detection models such as Unet++, STAnet, DTCDSCN, ESCNet, SNUNet, etc. From the detailed comparison of deforestation results, there are some commission alarms in Unet, PSPnet, and DeeplablV3+. Essentially, the semantic segmentation models stack two temporal images into a single image with six bands (each time-phase image is three bands), though this means the change detection task can be easily transformed into a semantic segmentation task; the feature extraction ability of the semantic segmentation models may be weaker than the change detection models because the change detection models can explicitly extract the difference between two time-phase images [
36]. The depths of SiamFCN and Unet are relatively shallow; thus, their deforestation detection results are relatively worse, because the high-level semantic features extracted by them are not enough to describe the differences in complex scenes. [
42] has also demonstrated that deep model depth usually achieves higher accuracy than shallower models. In terms of the spatial resolution of high-level semantic features, in most of the existing semantic segmentation or change detection models the spatial resolution of high-level semantic features is 1/32 of original input images, such as DeepLabV3+, PSPnet, and SNUnet. Generally speaking, objects in remote sensing images, especially those slender targets such as roads or rivers, high-level semantic features will be lost in deep layers. Therefore, the omission alarms of slender objects in the final detection result will be increased. However, the deforestation detection result produced by the SiamHRnet-OCR indicates that it can accurately capture slender object change, because whether in low-level semantic feature or high-level semantic feature the spatial resolution of semantic features in the SiamHRnet-OCR is always kept as 1/4 of original input images. Such a spatial resolution is suitable for slender object detection; and the above detection result also confirms that such a model structure is effective.
In
Figure 11, both SiamHRnet-OCR (concatenation) and SiamHRnet-OCR (differencing) indicate a good visual effect, and the difference between them is negligible. Then how does SiamHRnet-OCR perform in other objects with irregular shapes? An example experiment is shown in
Figure 12.
In
Figure 12a,b, we can see that the spectral difference between deforestation regions in the bi-temporal images is large, and the shape of the change region is also irregular. As shown in the deforestation detection result of different deep learning models from
Figure 12d,o, some omission alarms are produced by the semantic segmentation models, such as in the Unet, PSPnet, and DeeplabV3+ models. It could be that simply stacking the bi-temporal images into a multi-band image will interfere with high-level semantic feature generation [
36]. However, this phenomenon also occurred in the change detection models, for example, the Unet++ and DTCDSCN models. This result gives some indication that not all change detection models can achieve excellent performance in monitoring deforestation with high-resolution imagery. In
Figure 12j,m, both STAnet and SNUnet achieve relatively good results, but a few omission alarms are still produced in the boundary of change regions of these two methods, especially in the “pseudo-change” regions. As a whole, visually, both SiamHRnet-OCR (concatenation) and SiamHRnet-OCR (differencing) achieved better results than all other models. Additionally, it seems that deforestation detection results based on the SiamHRnet-OCR (differencing) model achieved better visual effects than the SiamHRnet-OCR (concatenation) model, such as in the edge of deforestation regions.
We have qualitatively compared and analyzed the change detection results of different models; we have also quantitatively evaluated all the models using the quantitative accuracy evaluation metrics mentioned in
Section 3.4. The detection accuracies of different models are shown in
Table 11.
In
Table 11, among all deep learning models, the Precision, F1, and OA indicators of the SiamHRnet-OCR (differencing) model achieve the highest scores. The F1 indicator of the SiamHRnet-OCR (concatenation) model was slightly lower than SiamHRnet-OCR (differencing). Moreover, the complexity comparison between SiamHRnet-OCR (differencing) and SiamHRnet-OCR (concatenation) indicates that SiamHRnet-OCR (differencing) has fewer parameters and a faster inference speed than SiamHRnet-OCR (concatenation), for example, the FLOPs indicator of SiamHRnet-OCR (differencing) is only 48.77% of SiamHRnet-OCR (concatenation), and the model Parameters of SiamHRnet-OCR (differencing) is 79.58% of SiamHRnet-OCR (concatenation).
Although compared with the lightweight models, such as Unet, SiamFCN, and SNUNet, the inference time of SiamHRnet-OCR (differencing) is slower, the Precision, F1, and OA indicators show that SiamHRnet-OCR (differencing) has higher accuracy results than them—for instance, the F1 indicator of SiamHRnet-OCR (differencing) is 3.0% higher than the SNUnet model. Moreover, compared with other relatively heavyweight models, such as PSPnet, DeepLabV3+, and ESCNet, the SiamHRnet-OCR (differencing) model has a faster inference speed. In addition, our experiment also confirms the finding of recent research [
61] that it is important to keep high-resolution features forwarding in the training process to acquire rich and useful context semantic information, which can improve the model detection accuracy of slender objects or other complex objects.
5.3. Comparison with an Existing Forest Change Product
What is the advantage of deforestation detection results by SiamHRnet-OCR over other existing deforestation products? To answer this question, we selected the Hengyang City for comparison. The current highest resolution forest change detection product covering large regions available was proposed by [
4], namely GFC-V1.8 (Hansen Global Forest Change V1.8), which has a 30 m spatial resolution. To maintain time consistency between GFC-V1.8 and our result, we selected the 2019 global forest loss product of GFC-V1.8 for comparison. The GFC-V1.8 and the SiamHRnet deforestation detection results are shown in
Figure 13.
As shown in the sub-regions A and B in
Figure 13, the deforestation boundaries detected by the SiamHRnet-OCR are accurate and almost identical to the GT boundaries. Though the spatial resolution of GFC-V1.8 is 30 m, it can still accurately locate forest change. However, GFC-V1.8 produces a few omission alarms in the central region of subregion A, because this region was covered by weeds in the former time-phase image with high NDVI values, causing it to be incorrectly considered as forest cover. By contrast, SiamHRnet-OCR can effectively distinguish between grass and forest on 2 m high-resolution remote sensing images. In sub-region C, the GFC-V1.8 product did not detect deforestation, perhaps due to cloud cover or missing image data.
The quantitative accuracy comparison between the SiamHRnet-OCR and the GFC-V1.8 can be seen in
Table 12. It indicates that all four accuracy assessment indicators of the deforestation detection result from SiamHRnet-OCR are higher than those of the GFC-V1.8 product; in particular, the F1 indicator of SiamHRnet-OCR is 40.75% higher than GFC-V1.8. In terms of spatial detail, the visual effect of our results is also relatively superior. It is worth mentioning that the GFC-V1.8 product is produced on Landsat imagery with 30 m spatial resolution, and thus the comparison between the SiamHRnet-OCR and the GFC-V1.8 is not entirely fair so it is not possible to say that the deforestation detection result based on SiamHRnet-OCR is better than GFC-V1.8 product. However, the above comparison further confirms that deep learning methods are a good choice to achieve high-precision deforestation detection with high-resolution remote sensing imagery.
Statistical analysis results indicate that the forest loss area detected by GFC-V1.8 in the Hengyang City is 6.05 km2, which is significantly lower than the GT (9.43 km2), while the total deforestation area detected by SiamHRnet-OCR is 11.24 km2, which is slightly larger than the GT. There are three possible reasons to explain this difference: (1) GFC-V1.8 was produced on 30 m Landsat imagery, which is a relatively coarse spatial resolution, and may be not suitable for high-precision deforestation detection; (2) some deforestation regions with slight spectral change cannot be captured by GFC-V1.8; and (3) a few commission errors were produced by the SiamHRnet-OCR, e.g. water changing into bare land was regarded as deforestation.
5.4. Limitations
With the help of a large quantity of high-quality deforestation training samples, deforestation detection with high-resolution imagery has been investigated in this study and proves the feasibility and efficiency of the SiamHRnet-OCR model in deforestation detection tasks. However, there is still room for further improvement.
(1) In this newly proposed deforestation detection dataset, the SimaHRnet-OCR achieved excellent performance; however, further experiments are needed to verify whether SimaHRnet-OCR is still the best model on other change detection training datasets.
(2) The SiamHRnet-OCR model can only be applied to two bi-temporal image change detections so far, and the next step is to extend the trained deep learning model for long time-series deforestation detection tasks.
(3) The SiamHRnet-OCR model produced a few omission errors in cloud and cloud shadowing covered regions. To improve the model accuracy, the automatic or semi-automatic cloud and cloud shadow masking algorithms can be used as the pre-processing means to further improve detection accuracy [
62].