1. Introduction
With population growth, economic development, and various factors, land cover information has been identified as one of the key data components in many aspects of global change research and environmental applications [
1,
2]. Large-scale land cover classification and mapping provides a source of data for many of the research works on global change and is an important input variable to global change models (such as net productivity models, ecosystem metabolic models, and carbon cycle models). Most global change models need to be supported by large areas of land cover information [
3,
4]. At the same time, the speed and magnitude of land changes are constantly changing with time and space. For two maps covering long time spans, there is a lack of corresponding process information, and the long time series of land cover datasets can capture the complexity of ground changes [
5] while quantifying these changes. Therefore, long time series land cover classification data are of great significance for land change monitoring [
6], identification, and planning assessment [
7,
8].
With the development of remote sensing technology, the quantity and quality of remote sensing data are also constantly increasing, and the data can be used for geographic area monitoring around the clock [
9,
10]. Therefore, this study of a wide range of long time series land cover classifications can be carried out well. Many countries and international organizations have used different image processing techniques and data, such as Landsat, SPOT, Advanced Very High Resolution Radiometer (AVHRR) and Moderate-resolution Imaging Spectroradiometer (MODIS) data, to conduct land cover research at the regional, intercontinental and global scales. Global and regional land cover classification products have been researched and produced by several countries and institutions [
11,
12].
There are still some shortcomings in the currently existing classified datasets. First, the current global land cover datasets are mostly concentrated after 2000. Categorical data before 2000, especially time-continuous products, are scarce due to the lack of continuous time series data [
13,
14]. In addition, the current database for this study of land cover classification generally uses surface reflectance data [
15,
16] or the vegetation index (VI) obtained from reflectance data [
13,
17]. The data categories are relatively singular and lack surface feature information. Different land types have different reflection features, which is the basis for the use of reflectivity for land cover classification [
18]. The similarity of the vegetation spectrum will cause the confusion of categories when using the reflectance for vegetation classification. One solution is to use time series data for land cover classification, which has been accepted by many researchers and has yielded good results [
19,
20]. Over the past three decades, remote sensing satellite observations have produced a large number of time series remote sensing products, which has made it possible to obtain long time series land cover classifications from up to 2000 years ago, such as Global LAnd Surface Satellite (GLASS) [
21]. Global LAnd Surface Satellite (GLASS) generates a range of geophysical, physical and chemical parameter values using multi-source data and multi-algorithm integration. Many quantitative remote sensing products have been produced such as albedo (Albedo) [
22], evapotranspiration index (ET) [
23], the leaf area index (LAI) [
24], gross primary productivity (GPP) [
25], and fraction of absorbed photosynthetically active radiation (FAPAR) [
26]. These products cover the years 1982–2017 at 1–5 km and 8-day resolutions. These products do not just contain a large amount of land feature information such as vegetation cover, photosynthetic absorption, surface reflection, radiation emission, latent heat flux and biomass, but also can extract interannual variation of surface features. Therefore, they are highly suitable for long time series land cover classification.
Traditional classification methods are classified into supervised classification and unsupervised classification. The representative of the unsupervised classification is clustering algorithm such as IOSDATA [
27], Maximum Likelihood Classification [
28] and K-Means [
29]. The supervised classification is mostly machine learning algorithms. For example, Gopal et al. used the ANN to get the Fuzzy ARTMAP [
30], Boles et al. used the IOSDATA algorithm to obtain the Temperate East Asia classification map [
31], Homer et al. completed the NLCD 2001 by using the Decision Tree [
32], and Carrão et al. used the SVM algorithm to produce land classification maps [
33]. These algorithms currently used in existing land cover classification products are more complex or require more manual intervention and manual extraction [
34,
35], and still need a lot of reference data [
36]. In recent years, deep learning [
37,
38] has demonstrated the excellent performance of neural network models in remote sensing classification [
39,
40,
41]. A very typical application is the use of convolutional neural networks (CNN) for land cover classification [
42,
43]. However, the characteristics of CNN indicate that they are more suitable for processing remotely sensed images with strong spatial correlation. These models cannot process time series information, and each attribute is an independent individual during the classification process. For the time characteristics of quantitative remote sensing products with strong temporal correlation, this method does not prove advantageous in classification.
The key to long time series land cover classification is determining how to use time series to make full use of rich seasonal patterns and order relationships for classification task. Recurrent neural networks (RNNs), especially long short-term memory networks [
44] (LSTM), capture time correlation very well [
45]. Therefore, RNN are generally considered to be a good machine learning method for time series and land cover change studies [
46,
47]. Existing research has shown that the LSTM model has higher accuracy for time series classification than does CNN, and the accuracy is far better than that of SVM [
48]. The main goal of this study is to establish a long time series classification data extraction model based on the method of deep learning using long time series quantitative remote sensing products as the input of the model, which led to this study of long time series land cover classification. By comparing the difference between the traditional method and the LSTM model, we propose a Bi-LSTM-based land cover classification method for multi-temporal remote sensing classification. Furthermore, learning from imbalanced training data is a common problem [
49]. Sample imbalances are common in remote sensing land cover classifications, and rare categories may be insufficient in number compared to large categories. Studies have shown that balanced data sets have a positive impact on classification results [
50]. SMOTE [
51] can artificially synthesize new samples by interpolation. Since the SMOTE algorithm may excessively increases the order of magnitude of rare samples, we have made some algorithmic improvements. The 10 categories were divided into three layers by magnitude. SMOTE was used for upsampling in each layer. We combine SMOTE with stratified sampling, and the resulting land cover classification map generated by the model is closer to the actual situation.
This study is divided into six parts. After the introduction of
Section 1, the data used in this time series classification model and related information are introduced in
Section 2.
Section 3 describes the model architecture used for deep learning to classify and optimize the model. Then,
Section 4 summarizes and analyzes the results this study, and calculates the accuracy of the model and the accuracy and reliability of the time trend.
Section 5 clarifies the views of this study and explains our results. Finally,
Section 6 summarizes the paper.
3. Methodology
3.1. Long Short-Term Memory Networks
Time series data are data collected at different points in time. These data reflect the state or extent of changes of a certain factor, phenomenon, and so on over time. LSTM are a kind of neural network used for processing sequence data. Typically, a neural network contains an input layer, one or several hidden layers and an output layer [
63]. The output is controlled by the activation function, and the layers are connected by weights. The activation function is determined in advance. The underlying neural network only establishes a weighted connection between the layers, and the biggest difference of the LSTM is that the right connection is also established between neurons in the layer. A typical LSTM schematic is shown in
Figure 2.
The sigmoid activation function [
64] is represented by δ in
Figure 2, and the calculation expressions of the three new gates, the hidden layer output
and the state update
Ct are as follows:
It can be seen that the inputs of these three doors are and , and each door has its own weight and skew. These parameters are tuned as the training process continues and play a role in the calculation of state updates and hidden layer output values.
3.2. Time Series Classification Based on the Bi-LSTM
A multi-layer LSTM in two opposite directions [
65] is used in this study to classify multiple categories of land cover through spatio-temporal data processing and multi-label land cover. The LSTM-based land cover classification model is shown in
Figure 3. In
Figure 3, the classification models are expanded in chronological order. The output layer originally existing in the LSTM network is removed. The output (
,
, …,
,
) in the hidden layer is input into an averaged pooling layer to obtain a vector h without time information [
66]. So far, an LSTM network was established with the output layer removed and added to the average pooling layer, called a Forward LSTM. At the same time, we also use a reverse version of the LSTM called the Backward LSTM. The two structures are basically the same, except that the Backward LSTM network requires the input data to be input in reverse order of the time series, and the output of this layer is represented by
. Forward LSTM speculates the information based on the previous information, and Backward LSTM can push back the previous information by the following information. Bi-LSTM is a combination of Forward LSTM and Backward LSTM. This two-way information extraction is helpful for classification. Existing experiments have shown that the classification effect of Bi-LSTM is better than that of LSTM classification [
67].
The land cover classification model takes the time series of multivariate quantitative remote sensing products as the input, and its output is the model’s estimate of the category label. It should be noted that the output of the ordinary RNN network or the LSTM network is a sequence of the same length as the input sequence, which is obviously inconsistent with the problem of the research. To this end, changes need to be made based on the LSTM model to adapt the output to the land cover classification problem.
To perform the task of classification, we construct a deep architecture with multiple layers of LSTMs stacked together, which will allow the extraction of advanced nonlinear time features in remote sensing time series. The proposed architecture is similar to the CNN network that combines several convolutional layers. Since the LSTM itself does not perform the task of class prediction, a softmax layer is added at the top of the LSTM network to perform multi-category predictions follow previous studies [
68]. The category corresponding to the largest value in the softmax neuron is the final result of the prediction.
3.3. Time Series Percentage Grading Classification Extraction Model Based on the Bi-LSTM
After obtaining the long time series land cover classification product set, this study studies the percentage of long time series land cover category extraction. The land cover percentage grading model takes the multivariate quantitative remote sensing product time series and land cover classification data as input, and its output is an estimate of the model’s percentage proportion of the category. The structure of the model is basically the same as that of the LSTM-based land cover classification model. The difference is that the land cover data added in the input is used as a new feature, and the output of the model is changed to the estimate of the percentage level of the category. Since we need to obtain the proportions of different categories in each pixel, a percentage level extraction model for each category needs to be established to achieve the percentage ratio study of all categories. The specific approach is to extract the spatial distributions of all levels of the category for a certain category. The specific form of the performance is shown in
Figure 4.
The different levels of pixels of the category are then extracted as samples as model inputs. Since the values of the percentages are discretized, the fitting work can be simplified into a classification task. When the model is trained, this study uses the level of the category as a label and uses land cover and quantitative remote sensing products as features. After the training, we predict the level of the category of the pixel by inputting quantitative remote sensing products and land cover. That is to say, the work of fitting a continuous percentage value of a category can be converted into a classification task of 6 levels of discrete values for the percentage of the category. Therefore, the land cover percentage grading model can be directly based on the framework of the land cover classification model, with the land cover percentage grading data and quantitative remote sensing products as inputs. Then, the land cover percentage grading model is constructed and trained. For the 10 categories, a total of 10 land cover percentage grading models need to be established.
3.4. Stratified SMOTE Algorithm
The different magnitudes of training samples have always been an unavoidable problem affecting the accuracy of deep learning models. To solve the imbalance of land cover samples, the traditional approach is to increase the number of small samples by upsampling and force the order of magnitude of the small samples to the same order of magnitude as the large samples. The common method is the SMOTE algorithm. In a nutshell, a new sample is generated for a few classes by means of “interpolation” [
51]. The algorithm idea can be summarized in that for each sample x of a few classes, the distance from all samples in the minority sample set
is calculated by the Euclidean distance standard, and the k-nearest neighbor is obtained. For each minority class sample x, several samples are randomly selected from their k nearest neighbors. Assuming that the nearest neighbor is
, for each
, a new sample is constructed according to Formula (6) using the original sample.
However, for the category samples of land cover classification, the sample sizes of different categories are very different. For example, a sample of the cultivated land category can reach the order of 300,000, while the sample size of the urban category is only 6000. When using the SMOTE algorithm, the city category sample will be increased from 6000 to 300,000. Among the results, the city’s 294,000 samples were obtained by SMOTE. It is very likely that the original characteristics of the city category will be lost, resulting in inaccuracies in the classification results [
69].
To improve the inaccuracy of the classification caused by the excessive sample increase from the SMOTE algorithm, we used a sampling method called the stratified SMOTE algorithm to calibrate the sample [
70]. The sample set obtained through the method and the sample set obtained through the traditional SMOTE algorithm are input into the model for training, and the results of the prediction are compared. The stratified SMOTE algorithm yields better results than does the traditional SMOTE algorithm. The number of small sample sizes can be increased within a reasonable range. This result will be discussed later.
The specific operation of the stratified SMOTE algorithm is to first count the number of samples in each category. Then, according to the order of magnitude of the samples, they are divided into three layers. The first layer is farmland, grassland, woodland and bare land. The second layer is shrubs. The third layer is wetland, water bodies, tundra, city and ice and snow. Then, the SMOTE algorithm is used in each of the three layers, and the sample set for the model training is obtained.
3.5. Evaluation
This study used the confusion matrix and the overall accuracy of the test set to evaluate the accuracy of the model and the classification results. The confusion matrix can calculate the accuracy of the classification of each category and evaluate the accuracies of different models relative to the overall accuracy. Then, this study also used the F1-score [
71] as an indicator of the ability to classify of the model. This study also takes the user accuracy and charting accuracy of the classification model into account:
Here, is a single category of the F1-score, is the cartographic accuracy of the classification confusion matrix, and is the user precision of the classification confusion matrix. The F1-score used in this study is the average F1-score obtained by simply averaging the F1-scores of all categories.
Finally, the accuracy standard deviation is used as the evaluation index of the stability and applicability of the model in different years. One of the concerns of the long time series land cover classification model is ensuring that the model maintains a high and similarly stable classification accuracy for different years of classification results, and the accuracy standard deviation can provide this information.
6. Conclusions
A long time series land cover classification deep learning model based on the Bi-LSTM is proposed. This model uses CNLUCC China land cover classification data to form a set of long time series land cover classification samples. Then, the corresponding relationship between the land cover type and the sequence quantitative remote sensing products is established. Afterward, a self-classification model of 0.05° long time series land cover depth learning in China was trained, and the accuracy of the model was evaluated.
The LSTM model has advantages over other models in processing time series data, which preserves the temporal information of the data for time series classification. A set of data was used to train and compare the accuracies of three deep learning models, the CNN, LSTM and Bi-LSTM. The Bi-LSTM model achieves higher accuracy with consistent parameters. Therefore, the Bi-LSTM model was chosen as the basic model for the long time series land cover model.
In this study, the long time series land cover classification sample set is randomly disrupted and input into the Bi-LSTM model to carry out model training. In the process of model training, the training process of the model is monitored through visual operation. Since the deep learning model may begin over-fitting after the number of training iterations reaches a certain level, it is necessary to monitor the process of the model. The training is ended after the model test accuracy has stabilized and before the test accuracy has decreased. Finally, a long time series land cover classification model was obtained with an overall accuracy of 84.2%.
The CNLUCC China land cover classification data in 1980, 1990, 1995, 2000, 2005, 2010 and 2015 is used as a reference to evaluate the accuracy of the land cover classification data obtained in the same years. There are 10 categories with accuracies of over 82%, and 5 of them have accuracies of over 90%. While ensuring the accuracy, we evaluate the classification accuracies of different years for the same category. The results show that the accuracies of different years for the same category are about the same, and the deviation is not large. The errors of 10 categories are not more than 6%, and the accuracy error does not exceed 0.8%. Due to the large amount of mixed category information contained in the 5 km classification pixels, we also extract the information by grading, and the overall classification accuracy of each category reached more than 85%. There are 8 categories with accuracies of over 90%, which proves the feasibility and reliability of the Bi-LSTM model for time series land cover classification.