Data-Driven Selection of Land Product Validation Station Based on Machine Learning

Li, Ruoxi; Tao, Zui; Zhou, Xiang; Lv, Tingting; Wang, Jin; Xie, Futai; Zhai, Mingjian

doi:10.3390/rs14040813

Open AccessArticle

Data-Driven Selection of Land Product Validation Station Based on Machine Learning

by

Ruoxi Li

^1,2

,

Zui Tao

^1,*

,

Xiang Zhou

¹,

Tingting Lv

¹

,

Jin Wang

¹

,

Futai Xie

³

and

Mingjian Zhai

^1,2

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

³

Beijing Institute of Radio Measurement, Beijing 100854, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(4), 813; https://doi.org/10.3390/rs14040813

Submission received: 30 November 2021 / Revised: 29 January 2022 / Accepted: 7 February 2022 / Published: 9 February 2022

(This article belongs to the Section Urban Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Validation is a crucial technique used to strengthen the application capabilities of earthobservation satellite data and solve the quality problems of remote-sensing products. Observing land-surface parameters in the field is one of the key steps of validation. Therefore, the demand for long-term stable validation stations has gradually increased. However, the current location-selection procedure of validation stations lacks a systematic and objective evaluation system. In this research, a data-driven selection of a land product validation station (DSS-LPV) based on Machine Learning is proposed. Firstly, we construct an evaluation indicator system in which all factors affecting the location of validation stations are divided into surface characteristics, atmospheric conditions and the social environment. Then, multi-scale evaluation grids are constructed and indicators are allocated for spatial evaluation. Finally, four Machine Learning (ML) methods are used to learn the established reliable stations, and different data-driven scoring models are constructed to explore the intrinsic relationship between evaluation indicators and station locations. In this article, the reliability of DSS-LPV is effectively validated by the example of China using the national-level land product validation station that has been established. After a comparison between the four ML models, the random forest (RF) with the highest accuracy was selected as the modeling method of DSS-LPV. The correlation between the regression value of test stations and the target value is 0.9133. The average score of test stations is 0.8304. The test stations are generally located within the calculated hot-spot area of the score density map, which means that it is highly consistent with the location of the built stations. Research results indicate that DSS-LPV is an effective method that can provide a reasonable geographical distribution of the stations. The location-selection results can provide scientific decision-making support for the construction of land product validation stations.

Keywords:

land product validation station; location selection; data-driven; machine learning

Graphical Abstract

1. Introduction

Modern remote sensing, while providing radiance data, gradually tends to provide end users with a series of high-level standard data products [1]. With the promotion of openness, sharing, interconnection and other services [2], multi-source and multi-temporal quantitative remote-sensing products provide better data support for resources and environmental monitoring, global change and sustainable development [3]. The quality of remote-sensing products is the key to restrict their application ability [4]. The importance of accurately evaluating remote-sensing products has been generally recognized [3,4,5,6,7,8]. According to the definition of the Working Group on Calibration and Validation (WGCV) of the International Committee on Earth Observation Satellites (CEOS), validation refers to the process of independently evaluating the accuracy and uncertainty by the comparative analysis of remote-sensing products and reference data (relative truth values) that can represent ground targets [5]. Therefore, validation is an important method used to solve the quality problems of remote-sensing products.

Obtaining land-surface parameter by means of in situ information from field observations is one of the key steps of validation [1,9]. The field observations are primarily based on deploying samples or constructing validation stations [10]. Deploying sample points on site is more flexible with a larger observation area, which can facilitate the comprehensive application of multiple types of remote-sensing products. This method requires various measuring instruments and a large number of surveyors [11]. Due to the limitations of time and cost, it is impossible to obtain long-term data. With the development of communication technology and the increasing demand for time-series data, long-term stable validation stations have emerged, and some newly advanced observation methods have developed, such as wireless sensor networks (WSN) [12] and footprint observation [13,14,15,16]. Validation stations provide multi-spatial and multi-temporal data for the optimization of remote-sensing inversion product models and the research of validation algorithms, which have an important promotional significance for the development of validation [17].

As early as 1999, National Aeronautics and Space Administration (NASA) constructed validation stations and flux towers to undertake ground validation of land cover (LC), leaf area indicator (LAI), photosynthetically active radiation (PAR), net primary productivity (NPP) and other products on the BigFoot Project [18]. Subsequently, at the beginning of the 21st century, European Space Agency (ESA) also launched the Validation of European Remote Sensing Instruments (VALERI) project to validate land remote-sensing products made by Moderate-resolution Imaging Spectroradiometer (MODIS), VEGETATION, Medium Resolution Imaging Spectrometer (MERIS) and Advanced Very High Resolution Radiometer (AVHRR) sensors [19]. From 2007 to 2015, China carried out the Heihe Watershed Allied Telemetry Experimental Research (WATER/HiWATER) in the Heihe River Basin for “atmosphere-hydrology-ecology” [20,21,22,23,24] and constructed ChinaFLUX. Since 2015, with the implementation of the National Civil Space Infrastructure Medium and Long-term Development plan, Chinese Academy of Sciences (CAS) has constructed the first batch of 48 validation stations for different underlying surfaces to provide support for the establishment of a long-term, stable validation observation network. For the marine environment, Chinese State Oceanic Administration constructed the Hangzhou global ocean system field scientific observation and research station—Argo. In terms of international cooperation, the “Belt and Road Initiatives” is a scientific and technological innovation plan that promotes cooperation via the building capacity of BRICS countries; therefore, the trend of globalized joint validation is increasing, thereby providing a guarantee of quantitative, accurate and scientific information for multilateral sustainable development. The joint deployment of stations is the main method by which to obtain long-term stability data on the ground [25], and it is also a key step to promote the joint validation of various countries. The validation stations have great application prospects and demands, which pose a challenge to current station-selection methods.

Location selection has been identified in many research fields as a planning problem [26,27,28,29,30,31], which is a multi-criteria decision-making problem that evaluates and selects schemes under the influence of multiple factors and criteria [32,33,34,35,36,37]. Whether it is a multi-criteria decision-making method of location selection or location search, decision makers, managers, and different stakeholders evaluate the schemes based on prior knowledge and interest preferences [32]. Similarly, the location selection of validation stations involves many factors such as surface characteristics, atmospheric conditions, and the social environment. Long-term experiments with a reasonable cost-effectiveness ratio are required. Currently, the popular location-selection method is to designate candidate locations based on prior knowledge such as remote-sensing images and GIS data, then compare and “select” the optimal from several specific and precise candidate locations, such as the Analytic Hierarchy Process (AHP) [38,39,40,41,42,43,44,45], Gray Relational Analysis (GRA) [46,47,48] and Fuzzy Comprehensive Evaluation (FCE) [49,50]. During the actual work of location selection for a validation station, there is usually no prior knowledge provided by experts to preselect candidate areas, which first requires a “search” for candidate areas from a large space, based on demand. In the process of building the scoring model for the candidate area, when the experts’ understanding and judgment are inconsistent or even contradictory, the results will be biased. Repeated field investigations and expert demonstrations make the station selection process very cumbersome. The evaluation process is affected by the prior knowledge of experts, which makes it difficult to reuse and promote the model. Therefore, it is necessary to establish the location-selection requirements, standards, and the principles of validation stations [19,25,51]. With the rapid development of Machine Learning, it has been widely applied in many fields of remote sensing [52,53,54,55,56,57], but few studies have employed Machine Learning regression for the location selection of a land-product validation station. Machine learning is driven by existing data and a decision on the location selection performance for a validation station is provided by building a relatively objective model. The method of Machine Learning attempts to explain the inherent relationship between the location and multi-element evaluation indicators, and to simulate the knowledge-system model framework, which is a data-driven scientific-validation station quantitative evaluation system.

Overall, in accordance with the requirements of CEOS on the location selection of calibration and validation, this research puts forward a data-driven selection of a land-product validation station (DSS-LPV) based on Machine Learning. To achieve this, we (1) constructed an evaluation indicator system for location selection of validation station; (2) evaluated spatially based on multi-scale grid; and (3) constructed a data-driven scoring model based on Machine Learning.

2. Data

2.1. Evaluation Indicator

To support the proposed principles and requirements for the construction of a validation station, we selected the important available indicators from an evaluation indicator system via the example of the establishment of a Grassland Station, Forest Station and Agricultural Station in China. These three categories of station are land-surface vegetation stations and have similar requirements for surface features, atmospheric conditions and the social environment, so the same evaluation indicators are selected. The source and introduction of the evaluation indicator data is shown in Table 1.

2.2. Machine Learning Dataset

The reliability of the training set affects the accuracy of the model. Under the problem of location selection for a validation station, it is necessary to find reliably constructed stations as a training sample. For instance, China has constructed a number of national-level field observation stations to better research the ecological environment and monitor the ecosystem. These stations are distributed in areas with typical ecological regions, climate types and surface cover. Moreover, a close cooperative relationship has been established with the global ecological research and monitoring network, which is an important part of the Global Environmental Monitoring System (GEMS). National field-observation stations provide comprehensive observation data for many research fields, including the validation of remote-sensing products. The location of the network has been repeatedly scrutinized on-site by many scientific researchers. After a lengthy period of inspection, a large number of existing research results have proved that the location is relatively reasonable and has a certain degree of universality. Therefore, national field-observation stations are undoubtedly the best choice as training samples for station selection s exampled by its establishment in China.

This research selects three categories of national-level field observation stations, including 16 Agricultural Stations, 9 Forest Stations, and 6 Grassland Stations, totaling 31 as is shown in Figure 1. The blue points represent training stations, and the red points represent test stations. The location, category and the corresponding evaluation indicator parameters of each station constitute the Machine Learning dataset.

2.3. Data Preprocessing

The objective of preprocessing is to transform the evaluation indicator data of different types and formats into quantitatively calculated parameters. All data are converted to the unified geographic coordinate system and reference datum, and data divided into the same layer of the grid are resampled to the same resolution. The aim of quantification is to process basic data into parameters that can support evaluation through reasonable spatial analysis (e.g., buffer analysis, calculation of Euclidean distance) and data management (e.g., data normalization, piecewise assignment). However, basic data have different attributes and dimensions. According to the attribute characteristics, the indicators can be classified into target, spatial, temporal, and binary. The target indicators are determined by the goal and type of station. For example, suppose a forest station is designed to be constructed in Sichuan China, the target indicators’ designated values are “Sichuan China” and “Forest”. The spatial indicators represent spatial information of the station, such as “traffic accessibility”, “surface uniformity” and “average aerosol depth”. The temporal indicators represent the time information of station, such as “annual average number of sunny days” and “observing instrument running time”. The binary indicators dictate whether the station is feasible. There are only yes or no cases, so there are only two values of 0 and 1, such as “high-prone natural disaster areas”, “natural protection areas”.

Classification based on attributes contributes to quantifying the indicators better. For evaluation indicators that can be clearly judged via impact on the validation station, their attribute value can be directly substituted into the model, such as cloud cover, precipitation and road distance. Conversely, for the other evaluation indicators that cannot be clearly judged, indirect assignment methods are required. For example, the validation stations cannot be constructed around cities. However, if the distance from the urban area is too far, transportation may be inconvenient and the materials may be scarce. For such evaluation indicators, it is necessary to set up a buffer zone around the urban area. In the buffer, the area is ignored. In parallel to, and outside the buffer, the closer to the urban, the higher the value that is assigned. The data preprocessing-method for the evaluation indicators is shown in Table 2.

3. Methods

In this research, we propose the data-driven location selection technique for validation station. First, the principles of location selection for validation stations are put forward, then factors such as surface characteristics, atmospheric conditions, and social environment are divided, and a quantitative evaluation indicator system is constructed. Second, spatial evaluation, based on a multi-scale grid, is performed. The evaluation indicators are allocated by the goals and application requirements of each layer of the grid. Through the calculation and selection of the first two layers of grid, the candidate area suitable for station construction are identified. Third, the data-driven scoring model is built for the candidate area based on Machine Learning national-level field stations which have been constructed. The data-driven objective scoring model can calculate reasonable hotspots for building stations. The DSS-LPV solves the problems caused by fuzzy qualitative evaluation and reduces the error caused by the difference of experts’ prior knowledge. The scoring results can be used for further scientific evaluation, analyzing the construction scenarios of the station selection area, and providing a decision-making basis for the selection of the verification station. A flow chart is presented in Figure 2.

3.1. Constructing the Evaluation Indicator System

There are many factors involved in the location selection of a validation station. This research proposes the following basic principles based on the characteristics of the validation station. The first principle is “Representativeness”. The validation station should be located in a typical land cover under different climatic conditions or geographical features. The second is “Feasibility”, including the accessibility of the experimental area and the suitability of the field conditions. Accessibility refers to whether the location can be approached, and suitability refers to meeting the requirements of long-term stable observation. The third is “Convenience”. Convenience means that the validation station should be located in an area with a sufficient power supply, comprehensive logistical support, and convenient accommodation and transportation for surveyors. The fourth principle is “Combination optimization”. Potentially, the factors involved are divided into three categories: surface characteristics, atmospheric conditions and social environment. The validation-station-evaluation indicator system is an organic whole composed of several individual evaluation indicators and weighted values. Therefore, the refined indicators of each category should be quantifiable, non-repetitive, and independent of each other [58].

Supported by the “National Civil Space Infrastructure Terrestrial Observation Satellite Common Application Support Platform” project, the CAS constructed the first batch of 48 validation stations. We deepened our understanding of the principles of location selection through repeated multi-industry investigations, demonstrations, discussions and project reviews. Specifically, the following requirements for the construction of stations are put forward via the following three categories: (1) Surface characteristics. Taking into account the provisions of the general experimental area, in order to expand the scope of application of the validation station, the area of validation station should preferably be 2 km × 2 km [11,25]; the validation station should be located in terrain with small relative elevation differences such as plains and basins; the observation object around the station is the typical land-cover type under the climatic conditions or geographical features; the location should be far away from a nature reserve or military base; the location should be far away from areas with a high incidence of natural disasters such as earthquakes, volcanoes, and mudslides. (2) Atmospheric conditions. The photographing quality of optical remote sensing satellites is affected by atmospheric conditions. To ensure the effective acquisition of imagery during satellites in orbit, it is required that the station location possesses proper lighting conditions and atmospheric visibility. (3) Social environment. A high level of manpower and material resources will be required in the construction of a validation station, conducting the experiment, and the subsequent maintenance. So, the station should be located in an area with convenient transportation, sufficient power supply, and comprehensive logistics support. At the same time, regional development near the station will not cause major changes in ground features. According to the proposed principles and requirements for the construction of station, the quantitative indicators are refined and the evaluation indicator system is constructed.

3.2. Spatial Evaluation Based on Multi-Scale Grid

Multi-scale grid refers to multiple evaluation spaces with different spatial scales. The pixels in the space represent grids, and each grid has a corresponding evaluation indicator score and attributes. The input base map is the specified area required by the users, corresponding to the target indicators data in Section 2.3. Considering the area of input base map, the resolution of indicator data and the scale of the validation station, the evaluation grid is divided into three layers. The first layer dictates the type and goal of the station. The input base map includes large-scale geographic information data such as administrative boundary, climatic regionalization, topography and landform. The second layer screens out the areas where stations can be constructed. The input data is processed data used to evaluate the accessibility of the experimental area (such as seismic zones, nature reserves, airports) and the suitability of test conditions (such as atmospheric aerosol parameters, meteorological data). The third layer of the grid scores the candidate areas. Quantitative indicators suitable for the region (such as urban distance, road distance, population density) are selected from the validation-station evaluation indicator system constructed in Section 3.1 as input data to substitute into the scoring model, and the hot spots for the construction of stations are calculated. Then, the application capabilities of the results are evaluated, and auxiliary information related to station selection decisions is provided.

The scale of the grid is adaptively adjusted proportionally according to the area of the input base map. Specifically, in order to ensure that the base map covers as many grids as possible, the size of the first layer of grid is 1/20 of the minimum side length of the circumscribed rectangle by experimental research on some areas, that is, the coverage base map is at least 20 × 20 grid. Considering the area of the validation station, the grid size of the third layer is 1~2 km. The grid size is reduced by 10 times layer by layer, that is, the second layer is 1/10 of the first layer, and the third layer is 1/10 of the second layer. If the grid size of the third layer is larger than 2 km, it is calculated as 2 km; if it is less than 1 km, it is calculated as 1 km.

The significance of multi-scale grid spatial evaluation involves two aspects. On the one hand, a large amount of high-resolution indicators data will lead to low calculation efficiency when the area of the input base map is large, such as the boundary of provincial and municipal administrative region, certain type of climatic division or topography, which is not conducive to method optimization. “Searching” candidate areas usually involves many indicators in a large region. The higher the resolution of indicators data, the lower the calculation efficiency; the lower resolution, the rougher the screening results. Therefore, the construction of an adaptive multi-scale grid can balance the relationship between the calculation efficiency of multi-source data and the accuracy of the result. On the other hand, multi-scale grid facilitates the normalization and standardization of multi-source data. Typically, there are large differences in the spatial resolution of a variety of remote sensing products or geographic information data. For example, the global population density raster data from 2000 to 2020 provided by the WorldPop website has a spatial resolution of 1 km (https://www.worldpop.org/geodata/ (accessed on 6 April 2021)), while the global land-cover product provided by the Zenodo website has a spatial resolution of 30 m (https://zenodo.org/record/ (accessed on 8 January 2021)). A better matching station construction scope and a reduction of the impact of scale conversion on the results is desired, so it is necessary to allocate the indicators to different scales.

3.3. Constructing the Data-Driven Scoring Model Based on Machine Learning

The first two layers of grids are for range judgments, and indicators are operated independently. The third layer is a comprehensive judgment, and quantitative algorithm calculations need to be carried out. This research uses Random Forest (RF) and three other popular methods for small samples in Machine Learning: Least Squares (LS) regression model, Support Vector Machine (SVM), Artificial Neural Networks (ANNs). The model of these four methods were performed for the third-layer candidate regions. Among the Machine Learning dataset, 70% constitute the training sample and 30% the test sample. Each indicator parameter of the training station is the input, and the station category is the output. As elaborated in Section 2.2, agricultural stations are designated as category 1, forest stations as category 2, and grassland stations as category 3. Therefore, the standard values are 1, 2, 3. Subsequently, the regression value calculated by the Machine Learning model was compared with the corresponding standard value, which indicated whether the pixel is suitable as the location for such a station. The parameters of the model were set as follows. The LS model was iteratively fitted 10,000 times, and the combination with the smallest fitting error and the highest test accuracy was selected. The SVM model uses the most extensive activation function—the Sigmoid kernel function [59]. ANNs selected the BP neural network with a momentum gradient descent and back propagation with an adaptive learning rate, which is a multi-layer perceptron [60]. The number of hidden layer is 10. As for the RF model, among various machine-learning algorithms, the emerging RF algorithm proposed by Leo Breiman and Cutler Adele in 2001 has been regarded as one of the most precise prediction methods for classification and regression [61], as it can model complex interactions among input variables and is relatively robust in regard to outliers. Furthermore, it is not sensitive to noise or over-fitting [62,63]. Existing studies have shown that the number of decision trees have no effect on the results [62,63,64,65,66,67], so the default of 500 trees is feasible. The model accuracy of these four machine learning methods was compared and the optimal was selected. The data-driven method was used to mine the internal connections between the evaluation indicators that affect the location of the station, learn its data characteristics, and obtain a reasonable and objective data model.

3.4. Evaluation Approach

3.4.1. Correlation Evaluation

Spearman’s correlation coefficient was used to evaluate the relationship between indicators and location of station in third-layer grid. In the case of small sample size, the data do not necessarily present a normal distribution, but the correlation coefficient method requires that the data have a normal distribution pattern. Multi-parameter correlation analysis first requires the data to conform to the normal distribution. So, the Kolmogorov–Smirnov (K-S) test was performed. When the probability p is less than the significance level α, the original hypothesis is directly overturned, and on the contrary, the original hypothesis is accepted [68]. The significance level α in this research is 0.05, and the original hypothesis is as follows: this parameter conforms to a normal distribution. Then, a normal conversion on the indicator was performed when p < 0.05. After the normal distribution test, the Spearman’s correlation coefficient was calculated, which is defined by Formula (1) [69]:

ρ = \frac{\sum_{i = 1}^{N} (x_{m i} - {\bar{x}}_{m}) (x_{n i} - {\bar{x}}_{n})}{\sqrt{\sum_{i = 1}^{N} {(x_{m i} - {\bar{x}}_{m})}^{2} \sum_{i = 1}^{N} {(x_{n i} - {\bar{x}}_{n})}^{2}}}

(1)

where

n

is the number of data,

x_{m i}

represents the

i

th predicted rank of score taken on the

i

th individual,

x_{n i}

represents the

i

th predicted rank of the evaluation indicator data taken on the

i

th individual. Generally, in statistics,

| ρ |

≤ 0.2 indicates that the correlation is relatively weak, 0.2 to 0.6 is moderately correlated, and 0.6 to 0.8 is strongly correlated [70].

3.4.2. Percentage Deviation

The deviation of the evaluation indicator parameters from the average of the training stations was used to reflect the impact of evaluation indicators on score, as Formula (2):

Deviation = \frac{I_{S t a t i o n} - \frac{\sum_{i = 1}^{n} I_{T r a i n}}{n}}{\frac{\sum_{i = 1}^{n} I_{T r a i n}}{n}} \times 100 %

(2)

where

n

is the number of training stations,

I_{T r a i n}

is value of the indicator of training station,

I_{S t a t i o n}

is value of the indicator of the station.

4. Results

In this Section, we preformed location selection for validation station demonstration using DSS-LPV following the example of the establishment of Grassland Station, Forest Station and Agricultural Station in China.

4.1. Comparison of DSS-LPV Models Based on Four Machine Learning Methods

The DSS-LPV modeling results of four Machine Learning methods that are suitable to small samples are shown in Figure 3 below. The test accuracy of each model is lower than the training accuracy. The target is the standard value of sample. The correlation between the regression value and the target value of the RF training sample is 0.9368, and the correlation between the regression value and the target value of the RF test sample is 0.9134, which is significantly better than other modeling methods. It is consistent with the existing research results that Random Forest is more suitable for the current data set and can effectively prevent overfitting [62,63]. Therefore, the Random Forest method with the highest modeling accuracy was performed to establish the scoring model of this research.

The calculation efficiency of the complete code is shown in Table 3. The data volume in Table 3 refers to the original basic data. The indicator data were clipped, then resampled and normalized. The experimental environment is based on Arcpy2.7 under Windows 10 system, the processor is Intel(R) Core (TM) i9-10900K CPU, the internal storage is 64 GB, and the running time was calculated based on the above configuration.

4.2. Analysis of DSS-LPV Model Based on Random Forest

The DSS-LPV comprehensively considers a variety of factors to construct an evaluation indicator system and quantify the station selection issue. To verify the rationality and reliability of the DSS-LPV, besides the correlation between the regression value and the target value, an evaluation of the model results in an independent position and the overall area is also necessary. Therefore, further analysis was required for the absolute accuracy of the model regression value and the relative accuracy of the score density map.

4.2.1. Accuracy Verification of DSS-LPV Model

Based on the results of Section 4.1, the Random Forest model was adopted. The various indicator parameters of the training station were input, and the category of station was output. Taking the target as the standard line, the deviation between the regression value of the model and the standard line represents the extent to which the pixel is suitable for constructing this category of validation station. Theoretically, the deviation is in the range of −1 to 1, and as the absolute value of the deviation is subtracted from 1 to obtain the score, the score is between 0 and 1. The smaller the deviation, the higher the score, meaning it is more suitable for constructing this category of station. The deviation and score of the training samples are shown in Figure 4.

As shown in Figure 4, the deviation of regression results of the training station is between 0 and 0.5, and the score is between 0.5 and 1, which is basically greater than 0.6. 45% stations have the scores higher than 0.8, and 27% are close to 0.9. The model can provide high scores to the built training stations, indicating that the training sample regression result is good.

The scores of the test station in Table 4 are obviously in the high-value area. The highest score was obtained for Gonggashan S. with a score of 0.9627, followed by Luancheng S. and Inner Mongolia S. The average score is 0.8304. Stations with scores in the top 20% account for 78%, and stations in the top 10% account for 44%. The model can provide high scores to the built test stations, indicating that the judgment of the model is consistent with the built stations. The regression results of the training stations and the test stations represent the absolute accuracy of the DSS-LPV model. The high score of all the built stations verifies the accuracy of the model on a single point location.

4.2.2. Correlation Analysis of Evaluation Indicators and Score in the Third-Layer Grid

For station selection, in addition to the results of the model, we also paid attention to the relationship between the evaluation indicators and the location of the validation station with a statistical significance. After verifying the accuracy of the model, the correlation of the parameters of the third-scale grid scoring model was further analyzed.

As shown in Table 5, all the indicators have obvious correlation with the location of the station. Among them, the slope, population density and night light are negatively correlated, indicating that most of the field validation observation stations are built in sparsely populated areas with a gentle slope. Urban distances, roads distance and altitude were positively correlated, indicating that most of the validation stations are built in areas with convenient transportation but not too close to urban areas. Road distance, urban distance, and population density are highly correlated with station location, indicating that these three parameters are the main indicators that affect the model.

The higher the correlation between the evaluation indicators and the location of the station, the more the model is affected by this evaluation indicator. Indirectly, the deviation of the evaluation indicator may have a greater impact on the score of the model. To verify this point of view, the Hailun agricultural station, Xishuangbanna forest station, and Guyuan grassland station with lower scores in the test stations were selected to compare the deviations of the evaluation indicator parameters from the average of the training stations, as shown in Table 6. It is not difficult to see that Hailun S. and Guyuan S. indeed have the largest deviations for the three indicators of road distance, urban distance and population density, indicating that the correlation analysis has a certain degree of reliability and can explain the unreasonableness of the station score. Among the indicators of Xishuangbanna S., the largest deviation is the altitude. The altitude of Xishuangbanna S. is 524 m, while the average altitude of forest station is 1017 m. However, the influence of the DSS-LPV model on the regression results is complicated. Solving this problem requires more data support and discussion, which will be further detailed in the follow-up work.

4.2.3. Reliability Analysis of DSS-LPV Model Based on Score Density Map

In addition to the scoring of the single point location, the establishment of a score-density map allows for the model’s application within a certain area. A multi-scale grid within the administrative area corresponding to each station was established, evaluation indicators and basic data were allocated, and candidate areas were screened out layer by layer. The built RF model scores each pixel in the candidate area, and a score density map was established to characterize the hot spots suitable for station building.

Figure 5 shows the score density map in the administrative area where the test station is located. It can be seen intuitively that most of the test stations are graded A, some are between A and B, and Yanting Station is grade B. The high score of the test stations indicates that the hotspot area calculated by the DSS-LPV model is basically the same as the location of the test stations, and the accuracy of the model was verified in overall. The area-density map constructed according to the pixel score can reflect the area suitable for station construction, and provide decision-making support for the station selection of field observation stations in practical applications.

In the score density map, some stations are not located in the hot spots, but they can get very high scores when they are substituted into the DSS-LPV model. For example, the Gonggashan forest station in Sichuan province was obviously not selected. We found that the reason is that it was eliminated when the second-scale grid screened the area that can be built and traceability revealed that the station is in the north–south seismic zone of the mainland China plate. In the past 50 years, there have been 2 moderate-strong earthquakes with magnitudes and 2 strong earthquakes with magnitudes around 50 km. To prevent earthquakes from damaging precision-measuring instruments, in the processing of the second-scale grid, areas where earthquakes with magnitudes 4.5 have historically occurred are ignored. Earthquakes of magnitude 4.5 to 6 are scored by distance, and earthquakes of magnitude 6 and above are buffered by a 50 km zone of protected area. Therefore, the algorithm excludes the location of the Gonggashan forest station. To further prove the reliability of the research, after removing the seismic factors, the area was scored and the expected result was obtained, as shown in Figure 6, which shows that DSS-LPV is still reliable.

5. Discussion

In this article, we proposed a multi-scale grid spatial evaluation and data-driven location-selection technique for the construction of validation stations. The main method was to divide the indicators that affect the location of the validation station into surface characteristics, atmospheric conditions and social environment and construct an evaluation-indicator system. We divided the three adaptive multi-scale grids and allocated the calculation indicators for each layer. Through Machine Learning of the indicator parameters of the stations that were built, a data-driven scoring model was established to characterize the internal relationship between the evaluation indicators and the location of the station.

The unique advantages of DSS-LPV have been verified by various analyses. Firstly, different from the traditional methods, DSS-LPV is an efficient and systematic method. It only needs to input the learning stations and evaluation indicators to automatically calculate the pixel score. Therefore, it saves the time spent in repeated expert argumentation and reduces the subjectivity introduced by expert knowledge. Secondly, DSS-LPV does not omit the traditional methods. When no candidate area is specified, DSS-LPV can search the candidate area according to the input conditions, and then combine the traditional method to select the optimal location. Therefore, DSS-LPV can provide support for traditional methods. Briefly, DSS-LPV explored more possibilities in the field of station selection and found a quantitative systematic evaluation method that can be effectively combined with traditional expert evaluation.

Additionally, DSS-LPV also has some application extensions. DSS-LPV is not only suitable for the station selection of a single observation element, but also for a comprehensive observation station or comprehensive experiment. When the observation task includes more than one type of surface object (such as vegetation, water body, and atmosphere), the station-selection results of a single observation element obtained by this method can be integrated into a comprehensive validation station to observe multiple features. According to different phenology, weather and land cover, it can also suggest a certain time period suitable for field observation. In particular, the established evaluation-indicator library can be dynamically adjusted and expanded according to actual requirements. For example, the article is mainly for optical satellite products, so atmospheric conditions are considered. For other types of targets, such as SAR satellite or luminous satellite products, there is no need to consider too many weather factors. For more data services and decision support, DSS-LPV can evaluate the rationality of the station location, use the correlation of model parameters to analyze the reasons for the unreasonable location, and provide decision-making support for project development and pre-evaluation.

Moreover, there are some limitations in this research. For example, the size of dataset limits the Machine Learning model. The reliability of the training dataset determines the reliability of the model. Therefore, taking China as an example, national-level stations were selected as the training samples. The limited number of national-level stations limits the construction of the model. With the continuous construction of stations of various types and industries, the training dataset will continue to increase, so the Machine Learning model can also be iteratively expanded. Additionally, since the location selection for validation is a rather complicated issue, the selection of evaluation indicators inevitably has certain limitations. To support the location-selection principle established in this study, the selection of evaluation indicators is based on papers with an in-depth literature review. Although there exist certain limitations, the focus of the study is to seek a quantitative method by which to explore, explain and deduce the location-selection process. In addition, theoretically, the applicability of the model established by Machine Learning depends on which country’s stations are studied. Then, in cases where countries have no or few validation stations, more international cooperation in science and technology is needed to provide support in finding a solution to the problem. Furthermore, whether the model of other countries can be applicable is an issue that we need to further research.

6. Conclusions

To respond to the trend of globalization and joint validation actively, we propose a data-driven selection technique which provides a reference for the construction of land-product-validation stations. The reliability of DSS-LPV is illustrated by the example of the establishment of the vegetation parameters of validation station in China. By learning the national-level field stations in China and establishing scoring models, DSS-LPV was tested, analyzed and corroborated. After comparing with other models, the RF with the highest accuracy was identified as the modeling method of DSS-LPV. The correlation between the regression value of test stations and the target value is 0.9133, the average score is 0.8304. The accuracy analysis, based on a score density map of the area where the station is located, found that the test stations are basically located within the calculated hot spot area and all had high scores. Research shows that the scoring technique derived from RF modeling can provide decision support for the station selection of field-observation stations for validation. Finally, the applicable scenarios and capabilities of the method were summarized. The method in this article does not deny the traditional method, but can better integrate with it to provide experts with more accurate data services. With the construction of civil space infrastructure and high-resolution major special projects, a greater number of validation stations have been built. In the future, new stations can be integrated in the establishment of the model, thereby improving the accuracy of the scoring model.

Author Contributions

Conceptualization, X.Z., Z.T., T.L. and R.L.; methodology, X.Z., R.L. and Z.T.; validation, R.L., Z.T. and T.L.; formal analysis, R.L., Z.T. and F.X.; investigation, R.L., F.X. and T.L.; resources, Z.T., R.L. and M.Z.; data curation, R.L.; writing—original draft preparation, R.L.; writing—review and editing, R.L., Z.T. and T.L.; visualization, R.L.; supervision, Z.T.; project administration, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Technologies for Collaborative Processing and Joint Verification of Basic Common Products of Quantitative Remote Sensing for the “Belt and Road” under contract No. 2020YFE0200700 (National Key R&D Program of China).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to thank Arcpy package for helping us to accomplish the experiment, analysis and plotting. We also thank the journal’s editors and reviewers for providing insightful comments and suggestions for this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liang, S. Quantitative Remote Sensing of Land Surfaces; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2003. [Google Scholar]
Li, G.; Zhang, H.; Zhang, L.; Wang, Y.; Tian, C. Development and trend of Earth observation data sharing. J. Remote Sens. 2016, 20, 979–990. [Google Scholar]
Xu, G.; Liu, Q.; Chen, L.; Liu, L. Remote sensing for China’s sustainable development: Opportunities and challenges. J. Remote Sens. 2016, 20, 679–688. [Google Scholar]
Liang, S.; Fang, H.; Chen, M.; Shuey, C.J.; Walthall, C.; Daughtry, C.; Morisette, J.; Schaaf, C.; Strahler, A. Validating MODIS land surface reflectance and albedo products: Methods and preliminary results. Remote Sens. Environ. 2002, 83, 149–162. [Google Scholar] [CrossRef]
Justice, C.; Belward, A.; Morisette, J.; Lewis, P.; Privette, J.; Baret, F. Developments in the ‘validation’ of satellite sensor products for the study of the land surface. Int. J. Remote Sens. 2000, 21, 3383–3390. [Google Scholar] [CrossRef]
Morisette, J.T.; Privette, J.L.; Justice, C.O. A framework for the validation of MODIS Land products. Remote Sens. Environ. 2002, 83, 77–96. [Google Scholar] [CrossRef]
Zhang, R.; Tian, J.; Li, Z.; Su, H.; Chen, S. Principles and methods for the validation of quantitative remote sensing products. Sci. Sin. (Terrae) 2010, 40, 211–222. [Google Scholar] [CrossRef]
Wu, X.; Xiao, Q.; Wen, J.; You, D.; Hueni, A. Advances in quantitative remote sensing product validation: Overview and current status. Earth-Sci. Rev. 2019, 196, 102875. [Google Scholar] [CrossRef]
Morisette, J.T.; Baret, F.; Privette, J.L.; Myneni, R.B. Validation of global moderate-resolution LAI products: A framework proposed within the CEOS land product validation subgroup. IEEE Trans. Geosci. Remote Sens. 2006, 44, 1804–1817. [Google Scholar] [CrossRef] [Green Version]
Zeng, Y.; Li, J.; Liu, Q. Review article: Global LAI ground validation dataset and product validation framework. Adv. Earth Sci. 2012, 27, 165–174. [Google Scholar]
Bai, J.H.; Xiao, Q.; Liu, Q.H.; Wen, J.G. The research of construction the target ranges to validate remote sensing products. Remote Sens. Technol. Appl. 2015, 30, 573–578. [Google Scholar]
Council, N. Review of the WATERS Network Science Plan; National Academies Press: Washington, DC, USA, 2010. [Google Scholar]
Jia, Z.; Liu, S.; Xu, Z.; Chen, Y.; Zhu, M. Validation of remotely sensed evapotranspiration over the Hai River Basin, China. J. Geophys. Res. Atmos. 2012, 117, D13113. [Google Scholar] [CrossRef]
Liu, S.M.; Xu, Z.W.; Zhu, Z.L.; Jia, Z.Z.; Zhu, M.J. Measurements of evapotranspiration from eddy-covariance systems and large aperture scintillometers in the Hai River Basin, China. J. Hydrol. Amst. 2013, 487, 24–38. [Google Scholar] [CrossRef]
Song, Y.; Wang, J.; Yang, K.; Ma, M.; Xin, L.; Zhang, Z.; Wang, X. A revised surface resistance parameterisation for estimating latent heat flux from remotely sensed data. Int. J. Appl. Earth Obs. Geoinf. 2012, 17, 76–84. [Google Scholar] [CrossRef]
Li, R.; Zhou, X.; Lv, T.; Tao, Z.; Wang, J.; Xie, F. Optimal sampling strategy for authenticity test in heterogeneous vegetated areas. Trans. Chin. Soc. Agric. Eng. 2021, 37, 177–186. [Google Scholar]
Ma, M.; Che, T.; Li, X.; Xiao, Q.; Zhao, K.; Xin, X. A Prototype Network for Remote Sensing Validation in China. Remote Sens. 2015, 7, 5187–5202. [Google Scholar] [CrossRef] [Green Version]
Running, S.W.; Baldocchi, D.D.; Turner, D.P.; Gower, S.T.; Bakwin, P.S.; Hibbard, K.A. A Global Terrestrial Monitoring Network Integrating Tower Fluxes, Flask Sampling, Ecosystem Modeling and EOS Satellite Data. Remote Sens. Environ. 1999, 70, 108–127. [Google Scholar] [CrossRef]
Baret, F.; Morissette, J.T.; Fernandes, R.; Champeaux, J.L. Evaluation of the Representativeness of Networks of Sites for the Global Validation and Intercomparison of Land Biophysical Products: Proposition of the CEOS-BELMANIP. IEEE Trans. Geosci. Remote Sens. 2006, 44, 1794–1803. [Google Scholar] [CrossRef]
Wang, J.; Che, T.; Zhang, L.; Jin, R.; Wang, W. The cold regions hydrological remote sensing and ground-based synchronous observation experiment in the upper reaches of Heihe river. J. Glaciol. Geocryol. 2009, 31, 189–197. [Google Scholar]
Ma, M.; Liu, Q.; Yan, G.; Chen, E.; Xiao, Q. Simultaneous remote sensing and ground-based experiment in the Heihe river basin: Experiment of forest hydrology and arid region hydrology in the middle reaches. Adv. Earth Sci. 2009, 24, 681–695. [Google Scholar]
Jia, S.; Ma, M.; Yu, W. Validation of the LAI produce in Heihe river basin. Remote Sens. Technol. Appl. 2014, 29, 1037–1045. [Google Scholar]
Li, X.; Chen, G.; Liu, S.; Xiao, Q. Heihe Watershed Allied Telemetry Experimental Research (HiWATER): Scientific Objectives and Experimental Design. Bull. Am. Meteorol. Soc. 2013, 94, 1145–1160. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Li, Z.; Wang, J.; Ma, M. Progresses on the Watershed Allied Telemetry Experimental Research (WATER). Remote Sens. Technol. Appl. 2012, 27, 637–649. [Google Scholar]
Jin, R.; Li, X.; Ma, M.; Ge, Y.; Liu, S. Key methods and experiment verification for the validation of quantitative remote sensing products. Adv. Earth Sci. 2017, 32, 630–642. [Google Scholar]
Hakimi, S.L. Optimum Locations of Switching Centers and the Absolute Centers and Medians of a Graph. Oper. Res. 1964, 12, 450–459. [Google Scholar] [CrossRef]
Hale, T.S.; Moberg, C.R. Location Science Research: A Review. Ann. Oper. Res. 2003, 123, 21–35. [Google Scholar] [CrossRef]
Li, H.; Yu, L.; Cheng, E.W.L. A GIS-based site selection system for real estate projects. Constr. Innov. 2005, 5, 231–241. [Google Scholar]
Owen, S.H.; Daskin, M.S. Strategic facility location: A review. Eur. J. Oper. Res. 1998, 111, 423–447. [Google Scholar] [CrossRef]
Norat, R.T.; Amparo, B.P.; Juan, B.V.; Francisco, M.V. The retail site location decision process using GIS and the analytical hierarchy process. Appl. Geogr. 2013, 40, 191–198. [Google Scholar]
Vlachopoulou, M.; Silleos, G.; Manthou, V. Geographic information systems in warehouse site selection decisions. Int. J. Prod. Econ. 2001, 71, 205–212. [Google Scholar] [CrossRef]
Jacek, M. GIS-based multicriteria decision analysis: A survey of the literature. Int. J. Geogr. Inf. Sci. 2006, 20, 703–726. [Google Scholar]
Nas, B.; Cay, T.; Iscan, F.; Berktay, A. Selection of MSW landfill site for Konya, Turkey using GIS and multi-criteria evaluation. Environ. Monit. Assess. 2010, 160, 491–500. [Google Scholar] [CrossRef] [PubMed]
Noorollahi, Y.; Yousefi, H.; Mohammadi, M. Multi-criteria decision support system for wind farm site selection using GIS. Sustain. Energy Technol. Assess. 2016, 13, 38–50. [Google Scholar] [CrossRef]
Ozturk, D.; Kl, F. GIS-based multi-criteria decision analysis for parking site selection. Kuwait J. Sci. 2020, 47, 2–15. [Google Scholar]
Shao, M.; Han, Z.; Sun, J.; Xiao, C.; Zhang, S.; Zhao, Y. A review of multi-criteria decision making applications for renewable energy site selection. Renew. Energy 2020, 157, 377–403. [Google Scholar] [CrossRef]
Wang, J.; Jing, Y.; Zhang, C.; Zhao, J. Review on multi-criteria decision analysis aid in sustainable energy decision-making. Renew. Sustain. Energy Rev. 2009, 13, 2263–2278. [Google Scholar] [CrossRef]
Chen, Y.; Yu, J.; Khan, S. The spatial framework for weight sensitivity analysis in AHP-based multi-criteria decision making. Environ. Model. Softw. 2013, 48, 129–140. [Google Scholar] [CrossRef]
Wang, G.; Qin, L.; Li, Q.; Chen, L. Landfill site selection using spatial information technologies and AHP: A case study in Beijing, China. J. Environ. Manag. 2009, 90, 2414–2421. [Google Scholar] [CrossRef]
Messaoudi, D.; Settou, N.; Negrou, B.; Rahmouni, S.; Settou, B.; Mayou, I. Site selection methodology for the wind-powered hydrogen refueling station based on AHP-GIS in Adrar, Algeria. Energy Procedia 2019, 162, 67–76. [Google Scholar] [CrossRef]
Othman, A.A.; Al-Maamar, A.F.; Al-Manmi, D.A.M.A.; Liesenberg, V.; Hasan, S.E.; Obaid, A.K.; Al-Quraishi, A.M.F. GIS-Based Modeling for Selection of Dam Sites in the Kurdistan Region, Iraq. ISPRS Int. J. Geo-Inf. 2020, 9, 244. [Google Scholar] [CrossRef] [Green Version]
Rahmat, Z.G.; Niri, M.V.; Alavi, N.; Goudarzi, G.; Babaei, A.A.; Baboli, Z.; Hosseinzadeh, M. Landfill site selection using GIS and AHP: A case study: Behbahan, Iran. KSCE J. Civ. Eng. 2017, 21, 111–118. [Google Scholar] [CrossRef]
Şener, Ş.; Şener, E.; Nas, B.; Karagüzel, R. Combining AHP with GIS for landfill site selection: A case study in the Lake Beyşehir catchment area (Konya, Turkey). Waste Manag. 2010, 30, 2037–2046. [Google Scholar] [CrossRef] [PubMed]
Uyan, M. GIS-based solar farms site selection using analytic hierarchy process (AHP) in Karapinar region, Konya/Turkey. Renew. Sustain. Energy Rev. 2013, 28, 11–17. [Google Scholar] [CrossRef]
Uyan, M. MSW landfill site selection by combining AHP with GIS for Konya, Turkey. Environ. Earth Sci. 2014, 71, 1629–1639. [Google Scholar] [CrossRef]
Ma, C.; Yang, Y.; Wang, J.; Chen, Y.; Yang, D. Determining the Location of a Swine Farming Facility Based on Grey Correlation and the TOPSIS Method. Trans. ASABE 2017, 60, 1281. [Google Scholar] [CrossRef]
Zhang, X.H.; Wang, W.B.; Zhang, S.M.; Zhu, Y.Q. Research on Location of Integrating Village Migration in Coal Mining Areas Based on AHP-Grey Correlation. Appl. Mech. Mater. 2013, 2546, 1851–1855. [Google Scholar] [CrossRef]
Zolfani, S.H.; Yazdani, M.; Torkayesh, A.E.; Derakhti, A. Application of a Gray-Based Decision Support Framework for Location Selection of a Temporary Hospital during COVID-19 Pandemic. Symmetry 2020, 12, 886. [Google Scholar] [CrossRef]
Chu, J.Y.; Su, Y.P. Comprehensive Evaluation Index System in the Application for Earthquake Emergency Shelter Site. Adv. Mater. Res. 2011, 1035, 79–83. [Google Scholar] [CrossRef]
Qin, C.; Li, B.; Shi, B.; Qin, T.; Xiao, J.; Xin, Y. Location of substation in similar candidates using comprehensive evaluation method base on DHGF. Measurement 2019, 146, 152–158. [Google Scholar] [CrossRef]
Jiang, X.; Li, Z.; Xi, X.; Li, X.; Li, Z. Basic frame of remote sensing validation system. Arid Land Geogr. 2008, 31, 567–571. [Google Scholar]
Ali, I.; Greifeneder, F.; Stamenkovic, J.; Neumann, M.; Notarnicola, C. Review of Machine Learning Approaches for Biomass and Soil Moisture Retrievals from Remote Sensing Data. Remote Sens. 2015, 7, 16398–16421. [Google Scholar] [CrossRef] [Green Version]
Lary, D.J.; Alavi, A.H.; Gandomi, A.H.; Walker, A.L. Machine learning in geosciences and remote sensing. Geosci. Front. 2016, 7, 3–10. [Google Scholar] [CrossRef] [Green Version]
Hengl, T.; de Jesus, J.M.; Heuvelink, G.B.M.; Gonzalez, M.R.; Kilibarda, M.; Blagotic, A.; Shangguan, W.; Wright, M.N.; Geng, X.Y.; Bauer-Marschallinger, B.; et al. SoilGrids250m: Global gridded soil information based on machine learning. PLoS ONE 2017, 12, e0169748. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Holloway, J.; Mengersen, K. Statistical Machine Learning Methods and Remote Sensing for Sustainable Development Goals: A Review. Remote Sens. 2018, 10, 1365. [Google Scholar] [CrossRef] [Green Version]
Shirmard, H.; Farahbakhsh, E.; Muller, R.D.; Chandra, R. A review of machine learning in processing remote sensing data for mineral exploration. Remote Sens. Environ. 2022, 268, 112750. [Google Scholar] [CrossRef]
Gong, J. Chances and Challenges for Development of Surveying and Remote Sensing in the Age of Artificial Intelligence. Geomat. Inf. Sci. Wuhan Univ. 2018, 43, 1788–1796. [Google Scholar]
Shi, Z.; Lin, W.; Li, Z. Research on Site Selection of Radar Test Site Based on System Comprehensive Evaluation Method. In Proceedings of the 4th Annual Meeting of the Electronic Repair Group of the Chinese Society of Naval Architecture and Information Equipment Support Seminar, Chengdu, China, 1 October 2005; p. 5. [Google Scholar]
Cherkassky, V. The nature of statistical learning theory. IEEE Trans. Neural Netw. 1997, 8, 1564. [Google Scholar] [CrossRef] [Green Version]
Haykin, S. Neural Networks: A Comprehensive Foundation, 3rd ed.; Prentice Hall: Hoboken, NJ, USA, 2007. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Belgiu, M.; Dragut, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. 2016, 114, 24–31. [Google Scholar] [CrossRef]
Guan, H.Y.; Li, J.; Chapman, M.; Deng, F.; Ji, Z.; Yang, X. Integration of orthoimagery and lidar data for object-based urban thematic mapping using random forests. Int. J. Remote Sens. 2013, 34, 5166–5186. [Google Scholar] [CrossRef]
Koreen, M.; Murray, R. On the Importance of Training Data Sample Selection in Random Forest Image Classification: A Case Study in Peatland Ecosystem Mapping. Remote Sens. 2015, 7, 8489–8515. [Google Scholar]
Lawrence, R.L.; Wood, S.D.; Sheley, R.L. Mapping invasive plants using hyperspectral imagery and Breiman Cutler classifications (randomForest). Remote Sens. Environ. 2006, 100, 356–362. [Google Scholar] [CrossRef]
Nitze, I.; Barrett, B.; Cawkwell, F. Temporal optimisation of image acquisition for land cover classification with Random Forest and MODIS time-series. Int. J. Appl. Earth Obs. Geoinf. 2015, 34, 136–146. [Google Scholar] [CrossRef] [Green Version]
Rodriguez-Galiano, V.F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J.P. An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 2012, 67, 93–104. [Google Scholar] [CrossRef]
Schellhaas, H. A modified Kolmogorov-Smirnov test for a rectangular distribution with unknown parameters: Computation of the distribution of the test statistic. Stat. Pap. 1999, 40, 343. [Google Scholar] [CrossRef]
Meddis, R. Statistics Using Ranks; Blackwell Pub: Oxford, UK, 1984. [Google Scholar]
Senthilnathan, S. Usefulness of Correlation Analysis. SSRN Electron. J. 2019. [Google Scholar] [CrossRef]

Figure 1. The distribution of Machine Learning stations: they are national field observation stations in China; the blue points represent training stations, and the red points represent test stations.

Figure 2. Flow chart of DSS-LPV.

Figure 3. Modeling accuracy of DSS-LPV based on four Machine Learning methods: (a–d) represent regression accuracy of training and test of LS, SVM, BP, and RF models, respectively.

R

represents the correlation between regression value and target. The yellow points are the training set, the blue points are the test set, and the straight lines are the fitting lines.

Figure 3. Modeling accuracy of DSS-LPV based on four Machine Learning methods: (a–d) represent regression accuracy of training and test of LS, SVM, BP, and RF models, respectively.

R

represents the correlation between regression value and target. The yellow points are the training set, the blue points are the test set, and the straight lines are the fitting lines.

Figure 4. The regression deviation and score of the training stations by DSS-LPV: the blue histogram represents the deviation between the regression value and target of training stations; the grey histogram represents the score of training stations.

Figure 5. Test stations score density map: (a) represents grassland stations, it includes Guyuan S. and Inner Mongolia S.; (b) represents forest stations, it includes Dinghushan S., Xishuangbanna S. and Gonggashan S.; (c) represents agricultural station, it includes Hanlun S., Luancheng S., Qianyanzhou S., and Yanting S. The depth of the color represents the grade of score.

Figure 6. Gonggashan Station score density map: the result (a) refers to the score density map by DSS-LPV based on Random Forest; the green points represent the locations of earthquakes below magnitude 6 that have occurred in the past 50 years; The pink points represent the locations of earthquakes of magnitude 6 or higher that have occurred in the past 50 years; the pink circles are 50-kilmeter buffer zone; and (b) refers to the result without considering seismic factors.

Table 1. Source and introduction of the evaluation indicator data.

Category	Indicator	Source	Extent	Spatial Resolution	Time
Surface features	Acreage	Satellite remote sensing image	Global	30 m/8 m/2 m	2011–2021
	Slope	TanDEM-X	Global	3 arcseconds	2010–2015
	Altitude	TanDEM-X	Global	3 arcseconds	2010–2015
	Earthquake prone area	China Earthquake Administration	China		1900–2013
	Nature reserve	Resource and Environment Science and Data Center	China		2018
	Land cover classification	GLC_FCS 30-2020 product	Global	30 m	2019–2020
Atmospheric conditions	Aerosol optical depth	MODIS/Terra	Global	3 km	2000–2021
	Aerosol cloud/water vapor	MODIS/Terra	Global	1°	2000–2021
	Climate and weather	CEDA/WorldClim/NKN	Global	0.5°/2.5′/1/24°	1970–2018
	Number of sunny days	CEDA/WorldClim/NKN	Global	0.5°/2.5′/1/24°	1970–2018
Social environment	Administrative divisions	GADM	Global		2018
	Traffic accessibility	Geofabrik			2018
	Population density	WorldPop		1 km	2000–2020
	Power supply conditions	NOAA/NASA		3 arcseconds/500 m	1992–2013/2016

1 degree equals 60 arcminutes, 1 arcminute equals 60 arcseconds; the arc length corresponding to a 1° difference in longitude at the equator is about 111 km.

Table 2. Evaluation indicator data preprocessing method.

Classification	Indicators	Processing Method
Target	Administrative divisions	Select attribute
	Climatic regionalization	Select attribute
	Land cover classification	Select the type code
Binary	Nature reserve	Clip raster based on vector
Binary	Earthquake prone area	Filter year/Calculate Euclidean distance/Buffer
Spatial	Cloud	Calculate the annual average value
	Precipitation	Calculate the annual average value
	Road network	Calculate Euclidean distance
	Urban area	Calculate Euclidean distance/Set threshold/Buffer
	Slope	Calculate slope
	Altitude	Take the absolute value
	Population density	Piecewise assign
	Night light	Piecewise assign
Temporal	Number of sunny days	Piecewise assign
Temporal	Observation time	Set threshold/Select

Table 3. Evaluation indicators volume and operation efficiency of DSS-LPV model.

Grid	Indicators	Volume	Running Time
The first layer	Administrative divisions	30.4 MB	15.7 s
	Climatic regionalization	644 KB
	Land cover classification	11.4 GB
The second layer	Nature reserve	612 KB	11.6 s
	Earthquake prone area	276 KB
	Cloud	402 MB
	Precipitation	1.66 GB
The third layer	Road	1.64 GB	62.6 s
	Urban distance	11.4 GB
	Slope	32.2 GB
	Altitude	32.2 GB
	Population density	46.6 MB
	Night light	2.33 GB

Table 4. Scores of test stations by DSS-LPV based on Random Forest.

Stations	Type	Score
Qianyanzhou	Agricultural	0.8111
Hailun		0.6428
Yanting		0.9142
Luancheng		0.9170
Gonggashan	Forest	0.9627
Xishuangbanna		0.7495
Dinghushan		0.7179
Inner Mongolia	Grassland	0.9163
Guyuan	Grassland	0.8423
Average		0.8304

Table 5. Correlation between evaluation indicators and score in the third-layer grid.

Indicator	Urban Distance	Road Distance	Slope	Altitude	Population Density	Night Light
$ρ$	0.379	0.592	−0.248	0.369	−0.616	−0.310

Table 6. Percentage deviation of indicator parameters in the third-layer grid.

Stations	Urban Distance	Road Distance	Slope	Altitude	Population Density	Night Light
Hailun	46.80%	60.27%	9.44%	11.25%	18.73%	15.16%
Xishuangbanna	25.83%	28.05%	7.18%	40.54%	6.65%	11.76%
Guyuan	15.39%	38.59%	3.73%	7.43%	12.74%	10.37%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, R.; Tao, Z.; Zhou, X.; Lv, T.; Wang, J.; Xie, F.; Zhai, M. Data-Driven Selection of Land Product Validation Station Based on Machine Learning. Remote Sens. 2022, 14, 813. https://doi.org/10.3390/rs14040813

AMA Style

Li R, Tao Z, Zhou X, Lv T, Wang J, Xie F, Zhai M. Data-Driven Selection of Land Product Validation Station Based on Machine Learning. Remote Sensing. 2022; 14(4):813. https://doi.org/10.3390/rs14040813

Chicago/Turabian Style

Li, Ruoxi, Zui Tao, Xiang Zhou, Tingting Lv, Jin Wang, Futai Xie, and Mingjian Zhai. 2022. "Data-Driven Selection of Land Product Validation Station Based on Machine Learning" Remote Sensing 14, no. 4: 813. https://doi.org/10.3390/rs14040813

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Driven Selection of Land Product Validation Station Based on Machine Learning

Abstract

1. Introduction

2. Data

2.1. Evaluation Indicator

2.2. Machine Learning Dataset

2.3. Data Preprocessing

3. Methods

3.1. Constructing the Evaluation Indicator System

3.2. Spatial Evaluation Based on Multi-Scale Grid

3.3. Constructing the Data-Driven Scoring Model Based on Machine Learning

3.4. Evaluation Approach

3.4.1. Correlation Evaluation

3.4.2. Percentage Deviation

4. Results

4.1. Comparison of DSS-LPV Models Based on Four Machine Learning Methods

4.2. Analysis of DSS-LPV Model Based on Random Forest

4.2.1. Accuracy Verification of DSS-LPV Model

4.2.2. Correlation Analysis of Evaluation Indicators and Score in the Third-Layer Grid

4.2.3. Reliability Analysis of DSS-LPV Model Based on Score Density Map

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI