1. Introduction
Land surface changes result from numerous driving forces, including tropical deforestation, rangeland modification, agricultural intensification, urbanization, social-political factors and the worldwide interconnectedness of places and people [
1,
2,
3,
4,
5]. Telecoupling provides a new avenue of research that enables natural and social scientists to understand and generate information for managing how humans and nature sustainably coexist [
6,
7,
8,
9,
10]. With global issues such as climate change, surging population growth and continued energy shortages, more focus is being directed toward global environmental changes [
10,
11,
12]. As landscapes are enormously complex systems, most methods to study the driving forces decompose the system into different functional components and then describe the state of the landscape, the processes within the landscape, and the reactions of the landscape over time [
3,
13,
14]. In addition, monitoring the locations and distributions of land cover change is important for establishing links between policy decisions, regulatory actions and the resulting land-use activities [
15,
16]. Land use/land cover change (LUCC) analysis has become a fundamental component of environmental change and sustainability research [
17,
18]. As stated by Li [
19], except for the analysis of the driving mechanism of LUCC, other existing studies are currently dedicated to designing reasonable methods to detect the changes more accurately [
11,
20,
21,
22] or predicting the change trends of the spatial-temporal progress of LUCC [
23,
24,
25,
26], which can be performed at a fine scale [
27,
28,
29,
30,
31,
32] or a global scale [
33,
34,
35,
36,
37].
Over the last few decades, dramatic changes have occurred throughout the world. LUCC is fundamentally a spatial process, resulting from the complex interaction of social, ecological and geophysical processes [
38]. Land use maps indicating the status are more detailed for agricultural and urban land use classes than for others. Satellite data has always been the primary source for obtaining information about the earth’s land surface, and a wide variety of techniques ranging from unsupervised algorithms to parametric supervised algorithms to machine learning algorithms (MLA) have been used to classify the land cover over large areas from satellite [
39,
40]. Generally, post-classification comparison and multi-date composite image change detection are the two most commonly used methods in change detection [
41,
42,
43], and the techniques of satellite remote sensing and GIS can also be used to examine the spatial and temporal patterns of the LUCC [
44]. Moreover, after classification, overlaying two layers in different time series, in raster or vector format, can directly reflect the changes in pixels or polygons [
45,
46,
47]. In addition, a hybrid classification by combining different types of methods or incorporating different data sources may also provide a better choice [
48,
49]. In recent years, the classification accuracy of these methods has been greatly enhanced with the increase of spatial resolution, such as very high-resolution (VHR) remote sensing images [
50], and the optimization of knowledge-based approaches which play a role in use-specific photointerpretation and machine learning-based algorithms [
51,
52]. In the near future, it is likely that more and more robust and highly adjusted approaches will emerge for recognizing objects from images.
When multi-temporal images are available, how to detect the changes among these images has become a disparate, highly variable and ever-expanding area of research [
53]. After classification, the land surface can be easily partitioned into different classes of land use/land cover parcels. On this basis, some research on change analysis addresses post classification methods, termed as a post-classification comparison [
54,
55]. For such comparisons, pixel-by-pixel [
56,
57] or object-by-object comparisons [
58,
59,
60] are usually conducted. As suggested by Zhou et al. [
59], object-based approaches provide a better means for change detection than pixel-based methods because they can provide a more effective way to incorporate spatial information and expert knowledge into the change detection process. With the resultant objects in polygon layers from the segmentation, land parcels are further split through GIS overlay functions, and then the classification can be well preserved at the object level [
1,
61]. Generally, two types of spatial models are widely used when representing the earth’s phenomena, vector and raster, which are the data formats for performing spatial operations [
62]. Therefore, to detect the changes from multi-temporal data, GIS operations can be used on raster images, vector polygon layers, or a combination of both representing the land classifications.
With the advances in sensor and computer technology, the approach for collecting, managing and analyzing the spatial data is going through a revolution. Monitoring geographical conditions can provide consistent measurements of landscape conditions, allowing detection of both abrupt changes and slow trends over time [
54]. National geographical conditions are viewed as an important part of the basic national conditions, including spatial distribution and relationships among natural elements, cultural elements, and socio-economic information [
63]. Routine monitoring provides a persistent data source for the LUCC analysis, a very large amount of data and the complicated computing processes require that high-performance computing (HPC) technologies should be used to conquer these challenges. In geographical information science, geocomputation refers to the adoption of a large-scale computationally intensive approach to the problems of physical and human geography in particular, and the geosciences in general [
64]. In essence, HPC is used as an important component of geospatial cyberinfrastructure (GCI) and is critical to large-scale geospatial data processing and problem-solving [
65,
66]. In the era of spatial big data, cluster computing, cloud computing, and GPU computing play complementary, overlapping and even crucial roles in the processing of geospatial data [
67]. In fact, the utilization of HPC systems in spatial big data processing has become more and more widespread in recent years [
68], such as image classification [
69], land use change analysis [
70], urban growth simulation [
71], etc. To perform land use change analysis over massive spatial data at a national scale, a very heavy computational load is often unavoidable, especially when processing the multi-temporal land cover data with finer resolution using more complicated procedures. Particularly, to solve this geospatial problem, a theoretical approach towards capturing the computational loads can be used to direct the two necessary steps: spatial decomposition and task scheduling [
72,
73].
This paper presents a new approach to this problem: a cluster computing paradigm for conducting the data moving, clipping, overlay operation, area calculation and the matrix building. As a validation case, we used it to quickly detect the land use changes over China from multi-temporal layers. It should be noted that the overlay operation is performed on digital maps in a vector format for different temporal land cover layers. However, we guess that the overlay operation could also be applied to raster maps with small changes.
The remainder of the paper is organized as follows.
Section 2 introduces the detailed method, including the LUCC procedures, the manner of spatial decomposition and the stream scheduling mechanism. Next, two individual experiments based on the estimating factors are given in
Section 3.
Section 4 provides a detailed discussion of the obtained results and finally the main conclusions of the work are presented in
Section 5.
4. Discussion
Generally, time complexity is simply a measure of the time it takes for a function or expression to complete its task, as well as the name of the process to measure that time. However, the hardware environment should also be considered. This is because different CPU frequencies, memory bandwidth and disk transfer rates will no doubt exert an influence on the time usage. Therefore, there exists a gap between time complexity and actual time usage. Through a polynomial regression analysis, we built the relevance between the actual time usage and the data volume factors, such as the number of geometries and vertices. Moreover, the estimated time usage from different procedures could be transformed into a unified function. According to the testing results in the
Supplementary Materials, the PCC coefficient for the observed time and the estimated time was more than 0.99. Thus, the estimated time function can exactly reflect the actual time usage by referencing the number of geometry vertices or the number of polygons. According to the analysis of the testing data, spatial heterogeneity is a common phenomenon that is determined both by the data and the corresponding procedures. Herein, two individual experiments were provided. The first one was performed on a single workstation, while the second one was performed on a cluster.
The results from the first experiment indicated that an average speed improvement of 5.1 times could be achieved over the sequential version with 12 processes. When the number of allocating processes was fewer than 6, a stable improvement could be achieved. However, there was no continuous performance improvement when the allocating processes were more than 6. The main reason was due to the I/O bottlenecks. When multiple processes read the data from disk or dump data to the disk with conflicts, the overall performance can easily decline due to the I/O speed limit of the storage drive. In addition, multiple procedures that were working on different CPU cores may also have caused some bad effects on the computation speed when assessing the shared memory space. Therefore, to alleviate the competition from slow disks, a flexible method is to adopt a faster drive, such as a solid-state drive (SSD). If conditions allowed, loading all the data into the main memory could definitely provide an ideal way of conquering large-scale data processing. According to our early tests [
79], using a streaming framework that can coordinate the I/O events with the computation may also provide an effective way of alleviating this bottleneck. Thus, an individual procedure should be allocated to continuously reading the data and producing a continuous data block stream; then, other procedures can concurrently process these available data blocks acquired from the stream.
When performing the task on a single workstation, we focused on how to fully exploit the parallelism from multicore CPUs. In a cluster environment, the difficulty lies in how to allocate the computational loads to different workstations. Therefore, a quantitative representation of the task should first be accomplished. According to the results from the second experiment, the average difference ratio of the partitioned loads maintained a relatively low level, which demonstrated that the estimation and the partitioning method could be used to decompose the computational loads. Furthermore, two temporal land cover layers covering mainland China were decomposed into multiple sub-datasets with even loads. Thus, this extremely time-consuming task was solved in a very short period of time.
In addition, two improvements should also be made in the near future. One regards the lack of assessment on how the geometry complexity influences the time consumption. Though time functions have given an approximate estimation of the observed time, some deep rules might have been covered up. Another improvement is in terms of applicability. All the tests were performed in a homogeneous environment, and all the workstations had an identical configuration. Actually, heterogeneous environments are very common, and some adjustments should be made.
As a matter of fact, the earth’s surface is changing at an amazing speed, and thus, more and more large-scale, short-period and high-resolution data are becoming available with the development of sensor technology. The proposed method was only used to achieve a two-temporal-layer LUCC analysis, and no geomorphic layers were involved for a deep-level analysis. When dealing with more temporal layers and thematic layers, the time costs will definitely increase sharply. Therefore, methods for how to perform the change analysis for a large-scale region, such as a country, a continent or even the globe, especially using high-performance computing facilities, is still a challenge that should be addressed.
5. Conclusions
This work presented a novel graph-based spatial decomposition to conquer the very heavy computational load when overlaying multi-temporal land cover data. This approach quantitatively represented the computational loads as graph vertices and edges, and balanced subtasks were formed through graph partitioning. Basically, PCC analysis was first conducted towards selecting the determining factors that influence the time usage. Then, stream scheduling was used to achieve an efficient parallel LUCC method. Together with a master-slave architecture connecting the distributed workstations and a stream scheduling method, we performed two experiments for testing the performance improvements on a single workstation and a cluster.
Clearly, selecting the proper factors and regression methods can contribute to a unified representation that quantitatively transforms each of the procedures in a complex workflow to a time function with the data size as measurable variables. On this basis, graph-based spatial decomposition gives a balanced load allocation method, which not only considers the volume of the data to be processed, but also the spatial operations performed on the data. As a main research object, the LUCC analysis is thus optimized efficiently. It is worth mentioning that the consumed time was reduced from a two-week period to less than 6 h, for detecting the two-temporal changes over China.
In the near future, research should also be done to validate the feasibility of approach to raster maps and explore the potential influence of the geometric complexity that may also governs the parallel efficiency to some degree. In addition, a heterogeneous environment should also be considered.