Hierarchical Object-Focused and Grid-Based Deep Unsupervised Segmentation Method for High-Resolution Remote Sensing Images

Pan, Xin; Xu, Jun; Zhao, Jian; Li, Xiaofeng

doi:10.3390/rs14225768

Open AccessArticle

Hierarchical Object-Focused and Grid-Based Deep Unsupervised Segmentation Method for High-Resolution Remote Sensing Images

by

Xin Pan

^1,2,3,

Jun Xu

³,

Jian Zhao

¹ and

Xiaofeng Li

^2,*

¹

School of Computer Technology and Engineering, Changchun Institute of Technology, Changchun 130012, China

²

Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130102, China

³

Jilin Provincial Key Laboratory of Changbai Historical Culture and VR Reconstruction Technology, Changchun Institute of Technology, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(22), 5768; https://doi.org/10.3390/rs14225768

Submission received: 23 September 2022 / Revised: 6 November 2022 / Accepted: 10 November 2022 / Published: 15 November 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Since the unsupervised segmentation of high-resolution remote sensing is a highly challenging task, the introduction of deep learning and processing may be a sensible choice to improve the quality of unsupervised segmentation. Unfortunately, any attempt to direct using unsupervised deep neural networks (UDNNs) to perform this task will be hindered by many obstacles: uncontrollable refinement processes, excessive fragmentation at the borders and excessive computing resource requirements. These obstacles can prevent us from obtaining acceptable results. To address this problem, this article proposes a hierarchical object-focused and grid-based deep unsupervised segmentation method for high-resolution remote sensing images (HOFG). Based on a grid approach, HOFG first adopt a lazy deep segmentation method (LDSM) to handle fragmentation and large image sizes. Then, a hierarchical and iterative segmentation strategy is introduced to reduce the accuracy expectation for the LDSM by means of a cascaded focus mechanism, making the entire segmentation process more controllable. HOFG can overcome all of the above obstacles while utilizing the high recognition ability of UDNNs. In experiments, HOFG are compared with shallow and deep unsupervised segmentation methods. The results show that HOFG can obtain fewer segments while maintaining a high accuracy. HOFG transform the unsupervised classification ability of UDNNs into a controllable and stable segmentation ability, making HOFG valuable for practical applications. The results show that on average, HOFG need only 81.73% as many segments as traditional shallow methods to achieve a high overall accuracy, and HOFG can obtain a 7.2% higher accuracy than a UDNN even when using only approximately 18% as many segments. HOFG can effectively and controllably utilize the recognition ability of UDNNs to achieve better unsupervised segmentation results.

Keywords:

high resolution; remote sensing; unsupervised segmentation; deep neural networks; object-focused

1. Introduction

High-resolution remote sensing images contain many diverse ground features and can provide detailed spatial information for many geographical applications [1]. The traditional pixel-based methods are usually low in accuracy and exhibit various problems when processing high-resolution remote sensing images [2]. In object-based image analysis (OBIA), images are processed on the basis of having meaningful objects in the images rather than individual pixels, which can greatly improve the results for high-resolution remote sensing images [3,4]. OBIA has shown a superior performance compared to pixel-based methods in many studies and has become an important approach for processing very-high-resolution remote sensing images [5,6,7].

Unsupervised image segmentation methods, which can partition remote sensing images into meaningful segments (geo-objects), are foundational procedures for OBIA [8,9]. Through the integration of point/pixel, edge, region and texture features, unsupervised segmentation methods can generate homogeneous segments as the minimal units of an image, thereby greatly facilitating the fusion of knowledge in remote sensing image processing and analysis [10,11]. However, with the continuously increasing resolution of remote sensing images, the collected image data increasingly tend to contain high interclass homogeneity and intraclass heterogeneity; this characteristic makes it difficult to recognize high-quality segments based on shallow features [12].

Currently, when shallow models encounter difficulties, the most common approach is to resort to deep models, and indeed, the deep-learning-based methods for the supervised classification and semantic segmentation of remote sensing images have achieved great successes [13]. Through deep neural networks, higher-level features can be extracted, significantly improving the recognition accuracy for ground objects in remote sensing images [14,15]. For an unsupervised classification, unsupervised deep neural network (UDNN) methods are also available [16,17]. Unfortunately, although UDNNs have a high recognition ability, when these models are directly used for unsupervised segmentation, many obstacles arise, including uncontrollable refinement processes, excessive fragmentation at the border and excessive computing resource requirements (Section 3 will discuss these obstacles in detail). Although the direct use of UDNNs enables a much better recognition of ground objects in remote sensing images than is possible with shallow methods, the problems induced by these obstacles cause the corresponding unsupervised segmentation results to lack practical value. As a result, although deep learning remote sensing methods have been developed for many years, many remote sensing applications that require OBIA still rely on shallow segmentation methods to obtain segments; consequently, research on unsupervised remote sensing image segmentation is lagging behind the state of the art in the field of image processing.

To introduce deep learning into unsupervised remote sensing segmentation in a manner suitable for a practical application, this article proposes a hierarchical object-focused and grid-based deep unsupervised segmentation method for high-resolution remote sensing images (HOFG). Different from the traditional approaches, HOFG do not directly convert the output of a deep neural network into segments but rather constructs a grid to indirectly convert the segments; with the help of this mechanism, the corresponding lazy deep segmentation method (LDSM) can greatly suppress the generation of the fragments and support the handling of large input images. At the same time, HOFG do not rely on a neural network to recognize all segments simultaneously but instead uses a gradual focusing process that hierarchically proceeds from whole large objects to the detailed boundaries of those objects, thereby reducing the precision expectation of the LDSM and making the entire segmentation process more controllable. In experiments, HOFG are compared with three shallow and one deep unsupervised segmentation methods. The results show that HOFG retain the recognition ability of UDNNs, allowing it to obtain fewer segments while maintaining a high accuracy, and therefore it outperforms other methods in terms of the segmentation quality. Our article offers the following contributions:

(1).: Aiming at a practical use, we analyze the obstacles encountered when attempting to use a UDNN to obtain unsupervised remote sensing image segmentation results.
(2).: A hierarchical object-focused and grid-based method is proposed to address these obstacles. Specifically, lazy and iterative processes are used to gradually achieve unsupervised segmentation targets, rather than pursuing a powerful UDNN that can obtain excellent results in a single shot.
(3).: The unsupervised classification ability of UDNNs is transformed into a controllable and stable segmentation ability, making UDNN models suitable for practical remote sensing applications.
(4).: The proposed method can serve as a new framework for solving the problem of deep unsupervised segmentation in remote sensing. The corresponding source code is shared in order to motivate and facilitate further research on deep unsupervised remote sensing image segmentation.

2. Related Work

Remote sensing images can perform unsupervised segmentation and semantic segmentation.

2.1. Unsupervised Segmentation

Unsupervised segmentation or superpixel segmentation can partition images into segments (clusters). Currently, many shallow-method-based unsupervised remote sensing segmentation methods have been proposed, the majority of which rely on pixel thresholds or edge- or region-based techniques [18,19,20]. The parameters of the unsupervised segmentation methods have a significant impact on the results, and these parameters can be obtained from the spatial statistics of the segments [21,22,23]. Local measures can be used to control appropriate splitting and merging processes to mitigate the problems of undersegmentation and oversegmentation [24,25]. Optimal scale selection, noise weighting and outlier removal techniques allow segmentation methods to yield more accurate and reliable results [26,27]. Moreover, through the hierarchical selection of the scale, shape and compactness parameters, multiresolution segmentation (MRS) methods can be realized [28], and several software tools, such as eCognition, have proven to be successful for segmenting high-resolution remote sensing images. Notably, the contents of remote sensing images tend to vary greatly, and a single method often exhibits bias for certain types of content, hence, hybrid methods may need to be introduced into the segmentation process [29]. By employing region merging methods based on local and global thresholds, complementary methods can be combined to obtain better geo-object segmentation results [30]. In the computer vision research field, the high object recognition capabilities of deep neural networks can improve the quality of unsupervised segmentation. By extracting high-level semantic concepts, object categories can be segmented from images without preprocessing [31]. By distilling unsupervised features into high-quality discrete semantic labels, semantic segmentation discrete semantic labels can be discovered from the images [32]. The methods resulting from these studies perform well for datasets containing background objects (e.g., COCO), however, they are not designed for datasets with very large images or images with no relationship among background objects, and therefore, a further improvement of their processing strategy is needed to enable their application in the field of remote sensing.

2.2. Semantic Segmentation

In recent years, semantic segmentation networks, such as fully convolutional networks (FCNs), SegNet and U-Net, have been widely used in the semantic segmentation of remote sensing images [33,34]. These deep models not only can achieve end-to-end fast processing but also have a significantly higher pixel-level classification accuracy than traditional shallow models, so they have been widely used in a variety of remote sensing applications [35,36,37]. Semantic segmentation is data driven and requires a large number of training datasets for training; for remote sensing applications, these datasets may be difficult or require a huge cost to obtain [38]. Therefore, it is also important to use fewer annotations for semantic segmentation. Through unsupervised learning signals that account for neighborhood structures in terms of both space and features, remote sensing images can be segmented by sparse scribbled annotations [39]. By using clustering, spatial consistency and contrastive learning, single-scene semantic segmentation can be achieved without using any annotation [40].

This article mainly focuses on unsupervised segmentation or superpixels. When processing low- or medium-resolution remote sensing images, unsupervised segmentation methods can easily partition objects with homogenous features, and the resulting segments are closely related to the real geo-objects [41]. However, as the resolution continues to increase, intraclass heterogeneity becomes increasingly obvious, and precisely separating each geo-object becomes difficult to accomplish. Therefore, the following requirements for the unsupervised segmentation of very-high-resolution remote sensing images are usually specified:

(i): Accurate boundary information and few cross-boundary segments

For most mapping applications, it is necessary to obtain accurate boundaries; when a segment crosses the boundary between different ground objects, this will inevitably introduce errors into the boundary analysis results [42]. In the absence of training samples, the easiest way to decrease the number of cross-boundary segments is to perform excessive oversegmentation, in which as many small segments as possible are generated to represent ground objects [43].

(ii): Reasonably sized segments containing spatial information

In many remote sensing applications based on OBIA, it is necessary to further obtain morphological, shape or texture information from segments, which requires that each segment be of a reasonable size [44,45]. A segment that is too small may contain only some similar band information, while the spatial information is lost; consequently, excessive oversegmentation does not necessarily offer a good segmentation quality [46]. Furthermore, since the segments are the basic processing units for OBIA, some typical applications use segments to reduce the overall number of calculations, thus, too many and too small segments will hinder this goal [47].

To meet the above two requirements, traditional methods require the introduction of low-level statistics concerning the borders or segments to control the balance between splitting and merging. Unfortunately, with an increasing resolution, such low-level statistics have difficulty describing the characteristics of the segments [48], thus making these two requirements incompatible with each other for the traditional methods.

As an alternative, deep learning technology can be used to extract high-level features from images; in particular, it is very helpful for integrating color, texture, shape and context information into an object’s features [49]. By introducing a deep model, it is possible to understand objects and boundaries more accurately, which is beneficial for controlling the unsupervised segmentation process. For example, a deep autoencoder can be used to perform an unsupervised classification for multispectral or hyperspectral remote sensing images to obtain higher-quality classification results [50]. Through the iterative backpropagation processes performed in deep neural networks, unsupervised results can be obtained, and higher object recognition results can be achieved [17].

Nevertheless, although existing deep neural networks exhibit good unsupervised classification capabilities, if the corresponding results are converted into segments, these segments will have difficulty meeting the two requirements mentioned above (as will be discussed in detail in Section 3). Therefore, to the best of our knowledge, there is no deep unsupervised remote sensing segmentation method that is suitable for a practical use in the field of remote sensing OBIA. In high-resolution remote sensing image processing applications that simultaneously require OBIA and deep learning, a shallow and highly controllable method is usually adopted for OBIA, and a trial-and-error approach is used to achieve oversegmentation to meet the two requirements mentioned above [51].

Deep learning techniques have proven to be very effective in many studies. If we can adapt these techniques to the requirements of remote sensing unsupervised segmentation while maintaining their high recognition ability, we will be able to develop more valuable methods to improve research on remote sensing.

3. Obstacles Hindering Deep Unsupervised Remote Sensing Image Segmentation

Deep neural networks perform excellently in a supervised classification, semantic segmentation and object detection for remote sensing images. UDNNs have also been developed in the field of computer vision, and it is reasonable to expect that these UDNNs can also be applied to remote sensing images and achieve good unsupervised segmentation results. However, to date, the applications and methods regarding deep unsupervised remote sensing image segmentation have unfortunately remained scarce.

The reason is not that unsupervised remote sensing image segmentation has no research value, but rather that there are serious obstacles that hinder deep neural networks’ performance in the unsupervised segmentation of remote sensing images. In this section, we present an example to illustrate these obstacles. For a remote sensing image I_example whose size is 600 × 600, we use a typical deep segmentation model M_example trained via an unsupervised optimized backpropagation training process [16,17]. M_example classifies I_example to obtain the unsupervised classification results (the labels are randomly assigned), and pixels that are adjacent and have the same classification label are further assigned to a unique segment label. The corresponding results are shown in Figure 1.

In Figure 1a, we show deep results obtained with few iterations, moderate iterations and many iterations. Generally, as the number of iterations increases, the segments become larger, and the number of segments decreases. Compared with the results of the traditional shallow methods, the characteristics of deep methods are as follows:

Advantage: obvious superior recognition ability. It can be seen from Figure 1a that the deep model has an attractive advantage: in the deep segmentation results, buildings, roads and trees are separated into obvious and larger segments, reflecting the excellent ability of UDNNs to distinguish obvious ground objects. With either few or many iterations, the corresponding segments are very stably recognized.

UDNNs use high-level features instead of the specific local color space considered in traditional methods, so they perform better in the recognition of obvious objects. If this advantageous capability can be introduced into the unsupervised remote sensing image segmentation process, it will be very helpful for improving the segmentation quality and even achieving an object-by-object segmentation.

However, as shown in Figure 1b, UDNNs still face many obstacles in the unsupervised segmentation of remote sensing images.

(1) Obstacle 1: uncontrollable refinement process. Traditional methods can be used to easily control the iteration parameters, often shifting the segmentation result from undersegmentation to oversegmentation. This process can controllably obtain segmentation results from coarse to very fine.

UDNNs are difficult to control in this process: we selected two locations as examples. At location 1, in the results obtained with “few iterations”, part of the roof is not separated (undersegmentation); as the number of iterations increases, the roof is separated. At location 2, an area of grass is correctly separated in the results obtained with “few iterations”, but as the number of iterations increases, this location becomes undersegmented, and the boundaries of this area are destroyed.

The trends of these two locations conflict, and we cannot decide the direction of refinement by controlling the UDNN iterative process. We are unable to determine whether the segmentation results will be fine enough by modifying the iteration number, and large-scale cross-boundary segments will always exist randomly. Hence, no matter how the deep learning parameters and the number of iterations are adjusted, unacceptable segments will always exist in large regions, and because of the randomness of the training process, these erroneous regions may vary greatly each time the method is executed.

(2) Obstacle 2: excessive fragmentation at the border. For the results of the traditional methods, boundaries between their segments are made of simple borders, and this is a very common characteristic.

The UDNN results contain many very small fragments. For “few iterations”, which are random, very small segments exist at both the border and interior of larger ground objects, which lead to too many segments. For “more iterations”, these small segments are arranged on the boundary of the larger objects, forming a large number of rings and double-line boundaries.

Therefore, for UDNNs, no matter what we do, we will encounter a large number of random fragment segments, which will lead to the number of segments produced via deep learning to be much higher than that produced using the traditional methods. These small segments not only fail to effectively transmit spatial information but also have a negative impact on the subsequent analysis and calculations.

(3) Obstacle 3: excessive computing resource requirements. For the traditional methods, remote sensing images in certain conventional size ranges (e.g., smaller than 50,000 × 50,000) can be easily processed, thus, image “size” is an issue that does not require attention.

A UDNN model needs to be loaded into the GPUs memory for execution (processing a deep neural network using a CPU would be very slow). As the UDNNs input size increases, its overall demand for GPU memory increases exponentially. In particular, a computer equipped with 11 GB of video memory (our experimental computer) has difficulty loading a deep model that takes input images larger than 1000 × 1000, and remote sensing images are usually much larger than this size. For the segmentation of larger images, one of two strategies is usually adopted in the computer vision field. (i) Resizing to a smaller image: a large image is resized to a smaller image, the deep model processes the smaller image, and the results are then rescaled to the original size. This strategy of downscaling and subsequent upscaling leads to the boundary between two lower-resolution pixels becoming the boundary between the two objects of higher-resolution pixels, which may lead to a jagged boundary. (ii) Cutting into patches: since unsupervised labels are randomly assigned, the same label value is unlikely to correspond to the same object between two adjacent patches. Consequently, it is difficult to correctly overlap or fuse the separated patches to reintegrate the results, and this strategy can result in a significant disconnection at the borders of patches, thus creating false boundaries that do not exist in the original image.

Therefore, from the perspective of the application of value alone, deep unsupervised segmentation methods are very difficult to use for remote sensing images. Instead, traditional shallow methods are more controllable and practical. This is why, to date, there have been few applications of and little research on deep neural networks for an unsupervised remote sensing image segmentation.

However, it should also be recognized that as long as these three obstacles can be overcome, we can take advantage of the superior recognition capability of UDNNs to greatly improve the segmentation quality. Therefore, the main target of our research is to address these three obstacles to make UDNNs practically usable for an unsupervised remote sensing image segmentation.

4. Methodology

4.1. Overall Design of the Method

As analyzed in Section 3, UDNNs produce very attractive unsupervised segmentation results. Especially compared to shallow methods based on local color or texture features, deep neural networks can effectively recognize “objects” in complex high-resolution remote sensing data, and this characteristic is very important for the understanding and processing of remote sensing content.

However, we cannot practically use this capability of UDNNs unless we can overcome three obstacles: the uncontrollable segmentation process, uncontrollable fragmentation and the processing of large remote sensing images. Common attempts to do so usually rely on the proposal of a more complex and delicate deep neural network structure. Unfortunately, in the absence of training samples, it will be very difficult to design a “perfect” neural network with a sufficient subtle control of its iterations and parameters to effectively address these obstacles. Therefore, our approach does not rely on such an attempt to propose a complex and delicate deep neural network.

There are two strategies that have emerged in the fields of image processing and artificial intelligence (AI) that can help us solve the problem at hand:

(1): Hierarchical: one may consider distinguishing the most obvious objects first, then further separating the most obvious region for each object and continuing this process until all boundary details can be distinguished. In this hierarchical iterative process, it is not necessary to recognize all segments in a single shot, it is only necessary to be able to subdivide the segments generated in the previous round.
(2): Superpixel/grid-based: segment labels can be assigned to units based on superpixels or grid cells rather than single pixels. This can effectively prevent the formation of fragments that are too small, and the establishment of superpixel/grid boundaries can also suppress the emergence of jagged borders caused by the resizing process. At present, it is easy to obtain oversegmented results of uniform size using shallow unsupervised segmentation methods, and these results can be directly used as a set of superpixels or grid cells for this purpose.

Inspired by the above two strategies, this paper proposes a hierarchical object-focused and grid-based deep unsupervised segmentation method for high-resolution remote sensing images (HOFG). The overall design of HOFG is illustrated in Figure 2.

As shown in Figure 2, the input to HOFG is a remote sensing image I_img, and the output is its corresponding segmentation results R_result = {r₁, r₂, …, r_n}, where r_i = {seglabel, {pixel-locations}} is a segment defined in terms of the segment label seglabel and the corresponding pixel locations. In HOFG, a traditional shallow segmentation method is also used to perform an excessive oversegmentation to obtain the grid segment results R_grid = {r₁, r₂, …, r_n}. In R_grid, it is necessary only to ensure the minimal occurrence of cross-boundary segments, and traditional methods can easily achieve this when the segments are sufficiently small and large in quantity. The segments in R_grid are regarded as the smallest unit “grid cells” in HOFG, and all assignments of the segment labels in HOFG are based on these grid cells rather than on individual pixels.

A UDNN is adopted as the core of the deep segmentation model in HOFG, but it is not expected that all segments will be obtained by applying the UDNN only once, nor that the UDNN will exhibit a “perfect” performance. Instead, this paper proposes a lazy deep segmentation method (LDSM) for segment recognition, in which a relatively simple UDNN is used. The LDSM can process only input images that are smaller than a certain size and will recognize the most obvious objects in such an input image. In addition, the LDSM segmentation results are based on the grid cells in R_grid, not individual pixels. These properties make the LDSM easier to realize than a “perfect” deep neural network.

On the basis of the LDSM, a hierarchical object-focused process is introduced in HOFG. First, the LDSM attempts to recognize the most obvious objects to obtain the first-round segmentation results R¹, then, for each segment in R¹, the LDSM performs another round of recognition and segmentation to obtain R², which is finer than R¹. Thus, the HOFG process is gradually refined from focusing on whole objects to focusing on the local details of the objects, and the final results R_segment are obtained.

For HOFG, the initial segmentation result is R⁰ = {r₁}, where r₁ contains the entirety of I_img. During subsequent iterations, the segment set Rⁱ⁻¹ obtained in the (i-1)th iteration is further separated into Rⁱ in the ith iteration until the final segmentation results are obtained.

With the above design, we use deep neural networks in the LDSM and indirectly use the recognition ability of deep models; this enables the advantage discussed in Section 3 to be incorporated into the results. Moreover, the 3 obstacles discussed in Section 3 can be solved:

(1): Uncontrollable refinement process: as the iterative process proceeds, Rⁱ is refined on the basis of Rⁱ⁻¹, and the number of segments in Rⁱ must be greater than or equal to the number in Rⁱ⁻¹. Therefore, although the number of segments identified by a UDNN is uncontrollable, the segmentation process of HOFG must be progressively refined as the number of iterations increases. This successfully solves Obstacle 1.
(2): Excessive fragmentation at the border: due to the introduction of the grid-based strategy, the segment label of each grid cell is determined by the label of the majority of its internal pixels, and small pixel-level fragments do not appear in the results. Therefore, an excessive fragmentation at the border does not occur, overcoming Obstacle 2.
(3): Excessive computing resource requirements: the use of downscaled input images in the LDSM can prevent UDNN from receiving excessively large images. This prevents the introduction of excessively large deep models during HOFG iterations, and the demand for GPU memory is always controlled within a reasonable range. Additionally, the segment boundaries in the output are directly based on the grid cells rather than the pixel-level UDNN results. Therefore, the use of downscaled input images in the UDNN (jagged boundary result) has little impact on the final results. These characteristics directly help address the challenges presented by Obstacle 3.

This process and the details of HOFG are described in Section 4.2 to Section 4.3.

4.2. Construction of the Grid and Initialization of the Set of Segmentation Results

For an input image I_image, HOFG need to initialize two sets to support the execution of the whole method: (1) it needs to initialize the segmentation result set R⁰, which will serve as the input data for the first iteration of HOFG, and (2) it needs to construct the “grid” set R_grid = {r₁, r₂, …, r_n}, where the r_i will be used as the basic units for segmentation.

For the initialization of the segmentation result set R⁰, because HOFG need to segment the entire image in the first iteration, R⁰ = {r₁}; there is only one segment r₁ in R⁰, and r₁ contains all the pixels of the whole image.

To be suitable as the “grid” set for HOFG, R_grid needs to satisfy two criteria: (1) regardless of what label is given, a segment that crosses a boundary will inevitably cause some pixels to be misclassified, destroying the corresponding boundary in the final results; therefore, there must be as few cross-boundary segments as possible in R_grid. (2) Since the segmentation function in HOFG is performed by a neural network, an excessive segment height/width will detrimentally impact the layer-by-layer feature extraction, so the compactness of the segments needs to be high to ensure that the image surrounding each segment is close to a square.

To meet the above two criteria, HOFG adopt the simple linear iterative clustering (SLIC) algorithm, which is a traditional shallow unsupervised segmentation method. This algorithm is initiated with k clusters, S_slic = {c₁, c₂, …, c_k}, with c_i = {l_i, a_i, b_i, x_i, y_i}, where l_i, a_i and b_i are the color values of cluster c_i in the CIELAB color space and x_i and y_i are the center coordinates of c_i in the image. During the execution of the SLIC algorithm, there are two parameters that control the segmentation results. The first is the target number of segments, P_slicnum; the merging process will stop when the number of image segments is less than or equal to P_slicnum (the final number of segments is usually not strictly equal to P_slicnum). The second is the compactness parameter, P_compact, which defines a balance between color and space, with a higher P_compact giving more weight to the spatial proximity [52]. The ability to adjust P_slicnum and P_compact makes the SLIC results more controllable. A larger P_slicnum can be specified to obtain oversegmented results and avoid the occurrence of cross-boundary segments, and a larger value of P_compact can be specified to improve the compactness of each segment.

Based on the above discussion, the initialization process of HOFG is described in Algorithm 1.

Algorithm 1: HOFG initialization algorithm (HOFG-Init)

Input: I_img
Output: R_grid, R⁰
Begin
    C_init-seg= Segment I_img via SLIC with segments= P_slicnum and compactness= P_compact;
    labelid=0; R_grid=ø; R⁰=ø;
    foreach c_i in C_init-seg
        r_i={labelid, {pixel positions of c_i}};
        R_grid←r_i;
        labelid= labelid+1;
    r₁={0, {all pixel positions of I_img}};
    R⁰←r₁;
    return R_grid, R⁰;
End

Through the HOFG-Init algorithm, we obtain Rgrid and R0 for the HOFG method.

4.3. Lazy Deep Segmentation Method (LDSM)

As shown in Figure 2, HOFG do not try to obtain the segmentation result in just one step but transforms the segmentation process into an iterative process. In the initial stage, the segmentation result is R⁰, and R¹ is obtained after the first iteration; after that, the i-th iteration converts Rⁱ⁻¹ to Rⁱ. The LDSM acts as a bridge for each iteration, and the LDSM can be thought of as a mapping function that realizes Rⁱ = LDSM(Rⁱ⁻¹).

4.3.1. Unsupervised Segmentation by a Deep Neural Network

HOFG employ a deep model M_uc to perform an unsupervised pixelwise classification. Unsupervised deep models have a very different structure and training process than supervised models. For M_uc, we adopt the strategy proposed by Kim and Kanezaki [16,17]. The model structure and the corresponding segmentation process are shown in Figure 3.

As shown in Figure 3, for the input subimage I_sub, its size is Width_sub × Height_sub and the number of bands is Band_sub. M_uc can perform an unsupervised classification of I_sub and obtain results that assign a label to each pixel of I_sub. For the unsupervised classification process of M_uc, the maximum number of categories is N_max-category, and the minimum number of categories is N_min-category. There are three components in M_uc:

(1): Feature extraction component: this component contains N groups of layers, where each group consists of a convolutional layer (filter size is 3 × 3, padding is ‘same’ and the number of output channels is N_max-category), a layer that applies the rectified linear unit (ReLU) activation function, and a batch normalization layer. Through this component, the high-level features of the input image can be extracted.
(2): Output component: this component contains a convolutional layer (filter size is 1 × 1, padding is ‘same’ and the number of output channels is N_max-category) and a batch normalization layer. Through the processing of these layers, a Width_sub × Height_sub × N_max-category matrix MX_sub-map is obtained, in which a number between 0 and 1 is used to express the degree to which each pixel belongs to each category.
(3): Argmax component: this component uses the argmax function to process the matrix MX_sub-map from the previous component to obtain C_sub-label, which is a Width_sub × Height_sub × 1 matrix containing a discrete class label for each pixel.

Unlike in supervised neural networks, M_uc needs to be initialized for each input I_sub image. During the initialization, the weights of M_uc are randomly assigned, so the output MX_sub-label of an untrained M_uc will be composed of random category labels in the range [0, N_max-category-1]. At this point, there will be a difference between MX_sub-map and MX_sub-label; this difference can be used as the unsupervised loss function for the evaluation of the neural network:

l o s s (M X_{s u b - m a p}, M X_{s u b - l a b e l}) = l o s s_{c a t e g o r y} (M X_{s u b - m a p}, M X_{s u b - l a b e l}) + l o s s_{s p a c e} (M X_{s u b - m a p}) .

(1)

Here, loss_category represents the loss of the output on the category labels, and it is expressed as follows:

l o s s_{c a t e g o r y} (M X_{s u b - m a p}, M X_{s u b - l a b e l}) = \sum_{i}^{N_{p i x e l}} \sum_{j}^{N_{c a t e g o r y}} (- δ (j - M X_{s u b - l a b e l} [i]) \ln (M X_{s u b - m a p} [i, j])),

(2)

where

δ (t) = {\begin{matrix} \begin{matrix} 1 & t = 0 \end{matrix} \\ \begin{matrix} 0 & t \neq 0 \end{matrix} \end{matrix}

. Meanwhile, loss_space represents the constraint of spatial continuity, and it is defined as follows:

l o s s_{s p a c e} (M X_{s u b - m a p}) = \sum_{i}^{W - 1} \sum_{j}^{H - 1} (‖ M X_{s u b - m a p} [j, i + 1] - M X_{s u b - m a p} [j, i] ‖ + ‖ M X_{s u b - m a p} [j + 1, i] - M X_{s u b - m a p} [j, i] ‖) .

(3)

Thus, Formula (1) considers constraints on both feature similarity and spatial continuity [17]. Based on this loss, the weights of M_uc can be adjusted through backpropagation, and this process can be iteratively repeated. As iteration progresses, MX_sub-label will contain fewer categories (some categories will no longer appear in the results) that are more continuous (pixels with similar high-level features will be assigned to the same category). The corresponding training process is described in Algorithm 2.

Algorithm 2: Unsupervised classification model training (UCM-Train)

Input:I_sub, N_max-train, N_min-category, N_max-category
Output:M_uc
Begin
    M_uc =Create model based on the size of I_sub and N_max-category;
    while (N_max-train >0)
        [MX_sub-map, MX_sub-label]=Process I_sub with M_uc;
        loss= loss(MX_sub-map, MX_sub-label);
        Update weights in M_uc←backpropagation with loss;
        N_max-train= N_max-train-1;
        N_{current-category}=number of categories in MX_sub-label;
        if (N_min-category >= N_{current-category}) break;
    return M_uc;
End

For the UCM-Train algorithm, N_max-category, N_max-train and N_min-category will all have an impact on the final results. For common unsupervised classification tasks, these parameters are difficult to specify, and doing so requires a large number of repeated tests. Fortunately, the LDSM is not required to produce perfect unsupervised classification results; it only needs to separate the most obvious objects in I_sub.

N_max-category determines the range of changes to the random class labels in the initial state; the larger the value specified for this parameter is, the more categories the model can recognize, but more iterations are also required to merge the pixels in each object. N_min-category determines the fewest categories the model will recognize. The default values of these two parameters for HOFG are N_max-category = 20 and N_min-category = 3.

N_max-train determines the maximum number of iterations; with a larger N_max-train, M_uc will output larger and more continuous classification segments, which is suitable for the classification of the overall image, while a smaller N_max-train will make the classification more sensitive to category differences, suitable for the segmentation of boundaries and object details. For HOFG, the value of N_max-train is related to the iteration sequence. By default, N_max-train is 100 in the first iteration, 50 in the second iteration and 20 in the third iteration and later.

The results of M_uc are unsupervised classification results, not segmentation results; pixels that are adjacent and have the same category label can be assigned the same segment label. Accordingly, the process of subimage segmentation by the deep neural network is described in Algorithm 3.

Algorithm 3: Subimage segmentation by a deep neural network (SUB-Seg)

Input:I_sub
Output:R_sub
Begin
    M_uc =UCM-Train(I_sub, N_max-train, N_{min-category,} N_max-category);
    MX_sub-label=Use M_uc to process I_sub;
    R_sub=ø; segid=1;
    seedpixel=first pixel in MX_sub-label with label≠-1;
    while (seedpixel is not null)
        pixelpositions=Perform flood filling at seedpixel based on the same category label;
        MX_sub-label[pixelpositions]=-1;
        R_sub ←{segid,{pixelpositions}}
        seedpixel=first pixel in R_{subi-classify} with label≠-1;
        segid= segid+1;
    return R_sub
End

For a subimage I_sub, the SUB-Seg algorithm first uses UCM-Train to train an unsupervised deep model M_uc, which can generate the unsupervised classification results MX_sub-label. SUB-Seg then uses the flood fill algorithm to find adjacent pixels with the same class label and assign them to the same segment; once all pixels have been assigned, SUB-Seg yields R_sub (the unsupervised segmentation results for I_sub).

4.3.2. LDSM Process

The LDSM is responsible for segmenting each r_i in Rⁱ⁻¹ to obtain a set of finer results, Rⁱ. The LDSM uses the deep neural network model M_uc and the SUB-Seg algorithm to perform the segmentation. The “lazy” nature of the LDSM is reflected in three aspects:

(1): Low segmentation requirements: the LDSM focuses on the refinement of each segment obtained from the previous step of HOFG. Therefore, in the current step, SUB-Seg does not need to perform a perfect and thorough segmentation. Instead, for a separable r_i, it is sufficient to find the most obvious different objects in it and separate it into >= 2 segments accordingly. This significantly reduces the difficulty of completing the SUB-Seg task.
(2): Low segment border and pixel precision requirements: the basic segmentation units of the LDSM are the grid cells from R_grid. Because a pixel-level fragmentation and the formation of jagged boundaries during the resizing are prevented by the grid-based strategy, the LDSM does not need to pursue precision at the level of all borders and pixels.
(3): Low image size requirements: due to the relatively loose requirements of (1) and (2), when a large segment/image needs to be processed, the LDSM can directly downscale the image before processing it. This makes the LDSM easier to adapt to larger remote sensing images.

Based on the above characteristics, the process of the LDSM is shown in Figure 4.

As shown in Figure 4, for an image I_img and the corresponding segmentation results Rⁱ⁻¹, the LDSM can further separate Rⁱ⁻¹ to output finer segmentation results Rⁱ. The LDSM can be divided into two stages:

(1): Processing of all segments

For this step, the input is the segment set Rⁱ⁻¹ = {r₁, r₂, …, r_n}, and the output is the segmented and labeled image I_seg-label. Each r_i may need to be further separated into smaller segments. For r_i, the corresponding rectangular image I_sub-ori is cut from I_img. Then, to focus on the segmentation task for r_i, the pixels in I_sub-ori that are not in r_i are replaced with a constant background color.

We define a threshold value P_max-size, which defines the maximum input image size for the LDSM. I_sub-ori may be larger than the threshold size P_max-size; in this case, it needs to be reduced to a size smaller than P_max-size to obtain the input for SUB-Set, I_sub. SUB-Seg will then process I_sub to obtain the corresponding segmentation results R_sub; subsequently, R_sub will be resized to obtain the corresponding original-size results R_sub-ori. Based on the segments in R_sub-ori, pixel labels will be written into the image to obtain I_seg-label, which can be regarded as the refined segment image for Rⁱ⁻¹. Since the SUB-Seg algorithm itself is not perfect, fragments will be present in the results; at the same time, the process of downscaling and subsequent upscaling in the processing of each r_i may produce jagged boundaries. Therefore, there will be many defects in I_seg-label, which will require a correction in the next stage.

(2): Correction based on R_grid

The grid cells in R_grid are treated as the basic units of the segmentation process. The pixels within a grid cell must all be assigned the same segment label; to this end, the label of the majority of the pixels determines the label of the entire grid cell. In this way, unnecessary fragments that are too small can be filtered out, and the boundaries of the grid cells can be used to replace the jagged boundaries.

Based on the above stages, the corresponding process is described in Algorithm 4.

Algorithm 4: Lazy deep segmentation method (LDSM)

Input:I_img, R_grid, Rⁱ⁻¹
Output:Rⁱ
Begin
#Processing of all segments
    I_seg-label= Empty label image; globallabel=1;
    foreach r_i in Rⁱ⁻¹
if (r_i is too small)
        I_seg-label[pixel locations of r_i]= globallabel++;
continue;
I_sub-ori= Cut a subimage from I_img based on a rectangle completely containing r_i;
I_sub= Downscale I_sub-ori if I_sub-ori is larger than P_max-size;
R_sub= SUB-Seg(I_sub);
R_sub-ori= Upscale R_sub to the original size;
        foreach r_i′ in R_sub-ori
                I_seg-label[pixel locations of r_i′]= globallabel++;
# Correction based on R_grid
    Rⁱ= ø; globallabel=1;
    foreach r_i in R_grid
        mlabel=Obtain the majority label among the pixels from I_seg-label[pixel locations of r_i];
        I_seg-label[pixel locations of r_i]=mlabel;
    labellist=unique(I_seg-label);
    foreach l_i in labellist
        pixelpositions=find pixels in I_seg-label where label= l_i;
        Rⁱ←{globallabel++,{pixelpositions}};
    return Rⁱ;
End

The LDSM can further separate the segments in Rⁱ⁻¹ to output Rⁱ, where Rⁱ is a finer segment set than Rⁱ⁻¹. By executing the LDSM multiple times, the segmentation results for I_img can be gradually refined.

4.4. Hierarchical Object-Focused Segmentation Process

Based on the LDSM, HOFG iteratively and hierarchically narrow their focus from the entire remote sensing image to the details of the objects. This process is illustrated in Figure 5.

As shown in Figure 5, HOFG obtain R⁰ and R_grid after initialization via the HOFG-Init algorithm. Starting from R⁰, the LDSM then continuously refines the segmentation results in sequence, producing R¹, R², … to Rⁿ. The overall algorithm process is described in Algorithm 5.

Algorithm 5: Hierarchical object-focused and grid-based deep unsupervised segmentation method for high-resolution remote sensing images (HOFG)

Input: I_img, N_{max-iteration}
Begin:R_result
    [R⁰, R_grid]=HOFG-Init(I_img);
    Rⁱ⁻¹=R⁰; inum=1;
    while (inum< N_{max-iteration})
        Rⁱ=LDSM(I_img, Rⁱ⁻¹, R_grid);
        if (Rⁱ has not changed compared to Rⁱ⁻¹)
            break;
        Rⁱ⁻¹=Rⁱ; inum= inum+1;
    R_result= Rⁱ;
    return R_result;
End

More iterations of HOFG will result in more segments. There are two termination conditions for HOFG: one is that the maximum number of iterations, N_{max-iteration}, is reached, and the other is that Rⁱ no longer changes from one iteration to the next. With the support of the LDSM, HOFG gradually separate Rⁱ⁻¹ into finer results Rⁱ and gradually achieves the final segmentation goal.

5. Experiments

5.1. Method Implementation

In this study, all algorithms for HOFG were implemented in Python 3.7, the deep neural network was realized with PyTorch 1.7 and scikit-image 0.18 was used for image access/processing. All the methods were tested on a computer equipped with an Intel i9-9900K CPU with 64 GB of main memory and an NVIDIA GeForce RTX 2080Ti GPU with 11 GB of video memory.

To investigate the differences between HOFG and the traditional methods, we selected the following methods for a comparison:

(1): SLIC: SLIC works in the CIELAB color space and can quickly gain momentum. SLIC is a relatively easy-to-control traditional shallow segmentation method, in which the target number of segments and the compactness of the segments can be specified [52].
(2): Watershed: the watershed algorithm uses a gradient image as a landscape, which is then flooded from given markers. The markers parameter determines the initial number of markers, which, in turn, determines the final output results [53].
(3): Felzenszwalb: the Felzenszwalb algorithm is a graph-based image segmentation method, in which the scale parameter influences the size of the segments in the results [54].
(4): UDNN: a UDNN is used to classify a remote sensing image in an unsupervised manner [16,17], and segment labels are then assigned via the flood fill algorithm. As analyzed in Section 3, a UDNN cannot handle remote sensing images that are too large. Therefore, for large remote sensing images, we resize them to a smaller size that is suitable for processing on our computer and then restore the results to the original size before assigning the segment labels.
(5): HOFG: for the method proposed in this article, the maximum input size P_max-size of the LDSM is set to 600 × 600, and the maximum number of iterations, N_{max-iteration}, is specified as five.

To test the segmentation capabilities of the five methods, we adopt the Vaihingen and Potsdam images from the ISPRS WG II/4 dataset. The characteristics of these images are summarized in Table 1.

5.2. Reference-Based Evaluation Strategy

A key criterion for the quality of unsupervised segmentation results is whether the segments destroy the boundaries of the ground objects. In this regard, a segment may exist in one of two situations: (i) the segment does not cross any object boundary, and all pixels in the segment are correct, or (ii) the segment crosses at least one boundary between two objects, and some pixels in the segment are incorrect. Based on this understanding, we use the strategy illustrated in Figure 6 to evaluate the accuracy of the segments.

As shown in Figure 6a, to evaluate a set of unsupervised segmentation results R_result, we can introduce a ground-truth image I_groundtruth. For each segment r_i in R_result, its intersection with I_groundtruth is taken, and the category label associated with the largest proportion of the pixels is regarded as that segment’s ground-truth category label; once all segments have been assigned a category label in this way, I_{seg-intersect} is obtained. We adopt the overall accuracy (OA) to describe the accuracy of I_{seg-intersect}:

O A = \frac{N_{c o r r e c t - p i x e l s}}{N_{p i x e l s}},

(4)

where N_pixels is the number of pixels in I_groundtruth and N_{correct-pixels} is the number of pixels with correct category labels in I_{seg-intersect}. We also adopt the mean intersection over union (mIoU) as the evaluation metric. For a certain category i, the intersection over union (IoU) is defined as follows:

I o U_{i} = \frac{R_{g t - i} \cap R_{s e g - i}}{R_{g t - i} \cup R_{s e g - i}},

(5)

where R_gt-i is the set of pixels belonging to category i in I_groundtruth and R_seg-i is the set of pixels belonging to category i in I_{seg-intersect}. The mean IoU (mIoU) is defined as follows:

m I o U = \frac{\sum_{i}^{N_{c a t e g o r y}} I o U_{i}}{N_{c a t e g o r y}} .

(6)

Nevertheless, the OA metric CANNOT be used as the sole criterion for evaluating the results of an unsupervised segmentation algorithm. As shown in Figure 6b, for a set of segmentation results, when the number of segments is large and the size of the segments is sufficiently small, any segments that cross a boundary will also be very small. Under such extreme conditions, such as one pixel corresponding to one segment, the OA value can reach 100%. All the methods participating in a comparison may be capable of producing very small segments and consequently reaching OA values approaching 100%, but this does not necessarily mean that all the corresponding methods are excellent. Therefore, a good segmentation evaluation method needs to consider both the OA and the number of segments.

As shown in Figure 6c, our evaluation strategy is based on reference results. Since SLIC is a traditional segmentation method in which the number and compactness of the segments can be easily controlled, we select SLIC as the reference method. By adjusting the SLIC parameters, we can obtain a high OA score = x while ensuring that the segments are of a reasonable number and size. The corresponding SLIC results are then taken as the reference against which to measure the segmentation quality. Since the OA variations of the various methods are not linear, it is impossible to obtain an OA strictly equal to x; therefore, we consider a value close to or greater than x to reach the OA target. For methods (2)–(5), we adjust the parameters to attempt to achieve an OA in the interval of [x-0.5%, 100%] with as few segments as possible. The parameter settings for all methods are given in Table 2.

For similar OA scores, a method that can produce fewer and larger segments is considered to have a higher object recognition ability, indicating that this method is better; otherwise, its recognition ability and, thus, its segmentation quality are lower.

5.3. Iterative Process of HOFG and Comparison of Results

In this section, we again use the image patch in Figure 1 as the test image. First, method (1), the SLIC method, is used as the reference method. For this test image, target segments = 400 and compactness = 10 are used as the SLIC method’s parameters, and OA = 91.75% is obtained. With the goal of reaching or exceeding this accuracy, four iterations of HOFG are performed, with the results being shown in Figure 7.

As shown in Figure 7, SLIC with a segment number parameter of 600 is adopted in the HOFG to initialize R_grid, which finally contains 516 segments, then, the iterative process of the HOFG begins. In the first iteration, only 82 segments are generated; from the whole image, the large areas of houses, roads and trees are segmented, yielding results close to the ideal target of “one object corresponding to one segment” for OBIA. From these results, it can be seen that as a deep method, HOFG have obvious advantages compared with traditional shallow methods, however, the results of the first iteration also have some defects. For example, in the region enclosed by the green circle, some large-scale ground objects are connected together (corresponding to a large cross-boundary segment). Consequently, the OA is only 80.57%, failing to surpass the OA of the reference SLIC results. As the iterative process continues, the second iteration produces 143 segments with OA = 87.48%, and the third iteration produces 221 segments with OA = 91.11%. Finally, in the fourth iteration, OA = 92.04%, and the number of segments is 247. At this time, the accuracy of HOFG exceeds the reference accuracy of SLIC, indicating that there is no need to perform further iterations.

For all five methods participating in the comparison, their segmentation results for this test image are shown in Figure 8.

The corresponding OA values and numbers of segments are listed in Table 3.

As seen in Figure 8 and Table 3, with SLIC as the reference method, both the watershed and Felzenszwalb algorithms can achieve the OA target, but they need to generate more segments to achieve this accuracy. For the watershed algorithm, the number of segments is 484, and for the Felzenszwalb algorithm, the number of segments is 428. Among the three shallow methods, SLIC uses the fewest segments to achieve better results, which is why in many OBIA applications for remote sensing, SLIC is the preferred segmentation algorithm.

For the UDNN method, we consider the segmentation results obtained after both 5 iterations and 15 iterations; the reason for performing these two consecutive tests is to show that no matter what training parameters are assigned, it is very difficult for a UDNN to achieve the OA target. With five iterations, the UDNN generates many large-area cross-boundary segments, seriously affecting its accuracy, and the OA is only 84.81%. Moreover, this low number of iterations also results in a serious fragmentation, and the number of segments is the highest at 1346 (note that this number is different from the number of segments in Figure 1 because the neural network training process is subject to a certain level of randomness). With 15 iterations, not only are the problems seen after 5 iterations not solved, but new problems also emerge. As shown in Figure 7, stripe-shaped and connected segments appear at the boundaries of objects, thus introducing more errors into the segmentation results and causing the OA to decrease to 77.40%, while the number of segments is reduced to 773. Although obvious objects are recognized by the UDNN, the low OA and the excessively large number of segments causes the UDNN results to lack a practical remote sensing application value.

In contrast, HOFG obtain OA = 92.04% and number of segments = 247, achieving the highest accuracy with the fewest segments among all the methods. HOFG can obtain an OA that is 7.2% higher than that of the UDNN by using only approximately 18% as many segments. The above results also show that Obstacles 1 and 2, identified in Section 3, are overcome, indicating that HOFG has the following advantages:

Overcoming Obstacle 1: the segmentation refinement process is controllable. As shown in Figure 7, segmentation with HOFG is a process of a step-by-step refinement. In each iteration, HOFG split the large segments identified in the previous iteration into smaller segments. Accordingly, a reasonable OA goal can be achieved by controlling the number of iterations of the HOFG process.

Overcoming Obstacle 2: the fragmentation and stripe-shaped segments are suppressed. In HOFG, the segments are not directly generated from the unsupervised classification output of a UDNN, instead, R_grid is used as a filter to suppress the fragmentation and stripe-shaped segments. This strategy guarantees that the number and size of the generated segments will be within a reasonable range.

Because of the above two characteristics, HOFG inherit the advantages of deep learning methods while avoiding their typical shortcomings, thereby endowing HOFG with a practical application value.

5.4. Performing Segmentation on Larger Remote Sensing Images

To test the performance of HOFG on larger remote sensing images, we adopt two remote sensing images from the ISPRS dataset: (1) test image 1, from the Vaihingen image file “top_mosaic_09cm_area11.tif”, with dimensions of 1893 × 2566, and (2) test image 2, from the Potsdam image file “top_potsdam_2_10_RGB.tif”, with dimensions of 6000 × 6000.

These two images are relatively large. The three shallow methods can easily process these images, however, a neural network model constructed based on such a large input size would be very large, far beyond what our computer’s GPU could load. Therefore, to segment these large images using a UDNN, the strategy is to downscale them to a smaller size to perform the segmentation and then upscale the segmentation results to the original size afterward. HOFG have their own mechanism for processing large images and therefore can directly process these two images.

The test images and their corresponding ground-truth segmentation images and segmentation quality evaluation results are shown in Figure 9a.

The segmentation results for the two images, along with their category labels, are presented in Figure 9b. Since the images are very large and the numbers of segments are similarly large, it is difficult to clearly see any individual segments once the images have been sized for presentation in this paper, so Figure 9b does not list the segmentation results of the methods. We show the examples of detailed differences in the local areas in Figure 10.

Due to the high OA target, the labeled segmentation results of all the methods except for the UDNN method are fairly similar. The UDNN cannot reach the reference OA, and fragmentation can also be seen in Figure 9 (marked with circles). Taking the SLIC method as the reference method, the performance of the five methods is shown in Table 4.

For test image 1, the initial segment number parameter of SLIC is specified as 7000, and the number of segments obtained after an algorithm execution is 5799. For test image 2, the initial segment number parameter of SLIC is specified as 9000, and the final number of segments obtained after the algorithm execution is 7882. With SLIC as the reference method, the watershed algorithm needs more segments to achieve a corresponding OA, whereas the number of segments produced by the Felzenszwalb algorithm is close to that of SLIC. For the UDNN method, as discussed in Section 5.3, it is impossible to achieve the target OA, so we choose a set of output results with a reasonable number of segments and a relatively high OA. It can be seen that the number of segments generated by the UDNN is much higher than the number produced by the shallow methods, which again proves that a UDNN is not suitable for directly performing an unsupervised segmentation of the remote sensing images. In contrast, HOFG achieve the OA goal with the fewest segments among all the methods. Detailed comparisons of the five methods on test image 1 are shown in Figure 10.

As seen from Figure 10a,b, the three shallow unsupervised segmentation methods (SLIC, watershed and Felzenszwalb) cannot truly distinguish objects in remote sensing images; their segmentation results are based only on local color clustering, thresholding or graph results. Due to this focus on local characteristics, a larger image size affects only the processing time, so these shallow methods can be effectively adapted to the processing of large remote sensing images.

For the UDNN, some attractive results can be observed in Figure 10a: buildings, trees and impervious surfaces are all identified as large, whole segments. Unfortunately, however, it can also be observed that some segments span multiple objects (marked with red circles), and due to the image downscaling and upscaling process, the segment boundaries exhibit a certain sawtooth phenomenon. These problems directly decrease the segmentation accuracy of the UDNN. In Figure 10b, since test image 2 is much larger, the UDNN needs to perform a downscaling and upscaling to an even greater extent, so the sawtooth phenomenon is more obvious, and some building boundaries are even destroyed. As seen from these results, a UDNN has the high recognition ability characteristic of deep neural networks, but many uncontrollable problems directly affect the final results. Especially for large remote sensing images, the problems of sawtooth boundaries, cross-boundary segments and broken boundaries become more obvious.

Regarding the results of HOFG, it can be seen that on the one hand, HOFG can use larger segments to capture objects in an image, and buildings, roads and cars are all recognized, indicating that HOFG retain the recognition ability of deep neural networks. On the other hand, for both test image 1 and test image 2, the large size of these images does not lead to undesirable results of HOFG, and segments with relatively accurate boundaries can still be obtained. From the results in Figure 10a,b, it can be inferred that HOFG can be successfully used for the processing of large remote sensing images to achieve a higher-quality segment recognition. Specifically, HOFG have the following advantage:

Overcoming Obstacle 3:

For test images 1 and 2, the 11 GB of GPU memory provided in this experiment is insufficient to directly load the corresponding UDNN models. HOFG provide a mechanism to reduce the computational resource requirements and thus have the ability to process large remote sensing images. Due to the use of the grid strategy and the LDSM, HOFG can handle large remote sensing images and obtain reasonable boundaries and segments.

Processing large remote sensing images is a necessary ability of any practically applicable method in this field, and HOFG have this ability, allowing it to play a role in practical applications.

5.5. Execution Time Comparison

To analyze the execution times of the methods, we use the five methods to process the test images and run each method five times. The average execution time comparison of the five methods is shown in Table 5.

The execution times of the three shallow methods, SLIC, Watershed and Felzenszwalb, are directly associated with the number of pixels; the larger the number of pixels is, the longer the time. There is no obvious difference in the speeds of these three methods. The UDNN uses a size reduction strategy; in this example, the size of test image 2 is 6000 × 6000, while the UDNN uses only a 600 × 600 input, which is equivalent to reducing the amount of data and the number of computations by 10 × 10 = 100 times. Thus, the UDNN is the fastest among all the methods for test image 2. HOFG need to use SLIC to obtain R_grid and then perform multiple rounds of UDNN processing on many subimages. Although the quantity of input data for each processing step is small, the run time of the method increases. Among all the methods, HOFG are the slowest, especially for test image 2, and HOFG need close to 15 min to run. Although this time is barely acceptable, the speed of HOFG could be increased. From the experimental results, we note that the UCM-Train algorithm of HOFG needs to run completely from the initial state in each iteration process, which will consume a considerable execution time of the HOFG. In future work, we will start to improve UCM-TRAIN so that it can continue the experience of the previous iteration of HOFG, which will greatly improve the training speed of HOFG.

5.6. Experiments on More Remote Sensing Images

To test the remote sensing processing capabilities of HOFG more extensively, this section introduces more remote sensing image files, the details of which are listed in Table 6.

As shown in Table 6, we consider 10 high-resolution remote sensing images, of which images 1 to 5 are from the Vaihingen dataset and images 6 to 10 are from the Potsdam dataset. For the reference SLIC method, segmentation number = 7000 is used for all Vthe aihingen images, and segmentation number = 9000 is used for all the Potsdam images. The R_grid parameter of the HOFG is P_slicnum = (1.5 × SLIC parameter), and the HOFG process is iterated to achieve the same OA as that of the reference SLIC results. The results for the ten images are compared in Table 7.

As shown in Table 7 and analyzed in the previous sections, HOFG are a controllable segmentation method and can therefore reach the target OA for all the images. Regarding the number of iterations needed to reach the OA target, most images require four iterations; the exception is image 7, which requires five iterations. For all images, HOFG produce fewer segments than SLIC. The “Percentage” column of Table 6 presents the value of (the HOFG segment number)/(SLIC segment number) × 100%; from this column, it can be seen that for these remote sensing images, HOFG can achieve an OA similar to that of SLIC, with 81.73% as many segments as SLIC on average. Among the test images, HOFG perform the best on image 5, needing only 69.39% as many segments as SLIC to achieve a comparable segmentation accuracy.

6. Conclusions

To better analyze complex high-resolution remote sensing data via segmentation, we need to take advantage of the higher object recognition ability offered by deep learning to improve the results of the unsupervised segmentation methods. However, directly converting the output of a UDNN into segments does not result in an improved segmentation effect; indeed, in most cases, the results of a UDNN are inferior to those of shallow methods.

This paper proposes a hierarchical object-focused and grid-based deep unsupervised segmentation method for high-resolution remote sensing images (HOFG). By means of an iterative and incremental improvement mechanism, HOFG can overcome the problems caused by the direct use of UDNNs and yield superior unsupervised segmentation results. Experiments show that HOFG have the following advantages.

(1): Inheriting the recognition ability of deep models: in place of shallow models, a UDNN is adopted in HOFG to identify and segment objects based on the high-level features of remote sensing images, enabling the generation of larger and more complete segments to represent typical objects such as buildings, roads and trees.
(2): Enabling a controllable refinement process: with the direct use of a UDNN, it is difficult to control the oversegmentation or undersegmentation of the results; consequently, a UDNN is unable to achieve a given OA target. In HOFG, the LDSM is used in each iteration to refine the results from the previous iteration; this ensures that the segments can be progressively refined as the iterative process proceeds, thus making the HOFG segmentation process controllable.
(3): Reducing fragmentation at the border: a reference grid is used in the LDSM, and the UDNN results are indirectly transformed into segments on the basis of this grid. In this way, segments that are too small, stripe shaped or inappropriately connected can be filtered out.
(4): Providing the ability to process large remote sensing images: as the input image size increases, the corresponding UDNN model will also increase in size, making it difficult for ordinary GPUs to load. Consequently, the direct use of a UDNN requires a resizing of the input image, which can lead to jagged segment boundaries. The grid-based strategy of the LDSM and the iterative improvement process in HOFG reduces the detrimental influence of resizing, allowing HOFG to obtain high-quality segmentation results even when processing very large images.

HOFG also have shortcomings: the speed of HOFG is slower than the traditional shallow model methods. The key factor affecting the execution speed lies in the UCM-TRAIN algorithm. By improving the training strategy of UCM-TRAIN, the execution time of HOFG can be significantly reduced. We will focus our research on this issue to make HOFG more useful in the field that needs to obtain results quickly or repeatedly.

HOFG not only inherit the advantages of deep models but also avoids the problems caused by the direct use of deep models; thus, it can obtain superior results compared to either shallow or other deep methods. Moreover, its controllability and relatively stable performance make HOFG as easy to use as the normal shallow unsupervised segmentation methods. HOFG thus combine the advantages of both shallow and deep models for unsupervised remote sensing image segmentation and provide a way to utilize deep models indirectly. As a result, it has a theoretical significance for exploring the applications of deep learning in the field of remote sensing.

Author Contributions

Conceptualization, X.P. and X.L.; data curation, J.Z.; methodology, X.P.; supervision, J.X.; visualization, J.Z.; writing—original draft, X.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA28110502), the National Natural Science Foundation of China (41871236; 41971193), the Foundation of the Jilin Provincial Science & Technology Department (20200403174SF, 20200403187SF) and the Foundation of the Jilin Province Education Department (JJKH20210667KJ).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pan, X.; Zhao, J.; Xu, J. An object-based and heterogeneous segment filter convolutional neural network for high-resolution remote sensing image classification. Int. J. Remote Sens. 2019, 40, 5892–5916. [Google Scholar] [CrossRef]
Castillejo-González, I.L.; López-Granados, F.; García-Ferrer, A.; Peña-Barragán, J.M.; Jurado-Expósito, M.; de la Orden, M.S.; González-Audicana, M. Object-and pixel-based analysis for mapping crops and their agro-environmental associated measures using QuickBird imagery. Comput. Electron. Agric. 2009, 68, 207–215. [Google Scholar] [CrossRef]
Castilla, G.; Hay, G.G.; Ruiz-Gallardo, J.R. Size-constrained region merging (SCRM). Photogramm. Eng. Remote Sens. 2008, 74, 409–419. [Google Scholar] [CrossRef] [Green Version]
Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef] [Green Version]
Blaschke, T.; Hay, G.J.; Kelly, M.; Lang, S.; Hofmann, P.; Addink, E.; Feitosa, R.Q.; van der Meer, F.; van der Werff, H.; van Coillie, F. Geographic object-based image analysis–towards a new paradigm. ISPRS J. Photogramm. Remote Sens. 2014, 87, 180–191. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fu, B.; Wang, Y.; Campbell, A.; Li, Y.; Zhang, B.; Yin, S.; Xing, Z.; Jin, X. Comparison of object-based and pixel-based Random Forest algorithm for wetland vegetation mapping using high spatial resolution GF-1 and SAR data. Ecol. Indic. 2017, 73, 105–117. [Google Scholar] [CrossRef]
Gong, M.; Zhan, T.; Zhang, P.; Miao, Q. Superpixel-based difference representation learning for change detection in multispectral remote sensing images. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2658–2673. [Google Scholar] [CrossRef]
Pekkarinen, A. A method for the segmentation of very high spatial resolution images of forested landscapes. Int. J. Remote Sens. 2002, 23, 2817–2836. [Google Scholar] [CrossRef]
Wang, M.; Li, R. Segmentation of high spatial resolution remote sensing imagery based on hard-boundary constraint and two-stage merging. IEEE Trans. Geosci. Remote Sens. 2014, 52, 5712–5725. [Google Scholar] [CrossRef]
Dey, V.; Zhang, Y.; Zhong, M. A Review on Image Segmentation Techniques with Remote Sensing Perspective. In Proceedings of the ISPRS TC VII Symposium–100 Years ISPRS, Vienna, Austria, 5–7 July 2010; Volume 38. [Google Scholar]
Yi, L.; Zhang, G.; Wu, Z. A scale-synthesis method for high spatial resolution remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2012, 50, 4062–4070. [Google Scholar] [CrossRef]
Wang, F.; Piao, S.; Xie, J. CSE-HRNet: A context and semantic enhanced high-resolution network for semantic segmentation of aerial imagery. IEEE Access 2020, 8, 182475–182489. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Zhang, L.; Hu, X.; Zhang, M.; Shu, Z.; Zhou, H. Object-level change detection with a dual correlation attention-guided detector. ISPRS J. Photogramm. Remote Sens. 2021, 177, 147–160. [Google Scholar] [CrossRef]
Kanezaki, A. Unsupervised Image Segmentation by Backpropagation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1543–1547. [Google Scholar]
Kim, W.; Kanezaki, A.; Tanaka, M. Unsupervised learning of image segmentation based on differentiable feature clustering. IEEE Trans. Image Process. 2020, 29, 8055–8068. [Google Scholar] [CrossRef]
Comaniciu, D.; Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 603–619. [Google Scholar] [CrossRef] [Green Version]
Benz, U.C.; Hofmann, P.; Willhauck, G.; Lingenfelder, I.; Heynen, M. Multi-resolution, object-oriented fuzzy analysis of remote sensing data for GIS-ready information. ISPRS J. Photogramm. Remote Sens. 2004, 58, 239–258. [Google Scholar] [CrossRef]
Ciecholewski, M. River channel segmentation in polarimetric SAR images: Watershed transform combined with average contrast maximisation. Expert Syst. Appl. 2017, 82, 196–215. [Google Scholar] [CrossRef]
Hadavand, A.; Saadatseresht, M.; Homayouni, S. Segmentation parameter selection for object-based land-cover mapping from ultra high resolution spectral and elevation data. Int. J. Remote Sens. 2017, 38, 3586–3607. [Google Scholar] [CrossRef]
Wang, Y.; Qi, Q.; Liu, Y.; Jiang, L.; Wang, J. Unsupervised segmentation parameter selection using the local spatial statistics for remote sensing image segmentation. Int. J. Appl. Earth Obs. Geoinf. 2019, 81, 98–109. [Google Scholar] [CrossRef]
Tetteh, G.O.; Gocht, A.; Schwieder, M.; Erasmi, S.; Conrad, C. Unsupervised Parameterization for Optimal Segmentation of Agricultural Parcels from Satellite Images in Different Agricultural Landscapes. Remote Sens. 2020, 12, 3096. [Google Scholar] [CrossRef]
Johnson, B.; Xie, Z. Unsupervised image segmentation evaluation and refinement using a multi-scale approach. ISPRS J. Photogramm. Remote Sens. 2011, 66, 473–483. [Google Scholar] [CrossRef]
Chen, J.; Deng, M.; Mei, X.; Chen, T.; Shao, Q.; Hong, L. Optimal segmentation of a high-resolution remote-sensing image guided by area and boundary. Int. J. Remote Sens. 2014, 35, 6914–6939. [Google Scholar] [CrossRef]
Liu, Y.; Shan, C.; Gao, Q.; Gao, X.; Han, J.; Cui, R. Hyperspectral image denoising via minimizing the partial sum of singular values and superpixel segmentation. Neurocomputing 2019, 330, 465–482. [Google Scholar] [CrossRef] [Green Version]
Dao, P.D.; Mantripragada, K.; He, Y.; Qureshi, F.Z. Improving hyperspectral image segmentation by applying inverse noise weighting and outlier removal for optimal scale selection. ISPRS J. Photogramm. Remote Sens. 2021, 171, 348–366. [Google Scholar] [CrossRef]
Tong, H.; Maxwell, T.; Zhang, Y.; Dey, V. A supervised and fuzzy-based approach to determine optimal multi-resolution image segmentation parameters. Photogramm. Eng. Remote Sens. 2012, 78, 1029–1044. [Google Scholar] [CrossRef]
Zhang, X.; Xiao, P.; Feng, X.; Wang, J.; Wang, Z. Hybrid region merging method for segmentation of high-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2014, 98, 19–28. [Google Scholar] [CrossRef]
Yang, J.; He, Y.; Caspersen, J. Region merging using local spectral angle thresholds: A more accurate method for hybrid segmentation of remote sensing images. Remote Sens. Environ. 2017, 190, 137–148. [Google Scholar] [CrossRef]
Cho, J.H.; Mall, U.; Bala, K.; Hariharan, B. Picie: Unsupervised Semantic Segmentation Using Invariance and Equivariance in Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16794–16804. [Google Scholar]
Hamilton, M.; Zhang, Z.; Hariharan, B.; Snavely, N.; Freeman, W.T. Unsupervised Semantic Segmentation by Distilling Feature Correspondences. arXiv 2022, arXiv:2203.08414. [Google Scholar]
Mou, L.; Zhu, X.X. Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6699–6711. [Google Scholar] [CrossRef] [Green Version]
Hua, Y.; Mou, L.; Zhu, X.X. Relation network for multilabel aerial image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4558–4572. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef] [Green Version]
Shi, Y.; Li, Q.; Zhu, X.X. Building segmentation through a gated graph convolutional neural network with deep structured feature embedding. ISPRS J. Photogramm. Remote Sens. 2020, 159, 184–197. [Google Scholar] [CrossRef] [PubMed]
Waser, L.T.; Rüetschi, M.; Psomas, A.; Small, D.; Rehush, N. Mapping dominant leaf type based on combined Sentinel-1/-2 data–Challenges for mountainous countries. ISPRS J. Photogramm. Remote Sens. 2021, 180, 209–226. [Google Scholar] [CrossRef]
Pan, X.; Zhao, J.; Xu, J. Conditional generative adversarial network-based training sample set improvement model for the semantic segmentation of high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7854–7870. [Google Scholar] [CrossRef]
Hua, Y.; Marcos, D.; Mou, L.; Zhu, X.X.; Tuia, D. Semantic segmentation of remote sensing images with sparse annotations. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8006305. [Google Scholar] [CrossRef]
Saha, S.; Shahzad, M.; Mou, L.; Song, Q.; Zhu, X.X. Unsupervised Single-Scene Semantic Segmentation for Earth Observation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5228011. [Google Scholar] [CrossRef]
Tian, J.; Chen, D.M. Optimization in multi-scale segmentation of high-resolution satellite images for artificial feature recognition. Int. J. Remote Sens. 2007, 28, 4625–4644. [Google Scholar] [CrossRef]
Li, D.; Zhang, G.; Wu, Z.; Yi, L. An edge embedded marker-based watershed algorithm for high spatial resolution remote sensing image segmentation. IEEE Trans. Image Process. 2010, 19, 2781–2787. [Google Scholar]
Yang, J.; He, Y.; Weng, Q. An automated method to parameterize segmentation scale by enhancing intrasegment homogeneity and intersegment heterogeneity. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1282–1286. [Google Scholar] [CrossRef]
Mishra, N.B.; Crews, K.A. Mapping vegetation morphology types in a dry savanna ecosystem: Integrating hierarchical object-based image analysis with Random Forest. Int. J. Remote Sens. 2014, 35, 1175–1198. [Google Scholar] [CrossRef]
Geiß, C.; Klotz, M.; Schmitt, A.; Taubenböck, H. Object-based morphological profiles for classification of remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2016, 54, 5952–5963. [Google Scholar] [CrossRef]
Su, T.; Zhang, S. Local and global evaluation for remote sensing image segmentation. ISPRS J. Photogramm. Remote Sens. 2017, 130, 256–276. [Google Scholar] [CrossRef]
Pan, X.; Zhao, J. High-resolution remote sensing image classification method based on convolutional neural network and restricted conditional random field. Remote Sens. 2018, 10, 920. [Google Scholar] [CrossRef] [Green Version]
Sutha, J. Object based classification of high resolution remote sensing image using HRSVM-CNN classifier. Eur. J. Remote Sens. 2020, 53, 16–30. [Google Scholar]
Papadomanolaki, M.; Vakalopoulou, M.; Karantzalos, K. A novel object-based deep learning framework for semantic segmentation of very high-resolution remote sensing data: Comparison with convolutional and fully convolutional networks. Remote Sens. 2019, 11, 684. [Google Scholar] [CrossRef] [Green Version]
Nalepa, J.; Myller, M.; Imai, Y.; Honda, K.-i.; Takeda, T.; Antoniak, M. Unsupervised segmentation of hyperspectral images using 3-D convolutional autoencoders. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1948–1952. [Google Scholar] [CrossRef] [Green Version]
Pan, X.; Zhang, C.; Xu, J.; Zhao, J. Simplified object-based deep neural network for very high resolution remote sensing image classification. ISPRS J. Photogramm. Remote Sens. 2021, 181, 218–237. [Google Scholar] [CrossRef]
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [Green Version]
Neubert, P.; Protzel, P. Compact Watershed and Preemptive Slic: On Improving Trade-Offs of Superpixel Segmentation Algorithms. In Proceedings of the 2014 22nd international Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 996–1001. [Google Scholar]
Felzenszwalb, P.F.; Huttenlocher, D.P. Efficient graph-based image segmentation. Int. J. Comput. Vis. 2004, 59, 167–181. [Google Scholar] [CrossRef]

Figure 1. Obstacles hindering deep unsupervised remote sensing image segmentation: (a) segmentation results of a UDNN; (b) illustrated obstacles.

Figure 2. Overall design of HOFG.

Figure 3. Unsupervised deep neural network structure and segmentation process.

Figure 4. Process of LDSM.

Figure 5. Hierarchical object-focused segmentation process.

Figure 6. Evaluation strategy: (a) obtaining the OA of a set of segmentation results; (b) extreme results in which the OA is very high but comparative value is lost; and (c) evaluation method based on a reference OA.

Figure 7. The iterative process of HOFG.

Figure 8. Segmentation results of the five methods.

Figure 9. Test images and segmentation quality evaluation results: (a) test images and ground truth; (b) segmentation results with category labels.

Figure 10. Detailed comparisons of the five methods on test images 1 and 2: (a) segmentation results for test image 1; (b) segmentation results for test image 2.

Table 1. Summary of the experimental image dataset.

Feature	Source
Feature	Vaihingen	Potsdam
Resolution	9 cm	5 cm
Categories	Impervious surfaces, buildings, low vegetation, trees, cars and clutter/background.	Impervious surfaces, buildings, low vegetation, trees, cars and clutter/background.
Image size	The size of all images is greater than 1500 × 1500.	The size of all images is 6000 × 6000.
Selected images	top_mosaic_09cm_area1.tif top_mosaic_09cm_area2.tif top_mosaic_09cm_area3.tif top_mosaic_09cm_area4.tif top_mosaic_09cm_area5.tif top_mosaic_09cm_area11.tif	top_potsdam_2_10_rgb.tif top_potsdam_2_11_rgb.tif top_potsdam_2_12_rgb.tif top_potsdam_3_10_rgb.tif top_potsdam_3_11_rgb.tif top_potsdam_3_12_rgb.tif

Table 2. Parameter settings for all methods.

Method	Role	Parameter Settings
SLIC	Reference method	For an input image, the compactness is set to 10, and the target number of segments is increased until the OA of the result is over 90%.
Watershed	To be evaluated	The markers’ compactness is set to 0.001, and the number of markers is increased until the SLIC accuracy interval is reached.
Felzenszwalb	To be evaluated	The scale parameter is decreased until the SLIC accuracy interval is reached.
UDNN	To be evaluated	For the UDNN between 5 and 100 iterations, typical results within 5 times the number of segments of SLIC are selected for comparison (a UDNN cannot reach the accuracy of SLIC in the approximation of the number of segments).
HOFG	To be evaluated	The P_max-size of the LDSM is set to 600 × 600, N_{max-iteration} is specified as 5, and the method stops iterating when the SLIC accuracy interval is reached.

Table 3. OA and segment number comparison of the five methods.

Method		OA (%)	mIoU (%)	OA Target Achieved?	Number of Segments
Shallow methods	SLIC	91.75	85.94	/	331
	Watershed	91.49	85.13	Y	484
	Felzenszwalb	91.56	84.05	Y	428
Deep methods	UDNN (5 iterations)	84.81	78.45	N	1346
	UDNN (15 iterations)	77.40	70.51	N	773
	HOFG	92.04	87.11	Y	247

Table 4. Quality evaluation on two test images.

Test Image	Method	OA (%)	mIoU (%)	OA Target Achieved?	Number of Segments
1	SLIC	91.23	85.98	/	5799
	Watershed	91.69	86.22	Y	9184
	Felzenszwalb	91.14	83.93	Y	6284
	UDNN	84.64	76.85	N	8051
	HOFG	91.29	85.66	Y	4186
2	SLIC	91.67	86.57	/	7882
	Watershed	91.31	86.03	Y	9409
	Felzenszwalb	91.84	85.34	Y	7126
	UDNN	80.68	73.49	N	13,211
	HOFG	91.45	86.85	Y	6393

Table 5. Execution times of different algorithms on the remote sensing test images.

Test Image	Method	Average Execution Time (s)
1	SLIC	8.4
	Watershed	10.6
	Felzenszwalb	11.0
	UDNN	22.8
	HOFG	293.4
2	SLIC	53.2
	Watershed	100.8
	Felzenszwalb	124.6
	UDNN	46.6
	HOFG	904.0

Table 6. Details of additional remote sensing test images.

Image	Dataset	Filename	Size
1	Vaihingen	top_mosaic_09cm_area1.tif	1919 × 2569
2		top_mosaic_09cm_area2.tif	2428 × 2767
3		top_mosaic_09cm_area3.tif	2006 × 3007
4		top_mosaic_09cm_area4.tif	1887 × 2557
5		top_mosaic_09cm_area5.tif	1887 × 2557
6	Potsdam	top_potsdam_2_11_rgb.tif	6000 × 6000
7		top_potsdam_2_12_ rgb.tif	6000 × 6000
8		top_potsdam_3_10_ rgb.tif	6000 × 6000
9		top_potsdam_3_11_ rgb.tif	6000 × 6000
10		top_potsdam_3_12_ rgb.tif	6000 × 6000

Table 7. Comparison of SLIC and HOFG results on 10 images.

Image	SLIC Results			HOFG Results				Percentage (%)
Image	OA (%)	mIoU (%)	Number of Segments	OA (%)	mIoU (%)	Number of Segments	Number of Iterations	Percentage (%)
1	91.77	86.16	5495	91.82	86.99	4235	4	77.07
2	91.25	85.86	5516	91.46	86.27	4014	4	72.77
3	89.32	84.81	5777	89.59	84.78	4632	4	80.18
4	93.83	88.21	5809	93.61	88.51	4164	4	71.68
5	93.48	89.46	5796	93.27	88.89	4022	4	69.39
6	89.93	84.59	8092	89.82	85.42	7147	4	88.32
7	89.75	84.43	8261	89.80	84.85	8142	5	98.56
8	88.25	82.61	8110	87.62	82.95	6889	4	84.94
9	90.57	85.33	8211	90.81	85.86	7125	4	86.77
10	88.10	83.62	7887	87.71	83.55	6912	4	87.64
Average								81.73

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, X.; Xu, J.; Zhao, J.; Li, X. Hierarchical Object-Focused and Grid-Based Deep Unsupervised Segmentation Method for High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 5768. https://doi.org/10.3390/rs14225768

AMA Style

Pan X, Xu J, Zhao J, Li X. Hierarchical Object-Focused and Grid-Based Deep Unsupervised Segmentation Method for High-Resolution Remote Sensing Images. Remote Sensing. 2022; 14(22):5768. https://doi.org/10.3390/rs14225768

Chicago/Turabian Style

Pan, Xin, Jun Xu, Jian Zhao, and Xiaofeng Li. 2022. "Hierarchical Object-Focused and Grid-Based Deep Unsupervised Segmentation Method for High-Resolution Remote Sensing Images" Remote Sensing 14, no. 22: 5768. https://doi.org/10.3390/rs14225768

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Object-Focused and Grid-Based Deep Unsupervised Segmentation Method for High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Unsupervised Segmentation

2.2. Semantic Segmentation

3. Obstacles Hindering Deep Unsupervised Remote Sensing Image Segmentation

4. Methodology

4.1. Overall Design of the Method

4.2. Construction of the Grid and Initialization of the Set of Segmentation Results

4.3. Lazy Deep Segmentation Method (LDSM)

4.3.1. Unsupervised Segmentation by a Deep Neural Network

4.3.2. LDSM Process

4.4. Hierarchical Object-Focused Segmentation Process

5. Experiments

5.1. Method Implementation

5.2. Reference-Based Evaluation Strategy

5.3. Iterative Process of HOFG and Comparison of Results

5.4. Performing Segmentation on Larger Remote Sensing Images

5.5. Execution Time Comparison

5.6. Experiments on More Remote Sensing Images

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI