1. Introduction
With the rapid development of remote-sensing technology, a large number of RS images are captured daily by satellites, airplanes, or drones. Understanding the content in RS images has become an increasingly urgent practical need. In the field of computer vision, natural images (e.g., COCO [
1,
2], ADE20K [
3], Cityscapes [
4] and Mapillary Vistas [
5]) are captured within a local region at or near the ground level for specific purposes, which render a visual center, and have spatially contiguous distributions for the same categories. In contrast, RS images (e.g., ISPRS [
6] and LoveDA [
7]) are taken from a high-altitude perspective, causing ground objects to be scattered in various corners of the image. In addition, objects of different sizes appear simultaneously in RS images. This situation can be witnessed in urban scenes where some occupy large areas, and some have only small regions. For instance, road surfaces usually occupy a large area, while cars only occupy a minimal space. Recently, the techniques for understanding natural images have become rich in the field of computer vision. However, practices have demonstrated that directly applying those existing models to complex RS images could not yield satisfactory results due to the significant difference between their visual appearances.
Semantic segmentation is a fundamental image understanding method to determine specific class labels at the pixel level. It is a delicate yet challenging task, especially for those sophisticated high-resolution RS images with rich ground details and many multi-scale objects. Recently, the techniques for this task have been primarily advanced by following those in the field of deep learning for natural images. Since the fully convolutional network [
8] was first proposed, convolution has been used as the most common basic operation to construct the neuron layer for semantic segmentation models for quite a long time, such as the well-known PSPNet [
9], UNet [
10], and DeepLabv3+ [
11], and so on. However, the emergence of vision transformer [
12] has changed this paradigm. A large amount of transformer-based work has become the state-of-the-art semantic segmentation models, such as MaskFormer [
13], and Mask2Former [
14], and so on. Most notably, MaskFormer [
13] rethinks the per-pixel classification as the mask classification with learnable queries facilitated by a DETR-like [
15] architecture. Later, Mask2Former [
14] has further incorporated together the masked attention mechanisms and multi-scale features, rendering a powerful capability for visual representation. This observation motivates us in this study to investigate its ability with an improved model to segment complex RS images.
More specifically, the purpose of this study is to design an improved model by exploiting the query of the Mask2Former [
14] to capture well the characteristics of remote-sensing images for achieving higher segmentation performance. In this study, the query will be improved in three pathways to adapt to the RS scenes, which yields a new model given neural architectural design. First, we propose a query scenario module. Considering the complexity of RS scenes, such as those in the clusters of buildings, field landscapes, and road scenarios, it is intuitively suggested that different scenarios should be associated with various queries. Additionally, the substantial number of classes in natural scenes inevitably results in many learnable queries, which increases the computational load and the number of parameters in the model in the original Mask2Former [
14]. Such an increase could be more significant in query due to the various scenarios of multi-scale objects scattered in the RS images. Therefore, we design the QSM to enhance the model’s adaptability to distinguish various scenarios adaptively. Technically, this module can decrease the number of queries, thereby reducing the computational load and the number of parameters in the model. Simultaneously, it can select the queries suitable for different scenarios and thus help adjust the scene adaptability of the model.
Second, we introduce a query position module. The motivation for designing this module is to consider the variations in the spatial information of the targets in RS images. For instance, in natural scene images, cars and pedestrians are typically located in the bottom half of the image, while the sky and trees are generally found in the upper region. In contrast, in RS images, the target positions are not fixed; for example, the car may appear at any position in the image. Consequently, compared to natural images, the spatial information of the target holds greater significance for transformer-based models in remote-sensing scenarios. However, in conventional transformer-based models, only the simple encoding like the cosine position encoding is incorporated into the image features [
12], while a learnable embedding is added later to the queries. For the queries to be learned, this could result in insufficient position awareness of the images. To this end, the QPM is proposed to address this issue, which integrates the cosine position encoding of the input image feature together into the queries. This enhancement aims to increase the model’s position sensitivity in RS images, thereby improving the segmentation performance.
Finally, we propose a query attention module since mining effective information from visual features, such as those in the channel or spatial domain, has been demonstrated to be a practical approach for addressing a variety of visual tasks [
16,
17,
18,
19,
20]. Inspired by the ODConv [
16], we believe that the model’s performance can be enhanced by explicitly incorporating learnable feature-attention modules into the learnable queries of the transformer-based model [
14]. Based on this justification, the QAM is developed to spatially position those discriminative features between the duplicate transformer decoder layers. As a result, QAM can ensure the comprehensive utilization of the supervisory information and the exploitation of those fine-grained details, and thus help extract the pertinent query features. From the model performance perspective, by introducing the QAM, our model can be trained to improve the representation quality of the learnable queries in the initial stages of the transformer decoder layers. In the subsequent transformer decoder layers, the QAM can effectively calibrate and accumulate pertinent information from queries, thereby enhancing the model’s response to varying image features.
The key of our model lies in the innovation of the queries in the Mask2Former [
14], taking into account the differences between RS scenes and natural ones. The main contributions and primary work can be summarized as follows:
We introduce the query scenario module. Considering the diversity of scenarios and the finite categories within RS datasets, we adaptively select effective queries as the input for the transformer decoder layer. This approach aims to enhance the model’s performance while reducing the number of model parameters and computational load.
We introduce the query position module. Regarding the complexity of positions in remote-sensing images, we incorporate the position encoding of image features into queries. This strategy is intended to further enhance the model’s capacity to perceive targets.
We propose the query attention module. We incorporate attention modules between duplicate transformer decoder layers to better mine valuable information from learnable queries. This approach is specifically designed to augment the extraction of valid query features.
The performance of IQ2Former for segmenting RS images has been assessed on three challenging public datasets, including the Vaihingen dataset, the Potsdam dataset, and the LoveDA dataset. The comprehensive experimental results and ablation studies demonstrate the effectiveness of the proposed model, including numerical scores and visual segmentation.
The remaining chapters are structured as follows:
Section 1 describes the background information, motivation, objectives, and hypotheses of this study.
Section 2 reviews the related works. Details of the proposed method are provided in
Section 3. Experimental results are presented in
Section 4. Discussions can be found in
Section 5, followed by conclusions in
Section 6.
4. Experiment
4.1. Data Description
The performance of the proposed IQ2Former has been evaluated on three public challenging benchmark datasets, which all focus on the semantic segmentation of remote sensing images. The details about these datasets are described in
Table 1. The Vaihingen [
6] and Potsdam [
6] datasets ceased updating in 2018, forming the appearance we see now. The paper of LoveDA dataset [
7] was published in 2021. The ground truth of the Vaihingen [
6] and Potsdam [
6] datasets encompasses six categories: impervious surface, building, low vegetation, tree, car, and cluster/background. We omit the segmentation results of meaningless cluster/background. The ground truth of the LoveDA dataset consists of seven categories: building, road, water, barren, forest, agriculture, and background.
4.2. Baseline Model Description
FCN: Fully convolutional network [
8] is the pioneering work in semantic segmentation in the deep learning era. FCN replaces fully connected layers with convolutional layers, enabling the network to process input images of arbitrary sizes and generate output with the exact spatial dimensions. Nowadays, FCN is the most crucial baseline in semantic segmentation tasks.
PSPNet: Pyramid scene parsing network [
9] utilizes a pyramid pooling module that gathers contextual information from the diverse areas of an image, enabling the network to have a holistic understanding of the scene. This module effectively captures both local and global contexts by hierarchically partitioning the input feature map and performing spatial pyramid pooling operations.
DeepLabV3+: DeepLabv3+ [
11] uses dilated convolution to effectively enlarge the receptive field of filters, allowing the network to capture more contextual information. Additionally, the feature pyramid network [
61] is introduced to combine features at various spatial resolutions. DeepLabV3+ was almost the most advanced algorithm in semantic segmentation before the advent of transformer.
OCRNet: Object-contextual representations network [
23] obtains coarse segmentation results from a general backbone and object region representation from gathering pixel embeddings in it. The second step is computing the relationship between each pixel and each object region. The final acceptable segmentation result was obtained by enhancing the expression of each pixel with the object-contextual representation.
UPerNet: Unified perceptual parsing network [
62] mimics the human recognition of multiple levels of the visual world and unifies the datasets containing various scenes, objects, parts, materials, and textures. Using a feature pyramid network [
61] and different detection heads, UPerNet can be applied to multi-task learning in addition to semantic segmentation.
MaskFormer: Rather than predicting the class of each pixel point, MaskFormer [
13] is first proposed to predict a set of binary masks associated with a single global class label prediction. Under the supervision of both mask loss and category loss, MaskFormer has achieved excellent performance in both semantic and panoptic segmentation tasks under the supervision of both mask and category loss.
Mask2Former: The critical component of Mask2Former [
14] is masked attention, which extracts local features by confining cross-attention within the predicted mask regions. In this way, the research effort is reduced by at least three times while improving performance by a significant margin. Mask2Former is capable of addressing all image segmentation tasks, including the panoptic, instance, or semantic ones.
4.3. Experiment Details
Backbone: Our IQ2Fomer is compatible with any backbone architecture. For a fair comparison, the standard convolution-based ResNet with 50 layers, ResNet with 101 layers, and the transformer-based Swin-transformer (Swin-L) are used as our visual backbone. All backbones are pre-trained on the ImageNet-1K [
63] if not stated otherwise.
Pixel decoder: As shown in
Figure 1, four different resolution outputs of the pixel decoder are expressed as
, where
. They are feature maps with resolutions 1/32, 1/16, 1/8, and 1/4, respectively, in our experiments. Similarly to Mask2Former [
14], the same multi-scale deformable attention transformer (MSDeformAttn) [
64] is utilized as our pixel decoder.
Transformer decoder: Totally, there are three consecutive transformer decoders in our IQ2Former (i.e., nine layers as a whole). the QSM is the input of the first transformer decoder of our IQ2Former. Note that the number of queries in the Vaihingen dataset and Potsdam dataset is set to 20, while the number of queries in the LovaDA dataset is taken as 40. The reason for this phenomenon is discussed in
Section 5. Each transformer decoder contains three QPMs and one QAM. Among them, the role of the QPM is to enhance the query’s perception ability of image position. An auxiliary loss is added to every intermediate transformer decoder layer, and a competitive query can be obtained. Therefore, the proposed QAM is used to explore the query output capability of the previous layer of the transformer decoder further.
Losses: The comprehensive training loss has two components: the classification loss and the mask loss. The classification loss is formulated by a cross-entropy loss, denoted by
. The mask loss integrates the binary cross-entropy loss and the dice loss [
65], which is depicted as
for clarity. The overall training loss can be expressed as
, where
and
are the hyper-parameters. According to the default settings of Mask2Former [
14], in our study, we also set the balance weight of the overall training loss to be
and
, respectively.
Inference: Assuming that and are predicted per-pixel embeddings and binary masks, respectively. Here, K represents the total number of object classes, and N is the number of object queries. , performs matrix multiplication and sums on the dimension of the query. In this way, the dimension of the query is eliminated, and a probability distribution is obtained for each pixel of the output feature. The ultimate segmentation results are O, without considering the no-object class ⌀.
Batch size and learning rate: In our experiments, the batch size is set to 8, and all models are trained for 80 k iterations. The AdamW [
66] and the poly [
67] learning rate schedule with an initial learning rate of
and a weight decay of 0.05 were adopted for both ResNet and Swin-transformer backbones.
Data augmentation: During the stage of training, random scale jittering (between 0.5 and 2.0), random horizontal flipping, random cropping, as well as random color jittering are used to perform data augmentation. During the stage of testing, test-time augmentation (TTA) was utilized in our experiments. Specifically, it creates multiple augmented copies for each image in the test set, allowing the model to make predictions for each image, and then returns the set of these predictions and the final results with the highest number of votes. Random flip and multi-scale testing are adopted in this paper.
Metric: IoU (intersection over union) is the quotient of the intersection and union between the predicted segmentation and annotation regions. mIoU (i.e., mean intersection over union) is the mean IoU of all classes. OA (overall accuracy, also known as pixel accuracy) is all correctly classified pixels divided by all pixels.
Environment: All models are trained with four A100-PCIE graphics cards with a memory capacity of 40 GB. The conda environmental configuration is as follows: Python 3.8.17, NumPy 1.24.3, PyTorch 2.0.0, TorchVision 0.15.0, CUDA version 11.7, MMCV [
68] 2.0.1, and MMSegmentation [
69] 1.1.0.
4.4. Experiment Results
To provide a thorough comparison of the models, we list the numerical scores of the OA and mIoU obtained by the eight models on the Vaihingen, Potsdam, and LoveDA datasets in
Table 2,
Table 3 and
Table 4, respectively. All the values are obtained from the corresponding test images and are then averaged across all categories. Note that the two methods are adopted to output the final performance. The first method is to directly calculate OA and mIoU on the images in the test dataset. The second method is to calculate OA and mIoU via test-time augmentation, which augments multiple copies for each testing image by randomly flipping and resizing.
As can be seen from the results of the comparisons in these tables, our IQ2Former model outperforms other baseline models to a large extent, both in the ResNet [
70] and Swin [
54] backbones.
Figure 5 shows the radar plots using test-time augmentation on the three datasets to further compare our model with baseline models, category by category. The points in these plots represent the corresponding mIoU scores, which are obtained via test-time augmentation tricks on the testing dataset. From these plots, it can be seen that the curves obtained by our IQ2Former are always located in the external region, indicating that it achieves a higher performance compared to the baseline models.
For the convenience of comparison,
Figure 6,
Figure 7 and
Figure 8 show the visual segmentation results of sample images obtained by all models, including the FCN [
8], PSPNet [
9], DeepLabv3+ [
11], OCRNet [
23], UPerNet [
62], MaskFormer [
13], Mask2Former [
14], IQ2Former. For the sake of fairness, the backbone of all models is ResNet-101 [
70].
To summarize, the above comparisons demonstrate that our IQ2Former is capable of successfully segmenting RS images with high resolution.
4.5. Ablation Study
This subsection describes the ablation experiments to evaluate the effectiveness of the three components proposed in our model.
Table 5 illustrates the validity of the query scenario module in
Section 3.2.1, query position module in
Section 3.2.2, and query attention module in
Section 3.2.3. It can be seen that performing both fundamental components helps enhance the performance. It is worth pointing out that there are no increases in the number of parameters in QPM.
For the query scenario module, we already verified that the performance of our IQ2Former is higher than the Mask2Former [
14].
Table 6 and
Table 7 serve as the foundation for selecting hyper-parameters in QSM for the Vaihingen and Potsdam datasets.
Table 8 is the basis for assigning hyper-parameters in QSM for the LoveDA dataset. The factors related to the computational efficiency are listed in
Table 9, including the number of the parameters and the number of the floating-point operations (FLOPs) with giga multiplier accumulators (GMACs) in the model.
6. Conclusions
This paper has proposed an IQ2Former for semantic segmentation in RS images. Technically, we have improved the query capability of the model in three aspects. Such an improvement fulfilled in this study is due to the fact that the embedding of the querying mechanism largely determines the representational power of the MaskFormer-like models. For selecting different remote sensing image scenarios, the QSM is designed to learn to group the queries from feature maps, which serve to select different scenarios such as urban and rural areas, building clusters, and parking lots. For classifying small targets in complex RS images, the QPM is constructed to assign the image position information to the query without increasing the number of parameters. For utilizing lower features ignored by Mask2Former, the QAM is proposed to be positioned between the duplicate transformer decoder layers, which mainly utilizes the characteristics of ODConv to extract valuable information from the previous query. With our QAM, the supervisory information is fully utilized, and the find-grained information is further exploited to achieve high-quality segmentations.
Comprehensive experiments have been conducted on three publicly challenging RS image datasets. Our model achieves 91.44% OA and 83.59% mIoU on the Vaihingen dataset. Our model achieves 91.92% OA and 87.89% mIoU on the Potsdam dataset. Our model achieves 72.63% OA and 56.31% mIoU on the LoveDA dataset. Additionally, ablation experiments and visual segmentation figures all demonstrate the effectiveness and superiority of our IQ2Former. In the future, we would like to conduct the research in the following three directions. First, the design of the query mechanism developed in our model could be optimized with more powerful attention and lightweight tricks. Second, we could comprehensively evaluate the performances of our model on the datasets containing RS images with low resolutions, noise, or a percentage of distortions. Third, we would also like to extend the applications of our model in the multi-spectral RS images and hyper-spectral RS data.