ACMFNet Attention-Based Cross-Modal Fusion Network For Building Extraction of Remote Sensing Images

Uploaded by

shirisha edikoju

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

ACMFNet Attention-Based Cross-Modal Fusion Network For Building Extraction of Remote Sensing Images

Uploaded by

shirisha edikoju

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

62, 2024 5405614

ACMFNet: Attention-Based Cross-Modal Fusion

Network for Building Extraction of
Remote Sensing Images
Baiyu Chen , Zongxu Pan , Senior Member, IEEE, Jianwei Yang , and Hui Long

Abstract— In recent years, significant progress has been made attention in the field of remote sensing [2], [3], [4]. Building
in extracting buildings from high spatial resolution (HSR) remote extraction from HSR remote sensing images could be clas-
sensing images due to the rapid development of deep learning sified under this task, aiming to label all pixels in an image
(DL). However, the existing methods still have some limitations
in maintaining the detail integrity of building footprint. First, as either building or nonbuilding classes [5], which plays a
skip connections typically involve the direct concatenation of pivotal role in applications, such as urban planning [6], [7],
feature maps from adjacent levels, which inevitably leads to [8], population statistics [9], economic assessment [10], and
misalignment due to semantic differences. Second, the integration disaster management [11], [12].
of building-related details remains a challenging task in the
Since entering the deep learning (DL) era, as a mainstream
context of cross-modal remote sensing image. Third, the over-
simplified upsampling structure used in previous methods may framework for automatic building extraction, full convolu-
lead to loss of spatial details. In this article, we propose a novel tional network (FCN) [13] transforms the general image
building extraction method attention-based cross-modal fusion segmentation task into a pixel-level classification task. Based
network (ACMFNet) based on cross-modal HSR remote sensing on the framework of FCN, a series of methods for auto-
images using an encoder–decoder structure. First, we propose a
matic building extraction have been proposed [14], [15], [16].
global and local feature refinement module (GL-FRM) to refine
features and establish contextual dependencies at multiple scales However, despite the excellent progress of DL-based building
and levels, mitigating the spatial discrepancy among multilevel extraction methods, there are still some persistent problems in
features. Meanwhile, a cross-modal fusion module is utilized to the current remote sensing building extraction field.
integrate complementary features extracted from multispectral
(MS) data and normalized digital surface model (nDSM) data.
In addition, we employed a lightweight residual upsampling A. Invalid Propagation of Semantics and Inefficient
module (RUM) for feature resolution recovery. We conducted Self-Attention (SA) Computation Strategy
complete experiments on two benchmark datasets, and the results The majority of DL-based building extraction methods rely
indicate that our proposed ACMFNet achieves state-of-the-art
(SOTA) performance without bells and whistles.
on an encoder–decoder architecture. The encoder encodes the
original input image into low-resolution feature maps, while
Index Terms— Building extraction, cross-modal fusion, digital the decoder recovers pixel predictions from them. Typically,
surface model, high spatial resolution (HSR), remote sensing
image. past classical methods [17] use skip connections to directly
concatenate features from corresponding stages of the encoder
and decoder along the channel dimension. However, semantic
I. I NTRODUCTION differences in the feature maps of different stages can lead to

W ITH the rapid advancement of aerospace and sensor

technology, researchers can easily acquire a substan-
tial amount of high spatial resolution (HSR) remote sensing
spatial dimension mismatches, thus limiting the accuracy of
the constructed extraction. For the above reasons, some recent
approaches are trying to combine the local contextual features
images that depict the ecological environment and human provided by convolutional neural networks (CNNs) with the
activities [1]. Currently, the semantic segmentation task devel- global contextual features provided by Transformers [18] to
oped in the computer vision community has gained significant improve the network performance [19]. However, applications
of [vision Transformer (ViT)] [20], [21], [22] tend to require
Manuscript received 21 March 2024; revised 19 April 2024 and 7 May 2024;
accepted 10 May 2024. Date of publication 14 May 2024; date of current
large amounts of memory and computational cost. Therefore,
version 24 May 2024. This work was supported by the Youth Innovation finding an SA computation strategy that strikes a good balance
Promotion Association, Chinese Academy of Sciences (CAS). (Corresponding between performance and computational cost is equally crucial
author: Zongxu Pan.)
The authors are with the Aerospace Information Research Institute,
for the building extraction task.
Chinese Academy of Sciences, Beijing 100094, China, also with the
Key Laboratory of Technology in Geo-Spatial Information Processing B. Underutilization of Cross-Modal Information
and Application System, Chinese Academy of Sciences, Beijing 100190,
China, and also with the School of Electronic, Electrical and Communi- Recent studies have demonstrated the efficacy of incorpo-
cation Engineering, University of Chinese Academy of Sciences, Beijing rating multisource information in enhancing the robustness
101408, China (e-mail: chenbaiyu22@mails.ucas.ac.cn; zxpan@mail.ie.ac.cn;
yangjianwei20@mails.ucas.ac.cn; lh_lh885@126.com). of building extraction. With advancements in airborne light
Digital Object Identifier 10.1109/TGRS.2024.3400979 detection and ranging (LiDAR) technology, a normalized
1558-0644 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: TKR Educational Society. Downloaded on October 12,2024 at 09:08:45 UTC from IEEE Xplore. Restrictions apply.
5405614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024

digital surface model (nDSM) has emerged as a resource for 2) We introduce CFM to fuse MS data and nDSM data by
extracting complementary features [22], [23]. By providing weighting each modality and removing irrelevant parts.
elevation data for objects, nDSM facilitates the segmentation 3) A lightweight RUM is employed to restore the resolution
of tall structures, such as buildings. However, it is susceptible of feature mappings.
to noise points, which can compromise data quality, under- 4) The proposed ACMFNet achieves comparable results
scoring the significance of integrating diverse cross-modal with state-of-the-art (SOTA) algorithms on two challeng-
data. Currently, fusion methods can be broadly categorized ing benchmarks.
into three main groups. Some algorithms directly integrate The remaining sections of this article are organized as
data from different modalities into a single feature extraction follows. We first provide a brief introduction of related work
network at the input stage. Others employ separate networks to in Section II. Then, in Section III, we present our proposed
extract features from each modality and only fuse them during ACMFNet in detail. Furthermore, we report and analyze
the final prediction stage [24]. However, these approaches do comprehensive experimental results in Section IV. Finally,
not fully exploit the information available in multiple modal- conclusions and suggestions are given in Section V.
ities. At present, advanced fusion networks predominantly
adopt feature-level fusion, where cross-modal features are II. R ELATED W ORK
extracted through independent branches and fused at various
In this section, we will review the recent literature from
scales for utilization at different stages of the network [5],
two perspectives: 1) representative works in MS building
[25], [26]. These methods enable more comprehensive utiliza- extraction and 2) representative works in cross-modal semantic
tion of cross-modal information but necessitate an effective segmentation.
fusion strategy to extract complementary information from the
cross-modal data while mitigating noise impact.
A. MS Building Extraction
C. Rough Upsampling Structure Compared with middle-/low-resolution remote sensing
One of the crucial functions of the decoder is to upsample images, HSR remote sensing images offer more intricate
semantically rich visual features from low resolution to the details regarding land objects, thereby presenting challenges
input resolution. While some methods employ a simple com- in building extraction tasks by intensifying intraclass vari-
bination of bilinear interpolation and convolution modules as ance and diminishing interclass variance [30]. Over the past
an upsampling module [19], [29], this structure tends to be two decades, numerous methodologies have been developed
too coarse, resulting in the loss of fine-grained information for building extraction from HSR remote sensing images.
during feature restoration. Especially, when it comes to the Early algorithms primarily relied on the physical principles
fine structure and textural details of an image, the simple of buildings [31], establishing criteria that best represent the
combination approach struggles to capture the complex pat- local appearance of building roofs based on spatial or spectral
terns and variations in the original image. Therefore, a rational features. Ngo et al. [32] employed geometric features to
and fine-grained upsampling module is an integral part of the extract building regions, while Guo and Du [33] differentiated
building extraction network. buildings from background through disparities in spectral
To tackle these challenging issues, we propose an attention- reflectance. In addition, certain algorithms incorporated spatial
based cross-modal fusion network (ACMFNet) for extracting constraints between buildings and their surrounding envi-
buildings in HSR remote sensing images. We employ a ronment [34], such as shadows cast by buildings on the
conventional encoder–decoder structure, where two identical ground [35], to facilitate building extraction. Although these
backbone branches are utilized in the encoder to extract methods performed well for buildings governed by similar
features from multispectral (MS) data and nDSM data. First, physical rules, they were unable to handle complex corre-
we use the global and local feature refinement module (GL- lations between buildings and the background due to their
FRM) to establish a context dependency at multiple scales and limited universality and robustness.
levels through large-scale convolutions and attention compu- With the rapid advancement of DL, CNN has emerged as
tations. It combines the deep semantic feature maps of the the predominant approach for building extraction in remote
decoder with the shallow fine-grained feature maps of the sensing imagery in recent years. Mnih [35] was the first to
encoder, effectively solving the problem of spatial dimension introduce CNN for patch-based extraction of building roofs
mismatch. Second, in order to better extract valid information from high-resolution aerial images. Saito et al. [36] designed
from multiple modalities, we use cross-modal complementary a neural network comprising three convolutional layers and
feature fusion module (CFM) for integrating the complemen- two fully connected layers. Zuo et al. [16] incorporated
tarities between two modalities. In addition, we introduce a multiscale features into their FCN-based model. Khalel and
lightweight residual upsampling module (RUM) to recover the El-Saban [37] refined the output using U-Net with multiple
resolution of these feature maps. The main contributions of stages, achieving relatively accurate building extraction. How-
this research can be summarized as follows. ever, their models were limited by inadequate representation of
1) We propose GL-FRM to enhance features and establish global information. Consequently, attention mechanisms have
contextual dependencies at multiple scales and levels, been introduced in some studies to address this limitation.
mitigating the spatial discrepancy among multilevel Guo et al. [38] proposed a multiloss neural network that
features. enhances sensitivity and suppresses background interference

Authorized licensed use limited to: TKR Educational Society. Downloaded on October 12,2024 at 09:08:45 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ACMFNet FOR BUILDING EXTRACTION OF REMOTE SENSING IMAGES 5405614

through attention modules focused on irrelevant feature areas. Seichter et al. [53] presented efficient scene analysis network
Tian et al. [39] integrated a convolutional attention module (ESANet) based on an improved ResNet encoder for efficient
into skip connection to extract salient multiscale features RGB-D semantic segmentation method that achieves higher
effectively. Das and Chand [40] employed various attention accuracy with lower computational cost. Yue et al. [54] pro-
modules to improve overall feature representation. Never- posed a two-stage decoder that effectively integrates high-level
theless, these methods still heavily rely on convolutional and low-level features while suppressing low-order detail noise
operations without liberating the network from its CNN struc- from shallow layers of the encoder. Zhang et al. [55] presented
ture, thereby retaining certain limitations in modeling global cross-modal fusion framework for RGB-X (CMX), which is
information. a unified fusion framework utilizing cross-modal feature cali-
Methods based on ViT have brought great progress and bration module to calibrate bimodal characteristics and exten-
development to semantic segmentation [41], [42], [43], and sively exchange remote contexts via feature fusion modules.
numerous researchers have incorporated ViT into build- 2) MS and nDSM Building Extraction: The integration of
ing extraction tasks. Chen et al. [44] proposed a sparse nDSM data with MS data aims to address the information defi-
token Transformer to effectively capture global dependen- ciencies associated with traditional single-modal processing.
cies among tokens in both spatial and channel dimensions. However, due to the high acquisition cost of nDSM data, there
Wang et al. [45] introduced the Swin Transformer [46] as is limited research on semantic segmentation methods based
an encoder and devised a novel dense connection feature on nDSM data sources. Audebert et al. [24] proposed an effi-
aggregation module for resolution restoration and segmen- cient multiscale method for semantic segmentation of remote
tation map generation. Li et al. [29] designed an efficient sensing images, exploring early and late fusion strategies for
linear complexity kernel attention mechanism to alleviate nDSM and MS data. Peng et al. [22] introduced a dual-path
the computational requirements of attention operations. Some dense convolution multimodal network that combines dense
approaches combine CNN with Transformer to leverage their connections and FCN, extracting features from both MS and
respective strengths in extracting global and local information. nDSM data before fusing them together. Nevertheless, their
He et al. [46] and Wang et al. [19], respectively, proposed approach merely employed elementwise addition to fuse the
encoders with dual-path structures by parallelizing the Swin two modalities’ features, failing to fully exploit the rich
Transformer with CNN, enabling rich spatial details encoding information provided by nDSM data. Huang et al. [25], on the
through spatial context paths while capturing global depen- other hand, utilized an improved residual learning network
dencies via global context paths, thus achieving better results. as an encoder to learn multilevel features from fused data
Zhang et al. [47] designed an efficient dual spatial attention while introducing gated feature annotation units to reduce
Transformer (DSAFormer) to solve the defects of standard unnecessary feature transmission. Zhang et al. [26] adap-
ViT, which has a dual attention structure to complement each tively selected and combined complementary features from
other. each modality through an attention-aware multimodal fusion
block in order to effectively learn and utilize multimodal
information. Zhou et al. [56] proposed a gate fusion network,
B. Cross-Modal Semantic Segmentation
which incorporates a gate fusion module aimed at eliminating
Single-modal semantic segmentation methods primarily redundant features from both MS and nDSM data. Similarly,
emphasize the fusion of low-level spatial details and high-level Hosseinpour et al. [5] used two independent encoders for
semantic information. In contrast, cross-modal models inte- extracting multilevel features from MS and nDSM data while
grate multiple dimensions of information, leveraging the achieving effective fusion through a gate fusion module.
complementarity across diverse data sources to extract more Hong et al. [27] introduced SpectralGPT, the first foundation
comprehensive features. model designed specifically for spectral remote sensing data,
1) RGB-Depth (RGB-D) Semantic Segmentation: Success- which also demonstrates significant potential in advancing
ful RGB-D (RGB and depth) semantic segmentation heavily cross-modal semantic segmentation tasks [28].
relies on the effective integration of cross-modal features.
Most methods carry out feature fusion in the encoder. III. M ETHOD
Wang et al. [48] proposed a feature transformation network
In this section, we describe the proposed ACMFNet frame-
to identify common features between cross-modal data and
work in detail. Our ACMFNet follows the encoder–decoder
enhance the representation of shared information. Hazir-
structure, as visualized in Fig. 1. GL-FRM, CFM, and RUM
bas et al. [49] introduced FuseNet, which integrates depth
make up our proposed method’s three main parts. First, the
features at different levels into the RGB encoder as the
overall framework employed in this work will be introduced in
network deepens. Jiang et al. [50] further extended FuseNet by
Section III-A. Then, we will go into the details of the proposed
incorporating a top-down path for fusing multilevel features.
modules in Sections III-B–III-D.
Hu et al. [51] proposed a three-branch network architec-
ture and employed an attention complementary module to
extract weighted features from both RGB and depth branches A. Framework Overview
simultaneously. Chen et al. [52] introduced a unified and effi- The architecture of ACMFNet, as illustrated in Fig. 1,
cient cross-modal guidance encoder, recalibrated RGB feature corresponds to a depth-supervised encoder–decoder network.
responses, and accurately extracted depth information through In the encoder, we have incorporated two separate branches for
multiple stages while alternately aggregating both modalities. extracting features from MS data and nDSM data. For these
Authorized licensed use limited to: TKR Educational Society. Downloaded on October 12,2024 at 09:08:45 UTC from IEEE Xplore. Restrictions apply.
5405614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024

Fig. 1. Architecture of our proposed ACMFNet. ACMFNet contains three important modules: CFM, GL-FRM, and RUM.

Fig. 2. (a) Original image. (b) Ground truth. (c) Shallow fine-grained feature from encoder. (d) Deep semantic feature from decoder. (e) Fusion by CAT.
(f) Fusion by GL-FRM.

branches, ResNet-34 [57] has been chosen as the backbone as the network goes deeper. Hence, explicit and dynamic estab-
due to its moderate depth and residual structure, which are lishment of position correspondence between feature maps
compatible with real-time operations owing to their smaller becomes necessary [59]. To address this issue, we propose
operations. To suit the semantic segmentation task require- GL-FRM that leverages extensive convolutional operations and
ments, all fully connected layers of ResNet-34 have been attention calculations to establish multiscale and multilevel
removed. Except for the first convolution module with a single contextual dependencies, thus alleviating mismatches between
channel on the nDSM data branch, both branches share an deep semantic information and shallow fine-grained spatial
identical network configuration comprising six module stages details.
that include five convolutional layers and one max-pooling The deep semantic feature X d from the decoder and the
layer for extracting multiscale features. After each feature shallow fine-grained feature X e from the encoder are initially
extraction stage, CFM is employed to extract complementary concatenated in the channel dimension to obtain X , as illus-
parts between the two modalities for feature recombination. trated in Fig. 3. To refine local features, a sparsely connected
In the decoder, GL-FRM establishes a context dependency at architecture is employed to stack various types of convolution
multiple scales and levels through large-scale convolutions and kernels (3 × 3, 5 × 5, 7 × 7, and MaxPool) for extracting
attention computations by combining deep semantic feature densely refined data X l [60]
maps from decoder with shallow fine-grained feature maps
X = Cat(X e , X d )
from encoder, which effectively addresses spatial dimension
mismatch issues between them. Finally, lightweight RUM is X l = Cat(Conv3×3 (X ), Conv5×5 (X ), Conv7×7 (X ),
utilized to restore resolution in these feature maps. MaxPool(X )) (1)
where Cat denotes the channelwise concatenation (CAT) and
B. GL-FRM Conva×a denotes a convolutional layer with a convolutional
During the process of recovering pixel predictions from kernel size of a × a.
low-resolution feature maps in the decoder, incorporating For global feature refinement, we aim to leverage the
feature maps from corresponding scales in the encoder to powerful coding capability of Transformer for conducting fea-
deep semantic information can effectively provide multiscale ture refinement and exploring correlations between channels.
and multilevel contextual information, thereby enhancing seg- In traditional ViT, a multihead SA mechanism is employed
mentation precision. It is a common practice to directly where the input image sequence X is multiplied by three
concatenate features along the channel dimension. However, weight matrices Wq , Wk , and Wv to derive query (Q), key (K ),
as depicted in Fig. 2(c) and (d), these two types of features and value (V ). By utilizing dot product calculations, the query
exhibit semantic disparities that result in feature misalignment tokens are individually matched separately with the key tokens,

Fig. 3. GL-FRM.

resulting in an attention map that reflects token similarity. The

value component represents the important features extracted
from each token, which are then multiplied with the attention
map. Through such computations performed on all tokens,
SA facilitates global information exchange. The process can
be illustrated as follows:
Q, K , V = X Wq , X Wk , X Wv
QK T

SA(Q, K , V ) = Softmax √ (2)
dk
where Softmax represents the Softmax function and SA repre-
sents the SA calculation. However, this mechanism also incurs
a substantial computational cost that scales quadratically with Fig. 4. Cross-modal complementary feature fusion module.
the number of tokens. Wang et al. [21] employ a strategy of
reducing the sequence length during SA calculation to reduce data modalities may lead to increased noise or redundancy
the computational cost. However, in order to maintain the when integrating features. To effectively integrate MS and
output size, the query is still processed at full resolution, nDSM information, we proposed CFM, enabling the network
resulting in significant computational overhead. Considering to focus on acquiring more complementary informational
the localized nature of buildings in HSR images, dense pixel- features from both MS and nDSM sources. As illustrated
wise SA is unnecessary for establishing global dependencies. in Fig. 4, we implemented a channel attention approach
To reduce computational cost, we simultaneously decrease the based on squeeze-and-excitation [58] mechanism that learns to
length of the Q, K , and V sequences to establish coarse accentuate valuable channels using global information while
global dependencies, further reducing the computational cost. suppressing less informative ones.
We employ nonoverlapping 4 × 4 convolutions to merge In a multibranch architecture, we have MS input feature
adjacent pixel tokens, thereby reducing burden during SA maps X M and nDSM input feature maps X D . For both
calculations. Subsequently, we upsample the feature maps branches, we use global average pooling (GAP) as the channel
to match the input size, denoted as X g . The refinement descriptor for channel attention in both branches, followed
features X l and X g are concatenated and fed into a multilayer by a 1 × 1 convolutional block to explore interchannel
perceptron (MLP) for further information fusion. The resulting correlations. The resulting convolutional output is activated
output is denoted as Z using the sigmoid function to restrict the weight vector values
X ′ = Conv4×4 (X ) between 0 and 1. Subsequently, these weight vectors are
multiplied with input feature maps from both branches to
Q, K , V = X ′ Wq , X ′ Wk , X ′ Wv obtain X ′M and X ′D . Due to an increase in dimensionality
X g = Upsample(SA(Q, K , V )) caused by concatenating feature maps, there is an excessive
computational burden and inclusion of redundant information.

Z = MLP X g + X l (3)
To address this issue, we utilize a 1 × 1 convolutional layer
where Upsample denotes bilinear interpolation upsampling to reduce the feature dimensionality to 32 for eliminating
and MLP denotes MLP calculation. redundancy. The formulaic representation of the CFM module
is shown as follows:
C. CFM
In the task of building extraction in cross-modal HSR X ′M = X M ⊗ σ (Conv1×1 (AvgPool(X M )))
remote sensing imagery, it is crucial to determine the optimal X ′D = X D ⊗ σ (Conv1×1 (AvgPool(X D )))
utilization of MS and nDSM information. MS encompasses Z = Conv1×1 Cat X ′M , X ′D

(4)
a wealth of spectral information pertaining to objects and
scenes, while nDSM provides additional contour and posi- where Z denotes the final output feature representation, Avg-
tional details. The inherent disparities between these two Pool denotes GAP, and σ denotes the sigmoid function.

proposed ACMFNet, both of which encompass urban scenes

and were provided by Commission II/4 of the ISPRS.
1) Vaihingen Dataset: Vaihingen is a relatively small vil-
lage characterized by numerous small multistory buildings
and standalone structures. Vaihingen [62] consists of 33 true
orthophoto (TOP) images captured by advanced airborne sen-
sors, along with their corresponding DSM data, covering an
area of 1.38 km2 in the town of Vaihingen, Germany. Each
image patch has varying dimensions, approximately 2000 ×
2500 pixels, with a ground sampling distance (GSD) of around
Fig. 5. RUM.
9 cm. These patches contain only three bands [Vaihingen-
infrared-red-green (IRRG)], namely, infrared (IR), red (R),
D. RUM
and green (G). The original ground truth encompasses six
An excessively simple upsampling structure will lead to primary land cover classes: impervious surface, buildings, low
coarse extractions. To address this limitation, we propose an vegetation, trees, cars, and clutter. For the purpose of building
enhanced lightweight RUM, as illustrated in Fig. 5. The input extraction in this study, foreground and background classes are
denoted as X is first upsampled using bilinear interpolation to utilized. According to the data provider, a total of 16 patches
match the resolution of lateral features, yielding X up . In the were selected for training, while the remaining patches were
residual branch, we explore channel correlations and reduce allocated for testing.
feature dimensions through a 1 × 1 convolution block denoted 2) Potsdam Dataset: Potsdam is a quintessential historical
as X res . Simultaneously, in the main branch, X up is computed city characterized by expansive architectural areas, narrow
by applying BatchNorm, activation function, and convolution streets, and a densely populated settlement structure. Similar
layers to obtain feature X conv , which is then combined with to Vaihingen, the dataset for Potsdam [62] comprises 38 TOP
X res to produce output feature Z . This residual structure images accompanied by corresponding DSM patches, encom-
effectively enhances and preserves more original features. passing an area of 3.42 km2 within the city of Potsdam in
Moreover, within RUM architecture, standard convolution Germany. Each patch measures 6000 × 6000 pixels and has
modules are replaced with depthwise separable convolution a GSD of 5 cm. These patches are available in two modes,
(DSC) [61] modules for parameter reduction purposes without the red (R), green (G), and blue (B) bands (Potsdam-RGB),
compromising performance quality or sacrificing network effi- and the near infrared (NIR), red (R), and green (G) bands
ciency in the decoder component. Consequently, the resolution (Potsdam-IRRG). The classes in the ground-truth data align
of feature maps is doubled, while the number of channels with those used in Vaihingen. Specifically, the training set
is halved. The overall process description for RUM can be consists of 24 RGB patches, while the remaining patches are
summarized as follows: allocated for testing purposes using RGB imagery.
X up = Upsample(X )

X res = Conv1×1 X up B. Implementation Details

X conv = DeepConv X up 1) Loss Function: In reference to BuildFormer [19],
Z = X res + X conv (5) we trained ACMFNet with a joint loss function using the
combined incorporation of boundary supervision technology.
where DeepConv consists of a BatchNorm layer, rectified The joint loss function can be defined as follows:
linear unit (ReLU) activation function, and DSC layer. The
L = L ce Y, Ŷ + L dice Y, Ŷ + L bce L(Y ), L Ŷ

decoder employs six RUMs to progressively upsample the (6)
low-resolution feature maps, aligning them with the resolution
where Y and Ŷ represent the predicted label and true label, L ce
of the input.
is the cross-entropy loss, L dice is the dice loss, L represents
IV. E XPERIMENTS AND A NALYSIS the Laplacian convolution used to extract building boundaries
from predicted labels and true labels, and L bce represents the
This section is divided into four parts. First, we provide a
binary cross-entropy loss applied to the building boundaries.
detailed description of the two benchmark datasets. Second,
2) Evaluation Metrics: To quantitatively evaluate the per-
the whole implementation process is comprehensively under-
formance of our method and other methods involved in the
stood through training settings, loss function, and evaluation
comparison, we use intersection over union (IoU), F1 score,
metrics. After that, an ablation study is conducted to verify
precision, and recall, which are widely used in building
the effects of each proposed module. Finally, we quantita-
extraction.
tively and qualitatively compare the proposed ACMFNet with
For each category, IoU is defined as the ratio of intersection
16 competitive methods in the form of tables and pictures,
and union of the predicted value and the true value
respectively.
TP
IoU = . (7)
A. Datasets TP + FN + FP
In this article, two benchmark datasets, namely, Vaihingen F1 score is a classical criterion for binary classification
and Potsdam, were used to verify the effectiveness of the between interest targets and nontargets, which is equal to the
Authorized licensed use limited to: TKR Educational Society. Downloaded on October 12,2024 at 09:08:45 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ACMFNet FOR BUILDING EXTRACTION OF REMOTE SENSING IMAGES 5405614

TABLE I
IOU AND THE C ORRESPONDING T IME R EQUIRED TO T RAIN E ACH E POCH FOR VARIOUS S TRIDES ON P OTSDAM DATASET

harmonic mean of precision and recall

TP
Precision =
TP + FP
TP
Recall =
TP + FN
2 × Precision × Recall
F1 = (8)
Precision + Recall
where TP, FP, and FN represent true positive, false positive,
and false negative, respectively.
3) Training Settings: The original images from Potsdam
and Vaihingen datasets were cropped into the dimensions of
512 × 512 using a sliding window. By using the Potsdam Fig. 6. Training loss of ACMFNet on Vaihingen and Potsdam datasets.
dataset as an example, we presented IoU and the corresponding
time required to train each epoch for various strides in Table I.
It was observed that these two metrics achieved optimal and 0.38% F1. Further enhancement is observed with the con-
performance at 256 × 256 strides; therefore, we have chosen current introduction of CFM, GL-FRM, and RUM, resulting
to set the stride length accordingly. To effectively expand the in an increase of 1.26% for IoU and 0.61% for F1 score.
training data and mitigate overfitting, we randomly augmented On Potsdam, the incorporation of these three modules also
the training set by applying horizontal and vertical flips as well demonstrates a similar positive trend in terms of IoU and F1
as rotations of 90◦ , 180◦ , and 270◦ . score improvement as observed in Vaihingen. The results in
To ensure fairness, all experiments were conducted under Table II verify that each component of ACMFNet helps boosts
identical conditions. All models used in the experiments building extraction performance.
were implemented using PyTorch 1.10.0 (CUDA 11.3) on an 1) Effect of Multilevel Feature Fusion and GL-FRM: We
NVIDIA GTX 4090 GPU with 24-GB RAM. During the train- conducted a study on the efficacy of multilevel feature fusion
ing phase, we employed the AdamW optimizer and utilized a and GL-FRM, and the results are presented in Table III.
cosine strategy to adjust the learning rate accordingly, which In ACMFNet, common multilevel feature fusion is incorpo-
facilitates the model in escaping local optima during training rated into the encoder–decoder architecture, where the fused
and converging toward a global optimal solution. Specifically, features from CFM in the encoder are concatenated with the
we set the initial learning rate to 0.001, the initial restart epoch high-level features in the decoder through skip connections.
T0 to 15, and the learning rate recovery speed Tmult to 2. To explore their effectiveness, we eliminated skip connections
A batch size of 8 was used, and a maximum of 105 epochs from ACMFNet. The findings revealed a decrease in IoU by
were executed. The training loss of ACMFNet on two datasets 1.39% and 0.84% on Vaihingen and Potsdam, respectively.
is illustrated in Fig. 6. Furthermore, we investigated the effectiveness of GL-FRM
in multilevel feature fusion by replacing it with channelwise
CAT, a commonly used method, resulting in an experiment that
C. Ablation Study demonstrated significant improvements in building extraction
To validate the effectiveness of the proposed ACMFNet, results, an increase of 0.47% IoU on Vaihingen dataset and
extensive ablation experiments are presented in this section. 0.18% on Potsdam dataset. Furthermore, relying solely on
We conduct ablation experiments on Vaihingen and Potsdam the global feature or the local feature results in a decline in
datasets to evaluate the contribution of each proposed module performance.
to our ACMFNet, and the results are shown in Table II, which As depicted in Fig. 7, relying solely on the deep semantic
demonstrates the IoU and F1 score of different variations. information of the decoder for feature resolution recovery
We note that the baseline uses elementwise summation (SUM) results in a significant loss of spatial details. To address
to combine data from both modalities. On Vaihingen, by solely this limitation, introducing fine-grained features at the cor-
introducing CMF, IoU and F1 exhibit a growth of 0.48% responding scale of the encoder can offer multiscale and
and 0.23%, respectively. Upon simultaneous introduction of multilevel information, leading to clearer contours and finer
CFM and GL-FRM, the result is improved by 0.78% IoU segmentation effects. However, directly concatenating fea-

TABLE II
A BLATION E XPERIMENTS OF ACMFN ET. T HE B EST P ERFORMANCE I S D ISPLAYED IN B OLD . T HE E FFECTIVENESS OF T HREE P ROPOSED M ODULES I S
V ERIFIED BY THE P ERFORMANCE C HANGE R ELATIVE TO THE BASELINE . T HE P ROPOSED ACMFN ET C OMBINES T HEIR B ENEFITS TO M AXIMIZE
THE B UILDING E XTRACTION P ERFORMANCE

TABLE III TABLE IV

E VALUATION M ETRICS S CORE OF A BLATION E XPERIMENT FOR E VALUATION M ETRICS S CORE OF A BLATION E XPERIMENT FOR
M ULTILEVEL F EATURE F USION AND GL-FRM C ROSS -M ODAL F USION AND CFM

TABLE V
E VALUATION M ETRICS S CORE OF A BLATION E XPERIMENT FOR RUM
ture maps using CAT introduces semantic differences and
spatial dimension mismatches [see Fig. 2(e)]. In contrast,
GL-FRM effectively mitigates this issue by establishing con-
text dependencies through extensive convolution and attention
computations across multiple scales and levels. This approach
supplements extracted features with abundant global informa-
tion while matching and integrating information from both
feature maps. Consequently, it refines building-related details
further, excludes background noise effectively, and enhances
the accuracy of building extraction [see Fig. 2(f)]. surrounding buildings. On the other hand, nDSM contains a
2) Effect of Cross-Modal Fusion and CFM: We investi- certain amount of noise points. Moreover, it poses a chal-
gated the efficacy of cross-modal fusion and CFM, and the lenge to differentiate buildings from objects of similar height.
corresponding results are presented in Table IV. To validate Inherent disparities exist between these two modalities. Thus,
the effectiveness of integrating nDSM into building extrac- simply performing SUM or CAT operations would undermine
tion tasks, we completely eliminated the nDSM stream from the accuracy of building extraction results. Nevertheless,
ACMFNet, consequently removing CFM as well. employing CFM enables the extraction of complementary
The findings demonstrated a significant enhancement in information from both modalities and, thereby, enhances the
building extraction results when utilizing nDSM, with an precision of building extraction outcomes.
increase in IoU by 1.35% and 2.71% for Vaihingen and Pots- 3) Effect of RUM: We conducted a study to evaluate the
dam datasets, respectively. Subsequently, we replaced CFM efficacy of the proposed RUM, and the experimental results
with two fusion techniques, elementwise SUM and CAT, aim- are presented in Fig. 9 and Table V. The utilization of residual
ing to explore the efficacy of CFM in cross-modal fusion. The structures led to an improvement in IoU by 0.60% and 0.19%
outcomes obtained from these two fusion methods exhibited on Vaihingen and Potsdam datasets, respectively, owing to the
similarity, with CAT slightly surpassing SUM. However, both network’s ability to preserve more original semantic informa-
fell short compared with the results achieved using CFM. tion during the upsampling process. Furthermore, substituting
As depicted in Fig. 8, when dealing with roof structures regular 3 × 3 convolutions with DSC resulted in an IoU
that exhibit similar texture features to the ground and shadows, increase of 0.66% and 0.28% while reducing parameter count
relying solely on MS can result in missed detections and false by approximately 0.1 M. By employing RUM, we facilitated
positives. However, incorporating the nDSM stream signif- the transfer of richer semantic information related to building
icantly mitigates this issue. Eliminating CFM compromises objects from deep convolution blocks to lower layers during
the integrity of building extraction outcomes. On the one upsampling, thereby achieving superior performance in build-
hand, MS data are prone to noise, such as light interference, ing extraction tasks while maintaining a lightweight network
and there exists a significant presence of shadows caused by structure.

Fig. 7. Visualization results of ablation experiments on multilevel feature fusion and GL-FRM. (a) Original image. (b) nDSM. (c) Ground truth. (d) ACMFNet
without multilevel feature fusion. (e) Replace GL-FRM with CAT. (f) GL-FRM without global stream. (g) GL-FRM without local stream. (h) ACMFNet
(proposed).

Fig. 8. Visualization results of ablation experiments on cross-modal fusion and CFM. (a) Original image. (b) nDSM. (c) Ground truth. (d) ACMFNet without
nDSM. (e) Replace CFM with SUM. (f) Replace CFM with CAT. (g) ACMFNet (proposed).

D. Comparisons With Other Methods the first four methods all chose ResNet-34 as backbone. Since
We compared ACMFNet with 16 DL-based semantic the methods in [49], [50], [51], [52], [53], and [55] are all
segmentation methods, including U-Net [14], FCN [13], semantic segmentation methods for RGB-D scenes, we used
DeepLabv3+ [63], dense dilated convolutions merging net- nDSM instead of depth data. The building extraction results
work (DDCM-Net) [64], multi-attention-network (MA-Net) for all methods were obtained by running available codes.
[29], DC-Swin [45], BuildFormer [19], dual spatial attention 1) Comparisons With Single-Modal Methods: Table VI
Transformer net (DSATNet) [47], V-FuseNet [24], cross-modal shows the numerical results of single-modal semantic seg-
gated fusion network (CMGFNet) [5], FuseNet [49], RedNet mentation methods used for building extraction. Our proposed
[50], attention complementary network (ACNet) [51], SA-Gate method achieved an IoU of 88.41% on Vaihingen, which is
[52], ESANet [53], and CMX [55]. As the methods in [13], a 0.22% improvement over BuildFormer, and outperformed
[14], [19], [29], [45], [63], and [64] are all based on single other algorithms in terms of precision and F1. In addition,
modal, we only used MS data. To ensure a fair comparison, our method achieved the best IoU, recall, and F1 scores

Fig. 9. Visualization results of ablation experiments on RUM. (a) Original image. (b) nDSM. (c) Ground truth. (d) RUM without residual. (e) Replace DSC
with conventional convolution. (f) ACMFNet (proposed).

TABLE VI
C OMPARISON OF D IFFERENT S INGLE -M ODAL N ETWORKS W ITH THE P ROPOSED M ETHOD

TABLE VII
C OMPARISON OF D IFFERENT C ROSS -M ODAL N ETWORKS W ITH THE P ROPOSED M ETHOD

on Potsdam. To qualitatively analyze the building extraction results alongside the results of the other three competitive
performance of the proposed ACMFNet, we visualized its methods as well as the ground truth. From Fig. 10, it is

Fig. 10. Visualization results compared with single-modal methods. (a) Original image. (b) Ground truth. (c) DeepLabv3+. (d) MA-Net. (e) DSATNet.
(f) BuildFormer. (g) ACMFNet without nDSM (proposed).

Fig. 11. Parameters and IoU of our ACMFNet with other methods. (a) Comparisons with single-modal methods. (b) Comparisons with cross-modal methods.

easy to find that compared with the other three methods, capture global dependencies, providing better global modeling
ACMFNet has a stronger ability to distinguish the buildings capability. However, its overly simplistic decoder structure
from the background, while the edges of the extracted build- leads to rough resolution recovery. Our proposed method
ings are clearer. DeepLabv3+ utilizes a unique spatial pyramid utilizes ResNet-34 as the encoder and incorporates GL-FRM
pooling module and decoder structure, resulting in superior in the decoder to match deep semantic features with shallow
performance compared with other CNN-based methods. U- fine-grained features and establish contextual dependencies
Net integrates spatial information from lower level features at the spatial level, which resulted in better accuracy in
through skip connection, yielding segmentation results slightly recognizing building pixels and maintaining building integrity.
inferior to DeepLabv3+. In contrast, BuildFormer incorporates In addition, Fig. 11(a) shows the parameters for each method,
an additional global contextual pathway in the encoder to with ST-U-shaped network (ST-UNet) having a parameter

Fig. 12. Visualization results compared with cross-modal methods. (a) Original image. (b) nDSM. (c) Ground truth. (d) CMGFNet. (e) ACNet. (f) ESANet.
(g) CMX. (h) ACMFNet (proposed).

count of 160.97 M, far exceeding other methods but yielding heavily affected by shadows or tree height, our method
inferior results. This indicates that using a complex encoder excels at distinguishing nonbuilding areas while maintaining
structure does not necessarily lead to performance improve- higher integrity in building extraction due to the fusion of
ments and may instead increase model complexity, resulting complementary features in cross-modal data through CFM.
in unnecessary computations. Our proposed method achieved Moreover, for buildings exhibiting irregular geometric struc-
optimal results while significantly reducing the parameter tures, our approach achieves clearer and more complete
count because of dimension reduction of features at each stage boundary structures by leveraging comprehensive integra-
of the encoder and the lightweight decoder module. tion and utilization of high-level semantic features, low-level
2) Comparisons With Cross-Modal Methods: Table VII spatial details, and global contextual information via GL-FRM.
presents the numerical results of cross-modal semantic seg- In addition, as shown in Fig. 11(b), our method outper-
mentation methods employed for building extraction. Our forms RGB-D-based semantic segmentation methods while
proposed method achieved the highest IoU of 89.76%, demon- significantly reducing parameter count. Compared with the
strating a notable improvement of 0.95% compared with SOTA method CMGFNet for cross-modal building extraction,
CMGFNet (88.81%) on the Vaihingen dataset. Furthermore, our method introduces fewer additional parameters yet yields
our approach outperformed other algorithms in terms of significant improvements in extraction results.
precision, recall, and F1 scores. On the Potsdam dataset,
our method attained an impressive IoU score of 93.13% V. C ONCLUSION
and surpassed alternative algorithms in recall and F1 metrics In this article, we propose ACMFNet, a novel build-
as well. Particularly noteworthy is our algorithm’s excep- ing extraction method for cross-modal HSR remote sensing
tional recall rate of 96.98%, surpassing other approaches images that adopts an encoder–decoder structure. To alleviate
by a significant margin of 0.76%, which indicates fewer spatial mismatch, we propose GL-FRM, which establishes
instances where building pixels were missed during segmen- multiscale and multilevel contextual dependency by com-
tation process. We selected four methods with relatively good bining large-scale convolution and attention computation.
performance to visualize the results in Fig. 12. In comparison It integrates deep semantic feature maps from the decoder
with other cross-modal methods, our proposed ACMFNet with shallow fine-grained feature maps from the encoder.
demonstrates accurate segmentation of building areas across Furthermore, CFM is employed to integrate the complemen-
various challenging scenarios. When dealing with buildings tary parts between MS data and nDSM data. In addition,

a lightweight RUM is used to recover the resolution of the [17] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++:
feature maps. Extensive experimental results on Vaihingen Redesigning skip connections to exploit multiscale features in image
segmentation,” IEEE Trans. Med. Imag., vol. 39, no. 6, pp. 1856–1867,
dataset and Potsdam dataset show that the proposed ACMFNet Jun. 2020.
achieves superior building extraction performance. However, [18] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
limitations exist in building boundary extraction where seg- Process. Syst., 2017, pp. 5998–6008.
mentation results do not align perfectly with building shapes [19] L. Wang, S. Fang, X. Meng, and R. Li, “Building extraction with
vision transformer,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022,
and boundaries appear less smooth. In future work, we will Art. no. 5625711.
explore encoding methods for boundary features to overcome [20] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
these limitations. Furthermore, the model’s adaptability to for image recognition at scale,” 2020, arXiv:2010.11929.
[21] W. Wang et al., “Pyramid vision transformer: A versatile backbone for
different scenes and ever-changing remote sensing images can dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf.
be improved by incorporating incremental learning [65], [66] Comput. Vis. (ICCV), Oct. 2021, pp. 548–558.
and domain adaptation techniques [67]. [22] C. Peng, Y. Li, L. Jiao, Y. Chen, and R. Shang, “Densely based multi-
scale and multi-modal fully convolutional networks for high-resolution
R EFERENCES remote-sensing image semantic segmentation,” IEEE J. Sel. Topics Appl.
Earth Observ. Remote Sens., vol. 12, no. 8, pp. 2612–2626, Aug. 2019.
[1] H. Bi, F. Xu, Z. Wei, Y. Xue, and Z. Xu, “An active deep learning [23] Z. Cao et al., “End-to-end DSM fusion networks for semantic segmenta-
approach for minimally supervised PolSAR image classification,” IEEE tion in high-resolution aerial images,” IEEE Geosci. Remote Sens. Lett.,
Trans. Geosci. Remote Sens., vol. 57, no. 11, pp. 9378–9395, Nov. 2019, vol. 16, no. 11, pp. 1766–1770, Oct. 2019.
doi: 10.1007/s13735-017-0141-z.
[24] N. Audebert, B. Le Saux, and S. Lefèvre, “Beyond RGB: Very high
[2] X. Yuan, J. Shi, and L. Gu, “A review of deep learning methods for
resolution urban remote sensing with multimodal deep networks,” ISPRS
semantic segmentation of remote sensing imagery,” Expert Syst. Appl.,
J. Photogramm. Remote Sens., vol. 140, pp. 20–32, Jun. 2018.
vol. 169, May 2021, Art. no. 114417.
[3] F. Ma, F. Zhang, Q. Yin, D. Xiang, and Y. Zhou, “Fast SAR image [25] J. Huang, X. Zhang, Q. Xin, Y. Sun, and P. Zhang, “Automatic building
segmentation with deep task-specific superpixel sampling and soft extraction from high-resolution aerial images and LiDAR data using
graph convolution,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, gated residual refinement network,” ISPRS J. Photogramm. Remote
Art. no. 5214116. Sens., vol. 151, pp. 91–105, May 2019.
[4] F. Ma, F. Zhang, D. Xiang, Q. Yin, and Y. Zhou, “Fast [26] P. Zhang et al., “A hybrid attention-aware fusion network (HAFNet)
task-specific region merging for SAR image segmentation,” IEEE for building extraction from high-resolution imagery and LiDAR data,”
Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5222316, doi: Remote Sens., vol. 12, no. 22, p. 3764, Nov. 2020.
10.1109/TGRS.2022.3141125. [27] D. Hong et al., “SpectralGPT: Spectral remote sensing founda-
[5] H. Hosseinpour, F. Samadzadegan, and F. D. Javan, “CMGFNet: A deep tion model,” IEEE Trans. Pattern Anal. Mach. Intell., early access,
cross-modal gated fusion network for building extraction from very high- Apr. 3, 2024, doi: 10.1109/TPAMI.2024.3362475.
resolution remote sensing images,” ISPRS J. Photogramm. Remote Sens., [28] D. Hong, C. Li, B. Zhang, N. Yokoya, J. A. Benediktsson, and J. Chanus-
vol. 184, pp. 96–115, Feb. 2022, doi: 10.1016/j.isprsjprs.2021.12.007. sot, “Multimodal artificial intelligence foundation models: Unleashing
[6] M. Belgiu and L. Dragut, “Comparing supervised and unsupervised the power of remote sensing big data in earth observation,” Innov.
multiresolution segmentation approaches for extracting buildings from Geosci., vol. 2, no. 1, 2024, Art. no. 100055.
very high resolution imagery,” ISPRS J. Photogramm. Remote Sens., [29] R. Li et al., “Multiattention network for semantic segmentation of fine-
vol. 96, pp. 67–75, Oct. 2014. resolution remote sensing images,” IEEE Trans. Geosci. Remote Sens.,
[7] F. Zhang, X. Sun, F. Ma, and Q. Yin, “Superpixelwise likelihood vol. 60, 2022, Art. no. 5607713.
ratio test statistic for PolSAR data and its application to built-up area [30] A. Bokhovkin and E. Burnaev, “Boundary loss for remote sensing
extraction,” ISPRS J. Photogramm. Remote Sens., vol. 209, pp. 233–248, imagery semantic segmentation,” in Proc. Int. Symp. Neural Netw. Cham,
Mar. 2024, doi: 10.1016/j.isprsjprs.2024.02.009. Switzerland: Springer, 2019, pp. 388–401.
[8] C. Li, B. Zhang, D. Hong, J. Yao, and J. Chanussot, “LRR-Net: An inter- [31] J. Li, X. Huang, L. Tu, T. Zhang, and L. Wang, “A review of building
pretable deep unfolding network for hyperspectral anomaly detection,” detection from very high resolution optical remote sensing images,”
IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 5513412. GISci. Remote Sens., vol. 59, no. 1, pp. 1199–1225, Dec. 2022.
[9] M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, [32] T.-T. Ngo, V. Mazet, C. Collet, and P. de Fraipont, “Shape-based building
“Building detection in very high resolution multispectral data with detection in visible band images using shadow information,” IEEE J. Sel.
deep learning features,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. Topics Appl. Earth Observ. Remote Sens., vol. 10, no. 3, pp. 920–932,
(IGARSS), Jul. 2015, pp. 1873–1876. Mar. 2017.
[10] D. Griffiths and J. Boehm, “Improving public data for building segmen- [33] Z. Guo and S. Du, “Mining parameter information for building extraction
tation from convolutional neural networks (CNNs) for fused airborne and change detection with very high-resolution imagery and GIS data,”
LiDAR and image data using active contours,” ISPRS J. Photogramm. GISci. Remote Sens., vol. 54, no. 1, pp. 38–63, Jan. 2017.
Remote Sens., vol. 154, pp. 70–83, Aug. 2019.
[34] G. Cheng and J. Han, “A survey on object detection in optical
[11] Z. Zheng, Y. Zhong, J. Wang, A. Ma, and L. Zhang, “Building damage remote sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117,
assessment for rapid disaster response with a deep object-based semantic pp. 11–28, Jul. 2016.
change detection framework: From natural disasters to man-made dis-
[35] V. Mnih, “Machine learning for aerial image labeling,”
asters,” Remote Sens. Environ., vol. 265, Nov. 2021, Art. no. 112636.
Ph.D. dissertation, Dept. Comput. Sci., Univ. Toronto, Toronto,
[12] L. Dong and J. Shan, “A comprehensive review of earthquake-induced
ON, Canada, 2013.
building damage detection with remote sensing techniques,” ISPRS
J. Photogramm. Remote Sens., vol. 84, pp. 85–99, Oct. 2013. [36] S. Saito, T. Yamashita, and Y. Aoki, “Multiple object extraction from
aerial imagery with convolutional neural networks,” Electron. Imag.,
[13] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
vol. 2016, no. 10, pp. 1–9, 2016.
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. [37] A. Khalel and M. El-Saban, “Automatic pixelwise object labeling for
[14] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- aerial imagery using stacked U-Nets,” 2018, arXiv:1803.04953.
works for biomedical image segmentation,” in Proc. Int. Conf. Med. [38] M. Guo, H. Liu, Y. Xu, and Y. Huang, “Building extraction based
Image Comput. Comput.-Assist. Intervent., 2015, pp. 234–241. on U-Net with an attention block and multiple losses,” Remote Sens.,
[15] S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource vol. 12, no. 9, p. 1400, Apr. 2020.
building extraction from an open aerial and satellite imagery data [39] Q. Tian, Y. Zhao, Y. Li, J. Chen, X. Chen, and K. Qin, “Multiscale
set,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 1, pp. 574–586, building extraction with refined attention pyramid networks,” IEEE
Jan. 2019. Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2022.
[16] T. Zuo, J. Feng, and X. Chen, “HF-FCN: Hierarchically fused fully [40] P. Das and S. Chand, “AttentionBuildNet for building extraction from
convolutional network for robust building extraction,” in Proc. Asian aerial imagery,” in Proc. Int. Conf. Comput., Commun., Intell. Syst.
Conf. Comput. Vis., 2017, pp. 291–302. (ICCCIS), Feb. 2021, pp. 576–580.

[41] S. Zheng et al., “Rethinking semantic segmentation from a sequence- [64] Q. Liu, M. Kampffmeyer, R. Jenssen, and A.-B. Salberg, “Dense dilated
to-sequence perspective with transformers,” in Proc. IEEE/CVF Conf. convolutions’ merging network for land cover classification,” IEEE
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 6881–6890. Trans. Geosci. Remote Sens., vol. 58, no. 9, pp. 6309–6320, Sep. 2020.
[42] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans- [65] H. Huang, F. Gao, J. Sun, J. Wang, A. Hussain, and H. Zhou, “Novel
former for semantic segmentation,” in Proc. IEEE/CVF Int. Conf. category discovery without forgetting for automatic target recogni-
Comput. Vis. (ICCV), Oct. 2021, pp. 7242–7252. tion,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 17,
[43] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, pp. 4408–4420, 2024.
“SegFormer: Simple and efficient design for semantic segmentation [66] F. Gao et al., “SAR target incremental recognition based on features
with transformers,” in Proc. Adv. Neural Inf. Process. Syst., 2021, with strong separability,” IEEE Trans. Geosci. Remote Sens., vol. 62,
pp. 12077–12090. 2024, Art. no. 5202813.
[44] K. Chen, Z. Zou, and Z. Shi, “Building extraction from remote sensing [67] D. Hong et al., “Cross-city matters: A multimodal remote sens-
images with sparse token transformers,” Remote Sens., vol. 13, no. 21, ing benchmark dataset for cross-city semantic segmentation using
p. 4441, Nov. 2021. high-resolution domain adaptation networks,” Remote Sens. Environ.,
vol. 299, Dec. 2023, Art. no. 113856.
[45] L. Wang, R. Li, C. Duan, C. Zhang, X. Meng, and S. Fang, “A novel
transformer based semantic segmentation scheme for fine-resolution
remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 19,
pp. 1–5, 2022.
[46] X. He, Y. Zhou, J. Zhao, D. Zhang, R. Yao, and Y. Xue, “Swin Baiyu Chen received the B.S. degree in measure-
transformer embedding UNet for remote sensing image semantic ment and control technology and instrument from
segmentation,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, the University of Science and Technology Beijing,
Art. no. 21644229. Beijing, China, in 2022. She is currently pursuing
[47] R. Zhang, Z. Wan, Q. Zhang, and G. Zhang, “DSAT-Net: Dual spa- the M.Eng. degree with the Aerospace Information
tial attention transformer for building extraction from aerial images,” Research Institute, Chinese Academy of Sciences,
IEEE Geosci. Remote Sens. Lett., vol. 20, pp. 1–5, 2023, doi: Beijing.
10.1109/LGRS.2023.3304377. Her research interests include computer vision and
[48] J. Wang, Z. Wang, D. Tao, S. See, and G. Wang, “Learning common and remote sensing image semantic segmentation.
specific features for RGB-D semantic segmentation with deconvolutional
networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 664–679.
[49] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “FuseNet: Incor-
porating depth into semantic segmentation via fusion-based CNN
architecture,” in Proc. Asian Conf. Comput. Vis., 2016, pp. 213–228.
Zongxu Pan (Senior Member, IEEE) received the
[50] J. Jiang, L. Zheng, F. Luo, and Z. Zhang, “RedNet: Residual
B.Eng. degree in electronic and information engi-
encoder–decoder network for indoor RGB-D semantic segmentation,”
neering from Harbin Institute of Technology, Harbin,
2018, arXiv:1806.01054.
China, in 2010, and the Ph.D. degree in informa-
[51] X. Hu, K. Yang, L. Fei, and K. Wang, “ACNET: Attention based network tion and communication engineering from Tsinghua
to exploit complementary features for RGBD semantic segmentation,” in University, Beijing, China, in 2015.
Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2019, pp. 1440–1444, He is currently an Associate Professor with the
doi: 10.1109/ICIP.2019.8803025. Aerospace Information Research Institute, Chinese
[52] X. Chen et al., “Bi-directional cross-modality feature propagation with Academy of Sciences, Beijing. His research interests
separation-and-aggregation gate for RGB-D semantic segmentation,” in focus on deep learning-based target detection and
Proc. Eur. Conf. Comput. Vis. (ECCV), Aug. 2020, pp. 561–577. recognition in optical remote sensing and synthetic
[53] D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and aperture radar images.
H.-M. Gross, “Efficient RGB-D semantic segmentation for indoor scene Dr. Pan serves as a Guest Editor of Remote Sensing, a Review Editor of
analysis,” 2020, arXiv:2011.06961. Frontiers in Remote Sensing, and a reviewer for several top journals. He was
[54] Y. Yue, W. Zhou, J. Lei, and L. Yu, “Two-stage cascaded decoder for selected as the Best Reviewer of IEEE G EOSCIENCE AND R EMOTE S ENSING
semantic segmentation of RGB-D images,” IEEE Signal Process. Lett., L ETTERS in 2022.
vol. 28, pp. 1115–1119, 2021.
[55] J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “CMX:
Cross-modal fusion for RGB-X semantic segmentation with transform-
ers,” IEEE Trans. Intell. Transp. Syst., vol. 24, no. 12, pp. 14679–14694, Jianwei Yang received the B.Eng. degree in elec-
Dec. 2023, doi: 10.1109/TITS.2023.3300537. tronics and information technology from Beijing
[56] W. Zhou, J. Jin, J. Lei, and J.-N. Hwang, “CEGFNet: Common Forestry University, Beijing, China, in 2020. He is
extraction and gate fusion network for scene parsing of remote currently pursuing the Ph.D. degree with the
sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Aerospace Information Research Institute, Chinese
Art. no. 5405110. Academy of Sciences, Beijing.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for His research interests include computer vision and
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. remote sensing image object tracking.
(CVPR), Jun. 2016, pp. 770–778.
[58] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 7132–7141.
[59] X. Li et al., “Semantic flow for fast and accurate scene parsing,” in Proc.
16th Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 775–793.
Hui Long received the B.Eng. degree in pho-
[60] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
togrammetry from Huazhong Agricultural Univer-
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
sity, Wuhan, China, in 1997, and the Ph.D. degree
[61] F. Chollet, “Xception: Deep learning with depthwise separable convo- in cartography and geographic information systems
lutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), from the Institute of Remote Sensing Applica-
Jul. 2017, pp. 1800–1807. tions, Chinese Academy of Sciences, Beijing, China,
[62] ISPRS. Accessed: 2020. [Online]. Available: http://www2.isprs.org/ in 2006.
commissions/comm3/wg4/semantic-labeling.html He is currently a Professor with the Aerospace
[63] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, Information Research Institute, Chinese Academy of
“Encoder–decoder with Atrous separable convolution for semantic image Sciences, Beijing, China. His research interests focus
segmentation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, on satellite ground system design, remote sensing
pp. 801–818. satellite information processing, and intelligent object detection.