0% found this document useful (0 votes)

16 views

Domain Adaptive and Interactive Differential Attention Network For Remote Sensing Image Change Detection

Uploaded by

yumiaowang8

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Domain Adaptive and Interactive Differential Attention Network For Remote Sensing Image Change Detection

Uploaded by

yumiaowang8

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

62, 2024 5616316

Domain Adaptive and Interactive Differential

Attention Network for Remote Sensing
Image Change Detection
Yuliang Ji , Weiwei Sun , Senior Member, IEEE, Yumiao Wang , Zhiyong Lv , Senior Member, IEEE,
Gang Yang , Yuanzeng Zhan, and Chong Li

Abstract— The objective of change detection (CD) is to Index Terms— Change detection (CD), convolutional neural
identify the altered region between dual-temporal images. network (CNN), domain adaptation, interactive differential
In pursuit of more precise change maps, numerous state-of- attention module (IDAM), remote sensing (RS), transformer.
the-art (SOTA) methods design neural networks with robust
discriminative capabilities. The convolutional neural network I. I NTRODUCTION
(CNN)-transformer model is specifically designed to integrate
the strengths of the CNN and transformer, facilitating effective
coupling of feature information. However, previous CNN-
transformer studies have not effectively mitigated the interference
C HANGE detection (CD) seeks to recognize the alter-
ations between the same geographical areas of dual-
temporal images, ultimately generating a binary change
of feature distribution differences as well as pseudovariations map. In recent decades, remote sensing (RS) has robust
between two images due to cloud occlusion, imaging conditions,
and other factors. In this article, we propose a domain adaptive capabilities of surface coverage and time-series observations,
and interactive differential attention network (DA-IDANet). and combined with CD techniques has been successfully
This model incorporates domain adaptive constraints (DACs) applied in various fields, encompassing urban expansion [1],
to mitigate the interference of pseudovariations by mapping deforestation [2], damage assessment [3], and forest cover
the two images to the same deep feature space for feature
mapping [4].
alignment. Furthermore, we designed the interactive differential
attention module (IDAM), which effectively improves the The approaches of CD can be broadly categorized into
feature representation and promotes the coupling of interactive two main groups: traditional approaches [5] and deep
differential discriminant information, thereby minimizing the learning approaches [6]. Traditional CD methods primarily
impact of irrelevant information. Experiments on four datasets utilize techniques, such as statistical analysis [7], signal
demonstrate the superior validity and robustness of our proposed
model compared to other SOTA methods, as evident from both
processing [8], or machine learning [9]. When confronted
quantitative analysis and qualitative comparisons. The code will with RS images exhibiting complex features or mixed pixels,
be available online (https://github.com/Jyl199904/DA-IDANet). traditional methods are limited by their modeling capabilities,
making it difficult to efficiently identify changing regions [10].
Moreover, traditional methods typically depend on manual
Manuscript received 6 February 2024; revised 8 March 2024;
accepted 22 March 2024. Date of publication 27 March 2024; date of feature extraction or threshold setting, a time-consuming, and
current version 3 April 2024. This work was supported in part by the subjective process, especially when handling large-scale [11].
National Natural Science Foundation of China under Grant 42122009 and In recent years, it has become increasingly convenient to
Grant 42201354, in part by the Zhejiang Province “Pioneering Soldier”
and “Leading Goose” Research and Development Project under Grant obtain high-resolution (HR) images. Nevertheless, there are
2023C01027, in part by the Zhejiang Provincial Natural Science Foundation rich texture information and complex spatial structures of
of China under Grant LQ22D010007, in part by the Public Projects of Ningbo ground objects in HR images, and traditional methods do not
City under Grant 2021S089 and Grant 2022S101, in part by the Ningbo
Science and Technology Innovation 2025 Major Special Project under Grant adequately capture complex and abstract features of images.
2021Z107 and Grant 2022Z032, and in part by the Youth Scientist Project To overcome the limitations of traditional methods, many
National Key Research and Development Program of China under Grant researchers have turned to advanced techniques, such as deep
2023YFF1305600. (Corresponding authors: Weiwei Sun; Yumiao Wang.)
Yuliang Ji and Chong Li are with the Faculty of Mathemat- learning, to address the CD task in an end-to-end pattern.
ics and Statistics, Ningbo University, Ningbo 315211, China (e-mail: Deep learning models, with their capacity to capture
benpaodexiaoliang@outlook.com; lichong.work@outlook.com). and model complex nonlinear relationships, find extensive
Weiwei Sun, Yumiao Wang, and Gang Yang are with the Department of
Geography and Spatial Information Techniques, Ningbo University, Ningbo applications in the field of CD using RS images, showcasing
315211, China (e-mail: sunweiwei@nbu.edu.cn; wymfrank@whu.edu.cn; notable effectiveness [12]. As the mainstream technology
yanggang@nbu.edu.cn). of deep learning, convolutional neural networks (CNNs)
Zhiyong Lv is with the School of Computer Science and Engi-
neering, Xi’an University of Technology, Xi’an 710048, China (e-mail: possess formidable feature extraction capabilities, enabling
Lvzhiyong_fly@hotmail.com). the automatic extraction of multilevel features containing rich
Yuanzeng Zhan is with the Institute of Surveying and Mapping Science spectral information and spatial features from RS images [13],
and Technology of Zhejiang Province, Hangzhou 310012, China (e-mail:
geozhanyz@hotmail.com). [14], [15]. For instance, Wiratama et al. [16] devised a
Digital Object Identifier 10.1109/TGRS.2024.3382116 novel CNN-based model that amalgamates information from
1558-0644 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Ningbo University. Downloaded on April 23,2024 at 02:01:39 UTC from IEEE Xplore. Restrictions apply.
5616316 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024

neighboring pixels in the image, facilitating the identification to capture spatiotemporal context and refining original
of altered regions through the interconnection of local features features through a decoder, which facilitates the efficient
across various levels. Zhan et al. [17] crafted a weight- utilization of spatial contextual information and modeling,
sharing Siamese deep CNN for pixelwise feature extraction markedly enhancing the efficacy of CD. Zhang et al. [33]
and constructed feature vectors around altered pixel pairs, devised a U-shaped structure for the transformer framework,
accentuating the regions of change. Chang et al. [18] leveraging the Swin transformer as the basic unit. The
proposed a multiscale joint information extraction method model processes dual-temporal images through the encoder to
based on space and spectrum, which can accurately segment extract multiscale features, concurrently restoring the details
multitemporal image endmembers for CD. The existing of change information through the decoder, which can obtain
deep learning methods often face challenges related to more accurate CD results.
optimization problems with sparse constraints. Consequently, In contrast to the straightforward CNN model that relies
Feng et al. [19], [20] proposed a semisupervised CNN on convolutional layers to adeptly capture local features, the
converted to reinforcement learning framework and a band- transformer model has the ability to concurrently attend to
attention-based graph convolutional network for hyperspectral various segments of the input sequence through the multihead
band selection, respectively. To fully utilize the complementary self-attention mechanism. This characteristic proves beneficial
information between dual-temporal images and solve the for global context modeling. Constructing a joint CNN-
problem of interscale variations, Du et al. [21] combined transformer model allows for a comprehensive amalgamation
the discriminator network with CD network in an end-to- of the strengths of both architectures, effectively addressing
end manner. Huang et al. [22] designed a spatiotemporal multiscale changes in ground objects. This model excels
augmentation and interlayer fusion network to improve in capturing local details while simultaneously considering
the feature representation of the changing objects, thereby global correlations [36], [37], [38]. Li et al. [39] introduced
efficiently explore the information of the dual-temporal the ConvTrans block, which dynamically aggregates global
differences. features from the transformer module and local features
CNN models capture the spatial distribution and structure from the CNN. This approach effectively enhances the
of ground objects in images through convolutional operations, robustness of CD at different scales, but it exhibits suboptimal
which can progressively abstractly represent the information performance when applied to extremely small altered areas.
in the image layer by layer. However, the CNN model Chen et al. [40] presented a fully connected CNN-transformer,
is constrained by a fixed receptive field and limited connecting UNet3+ and transformer in a Siamese structure.
comprehension of deeper level information. This limitation This model utilized a progressive attention module to
can lead to an inadequate response to changes in ground deeply integrate features extracted by both the CNN and
objects at different scales and challenges in distinguishing transformer, thereby addressing the issue of information loss
certain complex changes in ground objects [23]. The attention between scales in CD. While the CNN-transformer model
mechanism can assist the model in concentrating on dynamic has demonstrated excellent performance in CD tasks, it still
regions to augment feature representation [24]. With the exhibits certain shortcomings. First, the increasing complexity
incorporation of the attention mechanism, models can more of ground objects in bitemporal HR images, coupled with
effectively concentrate on areas with important information stylistic variations, such as illumination and cloud occlusion,
in the images [25]. It adaptively learns the importance may result in inconsistent feature representations for the
of different parts of the original input images, thereby same objects. Previous CNN-transformer models struggle to
alleviating the constraint of fixed receptive fields inherent effectively mitigate the interference of the pseudovariations
in CNN-only approaches. Hence, in recent years, many caused by differences in feature distributions. Furthermore,
researchers have incorporated attention mechanisms into CD the existing methods are frequently difficult to sufficiently
tasks to enhance accuracy. For instance, Liu et al. [24] leverage the interactive differential discriminant information
devised feature exchange and channel attention modules to between input images, leading to ambiguous or inaccurate
effectively model contextual information in dual-temporal detection of ground objects and building boundaries.
images. Feng et al. [26] introduced a joint attention module To effectively address these challenges, we propose
by combining self-attention and cross-attention, which guides a domain adaptive and interactive differential attention
the global feature distribution on the input side and facilitates network (DA-IDANet). Specifically, DA-IDANet first extracts
the coupling of information. multilevel features by a CNN backbone and employs domain
Recently, transformer [27], initially designed to address adaptive constraints (DACs) mapping the input images to the
natural language processing tasks, has sparked a considerable same feature space for feature alignment. Then, a transformer
sensation in the field of computer vision. In comparison to module undertakes the encoding and decoding stages to
models solely based on CNN, transformer showcases robust aggregate context information. To fully leverage the interactive
capabilities in modeling long-range dependencies [28], [29], differential discriminant information between input images at
[30], which can capture global contextual information and different moments, an interactive differential attention module
process long-range associated features. This development has (IDAM) is proposed in DA-IDANet. The contributions of our
significantly propelled advancements in CD algorithms [31], work can be summarized as follows.
[32], [33], [34]. For instance, Chen et al. [35] introduced 1) We have devised a hybrid CNN-transformer model that
a dual-temporal images transformer, employing an encoder effectively integrates local–global features and spatial

Authorized licensed use limited to: Ningbo University. Downloaded on April 23,2024 at 02:01:39 UTC from IEEE Xplore. Restrictions apply.
JI et al.: DA-IDANet FOR REMOTE SENSING IMAGE CHANGE DETECTION 5616316

context information, thereby enhancing the accuracy of the model that constructs multiscale feature fusion is better
CD tasks. equipped to distinguish complex surface structures and
2) We formulate DACs to mitigate the interferences of changing areas in RS images. Lv et al. [49] integrated
pseudovariations and feature distribution difference by cross-layer blocks to fuse multiscale features and multilevel
mapping the two images to the same deep feature space information based on UNet, thereby augmenting the model’s
for feature alignment. ability to capture feature information. Peng et al. [50]
3) The IDAM is designed to effectively integrate interactive introduced UNet++, which produced feature maps with
differential discriminant information between the two higher spatial precision by incorporating global and fine-
images at different moments, thereby preserving crucial grained information. Shuai et al. [51] introduced a method
information of changing areas and maximizing the utilizing graph convolutional operations to extract multiscale
accuracy of boundary detection. neighborhood features of nodes and then concatenated them
The remainder of this article is organized as follows. through the attention mechanism.
Section II provides a brief review of related work. The details It is noteworthy that CNN, being adept at capturing
of our proposed framework are elucidated in Section III. local details and small differences, plays a crucial role in
Section IV conducts a comprehensive experimental evaluation. extracting multiscale feature information essential for CD
The final section, Section V, summarizes this article. tasks. However, CNN lacks the ability to model long-range
dependencies and contextual information interactions. There-
II. R ELATED W ORK fore, after the CNN extracts local features, the combination
Recently, with the ongoing evolution of deep learning of the transformer architecture facilitates the establishment of
in CD, most mainstream algorithms are mainly based long-distance spatial dependencies and contextual information
on feature learning. From the conventional CNN models interactions to improve the feature representation.
extracting local features of images and performing fusion
process to the subsequent introduction of attention mechanism
to enhance feature representation. Then, transformer was B. Attention-Based Method
proposed to capture global features and establish long-distance
dependencies in feature representations. This section will delve In CD, the changing areas within input images are
into these aspects, considering the work from the perspectives frequently concentrated in localized regions, while most areas
of CNN models, attention mechanisms, and transformer. remain static. Moreover, buildings or ground objects in the
input images may be subject to various interferences, such as
lighting and noise. Through the incorporation of the attention
A. CNN-Based Method mechanism, the model can allocate more attention to areas
CNN-based models can leverage deeper neural network that are particularly sensitive to changes, thereby reducing
structures and incorporate information of different scales, the processing of redundant information. This enhancement
which effectively capture the local spatial and texture details contributes to the overall robustness of the model. In recent
of RS images. Notably, the residual network [41] facilitates years, the attention mechanism has been integrated into
the direct transfer of information across layers through numerous studies of CD tasks. These mechanisms include
the introduction of residual connections. This mechanism channel attention [47], [52], spatial attention [25], [47],
increases network depth and effectively mitigates the vanishing channel and spatial attention [43], [48], and so forth.
gradient problem. Ding et al. [42] introduced a deep To effectively improve the multiscale feature representation,
supervision module to enhance the capability of middle and Wang et al. [53] proposed the adaptive attention mechanism,
deep layers in extracting more salient features. To better effectively combines channel and spatial features to capture
utilize the effective feature information in the deep network, changes at various scales. Ding et al. [42] introduced cross-
Zhang et al. [43] proposed a deep supervised fusion network, layer addition and skip connection modules guided by spatial
employing a fully convolutional dual-stream architecture to attention to aggregate multilevel contextual information and
extract highly representative deep features and then input into direct the network’s attention to changing areas. As the neural
a deep differential discriminant network for CD. Li et al. [44] network’s layers deepen, there is frequently an associated par-
differentiated the changing areas of the target by leveraging tial loss of information. Fang et al. [47] incorporated channel
deep contextual features. To mitigate the limited image attention to enhance intermediate feature representation for
rotation capability and exploit the unique correlation between modeling contextual information. Zhang et al. [43] employed
image bands, Mei et al. [45] proposed cyclic polar coordinate the attention module to fuse deep features across diverse
convolutional layers to deal with rotational invariance for scales within the initial image and incorporate differential
feature learning. In addition, Ma et al. [46] proposed a features to reconstruct the change map. To enhance the
diverse band selection algorithm based on spectral correlation feature representation by exploiting the dependency between
for effective image classification. Numerous experiments channel and spatial location, Liu et al. [54] introduced a
have substantiated the effectiveness of multiscale feature paired attention module to strengthen the mutual influence
fusion [43], [47], [48]. This not only aids the model in between channels and spatial positions. Therefore, designing
adapting to changes at different scales but also enables a suitable attention mechanism can improve the performance
the model to possess a larger receptive field. Consequently, and efficiency of the model.

Fig. 1. Architecture of the proposed DA-IDANet. Bitemporal image features are extracted for the pretrained ResNet-18 and feature alignment is performed
by domain adaptive constraints. Then, the contextual information is aggregated by the transformer encoder and decoder, and the interaction difference attention
module is designed to supplement the interaction difference discriminative information between images. Finally, the final binary variation map is obtained by
going through the classifier.

C. Transformer-Based Method CNN and transformer in a parallel manner. Liu et al. [66]
introduced a CNN-transformer framework for multiscale upper
Transformer [27], introduced in 2017 as a novel architec- and lower information aggregation, utilizing CNN to extract
ture, has garnered attention for its ability to model remote local features from images and decoding context information
dependencies, making it particularly impactful in processing through the transformer multiscale aggregator, which can
sequence data. Due to its exceptional performance, researchers effectively identify the changing areas. Nevertheless, earlier
have extended the application of transformer to the realm of CNN-transformer models lack the emphasis on interactive
computer vision, yielding impressive results in various tasks, differential discriminant information between two images
such as target detection [55], [56], superresolution [57], [58], captured at distinct moments. Furthermore, they struggle
semantic segmentation [47], [59], image classification [60], to effectively alleviate the interference of pseudovariations
target tracking, video segmentation [55], and more. Recently, and disparities in feature distribution resulting from factors,
transformer has proven to have some growth potential such as light irradiation, cloud occlusion, and other external
in CD [33]. With transformer’s advantage in long-range influences. Therefore, we introduce the DACs and IDAM
dependency modeling, which can help the model learn more designed to align features extracted from dual-temporal
discriminative global features, Chen et al. [35] proposed that images into a common feature space, fully harnessing the
the dual-temporal image transformer can effectively model interactive differential discriminant information between them.
context in the spatiotemporal domain and feedback the learned Ultimately, it is possible to achieve effective mitigation of
context information to the pixel space. Mei et al. [61] feature distribution differences and improve the accuracy of
designed group-aware hierarchical transformer to concentrate model detection boundaries.
the extracted global features for efficient image classification.
Yan et al. [62] enhanced global view feature extraction using III. M ETHODOLOGY
the Swin transformer architecture and combined multilevel
visual features in a pyramid manner. To further enhance A. Architecture Overview
the global dependence of multiscale features and maximize The proposed DA-IDANet is depicted in Fig. 1. Specifically,
the effective utilization of spatial information within refined we input dual-temporal images into the pretrained ResNet-18
features, Ke and Zhang [63] presented a hybrid multiscale for feature extraction, and three different scales are selected
transformer module, utilizing a detailed self-attention mech- to circumvent redundancy among information of features and
anism to model representation attention across mixed scales diminish the computational load. Subsequently, the extracted
within each image. Simultaneously, Song et al. [64] introduced features from the two images are aligned at each scale
a multiscale Swin transformer combined with a deep to the same deep feature space through DACs, thereby
supervision network, which incorporated deep supervision mitigating the interference of pseudovariations and feature
aggregation to enhance the distinguishability of multiscale distribution differences. Following this, the multiscale feature
features and fully utilized the refined multiscale spatial map undergoes the transformer encoder and decoder for the
information. Models integrating CNN and transformer enable aggregation of context information. Then, the intermediate
the harnessing of the strengths of both synthetically, fostering features are fed to the IDAM, utilizing the interactive
effective coupling of feature information. Feng et al. [65] differential discriminant information between input images
introduced an intrascale cross-interaction and interscale feature to better focus on the changing areas and reduce the
fusion network to fully harness the potential of integrating interference of irrelevant information. Finally, the feature maps

are concatenated along the channel dimensions and passed vector f in the regenerated Hilbert space should be less than
through the classifier to obtain the final change binary map. or equal to 1. The function φ(·) corresponding to the kernel
function maps the random variable x to infinite dimensions
B. Multiscale Feature Extraction Network (φ(·)), and f (x) represented the dot product between the base
vector f and φ(x) in the regenerated Hilbert space.
DA-IDANet utilizes ResNet-18 [41] as the backbone
Initially, expand the expectation and subsequently express
network framework for extracting multiscale RS image
the function f (x) in the Hilbert space as the inner product of
features. The feature extraction module primarily comprises
the mapping function φ(·) and the basis vector f . This means
a 7 × 7 convolutional layer and four residual layers. A stride of
that mapping the vector f (x) to the Hilbert space through
2 is employed with the 7 × 7 convolutional layer, resulting in a
the function φ(·), followed by performing an inner product
change of channels from 256 to 64. Simultaneously, it extracts
with the base vector f within a unit ball in the space. This
the shallow features of the original images. Following this,
process accomplishes the transformation to higher dimensions.
the feature maps undergo a 3 × 3 max pooling layer,
Consequently, it can be inferred as follows:
aiming to decrease the size of feature maps while preserving Z
salient features. This strategy helps reduce the computational
E p f (x) = p(d x) f (x)
complexity and number of parameters in subsequent neural ZX
network layers, thereby enhancing computational efficiency.
= p(d x)⟨φ(x), f ⟩
In the backbone network, each residual block comprises X
two 3 × 3 convolutional layers, two batch normalization
Z
layers, and a rectified linear unit function. The convolutional = p(d x)φ(x), f . (2)
X
layer is primarily employed to extract image features, and
skip connections are introduced to directly add the input and Combined with the basic definition, the following expression
convolutional results, ensuring consistency in the features. can be obtained:
Both the second residual layer and the third residual layer MMD[ p, q, H ] := sup (E p [ f (x)] − E q [ f (y)])
set the stride to 2, which will cause the size of the feature f ϵ H,∥ f ∥ H ≤1

map to be reduced by half. After the fourth residual block, ≤ sup µ p − µq H

∥ f ∥H
the size of the final feature map is 1/8 of the input image f ϵ H,∥ f ∥ H ≤1

size. This structure facilitates the gradual extraction of deep ≤ µ p − µq H

(3)
features from the image and diminishes the spatial size of the R
where µ p = X p(d x)φ(x) represents the kernel mean
feature map for subsequent processing and analysis.
embedding. Assume that the source-domain data and target-
domain data amounts are m and n, respectively, the final
C. Domain Adaptation Constraint Block formula can be obtained as follows:
The objective of domain adaptation is to apply knowledge
MMD[ p, q, H ] = µ p − µq H
learned from one or more domains to another domain by
n n
mapping it into a unified feature space [67]. Maximum 1X 1 X
mean discrepancy (MMD) is a common domain adaptation = φ(xi ) − φ yj . (4)
n i=1 m j=1
technique, which fosters feature alignment between different H
domains by measuring the distribution distance of two random
variables between the source and target domains. Constructing D. Interactive Differential Attention Module
the MMD loss can maximize the similarity of feature The introduction of the attention mechanism aids the
representations in the feature space, thereby reducing the model in more effectively pinpointing potential change areas
distribution difference between the features of the source and enhances the model’s perceptual capabilities, ultimately
and target domains. While the two images originate from improving the accuracy of CD. Usually, with the increasing
a unified sensor, the influence of external factors, such number of layers in the neural network, the model tends to
as light irradiation, cloud cover, and imaging conditions, lose certain details from the original images. Additionally,
can lead to variations in spectral curves and differences in the changing areas might constitute only a small portion of
feature distribution within the same category of features. the entire image and exhibit sparse distribution, which may
By introducing DACs, images from two different moments result in blurred or inaccurate boundary detection. Therefore,
are aligned to the same deep feature space, thereby reducing we propose the IDAM to effectively supplement the learning
the interference caused by feature distribution differences as of interactive differential discriminant information between
well as pseudovariations. The MMD is defined as the input images. With the IDAM, the regions associated
with change can be dynamically focused on, thus reducing
MMD F, p, q := sup E p f (x) − E q f (y) (1)
the interference of irrelevant information and maximizing the
where E P (·) is expressed as expected value, sup is expressed accuracy of boundary detection.
as supremum, and f (·) is expressed as a mapping function. Built upon the concept of the traditional self-attention
The function domain in MMD was defined as any vector mechanism, the IDAM attains global context awareness by
f within the unit sphere of the regenerated Hilbert space modeling dependencies between various positions in the
(i.e., ∥ f ∥H ≤ 1), which means that the norm of the base sequence. The feature map is represented as a 3-D tensor

feature maps at different moments. Similarly, we also calculate

F j′ (Q i , K j , V j ) and F jinter . The interactive discriminative
information between images enhances the sensitivity to surface
change information, which further aids the model in effectively
distinguishing the contextual information of the changed area
and identifying the location of the change

F diff = Fiinter − F jinter (8)

Fi′′ Conv Concat F , Fiinter
diff

=
Fig. 2. Architecture of the proposed interactive differential attention module.
F j′′ Conv Concat F diff , F jinter

= (9)
Fiida Fiinter · Softmax Fi′′

=
and then reshaped into a 2-D matrix. The self-attention F jida = F jinter · Softmax F j′′ .

(10)
mechanism incorporates three sets of weight matrices,
which typically match the dimensions of the input features As depicted in formula (8), we execute a differential operation
and the expected output dimensions. The attention score between Fiinter and F jinter to obtain the differential features
is computed by taking the dot product between Q and F diff . Subsequently, the differential features F diff and the
K , followed by dividing the result by a scaling factor interactive features Fiinter are concatenated on the channel
to prevent gradient explosion. Throughout this process, dimension. Following a sequence of convolutional operations,
an attention score matrix is generated for different positions, the softmax and normalization functions are applied to
indicating the degree of correlation between each position obtain the probability distribution. The IDAM efficiently
and other positions. Ultimately, the attention score matrix mitigates the issue of partial information loss caused by
is normalized into a probability distribution through the an increasing number of neural network layers during the
softmax operation to acquire the attention weight matrix. training. It maximizes the utilization of interactive differential
Subsequently, the V matrix is weighted and operated by discriminative feature information between dual-temporal
the attention weight matrix to yield the output of the images, ultimately leading to enhancements in both the overall
self-attention mechanism, as shown in formula (5). This accuracy and robustness of the model.
output encapsulates information about the importance of each
position in the overall sequence, facilitating global context
awareness. This mechanism proves valuable in capturing E. Prediction Head and Loss Function
long-distance dependencies and strengthening the model’s
In the concluding phase of generating the binary changed
representation capability
map, a universal prediction head is employed. Initially,
Q i K iT

three decoded aggregated maps at different scales are
Self_Attention(Q i , K i , Vi ) = Softmax √ Vi (5)
d upsampled to the same size as the input images with bilinear
interpolation. Subsequently, these maps are converted into a
where the vector sequence Q i ∈ R N ×C , K i ∈ R N ×C , and differential prediction through two 1 × 1 convolutional layers
Vi ∈ R N ×C , where N = H × W . While the self-attention and batch normalization. Following multiscale differential
mechanism effectively captures global context information acquisition, aggregation, and reorganization, we generated
by modeling dependence relationships at different locations, three predictions (Pcat′ ′′
, Pcat ′′′
, and Pcat ). During the training
it exhibits certain limitations in capturing interactive differen- phase, we optimize the model parameters by minimizing the
tial discriminant information between features extracted from domain adaptation constraint loss L mmd and cross-entropy loss
dual-temporal images. Therefore, we designed the IDAM to L ce between the three predictions and the ground truth one by
address the above limitations. one
As depicted in Fig. 2, the features Fi and F j of the two
images extracted in the previous stage are, respectively, input Pcat = Conv(conv(concat(up(F i da j ), (up(F i da j )))) (11)
into a 3 × 3 convolutional layer to obtain Q i , K j , and N
V j , proceeding to execute the corresponding cross-attention 1 X
L ce (Y, P) = − [Yi log2 Pi + (1 − Yi )log2 (1 − P i )]
mechanism operation N i=1
T
! (12)
Q i K j
Fi′ (Q i , K j , V j ) = Softmax √ · Vj (6) n n
di 1X 1X
L mmd Fi , F j , H = φ(xi ) − φ yj (13)
= Ai,cross n i=1 n j=1
j · Vj H
Fiinter = Fi + Fi′ (Q i , K j , V j ) (7)
where N represents the total number of pixels in the change
where i and j, respectively, represent the sequence numbers map, n represents the total number of samples, Yi is the label
of the feature maps at different moments, and Ai,cross
j represents of sample i, Pi is the probability of sample i to be predicted
the cross-attention probability distribution. We acquire the as a positive class, σ represents the sigmoid function, GT is
global attention distribution by normalizing Q and K of the ground truth, and λ is the weighting factors of DAC loss.

The total training loss is given by the following formula: including FC-EF [14], FC-Siam-Di [14], FC-Siam-Conc [14],
3 IFNet [43], SNUNet [47], BiT [35], DASNet [70], STANet [1],
X
L total = L ce (GT, σ (Pi )) + λ L mmd (Fi , F j , H ). (14) FTN [62], DMINet [26], and MSCANet [66].
i=1 1) FC-EF [14] is a CNN-based CD network built upon
′′′ UNet, which takes dual-temporal RS images as the input
In the ultimate inference stage, prediction Pcat was selected
layer of the network and fuses at the image level.
as the output result map and converted into a single-channel
2) FC-Siam-Di [14] builds upon the improvement of FC-
map through an argmax operation.
EF, extracting multilevel features from dual-temporal
RS images and fusing them using the shared-weight
IV. E XPERIMENTS
characteristics of the Siamese network.
A. Data Description 3) FC-Siam-Conc [14] further refines FC-EF by incorpo-
To assess the practical performance of the proposed rating feature-level fusion through a Siamese network.
DA-IDANet, we conducted extensive experiments on four 4) IFNet [43] leverages multiscale features from dual-
representative public RS CD datasets. temporal images and integrates the differential features
1) CDD [68]: The CDD dataset, sourced from Google obtained through the attention module to produce the
Earth, is a public dataset focused on detecting seasonal ultimate binary change map.
changes in the same region. It comprises multispectral 5) SNUNet [47] is a densely connected Siamese network
and HR images, with resolutions ranging from 0.03 to that enhances the feature representation in the middle
1 m, typically obtained from satellites or aerial photogra- layer and models context using the channel attention
phy to cover a diverse geographical area and changes at module.
different times. After cropping and rotating operations, 6) BiT [35] models the spatiotemporal context of dual-
16 000 image patches of the size 256 × 256 pixels temporal images using the transformer encoder, feeds
are derived from the original images, which were back the learned context information, and refines the
randomly split into 10 000/3000/3000 for training, initial features through the transformer decoder.
validation, and testing. 7) DASNet [70] captures extended dependencies using a
2) LEVIR-CD: The learning, vision (LEVIR-CD) [1] is dual attention mechanism to acquire a more discrimina-
an extensive very HR (VHR, 0.5 m/pixel) CD dataset. tive feature representation.
It encompasses 637 pairs of image patches with a size of 8) STANet [1] computes attention weights between pixels
1024 pixels, capturing significant changes over a span of at diverse times and locations using a self-attention
5–14 years. The dual-temporal images of LEVIR-CD are module, leveraging this information to generate a more
sourced from 20 distinct regions across various cities in discriminative feature representation.
TX, USA. For experimental purpose, we cropped them 9) FTN [62] proposes a progressive attention module and
into smaller image pairs with a size of 256 pixels, and leverages Swin transformer to learn more discriminative
they were randomly divided into 7120/1024/2048 for global features by modeling long-range dependencies.
training, validation, and testing. 10) DMINet [26] utilizes the intertemporal joint attention
3) CLCD-CD [69]: The Change in Land Cover Dataset to steer the global feature distribution while eliciting
(CLCD) comprises 600 pairs of farmland change representational information coupling.
images. Among these, they were randomly divided into 11) MSCANet [66] adopts the CNN-transformer structure,
360/120/120 for training, validation, and testing. The extracts multiscale features through the CNN backbone
spatial resolution ranges from 0.5 to 2 m. Each set network, and then aggregates contextual information
of samples includes two 512 × 512 images and is based on the transformer structure.
accompanied by a binary tag indicating the change in 2) Experimental Details: The proposed DA-IDANet and
farmland. all compared methods were experimented on an NVIDIA
4) WHU-CD [70]: The Wuhan University dataset encom- GeForce RTX 3080Ti workstation, using the PyTorch frame-
passes the region affected by the earthquake in February work as the model development environment. Throughout
2011, along with the area that underwent reconstruction the model training phase, we applied data augmentation
several years after the earthquake. The dataset spans techniques to the dataset, encompassing vertical and horizontal
approximately 20 km2 and includes 12 796 buildings. flipping, scaling, and cropping to mitigate overfitting during
It comprises a pair of aerial images with a spatial size of training. The model underwent training utilizing the adaptive
32 507 × 15 354 pixels with a resolution of 0.2 m/pixel. moment estimation, employing a batch size of 8, λ was set as
The original images were cropped into smaller image 0.4. The training spanned 100 epochs, and the initial learning
pairs with the size of 256 pixels and randomly divided rate was 1e-4.
into 6096/762/762 for training, validation, and testing. 3) Evaluation Metrics: To evaluate the performance and
accuracy of our proposed model compared with other SOTA
B. Experimental Setup methods, we use five evaluation metrics to measure the
1) Comparison With State of the Arts (SOTAs): To similarity between the predicted change probability map and
assess the effectiveness and accuracy of our DA-IDANet in the ground truth. These metrics include precision (Pre), recall
CD, we compared with several SOTA approaches, which (Rec), F1 score (F1), intersection over union (IOU), and kappa

TABLE I
C OMPARISON R ESULTS ON THE LEVIR-CD AND CDD DATASET. T HE B EST R ESULTS A RE H IGHLIGHTER IN B OLD .
A LL R ESULTS A RE E XPRESSED AS P ERCENTAGE (%)

(Kap). The definition of each indicator is as follows: Nevertheless, DA-IDANet excels over FTN in various other
TP performance metrics. On the CLCD-CD, DA-IDANet outper-
Precision = (15) formed MSCANet by 3.48% in recall, 2.22% in F1 score,
TP + FP
TP and 2.45% in IOU. However, compared to FTN, DA-IDANet
Recall = (16) dropped by 1.51% in precision. On the LEVIR-CD dataset,
TP + FN
2 ∗ Precision ∗ Recall we improved the F1 score by 0.78% and the IOU by 0.87%
F1 = (17) compared to MSCANet. In comparison with IFNet, DA-
Precision + Recall
TP IDANet experienced a 2.04% decrease in precision, suggesting
IOU = (18) that the deeply supervised image fusion design of IFNet
TP + FP + FN
TP + TN yields superior precision benefits. Among other indicators,
OA = (19) DA-IDANet maintains certain advantages. Similarly, on the
TP + TN + FP + FN WHU-CD dataset, we achieved improvements of 1.01% in
OA − P
Kappa = (20) recall, 0.85% in F1 score, and 1.28% in IOU compared to
1−P MSCANet. These numerical results provide clear evidence
(TP + FP)(TP + FN)+(FN + TN)(TP + TN)
P= that the DAC and IDAM contribute meaningfully to CD tasks.
(TP + FP + TN + FN)2 Visualization results for each model on different datasets are
(21) presented in Figs. 3–6. These figures showcase the accuracy
where TP, TN, FP, and FN represent the number of of different models in detecting changing areas using different
true positives, true negatives, false positives, and false colors to represent true positives (TP, white), true negatives
negatives, respectively. In kappa, P represents the hypo- (TN, black), false positives (FP, red), and false negatives (FN,
thetical probability that the reference is consistent with the green).
1) Visualization on LEVIR-CD: Fig. 3(a) and (d) depicts
prediction.
the relatively scattered small buildings, while Fig. 3(b) and (c)
showcases the areas possibly affected by trees, with blocked
C. Performance Comparison and Result Analysis large and small buildings. Fig. 3(e) and (f) presents the
The quantitative comparison of CD indicators is presented examples of house building renovations. In the detection
in Tables I and II. The DA-IDANet demonstrates significant of large buildings, the DA-IDANet consistently exhibits
advantages over other SOTA models through five indicators. more accurate predictions of complete change maps. This
For instance, on the CDD dataset, compared to MSCANet, demonstrates the effectiveness of the IDAM in addressing
DA-IDANet exhibited improvements of 2.62% in Recall, problems, such as boundary blurring, missed detections,
1.77% in F1 score, and 1.82% in IOU. Compared to FTN, and false detections that may occur during the detection
which boasts superior precision owing to its progressive processing. Importantly, our proposed model demonstrates
attention mechanism designed based on Swin transformer, DA- robust performance in detecting buildings of various sizes,
IDANet exhibits a 2.57% decrease in precision indicators. even in the presence of tree occlusion or light interference. The

TABLE II
C OMPARISON R ESULTS ON THE CLCD-CD AND WHU-CD DATASET

Fig. 3. Visual results of different models on test samples of LEVIR-CD. (a)–(f) Six representative samples. Different colors are used to facilitate a better
observation. White represents a TP, black is a TN, red indicates an FP, and green stands as an FN.

visual comparison in Fig. 3, combined with the quantitative influenced by seasonal changes or severe weather conditions
analysis of the evaluation metrics in Table I, underscores during CD tasks, such as occlusion by lawns or snow cover.
the leading performance of our proposed DA-IDANet on the Other methods might miss or incorrectly identify roads or
LEVIR-CD dataset. buildings, and edge detection for medium and large buildings
2) Visualization on CDD: These selected images depict may lack clarity. However, with the incorporation of DACs,
changes in ground objects or buildings corresponding to the features extracted from dual-temporal images can be
seasonal variations. Fig. 4(a), (b), (d), and (f) illustrates the aligned to a common feature space. This feature alignment
changes in buildings or roads occurring with the shifting effectively mitigates the interference of pseudovariations
seasons. Fig. 4(c) and (e) showcases different types of ground arising from differences in feature distributions. Consequently,
objects and the presence of small vehicles captured after the DA-IDANet demonstrates a more accurate detection of
seasonal changes. It is noteworthy that the model may be changing areas.

Fig. 4. Visual results of different models on test samples of CDD. (a)–(f) Six representative samples. Different colors are used to facilitate a better observation.
White represents a TP, black is a TN, red indicates an FP, and green stands as an FN.

Fig. 5. Visual results of different models on test samples of CLCD. (a)–(f) Six representative samples. Different colors are used to facilitate a better
observation. White represents a TP, black is a TN, red indicates an FP, and green stands as an FN.

3) Visualization on CLCD: The CLCD dataset 4) Visualization on WHU-CD: Fig. 6(a), (e), and (f) depicts
encompasses a variety of changes in farmland and other the areas with relatively dense or small housing buildings,
land features. Fig. 5(a)–(f) illustrates the changes in while Fig. 6(b)–(d) showcases the instances of greenhouse
buildings on farmland, modifications to roads, changes in construction and asphalt-repainted roads. Examining several
farmland, and alterations in rivers, emphasizing the model’s pairs of representative cases visually, where architectural
crucial accuracy in detecting these changes. Accurately change areas are small or dense, poses a challenge for CD
identifying changing areas in complex farmland and other tasks. Our DA-IDANet demonstrates a lower number of FPs
land features poses a significant challenge. Based on both and FNs compared to other models, achieving more accurate
visual renderings and the insights from Table II, our and efficient change maps.
proposed model demonstrates superior accuracy in detecting
changes in such intricate environments compared to other D. Ablation Analysis
models, including the ability to discern more complete Ablation experiments were conducted to assess the
boundaries. effectiveness and rationality of each proposed module by

Fig. 6. Visual results of different models on the test samples of WHU-CD. (a)–(f) Six representative samples. Different colors are used to facilitate a better
observation. White represents a TP, black is a TN, red indicates an FP, and green stands as an FN.

TABLE III
A BLATION R ESULTS (%) ON CD DATASET. T HE S YMBOL “ X ” I NDICATES T HAT THE C ORRESPONDING M ODULE WAS
R EMOVED OR O PERATION WAS N OT P ERFORMED . T HE B EST R ESULTS A RE H IGHLIGHTED IN B OLD

swapping their positions or removing them from the CD than after the feature extraction network. According to the
network. Table III provides the experimental results on four result in Table III, the F1 scores of the modified model
datasets. on LEVIR-CD, CDD, WHU-CD, and CLCD are 80.80%,
1) DACs: To assess the impact of DACs on model 88.33%, 89.34%, and 72.51%, respectively. It is evident that
performance, we conducted experiments by removing the changing the position of the DACs, compared with the original
DACs. When DACs are removed, the F1 scores of the model solution, caused a drop in accuracy. This indicates that higher
dropped significantly, with 81.29% on LEVIR-CD, 88.91% benefits can be obtained by conducting feature alignment after
on CDD, 89.92% on WHU-CD, and 73.52% on CLCD, the CNN extracts multiscale features.
respectively. It can fully illustrate that DACs play a certain 2) IDAM: To precisely assess the overall impact of the
role in the feature alignment of the model. IDAM on model performance, we removed the IDAM. The
In addition, to evaluate the reasonableness of the location result in Table III reveals that the F1 scores of the model
of the domain adaptation constraints, we placed the constraint on the LEVIR-CD, CDD, WHU-CD, and CLCD are 80.97%,
after the transformer aggregates the context information rather 88.27%, 89.84%, and 72.79%, respectively. After removing the

Fig. 7. (a) Training loss changes with increasing training epoch. (b) Validation F1 changes with increasing training epoch.

IDAM, the attention to the interactive differential discriminant TABLE IV

information between the two images would be significantly E FFICIENCY C OMPARISON A NALYSIS W ITH D IFFERENT I NPUT S IZES
reduced, thereby affecting the performance of the model.
Additionally, we compared the difference in performance
of the IDAM at different locations. We attempted to position
the IDAM before the transformer aggregates contextual
information, but the effectiveness was not as significant as the
original design solution. It can be observed that the F1 scores
of the model on the LEVIR-CD, CDD, WHU-CD, and CLCD
are 81.17%, 88.85%, 89.75%, and 72.76%, respectively. The
rate is lower than the original scheme by 0.52%, 1.02%,
0.75%, and 2.22%. From this, it can be concluded that utilizing
the IDAM after aggregating contextual information through
transformer helps the model better identify the changed areas.
3) Multiscale features: To verify the influence of the
introduced multiscale features, we removed the features of
two scale. The result in Table III indicates that removing the
features of two scale caused a decrease in F1 scores on the
LEVIR-CD, CDD, WHU-CD, and CLCD by 2.66%, 2.19%,
0.83%, and 7.3%, respectively. Additionally, compared to the
original solution, the IOU fell by 3.49%, 2.5%, 1.31%, and
6.41%, respectively. Smaller scale features aid in detecting
small changes in ground features, while larger scales are
effective in capturing changes in the overall ground feature
structure. Combining multiscale information allows for a more 256 × 256 pixels and 512 × 512 pixels. As shown in
comprehensive and accurate representation of surface changes. Table IV, FC-EF, FC-Siam-Conc, and FC-Siam-Di have lower
Simultaneously, we conducted a comparison of the variation FLOPs and Params compared to other models at different input
in loss during the training process and F1 scores during image sizes. This is attributed to their basis on traditional
the validation process, as illustrated in Fig. 7. The loss of CNN structures without the incorporation of complex network
complete model can converge faster and which has more stable architectures, resulting in lower computational and space
F1 scores. This suggests that DA-IDANet exhibits a better complexity. IFNet and DASNet exhibit higher FLOPs and
performance of generalization. Params compared to other models. This is attributed to their
relatively complex attention calculation mechanism, leading
E. Model Efficiency Assessment to a larger number of network parameters. The proposed DA-
To compare the operational efficiency of various models, IDANet has 15.45G FLOPs and 16.45M Params when the
we evaluate the computational complexity and space com- input image size is 256 pixels. This indicates that our proposed
plexity using FLOPs and Params for input image sizes of model can maintain excellent CD performance while also

Fig. 8. Network visualization taking images from LEVIR-CD as examples.

Fig. 9. Network visualization results of different models taking images from LEVIR-CD, CDD, CLCD, and WHU-CD as examples.

having relatively lower model computational complexity and For large agricultural lands or dense buildings, there may be
space complexity compared to other models. external factors that interfere with the level of concern for
the model to judge the changed areas. However, the changed
areas that DA-IDANet focuses on are the closest to the actual
F. Neural Network Visualization labeled region while other approaches tend to overlook some
To explore the impact of each module in DA-IDANet significant information. It demonstrates superior accuracy and
during the processing, we specifically chose a pair of image effectiveness of DA-IDANet in CD tasks.
instances from the LEVIR-CD. The visualization of each
stage of the neural network is presented in Fig. 8. Initially, V. C ONCLUSION
a pair of dual-temporal images is selected as input, and For the CD tasks using RS images, this article proposes a
three different scale features are extracted by ResNet-18 CNN-transformer model incorporating DACs and an IDAM.
in a Siamese structure. Subsequently, the extracted features Specifically, we first extract three different scale features
undergo aggregation with contextual information through the through CNN and utilize DACs for feature alignment to reduce
encoding and decoding stages of the transformer. We introduce the distribution difference between input images. Then, the
the IDAM to guide the model to focus more on relevant transformer performs the encoding and decoding stage to
changing areas while suppressing irrelevant interference. aggregate context information. The proposed IDAM efficiently
The visual results in the figure clearly demonstrate that utilizes the interactive differential discriminant information
the IDAM enhances the model’s attention toward changed of input images, dynamically minimizing interference from
areas, assigning lower attention weights to unchanged areas. irrelevant information. We performed extensive experiments
Ultimately, the convolutional operation on the obtained change on four datasets: LEVIR-CD, CDD, CLCD, as well as WHU-
probability map reveals more accurate change localization CD. The results indicate that our approach achieves SOTA
and fine boundaries. Additionally, four representative image performance concerning accuracy and effectiveness compared
instances are selected from the WHU-CD, CDD, and CLCD to other approaches. While our proposed model is specifically
datasets to visualize the final stage’s predictions in Fig. 9. designed for CD tasks, some modules within our approach
Through comparisons with other SOTA approaches, the may have potential applications in other domains. Future
proposed model excels in achieving more complete boundary research endeavors may explore the adaptation of the proposed
detection while minimizing attention to unchanged areas. model to heterogeneous images.

R EFERENCES [21] Z. Du, X. Li, J. Miao, Y. Huang, H. Shen, and L. Zhang, “Concatenated
deep-learning framework for multitask change detection of optical and
[1] H. Chen and Z. Shi, “A spatial–temporal attention-based method and a SAR images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens.,
new dataset for remote sensing image change detection,” Remote Sens., vol. 17, pp. 719–731, 2024, doi: 10.1109/JSTARS.2023.3333959.
vol. 12, no. 10, p. 1662, 2020, doi: 10.3390/rs12101662. [22] Y. Huang, X. Li, Z. Du, and H. Shen, “Spatiotemporal enhancement and
[2] P. de Bem, O. de Carvalho Junior, R. F. Guimarães, and R. T. Gomes, interlevel fusion network for remote sensing images change detection,”
“Change detection of deforestation in the Brazilian Amazon using IEEE Trans. Geosci. Remote Sens., vol. 62, 2024, Art. no. 5609414, doi:
Landsat data and convolutional neural networks,” Remote Sens., vol. 12, 10.1109/TGRS.2024.3360516.
no. 6, p. 901, Mar. 2020, doi: 10.3390/rs12060901. [23] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, “Multi-scale
[3] J. Z. Xu, W. Lu, Z. Li, P. Khaitan, and V. Zaytseva, “Building damage object detection in remote sensing imagery with convolutional neural
detection in satellite imagery using convolutional neural networks,” networks,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 3–22,
2019, arXiv:1910.06444. Nov. 2018, doi: 10.1016/j.isprsjprs.2018.04.003.
[4] S. Ye, J. Rogan, Z. Zhu, and J. R. Eastman, “A near-real-time approach [24] W. Liu, Y. Lin, W. Liu, Y. Yu, and J. Li, “An attention-based multiscale
for monitoring forest disturbance using Landsat time series: Stochastic transformer network for remote sensing image change detection,” ISPRS
continuous change detection,” Remote Sens. Environ., vol. 252, J. Photogramm. Remote Sens., vol. 202, pp. 599–609, Aug. 2023, doi:
Jan. 2021, Art. no. 112167. 10.1016/j.isprsjprs.2023.07.001.
[5] R. Touati, M. Mignotte, and M. Dahmane, “Multimodal change detection [25] X. Peng, R. Zhong, Z. Li, and Q. Li, “Optical remote sensing image
in remote sensing images using an unsupervised pixel pairwise-based change detection based on attention mechanism and image difference,”
Markov random field model,” IEEE Trans. Image Process., vol. 29, IEEE Trans. Geosci. Remote Sens., vol. 59, no. 9, pp. 7296–7307,
pp. 757–767, 2020, doi: 10.1109/TIP.2019.2933747. Sep. 2021, doi: 10.1109/TGRS.2020.3033009.
[6] L. Ru, B. Du, and C. Wu, “Multi-temporal scene classification and [26] Y. Feng, J. Jiang, H. Xu, and J. Zheng, “Change detection on remote
scene change detection with correlation based fusion,” IEEE Trans. sensing images using dual-branch multilevel intertemporal network,”
Image Process., vol. 30, pp. 1382–1394, 2021, doi: 10.1109/TIP.2020. IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 4401015,
3039328. doi: 10.1109/TGRS.2023.3241257.
[7] E. F. Lambin and A. H. Strahlers, “Change-vector analysis in [27] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
multitemporal space: A tool to detect and categorize land-cover change Process. Syst., vol. 30, 2017, pp. 6000–6010.
processes using high temporal-resolution satellite data,” Remote Sens. [28] B. Wu et al., “Visual transformers: Token-based image representation
Environ., vol. 48, no. 2, pp. 231–244, May 1994. and processing for computer vision,” 2020, arXiv:2006.03677.
[8] B. Du, L. Ru, C. Wu, and L. Zhang, “Unsupervised deep slow feature [29] G. Li, L. Zhu, P. Liu, and Y. Yang, “Entangled transformer for
analysis for change detection in multi-temporal remote sensing images,” image captioning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
IEEE Trans. Geosci. Remote Sens., vol. 57, no. 12, pp. 9976–9992, Oct. 2019, pp. 8927–8936.
Dec. 2019, doi: 10.1109/TGRS.2019.2930682. [30] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
[9] Z. Liu, G. Li, G. Mercier, Y. He, and Q. Pan, “Change detection in for image recognition at scale,” 2020, arXiv:2010.11929.
heterogenous remote sensing images via homogeneous pixel transfor- [31] Z. Wang, Y. Zhang, L. Luo, and N. Wang, “TransCD: Scene change
mation,” IEEE Trans. Image Process., vol. 27, no. 4, pp. 1822–1834, detection via transformer-based architecture,” Opt. Exp., vol. 29, no. 25,
Apr. 2018, doi: 10.1109/TIP.2017.2784560. pp. 41409–41427, 2021.
[10] P.-F. Hsieh, L. C. Lee, and N.-Y. Chen, “Effect of spatial resolution on [32] G. Wang, B. Li, T. Zhang, and S. Zhang, “A network combining a
classification errors of pure and mixed pixels in remote sensing,” IEEE transformer and a convolutional neural network for remote sensing
Trans. Geosci. Remote Sens., vol. 39, no. 12, pp. 2657–2663, 2001, doi: image change detection,” Remote Sens., vol. 14, no. 9, p. 2228,
10.1109/36.975000. May 2022, doi: 10.3390/rs14092228.
[11] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep [33] C. Zhang, L. Wang, S. Cheng, and Y. Li, “SwinSUNet: Pure
feature extraction for remote sensing image classification,” IEEE Trans. transformer network for remote sensing image change detection,” IEEE
Geosci. Remote Sens., vol. 54, no. 3, pp. 1349–1362, Mar. 2016, doi: Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5224713, doi:
10.1109/TGRS.2015.2478379. 10.1109/TGRS.2022.3160007.
[12] Y. Ge, X. Zhang, P. M. Atkinson, A. Stein, and L. Li, “Geoscience- [34] W. G. C. Bandara and V. M. Patel, “A transformer-based Siamese
aware deep learning: A new paradigm for remote sensing,” Sci. Remote network for change detection,” in Proc. IEEE Int. Geosci. Remote Sens.
Sens., vol. 5, Jun. 2022, Art. no. 100047. Symp., Jul. 2022, pp. 207–210.
[13] A. Asokan and J. Anitha, “Change detection techniques for remote [35] H. Chen, Z. Qi, and Z. Shi, “Remote sensing image change detection
sensing applications: A survey,” Earth Sci. Inform., vol. 12, no. 2, with transformers,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2021,
pp. 143–160, Jun. 2019. Art. no. 5607514, doi: 10.1109/TGRS.2021.3095166.
[14] R. Caye Daudt, B. Le Saux, and A. Boulch, “Fully convolutional [36] Z. Li, G. Chen, and T. Zhang, “A CNN-transformer hybrid approach
Siamese networks for change detection,” in Proc. 25th IEEE Int. Conf. for crop classification using multitemporal multisensor images,” IEEE
Image Process. (ICIP), Oct. 2018, pp. 4063–4067. J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 847–858,
[15] R. C. Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, “Urban change 2020, doi: 10.1109/JSTARS.2020.2971763.
detection for multispectral Earth observation using convolutional neural [37] Q. He, Q. Yang, and M. Xie, “HCTNet: A hybrid CNN-transformer
networks,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2018, network for breast ultrasound image segmentation,” Comput. Biol. Med.,
pp. 2115–2118. vol. 155, Mar. 2023, Art. no. 106629.
[16] W. Wiratama, J. Lee, S.-E. Park, and D. Sim, “Dual-dense convolution [38] Q. Jia and H. Shu, “BiTr-Unet: A CNN-transformer combined network
network for change detection of high-resolution panchromatic imagery,” for mri brain tumor segmentation,” in Proc. Int. MICCAI Brainlesion
Appl. Sci., vol. 8, no. 10, p. 1785, Oct. 2018. Workshop. Cham, Switzerland: Springer, 2021, pp. 3–14.
[17] Y. Zhan, K. Fu, M. Yan, X. Sun, H. Wang, and X. Qiu, “Change [39] W. Li, L. Xue, X. Wang, and G. Li, “MCTNet: A multi-scale CNN-
detection based on deep Siamese convolutional network for optical transformer network for change detection in optical remote sensing
aerial images,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 10, images,” in Proc. 26th Int. Conf. Inf. Fusion (FUSION), Jun. 2023,
pp. 1845–1849, Oct. 2017, doi: 10.1109/LGRS.2017.2738149. pp. 1–5.
[18] M. Chang, X. Meng, W. Sun, G. Yang, and J. Peng, “Col- [40] M. Chen et al., “A full-scale connected CNN–transformer network for
laborative coupled hyperspectral unmixing based subpixel change remote sensing image change detection,” Remote Sens., vol. 15, no. 22,
detection for analyzing coastal wetlands,” IEEE J. Sel. Topics Appl. p. 5383, Nov. 2023, doi: 10.3390/rs15225383.
Earth Observ. Remote Sens., vol. 14, pp. 8208–8224, 2021, doi: [41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
10.1109/JSTARS.2021.3104164. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[19] J. Feng et al., “Deep reinforcement learning for semisupervised (CVPR), Jun. 2016, pp. 770–778.
hyperspectral band selection,” IEEE Trans. Geosci. Remote Sens., [42] Q. Ding, Z. Shao, X. Huang, and O. Altan, “DSA-Net: A novel
vol. 60, 2022, Art. no. 5501719, doi: 10.1109/TGRS.2021.3049372. deeply supervised attention-guided network for building change
[20] J. Feng et al., “Dual-graph convolutional network based on band detection in high-resolution remote sensing images,” Int. J. Appl.
attention and sparse constraint for hyperspectral band selection,” Knowl.- Earth Observ. Geoinf., vol. 105, Dec. 2021, Art. no. 102591, doi:
Based Syst., vol. 231, Nov. 2021, Art. no. 107428. 10.1016/j.jag.2021.102591.

[43] C. Zhang et al., “A deeply supervised image fusion network for change [62] T. Yan, Z. Wan, and P. Zhang, “Fully transformer network for change
detection in high resolution bi-temporal remote sensing images,” ISPRS detection of remote sensing images,” in Proc. Asian Conf. Comput. Vis.,
J. Photogramm. Remote Sens., vol. 166, pp. 183–200, Aug. 2020, doi: 2022, pp. 1691–1708.
10.1016/j.isprsjprs.2020.06.003. [63] Q. Ke and P. Zhang, “Hybrid-TransCD: A hybrid transformer remote
[44] X. Li, Z. Du, Y. Huang, and Z. Tan, “A deep translation (GAN) based sensing image change detection network via token aggregation,” ISPRS
change detection network for optical and SAR remote sensing images,” Int. J. Geo-Inf., vol. 11, no. 4, p. 263, Apr. 2022.
ISPRS J. Photogramm. Remote Sens., vol. 179, pp. 14–34, Sep. 2021, [64] F. Song, S. Zhang, T. Lei, Y. Song, and Z. Peng, “MSTDSNet-CD:
doi: 10.1016/j.isprsjprs.2021.07.007. Multiscale Swin transformer and deeply supervised network for change
[45] S. Mei, R. Jiang, M. Ma, and C. Song, “Rotation-invariant feature detection of the fast-growing urban regions,” IEEE Geosci. Remote Sens.
learning via convolutional neural network with cyclic polar coordinates Lett., vol. 19, pp. 1–5, 2022, doi: 10.1109/LGRS.2022.3165885.
convolutional layer,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, [65] Y. Feng, H. Xu, J. Jiang, H. Liu, and J. Zheng, “ICIF-Net: Intra-
Art. no. 5600713, doi: 10.1109/TGRS.2022.3233726. scale cross-interaction and inter-scale feature fusion network for
[46] M. Ma, S. Mei, F. Li, Y. Ge, and Q. Du, “Spectral correlation-based bitemporal remote sensing images change detection,” IEEE Trans.
diverse band selection for hyperspectral image classification,” IEEE Geosci. Remote Sens., vol. 60, 2022, Art. no. 4410213, doi:
Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 5508013, doi: 10.1109/TGRS.2022.3168331.
10.1109/TGRS.2023.3263580. [66] M. Liu, Z. Chai, H. Deng, and R. Liu, “A CNN-transformer network
[47] S. Fang, K. Li, J. Shao, and Z. Li, “SNUNet-CD: A densely with multiscale context aggregation for fine-grained cropland change
connected Siamese network for change detection of VHR images,” detection,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens.,
IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2021, doi: vol. 15, pp. 4297–4306, 2022, doi: 10.1109/JSTARS.2022.3177235.
10.1109/LGRS.2021.3056416. [67] C. Zhang et al., “A domain adaptation neural network for change
detection with heterogeneous optical and SAR remote sensing images,”
[48] Q. Shi, M. Liu, S. Li, X. Liu, F. Wang, and L. Zhang, “A
Int. J. Appl. Earth Observ. Geoinf., vol. 109, May 2022, Art. no. 102769.
deeply supervised attention metric-based network and an open aerial
image dataset for remote sensing change detection,” IEEE Trans. [68] M. A. Lebedev, Y. V. Vizilter, O. V. Vygolov, V. A. Knyaz,
Geosci. Remote Sens., vol. 60, 2022, Art. no. 5604816, doi: and A. Y. Rubis, “Change detection in remote sensing images using
10.1109/TGRS.2021.3085870. conditional adversarial networks,” Int. Arch. Photogramm., Remote Sens.
Spatial Inf. Sci., vol. 42, pp. 565–571, May 2018.
[49] Z. Lv, H. Huang, L. Gao, J. A. Benediktsson, M. Zhao, and C. Shi,
“Simple multiscale UNet for change detection with heterogeneous [69] S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource
remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 19, building extraction from an open aerial and satellite imagery data
pp. 1–5, 2022, doi: 10.1109/LGRS.2022.3173300. set,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 1, pp. 574–586,
Jan. 2018, doi: 10.1109/TGRS.2018.2858817.
[50] D. Peng, Y. Zhang, and H. Guan, “End-to-end change detection for [70] J. Chen et al., “DASNet: Dual attentive fully convolutional Siamese
high resolution satellite images using improved UNet++,” Remote Sens., networks for change detection in high-resolution satellite images,” IEEE
vol. 11, no. 11, p. 1382, 2019, doi: 10.3390/rs11111382. J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 1194–1206,
[51] W. Shuai, F. Jiang, H. Zheng, and J. Li, “MSGATN: A superpixel-based 2020, doi: 10.1109/JSTARS.2020.3037893.
multi-scale Siamese graph attention network for change detection in
remote sensing images,” Appl. Sci., vol. 12, no. 10, p. 5158, May 2022.
[52] H. Jiang, X. Hu, K. Li, J. Zhang, J. Gong, and M. Zhang,
“PGA-SiamNet: Pyramid feature-based attention-guided Siamese net-
work for remote sensing orthoimagery building change detec-
tion,” Remote Sens., vol. 12, no. 3, p. 484, Feb. 2020, doi:
10.3390/rs12030484. Yuliang Ji received the B.S. degree in mathematics
[53] D. Wang, X. Chen, M. Jiang, S. Du, B. Xu, and J. Wang, “ADS- from Huanghuai University, Zhumadian, Chain,
Net: An attention-based deeply supervised network for remote sensing in 2022. He is currently pursuing the M.S degree
image change detection,” Int. J. Appl. Earth Observ. Geoinf., vol. 101, with the School of Mathematics and Statistics,
Sep. 2021, Art. no. 102348. Ningbo University, Ningbo, China.
[54] Y. Liu, C. Pang, Z. Zhan, X. Zhang, and X. Yang, “Building His research interests include remote sensing
change detection for remote sensing images using a dual-task image change detection through deep learning.
constrained deep Siamese convolutional network model,” IEEE Geosci.
Remote Sens. Lett., vol. 18, no. 5, pp. 811–815, May 2021, doi:
10.1109/LGRS.2020.2988032.
[55] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Oct. 2021, pp. 9992–10002.
[56] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S.
Zagoruyko, “End-to-end object detection with transformers,” in Proc.
Eur. Conf. Comput. Vis. (ECCV), vol. 12346, Glasgow, U.K.: Springer,
Aug. 2020, pp. 213–229. Weiwei Sun (Senior Member, IEEE) received
[57] F. Zhu, C. Sun, C. Wang, and B. Zhu, “A double transformer residual the B.S. degree in surveying and mapping and
super-resolution network for cross-resolution person re-identification,” the Ph.D. degree in cartography and geographic
Egyptian J. Remote Sens. Space Sci., vol. 26, no. 3, pp. 768–776, information engineering from Tongji University,
Dec. 2023. Shanghai, China, in 2007 and 2013, respectively.
From 2011 to 2012, he studied at the Department
[58] X. Chai, F. Shao, Q. Jiang, and H. Ying, “TCCL-Net: Transformer-
of Applied Mathematics, University of Maryland,
convolution collaborative learning network for omnidirectional
College Park, MD, USA, working as a Visiting
image super-resolution,” Knowl.-Based Syst., vol. 274, Aug. 2023,
Scholar with the famous Prof. John Benedetto to
Art. no. 110625.
study on the dimensionality reduction of hyperspec-
[59] S. Zheng et al., “Rethinking semantic segmentation from a sequence- tral image. From 2014 to 2016, he studied at the
to-sequence perspective with transformers,” in Proc. IEEE/CVF Conf. State Key Laboratory for Information Engineering in Surveying, Mapping and
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 6877–6886. Remote Sensing (LIESMARS), Wuhan University, Wuhan, China, working
[60] J. Lanchantin, T. Wang, V. Ordonez, and Y. Qi, “General multi- as a Post-Doctoral Researcher to study intelligent processing in hyperspectral
label image classification with transformers,” in Proc. IEEE/CVF Conf. imagery. From 2017 to 2018, he worked with the Department of Electrical
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 16473–16483. and Computer Engineering, Mississippi State University, Starkville, MS, USA,
[61] S. Mei, C. Song, M. Ma, and F. Xu, “Hyperspectral image also working as a Visiting Scholar in hyperspectral image processing. He is
classification using group-aware hierarchical transformer,” IEEE Trans. currently a Full Professor with Ningbo University, Ningbo, China. He has
Geosci. Remote Sens., vol. 60, 2022, Art. no. 5539014, doi: published more than 80 journal articles, and his research interests include
10.1109/TGRS.2022.3207933. hyperspectral image processing with machine learning.

Yumiao Wang received the B.S. degree in GIS from Yuanzeng Zhan received the B.S. degree in
Anhui Normal University, Wuhu, China, in 2014, geographic information system from Zhejiang Uni-
and the Ph.D. degree in cartography and GIS from versity, Hangzhou, China, in 2008, and the M.S.
Wuhan University, Wuhan, China, in 2021. degree in ocean remote sensing from the Institute
He is currently an Assistant Researcher with of Oceanography of State Oceanic Administration,
Ningbo University, Ningbo, China. His research Hangzhou, in 2011.
interests focus on agricultural remote sensing and He is currently the Deputy Director of the Institute
machine learning. of Satellite Remote Sensing (ISRS), Hangzhou. His
research interests include satellite remote sensing
monitoring.

Zhiyong Lv (Senior Member, IEEE) received

the M.S. and Ph.D. degrees from the School
of Remote Sensing and Information Engineering,
Wuhan University, Wuhan, China, in 2008 and 2014,
respectively.
He was an Engineer in surveying and worked with
the First Institute of Photogrammetry and Remote
Sensing, Xi’an, China, from 2008 to 2011. He is
currently working with the School of Computer
Science and Engineering, Xi’an University of
Technology, Xi’an. His research interests include
multihyperspectral and high-resolution remotely sensed image processing,
spatial feature extraction, neural networks, pattern recognition, deep learning,
and remote-sensing applications.

Gang Yang received the M.S. degree in geo-

graphical information systems from the Hunan
University of Science and Technology, Xiangtan, Chong Li received the B.S. degree in mathematics
China, in 2012, and the Ph.D. degree from the from Xinyang University, Xinyang, China, in 2018,
School of Resource and Environmental Sciences, and the M.S. degree in mathematics from Henan
Wuhan University, Wuhan, China, in 2016. University, Kaifeng, China, in 2021. She is currently
He is currently an Associate Professor with pursuing the Ph.D. degree with the Faculty of
Ningbo University, Ningbo, China. His research Mathematics and Statistics, Ningbo University,
interests include missing information reconstruction Ningbo, China.
of remote sensing images, cloud removal of remote Her research interests include hyperspectral image
sensing images, and remote sensing time-series processing with transfer learning.
products temporal reconstruction.

Authorized licensed use limited to: Ningbo University. Downloaded on April 23,2024 at 02:01:39 UTC from IEEE Xplore. Restrictions apply.