TransCD: scene change detection via transformer-based architecture

Zhixue Wang; Yu Zhang; Lin Luo; Nan Wang

doi:10.1364/OE.440720

1. Introduction

Scene change detection (SCD) is a task of detecting interesting changes in a scene using a pair of images taken at different times. The interesting changes mainly lie in some structural changes, such as moving vehicles, building demolition, and traffic signs. Due to camera motion or environment variation, bi-temporal images would exhibit very large variability that is induced by noisy changes, such as viewpoint, outdoor conditions and dynamic changes. Hence, the core idea of a SCD algorithm is that it is able to detect interesting changes while rejecting noisy ones. In general, SCD assigns a binary change map to each pair of bi-temporal images, and the change map denotes the label (i.e., change or no change) of every pair of pixels [1–8]. SCD has applications in a lot of different disciplines, such as driver assistance systems [1,2], video surveillance [3–5,9], foreground segmentation [10–13], visualization of damages of a natural disaster and maintaining/updating the 3D model of a city [14–16] and so on.

In SCD, the critical issue is how to identify changed and unchanged areas. Meanwhile, since changes include semantic changes of interest and noisy changes [17], SCD algorithms must be able to distinguish the semantic changes from the noisy changes. In detail, semantic changes mainly include buildings, traffic signs and other road-side structures being constantly added or removed, and the moved vehicles or pedestrians; noisy changes which sometimes make SCD task more difficult are divided into radiometric changes (illumination intensity variation, shadow), geometric changes (viewpoint differences caused by camera motion), and sensor noise [17,18]. As shown in Fig. 1, (a) and (f) denote radiometric changes. In (a), the bad weather affects the identification of semantic changes, while in (f), the shadow of the vehicle belongs to noisy changes. Images (b), (c) and (d) are geomantic changes. Due to camera motion and jitter, the bi-temporal images are unregistered and even have a large viewpoint, which has a great influence on the image difference-based SCD methods. The reason is that a large viewpoint will result in many noisy changes which caused by the spatial position difference of the pixel pairs. Due to the zoom out of the camera, the change object in (e) has different scales. Hence, to precisely detect these changes from such an image pair, the core idea lies in whether a SCD algorithm is able to overcome noisy changes.

Fig. 1. Challenging examples from CDNet-2014 dataset. From up to down, the first, second, and third row denote the image at ${T_0}$, the image at ${T_1}$, and the ground truth. We list some semantic changes with typical noisy changes. a. Bad weather; b. Large viewpoint; c. Small viewpoint; d. Camera Jitter; e. Zoom Out; f. Shadow.

Download Full Size | PDF

Recently, inspired by the high precision that Convolutional Neural Networks (CNNs) have achieved in the dense prediction task, many CNNs-based approaches have been proposed in SCD [5–11,19–21]. Alcantarilla et al. [1] proposed a system for performing structural change detection in street-view videos captured. This approach meets the need for more frequent and efficient updates in the large-scale maps used in autonomous vehicle navigation. Sakurada et al. [14] proposed a novel change detection method that uses features of CNNs in combination with super-pixel segmentation. It indicated that the comparison of CNN features gives a low-resolution map of scene changes and is robust to illumination changes and viewpoint differences. Lei et al. [22] presented a novel Hierarchical Paired Channel Fusion Network (HPCFNet). By hierarchically combining the explored fusion of channel pairs at multiple feature levels, this network can improve the accuracy of corresponding change maps. Santana et al. [23] proposed a novel Siamese U-Nets to address SCD task, such that the model learns to perform semantic segmentation using background reference frames only. Therefore, any object that comes up into the scene defines a change. However, this method is not robust enough to images with a large viewpoint. There are various studies that have been done to tackle the problem of noisy changes in SCD. Guo et al. [17] proposed a fully convolutional Siamese metric network (CosimNet) based on Deeplabv2. To overcome noisy changes, they introduced a thresholded contrastive loss (TCL) loss function, which used L2 distance to measure the similarity of paired features. Sakurada et al. [24] mentioned that not only the spatial relation between the two images should be considered, but also the temporal relations. Only by modeling both of relations, changes can be accurately detected. Another way to overcome noisy changes is to divide SCD task into change detection and semantic extraction. Reference [25] proposed a novel semantic change detection scheme with only weak supervision. In spite of their exceptional representational power, CNN-based approaches generally exhibit limitations for modeling explicit long-range relations, due to the intrinsic locality of convolution operations. Chen et al. [26] presented an attention module named Temporal Attention Module (TAM). In addition, based on TAM, they introduced a more efficient and light-weight version named Dynamic Receptive Temporal Attention Module (DRTAM) and proposed the Concurrent Horizontal and Vertical Attention (CHVA) to improve the accuracy of the network on specific challenging entities.

In the above methods, if the extracted features already contain rich semantic information (i.e., changed objects and unchanged objects), using the feature difference will get good change information. As shown in Fig. 1(a), changes can be identified by only making difference between two image features. However, for the other pair of examples, the identification of change objects should be under the premise of considering global semantic context. For example, in (b) and (e), a lot of structural noisy changes will be obtained if simply making difference between two images. In (f), both the car and its shadow will be detected as changes. Therefore, for the detection of these challenges, the global semantic information of the input images should be fully combined to make the judgment [24].

In order to successfully differentiate between semantic changes and noise changes, the detection method must be capable of modelling both interesting changes and noisy changes. In this paper, we designed a SCD model based on Vision Transformer (ViT). Using the Transformer to focus on modeling the global context at all stages, this model can better establish global relationships between feature representation. The modeling of global semantic relationships of images benefits to identify interesting changes and noisy changes. Unlike the method in [17], the proposed method does not need to analyze the change information of the extracted features, but directly models the change information into the global features. This way is conducive to the model to identify change information through global information and have a stronger robustness.

In this paper, we explore the potential of Transformer in SCD task. As shown in Fig. 2, the proposed TransCD consists of three components:

Fig. 2. Illustration of the proposed TransCD. ${I_0}$ and ${I_1}$ denote input images acquired at different times. TransCD uses a conventional CNN backbone to learn a feature representation of the input images. Then the model flattens the extracted features and adds a positional embedding to them before inputting it into the Siamese ViT. Finally, the context-rich tokens obtained by the Siamese ViT are fed into the prediction head to produce change maps.

Download Full Size | PDF

CNN backbone: It is able to model images from low level spatial dimension into high level feature space and generate embedded semantic tokens.

Siamese vision transformer (SViT): SViT accepts embedded semantic tokens as input, and outputs context-rich tokens after modeling the semantic information and establishing long-range relationships between each token.

Prediction head: Prediction head calculates the feature difference maps of the two context-rich tokens, and then maps the high-level feature representations back to pixel space through a light CNN decoder to generate change map.

The contribution can be summarized as follows:

1. We explore the potential of Vision Transformer in SCD task and design a Vision Transformer-based SCD model named TransCD.
2. From the idea that Transformer is able to model global semantic relations, we design an SViT to capture long-range context for image semantics and generate high-level semantic information, which both of them benefit to change detection. Meanwhile, it is robust to viewpoint difference, illumination, and other challenging noisy changes.
3. Compared with the selected baselines, the proposed method achieves 0.9298, 0.9427, and 0.9361 in terms of the precision, recall, and F1 score, on the CDnet-2014 dataset, respectively.

The rest of this paper is organized as follows: Section 2 explores other works that were used for inspiration or comparison during the development of this work. The proposed method is described in Section 2. Section 4 shows the experimental results and a comparison with other state-of-the-art methods. Finally, the conclusion of this paper is drawn in Section 5.

2. Method

Transformer was first proposed in the field of Natural Language Proposing (NLP) to solve sequence-to-sequence tasks [27]. After achieving state-of-the -arts in many NLP tasks, it has inspired the vision community to study its applications in computer vision (CV) since it enables modeling long-range dependency within images. The first application on CV is Vision Transformer (ViT) [28], which showed a pure Transformer-based image classification network. After that, various Transformer-based models are proposed, including object detection [29–32], image classification [33,34], segmentation [35,36], tracking [37,38], pose recognition [39], and image generation [40,41] and so on.

Some of the most recent research has focused on the characteristics of Transformer itself. For example, Yinjie et al. [22] studied the robustness of ViT to adversarial examples; In order to validate the robustness of Transformer, Srinadh et al. [42] performed an extensive study of a variety of different measures and compared the findings to ResNet baselines; To visualize the class activation map of Transformer, Chefer et al. [43] proposed a novel way to compute relevancy for Transformer networks; In [44,45], they proposed a standard ViT backbone to make a good trade-off between performance and efficiency.

Due to the strong representation and modeling long-range context ability of Transformer, we explore the potential of it in SCD task. Our motivation is that thanks to the ability of modeling global semantic relations, the model incorporated with Transformer is able to extract global semantic information from images, which is more conducive to the model to identify interesting changes from noisy changes. Note that the proposed SViT is inspired by the work in [28,29].

2.1 Proposed change detection model

Given two images ${I_0}$, ${I_1} \in {{\mathbb R}^{{H_I} \times {W_I} \times {C_I}}}$ acquired at different times ${t_0}$, and ${t_1}$, ${H_I} \times {W_I}$ denotes the spatial resolution and ${C_I} = 3$ denotes the number of channels. The purpose of SCD is to predict a corresponding pixel-wise change map $\widehat p$ with the size of ${H_I} \times {W_I}$. The most common way is to directly train a Siamese network to map images into high-level feature representations, which are then used to analyze change information. Unlike existing approaches, the proposed method introduces how to incorporate Transformer into the procedure of detecting changes.

The overview of our Transformer-based change detection architecture (TransCD) is depicted in Fig. 2. TransCD consists of three components: the CNN backbone, the Siamese Vision Transformer (SViT), and the prediction head. The CNN backbone extracts input image features to produce patch embedding and is formulated as a Siamese network with shared weights. The SViT is able to establish global relations and capture long-range dependency over embedded tokens, and output context-rich tokens. By accepting context-rich tokens as input, the prediction head will produce a binary change map according to a light decoder network.

Patch embedding. Let ${F_0}$ and ${F_1}$ denote the features of ${I_0}$ and ${I_1}$ generated by the CNN backbone, respectively. ${F_{raw}}$ denotes the raw image features which need to be tokenized. This procedure can be defined as follows.

(1)$${F_{raw}} = {\textrm{CNN}} (I),\textrm{ }I = \{{{I_0},{I_1}} \}.$$

Note that ${F_{raw}}$ has the size of $H \times W \times C$, where H, W denote the spatial size and C denotes the channel dim of the features. After that, ${F_{raw}}$ is inputted into a tokenizer to produce semantic tokens. As shown in Fig. 3, the tokenizer can accept the features of the bi-temporal images extracted by the CNN backbone as input, and output semantic tokens successively using a convolution layer, flatten operation, and transpose operation. It worth noting that the tokenizer can also input bi-temporal images as input and output semantic tokens if CNN backbone is not used to extract features. Then, a learnable position embedding ${E_{pos}} \in {{\mathbb R}^{L \times {C_h}}}$ is added to the semantic tokens to retain positional information. Finally, we obtain token sequences ${T_{raw}} \in {{\mathbb R}^{L \times {C_h}}}$, where $L\textrm{ = }H \times W$ denotes the number of tokens.

(2)$${T_{raw}}\textrm{ = Tokenizer} ({F_{raw}}) + {E_{pos}},$$

where ‘Tokenizer’ denotes operations of convolution, flatten, and transpose. All operations including features embedding and tokenization are called the patch embedding E. In short,

(3)$${T_{raw}} = [{\textrm{E} (x_p^1);\textrm{E} (x_p^2);\textrm{E} (x_p^3);\ldots ;\textrm{E} (x_p^L)} ]+ {\textrm{E}_{pos}},$$

where $E(x_p^i)$ denotes the ith pair of image patches $x_p^i$ is mapped into a latent embedding space using the patch embedding E. Hence, $x_p^i$ has the spatial size of $({H_I}/H,{W_I}/W)$.

Fig. 3. Illustration of the tokenizer. The tokenizer accepts the image I or the output features ${F_{raw}}$ of the image I extracted by the CNN backbone as input and output semantic tokens ${T_{raw}}$. Where the grid size is $H \times W$.

Download Full Size | PDF

Refining tokens. The embedded sequence tokens ${T_{raw}}$ need to be inputted into SViT to obtain context-rich tokens ${T_{new}} \in {{\mathbb R}^{L \times {C_h}}}$.

(4)$${T_{new}} = \textrm{SViT} ({T_{raw}}),\textrm{ }{T_{raw}} = \{{{T_{raw0}},{T_{raw1}}} \}.$$

Predicting change map. Now, we obtain two pairs of context-rich tokens corresponding to the input bi-temporal images. They will be inputted into the prediction head ${E_{ph}}$ to produce the change map $\widehat p \in {{\mathbb R}^{{H_I} \times {W_I}}}$. First, we use the transpose and reshape operation to restore the context-rich tokens to feature maps, and then we get feature difference maps (FDM) through an element-wise absolute of the subtraction of the two feature maps. This procedure can be formulated as follows,

(5)$$\widehat p = {\textrm{E} _{ph}}(|{\textrm{reshape} (\textrm{transpose} ({T_{new0}})) - \textrm{reshape} (\textrm{transpose} ({T_{new1}}))} |),$$

where ${T_{new0}}$ and ${T_{new1}}$ denote the refined tokens with rich contexts corresponding to the bi-temporal images ${I_0}$ and ${I_1}$, respectively.

Loss function. In the training phase, we minimize the ${L_1}$ loss to optimize model parameters. In general, the loss function is formulated as:

(6)$$L = {||{g - \widehat p} ||_1},$$

where $g \in {{\mathbb R}^{{H_I} \times {W_I}}}$ is the ground truth, and ${||\cdot ||_1}$ denotes the ${L_1}$ distance.

2.2 Siamese vision transformer

The proposed SViT is formed by two weight-shared ViT. Our ViT consists of ${N_E}$ layers of encoders and ${N_D}$ layers of decoders, the details are shown in Fig. 4. The encoder follows the standard architecture of the Transformer, which consists of a multi-head self-attention (MSA) block and a multi-layer perceptron (MLP) block. Layer-normalization (LN) is applied before every block. Residual connection is applied after every block. It is worth noting that the MLP contains two layers with a Gaussian error linear units (GELU) activation [28]. The encoder inputs ${T_{raw}}$ and output ${T_{guide}}$ to facilitate the context modeling of the decoder.

Fig. 4. Illustration of the proposed ViT. This architecture shows the sub-network in SViT, and SViT consists of two ViTs with shared weights. In the proposed ViT, the encoder accepts ${T_{raw}}$ as input and output ${T_{guide}}$, the decoder accepts ${T_{guide}}$ and ${T_{raw}}$ as inputs and produce refined tokens ${T_{new}}$. The motivation is that ${T_{guide}}$ is able to facilitate the encoder to obtain context-rich tokens.

Download Full Size | PDF

At each layer $\ell $, The procedure of the encoder can be described as follows:

(7)$${z_0} = [{\textrm{E} (x_p^1);\textrm{E} (x_p^2);\textrm{E} (x_p^3);\ldots ;\textrm{E} (x_p^L)} ]+ {\textrm{E}_{pos}},$$

(8)$${z^{\prime}}_\ell = \textrm{MSA} (\textrm{LN} ({z_{\ell - 1}})) + {z_{\ell - 1}},\textrm{ }\ell = 1 \cdot{\cdot} \cdot {N_E},$$

(9)$${z_\ell } = \textrm{MLP} (\textrm{LN} ({z^{\prime}}_\ell )) + {z^{\prime}}_\ell ,\textrm{ }\ell = 1 \cdot{\cdot} \cdot {N_E},$$

(10)$${T_{guide}} = LN ({z_{{N_E}}}).$$

The decoder has a similar architecture to the encoder. The only difference is the Multi-head attention (MA) block. In the proposed decoder, it accepts ${T_{raw}}$ and ${T_{guide}}$ as inputs and outputs refined tokens ${T_{new}}$. Here, ${T_{raw}}$ denotes change queries, and ${T_{guide}}$ denotes compact high-level semantic information that well reveals the change of interest. Our motivation is that ${T_{guide}}$ is able to facilitate the global relationship establishment of tokens ${T_{raw}}$ and produce refined context-rich tokens. The procedure of the decoder can be formulated as follows.

(11)$${y_0} = ({{T_{guide}},{T_{raw}}} ),$$

(12)$${y^{\prime}}_\ell = \textrm{MAT}({_{guide}},\textrm{LN} ({y_{\ell - 1}})) + {y_{\ell - 1}},\textrm{ }\ell = 1 \cdot{\cdot} \cdot {N_D},$$

(13)$${y_\ell } = \textrm{MLP} (\textrm{LN} ({y^{\prime}}_\ell )) + {y^{\prime}}_\ell ,\textrm{ }\ell = 1 \cdot{\cdot} \cdot {N_D},$$

(14)$${T_{new}} = \textrm{LN} ({y_{{N_D}}}).$$

3. Experiment implementation details

3.1 Datasets

CDNet-2014 dataset: CDNet-2014 dataset [46,47] consists of 31 videos depicting indoor and outdoor scenes with boats, vehicles, pedestrians, traffic signs and other road-side structures. The videos have been obtained with different cameras ranging from low-resolution IP (internet protocol) cameras, through mid-resolution camcorders and PTZ (Pan/Tilt/Zoom) cameras, to far- and near-infrared cameras. As a consequence, spatial resolutions of the videos vary from 320×240 to 720×576. The dataset is divided into 11 categories, including ‘Bad Weather’, ‘Baseline’, ‘Camera Jitter’, ‘Dynamic Background’, ‘Intermittent Object Motion’, ‘Low Framerate’, ‘Night Videos’, ‘PTZ’, ‘Shadow’, ‘Thermal’, and ‘Turbulence’. The training and validation set contains 73276 and 18319 pairs of images, respectively. In this paper, to the other researchers’ operation, all images were scaled to 512 × 512.

3.2 Optimization and evaluation

In the experiment, all networks were trained with the Adam algorithm with the learning rate 2e-4. All experiments were implemented using Pytorch 1.2.0 and with two Nvidia RTX2080Ti GPUs of 11G memory and Intel i9-9900X CPU. More details can be found in our code source.

In order to evaluate the performance of the proposed method, we use three evaluation measures: precision (Pre), recall (Re), and F1 score (F1).

To validate the effectiveness of TransCD, we set several models for comparison. The detailed configurations of the following models are listed in Table 1, where the second and the third columns denote the layer number of the encoder and decoder, the fourth column indicates whether CNN backbone (ResNet18 is used in this paper) is used to extract features in tokenizer, and grid size shows the spatial size of embedded features that used to produce tokens.

Table 1. Configurations of the proposed models

View Table | View all tables in this article

BaseCD: a compact change detection model based on feature difference that consists of a CNN backbone (ResNet18) and the prediction head.

TransCD(SViT-E1-D1-16): our lightest model, a light SViT-based model which consists of a one-layer encoder and a one-layer decoder, the embedding grid size is 16×16.

TransCD(SViT-E1-D1-32): a light SViT-based model which consists of a one-layer encoder and a one-layer decoder, the embedding grid size is 32×32.

TransCD(SViT-E4-D4-16): a deep SViT-based model which consists of a four-layer encoder and a four-layer decoder, the embedding grid size is 16×16.

TransCD(SViT-E4-D4-32): a deep SViT-based model which consists of a four-layer encoder and a four-layer decoder, the embedding grid size is 32×32.

To further evaluate the efficiency of the proposed method, we set two additional models which use ResNet18 as the CNN backbone.

TransCD(Res-SViT-E1-D1-16), TransCD(Res-SViT-E1-D1-32) and TransCD(Res-SViT-E4-D4-16).

4. Experiment results

4.1 Comparison with the popular methods

In this section, to evaluate the performance of the proposed method, we compared it with several popular methods, CosimNet [17], HPCFNet [22], SuBSENSE (Self-Balanced Sensitivity Segmenter) [6], IUTIS-3 (In Unity There Is Strength 3) [7], and Cascade CNN [20].

CosimNet [17]: From the idea that detecting changes can be formulated as comparing dissimilarities between a pair of image features, this method incorporated the thresholded contrastive loss into a Siamese network based on Deeplabv3. Their experiments demonstrated this model is robust to a lot of challenging conditions, such as illumination changes, large viewpoint changes caused by camera motion and zooming.

HPCFNet [22]: The method proposed a novel Hierarchical Paired Channel Fusion Network (HPCFNet), which utilized the adaptive fusion of paired feature channels. The motivation is that an effective feature fusion method can improve the accuracy of the corresponding change maps.

SuBSENSE [6]: SuBSENSE is a universal pixel-level segmentation method that relies on spatiotemporal binary features as well as color information to detect changes. This method allowed changed foreground objects to be detected more easily while most illumination variations are ignored.

IUTIS-3[7]: This paper investigated how state-of-the-art SCD algorithms can be combined and used to create a more robust algorithm using Genetic Programming (GP) to automatically select the best algorithms.

Cascaded CNN [20]: Note that this method is used to solve the Moving Object Segmentation (MOS) task instead of SCD. On the task of MOS, researchers mainly contribute to segmenting the moving foreground object (changes) for each frame in video sequence.

The experimental results on CDNet-2014 are shown in Table 2. We can observe that all SViT-based models have achieved good performance, and have improved significantly when compared with the popular SCD models. For example, compared with CosimNet and HPCFNet, TransCD(SViT-E4-D4-32) has improvements of 7.70% and 7.31% in terms of the F1 score, respectively. Even SViT-E1-D1-16 increased by 5.27% and 4.88%, respectively. Compared with the MOS method, except TransCD(SViT-E1-D1-16), other SViT-based models outperform Cascade CNN. For example, TransCD(SViT-E1-D1-32) improved by 1.26%, and TransCD(SViT-E4-D4-32) improved by 1.52% in terms of the F1 score. When you compared BaseCD to SViT-based models, it is very clear that SViT-based models significantly outperform BaseCD, even TransCD(SViT-E1-D1-16) has an improvement of 11.68%. It indicates that SViT performs an important role in TransCD and indeed improves the performance. It is worth noting that TransCD(SViT-E1-D1-32) has fewer parameters. For example, the parameters of CosimNet are 6.7 times as much as, but TransCD(SViT-E1-D1-32) yielded a 7.44% improvement than that of CosimNet. It indicates TransCD is more efficient and effective than purely convolutional models.As shown in Fig. 5, we illustrated the F1 curve of each category on our lightest model TransCD(SViT-E1-D1-16) in the validation phase. For some easy categories, such as bad weather, baseline and camera jitter, it can be observed that TransCD(SViT-E1-D1-16) achieved high performance at early training. For some challenging categories,such as PTZ and night videos, as the training goes on, the F1 score gradually increases. Hence, the average-F1 curve increases sharply at the beginning and slowly at the end of the training.

Fig. 5. The F1 score of the proposed TransCD(SViT-E1-D1-16) in validation phase.

Download Full Size | PDF

Table 2. Comparison with popular methods

View Table | View all tables in this article

To quantitatively describe the performance of the proposed TransCD, we listed the metrics of each category on the best-performance model TransCD(SViT-E4-D4-32). As shown in Table 3, bad weather, baseline, camera jitter and dynamic background achieved 0.9588, 0.9626, 0.9584, and 0.9557 in terms of the F1 score, respectively. For challenging categories PTZ and night videos, the F1 are 0.8852 and 0.8860. In addition, except PTZ and night videos, all the other categories are over 90% in terms of the F1 score.

Table 3. Metrics for TransCD(SViT-E4-D4-32) on every category of CDNet-2014 dataset

View Table | View all tables in this article

As shown in Fig. 6, we visualized changes maps of different methods for comparison. For easy categories such that bad weather and baseline, BaseCD, CosimNet, and TransCD(SViT-E4-D4-32) all achieved satisfactory results. In the second pair of examples of intermittent object motion, BaseCD and CosimNet have many false detections, as well as the first pair examples of low framerate for CosimNet. For the second examples of low framerate, BaseCD missed many changes. BaseCD is a compact change detection model based on feature difference, and it is not good at overcoming noisy changes. From PTZ, it shows BaseCD does not handle it very well when facing large viewpoint. CosimNet proposed a TCL to tolerate noisy changes, and the results indicate that CosimNet outperforms BaseCD. However, change maps obtained by CosimNet are not fine enough. By comparison, TransCD(SViT-E4-D4-32) achieved more accurate detection results and has a higher tolerance for noisy changes.

Fig. 6. Visualization of change maps obtained by BaseCD, CosimNet, and TransCD(SViT-E4-D4-32). We visualized two pairs of typical examples for each category. From the left to right in each pair of examples, it is the image at ${T_0}$, the image at ${T_1}$, the ground truth, and the change map predicted by BaseCD, CosimNet, and the proposed TransCD(SViT-E4-D4-32), respectively.

Download Full Size | PDF

4.2 Ablation studies

In order to investigate the performance of the proposed SViT-based models in different settings, we set up a series of ablation experiments based on TransCD(SViT-E1-D1-16). They are about token length, hidden size, the layer number of the encoder and decoder, and the CNN backbone.

Effects of the token length (the grid size): From Fig. 3, the size of grid denotes the number of tokens finally generated ($L = H \times W$, where $H \times W$ denotes the grid size). The features corresponding to every grid area in the original image can be represented by a token. Therefore, the number of tokens denotes the amount of feature information in the raw image. Greater number of tokens denotes more rich features which can achieve better performance. To this end, we set up four grid sizes, $4 \times 4$, $8 \times 8$, $16 \times 16$, and $32 \times 32$, to test the performance of the proposed model. As shown in Table 4, the performance varies greatly under different grid sizes. The grid with size of $32 \times 32$ outperformed other grid sizes and obtained the best F1 score of 0.9335, and $4 \times 4$ had the worst performance with only 0.7872. A reasonable explanation is that the feature information is not enough when an image with spatial size of $512 \times 512$ is tokenized into only 16 tokens under the grid size of $4 \times 4$. Therefore, the dense grid benefits to better performance.

Table 4. Effects of the token length

View Table | View all tables in this article

Effects of the hidden size: The hidden size ${C_h}$ denotes the channel size of each token. The larger the hidden size is, the more and richer the feature information that can be represented by each token. As shown in Table 5, we set comparative experiments for different hidden sizes. With the decrease of the hidden size, the F1 score dropped gradually. Due to the self-attention mechanism in MSA, the change of hidden size is accompanied by the exponential increase in the complexity of computing. Therefore, the hidden size should be set to achieve a good trade-off between improving performance and computing complexity.

Table 5. Effects of the hidden size

View Table | View all tables in this article

Effects of the layers of the encoder and decoder: In TransCD, SViT consists of the encoder and decoder. Our motivation is that the encoder is able to facilitate the global relationship establishment of the embedded features and to guide the decoder to produce context-rich tokens. To explore the performance of SViT in different number of layers, we set experiments on different number of layers of the encoder and decoder.

As shown in Table 6, we divided the experimental setting into four categories, ID1: SViT only consists of a one-layer encoder; ID2-4: the decoder has more layers than the encoder; ID5-7: the decoder has fewer layers than the encoder; ID8-11: the decoder has the same number of layers as the encoder.

Table 6. Effects of layers of the encoder and decoder

View Table | View all tables in this article

From the results of ID1-4, with the increase of decoder layers, the performance increases gradually. For example, ID4 is 2.30% higher than ID1. From the comparison between ID2 and ID5, ID3 and ID6, ID4 and ID7, we can observe that the models that the decoder layers are larger than that of the encoder outperform their opposites. From the comparison results between ID2 and ID9, ID3 and ID10, ID4 and ID11, it indicates that the increase of the layer number of the encoder has little influence on the model when the layer number of the decoder is fixed. The above results also confirm that the encoder is more of guiding the decoder to produce the context-rich tokens and are an auxiliary component.

Effects of the CNN backbone: In the traditional CNN-based SCD models, the feature extraction of bi-temporal images is all dependent on CNN backbone. To explore if more sophisticated features benefit to SViT, we set up different experiments as shown in Table 7. In the models used ResNet18 as backbone, tokens are obtained by assigning tokenizing upon image features extracted by the backbone. The models without CNN backbone directly tokenizes input images as tokens. From the experiments in Table 7, the models with CNN backbone outperforms the one without CNN backbone. For example, TransCD(Res-SViT-E1-D1-16) is 1.27% higher than TransCD(SViT-E1-D1-16), and compared with TransCD(SViT-E4-D4-16), TransCD(Res-SViT-E4-D4-16) has an improvement of 0.28% in terms of the F1 score. However, it cannot be ignored that CNN backbone greatly increases the model parameters. As shown in Table 2, ResNet18 increases the parameters by about 10M.

Table 7. Effects of the CNN backbone

View Table | View all tables in this article

4.3 Visualization

In SCD task, there remains a lot of variability in bi-temporal images, which results in large dissimilarity between them. The variability may be induced by changes of interest, such as structural changes (construction, building demolition, traffic signs), but also from nuisances such as viewpoint changes, outdoor conditions (illumination, weather) and dynamic changes (pedestrians, vehicles, vegetation) [1]. Therefore, the key idea of SCD is whether a model is robust to noisy changes. To intuitively show the test results, we visualized the change maps detected by TransCD(SViT-E1-D1-16), as shown in Fig. 7.

Fig. 7. Visualization of change maps obtained by TransCD(SViT-E1-D1-16). We visualized three pairs of typical examples for each category on our lightest model. From the left to right in each pair of examples, it is the image at ${T_0}$, the image at ${T_1}$, the ground truth, and the change map predicted by our model, respectively.

Download Full Size | PDF

Note that all images in category Camera Jitter have small viewpoint due to unstable cameras. From Table 3, the F1 score of Camera Jitter achieved 0.9584, indicating that TransCD performs well to small viewpoint changes. As shown in Fig. 7 Camera Jitter, the three pairs of images all have a small viewpoint, the model accurately detected the interesting changes.

For intermittent object motion, it is a category that intends to test how a model adapts to background changes. For example, in the first pair of examples, the interesting change is the trash can, but the moving pedestrians and cars caused a large change in the background, as well as the third pair of examples. The model accurately identified the interesting changes. Please note that moving pedestrians and cars may be interesting changes in the other scenarios. However, in the first pair of examples, the model detected the trash can instead of identifying the moving pedestrians or cars as interesting changes. This indicates that TransCD is able to correctly detect semantic changes instead of remembering pedestrians and cars.

In the category of night videos, the main challenge is that low-visibility of vehicles and their very strong headlights that cause over saturation will make detection more difficult. In addition, headlights cause halos and reflection on the street. From the second pair of examples, although there are strong headlights and halos, the model obtained fine change maps.

Another challenging category is PTZ, the images in this category are accompanied with camera motion (pan, tilt or zoom) which will cause many false positives. Due to the camera motion, the background of the second and third pairs of examples is very different, which result in great dissimilarities. Our TransCD also detected the changed person and train.

In the categories of shadow, thermal and turbulence, the main challenges are the shadow of changes, heat stamps and reflection, and air turbulence and distortion caused by the heat, respectively. As can be seen from the results, the model is robust to the above nuisance.

As illustrated in Fig. 7, all of change maps have a smooth boundary and clear background. In addition, we used T-SNE (T-distributed Stochastic Neighbor Embedding) [48] algorithm to visualize the feature distribution of context-rich Tokens embedded by SViT. The results in Fig. 8 show that SViT achieved a good distinction between changed and unchanged information. The main reason is that the SViT is able to model relevant semantic changes and establish long-range connections between different objects. In addition, meaning that SViT leads to a more robust performance.

Fig. 8. Visualization of the features embedded by SViT. From left to right in each row, it is the image at ${T_0}$, the image at ${T_1}$, the ground truth, and the change map predicted by our model, two-dimensional feature embedding (where red denotes unchanged pair-features and blue denotes changed one), respectively.

Download Full Size | PDF

4.4 Supplementary experiments on VL-CMU-CD dataset

To further validate the effectiveness of the proposed Transformer-based model, we implemented experiments on another CD dataset named VL-CMU-CD [1] dataset. VL-CMU-CD dataset is a challenging CD dataset with a long time span. It includes various semantic changes, such as new building, road signs, construction areas, and noisy changes, such as viewpoint changes, illumination variation, season and weather changes. The author divided this dataset into a training set of 933 pairs and a test set of 429 pairs. During training phase, we resize all of examples to 512 × 512 since their raw resolution is 1024 × 768. Different from experiments on CDNet-2014 dataset, the learning rate and batch-size are set to 0.0005 and 4, respectively.

Comparison results are listed in Table 8. We only evaluated four light Transformer-based models on VL-CMU-CD dataset as the dataset is smaller than CDNet-2014 dataset. From Table 8, we can see that TransCD(Res-SViT-E1-D1-16) and TransCD(Res-SViT-E1-D1-32) obtained better performance than TransCD(SViT-E1-D1-16) and TransCD(SViT-E1-D1-32). For example, TransCD(Res-SViT-E1-D1-16) achieved 71.52% in terms of the F1 score and had an improvement of 19.78% than TransCD(SViT-E1-D1-16), and TransCD(Res-SViT-E1-D1-32) achieved 63.96% in terms of the F1 score and had an improvement of 7.57% than TransCD(SViT-E1-D1-32). It indicates that TransCD with ResNet18 outperforms the one without ResNet18 when be trained on small datasets. In addition, our models achieved competitive performance when compared with the popular methods, such as CosimNet and FCN-metrics.

Table 8. Comparison with popular methods on VL-CMU-CD dataset

View Table | View all tables in this article

Figure 9 visualized change maps for above four models. For the first two image pairs, although there are a lot of various light and shadow changes, four models have detected semantic changes. In row 3, two images have a strong light contrast and little viewpoint difference, the road sign is correctly identified as changes. In row 8, 9, and 10, since two images acquired with a long time span, trees and plants in two images have very different appearance which brings great difficulty to our models; TransCD(Res-SViT-E1-D1-16), however, has detected semantic changes. For images in row 7, the snow in the second image brings an obvious change area, only TransCD(Res-SViT-E1-D1-16) obtained a good change map. From above results, it shows that our TransCD is robust to noisy changes and able to detect changes among different scenes.

Fig. 9. Visualization of change maps on VL-CMU-CD dataset. From left to right in each row, it is the image at ${T_0}$, the image at ${T_1}$, the ground truth, and the change map predicted by TransCD(SViT-E1-D1-16), TransCD(SViT-E1-D1-32), TransCD(Res-SViT-E1-D1-16), and TransCD(Res-SViT-E1-D1-32), respectively.

Download Full Size | PDF

5. Conclusion

In this article, a Transformer-based framework named TransCD is proposed for the scene change detection task. To enhance the feature representation, we introduce SViT to establish global semantic relations and model long-range context. After incorporating SViT into a feature difference framework, TransCD is able to model both spatial and temporal relations. Experimental results show that all SViT-based models achieve significant improvements and have fewer parameters when compared with pure CNN-based SCD models. The challenging dataset named CDNet-2014 is used to verify the model’s effectiveness. Extensive results indicate that the proposed model is robust to noisy changes, such as viewpoints, illumination variation and other outdoor conditions. Compared with the selected pure CNN-based SCD models, the proposed method achieves an improvement of 7.31% in terms of the F1 score.

Funding

The Science and Technology Program of Sichuan (2021YJ0080); National Natural Science Foundation of China (61771409).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are available in Ref. [1,46,47]. Models and codes are available in [49].

References

1. P. F. Alcantarilla, S. Stent, G. Ros, R. Arroyo, and R. Gherardi, “Street-view change detection with deconvolutional networks,” Auton Robot 42(7), 1301–1322 (2018). [CrossRef]

2. C. Y. Fang, S. W. Chen, and C. S. Fuh, “Automatic change detection of driving environments in a vision-based driver assistance system,” IEEE Trans. Neural Netw. 14(3), 646–657 (2003). [CrossRef]

3. R. Collins, A. Lipton, and T. Kanade, “Introduction to the special section on video surveillance,” IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 745–746 (2000). [CrossRef]

4. C. Stauffer and W. E. L. Grimson, “Learning patterns of activity using real-time tracking,” IEEE Trans. Pattern Anal. Machine Intell. 22(8), 747–757 (2000). [CrossRef]

5. C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: real-time tracking of the human body,” IEEE Trans. Pattern Anal. Machine Intell. 19(7), 780–785 (1997). [CrossRef]

6. P.-L. St-Charles, G.-A. Bilodeau, and R. Bergevin, “Subsense: A universal change detection method with local adaptive sensitivity,” IEEE Trans. on Image Process. 24(1), 359–373 (2015). [CrossRef]

7. S. Bianco, G. Ciocca, and R. Schettini, “Combination of video change detection algorithms by genetic programming,” IEEE Trans. Evol. Computat. 21(6), 914–928 (2017). [CrossRef]

8. D. Lelescu and D. Schonfeld, “Statistical sequential analysis for real-time video scene change detection on compressed multimedia bitstream,” IEEE Trans. Multimedia 5(1), 106–117 (2003). [CrossRef]

9. S. K. Yedla, V. M. Manikandan, and P.V., “Real-time Scene Change Detection with Object Detection for Automated Stock Verification,” in 5th International Conference on Devices, Circuits and Systems (ICDCS) (2020), pp. 157-161.

10. . Lim, L. Ang and H. Y. Keles, “Learning multi-scale features for foreground segmentation,” Pattern Anal Applic 23(3), 1369–1380 (2020). [CrossRef]

11. H. Lee, H. Kim, and J. Kim, “Background Subtraction Using Background Sets with Image- and Color-Space Reduction,” IEEE Trans. Multimedia. 18(10), 2093–2103 (2016). [CrossRef]

12. X. Zhang, C. Zhu, H. Wu, Z. Liu, and Y. Xu, “An Imbalance Compensation Framework for Background Subtraction,” IEEE Trans. Multimedia. 19(11), 2425–2438 (2017). [CrossRef]

13. J. Seo and S. D. Kim, “Recursive on-line (2D)2PCA and its application to long-term background subtraction,” IEEE Trans. Multimedia. 16(8), 2333–2344 (2014). [CrossRef]

14. S. Ken and T. Okatani, “Change Detection from a Street Image Pair using CNN Features and Superpixel Segmentation,” in British Machine Vision Conference (2015), pp. 6101–6112.

15. X. H. Li, Z. S. Du, Y. Y. Huang, and Z. Y. Tan, “A deep translation (GAN) based change detection network for optical and SAR remote sensing images,” ISPRS Journal of Photogrammetry and Remote Sensing 179, 14–34 (2021). [CrossRef]

16. Z. X. Wang, C. Peng, Y. Zhang, N. Wang, and L. Luo, “Fully convolutional siamese networks based change detection for optical aerial images with focal contrastive loss,” Neurocomputing 457, 155–167 (2021). [CrossRef]

17. E. Q. Guo, X. S. Fu, J. W. Zhu, M. Deng, Y. Liu, Q. Zhu, and H. F. Li, “Learning to measure change: Fully convolutional siamese metric networks for scene change detection,” arXiv preprint arXiv:1810.09111, (2018).

18. R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, “Image change detection algorithms: a systematic survey,” IEEE Trans. on Image Process. 14(3), 294–307 (2005). [CrossRef]

19. S. Bianco, G. Ciocca, and R. Schettini, “How far can you get by combining change detection algorithms?” in International Conference on Image Analysis and Processing (2017), pp. 96–107.

20. Y. Wang, Z. Luo, and P.-M. Jodoin, “Interactive deep learning method for segmenting moving objects,” Pattern Recognit. Lett. 96, 66–75 (2017). [CrossRef]

21. B. N. Subudhi, T. Veerakumar, S. Esakkirajan, and A. Ghosh, “Kernelized Fuzzy Modal Variation for Local Change Detection from Video Scenes,” IEEE Trans. Multimedia 22(4), 912–920 (2020). [CrossRef]

22. Y. Lei, D. Peng, P. Zhang, Q. Ke, and H. Li, “Hierarchical Paired Channel Fusion Network for Street Scene Change Detection,” IEEE Trans. on Image Process. 30, 55–67 (2021). [CrossRef]

23. M. S. Santana, D. Colombo, L. Junior, V. Albuquerque, T. Moreira, and J. Papa, “A novel siamese-based approach for scene change detection with applications to obstructed routes in hazardous environments,” IEEE Intell. Syst. 35(1), 44–53 (2020). [CrossRef]

24. K. Sakurada, W. Wang, N. Kawaguchi, and R. Nakamura, “Dense optical flow-based change detection network robust to difference of camera viewpoints,” arXiv preprint arXiv:1712.02941 (2017).

25. Ken Sakurada, M. Shibuya, and W. Wang, “Weakly supervised silhouette-based semantic scene change detection,” in Proceedings of IEEE International Conference on Robotics and Automation (IEEE, 2020), pp. 6861–6867.

26. Shuo Chen, Kailun Yang, and Rainer Stiefelhagen, “DR-TANet: Dynamic Receptive Temporal Attention Network for Street Scene Change Detection,” arXiv preprint arXiv:2103.00879 (2021).

27. A. Vaswani, S. Noam, P. Niki, U. Jakob, L. Jones, N. G. Aidan, and K. Lukasz, “Attention is all you need,” in Advances in neural information processing systems (2017), pp. 5998–6008.

28. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16 × 16 words: Transformers for image recognition at scale,” in The Ninth International Conference on Learning Representations (ICLR) (2021).

29. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in the16th European Conference on Computer Vision (ECCV) (2020), pp. 213-229.

30. Z. Zhu, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in The Ninth International Conference on Learning Representations (ICLR) (2021).

31. C. Zou, B. H. Wang, Y. Hu, J. Q. Liu, Q. Wu, Y. Zhao, B. X. Li, C. G. Zhang, C. Zhang, Y.C. Wei, and J. Sun, “End-to-End Human Object Interaction Detection with HOI Transformer,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 11825–11834.

32. D. J. Chen, Hsieh, Y. He, and T. L. Liu, “Adaptive Image Transformer for One-Shot Object Detection,” in the in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 12247–12256

33. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning (2021), pp. 10347–10357.

34. B. Wu, C. Xu, X. Dai, A. Wan, P. Zang, Z. Y. Masayoshi, J. Gonzalez, K. Keutzer, and P. Vajda, “Visual transformers: Token-based image representation and processing for computer vision,” arXiv preprint arXiv:2006.03677 (2020).

35. S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. Torr, and L. Zhang, “Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 6881–6890.

36. D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Feature pyramid transformer,” in European Conference on Computer Vision (ECCV) (2020), pp. 323–339.

37. X. Chen, B. Yan, J. W. Zhu, D. Wang, X. Y. Yang, and H. C. Lu, “Transformer Tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 8126–8135.

38. N. Wang, W. G. Zhou, J. Wang, and H. Q. Li, “Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 1571–1580.

39. K. Li, S. J. Wang, X. Zhang, Y. F. Xu, W. J. Xu, and W. Z. Tu, “Pose Recognition with Cascade Transformers,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 1944–1953.

40. M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” in Proceedings of the 37th International Conference on Machine Learning (ICML) (2020), pp. 1691–1703.

41. D. A. Hudson and C. L. Zitnick, “Generative Adversarial Transformers,” in The 38th International Conference on Machine Learning (ICML) (2021).

42. S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit, “Understanding Robustness of Transformers for Image Classification,” arXiv preprint arXiv:2103.14586 (2021).

43. Hila Chefer, S. Gur, and L. Wolf, “Transformer Interpretability Beyond Attention Visualization,” in the in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 782–791.

44. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” arXiv preprint arXiv:2103.14030 (2021).

45. H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “CvT: Introducing Convolutions to Vision Transformers,” arXiv preprint arXiv:2103.15808 (2021).

46. N. Goyette, P.-M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar, “Change detection. net: A new change detection benchmark dataset,” in Proceedings of IEEE computer society conference on computer vision and pattern recognition workshops (IEEE, 2012), pp. 1–8.

47. Y. Wang, P.-M. Jodoin, F. Porikli, J. Konrad, Y. Benezeth, and P. Ishwar, “Cdnet 2014: an expanded change detection benchmark dataset,” in Proceedings of IEEE conference on computer vision and pattern recognition workshops (IEEE, 2014), pp. 393–400.

48. L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” J. Mach. Learn. Res. 9, 2579–2605 (2008).

49. Z. Wang, Y. Zhang, L. Luo, and N. Wang, “TransCD: Scene Change Detection via Transformer-based Architecture,” GitHub (2021), https://github.com/wangle53/TransCD.

Models	Re	Pre	F1	#.Param.
SuBSENSE [6]	0.8124	0.7509	0.7408	-
IUTIS-3 [7]	0.7779	0.7875	0.7551	-
CosimNet [17]	0.8036	0.9383	0.8591	38.32M
HPCFNet [22]	-	-	0.8630	-
Cascaded CNN [20]	0.9506	0.8997	0.9209	-
BaseCD	0.8063	0.8281	0.7950	15.67M
TransCD(SViT-E1-D1-16)	0.8897	0.9360	0.9118	6.33M
TransCD(SViT-E1-D1-32)	0.9197	0.9482	0.9335	5.71M
TransCD(SViT-E4-D4-16)	0.9157	0.9404	0.9277	11.06M
TransCD(SViT-E4-D4-32)	0.9298	0.9427	0.9361	10.44M
TransCD(Res-SViT-E1-D1-16)	0.9119	0.9377	0.9245	17.24M
TransCD(Res-SViT-E4-D4-16)	0.9232	0.9383	0.9305	21.98M

Categories	Re	Pre	F1
Bad Weather	0.9531	0.9645	0.9588
Baseline	0.9590	0.9663	0.9626
Camera Jitter	0.9549	0.9620	0.9584
Dynamic Background	0.9486	0.0.9629	0.9557
Intermittent Object Motion	0.9356	0.9594	0.9473
Low Framerate	0.9114	0.9345	0.9228
Night Videos	0.8518	0.8805	0.8660
PTZ	0.9072	0.8643	0.8852
Shadow	0.9516	0.9691	0.9603
Thermal	0.9711	0.9770	0.9361
Turbulence	0.8837	0.9294	0.9060
Overall	0.9298	0.9427	0.9361

Token Length	Re	Pre	F1
4×4	0.7296	0.8720	0.7872
8×8	0.8387	0.9105	0.8716
16×16	0.8897	0.9360	0.9118
32×32	0.9197	0.9482	0.9335

Hidden Size	Re	Pre	F1
32	0.8180	0.9222	0.8636
64	0.8610	0.9308	0.8937
128	0.8737	0.9387	0.9038
256	0.8897	0.9360	0.9118

ID	Encoder	Decoder	Re	Pre	F1
1	1	0	0.8722	0.9422	0.9045
2	1	2	0.8903	0.9466	0.9169
3	1	4	0.9089	0.9381	0.9231
4	1	8	0.9238	0.9315	0.9275
5	2	1	0.8925	0.9388	0.9146
6	4	1	0.8875	0.9414	0.9128
7	8	1	0.8966	0.9328	0.9138
8	1	1	0.8897	0.9360	0.9118
9	2	2	0.8999	0.9393	0.9186
10	4	4	0.9157	0.9404	0.9277
11	8	8	0.9122	0.9443	0.9278

TransCD: scene change detection via transformer-based architecture

Abstract

1. Introduction

2. Method

2.1 Proposed change detection model

2.2 Siamese vision transformer

3. Experiment implementation details

3.1 Datasets

3.2 Optimization and evaluation

4. Experiment results

4.1 Comparison with the popular methods

4.2 Ablation studies

4.3 Visualization

4.4 Supplementary experiments on VL-CMU-CD dataset

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (8)

Equations (14)

Optics Express

Models	Encoder	Decoder	Backbone	Grid Size
BaseCD	0	0	ResNet18	16×16
TransCD(SViT-E1-D1-16)	1	1	No	16×16
TransCD(SViT-E1-D1-32)	1	1	No	32×32
TransCD(SViT-E4-D4-16)	4	4	No	16×16
TransCD(SViT-E4-D4-32)	4	4	No	32×32
TransCD(Res-SViT-E1-D1-16)	1	1	ResNet18	16×16
TransCD(Res-SViT-E1-D1-32)	1	1	ResNet18	32×32
TransCD(Res-SViT-E4-D4-16)	4	4	ResNet18	16×16

Models	Re	Pre	F1
CDNet [1]	0.8500	0.4000	0.5500
FCN-CD [1]	0.9300	0.4300	0.5800
CosimNet [17]	-	-	0.7060
FCN-Metrics [17]	-	-	0.7210
TransCD(SViT-E1-D1-16)	0.4524	0.6043	0.5174
TransCD(SViT-E1-D1-32)	0.4807	0.6821	0.5639
TransCD(Res-SViT-E1-D1-16)	0.6627	0.7768	0.7152
TransCD(Res-SViT-E1-D1-32)	0.5825	0.7091	0.6396