Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Deformable Feature Alignment and Refinement for Moving Infrared Dim-small Target Detection

Dengyan Luo, Yanping Xiang, Hu Wang, Luping Ji,  , Shuai Li,   and Mao Ye This work was supported in part by the National Natural Science Foundation of China (62276048) and Chengdu Science and Technology Projects (2023-YF06-00009-HZ).Dengyan Luo, Yanping Xiang, Hu Wang, Luping Ji and Mao Ye are with the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, P.R. China (e-mail: dengyanluo@126.com; xiangyp@uestc.edu.cn; wanghu0833cv@gmail.com; jiluping@uestc.edu.cn; cvlab.uestc@gmail.com).Shuai Li is with the School of Control Science and Engineering, Shandong University, Jinan 250000, P.R. China (e-mail:shuaili@sdu.edu.cn).*corresponding author
Abstract

The detection of moving infrared dim-small targets has been a challenging and prevalent research topic. The current state-of-the-art methods are mainly based on ConvLSTM to aggregate information from adjacent frames to facilitate the detection of the current frame. However, these methods implicitly utilize motion information only in the training stage and fail to explicitly explore motion compensation, resulting in poor performance in the case of a video sequence including large motion. In this paper, we propose a Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution to explicitly use motion context in both the training and inference stages. Specifically, a Temporal Deformable Alignment (TDA) module based on the designed Dilated Convolution Attention Fusion (DCAF) block is developed to explicitly align the adjacent frames with the current frame at the feature level. Then, the feature refinement module adaptively fuses the aligned features and further aggregates useful spatio-temporal information by means of the proposed Attention-guided Deformable Fusion (AGDF) block. In addition, to improve the alignment of adjacent frames with the current frame, we extend the traditional loss function by introducing a new motion compensation loss. Extensive experimental results demonstrate that the proposed DFAR method achieves the state-of-the-art performance on two benchmark datasets including DAUB and IRDST.

Index Terms:
Moving infrared dim-Small target, target detection, multi-frame, deformable convolution, motion compensation loss.

I Introduction

Moving infrared dim-small target detection plays an important role in numerous practical applications, such as civil and military surveillance, because harsh external environmental conditions do not degrade the quality of infrared image [1]. Although video target detection [2, 3, 4, 5, 6, 7] and small object detection [8, 9, 10, 11] have been developed rapidly in recent years, there are still challenges in video infrared dim-small target detection due to the following reasons: 1) Due to the long imaging distances, the infrared target is extremely small compared to the background. 2) The low intensity makes it difficult to accurately extract the shape and texture information of targets from complex backgrounds. 3) In actual scenarios, sea surfaces, buildings, clouds, and other clutter can easily disturb or submerge small targets. Therefore, it is necessary to study moving infrared dim-small target detection.

In the past few decades, researchers have performed many studies on infrared dim-small target detection. The early work is mainly based on traditional paradigms, such as background modeling [12] and data structure [13, 14]. Although traditional methods have achieved satisfactory performance, there are too many hand-crafted components, which makes their application impractical in complex scenarios. In contrast, learning-based methods can use deep neural networks to globally optimize the model on a large number of training samples. Therefore, learning-based methods are becoming increasingly popular, greatly promoting the development of infrared dim-small target detection.

According to the number of input frames used, existing learning-based methods can be categorized into single-frame based approaches and multi-frame based approaches. The early works [15, 16, 17, 18, 19, 20] focused on single-frame target detection, that is, only one frame is input into the network at a time. Although they can improve detection performance to a certain extent, their performance is still limited due to ignoring the temporal information implied in consecutive frames. To explore the temporal information, multi-frame target detection methods are proposed [21, 22, 23]. This kind of scheme allows the network to not only model intra-frame spatial features, but also extract inter-frame temporal information of the target to enhance feature representation.

Refer to caption
Figure 1: Illustrating the differences between our method and the state-of-the-art method SSTNet [23]. The LSTM-based SSTNet implicitly aggregates the information from adjacent frames only in the training stage, while our method utilizes deformable convolution (DCN) to perform explicit inter-frame feature alignment in the training and inference stages, and applies the introduced Motion Compensation (MC) loss to supervise the temporal alignment.

The existing multi-frame detection methods are mainly based on 3D Convolutional structure [21, 22] or Convolutional Long Short-Term Memory (ConvLSTM) structure [23]. However, these methods implicitly utilize motion information and fail to explicitly explore motion compensation, which limits the network’s ability to model complex and large-scale motion. Furthermore, the current state-of-the-art method SSTNet [23] only utilizes ConvLSTM to improve network optimization in the training stage, which results in the frame to be detected not actually using the past-future frame information in the inference stage, as illustrated in Fig. 1. In addition, the models based on 3D convolution or LSTM are arduous to achieve a good trade-off between computational cost and detection performance.

To alleviate the above issues, in this work, we propose a Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution (DCN) [24] to explore motion context information simultaneously during training and inference stages, as presented in Fig. 1. Our motivation stems from the observation that in moving infrared dim-small target detection, there exist many cases that the targets can not be detected in the current frame but are easier or harder to be perceived in some adjacent frames; and DCN has strong capabilities in modeling geometric transformations.

Therefore, in order to adaptively utilize useful information from adjacent frames, there are two main parts are devised. On the one hand, a Temporal Deformable Alignment (TDA) module based on the designed Dilated Convolution Attention Fusion (DCAF) block is developed to explicitly align the adjacent frames with the current frame at the feature level. Specifically, the TDA module uses features from both the current frame and the adjacent frame to dynamically predict offsets of sampling convolution kernels. More specifically, the DCAF block uses channel attention to fuse multi-scale features extracted by multiple dilated convolutions, making the predicted offsets have an adaptive receptive field. Then, dynamic kernels are applied on features from adjacent frames to employ the temporal alignment. On the other hand, the feature refinement module adaptively fuses the feature of the current frame with the aligned adjacent features, and further aggregates effective spatio-temporal information through the proposed Attention-guided Deformable Fusion (AGDF) blocks. In particular, the AGDF block adopts a pyramid offsets generation scheme and fuses multi-scale deformable offsets at the pixel level, which provides the model with the ability to implicitly model complex and large motions. In addition, to improve the alignment effect, we introduce a new Motion Compensation (MC) loss MCsubscript𝑀𝐶\mathcal{L}_{MC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT by measuring the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance between the aligned adjacent features and the current frame feature.

The main contributions of this paper can be summarized as follows: (1) A new Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution is proposed to mine temporal information implied in continuing frames, and effectively align the target frame with its adjacent frames through a designed Temporal Deformable Alignment (TDA) module respectively. (2) We propose a feature refinement module to adaptively fuse the aligned adjacent features, and further explore valuable spatio-temporal details from the fused aligned features with the proposed Attention-guided Deformable Fusion (AGDF) blocks. (3) A new Motion Compensation (MC) loss MCsubscript𝑀𝐶\mathcal{L}_{MC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT is proposed which is used to supervise the alignment of adjacent frames with the current frame.

Experimental results demonstrate that the proposed DFAR method achieves superior performance compared with the state-of-the-art methods in both quantitative and qualitative evaluations. The remainder of the paper is organized as follows: Section II introduces the related works, including single-frame based infrared small target detection schemes and multi-frame based infrared small target detection schemes. Section III elaborates on the structure and details of the proposed method. Section IV shows comparative experiments and ablation studies on two benchmark datasets. Finally, Section V draws conclusions.

II Related Works

II-A Single-frame Infrared Small Target Detection

According to the number of frames used, infrared small target detection schemes can be classified into two categories: single-frame methods and multi-frame methods. Single-frame infrared small target detection aims to accurately detect targets in a single infrared image, which can be divided into two categories: traditional methods [12, 13, 14, 25, 26, 27, 28, 29, 30] and deep learning-based methods [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42].

Traditional methods can be further categorized into background modeling, data structure and target featuremethods. Background modeling methods estimate and suppress the background effect, such as top-hat [12] and max-mean [25]. Data structure methods are based on the sparsity of the target and the low rank of the background to separate the target and the background, such as TNLRS [14] and IPI [26]. Target feature methods are based on the feature differences between the target and its neighboring regions to detect the target, such as LCM [27] and WSLCM [29].

Since the introduction of Multi-layer Perception (MLP) network into small target detection by Liu et al. [31], deep-learning-based algorithms have received much research attention. To take advantage of the Convolutional Neural Network (CNN), for example, Dai et al. [32] proposed a ACM to aggregate low-level and deep-level features. To unambiguously establish long-range contextual information, for example, Liu et al. [35] designed a method based on the transformer for infrared small target detection. Later on, Lin et al. [36] designed a IR-TransDet to integrate the benefits of the CNN and the transformer to properly extract global semantic information and features of small targets. Although these methods have performed well in infrared image dim-small target detection, they are less effective in video infrared dim-small target detection due to the ignorance of temporal information.

II-B Multi-frame Infrared Small Target Detection

Refer to caption
Figure 2: The framework of the proposed DFAR approach consists of four parts. (a) A feature extraction module is applied to extract the spatial information from the input clip I[tR,t+R]subscript𝐼𝑡𝑅𝑡𝑅I_{[t-R,t+R]}italic_I start_POSTSUBSCRIPT [ italic_t - italic_R , italic_t + italic_R ] end_POSTSUBSCRIPT and obtain the extracted features F[tR,t+R]Esuperscriptsubscript𝐹𝑡𝑅𝑡𝑅𝐸F_{[t-R,t+R]}^{E}italic_F start_POSTSUBSCRIPT [ italic_t - italic_R , italic_t + italic_R ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT. (b) The extracted visual features FiEsuperscriptsubscript𝐹𝑖𝐸F_{i}^{E}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT and FtEsuperscriptsubscript𝐹𝑡𝐸F_{t}^{E}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT are concatenated in the channel dimension and fed into the Temporal Deformable Alignment (TDA) module based on the Dilated Convolution Attention Fusion (DCAF) blocks for alignment, i[tR,t+R] and it𝑖𝑡𝑅𝑡𝑅 and 𝑖𝑡i\in[t-R,t+R]\text{ and }i\neq titalic_i ∈ [ italic_t - italic_R , italic_t + italic_R ] and italic_i ≠ italic_t. (c) The aligned features and the extracted target feature are input into the feature refinement module based on the attention weight block and the designed Attention-guided Deformable Fusion (AGDF) blocks to adaptively fuse and refine spatio-temporal information. (d) The refined feature FDsubscript𝐹𝐷F_{D}italic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is fed into the detection head module for calculating the detection loss. The network is optimized under the supervision of the traditional detection loss and the introduced Motion Compensation (MC) loss. Herein, temporal radius R=2𝑅2R=2italic_R = 2.

To simultaneously utilize spatio-temporal information, multi-frame detection methods have been proposed in recent years. The early multi-frame methods [43, 44, 45, 46, 47, 48, 49] are also non-intelligent learning, and they are mainly based on tensor optimization. For instance, Sun et al. [43] developed a STTV-WNIPT in combination with spatial and temporal information to separate the target and background. Then, kwan et al. [44] proposed to use optical flow to improve detection performance. Later on, Wu et al. [48] proposed to construct a 4-D spatio-temporal tensor and decompose it into a low-dimensional tensor. However, these methods are heavily dependent on traditional priors and handcrafted features, resulting in poor detection performance in complex scenes such as clutter and noise.

To overcome the above weakness, learning-based multi-frame detection methods [21, 22, 23, 50, 51, 52, 53, 54, 55] are proposed. For example, Du et al. [51] proposed a STFBD to use multiple frames for video infrared dim-small target detection. Yan et al. [52] then designed a STDMANet to explore temporal multi-scale features. However, these methods directly concatenate multiple frames or features to construct spatio-temporal tensors resulting in rude motion information fusion. Later on, Zhang et al. [22] utilized 3D CNN and traditional priors to mine motion information. Meanwhile, Li et al. [53] developed a DTUM based on 3D CNN to encode the motion direction into features and extract the motion information of targets. After that, Tong et al. [55] introduced a ST-Trans based on the transformer to learn the spatio-temporal dependencies between successive frames of small infrared targets. More recently, Chen et al. [23] proposed a SSTNet based on ConvLSTM for multi-frame infrared small target detection. Although the state-of-the-art performance has been achieved, in these works, motion information is modeled implicitly and fails to be compensated explicitly, resulting in poor performance. Moreover, the models based on 3D CNN, transformer and LSTM tend to have heavy parameters and computation. In addition, the SSTNet method does not utilize ConvLSTM in the inference stage but only in the training stage to aggregate past–current–future frames, which may also lead to overfitting in the training dataset. In contrast, we directly incorporate deformable alignment into our module, allowing an explicit guidance during training and inference stages, thus achieving better performance and faster speed than the existing learning-based multi-frame methods.

III The Proposed Method

III-A Overview

Given a video clip of 2R+12𝑅12R+12 italic_R + 1 consecutive frames I[tR,t+R]subscript𝐼𝑡𝑅𝑡𝑅I_{[t-R,t+R]}italic_I start_POSTSUBSCRIPT [ italic_t - italic_R , italic_t + italic_R ] end_POSTSUBSCRIPT, the middle frame ItC×H×Wsubscript𝐼𝑡superscript𝐶𝐻𝑊I_{t}\in\mathbb{R}^{C\times H\times W}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT is the target frame to be detected and the other frames are the reference frames. Here, R𝑅Ritalic_R is the temporal radius (i.e., the number of input frames is 2×R+12𝑅12\times R+12 × italic_R + 1), C𝐶Citalic_C refers to the channel number, and H×W𝐻𝑊H\times Witalic_H × italic_W denotes the frame size. Our goal is to improve detection performance by enabling the network to learn motion context features. The overall structure of our DFAR method is shown in Fig. 2, which consists of four modules: a feature extraction module, a Temporal Deformable Alignment (TDA) module based on Dilated Convolution Attention Fusion (DCAF) blocks for feature alignment, a feature refinement module based on the attention weight block and the proposed Attention-guided Deformable Fusion (AGDF) blocks, and a detection head module for target detection.

As shown in Fig. 2, firstly, a feature extraction module is applied to extract the spatial information for each frame from the input clip I[tR,t+R]subscript𝐼𝑡𝑅𝑡𝑅I_{[t-R,t+R]}italic_I start_POSTSUBSCRIPT [ italic_t - italic_R , italic_t + italic_R ] end_POSTSUBSCRIPT and obtain the extracted features F[tR,t+R]Ec×h×wsuperscriptsubscript𝐹𝑡𝑅𝑡𝑅𝐸superscript𝑐𝑤F_{[t-R,t+R]}^{E}\in\mathbb{R}^{c\times h\times w}italic_F start_POSTSUBSCRIPT [ italic_t - italic_R , italic_t + italic_R ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_h × italic_w end_POSTSUPERSCRIPT, where c𝑐citalic_c, hhitalic_h, and w𝑤witalic_w denote the channel, height, and width of the feature FiEsuperscriptsubscript𝐹𝑖𝐸F_{i}^{E}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT, respectively. The feature extraction can be represented as:

F[tR,t+R]E=FE(I[tR,t+R]),superscriptsubscript𝐹𝑡𝑅𝑡𝑅𝐸𝐹𝐸subscript𝐼𝑡𝑅𝑡𝑅F_{[t-R,t+R]}^{E}=FE(I_{[t-R,t+R]}),italic_F start_POSTSUBSCRIPT [ italic_t - italic_R , italic_t + italic_R ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = italic_F italic_E ( italic_I start_POSTSUBSCRIPT [ italic_t - italic_R , italic_t + italic_R ] end_POSTSUBSCRIPT ) , (1)

where FE()𝐹𝐸FE(\cdot)italic_F italic_E ( ⋅ ) denotes the feature extraction module, which is a three-level pyramid structure, and each level contains a standard 3×3333\times 33 × 3 convolutional layer with a stride of 1 and a 3×3333\times 33 × 3 convolutional layer with a stride of 2 for downsampling. Then, each adjacent frame feature FiEsuperscriptsubscript𝐹𝑖𝐸F_{i}^{E}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT enters the TDA module along with the current frame feature FtEsuperscriptsubscript𝐹𝑡𝐸F_{t}^{E}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT for temporal alignment:

FiA=TDA(FiE,FtE),i[tR,t+R] and it,formulae-sequencesuperscriptsubscript𝐹𝑖𝐴𝑇𝐷𝐴superscriptsubscript𝐹𝑖𝐸superscriptsubscript𝐹𝑡𝐸𝑖𝑡𝑅𝑡𝑅 and 𝑖𝑡F_{i}^{A}=TDA(F_{i}^{E},F_{t}^{E}),i\in[t-R,t+R]\text{ and }i\neq t,italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = italic_T italic_D italic_A ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) , italic_i ∈ [ italic_t - italic_R , italic_t + italic_R ] and italic_i ≠ italic_t , (2)

where FiAsuperscriptsubscript𝐹𝑖𝐴F_{i}^{A}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT is the each aligned feature. Furthermore, each aligned feature FiAsuperscriptsubscript𝐹𝑖𝐴F_{i}^{A}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and the extracted target feature FtEsuperscriptsubscript𝐹𝑡𝐸F_{t}^{E}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT are input into the feature refinement module to further gather valuable spatio-temporal information:

FD=FR(FiA,FtE),subscript𝐹𝐷𝐹𝑅superscriptsubscript𝐹𝑖𝐴superscriptsubscript𝐹𝑡𝐸F_{D}=FR(F_{i}^{A},F_{t}^{E}),italic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = italic_F italic_R ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) , (3)

where FR()𝐹𝑅FR(\cdot)italic_F italic_R ( ⋅ ) denotes the feature refinement module, and FDsubscript𝐹𝐷F_{D}italic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is the refined feature. Finally, following SSTNet [23], the refined feature FDsubscript𝐹𝐷F_{D}italic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is input to the detection head YOLOX [56] for target detection. It should be noted that for the first few frames and the last few frames in a video sequence, where the number of adjacent frames is less than 2×R2𝑅2\times R2 × italic_R, we repeatedly pad it with the target frame until there are 2×R2𝑅2\times R2 × italic_R frames.

The details of the TDA module and the feature refinement module are explained in the following Sec. III-B and Sec. III-C, while the proposed Motion Compensation (MC) loss MCsubscript𝑀𝐶\mathcal{L}_{MC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT is presented in Sec. III-D.

Refer to caption
Figure 3: An example of moving infrared dim-small target. The target is more easily perceived in the adjacent frame 806 than in the current frame 807 and more difficult to be detected in the adjacent frame 808.

III-B Temporal Deformable Alignment Module

The motivation for inter-frame alignment comes from our observation that in moving infrared dim-small target detection task, there are many situations where it is not easy to detect the target in the current frame, but the related target information exists in adjacent frames. Fig. 3 shows an example of this case, where the target in frame 807 is difficult to be detected, whereas the target in frame 806 is much more easily perceived. And deformable convolution (DCN) has shown promising performance at capturing the motion cues of the targets in some low-level vision tasks like video super-resolution [57] and video deraining [58]. Therefore, we design a TDA module based on DCN to aggregate temporal information.

Concretely, firstly, the adjacent feature FiEsuperscriptsubscript𝐹𝑖𝐸F_{i}^{E}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT and the target feature FtEsuperscriptsubscript𝐹𝑡𝐸F_{t}^{E}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT are concatenated in the channel dimension and then passed through a 3×3333\times 33 × 3 convolutional layer to make the number of channels consistent:

Fti=Conv([Ft,Fi]),subscript𝐹𝑡𝑖𝐶𝑜𝑛𝑣subscript𝐹𝑡subscript𝐹𝑖F_{ti}=Conv([F_{t},F_{i}]),italic_F start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) , (4)

where [,][\cdot,\cdot][ ⋅ , ⋅ ] denotes the concatenation operation in the channel axis. Then, inspired by the residual dense network [59], we design a Dilated Convolution Attention Fusion (DCAF) block as the basic block to further integrate the temporal information:

Fti=(DCAF)4(Fti).superscriptsubscript𝐹𝑡𝑖superscript𝐷𝐶𝐴𝐹4subscript𝐹𝑡𝑖F_{ti}^{{}^{\prime}}=(DCAF)^{4}(F_{ti}).italic_F start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = ( italic_D italic_C italic_A italic_F ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT ) . (5)

The stacked blocks allow the network to have a large enough receptive field to aggregate information from distant spatial locations. Finally, the aggregated feature Ftisuperscriptsubscript𝐹𝑡𝑖F_{ti}^{{}^{\prime}}italic_F start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT is input to a 3×3333\times 33 × 3 convolutional layer to generate the corresponding deformable sampling parameters for alignment:

Θ=Conv(Fti),Θ𝐶𝑜𝑛𝑣superscriptsubscript𝐹𝑡𝑖\Theta=Conv(F_{ti}^{{}^{\prime}}),roman_Θ = italic_C italic_o italic_n italic_v ( italic_F start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , (6)

where Θd×2K2×h×wΘsuperscript𝑑2superscript𝐾2𝑤\Theta\in\mathbb{R}^{d\times 2K^{2}\times h\times w}roman_Θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 2 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT is offset filed for the deformable convolutional kernel; d𝑑ditalic_d and K2superscript𝐾2K^{2}italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denote the deformable group and the kernel size of the deformable convolution, respectively. Then, deformable convolution with the predicted offsets ΘΘ\Thetaroman_Θ is applied to the feature FiEsuperscriptsubscript𝐹𝑖𝐸F_{i}^{E}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT to get the aligned feature FiAsuperscriptsubscript𝐹𝑖𝐴F_{i}^{A}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT:

FiA(p)=k=1K2ωkFiE(p+pk+Δpk),superscriptsubscript𝐹𝑖𝐴𝑝superscriptsubscript𝑘1superscript𝐾2subscript𝜔𝑘superscriptsubscript𝐹𝑖𝐸𝑝subscript𝑝𝑘Δsubscript𝑝𝑘F_{i}^{A}(p)=\sum_{k=1}^{K^{2}}\omega_{k}\cdot F_{i}^{E}(p+p_{k}+\Delta p_{k}),italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( italic_p ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ( italic_p + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (7)

where pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the sampling grid with K2superscript𝐾2K^{2}italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT sampling locations, and ωksubscript𝜔𝑘\omega_{k}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the weights for each location p𝑝pitalic_p; ΔpkΔsubscript𝑝𝑘\Delta p_{k}roman_Δ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the learnable offset for the k𝑘kitalic_k-th location, that is Θ={Δpk}ΘΔsubscript𝑝𝑘\Theta=\{\Delta p_{k}\}roman_Θ = { roman_Δ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. As the p+pk+Δpk𝑝subscript𝑝𝑘Δsubscript𝑝𝑘p+p_{k}+\Delta p_{k}italic_p + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be fractional, bilinear interpolation is adopted as in [60]. It should be noted that the above process only describes the prediction and application of offsets. In fact, we also generate and apply masks for deformable convolution [24].

Refer to caption
Figure 4: The structure of Dilated Convolution Attention Fusion (DCAF) block. It contains 4 dilated convolutions with a dilation rate from 1 to 4. With no special indication, the kernel size and dilation rate of the convolutional layer are set to 3×3333\times 33 × 3 and 1, respectively.

The structure of the DCAF block is depicted in Fig. 4. In detail, to reduce computational cost while ensuring detection performance, we first use a 3×3333\times 33 × 3 convolutional layer to halve the number of channels. Then, 4 dilated convolutions with a dilation rate from 1 to 4 are used to obtain feature maps with different receptive fields. These feature maps are hierarchically added before concatenating them with the original input feature to acquire an effective receptive field. After that, the channel attention [61] and a 1×1111\times 11 × 1 convolutional layer are utilized to fuse the concatenated multi-scale feature and restore the input channels of the DCAF block. The channel attention mechanism can play a selection role for different receptive fields, which enables our TDA module to adaptively model different degrees of motion between frames. Finally, the local skip connection with residual scaling is applied to complete our DCAF block.

Remark. Although optical flow can also be used for explicit temporal alignment, optical flow estimation only predicts an offset for each coordinate, and this single-coordinate single-offset mechanism severely restricts the modeling ability in more complex scenarios. In addition, per-pixel motion estimation often suffers a heavy computational load. However, our method aligns the target frame with adjacent frames at the feature level, which makes the network have strong capability and flexibility to handle various motion conditions in temporal scenes. We visualize the feature maps in Fig. 5 to intuitively illustrate the effectiveness of the TDA module. We can see from Fig. 5 that after alignment, the target feature in frame 806 is closer to the target feature in frame 807. This makes it easier for the network to perceive the target with the help of information from adjacent frames. At the same time, the object feature in frame 808 that is prone to introducing new artifacts becomes more easily perceived.

Refer to caption
Figure 5: An example of visualizing feature maps. The feature maps are obtained by averaging all corresponding channel features. The white area is the anchor area to be aligned, and the red areas indicate the target areas. After alignment, the target features in adjacent frames are closer to the target features in the detected frame and become more easily perceived and utilized. Zoom in for the best view.

III-C Feature Refinement Module

Although we explicitly explore motion information, ineffective alignment can lead to worse detection results if adjacent frames are too blurry. As shown in Fig. 3, the target is harder to be detected in frame 808 than in frame 807, and the alignment may introduce new artifacts, making the target more susceptible to disturbance. Therefore, we design a feature refinement module to adaptively fuse and improve useful temporal information from the aligned features.

As shown in Fig. 2, we first aggregate the aligned features FiAsuperscriptsubscript𝐹𝑖𝐴F_{i}^{A}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT with the extracted feature FtEsuperscriptsubscript𝐹𝑡𝐸F_{t}^{E}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT via a concatenation operation followed by a 1×\times×1 convolutional layer to derive the fused alignment feature Ffasuperscriptsubscript𝐹𝑓𝑎F_{f}^{a}italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT:

Ffa=Conv([FtRA,,Ft1A,FtE,Ft+1A,,Ft+RA]).superscriptsubscript𝐹𝑓𝑎𝐶𝑜𝑛𝑣superscriptsubscript𝐹𝑡𝑅𝐴superscriptsubscript𝐹𝑡1𝐴superscriptsubscript𝐹𝑡𝐸superscriptsubscript𝐹𝑡1𝐴superscriptsubscript𝐹𝑡𝑅𝐴F_{f}^{a}=Conv([F_{t-R}^{A},\ldots,F_{t-1}^{A},F_{t}^{E},F_{t+1}^{A},\ldots,F_% {t+R}^{A}]).italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v ( [ italic_F start_POSTSUBSCRIPT italic_t - italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_t + italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ] ) . (8)

Then, a global average pooling operation and two 1×1111\times 11 × 1 convolutional layers are adopted to generate the attention weights W2R+1subscript𝑊2𝑅1W_{2R+1}italic_W start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT:

W2R+1=Conv2R+1(Conv(GAP(Ffa))),subscript𝑊2𝑅1𝐶𝑜𝑛subscript𝑣2𝑅1𝐶𝑜𝑛𝑣𝐺𝐴𝑃superscriptsubscript𝐹𝑓𝑎W_{2R+1}=Conv_{2R+1}(Conv(GAP(F_{f}^{a}))),italic_W start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT ( italic_C italic_o italic_n italic_v ( italic_G italic_A italic_P ( italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) ) ) , (9)

where GAP()𝐺𝐴𝑃GAP(\cdot)italic_G italic_A italic_P ( ⋅ ) represents the global average pooling operation, and Conv2R+1()𝐶𝑜𝑛subscript𝑣2𝑅1Conv_{2R+1}(\cdot)italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT ( ⋅ ) denotes the convolution operation required to generate weights W2R+1subscript𝑊2𝑅1W_{2R+1}italic_W start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT for the features to be aggregated. After that, the adaptive fusion weights W2R+1subscript𝑊2𝑅1W_{2R+1}italic_W start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT are element-wise multiplied with the features F2R+1={FtRA,,Ft1A,FtE,Ft+1A,,Ft+RA}subscript𝐹2𝑅1superscriptsubscript𝐹𝑡𝑅𝐴superscriptsubscript𝐹𝑡1𝐴superscriptsubscript𝐹𝑡𝐸superscriptsubscript𝐹𝑡1𝐴superscriptsubscript𝐹𝑡𝑅𝐴F_{2R+1}=\{F_{t-R}^{A},\ldots,F_{t-1}^{A},F_{t}^{E},F_{t+1}^{A},\ldots,F_{t+R}% ^{A}\}italic_F start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT = { italic_F start_POSTSUBSCRIPT italic_t - italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_t + italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT }:

F~2R+1=F2R+1W2R+1,subscript~𝐹2𝑅1tensor-productsubscript𝐹2𝑅1subscript𝑊2𝑅1\widetilde{F}_{2R+1}=F_{2R+1}\otimes W_{2R+1},over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT ⊗ italic_W start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT , (10)

where tensor-product\otimes refers to element-wise multiplication operation. Finally, the generated modulated features F~2R+1subscript~𝐹2𝑅1\widetilde{F}_{2R+1}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 2 italic_R + 1 end_POSTSUBSCRIPT are concatenated in the channel dimension and then via a 1×\times×1 bottleneck convolution to obtain the fused coarse feature Ffcsuperscriptsubscript𝐹𝑓𝑐F_{f}^{c}italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT:

Ffc=Conv([F~tR,,F~t1,F~t,F~t+1,,F~t+R]).superscriptsubscript𝐹𝑓𝑐𝐶𝑜𝑛𝑣subscript~𝐹𝑡𝑅subscript~𝐹𝑡1subscript~𝐹𝑡subscript~𝐹𝑡1subscript~𝐹𝑡𝑅F_{f}^{c}=Conv([\widetilde{F}_{t-R},\cdots,\widetilde{F}_{t-1},\widetilde{F}_{% t},\widetilde{F}_{t+1},\cdots,\widetilde{F}_{t+R}]).italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v ( [ over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t - italic_R end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t + italic_R end_POSTSUBSCRIPT ] ) . (11)

The obtained feature Ffcsuperscriptsubscript𝐹𝑓𝑐F_{f}^{c}italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT contains both intra-frame spatial relation and inter-frame temporal information, and the temporal dependence among frames is concatenated along the channel dimension. Therefore, we need to design an effective fusion structure to enable the network to dynamically aggregate spatio-temporal information. Fortunately, the attention mechanism can highlight the important information and suppress the less useful information. Thus, we propose a spatial attention and channel attention guided AGDF block to enable the network to concentrate on more valuable information and enhance discriminative learning ability.

Refer to caption
Figure 6: The structure of Attention-guided Deformable Fusion (AGDF) block. With no special indication, the kernel size and stride of the convolutional layer are set to 3×3333\times 33 × 3 and 1, respectively.

The structure of the AGDF block is illustrated in Fig. 6. Assuming that the input feature map of the AGDF block is Fxsubscript𝐹𝑥F_{x}italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, we first use a 3×3333\times 33 × 3 convolutional layer to halve the number of channels for low computational cost:

Fx=Conv(Fx).superscriptsubscript𝐹𝑥𝐶𝑜𝑛𝑣subscript𝐹𝑥F_{x}^{{}^{\prime}}=Conv(F_{x}).italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v ( italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) . (12)

Then, channel attention and spatial attention [62] are applied to aggregate spatial information and temporal dependence respectively:

Fxs=SA(Fx),superscriptsubscript𝐹𝑥𝑠𝑆𝐴superscriptsubscript𝐹𝑥F_{x}^{s}=SA(F_{x}^{{}^{\prime}}),italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_S italic_A ( italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , (13)
Fxc=CA(Fx),superscriptsubscript𝐹𝑥𝑐𝐶𝐴superscriptsubscript𝐹𝑥F_{x}^{c}=CA(F_{x}^{{}^{\prime}}),italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_C italic_A ( italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , (14)

where SA()𝑆𝐴SA(\cdot)italic_S italic_A ( ⋅ ) and CA()𝐶𝐴CA(\cdot)italic_C italic_A ( ⋅ ) represent spatial attention and channel attention operations, respectively. Furthermore, a 1×1111\times 11 × 1 convolutional layer is adopted to combine two branch features and make the number of channels consistent:

Fxsc=Conv([Fxs,Fxc]).superscriptsubscript𝐹𝑥𝑠𝑐𝐶𝑜𝑛𝑣superscriptsubscript𝐹𝑥𝑠superscriptsubscript𝐹𝑥𝑐F_{x}^{sc}=Conv([F_{x}^{s},F_{x}^{c}]).italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v ( [ italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ) . (15)

After that, we utilize a multi-scale structure to predict offsets and perform deformable fusion. Specifically, we use a strided convolution to downsample Fxscsuperscriptsubscript𝐹𝑥𝑠𝑐F_{x}^{sc}italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT by a factor of 2 and predict offsets at two different scales:

Θ1sc=Conv(Fxsc),superscriptsubscriptΘ1𝑠𝑐𝐶𝑜𝑛𝑣superscriptsubscript𝐹𝑥𝑠𝑐\Theta_{1}^{sc}=Conv(F_{x}^{sc}),roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v ( italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT ) , (16)
Θ2sc=Conv(SConv(Fxsc)).superscriptsubscriptΘ2𝑠𝑐𝐶𝑜𝑛𝑣𝑆𝐶𝑜𝑛𝑣superscriptsubscript𝐹𝑥𝑠𝑐\Theta_{2}^{sc}=Conv(SConv(F_{x}^{sc})).roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v ( italic_S italic_C italic_o italic_n italic_v ( italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT ) ) . (17)

Then, deformable convolution with the fused multi-scale offsets are applied to the mixed spatio-temporal feature Fxscsuperscriptsubscript𝐹𝑥𝑠𝑐F_{x}^{sc}italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT to further fuse spatio-temporal information:

Fxf=DCN(Fxsc,(Θ2sc)2+Θ1sc),superscriptsubscript𝐹𝑥𝑓𝐷𝐶𝑁superscriptsubscript𝐹𝑥𝑠𝑐superscriptsuperscriptsubscriptΘ2𝑠𝑐absent2superscriptsubscriptΘ1𝑠𝑐F_{x}^{f}=DCN(F_{x}^{sc},(\Theta_{2}^{sc})^{\uparrow 2}+\Theta_{1}^{sc}),italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT = italic_D italic_C italic_N ( italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT , ( roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ↑ 2 end_POSTSUPERSCRIPT + roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c end_POSTSUPERSCRIPT ) , (18)

where ()2superscriptabsent2(\cdot)^{\uparrow 2}( ⋅ ) start_POSTSUPERSCRIPT ↑ 2 end_POSTSUPERSCRIPT refers to upscaling by a factor of 2. The deformable convolution with pyramid generated offsets allows the AGDF block to have a larger and adaptive receptive field to aggregate information. Finally, a 3×3333\times 33 × 3 convolutional layer is used to increase the number of channels to ensure consistent input and output channels for the AGDF block:

Fy=Conv(Fxf).subscript𝐹𝑦𝐶𝑜𝑛𝑣superscriptsubscript𝐹𝑥𝑓F_{y}=Conv(F_{x}^{f}).italic_F start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ) . (19)

Remark. Although the stacked convolutional layers or multi-scale convolutional structures can also increase the receptive field of the network, gradient vanishing/exploding and degradation problems may be caused. In contrast, our AGDF block employs deformable convolution with joint predicted offsets to model spatio-temporal information from the fused two-branch features, leading to more efficient use of both intra-frame and inter-frame information. Although the three-level pyramid structure may bring more performance improvement, the accompanying increase in computational cost is unacceptable. Our AGDF block reaches a balance between the performance and computational efficiency by the designed two-level pyramid structure.

III-D The Loss Function

Although the deformable alignment has the potential to capture motion context and align FiEsuperscriptsubscript𝐹𝑖𝐸F_{i}^{E}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT with FtEsuperscriptsubscript𝐹𝑡𝐸F_{t}^{E}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT, it is difficult to train deformable convolution without a supervision loss [63]. Training instability regularly leads to offset overflow, degrading the final performance. Therefore, to improve the temporal alignment, we introduce a new Motion Compensation (MC) loss MCsubscript𝑀𝐶\mathcal{L}_{MC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT as follows:

MC=i=tR,tt+R1(FiA,FtE),subscript𝑀𝐶superscriptsubscript𝑖𝑡𝑅absent𝑡𝑡𝑅subscript1superscriptsubscript𝐹𝑖𝐴superscriptsubscript𝐹𝑡𝐸\mathcal{L}_{MC}=\sum_{i=t-R,\neq t}^{t+R}\mathcal{L}_{1}\left(F_{i}^{A},F_{t}% ^{E}\right),caligraphic_L start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = italic_t - italic_R , ≠ italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_R end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) , (20)

where 1(,)subscript1\mathcal{L}_{1}(\cdot,\cdot)caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ , ⋅ ) represents L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss. Then, the total loss function is formulated as:

=λreg+cls+obj+ηMC,𝜆subscript𝑟𝑒𝑔subscript𝑐𝑙𝑠subscript𝑜𝑏𝑗𝜂subscript𝑀𝐶\mathcal{L}=\lambda\mathcal{L}_{reg}+\mathcal{L}_{cls}+\mathcal{L}_{obj}+\eta% \mathcal{L}_{MC},caligraphic_L = italic_λ caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT + italic_η caligraphic_L start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT , (21)

where regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT is a regression loss, clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT denotes a classification loss, and objsubscript𝑜𝑏𝑗\mathcal{L}_{obj}caligraphic_L start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT refers to an IoU loss; λ𝜆\lambdaitalic_λ and η𝜂\etaitalic_η are two hyper-parameters to balance loss terms. Here, following the setting of YOLOX [56], λ𝜆\lambdaitalic_λ is fixed to 5.

IV Experiments

IV-A Datasets and Quantitative Evaluation Metrics

Datasets. Following [23], we conduct extensive experiments on two moving infrared dim-small target detection datasets, i.e., DAUB [64] and IRDST [19]. The DAUB dataset consists of 10 training video sequences with a total of 8982 frames and 7 test video sequences with a total of 4795 frames. The IRDST dataset contains 42 training video sequences with a total of 20398 frames and 43 test video sequences with a total of 20258 frames.

Quantitative Evaluation Metrics. To evaluate the detection performance, four standard evaluation metrics for infrared dim-small target detection are adopted, that is Precision (Pr), Recall (Re), F1𝐹1F1italic_F 1 score and the average precision (e.g., mAP50, the mean average precision with the IoU threshold of 0.5). These quantitative evaluation metrics are defined as follows:

 Precision =TPTP+FP, Precision TPTPFP\text{ Precision }=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},Precision = divide start_ARG roman_TP end_ARG start_ARG roman_TP + roman_FP end_ARG , (22)
 Recall =TPTP+FN, Recall TPTPFN\text{ Recall }=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},Recall = divide start_ARG roman_TP end_ARG start_ARG roman_TP + roman_FN end_ARG , (23)
F1=2× Precision × Recall  Precision + Recall ,𝐹12 Precision  Recall  Precision  Recall F1=\frac{2\times\text{ Precision }\times\text{ Recall }}{\text{ Precision }+% \text{ Recall }},italic_F 1 = divide start_ARG 2 × Precision × Recall end_ARG start_ARG Precision + Recall end_ARG , (24)

where TP, FP, and FN denote the number of correct predictions (true positives), false detections (false positives), and missing targets (false negatives), respectively. The F1𝐹1F1italic_F 1 score is a reliable and comprehensive evaluation metric on Pr and Re.

IV-B Implementation Details

Network settings. The convolutional layer in the feature extraction module has 48 filters (except for the last layer which has 64 filters). The kernel size of all deformable convolutions in the network is set to 3×3333\times 33 × 3; the number of deformable groups in the TDA module and the AGDF block is set to 8 and 32, respectively. There are 4 DCAF blocks and 4 AGDF blocks in the TDA module and feature refinement module, respectively. For the other settings in the proposed DFAR method have been described in Section III.

Model training. To have a fair comparison, following [23], the number of input frames is set to 5 (i.e., temporal radius R=2𝑅2R=2italic_R = 2). In the training process, the input frames are reshaped into 544×544544544544\times 544544 × 544, and the batch size is set to 4. The model is trained by Adam optimizer with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999 and ε=1×108𝜀1superscript108\varepsilon=1\times 10^{-8}italic_ε = 1 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT for 20 epochs. The learning rate is initially set to 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and retained throughout training. For the hyper-parameters in equation (21), following YOLOX [56], λ𝜆\lambdaitalic_λ is set to 5; η𝜂\etaitalic_η is selected by the grid search method and set to 1. The proposed model is implemented using PyTorch, and trained on a NVIDIA GeForce RTX 3090 GPU.

TABLE I: Overall Performance Comparison in Terms of mAP50, Precision (Pr), Recall (Re) and F1𝐹1F1italic_F 1 Score on the DAUB and IRDST Datasets. Red and Blue Colors Indicate the Best and the Second-Best Performance, Respectively.
Scheme Methods Publication DAUB IRDST
mAP50 (%) Pr (%) Re (%) F1 (%) mAP50 (%) Pr (%) Re (%) F1 (%)
single-frame based detection ACM [32] WACV 2021 72.30 76.84 95.31 85.09 67.74 81.44 84.01 82.71
RISTD [65] IEEE GRSL 2022 82.73 88.54 94.41 91.38 78.92 86.56 92.63 89.49
ISNet [66] CVPR 2022 83.43 88.64 95.04 91.73 75.00 87.78 86.81 87.29
UIUNet [67] IEEE TIP 2022 88.23 94.63 94.79 94.71 70.96 87.28 82.08 84.60
SANet [68] ICASSP 2023 87.90 94.14 94.22 94.18 77.98 85.42 92.13 88.64
AGPCNet [69] IEEE TAES 2023 73.08 78.49 94.41 85.72 73.86 83.77 89.18 86.39
RDIAN [19] IEEE TRGS 2023 83.69 90.55 93.37 91.94 71.99 84.41 86.48 85.43
DNANet [18] IEEE TIP 2023 89.24 95.66 94.83 95.24 76.84 90.08 86.81 88.42
SIRST5K [70] IEEE TGRS 2024 88.45 94.48 94.97 94.72 72.64 86.17 85.65 85.91
MSHNet [71] CVPR 2024 89.23 97.27 92.26 94.70 78.50 88.89 89.63 89.26
RPCANet [72] WACV 2024 85.75 89.12 97.58 93.16 73.29 85.02 87.13 86.06
SCTransNet [73] IEEE TGRS 2024 88.26 93.50 95.50 94.53 78.27 89.67 88.43 89.05
multi-frame based detection DTUM [53] IEEE TNNLS 2023 88.24 95.15 93.60 94.37 80.98 90.62 90.46 90.54
SSTNet [23] IEEE TGRS 2024 94.33 97.77 97.91 97.84 83.25 91.13 92.24 91.68
DFAR (Ours) - 96.56 98.82 98.62 98.72 89.88 95.91 94.66 95.28
Refer to caption
Figure 7: PR curve comparison on the DAUB dataset. The larger the area under the curve, the better the method.

IV-C Comparisons with State-of-The-Art Methods

To demonstrate the advantage of our method, we compare our method with various learning-based infrared dim-small target detection approaches whose codes are available, including 12 single-frame based detection approaches: ACM [32], RISTD [65], ISNet [66], UIUNet [67], SANet [68], AGPCNet [69], RDIAN [19], DNANet [18], SIRST5K [70], MSHNet [71], RPCANet [72] and SCTransNet [73], and 2 multi-frame based detection approaches: DTUM [53] and SSTNet [23]. The resolution of input frames for all comparison methods in training and test is reshaped 544×544544544544\times 544544 × 544. Since the datasets we used are based on bounding box annotations, to make the comparison as fair as possible, for all single-frame detection methods (except the SANet method) and the DTUM method based on pixel-level segmentation, the detection head YOLOX [56] is added to the output of their networks to generate bounding boxes. Then, these methods are retrained on our training set. For the SSTNet method, we directly run public training and test codes.

Refer to caption
Figure 8: PR curve comparison on the IRDST dataset. The proposed DFAR approach (red curve) obviously outperforms other methods.

IV-C1 Overall Quantitative Comparison

Table I presents the quantitative comparison results on two datasets, averaged over all frames of each test sequence. We can see that our method achieves better results than all the compared methods on two datasets in terms of four standard evaluation metrics, i.e., mAP50, Pr, Re and F1𝐹1F1italic_F 1.

To be specific, on the DAUB dataset, the highest mAP50 96.56% is reached by our method, which is 2.36% higher than mAP50 of SSTNet (94.33%); the highest Pr 98.82% is achieved by our method, which is 1.07% higher than Pr of SSTNet (97.77%); the highest Re 98.62% is reached by our method, 0.73% higher than Re of SSTNet (97.91%); the highest F1𝐹1F1italic_F 1 98.72% is achieved by our method, which is 0.90% higher than F1𝐹1F1italic_F 1 of SSTNet (97.84%).

Refer to caption
Figure 9: Two groups of visualization comparisons on the DAUB dataset; GT: ground truth. Red box denotes a detected target, and detection region is amplified (blue box).

More specifically, on the IRDST dataset, the highest mAP50 89.88% is reached by our method, which is 7.96% higher than mAP50 of SSTNet (83.25%); the highest Pr 95.91% is achieved by our method, which is 5.25% higher than Pr of SSTNet (91.13%); the highest Re 94.66% is reached by our method, which is 2.62% higher than Re of SSTNet (92.24%); the highest F1𝐹1F1italic_F 1 95.28% is achieved by our method, which is 3.93% higher than F1𝐹1F1italic_F 1 of SSTNet (91.68%). It can be seen that our method significantly outperforms the current state-of-the-art method SSTNet on the IRDST dataset. One possible reason is that the video sequences in the IRDST dataset have larger motions than those in the DAUB dataset, and it is difficult for LSTM-based SSTNet to implicitly aggregate inter-frame information. It demonstrates the robustness of our DFAR approach based on DCN explicit alignment in modeling complex and large motions.

TABLE II: Inference Complexity Comparison on the DAUB Dataset. For a Fair Comparison, All Methods are Retested on a NVIDIA GeForce RTX 3090. The Results are Reported by Model Parameters (Params), Floating-Point Operations (FLOPs) and Frame Per Second (FPS). The best Results are Marked in Bold.
Methods Frames mAP50 F1 Params FLOPs FPS PCR
ACM [32] 1 72.30 85.09 3.01M 28.17G 16.39 2.57
RISTD [65] 1 82.73 91.38 3.26M 92.95G 9.91 0.89
ISNet [66] 1 83.43 91.73 3.48M 300.64G 6.44 0.28
UIUNet [67] 1 88.23 94.71 53.03M 515.83G 2.10 0.17
SANet [68] 1 87.90 94.18 12.40M 47.46G 6.26 1.85
AGPCNet [69] 1 73.08 85.72 14.85M 413.61G 3.17 0.18
RDIAN [19] 1 83.69 91.94 2.71M 57.37G 13.91 1.46
DNANet [18] 1 89.24 95.24 7.19M 152.58G 3.47 0.58
SIRST5K [70] 1 88.45 94.72 11.28M 204.88G 6.64 0.43
MSHNet [71] 1 89.23 94.70 6.56M 78.77G 12.42 1.13
RPCANet [72] 1 85.75 93.16 3.18M 432.27G 13.13 0.20
SCTransNet [73] 1 88.26 94.53 13.68M 115.34G 8.12 0.77
DTUM [53] 5 88.24 94.37 2.79M 117.10G 7.58 0.75
SSTNet [23] 5 94.33 97.84 11.95M 139.53G 6.71 0.68
Ours 5 96.56 98.72 7.53M 100.87G 7.60 0.96

IV-C2 Precision–recall (PR) Curve Comparison

To comprehensively evaluate performance, we further evaluate the PR curves of all approaches on the DAUB and IRDST datasets. The larger the area under the curve, the better the method. We can see from Figs. 7 and 8 that our approach is obviously superior to other comparative methods on two datasets. In general, the proposed DFAR approach achieves a better balance of precision and recall than other methods.

IV-C3 Inference Complexity Comparison

We use Frame Per Second (FPS), Floating-point Operations (FLOPs) and model parameters to compare inference complexity. As shown in Table II, the proposed DFAR approach is better than the state-of-the-art LSTM-based method SSTNet in terms of both inference speed and model size. In addition, our method has a moderate Performance–cost Ratio (PCR; i.e., mAP50/FLOPs). Particularly, the PCR of our method on the DAUB dataset is 0.96, which is 41.18% higher than that of SSTNet (0.68). Although ACM, RISTD, RDIAN and MSHNet methods have smaller model sizes, FLOPs and faster inference speed, by checking Table I and Table II, we can see that our method achieves a good complexity-performance trade-off.

IV-C4 Visualization Comparison

In Fig. 9 and Fig. 10, we present two groups of detection results for visual comparison on the two datasets, respectively. We can see from Figs. 9 and 10 that our method can often accurately detect moving dim-small targets, while other methods usually result in missed or false detections.

Specifically, on the DAUB dataset, as shown in Fig. 9, our method precisely detects the target in the first group of visualization comparison (the first three rows). However, ISNet, AGPCNet, RDIAN and RPCANet methods cause missed detection; the target bounding boxes generated by RISTD and UIUNet methods are not accurate and appear too large. Furthermore, in the second group of comparison (the fourth to sixth rows), UIUNet, MSHNet and SSTNet methods occur missed detection; RISTD, SANet and DTUM methods produce false detection, and RISTD method even detect two targets.

More specifically, on the IRDST dataset, as shown in Fig. 10, most methods fail to detect two groups of targets since the targets are difficult to perceive, while our method can still detect the targets correctly. In summary, these qualitative comparison results verify the superiority of our method.

Refer to caption
Figure 10: Two groups of visualization comparisons on the IRDST dataset; GT: ground truth. Red box denotes a detected target, and detection region is amplified (blue box).
TABLE III: Ablation Study of Proposed Temporal Deformable Alignment (TDA) Module, Motion Compensation Loss MCsubscript𝑀𝐶\mathcal{L}_{MC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT and Feature Refinement (FR) Module.
TDA MCsubscript𝑀𝐶\boldsymbol{\mathcal{L}_{MC}}bold_caligraphic_L start_POSTSUBSCRIPT bold_italic_M bold_italic_C end_POSTSUBSCRIPT FR DAUB IRDST
mAP50 (%) Pr (%) Re (%) F1 (%) mAP50 (%) Pr (%) Re (%) F1 (%)
- - - 84.02 87.40 97.14 92.02 74.13 86.24 88.28 87.25
\checkmark - - 92.62 95.40 97.87 96.62 81.44 90.50 90.86 90.68
\checkmark \checkmark - 93.05 96.52 97.89 97.20 82.44 91.92 92.12 92.02
- - \checkmark 91.92 95.44 97.71 96.56 80.88 92.43 88.56 90.45
\checkmark - \checkmark 94.69 97.79 97.93 97.86 87.14 95.14 92.28 93.69
\checkmark \checkmark \checkmark 96.56 98.82 98.62 98.72 89.88 95.91 94.66 95.28

IV-D Ablation Study

IV-D1 Effects of Different Assemblies

To evaluate the effectiveness of our TDA module, feature refinement (FR) module and motion compensation loss, we directly use a 3×3333\times 33 × 3 convolutional layer with 320 filters to fuse the concatenated multi-frame features, and then input fused feature into the detection head as the baseline model. Then, we insert different components into the baseline, and retrain these models with the same experimental settings. The experimental results are presented in Table III.

Effectiveness of TDA module. We insert the TDA module before multi-frame feature fusion to evaluate the effectiveness of explicit alignment on features. The results in the 1st and 2nd rows (or 4th and 5th rows) in Table III prove the effectiveness of our proposed TDA module. Specifically, the results in the 4th and 5th rows show that alignment is more important for the IRDST dataset which contains more large motion videos. Additionally, to evaluate the effectiveness of the designed DCAF block, we use 8 3 × 3 convolutional layers with 64 filters to replace the DCAF blocks in the TDA module to keep the module parameters and computational cost essentially unchanged. The mAP50 and F𝐹Fitalic_F1 of the baseline model with the modified TDA module on the DAUB dataset are only 89.86 and 94.18, and on the IRDST dataset are only 78.24 and 88.83. It demonstrates the effectiveness of the designed DCAF block.

Refer to caption
Figure 11: The impact of hyper-parameter η𝜂\etaitalic_η in the loss function on two datasets.

Effectiveness of motion compensation loss MCsubscript𝑀𝐶\boldsymbol{\mathcal{L}_{MC}}bold_caligraphic_L start_POSTSUBSCRIPT bold_italic_M bold_italic_C end_POSTSUBSCRIPT. Here, we show the necessity of motion compensation loss MCsubscript𝑀𝐶\mathcal{L}_{MC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT. From the results in the 5th and 6th rows of Table III, it can be seen that although there is feature alignment, without the supervision of motion compensation loss, the performance is still sub-optimal. One possible reason is that deformable convolution is inherently difficult to train, and training instability often leads to offset overflow, deteriorating the final performance [63]. Therefore, it is necessary to use motion compensation loss for temporal alignment supervision. Besides, we also investigate the impact of hyper-parameter η𝜂\etaitalic_η in the loss function. Fig. 11 shows that the F𝐹Fitalic_F1 score has a stable region that reaches a peak when η=1.0𝜂1.0\eta=1.0italic_η = 1.0 on the DAUB dataset, and reaches a peak when η=1.4𝜂1.4\eta=1.4italic_η = 1.4 on the IRDST dataset. In addition, we can also see that when η=1.0𝜂1.0\eta=1.0italic_η = 1.0, the average F𝐹Fitalic_F1 of the two datasets is optimal, so the hyper-parameter η𝜂\etaitalic_η is set to 1.0.

Effectiveness of feature refinement module. We use the FR module to replace the simple multi-frame fusion operation (i.e., concatenation operation) to evaluate its effectiveness. From the results in the 1st and 4th rows (or 3rd and 6th rows) in Table III, we can conclude that our feature refinement module contributes a lot to performance improvement. For a deep investigation, we consider an additional ablation study about the branch configuration of the FR module. Intuitively, there are two variants of the FR module: FR without AFS (removing the Adaptive Fusion Structure (AFS) guided by the attention mechanism and using simple concatenation operation to fuse multiple features), and FR without AGDF (using 8 3 × 3 convolutional layers with 64 filters to replace the AGDF block). Experimental results are presented in Table IV. We can see that AFS and AGDF block are able to boost the performance of baseline, but all perform worse than the FR module.

TABLE IV: Ablation Study on Feature Refinement Module. AFS: Adaptive Fusion Structure Guided by the Attention Mechanism.
AFS AGDF DAUB IRDST
mAP50 (%) F1 (%) mAP50 (%) F1 (%)
- - 84.02 92.02 74.13 87.25
\checkmark - 86.81 93.51 76.88 88.63
- \checkmark 88.23 94.28 78.23 89.30
\checkmark \checkmark 91.92 96.56 80.88 90.45
Refer to caption
Figure 12: The impact of temporal radius R𝑅Ritalic_R on two datasets. The performance is positively correlated with the temporal radius R𝑅Ritalic_R, but reaches saturation very quickly and even decreases.

IV-D2 Effectiveness of Temporal Radius R𝑅Ritalic_R

Although we follow SSTNet method [23] to use 5 frames for model training, an ablation study is also conducted on the temporal radius R𝑅Ritalic_R to analyze its impact on performance. Here, due to limited memory, we have only studied the case where the maximum value of R𝑅Ritalic_R is 5. As shown in Fig. 12, the performance of our DFAR method increases with the temporal radius R𝑅Ritalic_R in genera, but quickly reaches a plateau and even decreases when the number of frames is too large. This phenomenon is more obvious in the IRDST dataset. The reason is that when the temporal radius is too large, the greater motion makes it difficult for the model to be effectively aligned, and inaccurate alignment may even introduce new artifacts and degrade performance. After comprehensively considering computation complexity and accuracy, we set R=2𝑅2R=2italic_R = 2.

V Conclusion

We proposed a new end-to-end network for moving infrared dim-small target detection. Specifically, a TDA module based on the designed DCAF block is proposed by explicitly aligning adjacent frames with the current frame at the feature level to mine temporal information. Furthermore, a feature refinement module with the adaptive fusion structure and AGDF blocks is designed to adaptively fuse and refine useful temporal information from the aligned features. In addition, we extend the traditional loss function by introducing a new motion compensation loss to improve the temporal alignment. Both qualitative and quantitative experimental results demonstrate the effectiveness of our proposed DFAR method, which can significantly improve detection performance compared with the state-of-the-art methods. Ablation studies are also conducted to show the effectiveness of different component assemblies in our method.

References

  • [1] T. R. Goodall, A. C. Bovik, and N. G. Paulter, “Tasking on natural statistics of infrared images,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 65–79, 2016.
  • [2] L. Li, Q. Hu, and X. Li, “Moving object detection in video via hierarchical modeling and alternating optimization,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 2021–2036, 2019.
  • [3] A. Guzman-Pando and M. I. Chacon-Murguia, “Deepfoveanet: Deep fovea eagle-eye bioinspired model to detect moving objects,” IEEE Transactions on Image Processing, vol. 30, pp. 7090–7100, 2021.
  • [4] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, “Minet: Meta-learning instance identifiers for video object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 6879–6891, 2021.
  • [5] L. Jiao, R. Zhang, F. Liu, S. Yang, B. Hou, L. Li, and X. Tang, “New generation deep learning for video object detection: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 8, pp. 3195–3215, 2021.
  • [6] Q. Qi, T. Hou, Y. Lu, Y. Yan, and H. Wang, “Dgrnet: A dual-level graph relation network for video object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 4128–4141, 2023.
  • [7] X. Zhao, H. Liang, P. Li, G. Sun, D. Zhao, R. Liang, and X. He, “Motion-aware memory network for fast video salient object detection,” IEEE Transactions on Image Processing, vol. 33, pp. 709–721, 2024.
  • [8] C. Liu, W. Ding, J. Yang, V. Murino, B. Zhang, J. Han, and G. Guo, “Aggregation signature for small object tracking,” IEEE Transactions on Image Processing, vol. 29, pp. 1738–1747, 2020.
  • [9] S. Deng, S. Li, K. Xie, W. Song, X. Liao, A. Hao, and H. Qin, “A global-local self-adaptive network for drone-view object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 1556–1569, 2021.
  • [10] C. Deng, M. Wang, L. Liu, Y. Liu, and Y. Jiang, “Extended feature pyramid network for small object detection,” IEEE Transactions on Multimedia, vol. 24, pp. 1968–1979, 2022.
  • [11] S. Chen, L. Ji, S. Zhu, and M. Ye, “Micpl: Motion-inspired cross-pattern learning for small-object detection in satellite videos,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2024.
  • [12] X. Bai and F. Zhou, “Analysis of new top-hat transformation and the application for infrared dim small target detection,” Pattern Recognition, vol. 43, no. 6, pp. 2145–2156, 2010.
  • [13] X. Kong, C. Yang, S. Cao, C. Li, and Z. Peng, “Infrared small target detection via nonconvex tensor fibered rank approximation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–21, 2021.
  • [14] H. Zhu, H. Ni, S. Liu, G. Xu, and L. Deng, “Tnlrs: Target-aware non-local low-rank modeling with saliency filtering regularization for infrared small target detection,” IEEE Transactions on Image Processing, vol. 29, pp. 9546–9558, 2020.
  • [15] H. Wang, L. Zhou, and L. Wang, “Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8509–8518.
  • [16] P. Zhang, L. Zhang, X. Wang, F. Shen, T. Pu, and C. Fei, “Edge and corner awareness-based spatial–temporal tensor model for infrared small-target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 12, pp. 10 708–10 724, 2021.
  • [17] K. Wang, S. Du, C. Liu, and Z. Cao, “Interior attention-aware network for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
  • [18] B. Li, C. Xiao, L. Wang, Y. Wang, Z. Lin, M. Li, W. An, and Y. Guo, “Dense nested attention network for infrared small target detection,” IEEE Transactions on Image Processing, vol. 32, pp. 1745–1758, 2023.
  • [19] H. Sun, J. Bai, F. Yang, and X. Bai, “Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023.
  • [20] F. Lin, K. Bao, Y. Li, D. Zeng, and S. Ge, “Learning contrast-enhanced shape-biased representations for infrared small target detection,” IEEE Transactions on Image Processing, vol. 33, pp. 3047–3058, 2024.
  • [21] X. Liu, X. Li, L. Li, X. Su, and F. Chen, “Dim and small target detection in multi-frame sequence using bi-conv-lstm and 3d-conv structure,” IEEE Access, vol. 9, pp. 135 845–135 855, 2021.
  • [22] Z. Zhang, P. Gao, S. Ji, X. Wang, and P. Zhang, “Infrared small target detection combining deep spatial-temporal prior with traditional priors,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  • [23] S. Chen, L. Ji, J. Zhu, M. Ye, and X. Yao, “SSTNet: Sliced spatio-temporal network with cross-slice convlstm for moving infrared dim-small target detection,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
  • [24] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets v2: More deformable, better results,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
  • [25] S. D. Deshpande, M. H. Er, R. Venkateswarlu, and P. Chan, “Max-mean and max-median filters for detection of small targets,” in Signal and Data Processing of Small Targets 1999, vol. 3809.   SPIE, 1999, pp. 74–83.
  • [26] C. Gao, D. Meng, Y. Yang, Y. Wang, X. Zhou, and A. G. Hauptmann, “Infrared patch-image model for small target detection in a single image,” IEEE Transactions on Image Processing, vol. 22, no. 12, pp. 4996–5009, 2013.
  • [27] C. P. Chen, H. Li, Y. Wei, T. Xia, and Y. Y. Tang, “A local contrast method for small infrared target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 1, pp. 574–581, 2013.
  • [28] A. Novak, N. Armstrong, T. Caelli, and I. Blair, “Bayesian contrast measures and clutter distribution determinants of human target detection,” IEEE Transactions on Image Processing, vol. 26, no. 3, pp. 1115–1126, 2017.
  • [29] S. Moradi, P. Moallem, and M. F. Sabahi, “Fast and robust small infrared target detection using absolute directional mean difference algorithm,” Signal Processing, vol. 177, p. 107727, 2020.
  • [30] X. Zhou, P. Li, Y. Zhang, X. Lu, and Y. Hu, “Deep low-rank and sparse patch-image network for infrared dim and small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023.
  • [31] M. Liu, H.-y. Du, Y.-j. Zhao, L.-q. Dong, M. Hui, and S. Wang, “Image small target detection based on deep learning with snr controlled sample generation,” Current Trends in Computer Science and Mechanical Automation, vol. 1, pp. 211–220, 2017.
  • [32] Y. Dai, Y. Wu, F. Zhou, and K. Barnard, “Asymmetric contextual modulation for infrared small target detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 950–959.
  • [33] ——, “Attentional local contrast networks for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 11, pp. 9813–9824, 2021.
  • [34] Y. Zhang, Y. Zhang, Z. Shi, R. Fu, D. Liu, Y. Zhang, and J. Du, “Enhanced cross-domain dim and small infrared target detection via content-decoupled feature alignment,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
  • [35] F. Liu, C. Gao, F. Chen, D. Meng, W. Zuo, and X. Gao, “Infrared small and dim target detection with transformer under complex backgrounds,” IEEE Transactions on Image Processing, vol. 32, pp. 5921–5932, 2023.
  • [36] J. Lin, S. Li, L. Zhang, X. Yang, B. Yan, and Z. Meng, “IR-TransDet: Infrared dim and small target detection with ir-transformer,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  • [37] Y. Dai, X. Li, F. Zhou, Y. Qian, Y. Chen, and J. Yang, “One-stage cascade refinement networks for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–17, 2023.
  • [38] T. Liu, Q. Yin, J. Yang, Y. Wang, and W. An, “Combining deep denoiser and low-rank priors for infrared small target detection,” Pattern Recognition, vol. 135, p. 109184, 2023.
  • [39] T. Ma, H. Wang, J. Liang, J. Peng, Q. Ma, and Z. Kai, “Msma-net: An infrared small target detection network by multiscale super-resolution enhancement and multilevel attention fusion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–20, 2024.
  • [40] T. Chen, Z. Tan, Q. Chu, Y. Wu, B. Liu, and N. Yu, “Tci-former: Thermal conduction-inspired transformer for infrared small target detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 1201–1209.
  • [41] Y. Huang, X. Zhi, J. Hu, L. Yu, Q. Han, W. Chen, and W. Zhang, “Fddba-net: Frequency domain decoupling bidirectional interactive attention network for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024.
  • [42] H. Yang, J. Liu, Z. Wang, Z. Fu, Q. Tan, and S. Niu, “Mapff: Multiangle pyramid feature fusion network for infrared dim small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024.
  • [43] Y. Sun, J. Yang, Y. Long, and W. An, “Infrared small target detection via spatial-temporal total variation regularization and weighted tensor nuclear norm,” IEEE Access, vol. 7, pp. 56 667–56 682, 2019.
  • [44] C. Kwan and B. Budavari, “Enhancing small moving target detection performance in low-quality and long-range infrared videos using optical flow techniques,” Remote Sensing, vol. 12, no. 24, p. 4024, 2020.
  • [45] M. Uzair, R. S. Brinkworth, and A. Finn, “Bio-inspired video enhancement for small moving target detection,” IEEE Transactions on Image Processing, vol. 30, pp. 1232–1244, 2021.
  • [46] G. Wang, B. Tao, X. Kong, and Z. Peng, “Infrared small target detection using nonoverlapping patch spatial–temporal tensor factorization with capped nuclear norm regularization,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–17, 2021.
  • [47] T. Liu, J. Yang, B. Li, C. Xiao, Y. Sun, Y. Wang, and W. An, “Nonconvex tensor low-rank approximation for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–18, 2021.
  • [48] F. Wu, H. Yu, A. Liu, J. Luo, and Z. Peng, “Infrared small target detection using spatio-temporal 4d tensor train and ring unfolding,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  • [49] H. Wang, Z. Zhong, F. Lei, J. Peng, and S. Yue, “Bio-inspired small target motion detection with spatio-temporal feedback in natural scenes,” IEEE Transactions on Image Processing, vol. 33, pp. 451–465, 2024.
  • [50] J. Du, D. Li, Y. Deng, L. Zhang, H. Lu, M. Hu, X. Shen, Z. Liu, and X. Ji, “Multiple frames based infrared small target detection method using cnn,” in Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence, 2021, pp. 1–6.
  • [51] J. Du, H. Lu, L. Zhang, M. Hu, S. Chen, Y. Deng, X. Shen, and Y. Zhang, “A spatial-temporal feature-based detection framework for infrared dim small target,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2021.
  • [52] P. Yan, R. Hou, X. Duan, C. Yue, X. Wang, and X. Cao, “Stdmanet: Spatio-temporal differential multiscale attention network for small moving infrared target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
  • [53] R. Li, W. An, C. Xiao, B. Li, Y. Wang, M. Li, and Y. Guo, “Direction-coded temporal u-shape module for multiframe infrared small target detection,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [54] H. Deng, Y. Zhang, Y. Li, K. Cheng, and Z. Chen, “Bemst: Multiframe infrared small-dim target detection using probabilistic estimation of sequential backgrounds,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024.
  • [55] X. Tong, Z. Zuo, S. Su, J. Wei, X. Sun, P. Wu, and Z. Zhao, “St-trans: Spatial-temporal transformer for infrared small target detection in sequential images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–19, 2024.
  • [56] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
  • [57] H. Song, W. Xu, D. Liu, B. Liu, Q. Liu, and D. N. Metaxas, “Multi-stage feature fusion network for video super-resolution,” IEEE Transactions on Image Processing, vol. 30, pp. 2923–2934, 2021.
  • [58] W. Yan, L. Xu, W. Yang, and R. T. Tan, “Feature-aligned video raindrop removal with temporal constraints,” IEEE Transactions on Image Processing, vol. 31, pp. 3440–3448, 2022.
  • [59] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2472–2481.
  • [60] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773.
  • [61] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 286–301.
  • [62] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
  • [63] K. C. K. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy, “Understanding deformable alignment in video super-resolution,” in AAAI, 2021.
  • [64] B. Hui, Z. Song, H. Fan, P. Zhong, W. Hu, X. Zhang, J. Lin, H. Su, W. Jin, Y. Zhang et al., “A dataset for infrared image dim-small aircraft target detection and tracking under ground/air background,” Sci. Data Bank, vol. 5, no. 12, p. 4, 2019.
  • [65] Q. Hou, Z. Wang, F. Tan, Y. Zhao, H. Zheng, and W. Zhang, “Ristdnet: Robust infrared small target detection network,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
  • [66] M. Zhang, R. Zhang, Y. Yang, H. Bai, J. Zhang, and J. Guo, “ISNet: Shape matters for infrared small target detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 877–886.
  • [67] X. Wu, D. Hong, and J. Chanussot, “UIU-Net: U-net in u-net for infrared small object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 364–376, 2022.
  • [68] J. Zhu, S. Chen, L. Li, and L. Ji, “Sanet: Spatial attention network with global average contrast learning for infrared small target detection,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [69] T. Zhang, L. Li, S. Cao, T. Pu, and Z. Peng, “Attention-guided pyramid context networks for detecting infrared small target under complex background,” IEEE Transactions on Aerospace and Electronic Systems, 2023.
  • [70] Y. Lu, Y. Lin, H. Wu, X. Xian, Y. Shi, and L. Lin, “Sirst-5k: Exploring massive negatives synthesis with self-supervised learning for robust infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–11, 2024.
  • [71] Q. Liu, R. Liu, B. Zheng, H. Wang, and Y. Fu, “Infrared small target detection with scale and location sensitivity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 490–17 499.
  • [72] F. Wu, T. Zhang, L. Li, Y. Huang, and Z. Peng, “Rpcanet: Deep unfolding rpca based infrared small target detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 4809–4818.
  • [73] S. Yuan, H. Qin, X. Yan, N. Akhtar, and A. Mian, “Sctransnet: Spatial-channel cross transformer network for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024.