Deformable Feature Alignment and Refinement for Moving Infrared Dim-small Target Detection

Dengyan Luo, Yanping Xiang, Hu Wang, Luping Ji, , Shuai Li, and Mao Ye^∗ This work was supported in part by the National Natural Science Foundation of China (62276048) and Chengdu Science and Technology Projects (2023-YF06-00009-HZ).Dengyan Luo, Yanping Xiang, Hu Wang, Luping Ji and Mao Ye are with the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, P.R. China (e-mail: dengyanluo@126.com; xiangyp@uestc.edu.cn; wanghu0833cv@gmail.com; jiluping@uestc.edu.cn; cvlab.uestc@gmail.com).Shuai Li is with the School of Control Science and Engineering, Shandong University, Jinan 250000, P.R. China (e-mail:shuaili@sdu.edu.cn).*corresponding author

Abstract

The detection of moving infrared dim-small targets has been a challenging and prevalent research topic. The current state-of-the-art methods are mainly based on ConvLSTM to aggregate information from adjacent frames to facilitate the detection of the current frame. However, these methods implicitly utilize motion information only in the training stage and fail to explicitly explore motion compensation, resulting in poor performance in the case of a video sequence including large motion. In this paper, we propose a Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution to explicitly use motion context in both the training and inference stages. Specifically, a Temporal Deformable Alignment (TDA) module based on the designed Dilated Convolution Attention Fusion (DCAF) block is developed to explicitly align the adjacent frames with the current frame at the feature level. Then, the feature refinement module adaptively fuses the aligned features and further aggregates useful spatio-temporal information by means of the proposed Attention-guided Deformable Fusion (AGDF) block. In addition, to improve the alignment of adjacent frames with the current frame, we extend the traditional loss function by introducing a new motion compensation loss. Extensive experimental results demonstrate that the proposed DFAR method achieves the state-of-the-art performance on two benchmark datasets including DAUB and IRDST.

Index Terms:

Moving infrared dim-Small target, target detection, multi-frame, deformable convolution, motion compensation loss.

I Introduction

Moving infrared dim-small target detection plays an important role in numerous practical applications, such as civil and military surveillance, because harsh external environmental conditions do not degrade the quality of infrared image [1]. Although video target detection [2, 3, 4, 5, 6, 7] and small object detection [8, 9, 10, 11] have been developed rapidly in recent years, there are still challenges in video infrared dim-small target detection due to the following reasons: 1) Due to the long imaging distances, the infrared target is extremely small compared to the background. 2) The low intensity makes it difficult to accurately extract the shape and texture information of targets from complex backgrounds. 3) In actual scenarios, sea surfaces, buildings, clouds, and other clutter can easily disturb or submerge small targets. Therefore, it is necessary to study moving infrared dim-small target detection.

In the past few decades, researchers have performed many studies on infrared dim-small target detection. The early work is mainly based on traditional paradigms, such as background modeling [12] and data structure [13, 14]. Although traditional methods have achieved satisfactory performance, there are too many hand-crafted components, which makes their application impractical in complex scenarios. In contrast, learning-based methods can use deep neural networks to globally optimize the model on a large number of training samples. Therefore, learning-based methods are becoming increasingly popular, greatly promoting the development of infrared dim-small target detection.

According to the number of input frames used, existing learning-based methods can be categorized into single-frame based approaches and multi-frame based approaches. The early works [15, 16, 17, 18, 19, 20] focused on single-frame target detection, that is, only one frame is input into the network at a time. Although they can improve detection performance to a certain extent, their performance is still limited due to ignoring the temporal information implied in consecutive frames. To explore the temporal information, multi-frame target detection methods are proposed [21, 22, 23]. This kind of scheme allows the network to not only model intra-frame spatial features, but also extract inter-frame temporal information of the target to enhance feature representation.

Refer to caption — Figure 1: Illustrating the differences between our method and the state-of-the-art method SSTNet [23]. The LSTM-based SSTNet implicitly aggregates the information from adjacent frames only in the training stage, while our method utilizes deformable convolution (DCN) to perform explicit inter-frame feature alignment in the training and inference stages, and applies the introduced Motion Compensation (MC) loss to supervise the temporal alignment.

The existing multi-frame detection methods are mainly based on 3D Convolutional structure [21, 22] or Convolutional Long Short-Term Memory (ConvLSTM) structure [23]. However, these methods implicitly utilize motion information and fail to explicitly explore motion compensation, which limits the network’s ability to model complex and large-scale motion. Furthermore, the current state-of-the-art method SSTNet [23] only utilizes ConvLSTM to improve network optimization in the training stage, which results in the frame to be detected not actually using the past-future frame information in the inference stage, as illustrated in Fig. 1. In addition, the models based on 3D convolution or LSTM are arduous to achieve a good trade-off between computational cost and detection performance.

To alleviate the above issues, in this work, we propose a Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution (DCN) [24] to explore motion context information simultaneously during training and inference stages, as presented in Fig. 1. Our motivation stems from the observation that in moving infrared dim-small target detection, there exist many cases that the targets can not be detected in the current frame but are easier or harder to be perceived in some adjacent frames; and DCN has strong capabilities in modeling geometric transformations.

Therefore, in order to adaptively utilize useful information from adjacent frames, there are two main parts are devised. On the one hand, a Temporal Deformable Alignment (TDA) module based on the designed Dilated Convolution Attention Fusion (DCAF) block is developed to explicitly align the adjacent frames with the current frame at the feature level. Specifically, the TDA module uses features from both the current frame and the adjacent frame to dynamically predict offsets of sampling convolution kernels. More specifically, the DCAF block uses channel attention to fuse multi-scale features extracted by multiple dilated convolutions, making the predicted offsets have an adaptive receptive field. Then, dynamic kernels are applied on features from adjacent frames to employ the temporal alignment. On the other hand, the feature refinement module adaptively fuses the feature of the current frame with the aligned adjacent features, and further aggregates effective spatio-temporal information through the proposed Attention-guided Deformable Fusion (AGDF) blocks. In particular, the AGDF block adopts a pyramid offsets generation scheme and fuses multi-scale deformable offsets at the pixel level, which provides the model with the ability to implicitly model complex and large motions. In addition, to improve the alignment effect, we introduce a new Motion Compensation (MC) loss $\mathcal{L}_{MC}$ by measuring the $L_{1}$ distance between the aligned adjacent features and the current frame feature.

The main contributions of this paper can be summarized as follows: (1) A new Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution is proposed to mine temporal information implied in continuing frames, and effectively align the target frame with its adjacent frames through a designed Temporal Deformable Alignment (TDA) module respectively. (2) We propose a feature refinement module to adaptively fuse the aligned adjacent features, and further explore valuable spatio-temporal details from the fused aligned features with the proposed Attention-guided Deformable Fusion (AGDF) blocks. (3) A new Motion Compensation (MC) loss $\mathcal{L}_{MC}$ is proposed which is used to supervise the alignment of adjacent frames with the current frame.

Experimental results demonstrate that the proposed DFAR method achieves superior performance compared with the state-of-the-art methods in both quantitative and qualitative evaluations. The remainder of the paper is organized as follows: Section II introduces the related works, including single-frame based infrared small target detection schemes and multi-frame based infrared small target detection schemes. Section III elaborates on the structure and details of the proposed method. Section IV shows comparative experiments and ablation studies on two benchmark datasets. Finally, Section V draws conclusions.

II Related Works

II-A Single-frame Infrared Small Target Detection

According to the number of frames used, infrared small target detection schemes can be classified into two categories: single-frame methods and multi-frame methods. Single-frame infrared small target detection aims to accurately detect targets in a single infrared image, which can be divided into two categories: traditional methods [12, 13, 14, 25, 26, 27, 28, 29, 30] and deep learning-based methods [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42].

Traditional methods can be further categorized into background modeling, data structure and target featuremethods. Background modeling methods estimate and suppress the background effect, such as top-hat [12] and max-mean [25]. Data structure methods are based on the sparsity of the target and the low rank of the background to separate the target and the background, such as TNLRS [14] and IPI [26]. Target feature methods are based on the feature differences between the target and its neighboring regions to detect the target, such as LCM [27] and WSLCM [29].

Since the introduction of Multi-layer Perception (MLP) network into small target detection by Liu et al. [31], deep-learning-based algorithms have received much research attention. To take advantage of the Convolutional Neural Network (CNN), for example, Dai et al. [32] proposed a ACM to aggregate low-level and deep-level features. To unambiguously establish long-range contextual information, for example, Liu et al. [35] designed a method based on the transformer for infrared small target detection. Later on, Lin et al. [36] designed a IR-TransDet to integrate the benefits of the CNN and the transformer to properly extract global semantic information and features of small targets. Although these methods have performed well in infrared image dim-small target detection, they are less effective in video infrared dim-small target detection due to the ignorance of temporal information.

II-B Multi-frame Infrared Small Target Detection

To simultaneously utilize spatio-temporal information, multi-frame detection methods have been proposed in recent years. The early multi-frame methods [43, 44, 45, 46, 47, 48, 49] are also non-intelligent learning, and they are mainly based on tensor optimization. For instance, Sun et al. [43] developed a STTV-WNIPT in combination with spatial and temporal information to separate the target and background. Then, kwan et al. [44] proposed to use optical flow to improve detection performance. Later on, Wu et al. [48] proposed to construct a 4-D spatio-temporal tensor and decompose it into a low-dimensional tensor. However, these methods are heavily dependent on traditional priors and handcrafted features, resulting in poor detection performance in complex scenes such as clutter and noise.

To overcome the above weakness, learning-based multi-frame detection methods [21, 22, 23, 50, 51, 52, 53, 54, 55] are proposed. For example, Du et al. [51] proposed a STFBD to use multiple frames for video infrared dim-small target detection. Yan et al. [52] then designed a STDMANet to explore temporal multi-scale features. However, these methods directly concatenate multiple frames or features to construct spatio-temporal tensors resulting in rude motion information fusion. Later on, Zhang et al. [22] utilized 3D CNN and traditional priors to mine motion information. Meanwhile, Li et al. [53] developed a DTUM based on 3D CNN to encode the motion direction into features and extract the motion information of targets. After that, Tong et al. [55] introduced a ST-Trans based on the transformer to learn the spatio-temporal dependencies between successive frames of small infrared targets. More recently, Chen et al. [23] proposed a SSTNet based on ConvLSTM for multi-frame infrared small target detection. Although the state-of-the-art performance has been achieved, in these works, motion information is modeled implicitly and fails to be compensated explicitly, resulting in poor performance. Moreover, the models based on 3D CNN, transformer and LSTM tend to have heavy parameters and computation. In addition, the SSTNet method does not utilize ConvLSTM in the inference stage but only in the training stage to aggregate past–current–future frames, which may also lead to overfitting in the training dataset. In contrast, we directly incorporate deformable alignment into our module, allowing an explicit guidance during training and inference stages, thus achieving better performance and faster speed than the existing learning-based multi-frame methods.

III The Proposed Method

III-A Overview

Given a video clip of $2R+1$ consecutive frames $I_{[t-R,t+R]}$ , the middle frame $I_{t}\in\mathbb{R}^{C\times H\times W}$ is the target frame to be detected and the other frames are the reference frames. Here, $R$ is the temporal radius (i.e., the number of input frames is $2\times R+1$ ), $C$ refers to the channel number, and $H\times W$ denotes the frame size. Our goal is to improve detection performance by enabling the network to learn motion context features. The overall structure of our DFAR method is shown in Fig. 2, which consists of four modules: a feature extraction module, a Temporal Deformable Alignment (TDA) module based on Dilated Convolution Attention Fusion (DCAF) blocks for feature alignment, a feature refinement module based on the attention weight block and the proposed Attention-guided Deformable Fusion (AGDF) blocks, and a detection head module for target detection.

As shown in Fig. 2, firstly, a feature extraction module is applied to extract the spatial information for each frame from the input clip $I_{[t-R,t+R]}$ and obtain the extracted features $F_{[t-R,t+R]}^{E}\in\mathbb{R}^{c\times h\times w}$ , where $c$ , $h$ , and $w$ denote the channel, height, and width of the feature $F_{i}^{E}$ , respectively. The feature extraction can be represented as:

F_{[t-R,t+R]}^{E}=FE(I_{[t-R,t+R]}),

(1)

where $FE(\cdot)$ denotes the feature extraction module, which is a three-level pyramid structure, and each level contains a standard $3\times 3$ convolutional layer with a stride of 1 and a $3\times 3$ convolutional layer with a stride of 2 for downsampling. Then, each adjacent frame feature $F_{i}^{E}$ enters the TDA module along with the current frame feature $F_{t}^{E}$ for temporal alignment:

F_{i}^{A}=TDA(F_{i}^{E},F_{t}^{E}),i\in[t-R,t+R]\text{ and }i\neq t,

(2)

where $F_{i}^{A}$ is the each aligned feature. Furthermore, each aligned feature $F_{i}^{A}$ and the extracted target feature $F_{t}^{E}$ are input into the feature refinement module to further gather valuable spatio-temporal information:

F_{D}=FR(F_{i}^{A},F_{t}^{E}),

(3)

where $FR(\cdot)$ denotes the feature refinement module, and $F_{D}$ is the refined feature. Finally, following SSTNet [23], the refined feature $F_{D}$ is input to the detection head YOLOX [56] for target detection. It should be noted that for the first few frames and the last few frames in a video sequence, where the number of adjacent frames is less than $2\times R$ , we repeatedly pad it with the target frame until there are $2\times R$ frames.

The details of the TDA module and the feature refinement module are explained in the following Sec. III-B and Sec. III-C, while the proposed Motion Compensation (MC) loss $\mathcal{L}_{MC}$ is presented in Sec. III-D.

III-B Temporal Deformable Alignment Module

The motivation for inter-frame alignment comes from our observation that in moving infrared dim-small target detection task, there are many situations where it is not easy to detect the target in the current frame, but the related target information exists in adjacent frames. Fig. 3 shows an example of this case, where the target in frame 807 is difficult to be detected, whereas the target in frame 806 is much more easily perceived. And deformable convolution (DCN) has shown promising performance at capturing the motion cues of the targets in some low-level vision tasks like video super-resolution [57] and video deraining [58]. Therefore, we design a TDA module based on DCN to aggregate temporal information.

Concretely, firstly, the adjacent feature $F_{i}^{E}$ and the target feature $F_{t}^{E}$ are concatenated in the channel dimension and then passed through a $3\times 3$ convolutional layer to make the number of channels consistent:

F_{ti}=Conv([F_{t},F_{i}]),

(4)

where $[\cdot,\cdot]$ denotes the concatenation operation in the channel axis. Then, inspired by the residual dense network [59], we design a Dilated Convolution Attention Fusion (DCAF) block as the basic block to further integrate the temporal information:

F_{ti}^{{}^{\prime}}=(DCAF)^{4}(F_{ti}).

(5)

The stacked blocks allow the network to have a large enough receptive field to aggregate information from distant spatial locations. Finally, the aggregated feature $F_{ti}^{{}^{\prime}}$ is input to a $3\times 3$ convolutional layer to generate the corresponding deformable sampling parameters for alignment:

\Theta=Conv(F_{ti}^{{}^{\prime}}),

(6)

where $\Theta\in\mathbb{R}^{d\times 2K^{2}\times h\times w}$ is offset filed for the deformable convolutional kernel; $d$ and $K^{2}$ denote the deformable group and the kernel size of the deformable convolution, respectively. Then, deformable convolution with the predicted offsets $\Theta$ is applied to the feature $F_{i}^{E}$ to get the aligned feature $F_{i}^{A}$ :

F_{i}^{A}(p)=\sum_{k=1}^{K^{2}}\omega_{k}\cdot F_{i}^{E}(p+p_{k}+\Delta p_{k}),

(7)

where $p_{k}$ denotes the sampling grid with $K^{2}$ sampling locations, and $\omega_{k}$ represents the weights for each location $p$ ; $\Delta p_{k}$ is the learnable offset for the $k$ -th location, that is $\Theta=\{\Delta p_{k}\}$ . As the $p+p_{k}+\Delta p_{k}$ can be fractional, bilinear interpolation is adopted as in [60]. It should be noted that the above process only describes the prediction and application of offsets. In fact, we also generate and apply masks for deformable convolution [24].

The structure of the DCAF block is depicted in Fig. 4. In detail, to reduce computational cost while ensuring detection performance, we first use a $3\times 3$ convolutional layer to halve the number of channels. Then, 4 dilated convolutions with a dilation rate from 1 to 4 are used to obtain feature maps with different receptive fields. These feature maps are hierarchically added before concatenating them with the original input feature to acquire an effective receptive field. After that, the channel attention [61] and a $1\times 1$ convolutional layer are utilized to fuse the concatenated multi-scale feature and restore the input channels of the DCAF block. The channel attention mechanism can play a selection role for different receptive fields, which enables our TDA module to adaptively model different degrees of motion between frames. Finally, the local skip connection with residual scaling is applied to complete our DCAF block.

Remark. Although optical flow can also be used for explicit temporal alignment, optical flow estimation only predicts an offset for each coordinate, and this single-coordinate single-offset mechanism severely restricts the modeling ability in more complex scenarios. In addition, per-pixel motion estimation often suffers a heavy computational load. However, our method aligns the target frame with adjacent frames at the feature level, which makes the network have strong capability and flexibility to handle various motion conditions in temporal scenes. We visualize the feature maps in Fig. 5 to intuitively illustrate the effectiveness of the TDA module. We can see from Fig. 5 that after alignment, the target feature in frame 806 is closer to the target feature in frame 807. This makes it easier for the network to perceive the target with the help of information from adjacent frames. At the same time, the object feature in frame 808 that is prone to introducing new artifacts becomes more easily perceived.

III-C Feature Refinement Module

Although we explicitly explore motion information, ineffective alignment can lead to worse detection results if adjacent frames are too blurry. As shown in Fig. 3, the target is harder to be detected in frame 808 than in frame 807, and the alignment may introduce new artifacts, making the target more susceptible to disturbance. Therefore, we design a feature refinement module to adaptively fuse and improve useful temporal information from the aligned features.

As shown in Fig. 2, we first aggregate the aligned features $F_{i}^{A}$ with the extracted feature $F_{t}^{E}$ via a concatenation operation followed by a 1 $\times$ 1 convolutional layer to derive the fused alignment feature $F_{f}^{a}$ :

F_{f}^{a}=Conv([F_{t-R}^{A},\ldots,F_{t-1}^{A},F_{t}^{E},F_{t+1}^{A},\ldots,F_% {t+R}^{A}]).

(8)

Then, a global average pooling operation and two $1\times 1$ convolutional layers are adopted to generate the attention weights $W_{2R+1}$ :

W_{2R+1}=Conv_{2R+1}(Conv(GAP(F_{f}^{a}))),

(9)

where $GAP(\cdot)$ represents the global average pooling operation, and $Conv_{2R+1}(\cdot)$ denotes the convolution operation required to generate weights $W_{2R+1}$ for the features to be aggregated. After that, the adaptive fusion weights $W_{2R+1}$ are element-wise multiplied with the features $F_{2R+1}=\{F_{t-R}^{A},\ldots,F_{t-1}^{A},F_{t}^{E},F_{t+1}^{A},\ldots,F_{t+R}% ^{A}\}$ :

\widetilde{F}_{2R+1}=F_{2R+1}\otimes W_{2R+1},

(10)

where $\otimes$ refers to element-wise multiplication operation. Finally, the generated modulated features $\widetilde{F}_{2R+1}$ are concatenated in the channel dimension and then via a 1 $\times$ 1 bottleneck convolution to obtain the fused coarse feature $F_{f}^{c}$ :

F_{f}^{c}=Conv([\widetilde{F}_{t-R},\cdots,\widetilde{F}_{t-1},\widetilde{F}_{% t},\widetilde{F}_{t+1},\cdots,\widetilde{F}_{t+R}]).

(11)

The obtained feature $F_{f}^{c}$ contains both intra-frame spatial relation and inter-frame temporal information, and the temporal dependence among frames is concatenated along the channel dimension. Therefore, we need to design an effective fusion structure to enable the network to dynamically aggregate spatio-temporal information. Fortunately, the attention mechanism can highlight the important information and suppress the less useful information. Thus, we propose a spatial attention and channel attention guided AGDF block to enable the network to concentrate on more valuable information and enhance discriminative learning ability.

The structure of the AGDF block is illustrated in Fig. 6. Assuming that the input feature map of the AGDF block is $F_{x}$ , we first use a $3\times 3$ convolutional layer to halve the number of channels for low computational cost:

F_{x}^{{}^{\prime}}=Conv(F_{x}).

(12)

Then, channel attention and spatial attention [62] are applied to aggregate spatial information and temporal dependence respectively:

F_{x}^{s}=SA(F_{x}^{{}^{\prime}}),

(13)

F_{x}^{c}=CA(F_{x}^{{}^{\prime}}),

(14)

where $SA(\cdot)$ and $CA(\cdot)$ represent spatial attention and channel attention operations, respectively. Furthermore, a $1\times 1$ convolutional layer is adopted to combine two branch features and make the number of channels consistent:

F_{x}^{sc}=Conv([F_{x}^{s},F_{x}^{c}]).

(15)

After that, we utilize a multi-scale structure to predict offsets and perform deformable fusion. Specifically, we use a strided convolution to downsample $F_{x}^{sc}$ by a factor of 2 and predict offsets at two different scales:

\Theta_{1}^{sc}=Conv(F_{x}^{sc}),

(16)

\Theta_{2}^{sc}=Conv(SConv(F_{x}^{sc})).

(17)

Then, deformable convolution with the fused multi-scale offsets are applied to the mixed spatio-temporal feature $F_{x}^{sc}$ to further fuse spatio-temporal information:

F_{x}^{f}=DCN(F_{x}^{sc},(\Theta_{2}^{sc})^{\uparrow 2}+\Theta_{1}^{sc}),

(18)

where $(\cdot)^{\uparrow 2}$ refers to upscaling by a factor of 2. The deformable convolution with pyramid generated offsets allows the AGDF block to have a larger and adaptive receptive field to aggregate information. Finally, a $3\times 3$ convolutional layer is used to increase the number of channels to ensure consistent input and output channels for the AGDF block:

F_{y}=Conv(F_{x}^{f}).

(19)

Remark. Although the stacked convolutional layers or multi-scale convolutional structures can also increase the receptive field of the network, gradient vanishing/exploding and degradation problems may be caused. In contrast, our AGDF block employs deformable convolution with joint predicted offsets to model spatio-temporal information from the fused two-branch features, leading to more efficient use of both intra-frame and inter-frame information. Although the three-level pyramid structure may bring more performance improvement, the accompanying increase in computational cost is unacceptable. Our AGDF block reaches a balance between the performance and computational efficiency by the designed two-level pyramid structure.

III-D The Loss Function

Although the deformable alignment has the potential to capture motion context and align $F_{i}^{E}$ with $F_{t}^{E}$ , it is difficult to train deformable convolution without a supervision loss [63]. Training instability regularly leads to offset overflow, degrading the final performance. Therefore, to improve the temporal alignment, we introduce a new Motion Compensation (MC) loss $\mathcal{L}_{MC}$ as follows:

\mathcal{L}_{MC}=\sum_{i=t-R,\neq t}^{t+R}\mathcal{L}_{1}\left(F_{i}^{A},F_{t}% ^{E}\right),

(20)

where $\mathcal{L}_{1}(\cdot,\cdot)$ represents $L_{1}$ loss. Then, the total loss function is formulated as:

\mathcal{L}=\lambda\mathcal{L}_{reg}+\mathcal{L}_{cls}+\mathcal{L}_{obj}+\eta% \mathcal{L}_{MC},

(21)

where $\mathcal{L}_{reg}$ is a regression loss, $\mathcal{L}_{cls}$ denotes a classification loss, and $\mathcal{L}_{obj}$ refers to an IoU loss; $\lambda$ and $\eta$ are two hyper-parameters to balance loss terms. Here, following the setting of YOLOX [56], $\lambda$ is fixed to 5.

IV Experiments

IV-A Datasets and Quantitative Evaluation Metrics

Datasets. Following [23], we conduct extensive experiments on two moving infrared dim-small target detection datasets, i.e., DAUB [64] and IRDST [19]. The DAUB dataset consists of 10 training video sequences with a total of 8982 frames and 7 test video sequences with a total of 4795 frames. The IRDST dataset contains 42 training video sequences with a total of 20398 frames and 43 test video sequences with a total of 20258 frames.

Quantitative Evaluation Metrics. To evaluate the detection performance, four standard evaluation metrics for infrared dim-small target detection are adopted, that is Precision (Pr), Recall (Re), $F1$ score and the average precision (e.g., mAP₅₀, the mean average precision with the IoU threshold of 0.5). These quantitative evaluation metrics are defined as follows:

\text{ Precision }=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},

(22)

\text{ Recall }=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},

(23)

F1=\frac{2\times\text{ Precision }\times\text{ Recall }}{\text{ Precision }+% \text{ Recall }},

(24)

where TP, FP, and FN denote the number of correct predictions (true positives), false detections (false positives), and missing targets (false negatives), respectively. The $F1$ score is a reliable and comprehensive evaluation metric on Pr and Re.

IV-B Implementation Details

Network settings. The convolutional layer in the feature extraction module has 48 filters (except for the last layer which has 64 filters). The kernel size of all deformable convolutions in the network is set to $3\times 3$ ; the number of deformable groups in the TDA module and the AGDF block is set to 8 and 32, respectively. There are 4 DCAF blocks and 4 AGDF blocks in the TDA module and feature refinement module, respectively. For the other settings in the proposed DFAR method have been described in Section III.

Model training. To have a fair comparison, following [23], the number of input frames is set to 5 (i.e., temporal radius $R=2$ ). In the training process, the input frames are reshaped into $544\times 544$ , and the batch size is set to 4. The model is trained by Adam optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ and $\varepsilon=1\times 10^{-8}$ for 20 epochs. The learning rate is initially set to $1\times 10^{-4}$ and retained throughout training. For the hyper-parameters in equation (21), following YOLOX [56], $\lambda$ is set to 5; $\eta$ is selected by the grid search method and set to 1. The proposed model is implemented using PyTorch, and trained on a NVIDIA GeForce RTX 3090 GPU.

TABLE I: Overall Performance Comparison in Terms of mAP₅₀, Precision (Pr), Recall (Re) and

F1

Score on the DAUB and IRDST Datasets. Red and Blue Colors Indicate the Best and the Second-Best Performance, Respectively.

Scheme	Methods	Publication	DAUB				IRDST
Scheme	Methods	Publication	mAP₅₀ (%)	Pr (%)	Re (%)	F1 (%)	mAP₅₀ (%)	Pr (%)	Re (%)	F1 (%)
single-frame based detection	ACM [32]	WACV 2021	72.30	76.84	95.31	85.09	67.74	81.44	84.01	82.71
	RISTD [65]	IEEE GRSL 2022	82.73	88.54	94.41	91.38	78.92	86.56	92.63	89.49
	ISNet [66]	CVPR 2022	83.43	88.64	95.04	91.73	75.00	87.78	86.81	87.29
	UIUNet [67]	IEEE TIP 2022	88.23	94.63	94.79	94.71	70.96	87.28	82.08	84.60
	SANet [68]	ICASSP 2023	87.90	94.14	94.22	94.18	77.98	85.42	92.13	88.64
	AGPCNet [69]	IEEE TAES 2023	73.08	78.49	94.41	85.72	73.86	83.77	89.18	86.39
	RDIAN [19]	IEEE TRGS 2023	83.69	90.55	93.37	91.94	71.99	84.41	86.48	85.43
	DNANet [18]	IEEE TIP 2023	89.24	95.66	94.83	95.24	76.84	90.08	86.81	88.42
	SIRST5K [70]	IEEE TGRS 2024	88.45	94.48	94.97	94.72	72.64	86.17	85.65	85.91
	MSHNet [71]	CVPR 2024	89.23	97.27	92.26	94.70	78.50	88.89	89.63	89.26
	RPCANet [72]	WACV 2024	85.75	89.12	97.58	93.16	73.29	85.02	87.13	86.06
	SCTransNet [73]	IEEE TGRS 2024	88.26	93.50	95.50	94.53	78.27	89.67	88.43	89.05
multi-frame based detection	DTUM [53]	IEEE TNNLS 2023	88.24	95.15	93.60	94.37	80.98	90.62	90.46	90.54
	SSTNet [23]	IEEE TGRS 2024	94.33	97.77	97.91	97.84	83.25	91.13	92.24	91.68
	DFAR (Ours)	-	96.56	98.82	98.62	98.72	89.88	95.91	94.66	95.28

IV-C Comparisons with State-of-The-Art Methods

To demonstrate the advantage of our method, we compare our method with various learning-based infrared dim-small target detection approaches whose codes are available, including 12 single-frame based detection approaches: ACM [32], RISTD [65], ISNet [66], UIUNet [67], SANet [68], AGPCNet [69], RDIAN [19], DNANet [18], SIRST5K [70], MSHNet [71], RPCANet [72] and SCTransNet [73], and 2 multi-frame based detection approaches: DTUM [53] and SSTNet [23]. The resolution of input frames for all comparison methods in training and test is reshaped $544\times 544$ . Since the datasets we used are based on bounding box annotations, to make the comparison as fair as possible, for all single-frame detection methods (except the SANet method) and the DTUM method based on pixel-level segmentation, the detection head YOLOX [56] is added to the output of their networks to generate bounding boxes. Then, these methods are retrained on our training set. For the SSTNet method, we directly run public training and test codes.

IV-C1 Overall Quantitative Comparison

Table I presents the quantitative comparison results on two datasets, averaged over all frames of each test sequence. We can see that our method achieves better results than all the compared methods on two datasets in terms of four standard evaluation metrics, i.e., mAP₅₀, Pr, Re and $F1$ .

To be specific, on the DAUB dataset, the highest mAP₅₀ 96.56% is reached by our method, which is 2.36% higher than mAP₅₀ of SSTNet (94.33%); the highest Pr 98.82% is achieved by our method, which is 1.07% higher than Pr of SSTNet (97.77%); the highest Re 98.62% is reached by our method, 0.73% higher than Re of SSTNet (97.91%); the highest $F1$ 98.72% is achieved by our method, which is 0.90% higher than $F1$ of SSTNet (97.84%).

More specifically, on the IRDST dataset, the highest mAP₅₀ 89.88% is reached by our method, which is 7.96% higher than mAP₅₀ of SSTNet (83.25%); the highest Pr 95.91% is achieved by our method, which is 5.25% higher than Pr of SSTNet (91.13%); the highest Re 94.66% is reached by our method, which is 2.62% higher than Re of SSTNet (92.24%); the highest $F1$ 95.28% is achieved by our method, which is 3.93% higher than $F1$ of SSTNet (91.68%). It can be seen that our method significantly outperforms the current state-of-the-art method SSTNet on the IRDST dataset. One possible reason is that the video sequences in the IRDST dataset have larger motions than those in the DAUB dataset, and it is difficult for LSTM-based SSTNet to implicitly aggregate inter-frame information. It demonstrates the robustness of our DFAR approach based on DCN explicit alignment in modeling complex and large motions.

TABLE II: Inference Complexity Comparison on the DAUB Dataset. For a Fair Comparison, All Methods are Retested on a NVIDIA GeForce RTX 3090. The Results are Reported by Model Parameters (Params), Floating-Point Operations (FLOPs) and Frame Per Second (FPS). The best Results are Marked in Bold.

Methods	Frames	mAP₅₀^↑	F1^↑	Params^↓	FLOPs^↓	FPS^↑	PCR^↑
ACM [32]	1	72.30	85.09	3.01M	28.17G	16.39	2.57
RISTD [65]	1	82.73	91.38	3.26M	92.95G	9.91	0.89
ISNet [66]	1	83.43	91.73	3.48M	300.64G	6.44	0.28
UIUNet [67]	1	88.23	94.71	53.03M	515.83G	2.10	0.17
SANet [68]	1	87.90	94.18	12.40M	47.46G	6.26	1.85
AGPCNet [69]	1	73.08	85.72	14.85M	413.61G	3.17	0.18
RDIAN [19]	1	83.69	91.94	2.71M	57.37G	13.91	1.46
DNANet [18]	1	89.24	95.24	7.19M	152.58G	3.47	0.58
SIRST5K [70]	1	88.45	94.72	11.28M	204.88G	6.64	0.43
MSHNet [71]	1	89.23	94.70	6.56M	78.77G	12.42	1.13
RPCANet [72]	1	85.75	93.16	3.18M	432.27G	13.13	0.20
SCTransNet [73]	1	88.26	94.53	13.68M	115.34G	8.12	0.77
DTUM [53]	5	88.24	94.37	2.79M	117.10G	7.58	0.75
SSTNet [23]	5	94.33	97.84	11.95M	139.53G	6.71	0.68
Ours	5	96.56	98.72	7.53M	100.87G	7.60	0.96

IV-C2 Precision–recall (PR) Curve Comparison

To comprehensively evaluate performance, we further evaluate the PR curves of all approaches on the DAUB and IRDST datasets. The larger the area under the curve, the better the method. We can see from Figs. 7 and 8 that our approach is obviously superior to other comparative methods on two datasets. In general, the proposed DFAR approach achieves a better balance of precision and recall than other methods.

IV-C3 Inference Complexity Comparison

We use Frame Per Second (FPS), Floating-point Operations (FLOPs) and model parameters to compare inference complexity. As shown in Table II, the proposed DFAR approach is better than the state-of-the-art LSTM-based method SSTNet in terms of both inference speed and model size. In addition, our method has a moderate Performance–cost Ratio (PCR; i.e., mAP₅₀/FLOPs). Particularly, the PCR of our method on the DAUB dataset is 0.96, which is 41.18% higher than that of SSTNet (0.68). Although ACM, RISTD, RDIAN and MSHNet methods have smaller model sizes, FLOPs and faster inference speed, by checking Table I and Table II, we can see that our method achieves a good complexity-performance trade-off.

IV-C4 Visualization Comparison

In Fig. 9 and Fig. 10, we present two groups of detection results for visual comparison on the two datasets, respectively. We can see from Figs. 9 and 10 that our method can often accurately detect moving dim-small targets, while other methods usually result in missed or false detections.

Specifically, on the DAUB dataset, as shown in Fig. 9, our method precisely detects the target in the first group of visualization comparison (the first three rows). However, ISNet, AGPCNet, RDIAN and RPCANet methods cause missed detection; the target bounding boxes generated by RISTD and UIUNet methods are not accurate and appear too large. Furthermore, in the second group of comparison (the fourth to sixth rows), UIUNet, MSHNet and SSTNet methods occur missed detection; RISTD, SANet and DTUM methods produce false detection, and RISTD method even detect two targets.

More specifically, on the IRDST dataset, as shown in Fig. 10, most methods fail to detect two groups of targets since the targets are difficult to perceive, while our method can still detect the targets correctly. In summary, these qualitative comparison results verify the superiority of our method.

TABLE III: Ablation Study of Proposed Temporal Deformable Alignment (TDA) Module, Motion Compensation Loss

\mathcal{L}_{MC}

and Feature Refinement (FR) Module.

TDA	$\boldsymbol{\mathcal{L}_{MC}}$	FR	DAUB				IRDST
TDA	$\boldsymbol{\mathcal{L}_{MC}}$	FR	mAP₅₀ (%)	Pr (%)	Re (%)	F1 (%)	mAP₅₀ (%)	Pr (%)	Re (%)	F1 (%)
-	-	-	84.02	87.40	97.14	92.02	74.13	86.24	88.28	87.25
$\checkmark$	-	-	92.62	95.40	97.87	96.62	81.44	90.50	90.86	90.68
$\checkmark$	$\checkmark$	-	93.05	96.52	97.89	97.20	82.44	91.92	92.12	92.02
-	-	$\checkmark$	91.92	95.44	97.71	96.56	80.88	92.43	88.56	90.45
$\checkmark$	-	$\checkmark$	94.69	97.79	97.93	97.86	87.14	95.14	92.28	93.69
$\checkmark$	$\checkmark$	$\checkmark$	96.56	98.82	98.62	98.72	89.88	95.91	94.66	95.28

IV-D Ablation Study

IV-D1 Effects of Different Assemblies

To evaluate the effectiveness of our TDA module, feature refinement (FR) module and motion compensation loss, we directly use a $3\times 3$ convolutional layer with 320 filters to fuse the concatenated multi-frame features, and then input fused feature into the detection head as the baseline model. Then, we insert different components into the baseline, and retrain these models with the same experimental settings. The experimental results are presented in Table III.

Effectiveness of TDA module. We insert the TDA module before multi-frame feature fusion to evaluate the effectiveness of explicit alignment on features. The results in the 1st and 2nd rows (or 4th and 5th rows) in Table III prove the effectiveness of our proposed TDA module. Specifically, the results in the 4th and 5th rows show that alignment is more important for the IRDST dataset which contains more large motion videos. Additionally, to evaluate the effectiveness of the designed DCAF block, we use 8 3 × 3 convolutional layers with 64 filters to replace the DCAF blocks in the TDA module to keep the module parameters and computational cost essentially unchanged. The mAP₅₀ and $F$ 1 of the baseline model with the modified TDA module on the DAUB dataset are only 89.86 and 94.18, and on the IRDST dataset are only 78.24 and 88.83. It demonstrates the effectiveness of the designed DCAF block.

Effectiveness of motion compensation loss $\boldsymbol{\mathcal{L}_{MC}}$ . Here, we show the necessity of motion compensation loss $\mathcal{L}_{MC}$ . From the results in the 5th and 6th rows of Table III, it can be seen that although there is feature alignment, without the supervision of motion compensation loss, the performance is still sub-optimal. One possible reason is that deformable convolution is inherently difficult to train, and training instability often leads to offset overflow, deteriorating the final performance [63]. Therefore, it is necessary to use motion compensation loss for temporal alignment supervision. Besides, we also investigate the impact of hyper-parameter $\eta$ in the loss function. Fig. 11 shows that the $F$ 1 score has a stable region that reaches a peak when $\eta=1.0$ on the DAUB dataset, and reaches a peak when $\eta=1.4$ on the IRDST dataset. In addition, we can also see that when $\eta=1.0$ , the average $F$ 1 of the two datasets is optimal, so the hyper-parameter $\eta$ is set to 1.0.

Effectiveness of feature refinement module. We use the FR module to replace the simple multi-frame fusion operation (i.e., concatenation operation) to evaluate its effectiveness. From the results in the 1st and 4th rows (or 3rd and 6th rows) in Table III, we can conclude that our feature refinement module contributes a lot to performance improvement. For a deep investigation, we consider an additional ablation study about the branch configuration of the FR module. Intuitively, there are two variants of the FR module: FR without AFS (removing the Adaptive Fusion Structure (AFS) guided by the attention mechanism and using simple concatenation operation to fuse multiple features), and FR without AGDF (using 8 3 × 3 convolutional layers with 64 filters to replace the AGDF block). Experimental results are presented in Table IV. We can see that AFS and AGDF block are able to boost the performance of baseline, but all perform worse than the FR module.

TABLE IV: Ablation Study on Feature Refinement Module. AFS: Adaptive Fusion Structure Guided by the Attention Mechanism.

AFS	AGDF	DAUB		IRDST
AFS	AGDF	mAP₅₀ (%)	F1 (%)	mAP₅₀ (%)	F1 (%)
-	-	84.02	92.02	74.13	87.25
$\checkmark$	-	86.81	93.51	76.88	88.63
-	$\checkmark$	88.23	94.28	78.23	89.30
$\checkmark$	$\checkmark$	91.92	96.56	80.88	90.45

IV-D2 Effectiveness of Temporal Radius $R$

Although we follow SSTNet method [23] to use 5 frames for model training, an ablation study is also conducted on the temporal radius $R$ to analyze its impact on performance. Here, due to limited memory, we have only studied the case where the maximum value of $R$ is 5. As shown in Fig. 12, the performance of our DFAR method increases with the temporal radius $R$ in genera, but quickly reaches a plateau and even decreases when the number of frames is too large. This phenomenon is more obvious in the IRDST dataset. The reason is that when the temporal radius is too large, the greater motion makes it difficult for the model to be effectively aligned, and inaccurate alignment may even introduce new artifacts and degrade performance. After comprehensively considering computation complexity and accuracy, we set $R=2$ .

V Conclusion

We proposed a new end-to-end network for moving infrared dim-small target detection. Specifically, a TDA module based on the designed DCAF block is proposed by explicitly aligning adjacent frames with the current frame at the feature level to mine temporal information. Furthermore, a feature refinement module with the adaptive fusion structure and AGDF blocks is designed to adaptively fuse and refine useful temporal information from the aligned features. In addition, we extend the traditional loss function by introducing a new motion compensation loss to improve the temporal alignment. Both qualitative and quantitative experimental results demonstrate the effectiveness of our proposed DFAR method, which can significantly improve detection performance compared with the state-of-the-art methods. Ablation studies are also conducted to show the effectiveness of different component assemblies in our method.

References

[1] T. R. Goodall, A. C. Bovik, and N. G. Paulter, “Tasking on natural statistics of infrared images,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 65–79, 2016.
[2] L. Li, Q. Hu, and X. Li, “Moving object detection in video via hierarchical modeling and alternating optimization,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 2021–2036, 2019.
[3] A. Guzman-Pando and M. I. Chacon-Murguia, “Deepfoveanet: Deep fovea eagle-eye bioinspired model to detect moving objects,” IEEE Transactions on Image Processing, vol. 30, pp. 7090–7100, 2021.
[4] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, “Minet: Meta-learning instance identifiers for video object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 6879–6891, 2021.
[5] L. Jiao, R. Zhang, F. Liu, S. Yang, B. Hou, L. Li, and X. Tang, “New generation deep learning for video object detection: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 8, pp. 3195–3215, 2021.
[6] Q. Qi, T. Hou, Y. Lu, Y. Yan, and H. Wang, “Dgrnet: A dual-level graph relation network for video object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 4128–4141, 2023.
[7] X. Zhao, H. Liang, P. Li, G. Sun, D. Zhao, R. Liang, and X. He, “Motion-aware memory network for fast video salient object detection,” IEEE Transactions on Image Processing, vol. 33, pp. 709–721, 2024.
[8] C. Liu, W. Ding, J. Yang, V. Murino, B. Zhang, J. Han, and G. Guo, “Aggregation signature for small object tracking,” IEEE Transactions on Image Processing, vol. 29, pp. 1738–1747, 2020.
[9] S. Deng, S. Li, K. Xie, W. Song, X. Liao, A. Hao, and H. Qin, “A global-local self-adaptive network for drone-view object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 1556–1569, 2021.
[10] C. Deng, M. Wang, L. Liu, Y. Liu, and Y. Jiang, “Extended feature pyramid network for small object detection,” IEEE Transactions on Multimedia, vol. 24, pp. 1968–1979, 2022.
[11] S. Chen, L. Ji, S. Zhu, and M. Ye, “Micpl: Motion-inspired cross-pattern learning for small-object detection in satellite videos,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2024.
[12] X. Bai and F. Zhou, “Analysis of new top-hat transformation and the application for infrared dim small target detection,” Pattern Recognition, vol. 43, no. 6, pp. 2145–2156, 2010.
[13] X. Kong, C. Yang, S. Cao, C. Li, and Z. Peng, “Infrared small target detection via nonconvex tensor fibered rank approximation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–21, 2021.
[14] H. Zhu, H. Ni, S. Liu, G. Xu, and L. Deng, “Tnlrs: Target-aware non-local low-rank modeling with saliency filtering regularization for infrared small target detection,” IEEE Transactions on Image Processing, vol. 29, pp. 9546–9558, 2020.
[15] H. Wang, L. Zhou, and L. Wang, “Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8509–8518.
[16] P. Zhang, L. Zhang, X. Wang, F. Shen, T. Pu, and C. Fei, “Edge and corner awareness-based spatial–temporal tensor model for infrared small-target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 12, pp. 10 708–10 724, 2021.
[17] K. Wang, S. Du, C. Liu, and Z. Cao, “Interior attention-aware network for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
[18] B. Li, C. Xiao, L. Wang, Y. Wang, Z. Lin, M. Li, W. An, and Y. Guo, “Dense nested attention network for infrared small target detection,” IEEE Transactions on Image Processing, vol. 32, pp. 1745–1758, 2023.
[19] H. Sun, J. Bai, F. Yang, and X. Bai, “Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023.
[20] F. Lin, K. Bao, Y. Li, D. Zeng, and S. Ge, “Learning contrast-enhanced shape-biased representations for infrared small target detection,” IEEE Transactions on Image Processing, vol. 33, pp. 3047–3058, 2024.
[21] X. Liu, X. Li, L. Li, X. Su, and F. Chen, “Dim and small target detection in multi-frame sequence using bi-conv-lstm and 3d-conv structure,” IEEE Access, vol. 9, pp. 135 845–135 855, 2021.
[22] Z. Zhang, P. Gao, S. Ji, X. Wang, and P. Zhang, “Infrared small target detection combining deep spatial-temporal prior with traditional priors,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
[23] S. Chen, L. Ji, J. Zhu, M. Ye, and X. Yao, “SSTNet: Sliced spatio-temporal network with cross-slice convlstm for moving infrared dim-small target detection,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
[24] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets v2: More deformable, better results,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
[25] S. D. Deshpande, M. H. Er, R. Venkateswarlu, and P. Chan, “Max-mean and max-median filters for detection of small targets,” in Signal and Data Processing of Small Targets 1999, vol. 3809. SPIE, 1999, pp. 74–83.
[26] C. Gao, D. Meng, Y. Yang, Y. Wang, X. Zhou, and A. G. Hauptmann, “Infrared patch-image model for small target detection in a single image,” IEEE Transactions on Image Processing, vol. 22, no. 12, pp. 4996–5009, 2013.
[27] C. P. Chen, H. Li, Y. Wei, T. Xia, and Y. Y. Tang, “A local contrast method for small infrared target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 1, pp. 574–581, 2013.
[28] A. Novak, N. Armstrong, T. Caelli, and I. Blair, “Bayesian contrast measures and clutter distribution determinants of human target detection,” IEEE Transactions on Image Processing, vol. 26, no. 3, pp. 1115–1126, 2017.
[29] S. Moradi, P. Moallem, and M. F. Sabahi, “Fast and robust small infrared target detection using absolute directional mean difference algorithm,” Signal Processing, vol. 177, p. 107727, 2020.
[30] X. Zhou, P. Li, Y. Zhang, X. Lu, and Y. Hu, “Deep low-rank and sparse patch-image network for infrared dim and small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023.
[31] M. Liu, H.-y. Du, Y.-j. Zhao, L.-q. Dong, M. Hui, and S. Wang, “Image small target detection based on deep learning with snr controlled sample generation,” Current Trends in Computer Science and Mechanical Automation, vol. 1, pp. 211–220, 2017.
[32] Y. Dai, Y. Wu, F. Zhou, and K. Barnard, “Asymmetric contextual modulation for infrared small target detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 950–959.
[33] ——, “Attentional local contrast networks for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 11, pp. 9813–9824, 2021.
[34] Y. Zhang, Y. Zhang, Z. Shi, R. Fu, D. Liu, Y. Zhang, and J. Du, “Enhanced cross-domain dim and small infrared target detection via content-decoupled feature alignment,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
[35] F. Liu, C. Gao, F. Chen, D. Meng, W. Zuo, and X. Gao, “Infrared small and dim target detection with transformer under complex backgrounds,” IEEE Transactions on Image Processing, vol. 32, pp. 5921–5932, 2023.
[36] J. Lin, S. Li, L. Zhang, X. Yang, B. Yan, and Z. Meng, “IR-TransDet: Infrared dim and small target detection with ir-transformer,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
[37] Y. Dai, X. Li, F. Zhou, Y. Qian, Y. Chen, and J. Yang, “One-stage cascade refinement networks for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–17, 2023.
[38] T. Liu, Q. Yin, J. Yang, Y. Wang, and W. An, “Combining deep denoiser and low-rank priors for infrared small target detection,” Pattern Recognition, vol. 135, p. 109184, 2023.
[39] T. Ma, H. Wang, J. Liang, J. Peng, Q. Ma, and Z. Kai, “Msma-net: An infrared small target detection network by multiscale super-resolution enhancement and multilevel attention fusion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–20, 2024.
[40] T. Chen, Z. Tan, Q. Chu, Y. Wu, B. Liu, and N. Yu, “Tci-former: Thermal conduction-inspired transformer for infrared small target detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 1201–1209.
[41] Y. Huang, X. Zhi, J. Hu, L. Yu, Q. Han, W. Chen, and W. Zhang, “Fddba-net: Frequency domain decoupling bidirectional interactive attention network for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024.
[42] H. Yang, J. Liu, Z. Wang, Z. Fu, Q. Tan, and S. Niu, “Mapff: Multiangle pyramid feature fusion network for infrared dim small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024.
[43] Y. Sun, J. Yang, Y. Long, and W. An, “Infrared small target detection via spatial-temporal total variation regularization and weighted tensor nuclear norm,” IEEE Access, vol. 7, pp. 56 667–56 682, 2019.
[44] C. Kwan and B. Budavari, “Enhancing small moving target detection performance in low-quality and long-range infrared videos using optical flow techniques,” Remote Sensing, vol. 12, no. 24, p. 4024, 2020.
[45] M. Uzair, R. S. Brinkworth, and A. Finn, “Bio-inspired video enhancement for small moving target detection,” IEEE Transactions on Image Processing, vol. 30, pp. 1232–1244, 2021.
[46] G. Wang, B. Tao, X. Kong, and Z. Peng, “Infrared small target detection using nonoverlapping patch spatial–temporal tensor factorization with capped nuclear norm regularization,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–17, 2021.
[47] T. Liu, J. Yang, B. Li, C. Xiao, Y. Sun, Y. Wang, and W. An, “Nonconvex tensor low-rank approximation for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–18, 2021.
[48] F. Wu, H. Yu, A. Liu, J. Luo, and Z. Peng, “Infrared small target detection using spatio-temporal 4d tensor train and ring unfolding,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
[49] H. Wang, Z. Zhong, F. Lei, J. Peng, and S. Yue, “Bio-inspired small target motion detection with spatio-temporal feedback in natural scenes,” IEEE Transactions on Image Processing, vol. 33, pp. 451–465, 2024.
[50] J. Du, D. Li, Y. Deng, L. Zhang, H. Lu, M. Hu, X. Shen, Z. Liu, and X. Ji, “Multiple frames based infrared small target detection method using cnn,” in Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence, 2021, pp. 1–6.
[51] J. Du, H. Lu, L. Zhang, M. Hu, S. Chen, Y. Deng, X. Shen, and Y. Zhang, “A spatial-temporal feature-based detection framework for infrared dim small target,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2021.
[52] P. Yan, R. Hou, X. Duan, C. Yue, X. Wang, and X. Cao, “Stdmanet: Spatio-temporal differential multiscale attention network for small moving infrared target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
[53] R. Li, W. An, C. Xiao, B. Li, Y. Wang, M. Li, and Y. Guo, “Direction-coded temporal u-shape module for multiframe infrared small target detection,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
[54] H. Deng, Y. Zhang, Y. Li, K. Cheng, and Z. Chen, “Bemst: Multiframe infrared small-dim target detection using probabilistic estimation of sequential backgrounds,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024.
[55] X. Tong, Z. Zuo, S. Su, J. Wei, X. Sun, P. Wu, and Z. Zhao, “St-trans: Spatial-temporal transformer for infrared small target detection in sequential images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–19, 2024.
[56] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
[57] H. Song, W. Xu, D. Liu, B. Liu, Q. Liu, and D. N. Metaxas, “Multi-stage feature fusion network for video super-resolution,” IEEE Transactions on Image Processing, vol. 30, pp. 2923–2934, 2021.
[58] W. Yan, L. Xu, W. Yang, and R. T. Tan, “Feature-aligned video raindrop removal with temporal constraints,” IEEE Transactions on Image Processing, vol. 31, pp. 3440–3448, 2022.
[59] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2472–2481.
[60] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773.
[61] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 286–301.
[62] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
[63] K. C. K. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy, “Understanding deformable alignment in video super-resolution,” in AAAI, 2021.
[64] B. Hui, Z. Song, H. Fan, P. Zhong, W. Hu, X. Zhang, J. Lin, H. Su, W. Jin, Y. Zhang et al., “A dataset for infrared image dim-small aircraft target detection and tracking under ground/air background,” Sci. Data Bank, vol. 5, no. 12, p. 4, 2019.
[65] Q. Hou, Z. Wang, F. Tan, Y. Zhao, H. Zheng, and W. Zhang, “Ristdnet: Robust infrared small target detection network,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
[66] M. Zhang, R. Zhang, Y. Yang, H. Bai, J. Zhang, and J. Guo, “ISNet: Shape matters for infrared small target detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 877–886.
[67] X. Wu, D. Hong, and J. Chanussot, “UIU-Net: U-net in u-net for infrared small object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 364–376, 2022.
[68] J. Zhu, S. Chen, L. Li, and L. Ji, “Sanet: Spatial attention network with global average contrast learning for infrared small target detection,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[69] T. Zhang, L. Li, S. Cao, T. Pu, and Z. Peng, “Attention-guided pyramid context networks for detecting infrared small target under complex background,” IEEE Transactions on Aerospace and Electronic Systems, 2023.
[70] Y. Lu, Y. Lin, H. Wu, X. Xian, Y. Shi, and L. Lin, “Sirst-5k: Exploring massive negatives synthesis with self-supervised learning for robust infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–11, 2024.
[71] Q. Liu, R. Liu, B. Zheng, H. Wang, and Y. Fu, “Infrared small target detection with scale and location sensitivity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 490–17 499.
[72] F. Wu, T. Zhang, L. Li, Y. Huang, and Z. Peng, “Rpcanet: Deep unfolding rpca based infrared small target detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 4809–4818.
[73] S. Yuan, H. Qin, X. Yan, N. Akhtar, and A. Mian, “Sctransnet: Spatial-channel cross transformer network for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024.