Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks

Chen Zhou  Peng Cheng  Junfeng Fang  Yifan Zhang  Yibo Yan  Xiaojun Jia  Yanyan Xu  Kun Wang  Xiaochun Cao
kevinzc9@bjfu.edu.cn xuyanyan@bjfu.edu.cn  wk520529wjh@gmail.com
caoxiaochun@mail.sysu.edu.cn
Yanyan Xu and Kun Wang are corresponding authors. Chen Zhou and Peng Cheng are contributed equally to this work.
Chen Zhou, Peng Cheng and Yanyan Xu are with the Beijing Forestry University.
Yibo Yan is with the Hong Kong University of Science and Technology.
Yifan Zhang is with the State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS).
Junfeng Fang is with the National University of Singapore.
Xiaojun Jia and Kun Wang is with the Nanyang Technological University. Xiaochun Cao is with the Sun Yat-sen University.
Abstract

Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task. It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies, spatial misalignment, and environmental dependencies between RGB and TIR images. These challenges significantly hinder the generalization of multispectral detection systems across diverse scenarios. Although numerous studies have attempted to overcome these limitations, it remains difficult to clearly distinguish the performance gains of multispectral detection systems from the impact of these “optimization techniques”. Worse still, despite the rapid emergence of high-performing single-modality detection models, there is still a lack of specialized training techniques that can effectively adapt these models for multispectral detection tasks. The absence of a standardized benchmark with fair and consistent experimental setups also poses a significant barrier to evaluating the effectiveness of new approaches. To this end, we propose the first fair and reproducible benchmark specifically designed to evaluate the training “techniques”, which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations. A comprehensive evaluation is conducted across multiple representative multispectral object detection datasets, utilizing various backbone networks and detection frameworks. Additionally, we introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models into dual-modality models, integrating our advanced training techniques. Our codes are available: https://github.com/cpboost/double-co-detr

Index Terms:
Multispectral object detection, Multimodal feature fusion, Spatial alignment, Data augmentation

1 Introduction

Multispectral object detection is a powerful technology that leverages both visible light and infrared spectra for object detection, and it has been widely adopted in various real-world applications [1, 2, 3, 4, 5, 6], including anomaly detection in surveillance systems [7, 8, 9, 10, 11], obstacle recognition in autonomous vehicles [4, 12, 13, 14, 15], defect identification in industrial inspection [5, 6, 16, 17, 18], and threat detection in defense and security [19, 20, 21], to name just few. While many traditional object detection algorithms [5, 6, 17, 22, 19] have primarily relied on information from a single modality, recent advancements have explored more sophisticated multispectral architectures [23, 24, 25, 26, 27, 28, 29, 30]. In numerous cases, fully exploiting the information from multiple-modalities has demonstrated significant advantages [28]. For instance, in low-light conditions, leveraging infrared spectra can enhance the performance of visible light detection, or in complex scenarios, combining information from both spectra can improve detection accuracy [31, 32, 33, 34]. Recently, with the rapid development of satellite remote sensing and thermal imaging technologies [17], many challenging detection datasets have emerged (such as low light and extreme weather conditions) [7, 17]. Multispectral detection architectures have demonstrated strong performance on these datasets [6, 17, 22].

However, training multispectral object detection models is known to be highly challenging [23, 28, 29, 35, 36, 37]. Beyond the common issues encountered in training deep architectures, such as vanishing gradients and overfitting [22, 28], multispectral models face several unique challenges that limit their strides on these datasets:

  • The first challenge lies in effectively utilizing dual-modality data. Simultaneously processing visible and infrared data increases the complexity of dual-modality feature fusion, which may result in suboptimal integration of information from both modalities [23, 36]. This issue is particularly pronounced in earlier multispectral models, where the fusion process often led to information loss, preventing the models from fully leveraging the strengths of both modalities [35, 36]. Additionally, registration discrepancies between the two modalities and the lack of modality-specific enhancement strategies further constrain model performance [37].

  • The second major question is the lack of an effective optimization strategy for converting high-performance single-modality models into dual-modality models. Despite the emergence of numerous powerful single-modality object detection frameworks in recent years [38, 39, 40, 41], there has yet to be a robust method for effectively harnessing the potential of these models while addressing the unique challenges of multispectral object detection.

To addess the aforementioned challenges, the promising approaches can be categorized into ➀ dual-modality architectural fusion [26, 27, 28] and ➁ modality-specific enhancements [31, 32, 33], both of which we classify as “training techniques”. The former involves adapting single-modality architectures to dual-modality structures, integrating advanced backbone networks, and employing diverse feature fusion strategies. The latter focuses on processing data from both modalities using techniques such as modality-specific data augmentation and alignment calibration [6]. While these techniques generally contribute to the effective training of multispectral object detection models, their benefits are not always significant or consistent [35, 36, 37]. Furthermore, it is often difficult to distinguish the performance improvements achieved through more complex dual-modality architectures from those gained via these “training techniques”.

In some extreme cases, contrary to initial expectations, single-modality models enhanced with certain optimization techniques may even outperform carefully designed, complex dual-modality architectures [27, 28, 29, 30]. This casts doubt on the pursuit of increased complexity, thereby rendering it a less attractive approach. These observations highlight a critical gap in the study of multispectral object detection: the lack of a standardized benchmark that can fairly and consistently evaluate the effectiveness of training techniques for dual-modality models. Without disentangling the effects of architectural complexity from the “training techniques” applied, it may remain unclear whether multispectral object detection should inherently perform better under otherwise identical conditions.

Our Contribution

To establish such a fair benchmark, our first step was to conduct a comprehensive investigation into the design philosophies and implementation details of dozens of popular multispectral object detection techniques, including various backbone networks, dual-modality fusion strategies, and alignment techniques. Unfortunately, we discovered that even on the same datasets, the implementation of hyperparameter configurations (such as hidden layer dimensions, learning rates, weight decay, dropout rates, number of training epochs, and early stopping patience) is highly inconsistent and often varies depending on specific circumstances. This inconsistency makes it challenging to draw any fair or reliable conclusions.

To this end, we conducted a detailed analysis of these sensitive hyperparameters and standardized them into a “best” hyperparameter set, consistently applied across all experiments. This standardization provides a fair and reproducible benchmark for training multispectral object detection models. Subsequently, we explored various combinations of training techniques across several classical multispectral object detection datasets, leveraging common single-modality model backbones and optimizing them for dual-modality detection tasks.

The results of our comprehensive study were highly significant. Based on the characteristics of different single-modality model backbones, framework features, and detection sample characteristics, we developed several effective training techniques and optimization strategies, enabling us to achieve state-of-the-art results on multiple representative datasets. Furthermore, we proposed several optimization strategies with strong transferability, demonstrating excellent performance across multiple dual-modality public datasets 111Our research was awarded the championship in the Global Artificial Intelligence Innovation Competition (GAIIC) https://gaiic.caai.cn/ai2024, out of over 1,200 participants, 1,000+ teams, and 8,200+ submissions..

Specifically, our contributions are as follows:

Multimodal Feature Fusion: we introduce advanced multimodal feature fusion techniques to effectively integrate visible and infrared data, enhancing the feature representation capabilities of multispectral object detection models, especially in complex environments.

Dual-Modality Data Augmentation: we employ modality-specific data augmentation strategies that cater to the distinct characteristics of visible and infrared data, improving the model’s robustness in varying environmental and complex scenarios.

Alignment Optimization: by implementing precise alignment techniques, we improve spatial consistency between visible and infrared data, reducing inter-modality misalignment and significantly enhancing performance in low-light object detection and multimodal information fusion.

Optimizing Single-Modality Models for Dual-Modality Tasks: we provide a new benchmark and training techniques to effectively adapt high-performing single-modality models into dual-modality detection models. Through these optimizations, single-modality models outperform even complex, large-scale dual-modality detection models, offering strong support for their migration to dual-modality tasks.

2 Related work

2.1 Multispectral Object Detection & Training Challenges

Multispectral object detection has achieved state-of-the-art performance in applications like autonomous driving and drone-based remote sensing [5, 6, 7, 8, 9, 10]. However, implementing multispectral detection is challenging, especially when dealing with images from distinct spectra, such as visible light (RGB) and thermal infrared (TIR) [11, 16, 17]. Existing methods [42, 43, 44, 45] face several issues, including spectral differences [46], spatial misalignment [47], and high sensitivity to environmental conditions [48], limiting their generalization across diverse scenarios. While recent studies have introduced various training techniques, they often struggle to deliver consistent performance improvements when applied to complex remote sensing data [49, 50], differing from the dual-modality detection benchmarks discussed in this paper.

To address these challenges, techniques such as multimodal feature fusion, registration alignment, and dual-modality data augmentation have been developed in recent years [24, 25, 26, 27, 28]. The following sections provide a detailed exploration of these techniques and their applications.

2.2 Multimodal Feature Fusion

In multispectral object detection, feature fusion plays a crucial role in enhancing model performance. Current fusion methods are generally categorized into three types: pixel-level, feature-level, and decision-level fusion. Pixel-level fusion [51, 52, 53] integrates RGB and TIR images at the input stage, allowing early information combination but potentially introducing noise or misalignment due to differences in resolution and viewpoints. Feature-level fusion [54, 55, 56, 57, 58] combines high-level features from both modalities at intermediate layers, utilizing techniques like concatenation, weighting, or attention mechanisms to better capture complementary information, though it may add computational overhead [59, 60]. Decision-level fusion [50, 61, 62, 63, 64, 65] merges independent detection results from each modality at the final stage, providing efficiency and stable performance, especially when the modalities offer relatively independent information.

2.3 Dual-Modality Data Augmentation

In multispectral object detection, data augmentation is crucial for improving model generalization and reducing overfitting [27, 28]. While traditional techniques like flipping, rotation, and scaling work well in single-modality detection [25, 26], the fusion of RGB and TIR images introduces higher complexity. A common approach is to apply synchronized augmentation to both RGB and TIR images [3, 6] to ensure consistency between the modalities. Techniques such as random cropping, scaling, and color transformations increase image diversity and help the model adapt to varying environmental conditions [66, 67]. Additionally, some studies propose joint data augmentation methods, such as mixed modal augmentation, which exchanges pixels or features between modalities to enhance robustness against modality differences [4, 5, 68, 69], ultimately improving detection performance in challenging scenarios.

2.4 Registration Alignment

Registration alignment techniques are employed to address spatial discrepancies between images from different sensors, such as RGB and TIR. Differences in resolution and viewpoints often lead to misalignment and distortion, which can negatively impact feature fusion and detection performance [70, 71]. Traditional alignment methods [72, 73, 74], such as scaling, rotation, and affine transformation, are used to align the images but tend to be limited in complex scenes or when nonlinear deformations are present. Recently, deep learning-based alignment techniques [64, 75, 76, 77, 78, 79] have emerged, achieving pixel-level precision by learning feature mappings between RGB and TIR images, and using contrastive loss or self-supervised learning to ensure spatial consistency. Some methods also incorporate attention mechanisms to dynamically adjust feature alignment [47, 70, 74, 80], enhancing both local detail and global consistency.

3 Method

In this section, we systematically discuss how to improve existing dual-modality object detection algorithms. The discussion focuses on three key aspects: multimodal feature fusion, dual-modality data augmentation, and registration alignment. Specifically, Section 3.1 details the hyperparameter configurations and datasets used in our experiments, while Sections 3.2, 3.3, and 3.4 discuss multimodal feature fusion, dual-modality data augmentation, and registration alignment, respectively.

TABLE I: Configurations of the optimal hyperparameters adopted to implement different single models for training on the KAIST dataset.
Method Total epoch Learning rate & Decay Weight decay Dropout
YOLOv3 [2] 100 1×1021superscript1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.5
Faster R-CNN [81] 80 1×1021superscript1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.5
SSD [82] 120 2×1032superscript1032\times 10^{-3}2 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.5
RetinaNet [83] 100 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.3
EfficientDet [84] 150 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 4×1054superscript1054\times 10^{-5}4 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 0.3
Mask R-CNN [85] 90 2×1022superscript1022\times 10^{-2}2 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.5
YOLOv5 [86] 300 1×1021superscript1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.5
CenterNet [87] 140 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.4
FCOS [88] 120 2.5×1032.5superscript1032.5\times 10^{-3}2.5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.5
Cascade R-CNN [85] 100 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.5

3.1 Standardized Experimental Configuration

We conducted a comprehensive analysis of previous single-modality object detection models applied to dual-modality detection tasks. To further enhance model performance and improve the robustness of our benchmarking, we also performed hyperparameter optimization and fine-tuning on these models. Based on dual-modality datasets (including both RGB and TIR data), we systematically explored the adaptability of single-modality models in integrating multimodal information, with particular emphasis on their performance across different modalities. The key hyperparameter configurations are presented in Table I.

Through a grid search approach, we optimized the hyperparameters for all methods and identified the most generalizable and effective configuration. This configuration was selected based on the best performance of various single-modality detection models across multiple datasets, focusing on parameters such as learning rate, weight decay, and dropout. Ultimately, we proposed this “optimal hyperparameter configuration” and strictly adhered to it in our experiments.

Specifically, the final configuration consists of a learning rate of 0.01 with decay, a weight decay of 0.0001, and a dropout rate of 0.5. We believe that this setup provides stable and efficient performance across a range of multispectral object detection tasks, ensuring fair comparisons between different methods under the same conditions.

In each experiment, we trained for up to 200 epochs, with early stopping set to a patience of 20 epochs. To minimize the impact of random variations, each experiment was repeated 20 times, and the results were averaged to obtain the final performance metrics.

Our subsequent experiments utilized the KAIST [89], FLIR, and DroneVehicle [90] datasets. The KAIST dataset is a benchmark for pedestrian detection, combining visible and infrared images to evaluate multispectral detection algorithms. The FLIR dataset includes multi-class vehicle and pedestrian detection tasks with high-resolution thermal imagery. The DroneVehicle dataset focuses on multi-class object detection from a drone’s perspective, covering various complex scenarios.

Regarding performance evaluation, we employed task-specific metrics tailored to the characteristics of each dataset. For the KAIST dataset, we selected Miss Rate as the primary evaluation metric due to its sensitivity to missed detections, which is critical in this context. In contrast, for the FLIR and DroneVehicle datasets, we used mean Average Precision (mAP) as the evaluation metric, as these datasets involve multi-class object detection, and mAP provides a more comprehensive assessment of detection accuracy across classes. This dataset-specific approach ensures a thorough and accurate evaluation of each method’s performance in diverse tasks.

By leveraging a unified hyperparameter configuration and dataset-specific evaluation metrics, we ensure fair and consistent comparisons between different methods, providing a robust foundation for subsequent performance improvements.

3.2 Multimodal Feature Fusion

Formulations. In multispectral object detection, multimodal feature fusion techniques aim to effectively integrate complementary information from both RGB and TIR images, enhancing detection accuracy and model robustness. Multimodal feature fusion can be categorized into three main approaches: pixel-level fusion, feature-level fusion, and decision-level fusion.

  1. Pixel-Level Fusion. The fusion of RGB and TIR images at the pixel level involves a series of tensor-based transformations, incorporating both adaptive weighting and convolutional refinement to effectively integrate the complementary modalities. Let Rh×w×3subscriptRsuperscript𝑤3\mathcal{I}_{\text{R}}\in\mathbb{R}^{h\times w\times 3}caligraphic_I start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT denote the RGB image and Th×w×1subscriptTsuperscript𝑤1\mathcal{I}_{\text{T}}\in\mathbb{R}^{h\times w\times 1}caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 1 end_POSTSUPERSCRIPT denote the TIR image. Initially, the TIR image is expanded to a three-channel format:

    T=(T,𝒥3)=c=13(T𝒥3(c))h×w×3,superscriptsubscriptTsubscriptTsubscript𝒥3superscriptsubscript𝑐13tensor-productsubscriptTsuperscriptsubscript𝒥3𝑐superscript𝑤3\mathcal{I}_{\text{T}}^{{}^{\prime}}=\mathcal{E}\left(\mathcal{I}_{\text{T}},% \mathcal{J}_{\text{3}}\right)=\sum_{c=1}^{3}\left(\mathcal{I}_{\text{T}}% \otimes\mathcal{J}_{3}^{(c)}\right)\in\mathbb{R}^{h\times w\times 3},caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = caligraphic_E ( caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , caligraphic_J start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ⊗ caligraphic_J start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT , (1)

    where 𝒥3subscript𝒥3\mathcal{J}_{3}caligraphic_J start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT denotes a tensor of shape 1×1×31131\times 1\times 31 × 1 × 3, employed to replicate the TIR image along the channel dimension via the tensor outer product operation tensor-product\otimes. The summation term c=13superscriptsubscript𝑐13\sum_{c=1}^{3}∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT indicates that this operation is applied independently to each channel, expanding the single-channel TIR image TsubscriptT\mathcal{I}_{\text{T}}caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT into a three-channel representation TsuperscriptsubscriptT\mathcal{I}_{\text{T}}^{{}^{\prime}}caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT consistent with the structure of RGB images.

    Next, adaptive pixel-wise weighting matrices 𝒲Rh×w×3subscript𝒲Rsuperscript𝑤3\mathcal{W}_{\text{R}}\in\mathbb{R}^{h\times w\times 3}caligraphic_W start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT and 𝒲Th×w×3subscript𝒲Tsuperscript𝑤3\mathcal{W}_{\text{T}}\in\mathbb{R}^{h\times w\times 3}caligraphic_W start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT are introduced for the RGB and TIR channels. The intermediate fused image interimsubscriptinterim\mathcal{I}_{\text{interim}}caligraphic_I start_POSTSUBSCRIPT interim end_POSTSUBSCRIPT is defined as:

    interim=𝒲RR+𝒲TT+𝜼(𝐧),subscriptinterimdirect-productsubscript𝒲RsubscriptRdirect-productsubscript𝒲TsuperscriptsubscriptT𝜼𝐧\mathcal{I}_{\text{interim}}=\mathcal{W}_{\text{R}}\odot\mathcal{I}_{\text{R}}% +\mathcal{W}_{\text{T}}\odot\mathcal{I}_{\text{T}}^{{}^{\prime}}+\boldsymbol{% \eta}(\mathbf{n}),caligraphic_I start_POSTSUBSCRIPT interim end_POSTSUBSCRIPT = caligraphic_W start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ⊙ caligraphic_I start_POSTSUBSCRIPT R end_POSTSUBSCRIPT + caligraphic_W start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ⊙ caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT + bold_italic_η ( bold_n ) , (2)

    where 𝜼(𝐧)𝜼𝐧\boldsymbol{\eta}(\mathbf{n})bold_italic_η ( bold_n ) represents a noise model parameterized by a Gaussian random vector 𝐧𝒩(𝟎,𝚺)similar-to𝐧𝒩0𝚺\mathbf{n}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma})bold_n ∼ caligraphic_N ( bold_0 , bold_Σ ), accounting for the inherent uncertainty in sensor measurements.

    To refine the fusion process and incorporate spatial context, convolutional transformations are applied using modality-specific kernels 𝒦Rk×k×3×3subscript𝒦Rsuperscript𝑘𝑘33\mathcal{K}_{\text{R}}\in\mathbb{R}^{k\times k\times 3\times 3}caligraphic_K start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k × 3 × 3 end_POSTSUPERSCRIPT and 𝒦Tk×k×3×3subscript𝒦Tsuperscript𝑘𝑘33\mathcal{K}_{\text{T}}\in\mathbb{R}^{k\times k\times 3\times 3}caligraphic_K start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k × 3 × 3 end_POSTSUPERSCRIPT. The convolutional outputs are expressed as:

    φ={𝒦RR+βR,if =R𝒦TΨT+βT,if =T,subscript𝜑casessubscript𝒦RsubscriptRsubscript𝛽Rif Rsubscript𝒦TsubscriptΨTsubscript𝛽Tif T\varphi_{\ell}=\begin{cases}\mathcal{K}_{\text{R}}\ast\mathcal{I}_{\text{R}}+% \beta_{\text{R}},&\text{if }\ell=\text{R}\\ \mathcal{K}_{\text{T}}\ast\Psi_{\text{T}}+\beta_{\text{T}},&\text{if }\ell=% \text{T}\end{cases},italic_φ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = { start_ROW start_CELL caligraphic_K start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ∗ caligraphic_I start_POSTSUBSCRIPT R end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT R end_POSTSUBSCRIPT , end_CELL start_CELL if roman_ℓ = R end_CELL end_ROW start_ROW start_CELL caligraphic_K start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ∗ roman_Ψ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , end_CELL start_CELL if roman_ℓ = T end_CELL end_ROW , (3)

    where \ast denotes the convolution operation, and βRsubscript𝛽R\mathbf{\beta}_{\text{R}}italic_β start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and βTsubscript𝛽T\mathbf{\beta}_{\text{T}}italic_β start_POSTSUBSCRIPT T end_POSTSUBSCRIPT are the learnable bias terms.

    The final fused image fusesubscriptfuse\mathcal{I}_{\text{fuse}}caligraphic_I start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT is obtained through a non-linear fusion strategy that incorporates spatially adaptive weight mappings. We introduce non-linear mapping functions R(,)subscriptR\mathcal{F}_{\text{R}}(\cdot,\cdot)caligraphic_F start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ( ⋅ , ⋅ ) and T(,)subscriptT\mathcal{F}_{\text{T}}(\cdot,\cdot)caligraphic_F start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( ⋅ , ⋅ ). The fusesubscriptfuse\mathcal{I}_{\text{fuse}}caligraphic_I start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT is expressed as:

    fuse=R(interim,𝒲R)φR+T(interim,𝒲T)φT.subscriptfusedirect-productsubscriptRsubscriptinterimsubscript𝒲Rsubscript𝜑Rdirect-productsubscriptTsubscriptinterimsubscript𝒲Tsubscript𝜑T\mathcal{I}_{\text{fuse}}=\mathcal{F}_{\text{R}}\left(\mathcal{I}_{\text{% interim}},\mathcal{W}_{\text{R}}\right)\odot\varphi_{\text{R}}+\mathcal{F}_{% \text{T}}\left(\mathcal{I}_{\text{interim}},\mathcal{W}_{\text{T}}\right)\odot% \varphi_{\text{T}}.caligraphic_I start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT interim end_POSTSUBSCRIPT , caligraphic_W start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ) ⊙ italic_φ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT + caligraphic_F start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT interim end_POSTSUBSCRIPT , caligraphic_W start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) ⊙ italic_φ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT . (4)

    The functions R(,)subscriptR\mathcal{F}_{\text{R}}(\cdot,\cdot)caligraphic_F start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ( ⋅ , ⋅ ) and T(,)subscriptT\mathcal{F}_{\text{T}}(\cdot,\cdot)caligraphic_F start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( ⋅ , ⋅ ) can be simplified as follows:

    (interim,𝒲)=σ(𝒲+𝜶δ(𝒢(interim))),subscriptsubscriptinterimsubscript𝒲𝜎subscript𝒲direct-productsubscript𝜶𝛿subscript𝒢subscriptinterim\mathcal{F}_{\ell}\left(\mathcal{I}_{\text{interim}},\mathcal{W}_{\ell}\right)% =\sigma\left(\mathcal{W}_{\ell}+\boldsymbol{\alpha}_{\ell}\odot\delta\left(% \mathcal{G}_{\ell}\left(\mathcal{I}_{\text{interim}}\right)\right)\right),caligraphic_F start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT interim end_POSTSUBSCRIPT , caligraphic_W start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) = italic_σ ( caligraphic_W start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT + bold_italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ⊙ italic_δ ( caligraphic_G start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT interim end_POSTSUBSCRIPT ) ) ) , (5)

    where {R,T}RT\ell\in\{\text{R},\text{T}\}roman_ℓ ∈ { R , T }, σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) denotes the sigmoid activation function, δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) represents the hyperbolic tangent function, and 𝒢()subscript𝒢\mathcal{G}_{\ell}(\cdot)caligraphic_G start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( ⋅ ) is a non-linear spatial filtering operation. The term 𝜶h×w×3subscript𝜶superscript𝑤3\boldsymbol{\alpha}_{\ell}\in\mathbb{R}^{h\times w\times 3}bold_italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT is a learnable scaling tensor.

    The proposed pixel-level fusion scheme integrates adaptive weighting, convolutional refinement, and a multi-layered non-linear transformation pipeline to enhance representation capacity. The noise modeling term η(n)𝜂𝑛\eta(n)italic_η ( italic_n ) improves robustness, while the activation functions σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) and δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) facilitate non-linear interactions between the RGB and TIR modalities.

  2. Feature-Level Fusion. The mainstream feature-level fusion methods primarily include convolution-based Network-in-Network (NIN) modules and bidirectional attention-based Iterative Cross-modal Feature Enhancement (ICFE) modules [91]. The following sections provide an in-depth introduction to each approach.

    NIN Module. To achieve independent localized non-linear transformations on the RGB and TIR modalities, we first designed a network-in-network module integrated with a residual structure. This module serves as a foundational step for subsequent cross-modal feature enhancement and interaction, where we leverage 1x1 convolutions to apply fine-grained, spatially localized non-linear mappings that improve the expressiveness of feature representations. Let 𝒳RlH×W×Csuperscriptsubscript𝒳R𝑙superscript𝐻𝑊𝐶\mathcal{X}_{\text{R}}^{l}\in\mathbb{R}^{H\times W\times C}caligraphic_X start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT and 𝒳TlH×W×Csuperscriptsubscript𝒳T𝑙superscript𝐻𝑊𝐶\mathcal{X}_{\text{T}}^{l}\in\mathbb{R}^{H\times W\times C}caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT denote the RGB and TIR modality feature maps at layer l𝑙litalic_l, respectively. We define learnable 1x1 convolution kernels 𝒲R(1×1)C×Csuperscriptsubscript𝒲R11superscript𝐶𝐶\mathcal{W}_{\text{R}}^{(1\times 1)}\in\mathbb{R}^{C\times C}caligraphic_W start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 × 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT and 𝒲T(1×1)C×Csuperscriptsubscript𝒲T11superscript𝐶𝐶\mathcal{W}_{\text{T}}^{(1\times 1)}\in\mathbb{R}^{C\times C}caligraphic_W start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 × 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT, applying them with residual connections to each modality for localized feature transformations, as follows:

    𝒟l={𝒳Rl+(𝒲R(1×1)𝒳Rl+ζRl),if =R𝒳Tl+(𝒲T(1×1)𝒳Tl+ζTl),if =T.superscriptsubscript𝒟𝑙casessuperscriptsubscript𝒳R𝑙superscriptsubscript𝒲R11superscriptsubscript𝒳R𝑙superscriptsubscript𝜁R𝑙if Rsuperscriptsubscript𝒳T𝑙superscriptsubscript𝒲T11superscriptsubscript𝒳T𝑙superscriptsubscript𝜁T𝑙if T\mathcal{D}_{\ell}^{l}=\begin{cases}\mathcal{X}_{\text{R}}^{l}+\left(\mathcal{% W}_{\text{R}}^{(1\times 1)}\ast\mathcal{X}_{\text{R}}^{l}+\zeta_{\text{R}}^{l}% \right),&\text{if }\ell=\text{R}\\ \mathcal{X}_{\text{T}}^{l}+\left(\mathcal{W}_{\text{T}}^{(1\times 1)}\ast% \mathcal{X}_{\text{T}}^{l}+\zeta_{\text{T}}^{l}\right),&\text{if }\ell=\text{T% }\end{cases}.caligraphic_D start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = { start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + ( caligraphic_W start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 × 1 ) end_POSTSUPERSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + italic_ζ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , end_CELL start_CELL if roman_ℓ = R end_CELL end_ROW start_ROW start_CELL caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + ( caligraphic_W start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 × 1 ) end_POSTSUPERSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + italic_ζ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , end_CELL start_CELL if roman_ℓ = T end_CELL end_ROW . (6)

    The residual connection within this transformation preserves original modality-specific information in the transformed feature, mitigating potential information loss or distortion. To achieve adaptive fusion of RGB and TIR modalities, we introduce dynamic weighting coefficients αRsubscript𝛼R\alpha_{\text{R}}italic_α start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and αTsubscript𝛼T\alpha_{\text{T}}italic_α start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, computed through a transformation ν()𝜈\nu(\cdot)italic_ν ( ⋅ ) followed by a shared non-linear function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) applied to each transformed feature map:

    αR=σ(ν(𝒟Rl)),αT=σ(ν(𝒟Tl)).formulae-sequencesubscript𝛼R𝜎𝜈superscriptsubscript𝒟R𝑙subscript𝛼T𝜎𝜈superscriptsubscript𝒟T𝑙\alpha_{\text{R}}=\sigma(\nu(\mathcal{D}_{\text{R}}^{l})),\quad\alpha_{\text{T% }}=\sigma(\nu(\mathcal{D}_{\text{T}}^{l})).italic_α start_POSTSUBSCRIPT R end_POSTSUBSCRIPT = italic_σ ( italic_ν ( caligraphic_D start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) , italic_α start_POSTSUBSCRIPT T end_POSTSUBSCRIPT = italic_σ ( italic_ν ( caligraphic_D start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) . (7)

    The final fused feature 𝒟fuselsuperscriptsubscript𝒟fuse𝑙\mathcal{D}_{\text{fuse}}^{l}caligraphic_D start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT at layer l𝑙litalic_l is given as follows:

    𝒟fusel=αR𝒟Rl+αT𝒟Tl.superscriptsubscript𝒟fuse𝑙direct-productsubscript𝛼Rsuperscriptsubscript𝒟R𝑙direct-productsubscript𝛼Tsuperscriptsubscript𝒟T𝑙\mathcal{D}_{\text{fuse}}^{l}=\alpha_{\text{R}}\odot\mathcal{D}_{\text{R}}^{l}% +\alpha_{\text{T}}\odot\mathcal{D}_{\text{T}}^{l}.caligraphic_D start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_α start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ⊙ caligraphic_D start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ⊙ caligraphic_D start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT . (8)

    Through these operations, the NIN module not only performs modality-specific, localized feature transformations but also enables adaptive and balanced feature fusion. This module strengthens the feature discriminability and robustness, while preserving localized information via non-linear activation and residual connections.

    ICFE Module. The ICFE module progressively enhances feature representations of RGB and TIR modalities by iteratively exchanging and refining complementary information, ultimately producing a single fused feature representation. Let 𝒯R(0)superscriptsubscript𝒯R0\mathcal{T}_{\text{R}}^{(0)}caligraphic_T start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and 𝒯T(0)superscriptsubscript𝒯T0\mathcal{T}_{\text{T}}^{(0)}caligraphic_T start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT represent the initial RGB and TIR features, respectively, and let the final fused feature representation after n𝑛nitalic_n iterations be denoted as 𝒱fuse(n)superscriptsubscript𝒱fusen\mathcal{V}_{\text{fuse}}^{(\text{n})}caligraphic_V start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( n ) end_POSTSUPERSCRIPT. The following outlines the detailed formulae of this process.

    At the k-th iteration, multi-head queries, keys, and values are generated for both the RGB and TIR modalities. Suppose there are H attention heads, indexed by h. For the h-th attention head, we compute the query matrix 𝒬R(k,h)superscriptsubscript𝒬Rk,h\mathcal{Q}_{\text{R}}^{(\text{k,h})}caligraphic_Q start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( k,h ) end_POSTSUPERSCRIPT for RGB features, and the key matrix 𝒦T(k,h)\mathcal{K}_{\text{T}}^{(\text{k,h)}}caligraphic_K start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( k,h) end_POSTSUPERSCRIPT and value matrix 𝒱T(k,h)\mathcal{V}_{\text{T}}^{\text{(k,h})}caligraphic_V start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,h ) end_POSTSUPERSCRIPT for TIR features:

    𝒬R(k,h)=𝒯R(k)𝒲Q(h),𝒦T(k,h)=𝒯T(k)𝒲K(h),𝒱T(k,h)=𝒯T(k)𝒲V(h),formulae-sequencesuperscriptsubscript𝒬R(k,h)superscriptsubscript𝒯R(k)superscriptsubscript𝒲Q(h)formulae-sequencesuperscriptsubscript𝒦T(k,h)superscriptsubscript𝒯T(k)superscriptsubscript𝒲K(h)superscriptsubscript𝒱T(k,h)superscriptsubscript𝒯T(k)superscriptsubscript𝒲V(h)\mathcal{Q}_{\text{R}}^{\text{(k,h)}}=\mathcal{T}_{\text{R}}^{\text{(k)}}% \mathcal{W}_{\text{Q}}^{\text{(h)}},\mathcal{K}_{\text{T}}^{\text{(k,h)}}=% \mathcal{T}_{\text{T}}^{\text{(k)}}\mathcal{W}_{\text{K}}^{\text{(h)}},% \mathcal{V}_{\text{T}}^{\text{(k,h)}}=\mathcal{T}_{\text{T}}^{\text{(k)}}% \mathcal{W}_{\text{V}}^{\text{(h)}},caligraphic_Q start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,h) end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k) end_POSTSUPERSCRIPT caligraphic_W start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (h) end_POSTSUPERSCRIPT , caligraphic_K start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,h) end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k) end_POSTSUPERSCRIPT caligraphic_W start_POSTSUBSCRIPT K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (h) end_POSTSUPERSCRIPT , caligraphic_V start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,h) end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k) end_POSTSUPERSCRIPT caligraphic_W start_POSTSUBSCRIPT V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (h) end_POSTSUPERSCRIPT , (9)

    where 𝒲Q(h),𝒲K(h),𝒲V(h)d×dHsuperscriptsubscript𝒲Q(h)superscriptsubscript𝒲K(h)superscriptsubscript𝒲V(h)superscript𝑑subscript𝑑𝐻\mathcal{W}_{\text{Q}}^{\text{{(h)}}},\mathcal{W}_{\text{K}}^{\text{{(h)}}},% \mathcal{W}_{\text{V}}^{\text{{(h)}}}\in\mathbb{R}^{d\times d_{H}}caligraphic_W start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (h) end_POSTSUPERSCRIPT , caligraphic_W start_POSTSUBSCRIPT K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (h) end_POSTSUPERSCRIPT , caligraphic_W start_POSTSUBSCRIPT V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (h) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are learnable projection matrices, and dH=d/HsubscriptdHdH\text{d}_{\text{H}}=\text{d}/\text{H}d start_POSTSUBSCRIPT H end_POSTSUBSCRIPT = d / H represents the dimensionality per attention head.

    To obtain the cross-modally enhanced RGB features 𝒵R(k,h)superscriptsubscript𝒵R(k,h)\mathcal{Z}_{\text{R}}^{\text{(k,h)}}caligraphic_Z start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,h) end_POSTSUPERSCRIPT, we calculate the weighted matrix by applying the softmax function to the scaled dot product of the query and key matrices, then multiply it with the value matrix:

    𝒵R(k,h)=softmax(𝒬R(k,h)(𝒦T(k,h))TdH)𝒱T(k,h).superscriptsubscript𝒵R(k,h)softmaxsuperscriptsubscript𝒬R(k,h)superscriptsuperscriptsubscript𝒦T(k,h)TsubscriptdHsuperscriptsubscript𝒱T(k,h)\mathcal{Z}_{\text{R}}^{\text{(k,h)}}=\textit{softmax}\left(\frac{\mathcal{Q}_% {\text{R}}^{\text{(k,h)}}(\mathcal{K}_{\text{T}}^{\text{(k,h)}})^{\text{T}}}{% \sqrt{\text{d}_{\text{H}}}}\right)\mathcal{V}_{\text{T}}^{\text{(k,h)}}.caligraphic_Z start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,h) end_POSTSUPERSCRIPT = softmax ( divide start_ARG caligraphic_Q start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,h) end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,h) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG d start_POSTSUBSCRIPT H end_POSTSUBSCRIPT end_ARG end_ARG ) caligraphic_V start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,h) end_POSTSUPERSCRIPT . (10)

    Then, we concatenate the features from all attention heads (denoted by ΓΓ\Gammaroman_Γ as the concatenation operation) and project them back to the original feature space using an output projection matrix 𝒲Osubscript𝒲O\mathcal{W}_{\text{O}}caligraphic_W start_POSTSUBSCRIPT O end_POSTSUBSCRIPT:

    𝒵R(k)=Γ(𝒵R(k,1),,𝒵R(k,H))𝒲O,superscriptsubscript𝒵RkΓsuperscriptsubscript𝒵R(k,1)superscriptsubscript𝒵R(k,H)subscript𝒲O\mathcal{Z}_{\text{R}}^{(\text{k})}=\Gamma(\mathcal{Z}_{\text{R}}^{\text{(k,1)% }},\ldots,\mathcal{Z}_{\text{R}}^{\text{(k,H)}})\mathcal{W}_{\text{O}},caligraphic_Z start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT = roman_Γ ( caligraphic_Z start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,1) end_POSTSUPERSCRIPT , … , caligraphic_Z start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (k,H) end_POSTSUPERSCRIPT ) caligraphic_W start_POSTSUBSCRIPT O end_POSTSUBSCRIPT , (11)

    where Γ()Γ\Gamma(\cdot)roman_Γ ( ⋅ ) represents the concatenation operation applied across all attention heads.

    In each iteration, the RGB and TIR features are combined to produce an intermediate fused feature representation 𝒱fuse(k)superscriptsubscript𝒱fusek\mathcal{V}_{\text{fuse}}^{(\text{k})}caligraphic_V start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT, with learnable weighting coefficients λ(k)superscript𝜆k\lambda^{(\text{k})}italic_λ start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT and μ(k)superscript𝜇k\mu^{(\text{k})}italic_μ start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT controlling the fusion:

    𝒱fuse(k)=λ(k)𝒵R(k)+μ(k)𝒵T(k),superscriptsubscript𝒱fusekdirect-productsuperscript𝜆ksuperscriptsubscript𝒵Rkdirect-productsuperscript𝜇ksuperscriptsubscript𝒵Tk\mathcal{V}_{\text{fuse}}^{(\text{k})}=\lambda^{(\text{k})}\odot\mathcal{Z}_{% \text{R}}^{(\text{k})}+\mu^{(\text{k})}\odot\mathcal{Z}_{\text{T}}^{(\text{k})},caligraphic_V start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT = italic_λ start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT ⊙ caligraphic_Z start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT + italic_μ start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT ⊙ caligraphic_Z start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT , (12)

    where 𝒵T(k)superscriptsubscript𝒵Tk\mathcal{Z}_{\text{T}}^{(\text{k})}caligraphic_Z start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT is the cross-modally enhanced TIR feature obtained symmetrically to 𝒵R(k)superscriptsubscript𝒵Rk\mathcal{Z}_{\text{R}}^{(\text{k})}caligraphic_Z start_POSTSUBSCRIPT R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( k ) end_POSTSUPERSCRIPT.

    To further enhance non-linear representation capabilities, a non-linear activation function δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) is applied with residual connection to the fused feature in each iteration. After n𝑛nitalic_n iterations, the final fused feature representation is given by:

    𝒱fuse(n)=𝒱fuse(n-1)+δ(𝒱fuse(n-1)).superscriptsubscript𝒱fusensuperscriptsubscript𝒱fusen-1𝛿superscriptsubscript𝒱fusen-1\mathcal{V}_{\text{fuse}}^{(\text{n})}=\mathcal{V}_{\text{fuse}}^{(\text{n-1})% }+\delta\left(\mathcal{V}_{\text{fuse}}^{(\text{n-1})}\right).caligraphic_V start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( n ) end_POSTSUPERSCRIPT = caligraphic_V start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( n-1 ) end_POSTSUPERSCRIPT + italic_δ ( caligraphic_V start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( n-1 ) end_POSTSUPERSCRIPT ) . (13)
  3. Decision-Level Fusion. In decision-level fusion, RGB and TIR modalities undergo separate feature extraction and preliminary detection, and their fusion occurs at the final decision stage. Let the detection results for RGB and TIR modalities be denoted as RsubscriptR\mathcal{M}_{\text{R}}caligraphic_M start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and TsubscriptT\mathcal{M}_{\text{T}}caligraphic_M start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, respectively. The following describes two advanced fusion strategies for combining these decisions.

    Confidence-Based Weighting with Normalization. To refine the fusion process, confidence scores 𝒞Rsubscript𝒞R\mathcal{C}_{\text{R}}caligraphic_C start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and 𝒞Tsubscript𝒞T\mathcal{C}_{\text{T}}caligraphic_C start_POSTSUBSCRIPT T end_POSTSUBSCRIPT reflect each modality’s reliability and serve as normalization factors. These scores are obtained through a scaling function ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ) and normalized using τ()𝜏\tau(\cdot)italic_τ ( ⋅ ):

    𝒞R=τ(ψ(R)),𝒞T=τ(ψ(T)).formulae-sequencesubscript𝒞R𝜏𝜓subscriptRsubscript𝒞T𝜏𝜓subscriptT\mathcal{C}_{\text{R}}=\tau(\psi(\mathcal{M}_{\text{R}})),\quad\mathcal{C}_{% \text{T}}=\tau(\psi(\mathcal{M}_{\text{T}})).caligraphic_C start_POSTSUBSCRIPT R end_POSTSUBSCRIPT = italic_τ ( italic_ψ ( caligraphic_M start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ) ) , caligraphic_C start_POSTSUBSCRIPT T end_POSTSUBSCRIPT = italic_τ ( italic_ψ ( caligraphic_M start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) ) . (14)

    The confidence-weighted fusion result 𝒬fusesubscript𝒬fuse\mathcal{Q}_{\text{fuse}}caligraphic_Q start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT is:

    𝒬fuse=(𝒞RR+𝒞TT)+ϵ𝒞R+𝒞T+ϵ,subscript𝒬fusedirect-productsubscript𝒞RsubscriptRdirect-productsubscript𝒞TsubscriptTitalic-ϵsubscript𝒞Rsubscript𝒞Titalic-ϵ\mathcal{Q}_{\text{fuse}}=\frac{\left(\mathcal{C}_{\text{R}}\odot\mathcal{M}_{% \text{R}}+\mathcal{C}_{\text{T}}\odot\mathcal{M}_{\text{T}}\right)+\epsilon}{% \mathcal{C}_{\text{R}}+\mathcal{C}_{\text{T}}+\epsilon},caligraphic_Q start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT = divide start_ARG ( caligraphic_C start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ⊙ caligraphic_M start_POSTSUBSCRIPT R end_POSTSUBSCRIPT + caligraphic_C start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ⊙ caligraphic_M start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) + italic_ϵ end_ARG start_ARG caligraphic_C start_POSTSUBSCRIPT R end_POSTSUBSCRIPT + caligraphic_C start_POSTSUBSCRIPT T end_POSTSUBSCRIPT + italic_ϵ end_ARG , (15)

    where direct-product\odot represents element-wise weighting, and ϵitalic-ϵ\epsilonitalic_ϵ is a small constant to prevent division by zero, thereby stabilizing the computation.

    Hierarchical Fusion with Multi-stage Process. Hierarchical fusion enhances robustness by applying both local and global fusion steps. Initially, a region-based fusion is applied independently within each modality. This local fusion step can be represented as:

    𝒬local=hlocal(κRR,κTT),subscript𝒬localsubscriptlocalsubscript𝜅RsubscriptRsubscript𝜅TsubscriptT\mathcal{Q}_{\text{local}}=h_{\text{local}}(\kappa_{\text{R}}\cdot\mathcal{M}_% {\text{R}},\kappa_{\text{T}}\cdot\mathcal{M}_{\text{T}}),caligraphic_Q start_POSTSUBSCRIPT local end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT local end_POSTSUBSCRIPT ( italic_κ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ⋅ caligraphic_M start_POSTSUBSCRIPT R end_POSTSUBSCRIPT , italic_κ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ⋅ caligraphic_M start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) , (16)

    where hlocal()subscriptlocalh_{\text{local}}(\cdot)italic_h start_POSTSUBSCRIPT local end_POSTSUBSCRIPT ( ⋅ ) represents the local fusion function, such as Simple Average, Confidence-Weighted Average, or Maximum Selection, and κRsubscript𝜅R\kappa_{\text{R}}italic_κ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and κTsubscript𝜅T\kappa_{\text{T}}italic_κ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT are weighting factors specific to each modality.

    After obtaining the locally fused results, a global aggregation function combines these results across regions or categories. The global fusion step is given by:

    𝒬fuse=hglobal(i=1Nθi𝒬local(i)),subscript𝒬fusesubscriptglobalsuperscriptsubscript𝑖1𝑁subscript𝜃𝑖superscriptsubscript𝒬locali\mathcal{Q}_{\text{fuse}}=h_{\text{global}}\left(\sum_{i=1}^{N}\theta_{i}\,% \mathcal{Q}_{\text{local}}^{(\text{i})}\right),caligraphic_Q start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT global end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_Q start_POSTSUBSCRIPT local end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( i ) end_POSTSUPERSCRIPT ) , (17)

    where hglobal()subscriptglobalh_{\text{global}}(\cdot)italic_h start_POSTSUBSCRIPT global end_POSTSUBSCRIPT ( ⋅ ) denotes the global fusion function, N𝑁Nitalic_N is the number of local regions or categories, and θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are adaptive coefficients for each local fused region 𝒬local(i)superscriptsubscript𝒬locali\mathcal{Q}_{\text{local}}^{(\text{i})}caligraphic_Q start_POSTSUBSCRIPT local end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( i ) end_POSTSUPERSCRIPT.

    This hierarchical approach provides finer control over region-specific interactions, enhancing robustness in complex scenes.

Experimental Observations

We first evaluate the three fusion methods through experiments and identify feature-level fusion as the most effective approach. Building on this insight, we further optimize the combination of feature-level fusion modules to achieve the best performance.

TABLE II: Performance metrics of advanced single-modality detection models under different fusion mechanisms. The results are averaged over 100 independent runs, with the standard deviations provided. We use bold red font and underline to highlight the best results.
Backbone Method Datasets Fusion Strategy
Pixel-Fusion Feature-Fusion Decision-Fusion RGB-Output TIR-Output
Resnet50 YOLO-V5 [86] KAIST 29.31±1.21subscript29.31plus-or-minus1.2129.31_{\pm 1.21}29.31 start_POSTSUBSCRIPT ± 1.21 end_POSTSUBSCRIPT 15.17±1.59¯¯subscript15.17plus-or-minus1.59\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}15.17_{\pm 1.59}}}}under¯ start_ARG bold_15.17 start_POSTSUBSCRIPT ± bold_1.59 end_POSTSUBSCRIPT end_ARG 17.37±2.24subscript17.37plus-or-minus2.2417.37_{\pm 2.24}17.37 start_POSTSUBSCRIPT ± 2.24 end_POSTSUBSCRIPT 18.39±1.75subscript18.39plus-or-minus1.7518.39_{\pm 1.75}18.39 start_POSTSUBSCRIPT ± 1.75 end_POSTSUBSCRIPT 17.89±2.92subscript17.89plus-or-minus2.9217.89_{\pm 2.92}17.89 start_POSTSUBSCRIPT ± 2.92 end_POSTSUBSCRIPT
FLIR 63.53±2.81subscript63.53plus-or-minus2.8163.53_{\pm 2.81}63.53 start_POSTSUBSCRIPT ± 2.81 end_POSTSUBSCRIPT 73.22±1.66¯¯subscript73.22plus-or-minus1.66\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}73.22_{\pm 1.66}}}}under¯ start_ARG bold_73.22 start_POSTSUBSCRIPT ± bold_1.66 end_POSTSUBSCRIPT end_ARG 68.71±1.20subscript68.71plus-or-minus1.2068.71_{\pm 1.20}68.71 start_POSTSUBSCRIPT ± 1.20 end_POSTSUBSCRIPT 67.84±1.17subscript67.84plus-or-minus1.1767.84_{\pm 1.17}67.84 start_POSTSUBSCRIPT ± 1.17 end_POSTSUBSCRIPT 68.23±1.75subscript68.23plus-or-minus1.7568.23_{\pm 1.75}68.23 start_POSTSUBSCRIPT ± 1.75 end_POSTSUBSCRIPT
CO-DETR [92] KAIST 26.17±2.01subscript26.17plus-or-minus2.0126.17_{\pm 2.01}26.17 start_POSTSUBSCRIPT ± 2.01 end_POSTSUBSCRIPT 14.67±2.35¯¯subscript14.67plus-or-minus2.35\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}14.67_{\pm 2.35}}}}under¯ start_ARG bold_14.67 start_POSTSUBSCRIPT ± bold_2.35 end_POSTSUBSCRIPT end_ARG 16.83±1.17subscript16.83plus-or-minus1.1716.83_{\pm 1.17}16.83 start_POSTSUBSCRIPT ± 1.17 end_POSTSUBSCRIPT 17.65±2.55subscript17.65plus-or-minus2.5517.65_{\pm 2.55}17.65 start_POSTSUBSCRIPT ± 2.55 end_POSTSUBSCRIPT 17.17±1.57subscript17.17plus-or-minus1.5717.17_{\pm 1.57}17.17 start_POSTSUBSCRIPT ± 1.57 end_POSTSUBSCRIPT
FLIR 62.39±2.65subscript62.39plus-or-minus2.6562.39_{\pm 2.65}62.39 start_POSTSUBSCRIPT ± 2.65 end_POSTSUBSCRIPT 78.97±2.50¯¯subscript78.97plus-or-minus2.50\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}78.97_{\pm 2.50}}}}under¯ start_ARG bold_78.97 start_POSTSUBSCRIPT ± bold_2.50 end_POSTSUBSCRIPT end_ARG 69.36±2.40subscript69.36plus-or-minus2.4069.36_{\pm 2.40}69.36 start_POSTSUBSCRIPT ± 2.40 end_POSTSUBSCRIPT 68.93±2.12subscript68.93plus-or-minus2.1268.93_{\pm 2.12}68.93 start_POSTSUBSCRIPT ± 2.12 end_POSTSUBSCRIPT 68.35±2.74subscript68.35plus-or-minus2.7468.35_{\pm 2.74}68.35 start_POSTSUBSCRIPT ± 2.74 end_POSTSUBSCRIPT
RTMDET [93] KAIST 23.59±1.64subscript23.59plus-or-minus1.6423.59_{\pm 1.64}23.59 start_POSTSUBSCRIPT ± 1.64 end_POSTSUBSCRIPT 14.13±2.58¯¯subscript14.13plus-or-minus2.58\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}14.13_{\pm 2.58}}}}under¯ start_ARG bold_14.13 start_POSTSUBSCRIPT ± bold_2.58 end_POSTSUBSCRIPT end_ARG 18.97±1.15subscript18.97plus-or-minus1.1518.97_{\pm 1.15}18.97 start_POSTSUBSCRIPT ± 1.15 end_POSTSUBSCRIPT 17.36±1.85subscript17.36plus-or-minus1.8517.36_{\pm 1.85}17.36 start_POSTSUBSCRIPT ± 1.85 end_POSTSUBSCRIPT 16.33±1.45subscript16.33plus-or-minus1.4516.33_{\pm 1.45}16.33 start_POSTSUBSCRIPT ± 1.45 end_POSTSUBSCRIPT
FLIR 57.81±1.97subscript57.81plus-or-minus1.9757.81_{\pm 1.97}57.81 start_POSTSUBSCRIPT ± 1.97 end_POSTSUBSCRIPT 75.36±2.31¯¯subscript75.36plus-or-minus2.31\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}75.36_{\pm 2.31}}}}under¯ start_ARG bold_75.36 start_POSTSUBSCRIPT ± bold_2.31 end_POSTSUBSCRIPT end_ARG 66.29±1.71subscript66.29plus-or-minus1.7166.29_{\pm 1.71}66.29 start_POSTSUBSCRIPT ± 1.71 end_POSTSUBSCRIPT 64.32±2.44subscript64.32plus-or-minus2.4464.32_{\pm 2.44}64.32 start_POSTSUBSCRIPT ± 2.44 end_POSTSUBSCRIPT 63.97±2.12subscript63.97plus-or-minus2.1263.97_{\pm 2.12}63.97 start_POSTSUBSCRIPT ± 2.12 end_POSTSUBSCRIPT
DINO [94] KAIST 27.73±1.98subscript27.73plus-or-minus1.9827.73_{\pm 1.98}27.73 start_POSTSUBSCRIPT ± 1.98 end_POSTSUBSCRIPT 18.93±2.16subscript18.93plus-or-minus2.1618.93_{\pm 2.16}18.93 start_POSTSUBSCRIPT ± 2.16 end_POSTSUBSCRIPT 16.67±1.12¯¯subscript16.67plus-or-minus1.12\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}16.67_{\pm 1.12}}}}under¯ start_ARG bold_16.67 start_POSTSUBSCRIPT ± bold_1.12 end_POSTSUBSCRIPT end_ARG 19.57±2.66subscript19.57plus-or-minus2.6619.57_{\pm 2.66}19.57 start_POSTSUBSCRIPT ± 2.66 end_POSTSUBSCRIPT 17.98±1.81subscript17.98plus-or-minus1.8117.98_{\pm 1.81}17.98 start_POSTSUBSCRIPT ± 1.81 end_POSTSUBSCRIPT
FLIR 57.69±2.87subscript57.69plus-or-minus2.8757.69_{\pm 2.87}57.69 start_POSTSUBSCRIPT ± 2.87 end_POSTSUBSCRIPT 76.83±2.33¯¯subscript76.83plus-or-minus2.33\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}76.83_{\pm 2.33}}}}under¯ start_ARG bold_76.83 start_POSTSUBSCRIPT ± bold_2.33 end_POSTSUBSCRIPT end_ARG 69.15±2.14subscript69.15plus-or-minus2.1469.15_{\pm 2.14}69.15 start_POSTSUBSCRIPT ± 2.14 end_POSTSUBSCRIPT 67.96±2.02subscript67.96plus-or-minus2.0267.96_{\pm 2.02}67.96 start_POSTSUBSCRIPT ± 2.02 end_POSTSUBSCRIPT 67.71±2.33subscript67.71plus-or-minus2.3367.71_{\pm 2.33}67.71 start_POSTSUBSCRIPT ± 2.33 end_POSTSUBSCRIPT
Vit-L YOLO-V5 [86] KAIST 28.99±1.98subscript28.99plus-or-minus1.9828.99_{\pm 1.98}28.99 start_POSTSUBSCRIPT ± 1.98 end_POSTSUBSCRIPT 13.52±2.18¯¯subscript13.52plus-or-minus2.18\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}13.52_{\pm 2.18}}}}under¯ start_ARG bold_13.52 start_POSTSUBSCRIPT ± bold_2.18 end_POSTSUBSCRIPT end_ARG 17.61±2.32subscript17.61plus-or-minus2.3217.61_{\pm 2.32}17.61 start_POSTSUBSCRIPT ± 2.32 end_POSTSUBSCRIPT 18.02±2.45subscript18.02plus-or-minus2.4518.02_{\pm 2.45}18.02 start_POSTSUBSCRIPT ± 2.45 end_POSTSUBSCRIPT 18.63±2.66subscript18.63plus-or-minus2.6618.63_{\pm 2.66}18.63 start_POSTSUBSCRIPT ± 2.66 end_POSTSUBSCRIPT
FLIR 62.72±1.91subscript62.72plus-or-minus1.9162.72_{\pm 1.91}62.72 start_POSTSUBSCRIPT ± 1.91 end_POSTSUBSCRIPT 72.67±2.71¯¯subscript72.67plus-or-minus2.71\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}72.67_{\pm 2.71}}}}under¯ start_ARG bold_72.67 start_POSTSUBSCRIPT ± bold_2.71 end_POSTSUBSCRIPT end_ARG 68.72±1.99subscript68.72plus-or-minus1.9968.72_{\pm 1.99}68.72 start_POSTSUBSCRIPT ± 1.99 end_POSTSUBSCRIPT 68.12±2.47subscript68.12plus-or-minus2.4768.12_{\pm 2.47}68.12 start_POSTSUBSCRIPT ± 2.47 end_POSTSUBSCRIPT 68.61±2.99subscript68.61plus-or-minus2.9968.61_{\pm 2.99}68.61 start_POSTSUBSCRIPT ± 2.99 end_POSTSUBSCRIPT
CO-DETR [92] KAIST 27.95±2.53subscript27.95plus-or-minus2.5327.95_{\pm 2.53}27.95 start_POSTSUBSCRIPT ± 2.53 end_POSTSUBSCRIPT 13.63±1.53¯¯subscript13.63plus-or-minus1.53\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}13.63_{\pm 1.53}}}}under¯ start_ARG bold_13.63 start_POSTSUBSCRIPT ± bold_1.53 end_POSTSUBSCRIPT end_ARG 16.85±2.34subscript16.85plus-or-minus2.3416.85_{\pm 2.34}16.85 start_POSTSUBSCRIPT ± 2.34 end_POSTSUBSCRIPT 18.36±1.90subscript18.36plus-or-minus1.9018.36_{\pm 1.90}18.36 start_POSTSUBSCRIPT ± 1.90 end_POSTSUBSCRIPT 17.17±2.97subscript17.17plus-or-minus2.9717.17_{\pm 2.97}17.17 start_POSTSUBSCRIPT ± 2.97 end_POSTSUBSCRIPT
FLIR 63.30±2.89subscript63.30plus-or-minus2.8963.30_{\pm 2.89}63.30 start_POSTSUBSCRIPT ± 2.89 end_POSTSUBSCRIPT 76.55±2.35¯¯subscript76.55plus-or-minus2.35\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}76.55_{\pm 2.35}}}}under¯ start_ARG bold_76.55 start_POSTSUBSCRIPT ± bold_2.35 end_POSTSUBSCRIPT end_ARG 70.72±1.52subscript70.72plus-or-minus1.5270.72_{\pm 1.52}70.72 start_POSTSUBSCRIPT ± 1.52 end_POSTSUBSCRIPT 67.27±2.95subscript67.27plus-or-minus2.9567.27_{\pm 2.95}67.27 start_POSTSUBSCRIPT ± 2.95 end_POSTSUBSCRIPT 69.65±2.11subscript69.65plus-or-minus2.1169.65_{\pm 2.11}69.65 start_POSTSUBSCRIPT ± 2.11 end_POSTSUBSCRIPT
RTMDET [93] KAIST 22.60±2.32subscript22.60plus-or-minus2.3222.60_{\pm 2.32}22.60 start_POSTSUBSCRIPT ± 2.32 end_POSTSUBSCRIPT 15.11±2.65subscript15.11plus-or-minus2.6515.11_{\pm 2.65}15.11 start_POSTSUBSCRIPT ± 2.65 end_POSTSUBSCRIPT 14.53±2.95¯¯subscript14.53plus-or-minus2.95\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}14.53_{\pm 2.95}}}}under¯ start_ARG bold_14.53 start_POSTSUBSCRIPT ± bold_2.95 end_POSTSUBSCRIPT end_ARG 16.34±2.15subscript16.34plus-or-minus2.1516.34_{\pm 2.15}16.34 start_POSTSUBSCRIPT ± 2.15 end_POSTSUBSCRIPT 16.19±2.98subscript16.19plus-or-minus2.9816.19_{\pm 2.98}16.19 start_POSTSUBSCRIPT ± 2.98 end_POSTSUBSCRIPT
FLIR 56.75±2.78subscript56.75plus-or-minus2.7856.75_{\pm 2.78}56.75 start_POSTSUBSCRIPT ± 2.78 end_POSTSUBSCRIPT 74.39±2.19¯¯subscript74.39plus-or-minus2.19\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}74.39_{\pm 2.19}}}}under¯ start_ARG bold_74.39 start_POSTSUBSCRIPT ± bold_2.19 end_POSTSUBSCRIPT end_ARG 66.79±1.61subscript66.79plus-or-minus1.6166.79_{\pm 1.61}66.79 start_POSTSUBSCRIPT ± 1.61 end_POSTSUBSCRIPT 65.63±2.22subscript65.63plus-or-minus2.2265.63_{\pm 2.22}65.63 start_POSTSUBSCRIPT ± 2.22 end_POSTSUBSCRIPT 65.07±1.68subscript65.07plus-or-minus1.6865.07_{\pm 1.68}65.07 start_POSTSUBSCRIPT ± 1.68 end_POSTSUBSCRIPT
DINO [94] KAIST 26.97±2.68subscript26.97plus-or-minus2.6826.97_{\pm 2.68}26.97 start_POSTSUBSCRIPT ± 2.68 end_POSTSUBSCRIPT 12.21±2.95¯¯subscript12.21plus-or-minus2.95\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}12.21_{\pm 2.95}}}}under¯ start_ARG bold_12.21 start_POSTSUBSCRIPT ± bold_2.95 end_POSTSUBSCRIPT end_ARG 15.54±1.95subscript15.54plus-or-minus1.9515.54_{\pm 1.95}15.54 start_POSTSUBSCRIPT ± 1.95 end_POSTSUBSCRIPT 19.89±1.85subscript19.89plus-or-minus1.8519.89_{\pm 1.85}19.89 start_POSTSUBSCRIPT ± 1.85 end_POSTSUBSCRIPT 18.08±2.34subscript18.08plus-or-minus2.3418.08_{\pm 2.34}18.08 start_POSTSUBSCRIPT ± 2.34 end_POSTSUBSCRIPT
FLIR 56.11±1.89subscript56.11plus-or-minus1.8956.11_{\pm 1.89}56.11 start_POSTSUBSCRIPT ± 1.89 end_POSTSUBSCRIPT 77.12±1.99¯¯subscript77.12plus-or-minus1.99\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}77.12_{\pm 1.99}}}}under¯ start_ARG bold_77.12 start_POSTSUBSCRIPT ± bold_1.99 end_POSTSUBSCRIPT end_ARG 70.61±1.62subscript70.61plus-or-minus1.6270.61_{\pm 1.62}70.61 start_POSTSUBSCRIPT ± 1.62 end_POSTSUBSCRIPT 68.37±2.95subscript68.37plus-or-minus2.9568.37_{\pm 2.95}68.37 start_POSTSUBSCRIPT ± 2.95 end_POSTSUBSCRIPT 69.97±2.11subscript69.97plus-or-minus2.1169.97_{\pm 2.11}69.97 start_POSTSUBSCRIPT ± 2.11 end_POSTSUBSCRIPT
  1. Fusion Method Experiments. In our preliminary experiments, we compared the effects of the three feature fusion methods on the improved multispectral model. The experimental results can be found in Table II. It is evident that using different fusion methods had a significant impact on the detection accuracy of the optimized model.

    1. (i):

      Observations on Pixel-Level Fusion. Pixel-level fusion exhibits lower stability and detection accuracy compared to single-modality detection on most datasets, with only slight improvements observed in a few specific cases. This may be attributed to the fact that pixel-level fusion combines the dual-light images at the input stage, introducing a significant amount of redundant information and noise. As a result, the model struggles to effectively learn the key features from each modality.

    2. (ii):

      Observations on Feature-Level Fusion. Compared to single-modality detection, feature-level fusion demonstrated significant improvements in both stability and detection accuracy across most datasets. This is likely due to the fact that feature-level fusion effectively utilizes high-level features extracted by the backbone, allowing for efficient fusion while minimizing redundant features and preserving as much valuable information as possible.

    3. (iii):

      Observations on Decision-Level Fusion. Compared to single-modality detection, decision-level fusion can improve accuracy to some extent, but it demonstrates instability with certain methods, such as the RTMDet framework [93]. This instability may stem from the fact that decision-level fusion processes RGB and TIR modality information independently, merging them only at the decision stage. Consequently, this approach struggles to effectively leverage complementary information between the two modalities, especially in scenarios where such information is crucial, like varying weather conditions or significant changes in viewpoints.

  2. Feature-Fusion Experiments. To determine the most effective fusion strategy, we selected the best-performing feature-level fusion method from prior experiments for further analysis. Using single-modality detection models as baselines, we introduced the NIN and ICFE modules under different input modalities. This approach enabled a systematic evaluation of their contributions to feature representation and fusion performance. Key results are shown in Figure 1, along with notable findings.

    Refer to caption
    Refer to caption
    Figure 1: Performance metrics obtained from 100 independent repetitions on the KAIST, FLIR, and DroneVehicle datasets using different backbones and feature fusion modules. The letter B represents the baseline, I represents the ICFE module, and N represents the NIN module, while the content in parentheses indicates the modality input to the fusion module. The left images show the results from the experiments using the Yolov5 detector, while the right images present the results from the experiments using the Co-Detr detector. Each column in the figures, from left to right, represents: the results with Resnet50 as the backbone on the KAIST, FLIR, and DroneVehicle datasets, followed by the results with Vit-L as the backbone on the same datasets.
    1. (i):

      Observations on Datasets. After applying fusion modules, all detection frameworks showed varying degrees of improvements. Notably, on datasets with significant changes in lighting conditions, shadows, and viewpoints (e.g., the FLIR dataset), both the NIN-structured fusion module and the ICFE-structured fusion module exhibited more pronounced performance. This enhancement is likely attributable to the fact that in scenarios where there are substantial differences between the two modalities, complementary information plays a crucial role in improving detection accuracy, which highlights the effectiveness of the fusion modules.

    2. (ii):

      Observations on Fusion Modules. We found that different fusion module architectures exhibit high sensitivity to various backbone networks. Specifically, in detection networks using Resnet50 as the backbone, the NIN-structured fusion module showed notable improvements in detection accuracy. On the other hand, for backbones based on the Vit-L structure, the ICFE module demonstrated better performance when fusing data from the RGB and TIR channels. This difference in performance may be attributed to the fact that Resnet50 is a convolution-based architecture, where the NIN module effectively fuses local features, maintaining the continuity and consistency of convolutional features, thus leading to better results. In contrast, Vit-L excels at capturing global features, and the ICFE module, with its cross-feature and attention mechanisms, further enhances the fusion of global information, resulting in superior performance.

    3. (iii):

      Observations on the ICFE Fusion Module Branches. For the branch inputs of the ICFE module, we experimented with various connection methods, as illustrated in Figure 1. The experimental results show that using the ICFE module alone for fusion, regardless of the connection method, failed to consistently improve the detection accuracy. This outcome may be attributed to the fact that when only a single module is used for fusion with inputs from the same modality, the ICFE module may repeatedly amplify background noise or irrelevant features, causing the model to focus excessively on the noise rather than the target, thereby reducing detection performance. Furthermore, when inputs from different modalities (RGB and TIR) are used, their features are not deeply fused or integrated (e.g., through NIN’s nonlinear transformation), meaning the complementary information between modalities is not fully leveraged.

      We further attempted to add an NIN connection structure after the iterative ICFE module, using different input methods. The experimental results indicate that using the R+T+NIN connection significantly improves the detection accuracy, while the R+R and T+T configurations, following NIN extraction, resulted in poorer performance. This is likely due to that the NIN module can more finely integrate and fuse cross-modality features, leading to notable improvements in detection performance.

    4. (iv):

      Observations on Robustness. The experimental results indicate that different input configurations (e.g., R+T, R+R, T+T) have a significant impact on the model’s robustness. When using the same modality inputs (R+R or T+T), the model’s detection performance tends to be unstable and more susceptible to background noise. In contrast, when using the R+T combination, especially when coupled with the NIN module for feature fusion, the model demonstrates significantly higher robustness across various environmental conditions. These findings suggest that the complementary information between modalities plays a crucial role in enhancing the model’s ability to withstand environmental uncertainty and noise interference.

3.3 Dual-Modality Data Augmentation

Formulations. Dual-modality data augmentation is a vital technique for enhancing the performance of multispectral object detection models. By applying consistent or complementary transformations to both modalities during training, this approach not only ensures the correlation between features from the two data sources but also enables the simulation of specific test scenarios (e.g., low-light conditions or small samples). Additionally, it effectively addresses information loss caused by feature dimensionality reduction, particularly in cases where the data distributions of the two modalities differ significantly. Mainstream dual-modality data augmentation strategies can be broadly categorized into three types: Geometric Transformations, Pixel-Level Transformations, and Multimodal-Specific Enhancements. These strategies will be detailed in the following sections.

  1. Geometric Transformations. Geometric transformation strategies involve a range of spatial modifications designed to maximize the geometric diversity of training samples, enabling the model to generalize more effectively to varied object poses, orientations, scales, and viewpoints. The overall approach to geometric transformation strategies is outlined below, with most transformations formulated based on the following equation. Let the input image be represented by \mathcal{I}caligraphic_I, the processed image by superscript\mathcal{I}^{\prime}caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and the geometric transformation function by gsubscript𝑔\mathcal{F}_{g}caligraphic_F start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. This transformation can be formalized as:

    =g()=ρ+Υ,superscriptsubscript𝑔𝜌Υ\mathcal{I}^{\prime}=\mathcal{F}_{g}(\mathcal{I})=\rho\cdot\mathcal{I}+\Upsilon,caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( caligraphic_I ) = italic_ρ ⋅ caligraphic_I + roman_Υ , (18)

    where ρ𝜌\rhoitalic_ρ denotes the composite affine transformation matrix, which integrates non-uniform scaling, complex rotation, and controlled mirroring. The ΥΥ\Upsilonroman_Υ represents the non-linear offset coefficient.

    The matrix ρ𝜌\rhoitalic_ρ can be decomposed as:

    ρ=𝒮(cx,cy)(θ)𝒰(ϕ)(tx,ty),𝜌𝒮subscript𝑐𝑥subscript𝑐𝑦𝜃subscript𝒰italic-ϕsubscript𝑡𝑥subscript𝑡𝑦\rho=\mathcal{S}(c_{x},c_{y})\cdot\mathcal{R}(\theta)\cdot\mathcal{U}_{\ell}(% \phi)\cdot\mathcal{E}(t_{x},t_{y}),italic_ρ = caligraphic_S ( italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ⋅ caligraphic_R ( italic_θ ) ⋅ caligraphic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_ϕ ) ⋅ caligraphic_E ( italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) , (19)

    where each component transformation is defined as follows:

    - 𝒮(cx,cy)𝒮subscript𝑐𝑥subscript𝑐𝑦\mathcal{S}(c_{x},c_{y})caligraphic_S ( italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) represents a non-uniform scaling matrix, applying differential scaling along the x𝑥xitalic_x and y𝑦yitalic_y axes:

    𝒮(cx,cy)=[cx00cy],𝒮subscript𝑐𝑥subscript𝑐𝑦matrixsubscript𝑐𝑥00subscript𝑐𝑦\mathcal{S}(c_{x},c_{y})=\begin{bmatrix}c_{x}&0\\ 0&c_{y}\end{bmatrix},caligraphic_S ( italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , (20)

    where cxsubscript𝑐𝑥c_{x}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and cysubscript𝑐𝑦c_{y}italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are the horizontal and vertical scaling factors, respectively, which may vary based on context-specific augmentation parameters.

    - (θ)𝜃\mathcal{R}(\theta)caligraphic_R ( italic_θ ) denotes the rotation matrix, which rotates the image by an angle θ𝜃\thetaitalic_θ in the 2D plane:

    (θ)=[cos(θ)sin(θ)sin(θ)cos(θ)].𝜃matrix𝜃𝜃𝜃𝜃\mathcal{R}(\theta)=\begin{bmatrix}\cos(\theta)&-\sin(\theta)\\ \sin(\theta)&\cos(\theta)\end{bmatrix}.caligraphic_R ( italic_θ ) = [ start_ARG start_ROW start_CELL roman_cos ( italic_θ ) end_CELL start_CELL - roman_sin ( italic_θ ) end_CELL end_ROW start_ROW start_CELL roman_sin ( italic_θ ) end_CELL start_CELL roman_cos ( italic_θ ) end_CELL end_ROW end_ARG ] . (21)

    - 𝒰(ϕ)subscript𝒰italic-ϕ\mathcal{U}_{\ell}(\phi)caligraphic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_ϕ ) represents the mirroring transformation, capable of inducing horizontal or vertical flips, denoted as follows:

    𝒰(ϕ)={[cos(ϕ)00cos(ϕ)],if =horizontal[cos(ϕ)00cos(ϕ)],if =vertical,subscript𝒰italic-ϕcasesmatrixitalic-ϕ00italic-ϕif horizontalmatrixitalic-ϕ00italic-ϕif vertical\mathcal{U}_{\ell}(\phi)=\left\{\begin{array}[]{ll}\begin{bmatrix}-\cos(\phi)&% 0\\ 0&\cos(\phi)\end{bmatrix},&\text{if }\ell=\text{horizontal}\\[12.0pt] \begin{bmatrix}\cos(\phi)&0\\ 0&-\cos(\phi)\end{bmatrix},&\text{if }\ell=\text{vertical}\end{array},\right.caligraphic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_ϕ ) = { start_ARRAY start_ROW start_CELL [ start_ARG start_ROW start_CELL - roman_cos ( italic_ϕ ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_cos ( italic_ϕ ) end_CELL end_ROW end_ARG ] , end_CELL start_CELL if roman_ℓ = horizontal end_CELL end_ROW start_ROW start_CELL [ start_ARG start_ROW start_CELL roman_cos ( italic_ϕ ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - roman_cos ( italic_ϕ ) end_CELL end_ROW end_ARG ] , end_CELL start_CELL if roman_ℓ = vertical end_CELL end_ROW end_ARRAY , (22)

    where ϕitalic-ϕ\phiitalic_ϕ is a stochastic parameter controlling the mirroring type, potentially following a probabilistic distribution to introduce randomness into the flipping process. This matrix may be further generalized to incorporate combinations of horizontal and vertical mirroring transformations, represented as:

    𝒰(ϕh,ϕv)=[cos(ϕh)cos(ϕv)00cos(ϕh)cos(ϕv)].𝒰subscriptitalic-ϕsubscriptitalic-ϕ𝑣matrixsubscriptitalic-ϕsubscriptitalic-ϕ𝑣00subscriptitalic-ϕsubscriptitalic-ϕ𝑣\mathcal{U}(\phi_{h},\phi_{v})=\begin{bmatrix}\cos(\phi_{h})\cdot\cos(\phi_{v}% )&0\\ 0&\cos(\phi_{h})\cdot\cos(\phi_{v})\end{bmatrix}.caligraphic_U ( italic_ϕ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⋅ roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⋅ roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] . (23)

    - (tx,ty)subscript𝑡𝑥subscript𝑡𝑦\mathcal{E}(t_{x},t_{y})caligraphic_E ( italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) is the translation matrix, introducing positional shifts along the x𝑥xitalic_x and y𝑦yitalic_y axes:

    (tx,ty)=[10tx01ty001],subscript𝑡𝑥subscript𝑡𝑦matrix10subscript𝑡𝑥01subscript𝑡𝑦001\mathcal{E}(t_{x},t_{y})=\begin{bmatrix}1&0&t_{x}\\ 0&1&t_{y}\\ 0&0&1\end{bmatrix},caligraphic_E ( italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , (24)

    where txsubscript𝑡𝑥t_{x}italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and tysubscript𝑡𝑦t_{y}italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT represent horizontal and vertical translations, respectively. These shifts may vary based on contextual constraints to simulate different spatial orientations.

  2. Pixel-Level Transformations. Pixel-level transformation strategies modify the pixel values of an image, such as by adding noise, adjusting colors, or altering contrast, to simulate various imaging conditions. This enhances the model’s robustness to lighting variations, noise, and diverse environmental factors. The following introduces pixel-level transformation strategies, with most transformations adhering to the approach outlined below. Let the pixel matrix of the image be 𝒫𝒫\mathcal{P}caligraphic_P, the transformation can be expressed through the following steps:

    Noise Addition. To simulate sensor noise or environmental interference, Gaussian noise N(σ)𝑁𝜎N(\sigma)italic_N ( italic_σ ) with a standard deviation of σ𝜎\sigmaitalic_σ is added to the pixel matrix:

    𝒫noise=𝒫+𝒩(σ),subscript𝒫noise𝒫𝒩𝜎\mathcal{P}_{\text{noise}}=\mathcal{P}+\mathcal{N}(\sigma),caligraphic_P start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT = caligraphic_P + caligraphic_N ( italic_σ ) , (25)

    where 𝒩(σ)𝒩𝜎\mathcal{N}(\sigma)caligraphic_N ( italic_σ ) represents Gaussian noise with variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

    Color Adjustment. To simulate different lighting conditions or sensor biases, color adjustment is applied using a scaling factor α𝛼\alphaitalic_α:

    𝒫color=𝒞(α)𝒫noise,subscript𝒫color𝒞𝛼subscript𝒫noise\mathcal{P}_{\text{color}}=\mathcal{C}(\alpha)\cdot\mathcal{P}_{\text{noise}},caligraphic_P start_POSTSUBSCRIPT color end_POSTSUBSCRIPT = caligraphic_C ( italic_α ) ⋅ caligraphic_P start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT , (26)

    where α𝛼\alphaitalic_α is the color adjustment factor that controls the brightness or saturation of each channel.

    Contrast Adjustment. To enhance or reduce image details, contrast adjustment is applied using a contrast factor β𝛽\betaitalic_β:

    𝒫contrast=𝒟(β)(𝒫colorμ)+μ,subscript𝒫contrast𝒟𝛽subscript𝒫color𝜇𝜇\mathcal{P}_{\text{contrast}}=\mathcal{D}(\beta)\cdot(\mathcal{P}_{\text{color% }}-\mu)+\mu,caligraphic_P start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT = caligraphic_D ( italic_β ) ⋅ ( caligraphic_P start_POSTSUBSCRIPT color end_POSTSUBSCRIPT - italic_μ ) + italic_μ , (27)

    where β𝛽\betaitalic_β is the contrast adjustment factor and μ𝜇\muitalic_μ is the mean pixel value used for centering the pixel matrix.

    Final Pixel Transformation. The final pixel transformation combines all the above operations:

    𝒫=𝒟(β)𝒞(α)(𝒫+𝒩(σ)).superscript𝒫𝒟𝛽𝒞𝛼𝒫𝒩𝜎\mathcal{P}^{\prime}=\mathcal{D}(\beta)\cdot\mathcal{C}(\alpha)\cdot(\mathcal{% P}+\mathcal{N}(\sigma)).caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_D ( italic_β ) ⋅ caligraphic_C ( italic_α ) ⋅ ( caligraphic_P + caligraphic_N ( italic_σ ) ) . (28)
  3. Multimodal-Specific Enhancements. This class of strategies focuses on the unique characteristics of dual-light data, employing dual-channel synchronized or complementary enhancements tailored to specific test scenarios. By applying different augmentation methods to each modality, these strategies effectively enhance the cooperative performance of multimodal images and improve accuracy in targeted detection scenarios. Let the RGB image be denoted as RsubscriptR\mathcal{I}_{\text{R}}caligraphic_I start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and the TIR image as TsubscriptT\mathcal{I}_{\text{T}}caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT. The multimodal-specific enhancement can be expressed as:

    [RT]=τ(R,T),matrixsubscriptsuperscriptRsubscriptsuperscriptT𝜏subscriptRsubscriptT\begin{bmatrix}\mathcal{I}^{\prime}_{\text{R}}\\ \mathcal{I}^{\prime}_{\text{T}}\end{bmatrix}=\tau\left(\mathcal{I}_{\text{R}},% \mathcal{I}_{\text{T}}\right),[ start_ARG start_ROW start_CELL caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT R end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT T end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = italic_τ ( caligraphic_I start_POSTSUBSCRIPT R end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) , (29)

    where τ𝜏\tauitalic_τ represents the multimodal enhancement function, which may include cross-modal alignment and modality-specific feature enhancement. The RsubscriptsuperscriptR\mathcal{I}^{\prime}_{\text{R}}caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and TsubscriptsuperscriptT\mathcal{I}^{\prime}_{\text{T}}caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT T end_POSTSUBSCRIPT represent the enhanced RGB and TIR images, respectively. Specifically, the enhancement process can be further detailed as:

    [RT]=[ϖR(R,𝒜T(T))ϖT(T,𝒜R(R))].matrixsubscriptsuperscriptRsubscriptsuperscriptTmatrixsubscriptitalic-ϖRsubscriptRsubscript𝒜TsubscriptTsubscriptitalic-ϖTsubscriptTsubscript𝒜RsubscriptR\begin{bmatrix}\mathcal{I}^{\prime}_{\text{R}}\\ \mathcal{I}^{\prime}_{\text{T}}\end{bmatrix}=\begin{bmatrix}\varpi_{\text{R}}% \left(\mathcal{I}_{\text{R}},\mathcal{A}_{\text{T}}\cdot\mathcal{L}(\mathcal{I% }_{\text{T}})\right)\\ \varpi_{\text{T}}\left(\mathcal{I}_{\text{T}},\mathcal{A}_{\text{R}}\cdot% \mathcal{L}(\mathcal{I}_{\text{R}})\right)\end{bmatrix}.[ start_ARG start_ROW start_CELL caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT R end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT T end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_ϖ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT R end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ⋅ caligraphic_L ( caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL italic_ϖ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ⋅ caligraphic_L ( caligraphic_I start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ) ) end_CELL end_ROW end_ARG ] . (30)

    The functions ϖRsubscriptitalic-ϖR\varpi_{\text{R}}italic_ϖ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and ϖTsubscriptitalic-ϖT\varpi_{\text{T}}italic_ϖ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT denote modality-specific enhancement operations applied to the input images, incorporating their corresponding aligned features. The matrices 𝒜Tsubscript𝒜T\mathcal{A}_{\text{T}}caligraphic_A start_POSTSUBSCRIPT T end_POSTSUBSCRIPT and 𝒜Rsubscript𝒜R\mathcal{A}_{\text{R}}caligraphic_A start_POSTSUBSCRIPT R end_POSTSUBSCRIPT are modality-specific alignment matrices, while (modality)subscriptmodality\mathcal{L}(\mathcal{I}_{\text{modality}})caligraphic_L ( caligraphic_I start_POSTSUBSCRIPT modality end_POSTSUBSCRIPT ) serves as the feature extraction function that identifies crucial features within each image for optimized information integration.

Experimental observations

Based on the single-modality object detection model Co-Detr, we made adaptive modifications to construct a baseline model suitable for multispectral object detection. As multispectral object detection augmentation strategies often need to adapt to specific application scenarios, test set sample characteristics, and varying weather and lighting conditions, we first conducted experiments exploring a set of synchronized augmentation techniques focused on geometric and pixel-level transformations. The experimental results are shown in Figure 2. Building upon these methods, we further investigate specific augmentation strategies tailored to the unique characteristics of dual-light samples. The experimental results are shown in Figures 3 and 4.

Refer to caption
Refer to caption
Figure 2: The performance of general geometric and pixel-level augmentations (using different backbones) on the KAIST, FLIR, and DroneVehicle datasets. The left figure illustrates the results of various geometric augmentations, where B denotes the baseline, R represents random rotation, S signifies multi-scale scaling, C stands for random cropping, F corresponds to random flipping, and T indicates random translation. The right figure presents the results of general pixel-level augmentations, with B as the baseline, BL for random blurring, NI for noise injection, S for random sharpening, O for random occlusion, and CJ for color jittering. \triangle represents the mean performance difference between this method and the baseline.
  1. General Augmentation Strategy Experiments. In this section, we conducted dual-channel synchronized augmentation experiments using various geometric and pixel-level strategies, revealing several key insights.

    1. (i):

      Observations on Geometric Transformations. The experimental data indicates that applying a combination of random rotation, multi-scale scaling, and random cropping results in performance improvements across multiple datasets. However, strategies such as random flipping and random translation show poorer performance on the KAIST dataset. This could be attributed to the fact that the combination of random rotation, multi-scale scaling, and random cropping effectively simulates samples from various perspectives and angles, thus enhancing the model’s ability to adapt to different viewpoints, angles, scales, and deformations. On the other hand, strategies like flipping and translation may produce illogical images for certain samples (e.g., flipping upright pedestrians in the KAIST dataset leads to unnatural postures), which disrupts the inherent distribution patterns and modality alignment in some datasets, negatively affecting detection performance.

    2. (ii):

      Observations on Pixel-Level Transformations. The overall performance improvements from pixel-level augmentation strategies are less significant compared to geometric transformations or spatial alignment methods. For instance, even the most effective combination in our experiments yielded only a 2.5% increase in recognition accuracy over the baseline, which is relatively modest when compared to methods such as feature fusion. Besides, a large number of pixel-level augmentation strategies (three or more) exhibit high sensitivity to different datasets. Specifically, we observed that the combination of Color Jitter+Random Sharpening+Random Blurring significantly improved recognition accuracy on the KAIST dataset, but the same combination performed poorly on the FLIR dataset. When more than four pixel-level augmentation strategies were applied, recognition accuracy often plateaued or even decreased across multiple datasets.

  1. Experiments on Unique Augmentation Strategies. For specific scenarios, such as low-light/nighttime conditions and very small sample cases, we selected 500 images from the original dataset that exhibit these characteristics for targeted testing. We experimented with various combinations of dual-channel augmentation strategies, which includes dual-channel synchronized augmentation and complementary augmentation. Below are some interesting observations:

    Refer to caption
    Refer to caption
    Figure 3: The performance metrics of different augmentation strategies applied to nighttime/low-light samples. The top image shows the results using dual-channel synchronized augmentation, while the bottom image displays results with dual-channel complementary augmentation. In both images, B represents the baseline, C stands for CLAHE, RL denotes random lighting, and L indicates light enhancement. In the bottom image, each set of parentheses indicates the specific augmentation strategies applied to each modality, with the order representing the RGB and TIR channels, respectively. \triangle represents the mean performance difference.
    1. (i):

      Observations on Strategies for Nighttime/Low-Light Samples. We conducted experiments comparing both synchronous and complementary augmentation strategies to identify the most effective combination for enhancing performance in low-light conditions. We found that complementary augmentation outperforms synchronous augmentation in improving overall recognition accuracy. This improvement is particularly pronounced in low-light conditions, where the strengths of complementary augmentation are more evident. Specifically, in low-light environments, the RGB modality tends to suffer from information loss, such as reduced contrast and increased noise, while the TIR modality, which captures thermal radiation, continues to provide stable target information even in the absence of illumination. Thus, adopting a complementary augmentation strategy allows each modality to better leverage its respective strengths. Besides, The complementary augmentation combination of random lighting and light enhancement for the TIR channel, paired with CLAHE for the RGB channel, achieved excellent results across all datasets. This success can be attributed to the complementary strategy’s ability to enhance the adaptability of the RGB channel to varying lighting conditions, while simultaneously improving the clarity of edges and shapes in the infrared samples.

    2. (ii):

      Observations on Strategies for Small Samples. From the experimental data, it is evident that the augmentation strategy improves recognition accuracy. Specifically, the stitching operation proved to be highly effective in addressing the problem of very small samples individually, while the other two augmentation techniques did not consistently improve recognition accuracy. Independent use of both Stitcher and Fastmosaic led to notable improvements in recognition accuracy. In particular, Fastmosaic was the preferred choice for large-scale datasets (such as KAIST), while Stitcher performed better on more complex datasets (such as FLIR). Interestingly, when the two methods were combined, recognition accuracy decreased compared to their individual use. This outcome could be attributed to an imbalance in data distribution caused by the combination, which failed to provide the model with additional useful information.

Refer to caption
Figure 4: The performance metrics of different augmentation strategies on small sample sets. \triangle represents the mean performance difference between this method and the baseline. In this figure, B represents the baseline, S denotes Stitcher [95], F stands for Fastmosaic [2], R represents Region Resampling, and M indicates Small-Object Magnification.

3.4 Registration Alignment

Formulations. In multispectral object detection tasks, factors such as sensor viewpoints, resolution discrepancies, and varying weather conditions can lead to spatial misalignment between RGB and TIR images. Such misalignment often introduces inconsistencies during feature fusion, thereby degrading detection performance. To address these issues, researchers have developed various registration and alignment strategies, which can be broadly categorized into Feature Alignment-Based Methods and Feature Fusion-Based Methods. By applying these registration techniques at different stages of training and testing, the alignment between RGB and TIR images can be effectively improved, significantly enhancing recognition accuracy. The following sections provide a detailed discussion of these two categories.

  1. Feature Alignment-Based Methods. The main goal of these methods is to address spatial misalignment between RGB and TIR images through precise feature matching and alignment. The Loftr approach exemplifies this objective by leveraging a Transformer-based architecture to achieve pixel-level feature matching between RGB and TIR images, allowing for high-precision geometric alignment [96]. This approach enables the calculation of transformation parameters (such as affine or perspective transformations) that can be applied to register the images effectively.

    Let the RGB image be denoted as RsubscriptR\mathcal{I}_{\text{R}}caligraphic_I start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and the TIR image as TsubscriptT\mathcal{I}_{\text{T}}caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT. The coarse and fine features extracted from these images are represented as ΦRsubscriptΦR\Phi_{\text{R}}roman_Φ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and ΦTsubscriptΦT\Phi_{\text{T}}roman_Φ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, respectively. The matching function ϱm(ΦR,ΦT)subscriptitalic-ϱmsubscriptΦRsubscriptΦT\varrho_{\text{m}}(\Phi_{\text{R}},\Phi_{\text{T}})italic_ϱ start_POSTSUBSCRIPT m end_POSTSUBSCRIPT ( roman_Φ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) can be formulated as follows:

    ϱm(ΦR,ΦT)={(pi,qj)p^i=σ(ΦR(pi)ΦT(qj)τ)},subscriptitalic-ϱmsubscriptΦRsubscriptΦTconditional-setsubscript𝑝𝑖subscript𝑞𝑗subscript^𝑝𝑖𝜎subscriptΦRsubscript𝑝𝑖subscriptΦTsubscript𝑞𝑗𝜏\varrho_{\text{m}}(\Phi_{\text{R}},\Phi_{\text{T}})=\left\{(p_{i},q_{j})\mid% \hat{p}_{i}=\sigma\left(\frac{\Phi_{\text{R}}(p_{i})\cdot\Phi_{\text{T}}(q_{j}% )}{\tau}\right)\right\},italic_ϱ start_POSTSUBSCRIPT m end_POSTSUBSCRIPT ( roman_Φ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) = { ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( divide start_ARG roman_Φ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ roman_Φ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) } , (31)

    where (pi,qj)subscript𝑝𝑖subscript𝑞𝑗(p_{i},q_{j})( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) represents matched point pairs across RGB and TIR modalities, and τ𝜏\tauitalic_τ is a temperature parameter controlling the similarity distribution. σ𝜎\sigmaitalic_σ denotes the softmax function, and p^isubscript^𝑝𝑖\hat{p}_{i}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the point with the highest matching score in the RGB modality for each qjsubscript𝑞𝑗q_{j}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the TIR modality.

    The geometric transformation 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is then estimated based on these matched points by minimizing a distance-based objective:

    θ=argminθ(pi,qj)M𝒯g(pi,θ)qj2,superscript𝜃subscript𝜃subscriptsubscript𝑝𝑖subscript𝑞𝑗𝑀superscriptnormsubscript𝒯𝑔subscript𝑝𝑖𝜃subscript𝑞𝑗2\theta^{*}=\arg\min_{\theta}\sum_{(p_{i},q_{j})\in M}\left\|\mathcal{T}_{g}(p_% {i},\theta)-q_{j}\right\|^{2},italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_M end_POSTSUBSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ) - italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (32)

    where 𝒯g(pi,θ)subscript𝒯𝑔subscript𝑝𝑖𝜃\mathcal{T}_{g}(p_{i},\theta)caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ) represents the transformed location of pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the TIR image space, with θ𝜃\thetaitalic_θ containing transformation parameters for an affine or homography matrix 𝒜𝒜\mathcal{A}caligraphic_A and translation vector \mathcal{B}caligraphic_B. Once optimized, the transformation can be applied to obtain the aligned image:

    aligned(x,y)=𝒯g(IT,θ)=𝒜(θ)T+.subscriptaligned𝑥𝑦subscript𝒯𝑔subscript𝐼Tsuperscript𝜃𝒜superscript𝜃subscriptT\mathcal{I}_{\text{aligned}}(x,y)=\mathcal{T}_{g}(I_{\text{T}},\theta^{*})=% \mathcal{A}(\theta^{*})\cdot\mathcal{I}_{\text{T}}+\mathcal{B}.caligraphic_I start_POSTSUBSCRIPT aligned end_POSTSUBSCRIPT ( italic_x , italic_y ) = caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = caligraphic_A ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT + caligraphic_B . (33)

    To further improve the registration accuracy, a joint loss that combines a feature consistency loss and an alignment loss are introduced, expressed as:

    =λ1(pi,qj)MΦR(pi)ΦT(qj)2+λ2alignment(θ),subscript𝜆1subscriptsubscript𝑝𝑖subscript𝑞𝑗𝑀superscriptnormsubscriptΦRsubscript𝑝𝑖subscriptΦTsubscript𝑞𝑗2subscript𝜆2subscriptalignment𝜃\mathcal{L}=\lambda_{1}\sum_{(p_{i},q_{j})\in M}\left\|\Phi_{\text{R}}(p_{i})-% \Phi_{\text{T}}(q_{j})\right\|^{2}+\lambda_{2}\mathcal{L}_{\text{alignment}}(% \theta),caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_M end_POSTSUBSCRIPT ∥ roman_Φ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_Φ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT alignment end_POSTSUBSCRIPT ( italic_θ ) , (34)

    where alignment(θ)subscriptalignment𝜃\mathcal{L}_{\text{alignment}}(\theta)caligraphic_L start_POSTSUBSCRIPT alignment end_POSTSUBSCRIPT ( italic_θ ) measures the alignment quality based on transformation parameters, and λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are weighting coefficients to balance the two loss terms. This method demonstrates exceptional alignment capabilities in scenes with pronounced parallax and varying viewpoints, enabling efficient image registration.

  2. Feature Fusion-Based Methods. Feature fusion-based methods aim to effectively combine deep RGB and TIR features to generate a fused image, thereby achieving modality alignment. SuperFusion is a prime example, employing a multilevel fusion strategy that includes data-level transformation, feature-level attention mechanisms, and final Bird’s Eye View (BEV) alignment [97].

    Given an RGB image RsubscriptR\mathcal{I}_{\text{R}}caligraphic_I start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and a TIR image TsubscriptT\mathcal{I}_{\text{T}}caligraphic_I start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, the process begins by extracting feature maps 𝒳Rsubscript𝒳R\mathcal{X}_{\text{R}}caligraphic_X start_POSTSUBSCRIPT R end_POSTSUBSCRIPT and 𝒳Tsubscript𝒳T\mathcal{X}_{\text{T}}caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT through separate convolutional backbones. To enhance depth perception, a sparse depth map 𝒟sparsesubscript𝒟sparse\mathcal{D}_{\text{sparse}}caligraphic_D start_POSTSUBSCRIPT sparse end_POSTSUBSCRIPT is generated by projecting TIR depth information into the RGB image plane. A completion function 𝒮()𝒮\mathcal{S}(\cdot)caligraphic_S ( ⋅ ) then generates a dense depth map 𝒟densesubscript𝒟dense\mathcal{D}_{\text{dense}}caligraphic_D start_POSTSUBSCRIPT dense end_POSTSUBSCRIPT:

    𝒟dense=𝒮(𝒟sparse).subscript𝒟dense𝒮subscript𝒟sparse\mathcal{D}_{\text{dense}}=\mathcal{S}(\mathcal{D}_{\text{sparse}}).caligraphic_D start_POSTSUBSCRIPT dense end_POSTSUBSCRIPT = caligraphic_S ( caligraphic_D start_POSTSUBSCRIPT sparse end_POSTSUBSCRIPT ) . (35)

    In the feature fusion stage, cross-attention is used to align features from both modalities, where RGB features 𝒳Rsubscript𝒳R\mathcal{X}_{\text{R}}caligraphic_X start_POSTSUBSCRIPT R end_POSTSUBSCRIPT guide the enhancement of TIR features 𝒳Tsubscript𝒳T\mathcal{X}_{\text{T}}caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT. The cross-attention matrix \mathcal{H}caligraphic_H incorporates depth information from 𝒟densesubscript𝒟dense\mathcal{D}_{\text{dense}}caligraphic_D start_POSTSUBSCRIPT dense end_POSTSUBSCRIPT and is defined as:

    =σ(𝒬𝒦T𝒟densed),𝒬=𝒲q𝒳R,𝒦=𝒲k𝒳T,formulae-sequence𝜎𝒬superscript𝒦𝑇subscript𝒟dense𝑑formulae-sequence𝒬subscript𝒲qsubscript𝒳R𝒦subscript𝒲ksubscript𝒳T\mathcal{H}=\sigma\left(\frac{\mathcal{Q}\mathcal{K}^{T}\cdot\mathcal{D}_{% \text{dense}}}{\sqrt{d}}\right),\quad\mathcal{Q}=\mathcal{W}_{\text{q}}% \mathcal{X}_{\text{R}},\quad\mathcal{K}=\mathcal{W}_{\text{k}}\mathcal{X}_{% \text{T}},caligraphic_H = italic_σ ( divide start_ARG caligraphic_Q caligraphic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ caligraphic_D start_POSTSUBSCRIPT dense end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) , caligraphic_Q = caligraphic_W start_POSTSUBSCRIPT q end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT R end_POSTSUBSCRIPT , caligraphic_K = caligraphic_W start_POSTSUBSCRIPT k end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , (36)

    where 𝒲qsubscript𝒲q\mathcal{W}_{\text{q}}caligraphic_W start_POSTSUBSCRIPT q end_POSTSUBSCRIPT and 𝒲ksubscript𝒲k\mathcal{W}_{\text{k}}caligraphic_W start_POSTSUBSCRIPT k end_POSTSUBSCRIPT are learned weights and d𝑑ditalic_d is a scaling factor, and σ𝜎\sigmaitalic_σ denotes the softmax function. This mechanism aligns features across modalities by using depth information to refine attention, allowing RGB features to enrich TIR information in the fused representation.

    The resulting attention matrix \mathcal{H}caligraphic_H is then used to enhance the TIR features:

    𝒳T=𝒱,𝒱=𝒲v𝒳T,formulae-sequencesuperscriptsubscript𝒳T𝒱𝒱subscript𝒲vsubscript𝒳T\mathcal{X}_{\text{T}}^{\prime}=\mathcal{H}\cdot\mathcal{V},\quad\mathcal{V}=% \mathcal{W}_{\text{v}}\mathcal{X}_{\text{T}},caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_H ⋅ caligraphic_V , caligraphic_V = caligraphic_W start_POSTSUBSCRIPT v end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , (37)

    where 𝒲vsubscript𝒲v\mathcal{W}_{\text{v}}caligraphic_W start_POSTSUBSCRIPT v end_POSTSUBSCRIPT is a learned weight matrix for generating the value matrix 𝒱𝒱\mathcal{V}caligraphic_V, and 𝒳Tsuperscriptsubscript𝒳T\mathcal{X}_{\text{T}}^{\prime}caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represents the TIR features enhanced by the RGB guidance.

    Finally, a BEV alignment module refines the fused feature map by learning a flow field ΔΔ\Deltaroman_Δ to warp RGB features 𝒳Rsubscript𝒳R\mathcal{X}_{\text{R}}caligraphic_X start_POSTSUBSCRIPT R end_POSTSUBSCRIPT, achieving better alignment with the enhanced TIR features 𝒳Tsuperscriptsubscript𝒳T\mathcal{X}_{\text{T}}^{\prime}caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The aligned RGB image alignedsubscriptaligned\mathcal{I}_{\text{aligned}}caligraphic_I start_POSTSUBSCRIPT aligned end_POSTSUBSCRIPT can be expressed as:

    aligned(x,y)=x,y𝒳T(x,y)w(x,y,x,y,Δ),subscriptaligned𝑥𝑦subscriptsuperscript𝑥superscript𝑦superscriptsubscript𝒳Tsuperscript𝑥superscript𝑦𝑤𝑥𝑦superscript𝑥superscript𝑦Δ\mathcal{I}_{\text{aligned}}(x,y)=\sum_{x^{\prime},y^{\prime}}\mathcal{X}_{% \text{T}}^{\prime}(x^{\prime},y^{\prime})\,w(x,y,x^{\prime},y^{\prime},\Delta),caligraphic_I start_POSTSUBSCRIPT aligned end_POSTSUBSCRIPT ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_w ( italic_x , italic_y , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_Δ ) , (38)

    where w(x,y,x,y,Δ)𝑤𝑥𝑦superscript𝑥superscript𝑦Δw(x,y,x^{\prime},y^{\prime},\Delta)italic_w ( italic_x , italic_y , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_Δ ) represents the bilinear interpolation weights based on the flow field ΔΔ\Deltaroman_Δ to adjust the alignment features. The interpolation weights w𝑤witalic_w can be defined as:

    w(x,y,x,y,Δ)=i{x,y}max(0,1|iiΔi|).𝑤𝑥𝑦superscript𝑥superscript𝑦Δsubscriptproduct𝑖𝑥𝑦01superscript𝑖𝑖subscriptΔ𝑖w(x,y,x^{\prime},y^{\prime},\Delta)=\prod_{i\in\{x,y\}}\max\big{(}0,1-|i^{% \prime}-i-\Delta_{i}|\big{)}.italic_w ( italic_x , italic_y , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_Δ ) = ∏ start_POSTSUBSCRIPT italic_i ∈ { italic_x , italic_y } end_POSTSUBSCRIPT roman_max ( 0 , 1 - | italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i - roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ) . (39)

    These weights ensure that the spatial position of the RGB features is precisely adjusted according to the flow field ΔΔ\Deltaroman_Δ, allowing for better alignment with the TIR features.

    The entire process is optimized by a joint loss function \mathcal{L}caligraphic_L that combines feature consistency and alignment error terms, weighted by λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

    =λ1feature+λ2alignment,subscript𝜆1subscriptfeaturesubscript𝜆2subscriptalignment\mathcal{L}=\lambda_{1}\,\mathcal{L}_{\text{feature}}+\lambda_{2}\,\mathcal{L}% _{\text{alignment}},caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT feature end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT alignment end_POSTSUBSCRIPT , (40)

    where the feature consistency term feature=(pi,qj)M𝒳R(pi)𝒳T(qj)2subscriptfeaturesubscriptsubscript𝑝𝑖subscript𝑞𝑗Msuperscriptnormsubscript𝒳Rsubscript𝑝𝑖superscriptsubscript𝒳Tsubscript𝑞𝑗2\mathcal{L}_{\text{feature}}=\sum_{(p_{i},q_{j})\in\text{M}}\left\|\mathcal{X}% _{\text{R}}(p_{i})-\mathcal{X}_{\text{T}}^{\prime}(q_{j})\right\|^{2}caligraphic_L start_POSTSUBSCRIPT feature end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ M end_POSTSUBSCRIPT ∥ caligraphic_X start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - caligraphic_X start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT minimizes the difference between matched feature pairs (pi,qj)subscript𝑝𝑖subscript𝑞𝑗(p_{i},q_{j})( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) in set M, and the alignment term alignment=x,yΔ(x,y)Δ(x,y)2subscriptalignmentsubscript𝑥𝑦superscriptnormΔ𝑥𝑦superscriptΔ𝑥𝑦2\mathcal{L}_{\text{alignment}}=\sum_{x,y}\left\|\Delta(x,y)-\Delta^{*}(x,y)% \right\|^{2}caligraphic_L start_POSTSUBSCRIPT alignment end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ∥ roman_Δ ( italic_x , italic_y ) - roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT measures the deviation from the ideal alignment ΔsuperscriptΔ\Delta^{*}roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Refer to caption
Figure 5: Comparison of registration results using LoFTR and SuperFusion under different viewpoints and lighting conditions. The first and second rows present the RGB and TIR channel images, respectively. The third and fourth rows showcase the registration outcomes of the LoFTR and SuperFusion methods. Regions with significant registration discrepancies are highlighted.
Refer to caption
Figure 6: Visualization of intermediate point registration results using the LoFTR method in sparse and dense sample scenarios.

Experimental observations

We utilized Loftr and SuperFusion to register RGB and TIR images separately and experimented by replacing the original RGB or TIR images with the fused images during model training and testing. The registration results in different scenarios can be observed in Figures 5 and 6. The performance metrics of different registration methods can be found in Figure 7. Below are some interesting findings:

  1. (i):

    Observation on Registration Performance. The experimental results demonstrate that Loftr and SuperFusion exhibit distinct advantages and characteristics in generating fused RGB images. Loftr focuses on precise feature matching and geometric alignment, ensuring that the fused RGB image is spatially well-aligned with the TIR image, with each pixel accurately corresponding to its counterpart. As shown in Figure 5, Loftr performs well in images with high sample density, displaying strong spatial stability—likely due to the greater availability of feature mapping information provided by the dense samples. However, its performance deteriorates in sparser scenes, sometimes leading to issues such as ghosting and overlapping artifacts, making it challenging to proceed with subsequent detection steps.

    In contrast, SuperFusion excels at handling sparse scenes where Loftr struggles, effectively preserving sample information and image features. However, it may impact the geometric characteristics of certain scenes, such as the vertical structures of bridges, whereas Loftr remains largely unaffected in such scenarios.

    Refer to caption
    Refer to caption
    Figure 7: The performance metrics of different registration methods at various stages are presented. The image on the left represents registration during the training phase, while the image on the right represents registration during the testing phase. In this figure, b1 corresponds to the method where the registered image replaces the original RGB image, and b2 corresponds to the method where the registered image replaces the original TIR image.
  2. (ii):

    Observations on Registration Methods. The results in Figure 7 indicate that training the multispectral object detection model with Loftr-registered data yields a substantial increase in recognition accuracy, whereas training with SuperFusion-processed data shows limited impact. During testing, however, both Loftr and SuperFusion enhance recognition accuracy. This advantage is likely due to Loftr’s ability to address data inconsistencies via feature alignment during training, thereby improving data quality and facilitating more effective feature learning.

    While SuperFusion is effective for multimodal fusion, it may introduce redundancy and complexity in the training data, potentially diverting the model’s focus from key features and limiting accuracy gains. In testing, both methods improve recognition accuracy by refining data quality or enriching feature information. Importantly, both registration frameworks perform best when generating RGB data based on the TIR reference, likely because the TIR-based RGB retains essential thermal information, supporting reliable performance in challenging conditions such as low light, smoke, or nighttime environments.

  3. (iii):

    Observations on Application Scenarios. Experimental results indicate that Loftr excels in scenarios with significant rotational deviation or displacement between RGB and TIR images. This effectiveness is likely due to Loftr’s precise feature matching and geometric transformations, which effectively mitigate spatial misalignments. Conversely, SuperFusion demonstrates greater suitability in environments affected by adverse weather or low resolution, where it efficiently integrates multimodal data despite these challenges.

4 Optimal Combination of Individual Techniques

Refer to caption
Refer to caption
Refer to caption
Figure 8: Ablation experiment results on the KAIST, FLIR, and DroneVehicle datasets. The experimental configurations strictly adhere to the setups outlined in the “Best Technique Combination”.

In the previous section, we evaluated various training techniques for multispectral object detection under consistent conditions. However, extending a single-modality model to dual-modality with only one technique often yields suboptimal performance, as no single method fully addresses challenges like feature misalignment, overfitting, and fusion conflicts. Therefore, given the diverse methods in multispectral frameworks, relying on a single technique to enhance model performance is impractical. Our benchmark analysis highlights effective combinations of techniques and offers new insights for designing multispectral object detection models.

Best Technique Combinations With the optimal hyperparameter settings, the following combinations are recommended: KAIST Dataset: A single-modality Co-Detr-based model, utilizing the Vit-L backbone and ICFE feature fusion, applies a dual-channel synchronization enhancement strategy through Stitcher, multi-scale scaling, and illumination augmentation. The model leverages SuperFusion for alignment within the test set, resulting in significant improvements in detection accuracy. FLIR Dataset: A single-modality Co-Detr-based model, integrating the Vit-L backbone with ICFE+NIN feature fusion, achieves dual-channel synchronization enhancement through FastMosaic, multi-scale scaling, and illumination augmentation. The model employs LoFTR for alignment on the test set, delivering exceptional performance. DroneVehicle Dataset: A single-modality Co-Detr-based model, combining the Vit-L backbone and ICFE+NIN feature fusion, applies Loftr for alignment within the test set. The model adopts complementary enhancement strategies (CLAHE for the RGB channel and random illumination and contrast enhancement for the TIR channel) and applies synchronization augment for both channels using Stitcher and multi-scale scaling. This approach leads to significant improvements in detection performance under low-light conditions.

4.1 Optimal Trick Combinations and Ablation Study

We have summarized the optimal technique combinations for the KAIST, FLIR, and DroneVehicle datasets above. Additionally, we conducted detailed ablation studies to validate the effectiveness of these combinations, as shown in Figure 8. For each dataset, we tested 5 to 6 different combination variants by removing or substituting certain techniques. The results consistently demonstrate the significant effectiveness of our selected combinations, and the observed performance variations on specific samples are highly consistent with the conclusions we presented in Sections 3.2, 3.3, and 3.4.

4.2 Comparison with Leading Frameworks

To further validate the effectiveness of the optimized single-modality model based on the best technique combinations, we compared it with other advanced frameworks specifically designed for multispectral object detection, including MBNet, MLPD, and MSDS-RCNN. As shown in Tables IV, V, VI, by organically integrating our training techniques into the single-modality model, the optimized model consistently outperforms previously well-designed multispectral detection frameworks on both small-scale and large-scale datasets.

4.3 Transferring Technique Combinations

The final plausibility check is to determine whether certain technique combinations remain effective across multiple multispectral object detection datasets. To this end, we selected the combination of “Loftr for test alignment + ICFE for feature fusion”, as these two techniques consistently demonstrated optimal performance in the majority of scenarios covered in Sections 3.2, 3.3, and 3.4. This combination also performed comparably to other top-performing combinations on the FLIR and DroneVehicle datasets. Specifically, we evaluated this approach on two additional open-source multispectral detection datasets: (i) the LLVIP dataset, (ii) the CVC-14 dataset. In these transfer studies, we strictly adhered to the “best configuration point” settings outlined in Section 3.1.

TABLE III: Performance metrics of models with and without our strategy on the LLVIP and CVC-14 datasets. The results are averaged over multiple independent runs, with the standard deviations provided.
Method Strategy LLVIP CVC-14
mAP50(%) mAP(%) 𝑴𝑹𝟐𝑴superscript𝑹2MR^{2}bold_italic_M bold_italic_R start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT (%)↓
SSD [82] w/o 90.25±1.76subscript90.25plus-or-minus1.7690.25_{\pm 1.76}90.25 start_POSTSUBSCRIPT ± 1.76 end_POSTSUBSCRIPT 53.52±2.45subscript53.52plus-or-minus2.4553.52_{\pm 2.45}53.52 start_POSTSUBSCRIPT ± 2.45 end_POSTSUBSCRIPT 68.39±1.78subscript68.39plus-or-minus1.7868.39_{\pm 1.78}68.39 start_POSTSUBSCRIPT ± 1.78 end_POSTSUBSCRIPT
with 92.13±2.45subscript92.13plus-or-minus2.4592.13_{\pm 2.45}92.13 start_POSTSUBSCRIPT ± 2.45 end_POSTSUBSCRIPT 54.39±2.31subscript54.39plus-or-minus2.3154.39_{\pm 2.31}54.39 start_POSTSUBSCRIPT ± 2.31 end_POSTSUBSCRIPT 37.16±2.18subscript37.16plus-or-minus2.1837.16_{\pm 2.18}37.16 start_POSTSUBSCRIPT ± 2.18 end_POSTSUBSCRIPT
RetinaNet [83] w/o 94.81±2.13subscript94.81plus-or-minus2.1394.81_{\pm 2.13}94.81 start_POSTSUBSCRIPT ± 2.13 end_POSTSUBSCRIPT 55.18±1.29subscript55.18plus-or-minus1.2955.18_{\pm 1.29}55.18 start_POSTSUBSCRIPT ± 1.29 end_POSTSUBSCRIPT 47.87±2.75subscript47.87plus-or-minus2.7547.87_{\pm 2.75}47.87 start_POSTSUBSCRIPT ± 2.75 end_POSTSUBSCRIPT
with 95.15±1.89subscript95.15plus-or-minus1.8995.15_{\pm 1.89}95.15 start_POSTSUBSCRIPT ± 1.89 end_POSTSUBSCRIPT 57.87±2.48subscript57.87plus-or-minus2.4857.87_{\pm 2.48}57.87 start_POSTSUBSCRIPT ± 2.48 end_POSTSUBSCRIPT 29.63±1.32subscript29.63plus-or-minus1.3229.63_{\pm 1.32}29.63 start_POSTSUBSCRIPT ± 1.32 end_POSTSUBSCRIPT
Cascade R-CNN [85] w/o 95.12±2.23subscript95.12plus-or-minus2.2395.12_{\pm 2.23}95.12 start_POSTSUBSCRIPT ± 2.23 end_POSTSUBSCRIPT 56.81±2.61subscript56.81plus-or-minus2.6156.81_{\pm 2.61}56.81 start_POSTSUBSCRIPT ± 2.61 end_POSTSUBSCRIPT 42.36±2.91subscript42.36plus-or-minus2.9142.36_{\pm 2.91}42.36 start_POSTSUBSCRIPT ± 2.91 end_POSTSUBSCRIPT
with 95.58±1.68subscript95.58plus-or-minus1.6895.58_{\pm 1.68}95.58 start_POSTSUBSCRIPT ± 1.68 end_POSTSUBSCRIPT 57.99±1.35subscript57.99plus-or-minus1.3557.99_{\pm 1.35}57.99 start_POSTSUBSCRIPT ± 1.35 end_POSTSUBSCRIPT 22.15±1.54subscript22.15plus-or-minus1.5422.15_{\pm 1.54}22.15 start_POSTSUBSCRIPT ± 1.54 end_POSTSUBSCRIPT
Faster R-CNN [85] w/o 94.63±2.78subscript94.63plus-or-minus2.7894.63_{\pm 2.78}94.63 start_POSTSUBSCRIPT ± 2.78 end_POSTSUBSCRIPT 54.53±2.43subscript54.53plus-or-minus2.4354.53_{\pm 2.43}54.53 start_POSTSUBSCRIPT ± 2.43 end_POSTSUBSCRIPT 51.97±1.97subscript51.97plus-or-minus1.9751.97_{\pm 1.97}51.97 start_POSTSUBSCRIPT ± 1.97 end_POSTSUBSCRIPT
with 94.97±2.11subscript94.97plus-or-minus2.1194.97_{\pm 2.11}94.97 start_POSTSUBSCRIPT ± 2.11 end_POSTSUBSCRIPT 56.15±1.95subscript56.15plus-or-minus1.9556.15_{\pm 1.95}56.15 start_POSTSUBSCRIPT ± 1.95 end_POSTSUBSCRIPT 24.31±1.87subscript24.31plus-or-minus1.8724.31_{\pm 1.87}24.31 start_POSTSUBSCRIPT ± 1.87 end_POSTSUBSCRIPT
DDQ-DETR [98] w/o 93.91±1.67subscript93.91plus-or-minus1.6793.91_{\pm 1.67}93.91 start_POSTSUBSCRIPT ± 1.67 end_POSTSUBSCRIPT 58.67±1.49subscript58.67plus-or-minus1.4958.67_{\pm 1.49}58.67 start_POSTSUBSCRIPT ± 1.49 end_POSTSUBSCRIPT 52.78±2.41subscript52.78plus-or-minus2.4152.78_{\pm 2.41}52.78 start_POSTSUBSCRIPT ± 2.41 end_POSTSUBSCRIPT
with 94.86±2.26subscript94.86plus-or-minus2.2694.86_{\pm 2.26}94.86 start_POSTSUBSCRIPT ± 2.26 end_POSTSUBSCRIPT 60.13±1.87subscript60.13plus-or-minus1.8760.13_{\pm 1.87}60.13 start_POSTSUBSCRIPT ± 1.87 end_POSTSUBSCRIPT 26.51±1.53subscript26.51plus-or-minus1.5326.51_{\pm 1.53}26.51 start_POSTSUBSCRIPT ± 1.53 end_POSTSUBSCRIPT
TABLE IV: Comparison of our most effective detection model with other advanced frameworks on the KAIST dataset. We use bold red font and underline to highlight the best results.
     Method      𝑴𝑹𝟐𝑴superscript𝑹2MR^{2}bold_italic_M bold_italic_R start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT (%)↓
     All      Day      Night
     FusionRPN+BF [99]      18.31      19.54      16.33
     IAF-RCNN [100]      15.55      14.97      16.89
     IATDNN-IAMSS [101]      14.41      14.30      15.29
     MBNet [102]      8.43      8.79      8.10
     MLPD [103]      7.21      6.83      7.68
     MSDS-RCNN [104]      7.34      8.98      6.94
     Ours      6.23      6.91      6.19
TABLE V: Comparison of our most effective detection model with other advanced frameworks on the FLIR dataset. We use bold red font and underline to highlight the best results.
Method AP50 (%) mAP (%)
Bicycle Car Person
MMTOD-CG [105] 50.38 70.61 63.42 61.47
MMTOD-UNIT [105] 49.28 70.78 64.33 61.46
CFR [106] 57.95 84.92 74.46 72.44
BU-ATT [107] 56.01 87.11 76.08 73.06
BU-LTT [107] 57.43 86.31 75.65 73.13
CFT [108] 61.44 89.55 84.28 78.42
Ours 68.71 89.51 85.30 81.17
TABLE VI: Comparison of our most effective detection model with other advanced frameworks on the DroneVehicle dataset. We use bold red font and underline to highlight the best results.
Method AP50 (%) mAP (%)
Car Freight Car Truck Bus Van
RetinaNet-OBB [83] 65.36 15.69 32.81 61.34 16.26 38.29
Mask R-CNN [85] 88.98 36.84 47.79 78.17 36.65 57.69
Cascade Mask R-CNN [85] 80.95 31.00 38.27 66.62 25.01 48.37
UA-CMDet [109] 87.35 41.27 62.69 84.17 39.82 63.06
CALNet [110] 86.32 60.67 67.15 86.52 53.68 70.87
TSFADet [111] 89.01 51.97 68.51 83.06 46.95 67.9
Gliding Vertex [112] 89.99 42.75 59.71 79.79 44.19 63.29
Ours 92.05 63.39 71.95 88.93 57.12 74.69

As shown in Table III: the selected technique combination significantly improved the performance of the single-modality model on various multispectral datasets in most cases, particularly in scenarios with complex backgrounds and varying lighting conditions. This combination consistently enhanced model performance across different datasets, with the CVC-14 dataset showing a maximum accuracy improvement of over 31.23%. The strong transferability of this technique combination suggests its potential to serve as a robust baseline for future research in multispectral object detection, while also offering new training strategies for optimizing single-modality detection models.

5 Conclusion

Multispectral object detection is a rapidly advancing field, yet significant challenges remain in effectively integrating multimodal information to adapt to diverse environmental conditions. In this study, we propose a standardized benchmark with fair and consistent experimental setups to drive progress in this domain. We conducted extensive experiments across multiple public datasets, focusing on three critical aspects of multispectral detection: multimodal feature fusion, dual-modality data augmentation, and registration alignment. Through a comprehensive analysis of our results, we identified the most effective technique combinations and established new performance benchmarks for multispectral object detection.

Additionally, we introduce a novel training strategy to optimize single-modality models for dual-modality tasks, laying the groundwork for adapting high-performing single-modality models to dual-modality scenarios. We believe that the strong baselines and optimized technique combinations presented in this work will facilitate fairer and more practical evaluations in multispectral object detection research. This work sets a robust foundation for future studies and opens new avenues for enhancing multispectral object detection performance.

References

  • [1] S. Jha, C. Seo, E. Yang, and G. P. Joshi, “Real time object detection and trackingsystem for video surveillance system,” Multimedia Tools and Applications, vol. 80, no. 3, pp. 3981–3996, 2021.
  • [2] C. Kumar, R. Punitha et al., “Yolov3 and yolov4: Multiple object detection for surveillance applications,” in 2020 Third international conference on smart systems and inventive technology (ICSSIT).   IEEE, 2020, pp. 1316–1321.
  • [3] A. Balasubramaniam and S. Pasricha, “Object detection in autonomous vehicles: Status and open challenges,” arXiv preprint arXiv:2201.07706, 2022.
  • [4] M. Carranza-García, J. Torres-Mateo, P. Lara-Benítez, and J. García-Gutiérrez, “On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data,” Remote Sensing, vol. 13, no. 1, p. 89, 2020.
  • [5] Y. Zuo, J. Wang, and J. Song, “Application of yolo object detection network in weld surface defect detection,” in 2021 IEEE 11th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER).   IEEE, 2021, pp. 704–710.
  • [6] Z. Qiu, S. Wang, Z. Zeng, and D. Yu, “Automatic visual defects inspection of wind turbine blades via yolo-based small object detection approach,” Journal of electronic imaging, vol. 28, no. 4, pp. 043 023–043 023, 2019.
  • [7] K. A. Joshi and D. G. Thakore, “A survey on moving object detection and tracking in video surveillance system,” International Journal of Soft Computing and Engineering, vol. 2, no. 3, pp. 44–48, 2012.
  • [8] P. K. Mishra and G. Saroha, “A study on video surveillance system for object detection and tracking,” in 2016 3rd international conference on computing for sustainable global development (INDIACom).   IEEE, 2016, pp. 221–226.
  • [9] L. L. Presti and M. La Cascia, “Real-time object detection in embedded video surveillance systems,” in 2008 ninth international workshop on image analysis for multimedia interactive services.   IEEE, 2008, pp. 151–154.
  • [10] S. Varma and M. Sreeraj, “Object detection and classification in surveillance system,” in 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS).   IEEE, 2013, pp. 299–303.
  • [11] J. C. Nascimento and J. S. Marques, “Performance evaluation of object detection algorithms for video surveillance,” IEEE Transactions on Multimedia, vol. 8, no. 4, pp. 761–774, 2006.
  • [12] R. Nabati and H. Qi, “Rrpn: Radar region proposal network for object detection in autonomous vehicles,” in 2019 IEEE International Conference on Image Processing (ICIP).   IEEE, 2019, pp. 3093–3097.
  • [13] J. Lu, H. Sibai, E. Fabry, and D. Forsyth, “No need to worry about adversarial examples in object detection in autonomous vehicles,” arXiv preprint arXiv:1707.03501, 2017.
  • [14] L. Peng, H. Wang, and J. Li, “Uncertainty evaluation of object detection algorithms for autonomous vehicles,” Automotive Innovation, vol. 4, no. 3, pp. 241–252, 2021.
  • [15] D. Feng, A. Harakeh, S. L. Waslander, and K. Dietmayer, “A review and comparative study on probabilistic object detection in autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 9961–9980, 2021.
  • [16] D. He, K. Xu, and P. Zhou, “Defect detection of hot rolled steels with a new object detection framework called classification priority network,” Computers & Industrial Engineering, vol. 128, pp. 290–297, 2019.
  • [17] X. Wang, X. Jia, C. Jiang, and S. Jiang, “A wafer surface defect detection method built on generic object detection network,” Digital Signal Processing, vol. 130, p. 103718, 2022.
  • [18] J. Yuan, X. Zheng, L. Peng, K. Qu, H. Luo, L. Wei, J. Jin, and F. Tan, “Identification method of typical defects in transmission lines based on yolov5 object detection algorithm,” Energy Reports, vol. 9, pp. 323–332, 2023.
  • [19] W. R. Tribe, D. A. Newnham, P. F. Taday, and M. C. Kemp, “Hidden object detection: security applications of terahertz technology,” in Terahertz and Gigahertz Electronics and Photonics III, vol. 5354.   SPIE, 2004, pp. 168–176.
  • [20] S. Akcay and T. Breckon, “Towards automatic threat detection: A survey of advances of deep learning within x-ray security imaging,” Pattern Recognition, vol. 122, p. 108245, 2022.
  • [21] J. B. Sigman, G. P. Spell, K. J. Liang, and L. Carin, “Background adaptive faster r-cnn for semi-supervised convolutional object detection of threats in x-ray images,” in Anomaly Detection and Imaging with X-Rays (ADIX) V, vol. 11404.   SPIE, 2020, pp. 12–21.
  • [22] K. J. Liang, J. B. Sigman, G. P. Spell, D. Strellis, W. Chang, F. Liu, T. Mehta, and L. Carin, “Toward automatic threat recognition for airport x-ray baggage screening with deep convolutional object detection,” arXiv preprint arXiv:1912.06329, 2019.
  • [23] Z. Hou, C. Yang, Y. Sun, S. Ma, X. Yang, and J. Fan, “An object detection algorithm based on infrared-visible dual modal feature fusion,” Infrared Physics & Technology, vol. 137, p. 105107, 2024.
  • [24] Q. Ji and Y. Qi, “Dual-mode object detection algorithm based on feature enhancement and feature fusion,” in Journal of Physics: Conference Series, vol. 2816, no. 1.   IOP Publishing, 2024, p. 012091.
  • [25] X. Yan, D. Tian, D. Zhou, C. Wang, and W. Zhang, “Iv-yolo: A lightweight dual-branch object detection network,” 2024.
  • [26] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 3, pp. 1341–1360, 2020.
  • [27] X. Xu, Y. Li, G. Wu, and J. Luo, “Multi-modal deep feature learning for rgb-d object detection,” Pattern Recognition, vol. 72, pp. 300–313, 2017.
  • [28] Y. Li, A. W. Yu, T. Meng, B. Caine, J. Ngiam, D. Peng, J. Shen, Y. Lu, D. Zhou, Q. V. Le et al., “Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 17 182–17 191.
  • [29] R. Guo, D. Li, and Y. Han, “Deep multi-scale and multi-modal fusion for 3d object detection,” Pattern Recognition Letters, vol. 151, pp. 236–242, 2021.
  • [30] Y. Xu, X. Yu, J. Zhang, L. Zhu, and D. Wang, “Weakly supervised rgb-d salient object detection with prediction consistency training and active scribble boosting,” IEEE Transactions on Image Processing, vol. 31, pp. 2148–2161, 2022.
  • [31] K. Geng, W. Zou, G. Yin, Y. Li, Z. Zhou, F. Yang, Y. Wu, and C. Shen, “Low-observable targets detection for autonomous vehicles based on dual-modal sensor fusion with deep learning approach,” Proceedings of the Institution of Mechanical Engineers, Part D: Journal of automobile engineering, vol. 233, no. 9, pp. 2270–2283, 2019.
  • [32] D. Yang, X. Liu, H. He, and Y. Li, “Air-to-ground multimodal object detection algorithm based on feature association learning,” International Journal of Advanced Robotic Systems, vol. 16, no. 3, p. 1729881419842995, 2019.
  • [33] B. Wan, X. Zhou, Y. Sun, T. Wang, C. Lv, S. Wang, H. Yin, and C. Yan, “Mffnet: Multi-modal feature fusion network for vdt salient object detection,” IEEE Transactions on Multimedia, 2023.
  • [34] B. Ghari, A. Tourani, A. Shahbahrami, and G. Gaydadjiev, “Pedestrian detection in low-light conditions: A comprehensive survey,” Image and Vision Computing, p. 105106, 2024.
  • [35] X. Wang, T. Sun, R. Yang, C. Li, B. Luo, and J. Tang, “Quality-aware dual-modal saliency detection via deep reinforcement learning,” Signal Processing: Image Communication, vol. 75, pp. 158–167, 2019.
  • [36] B. Jiang, Z. Zhou, X. Wang, J. Tang, and B. Luo, “Cmsalgan: Rgb-d salient object detection with cross-view generative adversarial networks,” IEEE Transactions on Multimedia, vol. 23, pp. 1343–1353, 2020.
  • [37] K. Song, J. Wang, Y. Bao, L. Huang, and Y. Yan, “A novel visible-depth-thermal image dataset of salient object detection for robotic visual perception,” IEEE/ASME Transactions on Mechatronics, vol. 28, no. 3, pp. 1558–1569, 2022.
  • [38] S. Takken, “Hardware efficient co-detr model for mobile applications,” B.S. thesis, University of Twente, 2024.
  • [39] T. Diwan, G. Anirudh, and J. V. Tembhurne, “Object detection using yolo: Challenges, architectural successors, datasets and applications,” multimedia Tools and Applications, vol. 82, no. 6, pp. 9243–9275, 2023.
  • [40] W. Fang, L. Wang, and P. Ren, “Tinier-yolo: A real-time object detection method for constrained environments,” Ieee Access, vol. 8, pp. 1935–1944, 2019.
  • [41] R. Huang, J. Pedoeem, and C. Chen, “Yolo-lite: a real-time object detection algorithm optimized for non-gpu computers,” in 2018 IEEE international conference on big data (big data).   IEEE, 2018, pp. 2503–2510.
  • [42] Z. Wang, M. Xiao, J. He, C. Zhang, and K. Fu, “Bimodal information fusion network for salient object detection based on transformer,” in 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML).   IEEE, 2022, pp. 38–48.
  • [43] N. Yan, T. Zhou, C. Gu, A. Jiang, and W. Lu, “Bimodal-based object detection and instance segmentation models for substation equipments,” in IECON 2020 The 46th Annual Conference of the IEEE Industrial Electronics Society.   IEEE, 2020, pp. 428–434.
  • [44] Y. Zhang, L. Yuan, Y. Guo, Z. He, I.-A. Huang, and H. Lee, “Discriminative bimodal networks for visual localization and detection with natural language queries,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 557–566.
  • [45] J. Zheng, L. Wang, J. Liu, H. Wang, S. Wang, L. Wang, and J. Zhang, “An inspection method of rail head surface defect via bimodal structured light sensors,” International Journal of Machine Learning and Cybernetics, vol. 14, no. 5, pp. 1903–1920, 2023.
  • [46] Z. Tang, T. Xu, Z. Feng, X. Zhu, H. Wang, P. Shao, C. Cheng, X.-J. Wu, M. Awais, S. Atito et al., “Revisiting rgbt tracking benchmarks from the perspective of modality validity: A new benchmark, problem, and method,” arXiv preprint arXiv:2405.00168, 2024.
  • [47] A. Lu, W. Wang, C. Li, J. Tang, and B. Luo, “After: Attention-based fusion router for rgbt tracking,” arXiv preprint arXiv:2405.02717, 2024.
  • [48] Y. Cao, W. Ming, H. Li, B. He, and P. Yu, “Anchor-free ranking-based localization optimized siamese rgb-t object tracking network,” in 2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology (ICCECT).   IEEE, 2024, pp. 1047–1051.
  • [49] X. Dai, X. Yuan, and X. Wei, “Tirnet: Object detection in thermal infrared images for autonomous driving,” Applied Intelligence, vol. 51, no. 3, pp. 1244–1261, 2021.
  • [50] Z. Tang, T. Xu, H. Li, X.-J. Wu, X. Zhu, and J. Kittler, “Exploring fusion strategies for accurate rgbt visual object tracking,” Information Fusion, vol. 99, p. 101881, 2023.
  • [51] L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. Van De Weijer, and F. Shahbaz Khan, “Multi-modal fusion for end-to-end rgb-t tracking,” in Proceedings of the IEEE/CVF International conference on computer vision workshops, 2019, pp. 0–0.
  • [52] T. Zhang, X. He, Y. Luo, Q. Zhang, and J. Han, “Exploring target-related information with reliable global pixel relationships for robust rgb-t tracking,” Pattern Recognition, vol. 155, p. 110707, 2024.
  • [53] Y. Zhu, C. Li, J. Tang, B. Luo, and L. Wang, “Rgbt tracking by trident fusion network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 2, pp. 579–592, 2021.
  • [54] Y. Zhu, C. Li, J. Tang, and B. Luo, “Quality-aware feature aggregation network for robust rgbt tracking,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 1, pp. 121–130, 2020.
  • [55] Z. Tu, W. Pan, Y. Duan, J. Tang, and C. Li, “Rgbt tracking via reliable feature configuration,” Science China Information Sciences, vol. 65, no. 4, p. 142101, 2022.
  • [56] S. Zhai, Y. Wu, L. Liu, and J. Tang, “Rgbt tracking based on modality feature enhancement,” Multimedia Tools and Applications, vol. 83, no. 10, pp. 29 311–29 330, 2024.
  • [57] W. Zhou, Y. Pan, J. Lei, L. Ye, and L. Yu, “Defnet: Dual-branch enhanced feature fusion network for rgb-t crowd counting,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 24 540–24 549, 2022.
  • [58] W. Gao, G. Liao, S. Ma, G. Li, Y. Liang, and W. Lin, “Unified information fusion network for multi-modal rgb-d and rgb-t salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2091–2106, 2021.
  • [59] X. Xiao, X. Xiong, F. Meng, and Z. Chen, “Multi-scale feature interactive fusion network for rgbt tracking,” Sensors, vol. 23, no. 7, p. 3410, 2023.
  • [60] Y. Cai, X. Sui, and G. Gu, “Multi-modal multi-task feature fusion for rgbt tracking,” Information Fusion, vol. 97, p. 101816, 2023.
  • [61] J. Peng, H. Zhao, and Z. Hu, “Dynamic fusion network for rgbt tracking,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 4, pp. 3822–3832, 2022.
  • [62] Y. Zhao, H. Lai, and G. Gao, “Rmfnet: Redetection multimodal fusion network for rgbt tracking,” Applied Sciences, vol. 13, no. 9, p. 5793, 2023.
  • [63] S.-I. Oh and H.-B. Kang, “Object detection and classification by decision-level fusion for intelligent vehicle systems,” Sensors, vol. 17, no. 1, p. 207, 2017.
  • [64] Y. Zhang, H. Yu, Y. He, X. Wang, and W. Yang, “Illumination-guided rgbt object detection with inter-and intra-modality fusion,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–13, 2023.
  • [65] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931.
  • [66] J. Orlosky, P. Kim, K. Kiyokawa, T. Mashita, P. Ratsamee, Y. Uranishi, and H. Takemura, “Vismerge: Light adaptive vision augmentation via spectral and temporal fusion of non-visible light,” in 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).   IEEE, 2017, pp. 22–31.
  • [67] M. V. Andersen, R. Greer, A. Møgelmose, and M. M. Trivedi, “Learning to find missing video frames with synthetic data augmentation: A general framework and application in generating thermal images using rgb cameras,” in 2024 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2024, pp. 104–109.
  • [68] J. Chen, W. Yang, C. Liu, and L. Yao, “A data augmentation method for skeleton-based action recognition with relative features,” Applied Sciences, vol. 11, no. 23, p. 11481, 2021.
  • [69] J. Lambrecht and L. Kästner, “Towards the usage of synthetic data for marker-less pose estimation of articulated robots in rgb images,” in 2019 19th International Conference on Advanced Robotics (ICAR).   IEEE, 2019, pp. 240–247.
  • [70] Z. Tu, Z. Li, C. Li, and J. Tang, “Weakly alignment-free rgbt salient object detection with deep correlation network,” IEEE Transactions on Image Processing, vol. 31, pp. 3752–3764, 2022.
  • [71] M. Yuan, Y. Wang, and X. Wei, “Translation, scale and rotation: cross-modal alignment meets rgb-infrared vehicle detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 509–525.
  • [72] T. Zhang, X. He, Q. Jiao, Q. Zhang, and J. Han, “Amnet: Learning to align multi-modality for rgb-t tracking,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
  • [73] L. Liu, C. Li, Y. Xiao, R. Ruan, and M. Fan, “Rgbt tracking via challenge-based appearance disentanglement and interaction,” IEEE Transactions on Image Processing, 2024.
  • [74] H. Li, J. Liu, Y. Zhang, and Y. Liu, “A deep learning framework for infrared and visible image fusion without strict registration,” International Journal of Computer Vision, vol. 132, no. 5, pp. 1625–1644, 2024.
  • [75] J. Tang, D. Fan, X. Wang, Z. Tu, and C. Li, “Rgbt salient object detection: Benchmark and a novel cooperative ranking approach,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4421–4433, 2019.
  • [76] H. O. Velesaca, G. Bastidas, M. Rouhani, and A. D. Sappa, “Multimodal image registration techniques: a comprehensive survey,” Multimedia Tools and Applications, pp. 1–29, 2024.
  • [77] M. Brenner, N. H. Reyes, T. Susnjak, and A. L. Barczak, “Rgb-d and thermal sensor fusion: a systematic literature review,” IEEE Access, 2023.
  • [78] Y. Zhang, C. Xu, W. Yang, G. He, H. Yu, L. Yu, and G.-S. Xia, “Drone-based rgbt tiny person detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 204, pp. 61–76, 2023.
  • [79] L. Tang, Y. Deng, Y. Ma, J. Huang, and J. Ma, “Superfusion: A versatile image registration and fusion network with semantic awareness,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 12, pp. 2121–2137, 2022.
  • [80] L. Liu, C. Li, Y. Xiao, and J. Tang, “Quality-aware rgbt tracking via supervised reliability learning and weighted residual guidance,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 3129–3137.
  • [81] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” 2016. [Online]. Available: https://arxiv.org/abs/1506.01497
  • [82] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, SSD: Single Shot MultiBox Detector.   Springer International Publishing, 2016, p. 21–37. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46448-0_2
  • [83] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–3007.
  • [84] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” 2020. [Online]. Available: https://arxiv.org/abs/1911.09070
  • [85] F. Liu, S. Guan, K. Yu, and H. Gong, “Infrared target detection based on the fusion of mask r-cnn and image enhancement network,” in 2022 China Automation Congress (CAC), 2022, pp. 2011–2016.
  • [86] A. S. Geetha, “Comparing yolov5 variants for vehicle detection: A performance analysis,” 2024. [Online]. Available: https://arxiv.org/abs/2408.12550
  • [87] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” 2019. [Online]. Available: https://arxiv.org/abs/1904.01355
  • [88] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” 2019. [Online]. Available: https://arxiv.org/abs/1904.07850
  • [89] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral pedestrian detection: Benchmark dataset and baselines,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [90] Y. Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning,” 2021. [Online]. Available: https://arxiv.org/abs/2003.02437
  • [91] J. Shen, Y. Chen, Y. Liu, X. Zuo, H. Fan, and W. Yang, “Icafusion: Iterative cross-attention guided feature fusion for multispectral object detection,” 2023. [Online]. Available: https://arxiv.org/abs/2308.07504
  • [92] Z. Zong, G. Song, and Y. Liu, “Detrs with collaborative hybrid assignments training,” 2023. [Online]. Available: https://arxiv.org/abs/2211.12860
  • [93] C. Lyu, W. Zhang, H. Huang, Y. Zhou, Y. Wang, Y. Liu, S. Zhang, and K. Chen, “Rtmdet: An empirical study of designing real-time object detectors,” 2022. [Online]. Available: https://arxiv.org/abs/2212.07784
  • [94] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” 2021. [Online]. Available: https://arxiv.org/abs/2104.14294
  • [95] Y. Chen, P. Zhang, Z. Li, Y. Li, X. Zhang, L. Qi, J. Sun, and J. Jia, “Dynamic scale training for object detection,” 2021. [Online]. Available: https://arxiv.org/abs/2004.12432
  • [96] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” 2021. [Online]. Available: https://arxiv.org/abs/2104.00680
  • [97] H. Dong, W. Gu, X. Zhang, J. Xu, R. Ai, H. Lu, J. Kannala, and X. Chen, “Superfusion: Multilevel lidar-camera fusion for long-range hd map generation,” 2024. [Online]. Available: https://arxiv.org/abs/2211.15656
  • [98] S. Zhang, X. Wang, J. Wang, J. Pang, C. Lyu, W. Zhang, P. Luo, and K. Chen, “Dense distinct query for end-to-end object detection,” 2023. [Online]. Available: https://arxiv.org/abs/2303.12776
  • [99] D. König, M. Adam, C. Jarvers, G. Layher, H. Neumann, and M. Teutsch, “Fully convolutional region proposal networks for multispectral person detection,” 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 243–250, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:8436249
  • [100] C. Li, D. Song, R. Tong, and M. Tang, “Illumination-aware faster r-cnn for robust multispectral pedestrian detection,” 2018. [Online]. Available: https://arxiv.org/abs/1803.05347
  • [101] D. Guan, Y. Cao, J. Liang, Y. Cao, and M. Y. Yang, “Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection,” 2018. [Online]. Available: https://arxiv.org/abs/1802.09972
  • [102] K. Zhou, L. Chen, and X. Cao, “Improving multispectral pedestrian detection by addressing modality imbalance problems,” 2020. [Online]. Available: https://arxiv.org/abs/2008.03043
  • [103] J. Kim, H. Kim, T. Kim, N. Kim, and Y. Choi, “Mlpd: Multi-label pedestrian detector in multispectral domain,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7846–7853, 2021.
  • [104] C. Li, D. Song, R. Tong, and M. Tang, “Multispectral pedestrian detection via simultaneous detection and segmentation,” 2018. [Online]. Available: https://arxiv.org/abs/1808.04818
  • [105] C. Devaguptapu, N. Akolekar, M. M. Sharma, and V. N. Balasubramanian, “Borrow from anywhere: Pseudo multi-modal object detection in thermal imagery,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 1029–1038.
  • [106] H. Zhang, E. Fromont, S. Lefevre, and B. Avignon, “Multispectral fusion for object detection with cyclic fuse-and-refine blocks,” in 2020 IEEE International Conference on Image Processing (ICIP), 2020, pp. 276–280.
  • [107] M. Kieu, A. D. Bagdanov, and M. Bertini, “Bottom-up and layerwise domain adaptation for pedestrian detection in thermal images,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 17, no. 1, Apr. 2021. [Online]. Available: https://doi.org/10.1145/3418213
  • [108] F. Qingyun, H. Dapeng, and W. Zhaokui, “Cross-modality fusion transformer for multispectral object detection,” 2022. [Online]. Available: https://arxiv.org/abs/2111.00273
  • [109] Y. Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6700–6713, 2022.
  • [110] X. He, C. Tang, X. Zou, and W. Zhang, “Multispectral object detection via cross-modal conflict-aware learning,” in Proceedings of the 31st ACM International Conference on Multimedia, ser. MM ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 1465–1474. [Online]. Available: https://doi.org/10.1145/3581783.3612651
  • [111] M. Yuan, Y. Wang, and X. Wei, “Translation, scale and rotation: Cross-modal alignment meets rgb-infrared vehicle detection,” 2022. [Online]. Available: https://arxiv.org/abs/2209.13801
  • [112] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G.-S. Xia, and X. Bai, “Gliding vertex on the horizontal bounding box for multi-oriented object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, p. 1452–1459, Apr. 2021. [Online]. Available: http://dx.doi.org/10.1109/TPAMI.2020.2974745
[Uncaptioned image] Chen Zhou is a master’s student at Beijing Forestry University and currently works at the TeleAI Artificial Intelligence Research Institute of China Telecom. His research focuses on computer vision, speech-driven lip motion, and multimodal generative algorithms. He has won several prestigious awards, including the championship in the Global AI Technology Innovation Competition for Dual-Spectrum Object Detection in Drone Perspectives, the championship in the Ruikang Robotics Developer Competition for Algorithm Optimization, and third place in the ”Smart Balance House” AI Challenge for the Target Recognition System Speed Competition.
[Uncaptioned image] Peng Cheng is a master’s student at Beijing Forestry University. His research focuses on computer vision, data compression, and multimodal technologies. He has achieved remarkable results in various competitions, including the championship in the SEED Jiangsu Big Data Development and Application Competition, the Global AI Technology Innovation Competition for Dual-Spectrum Object Detection in Drone Perspectives, and the CVPR Challenge on Low-light Object Detection. Altogether, he has won over 20 prestigious competition awards.
[Uncaptioned image] Junfeng Fang obtained his PhD from the University of Science and Technology of China. His research primarily focuses on Model Editing and LLM Explainability. He has published in top conferences and journals, including NeurIPS, KDD, ICLR, and TKDE. He has also served as a reviewer for major conferences and journals such as ICLR, KDD, NeurIPS, ICML, and TKDE.
[Uncaptioned image] Yibo Yan is currently a Ph.D. candidate of Artificial Intelligence Thrust, Hong Kong University of Science and Technology (Guangzhou) and Department of Computer Science and Engineering, Hong Kong University of Science and Technology. His primary research interest include multimodal learning, large language model, and natural language processing.
[Uncaptioned image] Yifan Zhang (yifanzhang.cs@gmail.com) is with the State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China.
[Uncaptioned image] Xiaojun Jia received his Ph.D. degree in State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences and School of Cyber Security, University of Chinese Academy of Sciences, Beijing. He is now a Research Fellow in Cyber Security Research Centre @ NTU, Nanyang Technological University, Singapore. His research interests include computer vision, deep learning and adversarial machine learning.
[Uncaptioned image] Yanyan Xu received her Ph.D. from Institute of Software, Chinese Academy of Sciences, and her M.Sc. and B.Sc. from Sun Yat-sen University. She has since joined School of Information Science and Technology, Beijing Forestry University, where she is currently an associate professor. Also, she spent a year as a visiting scholar at Leiden Institute of Advanced Computer Science, Leiden University in the Netherlands. Her research interests include artificial intelligence, speech processing, and large language models.
[Uncaptioned image] Kun Wang obtained his PhD from the University of Science and Technology of China and is currently a postdoctoral researcher at Nanyang Technological University. His research primarily focuses on applications of graph structures, covering areas such as sparsification, data mining (especially spatiotemporal forecasting and Earth sciences), and LLM Agent studies. Dr. Wang is dedicated to enhancing the trustworthiness, interpretability, generalization, and robustness of algorithms in deep learning and artificial intelligence. As the first or corresponding author, he has published over 20 papers in top conferences and journals, including TPAMI, ICML, NeurIPS, KDD, ICLR, AAAI, WWW, and TKDE. He has also served as a reviewer for major conferences and journals such as ICLR, KDD, NeurIPS, ICML, and TKDE.
[Uncaptioned image] Xiaochun Cao (Senior Member, IEEE) received the B.E. and M.E. degrees in computer science from Beihang University, China, and the Ph.D. degree in computer science from the University of Central Florida, Orlando, USA. After graduation, he spent about three years at ObjectVideo Inc., as a Research Scientist. From 2008 to 2012, he was a Professor with Tianjin University, Tianjin, China. He has been a Professor with the Institute of Information Engineering, Chinese Academy of Sciences, since 2012. He is currently with the School of Cyber Security, Sun Yat-sen University, China. He is on the Editorial Boards of IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE TRANSACTIONS ON MULTIMEDIA, and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY. From 2004 to 2010, he was a recipient of the Piero Zamperoni Best Student Paper Award from the International Conference on Pattern Recognition.