Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks

Chen Zhou^† Peng Cheng^† Junfeng Fang Yifan Zhang Yibo Yan Xiaojun Jia Yanyan Xu^‡ Kun Wang^‡ Xiaochun Cao
kevinzc9@bjfu.edu.cn xuyanyan@bjfu.edu.cn wk520529wjh@gmail.com
caoxiaochun@mail.sysu.edu.cn Yanyan Xu and Kun Wang are corresponding authors. Chen Zhou and Peng Cheng are contributed equally to this work.
Chen Zhou, Peng Cheng and Yanyan Xu are with the Beijing Forestry University.
Yibo Yan is with the Hong Kong University of Science and Technology.
Yifan Zhang is with the State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS).
Junfeng Fang is with the National University of Singapore.
Xiaojun Jia and Kun Wang is with the Nanyang Technological University. Xiaochun Cao is with the Sun Yat-sen University.

Abstract

Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task. It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies, spatial misalignment, and environmental dependencies between RGB and TIR images. These challenges significantly hinder the generalization of multispectral detection systems across diverse scenarios. Although numerous studies have attempted to overcome these limitations, it remains difficult to clearly distinguish the performance gains of multispectral detection systems from the impact of these “optimization techniques”. Worse still, despite the rapid emergence of high-performing single-modality detection models, there is still a lack of specialized training techniques that can effectively adapt these models for multispectral detection tasks. The absence of a standardized benchmark with fair and consistent experimental setups also poses a significant barrier to evaluating the effectiveness of new approaches. To this end, we propose the first fair and reproducible benchmark specifically designed to evaluate the training “techniques”, which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations. A comprehensive evaluation is conducted across multiple representative multispectral object detection datasets, utilizing various backbone networks and detection frameworks. Additionally, we introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models into dual-modality models, integrating our advanced training techniques. Our codes are available: https://github.com/cpboost/double-co-detr

Index Terms:

Multispectral object detection, Multimodal feature fusion, Spatial alignment, Data augmentation

1 Introduction

Multispectral object detection is a powerful technology that leverages both visible light and infrared spectra for object detection, and it has been widely adopted in various real-world applications [1, 2, 3, 4, 5, 6], including anomaly detection in surveillance systems [7, 8, 9, 10, 11], obstacle recognition in autonomous vehicles [4, 12, 13, 14, 15], defect identification in industrial inspection [5, 6, 16, 17, 18], and threat detection in defense and security [19, 20, 21], to name just few. While many traditional object detection algorithms [5, 6, 17, 22, 19] have primarily relied on information from a single modality, recent advancements have explored more sophisticated multispectral architectures [23, 24, 25, 26, 27, 28, 29, 30]. In numerous cases, fully exploiting the information from multiple-modalities has demonstrated significant advantages [28]. For instance, in low-light conditions, leveraging infrared spectra can enhance the performance of visible light detection, or in complex scenarios, combining information from both spectra can improve detection accuracy [31, 32, 33, 34]. Recently, with the rapid development of satellite remote sensing and thermal imaging technologies [17], many challenging detection datasets have emerged (such as low light and extreme weather conditions) [7, 17]. Multispectral detection architectures have demonstrated strong performance on these datasets [6, 17, 22].

However, training multispectral object detection models is known to be highly challenging [23, 28, 29, 35, 36, 37]. Beyond the common issues encountered in training deep architectures, such as vanishing gradients and overfitting [22, 28], multispectral models face several unique challenges that limit their strides on these datasets:

➤

The first challenge lies in effectively utilizing dual-modality data. Simultaneously processing visible and infrared data increases the complexity of dual-modality feature fusion, which may result in suboptimal integration of information from both modalities [23, 36]. This issue is particularly pronounced in earlier multispectral models, where the fusion process often led to information loss, preventing the models from fully leveraging the strengths of both modalities [35, 36]. Additionally, registration discrepancies between the two modalities and the lack of modality-specific enhancement strategies further constrain model performance [37].
➤

The second major question is the lack of an effective optimization strategy for converting high-performance single-modality models into dual-modality models. Despite the emergence of numerous powerful single-modality object detection frameworks in recent years [38, 39, 40, 41], there has yet to be a robust method for effectively harnessing the potential of these models while addressing the unique challenges of multispectral object detection.

To addess the aforementioned challenges, the promising approaches can be categorized into ➀ dual-modality architectural fusion [26, 27, 28] and ➁ modality-specific enhancements [31, 32, 33], both of which we classify as “training techniques”. The former involves adapting single-modality architectures to dual-modality structures, integrating advanced backbone networks, and employing diverse feature fusion strategies. The latter focuses on processing data from both modalities using techniques such as modality-specific data augmentation and alignment calibration [6]. While these techniques generally contribute to the effective training of multispectral object detection models, their benefits are not always significant or consistent [35, 36, 37]. Furthermore, it is often difficult to distinguish the performance improvements achieved through more complex dual-modality architectures from those gained via these “training techniques”.

In some extreme cases, contrary to initial expectations, single-modality models enhanced with certain optimization techniques may even outperform carefully designed, complex dual-modality architectures [27, 28, 29, 30]. This casts doubt on the pursuit of increased complexity, thereby rendering it a less attractive approach. These observations highlight a critical gap in the study of multispectral object detection: the lack of a standardized benchmark that can fairly and consistently evaluate the effectiveness of training techniques for dual-modality models. Without disentangling the effects of architectural complexity from the “training techniques” applied, it may remain unclear whether multispectral object detection should inherently perform better under otherwise identical conditions.

Our Contribution

To establish such a fair benchmark, our first step was to conduct a comprehensive investigation into the design philosophies and implementation details of dozens of popular multispectral object detection techniques, including various backbone networks, dual-modality fusion strategies, and alignment techniques. Unfortunately, we discovered that even on the same datasets, the implementation of hyperparameter configurations (such as hidden layer dimensions, learning rates, weight decay, dropout rates, number of training epochs, and early stopping patience) is highly inconsistent and often varies depending on specific circumstances. This inconsistency makes it challenging to draw any fair or reliable conclusions.

To this end, we conducted a detailed analysis of these sensitive hyperparameters and standardized them into a “best” hyperparameter set, consistently applied across all experiments. This standardization provides a fair and reproducible benchmark for training multispectral object detection models. Subsequently, we explored various combinations of training techniques across several classical multispectral object detection datasets, leveraging common single-modality model backbones and optimizing them for dual-modality detection tasks.

The results of our comprehensive study were highly significant. Based on the characteristics of different single-modality model backbones, framework features, and detection sample characteristics, we developed several effective training techniques and optimization strategies, enabling us to achieve state-of-the-art results on multiple representative datasets. Furthermore, we proposed several optimization strategies with strong transferability, demonstrating excellent performance across multiple dual-modality public datasets ¹¹1Our research was awarded the championship in the Global Artificial Intelligence Innovation Competition (GAIIC) https://gaiic.caai.cn/ai2024, out of over 1,200 participants, 1,000+ teams, and 8,200+ submissions..

Specifically, our contributions are as follows:

➥ Multimodal Feature Fusion: we introduce advanced multimodal feature fusion techniques to effectively integrate visible and infrared data, enhancing the feature representation capabilities of multispectral object detection models, especially in complex environments.

➥ Dual-Modality Data Augmentation: we employ modality-specific data augmentation strategies that cater to the distinct characteristics of visible and infrared data, improving the model’s robustness in varying environmental and complex scenarios.

➥ Alignment Optimization: by implementing precise alignment techniques, we improve spatial consistency between visible and infrared data, reducing inter-modality misalignment and significantly enhancing performance in low-light object detection and multimodal information fusion.

➥ Optimizing Single-Modality Models for Dual-Modality Tasks: we provide a new benchmark and training techniques to effectively adapt high-performing single-modality models into dual-modality detection models. Through these optimizations, single-modality models outperform even complex, large-scale dual-modality detection models, offering strong support for their migration to dual-modality tasks.

2 Related work

2.1 Multispectral Object Detection & Training Challenges

Multispectral object detection has achieved state-of-the-art performance in applications like autonomous driving and drone-based remote sensing [5, 6, 7, 8, 9, 10]. However, implementing multispectral detection is challenging, especially when dealing with images from distinct spectra, such as visible light (RGB) and thermal infrared (TIR) [11, 16, 17]. Existing methods [42, 43, 44, 45] face several issues, including spectral differences [46], spatial misalignment [47], and high sensitivity to environmental conditions [48], limiting their generalization across diverse scenarios. While recent studies have introduced various training techniques, they often struggle to deliver consistent performance improvements when applied to complex remote sensing data [49, 50], differing from the dual-modality detection benchmarks discussed in this paper.

To address these challenges, techniques such as multimodal feature fusion, registration alignment, and dual-modality data augmentation have been developed in recent years [24, 25, 26, 27, 28]. The following sections provide a detailed exploration of these techniques and their applications.

2.2 Multimodal Feature Fusion

In multispectral object detection, feature fusion plays a crucial role in enhancing model performance. Current fusion methods are generally categorized into three types: pixel-level, feature-level, and decision-level fusion. Pixel-level fusion [51, 52, 53] integrates RGB and TIR images at the input stage, allowing early information combination but potentially introducing noise or misalignment due to differences in resolution and viewpoints. Feature-level fusion [54, 55, 56, 57, 58] combines high-level features from both modalities at intermediate layers, utilizing techniques like concatenation, weighting, or attention mechanisms to better capture complementary information, though it may add computational overhead [59, 60]. Decision-level fusion [50, 61, 62, 63, 64, 65] merges independent detection results from each modality at the final stage, providing efficiency and stable performance, especially when the modalities offer relatively independent information.

2.3 Dual-Modality Data Augmentation

In multispectral object detection, data augmentation is crucial for improving model generalization and reducing overfitting [27, 28]. While traditional techniques like flipping, rotation, and scaling work well in single-modality detection [25, 26], the fusion of RGB and TIR images introduces higher complexity. A common approach is to apply synchronized augmentation to both RGB and TIR images [3, 6] to ensure consistency between the modalities. Techniques such as random cropping, scaling, and color transformations increase image diversity and help the model adapt to varying environmental conditions [66, 67]. Additionally, some studies propose joint data augmentation methods, such as mixed modal augmentation, which exchanges pixels or features between modalities to enhance robustness against modality differences [4, 5, 68, 69], ultimately improving detection performance in challenging scenarios.

2.4 Registration Alignment

Registration alignment techniques are employed to address spatial discrepancies between images from different sensors, such as RGB and TIR. Differences in resolution and viewpoints often lead to misalignment and distortion, which can negatively impact feature fusion and detection performance [70, 71]. Traditional alignment methods [72, 73, 74], such as scaling, rotation, and affine transformation, are used to align the images but tend to be limited in complex scenes or when nonlinear deformations are present. Recently, deep learning-based alignment techniques [64, 75, 76, 77, 78, 79] have emerged, achieving pixel-level precision by learning feature mappings between RGB and TIR images, and using contrastive loss or self-supervised learning to ensure spatial consistency. Some methods also incorporate attention mechanisms to dynamically adjust feature alignment [47, 70, 74, 80], enhancing both local detail and global consistency.

3 Method

In this section, we systematically discuss how to improve existing dual-modality object detection algorithms. The discussion focuses on three key aspects: multimodal feature fusion, dual-modality data augmentation, and registration alignment. Specifically, Section 3.1 details the hyperparameter configurations and datasets used in our experiments, while Sections 3.2, 3.3, and 3.4 discuss multimodal feature fusion, dual-modality data augmentation, and registration alignment, respectively.

TABLE I: Configurations of the optimal hyperparameters adopted to implement different single models for training on the KAIST dataset.

Method	Total epoch	Learning rate & Decay	Weight decay	Dropout
YOLOv3 [2]	100	$1\times 10^{-2}$	$5\times 10^{-4}$	0.5
Faster R-CNN [81]	80	$1\times 10^{-2}$	$1\times 10^{-4}$	0.5
SSD [82]	120	$2\times 10^{-3}$	$5\times 10^{-4}$	0.5
RetinaNet [83]	100	$1\times 10^{-3}$	$1\times 10^{-4}$	0.3
EfficientDet [84]	150	$5\times 10^{-4}$	$4\times 10^{-5}$	0.3
Mask R-CNN [85]	90	$2\times 10^{-2}$	$1\times 10^{-4}$	0.5
YOLOv5 [86]	300	$1\times 10^{-2}$	$5\times 10^{-4}$	0.5
CenterNet [87]	140	$1\times 10^{-3}$	$1\times 10^{-4}$	0.4
FCOS [88]	120	$2.5\times 10^{-3}$	$1\times 10^{-4}$	0.5
Cascade R-CNN [85]	100	$5\times 10^{-3}$	$1\times 10^{-4}$	0.5

3.1 Standardized Experimental Configuration

We conducted a comprehensive analysis of previous single-modality object detection models applied to dual-modality detection tasks. To further enhance model performance and improve the robustness of our benchmarking, we also performed hyperparameter optimization and fine-tuning on these models. Based on dual-modality datasets (including both RGB and TIR data), we systematically explored the adaptability of single-modality models in integrating multimodal information, with particular emphasis on their performance across different modalities. The key hyperparameter configurations are presented in Table I.

Through a grid search approach, we optimized the hyperparameters for all methods and identified the most generalizable and effective configuration. This configuration was selected based on the best performance of various single-modality detection models across multiple datasets, focusing on parameters such as learning rate, weight decay, and dropout. Ultimately, we proposed this “optimal hyperparameter configuration” and strictly adhered to it in our experiments.

Specifically, the final configuration consists of a learning rate of 0.01 with decay, a weight decay of 0.0001, and a dropout rate of 0.5. We believe that this setup provides stable and efficient performance across a range of multispectral object detection tasks, ensuring fair comparisons between different methods under the same conditions.

In each experiment, we trained for up to 200 epochs, with early stopping set to a patience of 20 epochs. To minimize the impact of random variations, each experiment was repeated 20 times, and the results were averaged to obtain the final performance metrics.

Our subsequent experiments utilized the KAIST [89], FLIR, and DroneVehicle [90] datasets. The KAIST dataset is a benchmark for pedestrian detection, combining visible and infrared images to evaluate multispectral detection algorithms. The FLIR dataset includes multi-class vehicle and pedestrian detection tasks with high-resolution thermal imagery. The DroneVehicle dataset focuses on multi-class object detection from a drone’s perspective, covering various complex scenarios.

Regarding performance evaluation, we employed task-specific metrics tailored to the characteristics of each dataset. For the KAIST dataset, we selected Miss Rate as the primary evaluation metric due to its sensitivity to missed detections, which is critical in this context. In contrast, for the FLIR and DroneVehicle datasets, we used mean Average Precision (mAP) as the evaluation metric, as these datasets involve multi-class object detection, and mAP provides a more comprehensive assessment of detection accuracy across classes. This dataset-specific approach ensures a thorough and accurate evaluation of each method’s performance in diverse tasks.

By leveraging a unified hyperparameter configuration and dataset-specific evaluation metrics, we ensure fair and consistent comparisons between different methods, providing a robust foundation for subsequent performance improvements.

3.2 Multimodal Feature Fusion

Formulations. In multispectral object detection, multimodal feature fusion techniques aim to effectively integrate complementary information from both RGB and TIR images, enhancing detection accuracy and model robustness. Multimodal feature fusion can be categorized into three main approaches: pixel-level fusion, feature-level fusion, and decision-level fusion.

▲

Pixel-Level Fusion. The fusion of RGB and TIR images at the pixel level involves a series of tensor-based transformations, incorporating both adaptive weighting and convolutional refinement to effectively integrate the complementary modalities. Let $\mathcal{I}_{\text{R}}\in\mathbb{R}^{h\times w\times 3}$ denote the RGB image and $\mathcal{I}_{\text{T}}\in\mathbb{R}^{h\times w\times 1}$ denote the TIR image. Initially, the TIR image is expanded to a three-channel format:

\mathcal{I}_{\text{T}}^{{}^{\prime}}=\mathcal{E}\left(\mathcal{I}_{\text{T}},% \mathcal{J}_{\text{3}}\right)=\sum_{c=1}^{3}\left(\mathcal{I}_{\text{T}}% \otimes\mathcal{J}_{3}^{(c)}\right)\in\mathbb{R}^{h\times w\times 3},

(1)

where $\mathcal{J}_{3}$ denotes a tensor of shape $1\times 1\times 3$ , employed to replicate the TIR image along the channel dimension via the tensor outer product operation $\otimes$ . The summation term $\sum_{c=1}^{3}$ indicates that this operation is applied independently to each channel, expanding the single-channel TIR image $\mathcal{I}_{\text{T}}$ into a three-channel representation $\mathcal{I}_{\text{T}}^{{}^{\prime}}$ consistent with the structure of RGB images.

Next, adaptive pixel-wise weighting matrices $\mathcal{W}_{\text{R}}\in\mathbb{R}^{h\times w\times 3}$ and $\mathcal{W}_{\text{T}}\in\mathbb{R}^{h\times w\times 3}$ are introduced for the RGB and TIR channels. The intermediate fused image $\mathcal{I}_{\text{interim}}$ is defined as:

\mathcal{I}_{\text{interim}}=\mathcal{W}_{\text{R}}\odot\mathcal{I}_{\text{R}}% +\mathcal{W}_{\text{T}}\odot\mathcal{I}_{\text{T}}^{{}^{\prime}}+\boldsymbol{% \eta}(\mathbf{n}),

(2)

where $\boldsymbol{\eta}(\mathbf{n})$ represents a noise model parameterized by a Gaussian random vector $\mathbf{n}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma})$ , accounting for the inherent uncertainty in sensor measurements.

To refine the fusion process and incorporate spatial context, convolutional transformations are applied using modality-specific kernels $\mathcal{K}_{\text{R}}\in\mathbb{R}^{k\times k\times 3\times 3}$ and $\mathcal{K}_{\text{T}}\in\mathbb{R}^{k\times k\times 3\times 3}$ . The convolutional outputs are expressed as:

\varphi_{\ell}=\begin{cases}\mathcal{K}_{\text{R}}\ast\mathcal{I}_{\text{R}}+% \beta_{\text{R}},&\text{if }\ell=\text{R}\\ \mathcal{K}_{\text{T}}\ast\Psi_{\text{T}}+\beta_{\text{T}},&\text{if }\ell=% \text{T}\end{cases},

(3)

where $\ast$ denotes the convolution operation, and $\mathbf{\beta}_{\text{R}}$ and $\mathbf{\beta}_{\text{T}}$ are the learnable bias terms.

The final fused image $\mathcal{I}_{\text{fuse}}$ is obtained through a non-linear fusion strategy that incorporates spatially adaptive weight mappings. We introduce non-linear mapping functions $\mathcal{F}_{\text{R}}(\cdot,\cdot)$ and $\mathcal{F}_{\text{T}}(\cdot,\cdot)$ . The $\mathcal{I}_{\text{fuse}}$ is expressed as:

\mathcal{I}_{\text{fuse}}=\mathcal{F}_{\text{R}}\left(\mathcal{I}_{\text{% interim}},\mathcal{W}_{\text{R}}\right)\odot\varphi_{\text{R}}+\mathcal{F}_{% \text{T}}\left(\mathcal{I}_{\text{interim}},\mathcal{W}_{\text{T}}\right)\odot% \varphi_{\text{T}}.

(4)

The functions $\mathcal{F}_{\text{R}}(\cdot,\cdot)$ and $\mathcal{F}_{\text{T}}(\cdot,\cdot)$ can be simplified as follows:

\mathcal{F}_{\ell}\left(\mathcal{I}_{\text{interim}},\mathcal{W}_{\ell}\right)% =\sigma\left(\mathcal{W}_{\ell}+\boldsymbol{\alpha}_{\ell}\odot\delta\left(% \mathcal{G}_{\ell}\left(\mathcal{I}_{\text{interim}}\right)\right)\right),

(5)

where $\ell\in\{\text{R},\text{T}\}$ , $\sigma(\cdot)$ denotes the sigmoid activation function, $\delta(\cdot)$ represents the hyperbolic tangent function, and $\mathcal{G}_{\ell}(\cdot)$ is a non-linear spatial filtering operation. The term $\boldsymbol{\alpha}_{\ell}\in\mathbb{R}^{h\times w\times 3}$ is a learnable scaling tensor.

The proposed pixel-level fusion scheme integrates adaptive weighting, convolutional refinement, and a multi-layered non-linear transformation pipeline to enhance representation capacity. The noise modeling term $\eta(n)$ improves robustness, while the activation functions $\sigma(\cdot)$ and $\delta(\cdot)$ facilitate non-linear interactions between the RGB and TIR modalities.

▲

Feature-Level Fusion. The mainstream feature-level fusion methods primarily include convolution-based Network-in-Network (NIN) modules and bidirectional attention-based Iterative Cross-modal Feature Enhancement (ICFE) modules [91]. The following sections provide an in-depth introduction to each approach.

NIN Module. To achieve independent localized non-linear transformations on the RGB and TIR modalities, we first designed a network-in-network module integrated with a residual structure. This module serves as a foundational step for subsequent cross-modal feature enhancement and interaction, where we leverage 1x1 convolutions to apply fine-grained, spatially localized non-linear mappings that improve the expressiveness of feature representations. Let $\mathcal{X}_{\text{R}}^{l}\in\mathbb{R}^{H\times W\times C}$ and $\mathcal{X}_{\text{T}}^{l}\in\mathbb{R}^{H\times W\times C}$ denote the RGB and TIR modality feature maps at layer $l$ , respectively. We define learnable 1x1 convolution kernels $\mathcal{W}_{\text{R}}^{(1\times 1)}\in\mathbb{R}^{C\times C}$ and $\mathcal{W}_{\text{T}}^{(1\times 1)}\in\mathbb{R}^{C\times C}$ , applying them with residual connections to each modality for localized feature transformations, as follows:

\mathcal{D}_{\ell}^{l}=\begin{cases}\mathcal{X}_{\text{R}}^{l}+\left(\mathcal{% W}_{\text{R}}^{(1\times 1)}\ast\mathcal{X}_{\text{R}}^{l}+\zeta_{\text{R}}^{l}% \right),&\text{if }\ell=\text{R}\\ \mathcal{X}_{\text{T}}^{l}+\left(\mathcal{W}_{\text{T}}^{(1\times 1)}\ast% \mathcal{X}_{\text{T}}^{l}+\zeta_{\text{T}}^{l}\right),&\text{if }\ell=\text{T% }\end{cases}.

(6)

The residual connection within this transformation preserves original modality-specific information in the transformed feature, mitigating potential information loss or distortion. To achieve adaptive fusion of RGB and TIR modalities, we introduce dynamic weighting coefficients $\alpha_{\text{R}}$ and $\alpha_{\text{T}}$ , computed through a transformation $\nu(\cdot)$ followed by a shared non-linear function $\sigma(\cdot)$ applied to each transformed feature map:

\alpha_{\text{R}}=\sigma(\nu(\mathcal{D}_{\text{R}}^{l})),\quad\alpha_{\text{T% }}=\sigma(\nu(\mathcal{D}_{\text{T}}^{l})).

(7)

The final fused feature $\mathcal{D}_{\text{fuse}}^{l}$ at layer $l$ is given as follows:

\mathcal{D}_{\text{fuse}}^{l}=\alpha_{\text{R}}\odot\mathcal{D}_{\text{R}}^{l}% +\alpha_{\text{T}}\odot\mathcal{D}_{\text{T}}^{l}.

(8)

Through these operations, the NIN module not only performs modality-specific, localized feature transformations but also enables adaptive and balanced feature fusion. This module strengthens the feature discriminability and robustness, while preserving localized information via non-linear activation and residual connections.

ICFE Module. The ICFE module progressively enhances feature representations of RGB and TIR modalities by iteratively exchanging and refining complementary information, ultimately producing a single fused feature representation. Let $\mathcal{T}_{\text{R}}^{(0)}$ and $\mathcal{T}_{\text{T}}^{(0)}$ represent the initial RGB and TIR features, respectively, and let the final fused feature representation after $n$ iterations be denoted as $\mathcal{V}_{\text{fuse}}^{(\text{n})}$ . The following outlines the detailed formulae of this process.

At the k-th iteration, multi-head queries, keys, and values are generated for both the RGB and TIR modalities. Suppose there are H attention heads, indexed by h. For the h-th attention head, we compute the query matrix $\mathcal{Q}_{\text{R}}^{(\text{k,h})}$ for RGB features, and the key matrix $\mathcal{K}_{\text{T}}^{(\text{k,h)}}$ and value matrix $\mathcal{V}_{\text{T}}^{\text{(k,h})}$ for TIR features:

\mathcal{Q}_{\text{R}}^{\text{(k,h)}}=\mathcal{T}_{\text{R}}^{\text{(k)}}% \mathcal{W}_{\text{Q}}^{\text{(h)}},\mathcal{K}_{\text{T}}^{\text{(k,h)}}=% \mathcal{T}_{\text{T}}^{\text{(k)}}\mathcal{W}_{\text{K}}^{\text{(h)}},% \mathcal{V}_{\text{T}}^{\text{(k,h)}}=\mathcal{T}_{\text{T}}^{\text{(k)}}% \mathcal{W}_{\text{V}}^{\text{(h)}},

(9)

where $\mathcal{W}_{\text{Q}}^{\text{{(h)}}},\mathcal{W}_{\text{K}}^{\text{{(h)}}},% \mathcal{W}_{\text{V}}^{\text{{(h)}}}\in\mathbb{R}^{d\times d_{H}}$ are learnable projection matrices, and $\text{d}_{\text{H}}=\text{d}/\text{H}$ represents the dimensionality per attention head.

To obtain the cross-modally enhanced RGB features $\mathcal{Z}_{\text{R}}^{\text{(k,h)}}$ , we calculate the weighted matrix by applying the softmax function to the scaled dot product of the query and key matrices, then multiply it with the value matrix:

\mathcal{Z}_{\text{R}}^{\text{(k,h)}}=\textit{softmax}\left(\frac{\mathcal{Q}_% {\text{R}}^{\text{(k,h)}}(\mathcal{K}_{\text{T}}^{\text{(k,h)}})^{\text{T}}}{% \sqrt{\text{d}_{\text{H}}}}\right)\mathcal{V}_{\text{T}}^{\text{(k,h)}}.

(10)

Then, we concatenate the features from all attention heads (denoted by $\Gamma$ as the concatenation operation) and project them back to the original feature space using an output projection matrix $\mathcal{W}_{\text{O}}$ :

\mathcal{Z}_{\text{R}}^{(\text{k})}=\Gamma(\mathcal{Z}_{\text{R}}^{\text{(k,1)% }},\ldots,\mathcal{Z}_{\text{R}}^{\text{(k,H)}})\mathcal{W}_{\text{O}},

(11)

where $\Gamma(\cdot)$ represents the concatenation operation applied across all attention heads.

In each iteration, the RGB and TIR features are combined to produce an intermediate fused feature representation $\mathcal{V}_{\text{fuse}}^{(\text{k})}$ , with learnable weighting coefficients $\lambda^{(\text{k})}$ and $\mu^{(\text{k})}$ controlling the fusion:

\mathcal{V}_{\text{fuse}}^{(\text{k})}=\lambda^{(\text{k})}\odot\mathcal{Z}_{% \text{R}}^{(\text{k})}+\mu^{(\text{k})}\odot\mathcal{Z}_{\text{T}}^{(\text{k})},

(12)

where $\mathcal{Z}_{\text{T}}^{(\text{k})}$ is the cross-modally enhanced TIR feature obtained symmetrically to $\mathcal{Z}_{\text{R}}^{(\text{k})}$ .

To further enhance non-linear representation capabilities, a non-linear activation function $\delta(\cdot)$ is applied with residual connection to the fused feature in each iteration. After $n$ iterations, the final fused feature representation is given by:

\mathcal{V}_{\text{fuse}}^{(\text{n})}=\mathcal{V}_{\text{fuse}}^{(\text{n-1})% }+\delta\left(\mathcal{V}_{\text{fuse}}^{(\text{n-1})}\right).

(13)

▲

Decision-Level Fusion. In decision-level fusion, RGB and TIR modalities undergo separate feature extraction and preliminary detection, and their fusion occurs at the final decision stage. Let the detection results for RGB and TIR modalities be denoted as $\mathcal{M}_{\text{R}}$ and $\mathcal{M}_{\text{T}}$ , respectively. The following describes two advanced fusion strategies for combining these decisions.

Confidence-Based Weighting with Normalization. To refine the fusion process, confidence scores $\mathcal{C}_{\text{R}}$ and $\mathcal{C}_{\text{T}}$ reflect each modality’s reliability and serve as normalization factors. These scores are obtained through a scaling function $\psi(\cdot)$ and normalized using $\tau(\cdot)$ :

\mathcal{C}_{\text{R}}=\tau(\psi(\mathcal{M}_{\text{R}})),\quad\mathcal{C}_{% \text{T}}=\tau(\psi(\mathcal{M}_{\text{T}})).

(14)

The confidence-weighted fusion result $\mathcal{Q}_{\text{fuse}}$ is:

\mathcal{Q}_{\text{fuse}}=\frac{\left(\mathcal{C}_{\text{R}}\odot\mathcal{M}_{% \text{R}}+\mathcal{C}_{\text{T}}\odot\mathcal{M}_{\text{T}}\right)+\epsilon}{% \mathcal{C}_{\text{R}}+\mathcal{C}_{\text{T}}+\epsilon},

(15)

where $\odot$ represents element-wise weighting, and $\epsilon$ is a small constant to prevent division by zero, thereby stabilizing the computation.

Hierarchical Fusion with Multi-stage Process. Hierarchical fusion enhances robustness by applying both local and global fusion steps. Initially, a region-based fusion is applied independently within each modality. This local fusion step can be represented as:

\mathcal{Q}_{\text{local}}=h_{\text{local}}(\kappa_{\text{R}}\cdot\mathcal{M}_% {\text{R}},\kappa_{\text{T}}\cdot\mathcal{M}_{\text{T}}),

(16)

where $h_{\text{local}}(\cdot)$ represents the local fusion function, such as Simple Average, Confidence-Weighted Average, or Maximum Selection, and $\kappa_{\text{R}}$ and $\kappa_{\text{T}}$ are weighting factors specific to each modality.

After obtaining the locally fused results, a global aggregation function combines these results across regions or categories. The global fusion step is given by:

\mathcal{Q}_{\text{fuse}}=h_{\text{global}}\left(\sum_{i=1}^{N}\theta_{i}\,% \mathcal{Q}_{\text{local}}^{(\text{i})}\right),

(17)

where $h_{\text{global}}(\cdot)$ denotes the global fusion function, $N$ is the number of local regions or categories, and $\theta_{i}$ are adaptive coefficients for each local fused region $\mathcal{Q}_{\text{local}}^{(\text{i})}$ .

This hierarchical approach provides finer control over region-specific interactions, enhancing robustness in complex scenes.

Experimental Observations

We first evaluate the three fusion methods through experiments and identify feature-level fusion as the most effective approach. Building on this insight, we further optimize the combination of feature-level fusion modules to achieve the best performance.

TABLE II: Performance metrics of advanced single-modality detection models under different fusion mechanisms. The results are averaged over 100 independent runs, with the standard deviations provided. We use bold red font and underline to highlight the best results.

Backbone	Method	Datasets	Fusion Strategy
			Pixel-Fusion	Feature-Fusion	Decision-Fusion	RGB-Output	TIR-Output
Resnet50	YOLO-V5 [86]	KAIST	$29.31_{\pm 1.21}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}15.17_{\pm 1.59}}}}$	$17.37_{\pm 2.24}$	$18.39_{\pm 1.75}$	$17.89_{\pm 2.92}$
		FLIR	$63.53_{\pm 2.81}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}73.22_{\pm 1.66}}}}$	$68.71_{\pm 1.20}$	$67.84_{\pm 1.17}$	$68.23_{\pm 1.75}$
	CO-DETR [92]	KAIST	$26.17_{\pm 2.01}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}14.67_{\pm 2.35}}}}$	$16.83_{\pm 1.17}$	$17.65_{\pm 2.55}$	$17.17_{\pm 1.57}$
		FLIR	$62.39_{\pm 2.65}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}78.97_{\pm 2.50}}}}$	$69.36_{\pm 2.40}$	$68.93_{\pm 2.12}$	$68.35_{\pm 2.74}$
	RTMDET [93]	KAIST	$23.59_{\pm 1.64}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}14.13_{\pm 2.58}}}}$	$18.97_{\pm 1.15}$	$17.36_{\pm 1.85}$	$16.33_{\pm 1.45}$
		FLIR	$57.81_{\pm 1.97}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}75.36_{\pm 2.31}}}}$	$66.29_{\pm 1.71}$	$64.32_{\pm 2.44}$	$63.97_{\pm 2.12}$
	DINO [94]	KAIST	$27.73_{\pm 1.98}$	$18.93_{\pm 2.16}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}16.67_{\pm 1.12}}}}$	$19.57_{\pm 2.66}$	$17.98_{\pm 1.81}$
		FLIR	$57.69_{\pm 2.87}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}76.83_{\pm 2.33}}}}$	$69.15_{\pm 2.14}$	$67.96_{\pm 2.02}$	$67.71_{\pm 2.33}$
Vit-L	YOLO-V5 [86]	KAIST	$28.99_{\pm 1.98}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}13.52_{\pm 2.18}}}}$	$17.61_{\pm 2.32}$	$18.02_{\pm 2.45}$	$18.63_{\pm 2.66}$
		FLIR	$62.72_{\pm 1.91}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}72.67_{\pm 2.71}}}}$	$68.72_{\pm 1.99}$	$68.12_{\pm 2.47}$	$68.61_{\pm 2.99}$
	CO-DETR [92]	KAIST	$27.95_{\pm 2.53}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}13.63_{\pm 1.53}}}}$	$16.85_{\pm 2.34}$	$18.36_{\pm 1.90}$	$17.17_{\pm 2.97}$
		FLIR	$63.30_{\pm 2.89}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}76.55_{\pm 2.35}}}}$	$70.72_{\pm 1.52}$	$67.27_{\pm 2.95}$	$69.65_{\pm 2.11}$
	RTMDET [93]	KAIST	$22.60_{\pm 2.32}$	$15.11_{\pm 2.65}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}14.53_{\pm 2.95}}}}$	$16.34_{\pm 2.15}$	$16.19_{\pm 2.98}$
		FLIR	$56.75_{\pm 2.78}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}74.39_{\pm 2.19}}}}$	$66.79_{\pm 1.61}$	$65.63_{\pm 2.22}$	$65.07_{\pm 1.68}$
	DINO [94]	KAIST	$26.97_{\pm 2.68}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}12.21_{\pm 2.95}}}}$	$15.54_{\pm 1.95}$	$19.89_{\pm 1.85}$	$18.08_{\pm 2.34}$
		FLIR	$56.11_{\pm 1.89}$	$\mathbf{\underline{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}77.12_{\pm 1.99}}}}$	$70.61_{\pm 1.62}$	$68.37_{\pm 2.95}$	$69.97_{\pm 2.11}$

★
Fusion Method Experiments. In our preliminary experiments, we compared the effects of the three feature fusion methods on the improved multispectral model. The experimental results can be found in Table II. It is evident that using different fusion methods had a significant impact on the detection accuracy of the optimized model.
1. (i):
  
  Observations on Pixel-Level Fusion. Pixel-level fusion exhibits lower stability and detection accuracy compared to single-modality detection on most datasets, with only slight improvements observed in a few specific cases. This may be attributed to the fact that pixel-level fusion combines the dual-light images at the input stage, introducing a significant amount of redundant information and noise. As a result, the model struggles to effectively learn the key features from each modality.
2. (ii):
  
  Observations on Feature-Level Fusion. Compared to single-modality detection, feature-level fusion demonstrated significant improvements in both stability and detection accuracy across most datasets. This is likely due to the fact that feature-level fusion effectively utilizes high-level features extracted by the backbone, allowing for efficient fusion while minimizing redundant features and preserving as much valuable information as possible.
3. (iii):
  
  Observations on Decision-Level Fusion. Compared to single-modality detection, decision-level fusion can improve accuracy to some extent, but it demonstrates instability with certain methods, such as the RTMDet framework [93]. This instability may stem from the fact that decision-level fusion processes RGB and TIR modality information independently, merging them only at the decision stage. Consequently, this approach struggles to effectively leverage complementary information between the two modalities, especially in scenarios where such information is crucial, like varying weather conditions or significant changes in viewpoints.
★

Feature-Fusion Experiments. To determine the most effective fusion strategy, we selected the best-performing feature-level fusion method from prior experiments for further analysis. Using single-modality detection models as baselines, we introduced the NIN and ICFE modules under different input modalities. This approach enabled a systematic evaluation of their contributions to feature representation and fusion performance. Key results are shown in Figure 1, along with notable findings.

Figure 1: Performance metrics obtained from 100 independent repetitions on the KAIST, FLIR, and DroneVehicle datasets using different backbones and feature fusion modules. The letter B represents the baseline, I represents the ICFE module, and N represents the NIN module, while the content in parentheses indicates the modality input to the fusion module. The left images show the results from the experiments using the Yolov5 detector, while the right images present the results from the experiments using the Co-Detr detector. Each column in the figures, from left to right, represents: the results with Resnet50 as the backbone on the KAIST, FLIR, and DroneVehicle datasets, followed by the results with Vit-L as the backbone on the same datasets.
1. (i):
  
  Observations on Datasets. After applying fusion modules, all detection frameworks showed varying degrees of improvements. Notably, on datasets with significant changes in lighting conditions, shadows, and viewpoints (e.g., the FLIR dataset), both the NIN-structured fusion module and the ICFE-structured fusion module exhibited more pronounced performance. This enhancement is likely attributable to the fact that in scenarios where there are substantial differences between the two modalities, complementary information plays a crucial role in improving detection accuracy, which highlights the effectiveness of the fusion modules.
2. (ii):
  
  Observations on Fusion Modules. We found that different fusion module architectures exhibit high sensitivity to various backbone networks. Specifically, in detection networks using Resnet50 as the backbone, the NIN-structured fusion module showed notable improvements in detection accuracy. On the other hand, for backbones based on the Vit-L structure, the ICFE module demonstrated better performance when fusing data from the RGB and TIR channels. This difference in performance may be attributed to the fact that Resnet50 is a convolution-based architecture, where the NIN module effectively fuses local features, maintaining the continuity and consistency of convolutional features, thus leading to better results. In contrast, Vit-L excels at capturing global features, and the ICFE module, with its cross-feature and attention mechanisms, further enhances the fusion of global information, resulting in superior performance.
3. (iii):
  
  Observations on the ICFE Fusion Module Branches. For the branch inputs of the ICFE module, we experimented with various connection methods, as illustrated in Figure 1. The experimental results show that using the ICFE module alone for fusion, regardless of the connection method, failed to consistently improve the detection accuracy. This outcome may be attributed to the fact that when only a single module is used for fusion with inputs from the same modality, the ICFE module may repeatedly amplify background noise or irrelevant features, causing the model to focus excessively on the noise rather than the target, thereby reducing detection performance. Furthermore, when inputs from different modalities (RGB and TIR) are used, their features are not deeply fused or integrated (e.g., through NIN’s nonlinear transformation), meaning the complementary information between modalities is not fully leveraged.
  
  We further attempted to add an NIN connection structure after the iterative ICFE module, using different input methods. The experimental results indicate that using the R+T+NIN connection significantly improves the detection accuracy, while the R+R and T+T configurations, following NIN extraction, resulted in poorer performance. This is likely due to that the NIN module can more finely integrate and fuse cross-modality features, leading to notable improvements in detection performance.
4. (iv):
  
  Observations on Robustness. The experimental results indicate that different input configurations (e.g., R+T, R+R, T+T) have a significant impact on the model’s robustness. When using the same modality inputs (R+R or T+T), the model’s detection performance tends to be unstable and more susceptible to background noise. In contrast, when using the R+T combination, especially when coupled with the NIN module for feature fusion, the model demonstrates significantly higher robustness across various environmental conditions. These findings suggest that the complementary information between modalities plays a crucial role in enhancing the model’s ability to withstand environmental uncertainty and noise interference.

3.3 Dual-Modality Data Augmentation

Formulations. Dual-modality data augmentation is a vital technique for enhancing the performance of multispectral object detection models. By applying consistent or complementary transformations to both modalities during training, this approach not only ensures the correlation between features from the two data sources but also enables the simulation of specific test scenarios (e.g., low-light conditions or small samples). Additionally, it effectively addresses information loss caused by feature dimensionality reduction, particularly in cases where the data distributions of the two modalities differ significantly. Mainstream dual-modality data augmentation strategies can be broadly categorized into three types: Geometric Transformations, Pixel-Level Transformations, and Multimodal-Specific Enhancements. These strategies will be detailed in the following sections.

▲

Geometric Transformations. Geometric transformation strategies involve a range of spatial modifications designed to maximize the geometric diversity of training samples, enabling the model to generalize more effectively to varied object poses, orientations, scales, and viewpoints. The overall approach to geometric transformation strategies is outlined below, with most transformations formulated based on the following equation. Let the input image be represented by $\mathcal{I}$ , the processed image by $\mathcal{I}^{\prime}$ , and the geometric transformation function by $\mathcal{F}_{g}$ . This transformation can be formalized as:

\mathcal{I}^{\prime}=\mathcal{F}_{g}(\mathcal{I})=\rho\cdot\mathcal{I}+\Upsilon,

(18)

where $\rho$ denotes the composite affine transformation matrix, which integrates non-uniform scaling, complex rotation, and controlled mirroring. The $\Upsilon$ represents the non-linear offset coefficient.

The matrix $\rho$ can be decomposed as:

\rho=\mathcal{S}(c_{x},c_{y})\cdot\mathcal{R}(\theta)\cdot\mathcal{U}_{\ell}(% \phi)\cdot\mathcal{E}(t_{x},t_{y}),

(19)

where each component transformation is defined as follows:

- $\mathcal{S}(c_{x},c_{y})$ represents a non-uniform scaling matrix, applying differential scaling along the $x$ and $y$ axes:

\mathcal{S}(c_{x},c_{y})=\begin{bmatrix}c_{x}&0\\ 0&c_{y}\end{bmatrix},

(20)

where $c_{x}$ and $c_{y}$ are the horizontal and vertical scaling factors, respectively, which may vary based on context-specific augmentation parameters.

- $\mathcal{R}(\theta)$ denotes the rotation matrix, which rotates the image by an angle $\theta$ in the 2D plane:

\mathcal{R}(\theta)=\begin{bmatrix}\cos(\theta)&-\sin(\theta)\\ \sin(\theta)&\cos(\theta)\end{bmatrix}.

(21)

- $\mathcal{U}_{\ell}(\phi)$ represents the mirroring transformation, capable of inducing horizontal or vertical flips, denoted as follows:

\mathcal{U}_{\ell}(\phi)=\left\{\begin{array}[]{ll}\begin{bmatrix}-\cos(\phi)&% 0\\ 0&\cos(\phi)\end{bmatrix},&\text{if }\ell=\text{horizontal}\\[12.0pt] \begin{bmatrix}\cos(\phi)&0\\ 0&-\cos(\phi)\end{bmatrix},&\text{if }\ell=\text{vertical}\end{array},\right.

(22)

where $\phi$ is a stochastic parameter controlling the mirroring type, potentially following a probabilistic distribution to introduce randomness into the flipping process. This matrix may be further generalized to incorporate combinations of horizontal and vertical mirroring transformations, represented as:

\mathcal{U}(\phi_{h},\phi_{v})=\begin{bmatrix}\cos(\phi_{h})\cdot\cos(\phi_{v}% )&0\\ 0&\cos(\phi_{h})\cdot\cos(\phi_{v})\end{bmatrix}.

(23)

- $\mathcal{E}(t_{x},t_{y})$ is the translation matrix, introducing positional shifts along the $x$ and $y$ axes:

\mathcal{E}(t_{x},t_{y})=\begin{bmatrix}1&0&t_{x}\\ 0&1&t_{y}\\ 0&0&1\end{bmatrix},

(24)

where $t_{x}$ and $t_{y}$ represent horizontal and vertical translations, respectively. These shifts may vary based on contextual constraints to simulate different spatial orientations.

▲

Pixel-Level Transformations. Pixel-level transformation strategies modify the pixel values of an image, such as by adding noise, adjusting colors, or altering contrast, to simulate various imaging conditions. This enhances the model’s robustness to lighting variations, noise, and diverse environmental factors. The following introduces pixel-level transformation strategies, with most transformations adhering to the approach outlined below. Let the pixel matrix of the image be $\mathcal{P}$ , the transformation can be expressed through the following steps:

Noise Addition. To simulate sensor noise or environmental interference, Gaussian noise $N(\sigma)$ with a standard deviation of $\sigma$ is added to the pixel matrix:

\mathcal{P}_{\text{noise}}=\mathcal{P}+\mathcal{N}(\sigma),

(25)

where $\mathcal{N}(\sigma)$ represents Gaussian noise with variance $\sigma^{2}$ .

Color Adjustment. To simulate different lighting conditions or sensor biases, color adjustment is applied using a scaling factor $\alpha$ :

\mathcal{P}_{\text{color}}=\mathcal{C}(\alpha)\cdot\mathcal{P}_{\text{noise}},

(26)

where $\alpha$ is the color adjustment factor that controls the brightness or saturation of each channel.

Contrast Adjustment. To enhance or reduce image details, contrast adjustment is applied using a contrast factor $\beta$ :

\mathcal{P}_{\text{contrast}}=\mathcal{D}(\beta)\cdot(\mathcal{P}_{\text{color% }}-\mu)+\mu,

(27)

where $\beta$ is the contrast adjustment factor and $\mu$ is the mean pixel value used for centering the pixel matrix.

Final Pixel Transformation. The final pixel transformation combines all the above operations:

\mathcal{P}^{\prime}=\mathcal{D}(\beta)\cdot\mathcal{C}(\alpha)\cdot(\mathcal{% P}+\mathcal{N}(\sigma)).

(28)

▲

Multimodal-Specific Enhancements. This class of strategies focuses on the unique characteristics of dual-light data, employing dual-channel synchronized or complementary enhancements tailored to specific test scenarios. By applying different augmentation methods to each modality, these strategies effectively enhance the cooperative performance of multimodal images and improve accuracy in targeted detection scenarios. Let the RGB image be denoted as $\mathcal{I}_{\text{R}}$ and the TIR image as $\mathcal{I}_{\text{T}}$ . The multimodal-specific enhancement can be expressed as:

\begin{bmatrix}\mathcal{I}^{\prime}_{\text{R}}\\ \mathcal{I}^{\prime}_{\text{T}}\end{bmatrix}=\tau\left(\mathcal{I}_{\text{R}},% \mathcal{I}_{\text{T}}\right),

(29)

where $\tau$ represents the multimodal enhancement function, which may include cross-modal alignment and modality-specific feature enhancement. The $\mathcal{I}^{\prime}_{\text{R}}$ and $\mathcal{I}^{\prime}_{\text{T}}$ represent the enhanced RGB and TIR images, respectively. Specifically, the enhancement process can be further detailed as:

\begin{bmatrix}\mathcal{I}^{\prime}_{\text{R}}\\ \mathcal{I}^{\prime}_{\text{T}}\end{bmatrix}=\begin{bmatrix}\varpi_{\text{R}}% \left(\mathcal{I}_{\text{R}},\mathcal{A}_{\text{T}}\cdot\mathcal{L}(\mathcal{I% }_{\text{T}})\right)\\ \varpi_{\text{T}}\left(\mathcal{I}_{\text{T}},\mathcal{A}_{\text{R}}\cdot% \mathcal{L}(\mathcal{I}_{\text{R}})\right)\end{bmatrix}.

(30)

The functions $\varpi_{\text{R}}$ and $\varpi_{\text{T}}$ denote modality-specific enhancement operations applied to the input images, incorporating their corresponding aligned features. The matrices $\mathcal{A}_{\text{T}}$ and $\mathcal{A}_{\text{R}}$ are modality-specific alignment matrices, while $\mathcal{L}(\mathcal{I}_{\text{modality}})$ serves as the feature extraction function that identifies crucial features within each image for optimized information integration.

Experimental observations

Based on the single-modality object detection model Co-Detr, we made adaptive modifications to construct a baseline model suitable for multispectral object detection. As multispectral object detection augmentation strategies often need to adapt to specific application scenarios, test set sample characteristics, and varying weather and lighting conditions, we first conducted experiments exploring a set of synchronized augmentation techniques focused on geometric and pixel-level transformations. The experimental results are shown in Figure 2. Building upon these methods, we further investigate specific augmentation strategies tailored to the unique characteristics of dual-light samples. The experimental results are shown in Figures 3 and 4.

Refer to caption — Figure 2: The performance of general geometric and pixel-level augmentations (using different backbones) on the KAIST, FLIR, and DroneVehicle datasets. The left figure illustrates the results of various geometric augmentations, where B denotes the baseline, R represents random rotation, S signifies multi-scale scaling, C stands for random cropping, F corresponds to random flipping, and T indicates random translation. The right figure presents the results of general pixel-level augmentations, with B as the baseline, BL for random blurring, NI for noise injection, S for random sharpening, O for random occlusion, and CJ for color jittering. $\triangle$ represents the mean performance difference between this method and the baseline.

★

General Augmentation Strategy Experiments. In this section, we conducted dual-channel synchronized augmentation experiments using various geometric and pixel-level strategies, revealing several key insights.
1. (i):
  
  Observations on Geometric Transformations. The experimental data indicates that applying a combination of random rotation, multi-scale scaling, and random cropping results in performance improvements across multiple datasets. However, strategies such as random flipping and random translation show poorer performance on the KAIST dataset. This could be attributed to the fact that the combination of random rotation, multi-scale scaling, and random cropping effectively simulates samples from various perspectives and angles, thus enhancing the model’s ability to adapt to different viewpoints, angles, scales, and deformations. On the other hand, strategies like flipping and translation may produce illogical images for certain samples (e.g., flipping upright pedestrians in the KAIST dataset leads to unnatural postures), which disrupts the inherent distribution patterns and modality alignment in some datasets, negatively affecting detection performance.
2. (ii):
  
  Observations on Pixel-Level Transformations. The overall performance improvements from pixel-level augmentation strategies are less significant compared to geometric transformations or spatial alignment methods. For instance, even the most effective combination in our experiments yielded only a 2.5% increase in recognition accuracy over the baseline, which is relatively modest when compared to methods such as feature fusion. Besides, a large number of pixel-level augmentation strategies (three or more) exhibit high sensitivity to different datasets. Specifically, we observed that the combination of Color Jitter+Random Sharpening+Random Blurring significantly improved recognition accuracy on the KAIST dataset, but the same combination performed poorly on the FLIR dataset. When more than four pixel-level augmentation strategies were applied, recognition accuracy often plateaued or even decreased across multiple datasets.

★

Experiments on Unique Augmentation Strategies. For specific scenarios, such as low-light/nighttime conditions and very small sample cases, we selected 500 images from the original dataset that exhibit these characteristics for targeted testing. We experimented with various combinations of dual-channel augmentation strategies, which includes dual-channel synchronized augmentation and complementary augmentation. Below are some interesting observations:

Figure 3: The performance metrics of different augmentation strategies applied to nighttime/low-light samples. The top image shows the results using dual-channel synchronized augmentation, while the bottom image displays results with dual-channel complementary augmentation. In both images, B represents the baseline, C stands for CLAHE, RL denotes random lighting, and L indicates light enhancement. In the bottom image, each set of parentheses indicates the specific augmentation strategies applied to each modality, with the order representing the RGB and TIR channels, respectively. $\triangle$ represents the mean performance difference.
1. (i):
  
  Observations on Strategies for Nighttime/Low-Light Samples. We conducted experiments comparing both synchronous and complementary augmentation strategies to identify the most effective combination for enhancing performance in low-light conditions. We found that complementary augmentation outperforms synchronous augmentation in improving overall recognition accuracy. This improvement is particularly pronounced in low-light conditions, where the strengths of complementary augmentation are more evident. Specifically, in low-light environments, the RGB modality tends to suffer from information loss, such as reduced contrast and increased noise, while the TIR modality, which captures thermal radiation, continues to provide stable target information even in the absence of illumination. Thus, adopting a complementary augmentation strategy allows each modality to better leverage its respective strengths. Besides, The complementary augmentation combination of random lighting and light enhancement for the TIR channel, paired with CLAHE for the RGB channel, achieved excellent results across all datasets. This success can be attributed to the complementary strategy’s ability to enhance the adaptability of the RGB channel to varying lighting conditions, while simultaneously improving the clarity of edges and shapes in the infrared samples.
2. (ii):
  
  Observations on Strategies for Small Samples. From the experimental data, it is evident that the augmentation strategy improves recognition accuracy. Specifically, the stitching operation proved to be highly effective in addressing the problem of very small samples individually, while the other two augmentation techniques did not consistently improve recognition accuracy. Independent use of both Stitcher and Fastmosaic led to notable improvements in recognition accuracy. In particular, Fastmosaic was the preferred choice for large-scale datasets (such as KAIST), while Stitcher performed better on more complex datasets (such as FLIR). Interestingly, when the two methods were combined, recognition accuracy decreased compared to their individual use. This outcome could be attributed to an imbalance in data distribution caused by the combination, which failed to provide the model with additional useful information.

3.4 Registration Alignment

Formulations. In multispectral object detection tasks, factors such as sensor viewpoints, resolution discrepancies, and varying weather conditions can lead to spatial misalignment between RGB and TIR images. Such misalignment often introduces inconsistencies during feature fusion, thereby degrading detection performance. To address these issues, researchers have developed various registration and alignment strategies, which can be broadly categorized into Feature Alignment-Based Methods and Feature Fusion-Based Methods. By applying these registration techniques at different stages of training and testing, the alignment between RGB and TIR images can be effectively improved, significantly enhancing recognition accuracy. The following sections provide a detailed discussion of these two categories.

▲

Feature Alignment-Based Methods. The main goal of these methods is to address spatial misalignment between RGB and TIR images through precise feature matching and alignment. The Loftr approach exemplifies this objective by leveraging a Transformer-based architecture to achieve pixel-level feature matching between RGB and TIR images, allowing for high-precision geometric alignment [96]. This approach enables the calculation of transformation parameters (such as affine or perspective transformations) that can be applied to register the images effectively.

Let the RGB image be denoted as $\mathcal{I}_{\text{R}}$ and the TIR image as $\mathcal{I}_{\text{T}}$ . The coarse and fine features extracted from these images are represented as $\Phi_{\text{R}}$ and $\Phi_{\text{T}}$ , respectively. The matching function $\varrho_{\text{m}}(\Phi_{\text{R}},\Phi_{\text{T}})$ can be formulated as follows:

\varrho_{\text{m}}(\Phi_{\text{R}},\Phi_{\text{T}})=\left\{(p_{i},q_{j})\mid% \hat{p}_{i}=\sigma\left(\frac{\Phi_{\text{R}}(p_{i})\cdot\Phi_{\text{T}}(q_{j}% )}{\tau}\right)\right\},

(31)

where $(p_{i},q_{j})$ represents matched point pairs across RGB and TIR modalities, and $\tau$ is a temperature parameter controlling the similarity distribution. $\sigma$ denotes the softmax function, and $\hat{p}_{i}$ represents the point with the highest matching score in the RGB modality for each $q_{j}$ in the TIR modality.

The geometric transformation $\mathcal{T}_{g}$ is then estimated based on these matched points by minimizing a distance-based objective:

\theta^{*}=\arg\min_{\theta}\sum_{(p_{i},q_{j})\in M}\left\|\mathcal{T}_{g}(p_% {i},\theta)-q_{j}\right\|^{2},

(32)

where $\mathcal{T}_{g}(p_{i},\theta)$ represents the transformed location of $p_{i}$ in the TIR image space, with $\theta$ containing transformation parameters for an affine or homography matrix $\mathcal{A}$ and translation vector $\mathcal{B}$ . Once optimized, the transformation can be applied to obtain the aligned image:

\mathcal{I}_{\text{aligned}}(x,y)=\mathcal{T}_{g}(I_{\text{T}},\theta^{*})=% \mathcal{A}(\theta^{*})\cdot\mathcal{I}_{\text{T}}+\mathcal{B}.

(33)

To further improve the registration accuracy, a joint loss that combines a feature consistency loss and an alignment loss are introduced, expressed as:

\mathcal{L}=\lambda_{1}\sum_{(p_{i},q_{j})\in M}\left\|\Phi_{\text{R}}(p_{i})-% \Phi_{\text{T}}(q_{j})\right\|^{2}+\lambda_{2}\mathcal{L}_{\text{alignment}}(% \theta),

(34)

where $\mathcal{L}_{\text{alignment}}(\theta)$ measures the alignment quality based on transformation parameters, and $\lambda_{1}$ and $\lambda_{2}$ are weighting coefficients to balance the two loss terms. This method demonstrates exceptional alignment capabilities in scenes with pronounced parallax and varying viewpoints, enabling efficient image registration.

▲

Feature Fusion-Based Methods. Feature fusion-based methods aim to effectively combine deep RGB and TIR features to generate a fused image, thereby achieving modality alignment. SuperFusion is a prime example, employing a multilevel fusion strategy that includes data-level transformation, feature-level attention mechanisms, and final Bird’s Eye View (BEV) alignment [97].

Given an RGB image $\mathcal{I}_{\text{R}}$ and a TIR image $\mathcal{I}_{\text{T}}$ , the process begins by extracting feature maps $\mathcal{X}_{\text{R}}$ and $\mathcal{X}_{\text{T}}$ through separate convolutional backbones. To enhance depth perception, a sparse depth map $\mathcal{D}_{\text{sparse}}$ is generated by projecting TIR depth information into the RGB image plane. A completion function $\mathcal{S}(\cdot)$ then generates a dense depth map $\mathcal{D}_{\text{dense}}$ :

\mathcal{D}_{\text{dense}}=\mathcal{S}(\mathcal{D}_{\text{sparse}}).

(35)

In the feature fusion stage, cross-attention is used to align features from both modalities, where RGB features $\mathcal{X}_{\text{R}}$ guide the enhancement of TIR features $\mathcal{X}_{\text{T}}$ . The cross-attention matrix $\mathcal{H}$ incorporates depth information from $\mathcal{D}_{\text{dense}}$ and is defined as:

\mathcal{H}=\sigma\left(\frac{\mathcal{Q}\mathcal{K}^{T}\cdot\mathcal{D}_{% \text{dense}}}{\sqrt{d}}\right),\quad\mathcal{Q}=\mathcal{W}_{\text{q}}% \mathcal{X}_{\text{R}},\quad\mathcal{K}=\mathcal{W}_{\text{k}}\mathcal{X}_{% \text{T}},

(36)

where $\mathcal{W}_{\text{q}}$ and $\mathcal{W}_{\text{k}}$ are learned weights and $d$ is a scaling factor, and $\sigma$ denotes the softmax function. This mechanism aligns features across modalities by using depth information to refine attention, allowing RGB features to enrich TIR information in the fused representation.

The resulting attention matrix $\mathcal{H}$ is then used to enhance the TIR features:

\mathcal{X}_{\text{T}}^{\prime}=\mathcal{H}\cdot\mathcal{V},\quad\mathcal{V}=% \mathcal{W}_{\text{v}}\mathcal{X}_{\text{T}},

(37)

where $\mathcal{W}_{\text{v}}$ is a learned weight matrix for generating the value matrix $\mathcal{V}$ , and $\mathcal{X}_{\text{T}}^{\prime}$ represents the TIR features enhanced by the RGB guidance.

Finally, a BEV alignment module refines the fused feature map by learning a flow field $\Delta$ to warp RGB features $\mathcal{X}_{\text{R}}$ , achieving better alignment with the enhanced TIR features $\mathcal{X}_{\text{T}}^{\prime}$ . The aligned RGB image $\mathcal{I}_{\text{aligned}}$ can be expressed as:

\mathcal{I}_{\text{aligned}}(x,y)=\sum_{x^{\prime},y^{\prime}}\mathcal{X}_{% \text{T}}^{\prime}(x^{\prime},y^{\prime})\,w(x,y,x^{\prime},y^{\prime},\Delta),

(38)

where $w(x,y,x^{\prime},y^{\prime},\Delta)$ represents the bilinear interpolation weights based on the flow field $\Delta$ to adjust the alignment features. The interpolation weights $w$ can be defined as:

w(x,y,x^{\prime},y^{\prime},\Delta)=\prod_{i\in\{x,y\}}\max\big{(}0,1-|i^{% \prime}-i-\Delta_{i}|\big{)}.

(39)

These weights ensure that the spatial position of the RGB features is precisely adjusted according to the flow field $\Delta$ , allowing for better alignment with the TIR features.

The entire process is optimized by a joint loss function $\mathcal{L}$ that combines feature consistency and alignment error terms, weighted by $\lambda_{1}$ and $\lambda_{2}$ :

\mathcal{L}=\lambda_{1}\,\mathcal{L}_{\text{feature}}+\lambda_{2}\,\mathcal{L}% _{\text{alignment}},

(40)

where the feature consistency term $\mathcal{L}_{\text{feature}}=\sum_{(p_{i},q_{j})\in\text{M}}\left\|\mathcal{X}% _{\text{R}}(p_{i})-\mathcal{X}_{\text{T}}^{\prime}(q_{j})\right\|^{2}$ minimizes the difference between matched feature pairs $(p_{i},q_{j})$ in set M, and the alignment term $\mathcal{L}_{\text{alignment}}=\sum_{x,y}\left\|\Delta(x,y)-\Delta^{*}(x,y)% \right\|^{2}$ measures the deviation from the ideal alignment $\Delta^{*}$ .

Experimental observations

We utilized Loftr and SuperFusion to register RGB and TIR images separately and experimented by replacing the original RGB or TIR images with the fused images during model training and testing. The registration results in different scenarios can be observed in Figures 5 and 6. The performance metrics of different registration methods can be found in Figure 7. Below are some interesting findings:

(i):

Observation on Registration Performance. The experimental results demonstrate that Loftr and SuperFusion exhibit distinct advantages and characteristics in generating fused RGB images. Loftr focuses on precise feature matching and geometric alignment, ensuring that the fused RGB image is spatially well-aligned with the TIR image, with each pixel accurately corresponding to its counterpart. As shown in Figure 5, Loftr performs well in images with high sample density, displaying strong spatial stability—likely due to the greater availability of feature mapping information provided by the dense samples. However, its performance deteriorates in sparser scenes, sometimes leading to issues such as ghosting and overlapping artifacts, making it challenging to proceed with subsequent detection steps.

In contrast, SuperFusion excels at handling sparse scenes where Loftr struggles, effectively preserving sample information and image features. However, it may impact the geometric characteristics of certain scenes, such as the vertical structures of bridges, whereas Loftr remains largely unaffected in such scenarios.

Figure 7: The performance metrics of different registration methods at various stages are presented. The image on the left represents registration during the training phase, while the image on the right represents registration during the testing phase. In this figure, b1 corresponds to the method where the registered image replaces the original RGB image, and b2 corresponds to the method where the registered image replaces the original TIR image.
(ii):

Observations on Registration Methods. The results in Figure 7 indicate that training the multispectral object detection model with Loftr-registered data yields a substantial increase in recognition accuracy, whereas training with SuperFusion-processed data shows limited impact. During testing, however, both Loftr and SuperFusion enhance recognition accuracy. This advantage is likely due to Loftr’s ability to address data inconsistencies via feature alignment during training, thereby improving data quality and facilitating more effective feature learning.

While SuperFusion is effective for multimodal fusion, it may introduce redundancy and complexity in the training data, potentially diverting the model’s focus from key features and limiting accuracy gains. In testing, both methods improve recognition accuracy by refining data quality or enriching feature information. Importantly, both registration frameworks perform best when generating RGB data based on the TIR reference, likely because the TIR-based RGB retains essential thermal information, supporting reliable performance in challenging conditions such as low light, smoke, or nighttime environments.
(iii):

Observations on Application Scenarios. Experimental results indicate that Loftr excels in scenarios with significant rotational deviation or displacement between RGB and TIR images. This effectiveness is likely due to Loftr’s precise feature matching and geometric transformations, which effectively mitigate spatial misalignments. Conversely, SuperFusion demonstrates greater suitability in environments affected by adverse weather or low resolution, where it efficiently integrates multimodal data despite these challenges.

4 Optimal Combination of Individual Techniques

In the previous section, we evaluated various training techniques for multispectral object detection under consistent conditions. However, extending a single-modality model to dual-modality with only one technique often yields suboptimal performance, as no single method fully addresses challenges like feature misalignment, overfitting, and fusion conflicts. Therefore, given the diverse methods in multispectral frameworks, relying on a single technique to enhance model performance is impractical. Our benchmark analysis highlights effective combinations of techniques and offers new insights for designing multispectral object detection models.

4.1 Optimal Trick Combinations and Ablation Study

We have summarized the optimal technique combinations for the KAIST, FLIR, and DroneVehicle datasets above. Additionally, we conducted detailed ablation studies to validate the effectiveness of these combinations, as shown in Figure 8. For each dataset, we tested 5 to 6 different combination variants by removing or substituting certain techniques. The results consistently demonstrate the significant effectiveness of our selected combinations, and the observed performance variations on specific samples are highly consistent with the conclusions we presented in Sections 3.2, 3.3, and 3.4.

4.2 Comparison with Leading Frameworks

To further validate the effectiveness of the optimized single-modality model based on the best technique combinations, we compared it with other advanced frameworks specifically designed for multispectral object detection, including MBNet, MLPD, and MSDS-RCNN. As shown in Tables IV, V, VI, by organically integrating our training techniques into the single-modality model, the optimized model consistently outperforms previously well-designed multispectral detection frameworks on both small-scale and large-scale datasets.

4.3 Transferring Technique Combinations

The final plausibility check is to determine whether certain technique combinations remain effective across multiple multispectral object detection datasets. To this end, we selected the combination of “Loftr for test alignment + ICFE for feature fusion”, as these two techniques consistently demonstrated optimal performance in the majority of scenarios covered in Sections 3.2, 3.3, and 3.4. This combination also performed comparably to other top-performing combinations on the FLIR and DroneVehicle datasets. Specifically, we evaluated this approach on two additional open-source multispectral detection datasets: (i) the LLVIP dataset, (ii) the CVC-14 dataset. In these transfer studies, we strictly adhered to the “best configuration point” settings outlined in Section 3.1.

TABLE III: Performance metrics of models with and without our strategy on the LLVIP and CVC-14 datasets. The results are averaged over multiple independent runs, with the standard deviations provided.

Method	Strategy	LLVIP		CVC-14
		mAP50(%)	mAP(%)	$MR^{2}$ (%)↓
SSD [82]	w/o	$90.25_{\pm 1.76}$	$53.52_{\pm 2.45}$	$68.39_{\pm 1.78}$
	with	$92.13_{\pm 2.45}$	$54.39_{\pm 2.31}$	$37.16_{\pm 2.18}$
RetinaNet [83]	w/o	$94.81_{\pm 2.13}$	$55.18_{\pm 1.29}$	$47.87_{\pm 2.75}$
	with	$95.15_{\pm 1.89}$	$57.87_{\pm 2.48}$	$29.63_{\pm 1.32}$
Cascade R-CNN [85]	w/o	$95.12_{\pm 2.23}$	$56.81_{\pm 2.61}$	$42.36_{\pm 2.91}$
	with	$95.58_{\pm 1.68}$	$57.99_{\pm 1.35}$	$22.15_{\pm 1.54}$
Faster R-CNN [85]	w/o	$94.63_{\pm 2.78}$	$54.53_{\pm 2.43}$	$51.97_{\pm 1.97}$
	with	$94.97_{\pm 2.11}$	$56.15_{\pm 1.95}$	$24.31_{\pm 1.87}$
DDQ-DETR [98]	w/o	$93.91_{\pm 1.67}$	$58.67_{\pm 1.49}$	$52.78_{\pm 2.41}$
	with	$94.86_{\pm 2.26}$	$60.13_{\pm 1.87}$	$26.51_{\pm 1.53}$

TABLE IV: Comparison of our most effective detection model with other advanced frameworks on the KAIST dataset. We use bold red font and underline to highlight the best results.

Method	$MR^{2}$ (%)↓
Method	All	Day	Night
FusionRPN+BF [99]	18.31	19.54	16.33
IAF-RCNN [100]	15.55	14.97	16.89
IATDNN-IAMSS [101]	14.41	14.30	15.29
MBNet [102]	8.43	8.79	8.10
MLPD [103]	7.21	6.83	7.68
MSDS-RCNN [104]	7.34	8.98	6.94
Ours	6.23	6.91	6.19

TABLE V: Comparison of our most effective detection model with other advanced frameworks on the FLIR dataset. We use bold red font and underline to highlight the best results.

Method	AP50 (%)			mAP (%)
Method	Bicycle	Car	Person	mAP (%)
MMTOD-CG [105]	50.38	70.61	63.42	61.47
MMTOD-UNIT [105]	49.28	70.78	64.33	61.46
CFR [106]	57.95	84.92	74.46	72.44
BU-ATT [107]	56.01	87.11	76.08	73.06
BU-LTT [107]	57.43	86.31	75.65	73.13
CFT [108]	61.44	89.55	84.28	78.42
Ours	68.71	89.51	85.30	81.17

TABLE VI: Comparison of our most effective detection model with other advanced frameworks on the DroneVehicle dataset. We use bold red font and underline to highlight the best results.

Method	AP50 (%)					mAP (%)
Method	Car	Freight Car	Truck	Bus	Van	mAP (%)
RetinaNet-OBB [83]	65.36	15.69	32.81	61.34	16.26	38.29
Mask R-CNN [85]	88.98	36.84	47.79	78.17	36.65	57.69
Cascade Mask R-CNN [85]	80.95	31.00	38.27	66.62	25.01	48.37
UA-CMDet [109]	87.35	41.27	62.69	84.17	39.82	63.06
CALNet [110]	86.32	60.67	67.15	86.52	53.68	70.87
TSFADet [111]	89.01	51.97	68.51	83.06	46.95	67.9
Gliding Vertex [112]	89.99	42.75	59.71	79.79	44.19	63.29
Ours	92.05	63.39	71.95	88.93	57.12	74.69

As shown in Table III: the selected technique combination significantly improved the performance of the single-modality model on various multispectral datasets in most cases, particularly in scenarios with complex backgrounds and varying lighting conditions. This combination consistently enhanced model performance across different datasets, with the CVC-14 dataset showing a maximum accuracy improvement of over 31.23%. The strong transferability of this technique combination suggests its potential to serve as a robust baseline for future research in multispectral object detection, while also offering new training strategies for optimizing single-modality detection models.

5 Conclusion

Multispectral object detection is a rapidly advancing field, yet significant challenges remain in effectively integrating multimodal information to adapt to diverse environmental conditions. In this study, we propose a standardized benchmark with fair and consistent experimental setups to drive progress in this domain. We conducted extensive experiments across multiple public datasets, focusing on three critical aspects of multispectral detection: multimodal feature fusion, dual-modality data augmentation, and registration alignment. Through a comprehensive analysis of our results, we identified the most effective technique combinations and established new performance benchmarks for multispectral object detection.

Additionally, we introduce a novel training strategy to optimize single-modality models for dual-modality tasks, laying the groundwork for adapting high-performing single-modality models to dual-modality scenarios. We believe that the strong baselines and optimized technique combinations presented in this work will facilitate fairer and more practical evaluations in multispectral object detection research. This work sets a robust foundation for future studies and opens new avenues for enhancing multispectral object detection performance.

References

[1] S. Jha, C. Seo, E. Yang, and G. P. Joshi, “Real time object detection and trackingsystem for video surveillance system,” Multimedia Tools and Applications, vol. 80, no. 3, pp. 3981–3996, 2021.
[2] C. Kumar, R. Punitha et al., “Yolov3 and yolov4: Multiple object detection for surveillance applications,” in 2020 Third international conference on smart systems and inventive technology (ICSSIT). IEEE, 2020, pp. 1316–1321.
[3] A. Balasubramaniam and S. Pasricha, “Object detection in autonomous vehicles: Status and open challenges,” arXiv preprint arXiv:2201.07706, 2022.
[4] M. Carranza-García, J. Torres-Mateo, P. Lara-Benítez, and J. García-Gutiérrez, “On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data,” Remote Sensing, vol. 13, no. 1, p. 89, 2020.
[5] Y. Zuo, J. Wang, and J. Song, “Application of yolo object detection network in weld surface defect detection,” in 2021 IEEE 11th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, 2021, pp. 704–710.
[6] Z. Qiu, S. Wang, Z. Zeng, and D. Yu, “Automatic visual defects inspection of wind turbine blades via yolo-based small object detection approach,” Journal of electronic imaging, vol. 28, no. 4, pp. 043 023–043 023, 2019.
[7] K. A. Joshi and D. G. Thakore, “A survey on moving object detection and tracking in video surveillance system,” International Journal of Soft Computing and Engineering, vol. 2, no. 3, pp. 44–48, 2012.
[8] P. K. Mishra and G. Saroha, “A study on video surveillance system for object detection and tracking,” in 2016 3rd international conference on computing for sustainable global development (INDIACom). IEEE, 2016, pp. 221–226.
[9] L. L. Presti and M. La Cascia, “Real-time object detection in embedded video surveillance systems,” in 2008 ninth international workshop on image analysis for multimedia interactive services. IEEE, 2008, pp. 151–154.
[10] S. Varma and M. Sreeraj, “Object detection and classification in surveillance system,” in 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS). IEEE, 2013, pp. 299–303.
[11] J. C. Nascimento and J. S. Marques, “Performance evaluation of object detection algorithms for video surveillance,” IEEE Transactions on Multimedia, vol. 8, no. 4, pp. 761–774, 2006.
[12] R. Nabati and H. Qi, “Rrpn: Radar region proposal network for object detection in autonomous vehicles,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 3093–3097.
[13] J. Lu, H. Sibai, E. Fabry, and D. Forsyth, “No need to worry about adversarial examples in object detection in autonomous vehicles,” arXiv preprint arXiv:1707.03501, 2017.
[14] L. Peng, H. Wang, and J. Li, “Uncertainty evaluation of object detection algorithms for autonomous vehicles,” Automotive Innovation, vol. 4, no. 3, pp. 241–252, 2021.
[15] D. Feng, A. Harakeh, S. L. Waslander, and K. Dietmayer, “A review and comparative study on probabilistic object detection in autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 9961–9980, 2021.
[16] D. He, K. Xu, and P. Zhou, “Defect detection of hot rolled steels with a new object detection framework called classification priority network,” Computers & Industrial Engineering, vol. 128, pp. 290–297, 2019.
[17] X. Wang, X. Jia, C. Jiang, and S. Jiang, “A wafer surface defect detection method built on generic object detection network,” Digital Signal Processing, vol. 130, p. 103718, 2022.
[18] J. Yuan, X. Zheng, L. Peng, K. Qu, H. Luo, L. Wei, J. Jin, and F. Tan, “Identification method of typical defects in transmission lines based on yolov5 object detection algorithm,” Energy Reports, vol. 9, pp. 323–332, 2023.
[19] W. R. Tribe, D. A. Newnham, P. F. Taday, and M. C. Kemp, “Hidden object detection: security applications of terahertz technology,” in Terahertz and Gigahertz Electronics and Photonics III, vol. 5354. SPIE, 2004, pp. 168–176.
[20] S. Akcay and T. Breckon, “Towards automatic threat detection: A survey of advances of deep learning within x-ray security imaging,” Pattern Recognition, vol. 122, p. 108245, 2022.
[21] J. B. Sigman, G. P. Spell, K. J. Liang, and L. Carin, “Background adaptive faster r-cnn for semi-supervised convolutional object detection of threats in x-ray images,” in Anomaly Detection and Imaging with X-Rays (ADIX) V, vol. 11404. SPIE, 2020, pp. 12–21.
[22] K. J. Liang, J. B. Sigman, G. P. Spell, D. Strellis, W. Chang, F. Liu, T. Mehta, and L. Carin, “Toward automatic threat recognition for airport x-ray baggage screening with deep convolutional object detection,” arXiv preprint arXiv:1912.06329, 2019.
[23] Z. Hou, C. Yang, Y. Sun, S. Ma, X. Yang, and J. Fan, “An object detection algorithm based on infrared-visible dual modal feature fusion,” Infrared Physics & Technology, vol. 137, p. 105107, 2024.
[24] Q. Ji and Y. Qi, “Dual-mode object detection algorithm based on feature enhancement and feature fusion,” in Journal of Physics: Conference Series, vol. 2816, no. 1. IOP Publishing, 2024, p. 012091.
[25] X. Yan, D. Tian, D. Zhou, C. Wang, and W. Zhang, “Iv-yolo: A lightweight dual-branch object detection network,” 2024.
[26] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 3, pp. 1341–1360, 2020.
[27] X. Xu, Y. Li, G. Wu, and J. Luo, “Multi-modal deep feature learning for rgb-d object detection,” Pattern Recognition, vol. 72, pp. 300–313, 2017.
[28] Y. Li, A. W. Yu, T. Meng, B. Caine, J. Ngiam, D. Peng, J. Shen, Y. Lu, D. Zhou, Q. V. Le et al., “Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 17 182–17 191.
[29] R. Guo, D. Li, and Y. Han, “Deep multi-scale and multi-modal fusion for 3d object detection,” Pattern Recognition Letters, vol. 151, pp. 236–242, 2021.
[30] Y. Xu, X. Yu, J. Zhang, L. Zhu, and D. Wang, “Weakly supervised rgb-d salient object detection with prediction consistency training and active scribble boosting,” IEEE Transactions on Image Processing, vol. 31, pp. 2148–2161, 2022.
[31] K. Geng, W. Zou, G. Yin, Y. Li, Z. Zhou, F. Yang, Y. Wu, and C. Shen, “Low-observable targets detection for autonomous vehicles based on dual-modal sensor fusion with deep learning approach,” Proceedings of the Institution of Mechanical Engineers, Part D: Journal of automobile engineering, vol. 233, no. 9, pp. 2270–2283, 2019.
[32] D. Yang, X. Liu, H. He, and Y. Li, “Air-to-ground multimodal object detection algorithm based on feature association learning,” International Journal of Advanced Robotic Systems, vol. 16, no. 3, p. 1729881419842995, 2019.
[33] B. Wan, X. Zhou, Y. Sun, T. Wang, C. Lv, S. Wang, H. Yin, and C. Yan, “Mffnet: Multi-modal feature fusion network for vdt salient object detection,” IEEE Transactions on Multimedia, 2023.
[34] B. Ghari, A. Tourani, A. Shahbahrami, and G. Gaydadjiev, “Pedestrian detection in low-light conditions: A comprehensive survey,” Image and Vision Computing, p. 105106, 2024.
[35] X. Wang, T. Sun, R. Yang, C. Li, B. Luo, and J. Tang, “Quality-aware dual-modal saliency detection via deep reinforcement learning,” Signal Processing: Image Communication, vol. 75, pp. 158–167, 2019.
[36] B. Jiang, Z. Zhou, X. Wang, J. Tang, and B. Luo, “Cmsalgan: Rgb-d salient object detection with cross-view generative adversarial networks,” IEEE Transactions on Multimedia, vol. 23, pp. 1343–1353, 2020.
[37] K. Song, J. Wang, Y. Bao, L. Huang, and Y. Yan, “A novel visible-depth-thermal image dataset of salient object detection for robotic visual perception,” IEEE/ASME Transactions on Mechatronics, vol. 28, no. 3, pp. 1558–1569, 2022.
[38] S. Takken, “Hardware efficient co-detr model for mobile applications,” B.S. thesis, University of Twente, 2024.
[39] T. Diwan, G. Anirudh, and J. V. Tembhurne, “Object detection using yolo: Challenges, architectural successors, datasets and applications,” multimedia Tools and Applications, vol. 82, no. 6, pp. 9243–9275, 2023.
[40] W. Fang, L. Wang, and P. Ren, “Tinier-yolo: A real-time object detection method for constrained environments,” Ieee Access, vol. 8, pp. 1935–1944, 2019.
[41] R. Huang, J. Pedoeem, and C. Chen, “Yolo-lite: a real-time object detection algorithm optimized for non-gpu computers,” in 2018 IEEE international conference on big data (big data). IEEE, 2018, pp. 2503–2510.
[42] Z. Wang, M. Xiao, J. He, C. Zhang, and K. Fu, “Bimodal information fusion network for salient object detection based on transformer,” in 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML). IEEE, 2022, pp. 38–48.
[43] N. Yan, T. Zhou, C. Gu, A. Jiang, and W. Lu, “Bimodal-based object detection and instance segmentation models for substation equipments,” in IECON 2020 The 46th Annual Conference of the IEEE Industrial Electronics Society. IEEE, 2020, pp. 428–434.
[44] Y. Zhang, L. Yuan, Y. Guo, Z. He, I.-A. Huang, and H. Lee, “Discriminative bimodal networks for visual localization and detection with natural language queries,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 557–566.
[45] J. Zheng, L. Wang, J. Liu, H. Wang, S. Wang, L. Wang, and J. Zhang, “An inspection method of rail head surface defect via bimodal structured light sensors,” International Journal of Machine Learning and Cybernetics, vol. 14, no. 5, pp. 1903–1920, 2023.
[46] Z. Tang, T. Xu, Z. Feng, X. Zhu, H. Wang, P. Shao, C. Cheng, X.-J. Wu, M. Awais, S. Atito et al., “Revisiting rgbt tracking benchmarks from the perspective of modality validity: A new benchmark, problem, and method,” arXiv preprint arXiv:2405.00168, 2024.
[47] A. Lu, W. Wang, C. Li, J. Tang, and B. Luo, “After: Attention-based fusion router for rgbt tracking,” arXiv preprint arXiv:2405.02717, 2024.
[48] Y. Cao, W. Ming, H. Li, B. He, and P. Yu, “Anchor-free ranking-based localization optimized siamese rgb-t object tracking network,” in 2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology (ICCECT). IEEE, 2024, pp. 1047–1051.
[49] X. Dai, X. Yuan, and X. Wei, “Tirnet: Object detection in thermal infrared images for autonomous driving,” Applied Intelligence, vol. 51, no. 3, pp. 1244–1261, 2021.
[50] Z. Tang, T. Xu, H. Li, X.-J. Wu, X. Zhu, and J. Kittler, “Exploring fusion strategies for accurate rgbt visual object tracking,” Information Fusion, vol. 99, p. 101881, 2023.
[51] L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. Van De Weijer, and F. Shahbaz Khan, “Multi-modal fusion for end-to-end rgb-t tracking,” in Proceedings of the IEEE/CVF International conference on computer vision workshops, 2019, pp. 0–0.
[52] T. Zhang, X. He, Y. Luo, Q. Zhang, and J. Han, “Exploring target-related information with reliable global pixel relationships for robust rgb-t tracking,” Pattern Recognition, vol. 155, p. 110707, 2024.
[53] Y. Zhu, C. Li, J. Tang, B. Luo, and L. Wang, “Rgbt tracking by trident fusion network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 2, pp. 579–592, 2021.
[54] Y. Zhu, C. Li, J. Tang, and B. Luo, “Quality-aware feature aggregation network for robust rgbt tracking,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 1, pp. 121–130, 2020.
[55] Z. Tu, W. Pan, Y. Duan, J. Tang, and C. Li, “Rgbt tracking via reliable feature configuration,” Science China Information Sciences, vol. 65, no. 4, p. 142101, 2022.
[56] S. Zhai, Y. Wu, L. Liu, and J. Tang, “Rgbt tracking based on modality feature enhancement,” Multimedia Tools and Applications, vol. 83, no. 10, pp. 29 311–29 330, 2024.
[57] W. Zhou, Y. Pan, J. Lei, L. Ye, and L. Yu, “Defnet: Dual-branch enhanced feature fusion network for rgb-t crowd counting,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 24 540–24 549, 2022.
[58] W. Gao, G. Liao, S. Ma, G. Li, Y. Liang, and W. Lin, “Unified information fusion network for multi-modal rgb-d and rgb-t salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2091–2106, 2021.
[59] X. Xiao, X. Xiong, F. Meng, and Z. Chen, “Multi-scale feature interactive fusion network for rgbt tracking,” Sensors, vol. 23, no. 7, p. 3410, 2023.
[60] Y. Cai, X. Sui, and G. Gu, “Multi-modal multi-task feature fusion for rgbt tracking,” Information Fusion, vol. 97, p. 101816, 2023.
[61] J. Peng, H. Zhao, and Z. Hu, “Dynamic fusion network for rgbt tracking,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 4, pp. 3822–3832, 2022.
[62] Y. Zhao, H. Lai, and G. Gao, “Rmfnet: Redetection multimodal fusion network for rgbt tracking,” Applied Sciences, vol. 13, no. 9, p. 5793, 2023.
[63] S.-I. Oh and H.-B. Kang, “Object detection and classification by decision-level fusion for intelligent vehicle systems,” Sensors, vol. 17, no. 1, p. 207, 2017.
[64] Y. Zhang, H. Yu, Y. He, X. Wang, and W. Yang, “Illumination-guided rgbt object detection with inter-and intra-modality fusion,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–13, 2023.
[65] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931.
[66] J. Orlosky, P. Kim, K. Kiyokawa, T. Mashita, P. Ratsamee, Y. Uranishi, and H. Takemura, “Vismerge: Light adaptive vision augmentation via spectral and temporal fusion of non-visible light,” in 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2017, pp. 22–31.
[67] M. V. Andersen, R. Greer, A. Møgelmose, and M. M. Trivedi, “Learning to find missing video frames with synthetic data augmentation: A general framework and application in generating thermal images using rgb cameras,” in 2024 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2024, pp. 104–109.
[68] J. Chen, W. Yang, C. Liu, and L. Yao, “A data augmentation method for skeleton-based action recognition with relative features,” Applied Sciences, vol. 11, no. 23, p. 11481, 2021.
[69] J. Lambrecht and L. Kästner, “Towards the usage of synthetic data for marker-less pose estimation of articulated robots in rgb images,” in 2019 19th International Conference on Advanced Robotics (ICAR). IEEE, 2019, pp. 240–247.
[70] Z. Tu, Z. Li, C. Li, and J. Tang, “Weakly alignment-free rgbt salient object detection with deep correlation network,” IEEE Transactions on Image Processing, vol. 31, pp. 3752–3764, 2022.
[71] M. Yuan, Y. Wang, and X. Wei, “Translation, scale and rotation: cross-modal alignment meets rgb-infrared vehicle detection,” in European Conference on Computer Vision. Springer, 2022, pp. 509–525.
[72] T. Zhang, X. He, Q. Jiao, Q. Zhang, and J. Han, “Amnet: Learning to align multi-modality for rgb-t tracking,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
[73] L. Liu, C. Li, Y. Xiao, R. Ruan, and M. Fan, “Rgbt tracking via challenge-based appearance disentanglement and interaction,” IEEE Transactions on Image Processing, 2024.
[74] H. Li, J. Liu, Y. Zhang, and Y. Liu, “A deep learning framework for infrared and visible image fusion without strict registration,” International Journal of Computer Vision, vol. 132, no. 5, pp. 1625–1644, 2024.
[75] J. Tang, D. Fan, X. Wang, Z. Tu, and C. Li, “Rgbt salient object detection: Benchmark and a novel cooperative ranking approach,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4421–4433, 2019.
[76] H. O. Velesaca, G. Bastidas, M. Rouhani, and A. D. Sappa, “Multimodal image registration techniques: a comprehensive survey,” Multimedia Tools and Applications, pp. 1–29, 2024.
[77] M. Brenner, N. H. Reyes, T. Susnjak, and A. L. Barczak, “Rgb-d and thermal sensor fusion: a systematic literature review,” IEEE Access, 2023.
[78] Y. Zhang, C. Xu, W. Yang, G. He, H. Yu, L. Yu, and G.-S. Xia, “Drone-based rgbt tiny person detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 204, pp. 61–76, 2023.
[79] L. Tang, Y. Deng, Y. Ma, J. Huang, and J. Ma, “Superfusion: A versatile image registration and fusion network with semantic awareness,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 12, pp. 2121–2137, 2022.
[80] L. Liu, C. Li, Y. Xiao, and J. Tang, “Quality-aware rgbt tracking via supervised reliability learning and weighted residual guidance,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 3129–3137.
[81] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” 2016. [Online]. Available: https://arxiv.org/abs/1506.01497
[82] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, SSD: Single Shot MultiBox Detector. Springer International Publishing, 2016, p. 21–37. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46448-0_2
[83] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–3007.
[84] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” 2020. [Online]. Available: https://arxiv.org/abs/1911.09070
[85] F. Liu, S. Guan, K. Yu, and H. Gong, “Infrared target detection based on the fusion of mask r-cnn and image enhancement network,” in 2022 China Automation Congress (CAC), 2022, pp. 2011–2016.
[86] A. S. Geetha, “Comparing yolov5 variants for vehicle detection: A performance analysis,” 2024. [Online]. Available: https://arxiv.org/abs/2408.12550
[87] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” 2019. [Online]. Available: https://arxiv.org/abs/1904.01355
[88] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” 2019. [Online]. Available: https://arxiv.org/abs/1904.07850
[89] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral pedestrian detection: Benchmark dataset and baselines,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[90] Y. Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning,” 2021. [Online]. Available: https://arxiv.org/abs/2003.02437
[91] J. Shen, Y. Chen, Y. Liu, X. Zuo, H. Fan, and W. Yang, “Icafusion: Iterative cross-attention guided feature fusion for multispectral object detection,” 2023. [Online]. Available: https://arxiv.org/abs/2308.07504
[92] Z. Zong, G. Song, and Y. Liu, “Detrs with collaborative hybrid assignments training,” 2023. [Online]. Available: https://arxiv.org/abs/2211.12860
[93] C. Lyu, W. Zhang, H. Huang, Y. Zhou, Y. Wang, Y. Liu, S. Zhang, and K. Chen, “Rtmdet: An empirical study of designing real-time object detectors,” 2022. [Online]. Available: https://arxiv.org/abs/2212.07784
[94] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” 2021. [Online]. Available: https://arxiv.org/abs/2104.14294
[95] Y. Chen, P. Zhang, Z. Li, Y. Li, X. Zhang, L. Qi, J. Sun, and J. Jia, “Dynamic scale training for object detection,” 2021. [Online]. Available: https://arxiv.org/abs/2004.12432
[96] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” 2021. [Online]. Available: https://arxiv.org/abs/2104.00680
[97] H. Dong, W. Gu, X. Zhang, J. Xu, R. Ai, H. Lu, J. Kannala, and X. Chen, “Superfusion: Multilevel lidar-camera fusion for long-range hd map generation,” 2024. [Online]. Available: https://arxiv.org/abs/2211.15656
[98] S. Zhang, X. Wang, J. Wang, J. Pang, C. Lyu, W. Zhang, P. Luo, and K. Chen, “Dense distinct query for end-to-end object detection,” 2023. [Online]. Available: https://arxiv.org/abs/2303.12776
[99] D. König, M. Adam, C. Jarvers, G. Layher, H. Neumann, and M. Teutsch, “Fully convolutional region proposal networks for multispectral person detection,” 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 243–250, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:8436249
[100] C. Li, D. Song, R. Tong, and M. Tang, “Illumination-aware faster r-cnn for robust multispectral pedestrian detection,” 2018. [Online]. Available: https://arxiv.org/abs/1803.05347
[101] D. Guan, Y. Cao, J. Liang, Y. Cao, and M. Y. Yang, “Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection,” 2018. [Online]. Available: https://arxiv.org/abs/1802.09972
[102] K. Zhou, L. Chen, and X. Cao, “Improving multispectral pedestrian detection by addressing modality imbalance problems,” 2020. [Online]. Available: https://arxiv.org/abs/2008.03043
[103] J. Kim, H. Kim, T. Kim, N. Kim, and Y. Choi, “Mlpd: Multi-label pedestrian detector in multispectral domain,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7846–7853, 2021.
[104] C. Li, D. Song, R. Tong, and M. Tang, “Multispectral pedestrian detection via simultaneous detection and segmentation,” 2018. [Online]. Available: https://arxiv.org/abs/1808.04818
[105] C. Devaguptapu, N. Akolekar, M. M. Sharma, and V. N. Balasubramanian, “Borrow from anywhere: Pseudo multi-modal object detection in thermal imagery,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 1029–1038.
[106] H. Zhang, E. Fromont, S. Lefevre, and B. Avignon, “Multispectral fusion for object detection with cyclic fuse-and-refine blocks,” in 2020 IEEE International Conference on Image Processing (ICIP), 2020, pp. 276–280.
[107] M. Kieu, A. D. Bagdanov, and M. Bertini, “Bottom-up and layerwise domain adaptation for pedestrian detection in thermal images,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 17, no. 1, Apr. 2021. [Online]. Available: https://doi.org/10.1145/3418213
[108] F. Qingyun, H. Dapeng, and W. Zhaokui, “Cross-modality fusion transformer for multispectral object detection,” 2022. [Online]. Available: https://arxiv.org/abs/2111.00273
[109] Y. Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6700–6713, 2022.
[110] X. He, C. Tang, X. Zou, and W. Zhang, “Multispectral object detection via cross-modal conflict-aware learning,” in Proceedings of the 31st ACM International Conference on Multimedia, ser. MM ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 1465–1474. [Online]. Available: https://doi.org/10.1145/3581783.3612651
[111] M. Yuan, Y. Wang, and X. Wei, “Translation, scale and rotation: Cross-modal alignment meets rgb-infrared vehicle detection,” 2022. [Online]. Available: https://arxiv.org/abs/2209.13801
[112] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G.-S. Xia, and X. Bai, “Gliding vertex on the horizontal bounding box for multi-oriented object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, p. 1452–1459, Apr. 2021. [Online]. Available: http://dx.doi.org/10.1109/TPAMI.2020.2974745