Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

Xue Zhang, Si-Yuan Cao, Fang Wang, Runmin Zhang, Zhe Wu, Xiaohan Zhang, Xiaokai Bai, and
Hui-Liang Shen, Senior Member, IEEE This work was supported in part by the National Key R&D Program of China under grant 2023YFB3209800, in part by the Natural Science Foundation of Zhejiang Province under grant D24F020006, in part by the National Natural Science Foundation of China under grant 62301484, and in part by the Jinhua Science and Technology Bureau Project. (Corresponding authors: Si-Yuan Cao and Hui-Liang Shen.) X. Zhang, R. Zhang, Z. Wu, X. Zhang, and X. Bai are with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China (e-mail: zxue2019@zju.edu.cn, runmin_zhang@zju.edu.cn, jeffw@zju.edu.cn, zhangxh2023@zju.edu.cn, shawnnnkb@zju.edu.cn). S.-Y. Cao is with the Ningbo Research Institute, College of Information Science and Electronic Engineering, Zhejiang University, China (e-mail: cao_siyuan@zju.edu.cn). F. Wang is with the School of Information and Electrical Engineering, Hangzhou City University, Hangzhou 310015, China (e-mail: wangf@zucc.edu.cn) H.-L. Shen is with the College of Information Science and Electronic Engineering, Jinhua Institute, Zhejiang University, and also with the Key Laboratory of Collaborative Sensing and Autonomous Unmanned Systems of Zhejiang Province, China (e-mail: shenhl@zju.edu.cn).

Abstract

Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, we address this issue by improving the performance of efficient single-branch structures. We revisit the reasons causing the performance gap between these structures. For the first time, we reveal the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures. Besides, we find that the domain gap between multispectral images, and weak feature representation of the single-branch structure are also key obstacles for performance. Focusing on these three problems, we propose corresponding solutions, including a novel shape-priority early-fusion strategy, a weakly supervised learning method, and a core knowledge distillation technique. Experiments demonstrate that single-branch networks equipped with these three contributions achieve significant performance enhancements while retaining high efficiency. Our code will be available at https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection.

Index Terms:

Multispectral object detection; feature fusion; weakly supervised learning; knowledge distillation

I Introduction

Multispectral object detection has been widely studied, since multispectral images can provide complementary information to achieve consistent detection in various lighting conditions [3, 4, 5, 6, 7, 8, 9, 10]. This complementarity is illustrated in Fig. 1 (a) and (b). Given the multispectral inputs, modern multispectral detectors develop three fusion strategies: early-fusion, medium-fusion, and late-fusion, shown in Fig. 1 (c) - (e). The medium and late-fusion strategies often achieve superior performance compared to early-fusion [11, 12, 13, 5, 4, 14]. However, they use a two-branch structure, making model deployment on edge devices expensive. In contrast, the early-fusion strategy adopts a simple single-branch structure, facilitating deployment on edge devices. Nevertheless, its performance is low, and there are few works to address this problem, resulting in an increasing gap between high performance and high efficiency.

To resolve this conflict, in this work we focus on improving the performance of the early-fusion strategy while maintaining its high efficiency. We first conduct pilot studies and observe that a plain early-fusion strategy cannot consistently obtain improved performances compared to single-modality inputs. Based on this observation, we rethink the early-fusion strategy and summarize three key obstacles: 1) the information interference problem when simply concatenating the multispectral images, 2) the domain gap existing in thermal and RGB images, and 3) the weak feature representation of the single-branch structure. Focusing on these obstacles, we propose corresponding solutions.
- Information interference problem refers to the potential suppression of important information in one modality by another. In the plain early-fusion strategy, previous works [15] typically feed concatenated multispectral images into a convolution layer and generate a fused feature. The convolution layer generally has a small receptive field. Therefore, based on limited contexts, this approach is hard to determine which modality information is important. We address this issue by first recognizing that object shapes are agnostic to visible and infrared wavelengths and devise a module to fuse multispectral images based on object shape saliency, named the shape-priority early-fusion (ShaPE) module.
- Domain gap between RGB and thermal images is usually neglected in previous works. They generally adopt an RGB pre-trained backbone network to extract features from both RGB and thermal images [13, 5]. However, the domain gap may cause the representation distribution shift. This issue is also recognized in the work [16] on an RGB-D task. Different from previous works, we introduce a weakly supervised learning method to address this issue. Within this method, the backbone network jointly uses RGB and thermal images to learn the representation of CLIP [17], since CLIP has demonstrated promising zero-shot generalizability in bridging the domain gap [18]. Additionally, we introduce a segmentation auxiliary branch. Our method allows the backbone network to reduce representation shifts and improve semantic localization ability.
- Weak feature representation problem results from the early-fusion strategy employing a single-branch structure. This structure has fewer parameters and simpler fusion modules compared to medium and late-fusion strategies. We address this issue by introducing the knowledge distillation (KD) technique [19]. In KD, a key problem is how to align the feature dimensions between teacher and student models. Previous works generally introduce a convolution layer for the student model to learn all knowledge from the teacher model [20, 21]. However, we show that not all information in teacher model is helpful for downstream tasks. Therefore, we introduce core knowledge distillation (CoreKD) to transfer the most crucial knowledge for specific downstream tasks, resembling the human learning process where the teacher highlights key knowledge for quick understanding and absorption by the students.

Experimental results validate that our efficient multispectral early-fusion (EME) detector achieves a significant performance improvement without considerably increasing the number of parameters, as shown in Fig. 1 (f). Besides, our EME outperforms the previous state-of-the-art approaches. In summary, our contributions are threefold:

•

Different from previous works, we summarize three key obstacles limiting the early-fusion strategy, including information interference, domain gap, and representation learning.
•

For each obstacle, we propose the corresponding solution: we develop 1) a ShaPE module to address the information interference issue, 2) a weakly supervised learning method to reduce domain gap and improve semantic localization abilities, and 3) a CoreKD to enhance the representation learning of single-branch networks.
•

Extensive experiments validate that the early-fusion strategy, equipped with our ShaPE module, weakly supervised learning, and CoreKD technique, shows significant improvement. Additionally, we only retain the ShaPE module during the inference phase. Consequently, our method is efficient and achieves improved performance.

II Related Work

In this section, we offer a brief overview of multispectral object detection and introduce related works in weakly supervised learning and knowledge distillation.

II-A Multispectral Object Detection

According to fusion strategies, multispectral object detection can be classified into three categories: early-fusion, medium-fusion and late-fusion strategies. Previous works [11], [12] and [22] confirm that both medium-fusion and late-fusion strategies outperform the early-fusion strategy.

However, both the medium and late fusion strategies adopt a two-branch structure that limits their use on resource-limited edge devices. Previous works notice this weakness and provide some solutions. For example, in [14], a model using the medium-fusion strategy is first trained as a teacher, and its knowledge is transferred to a student model. The student model only receives RGB images as inputs. Although it saves resources, it discards important complementary information from thermal images. The work [13] introduces a domain adaptation technique. It uses a medium-fusion model to guide single-branch model learning, which only receives thermal images as inputs and also discards complementary information from RGB images. To employ complementary information while saving computational resources, [23] transfers knowledge from a medium-fusion model to an early-fusion model. Nevertheless, it neglects information interference problem. Some works in the image fusion field [1, 24, 25] demonstrate that fused images can improve detectors, but the fusion process still introduces an additional computational burden.

Different from previous works, we identify the information interference problem in early-fusion strategies. By addressing this problem, we fully employ the complementary information in multispectral images, without significantly increasing computational burden.

II-B Weakly Supervised Learning in Object Detection

Weakly supervised learning has received much attention in object localization and detection, as comprehensively surveyed in [26]. Recent works in the multispectral object detection adopt this technique. Based on the weak annotations they utilize, we can coarsely divide them into image- and box-level weakly supervised learning approaches.

In image-level weakly supervised learning approaches, previous works mainly employ the illumination condition of RGB images as weighting factors to determine the modality importance [14, 5, 27, 13]. In box-level approaches, previous works [15, 28] mainly employ the bounding-box annotations to generate masks. They use these masks to construct spatial attention mechanisms, highlighting representations within target regions.

Different from previous works, we use weakly supervised learning to address the domain gap problem in RGB and thermal images. We employ image-level labels to construct a multi-label classification auxiliary task. This task can fully exploit the complementary information in multispectral images, instead of solely using information from one modality. Along with the powerful CLIP model [17] and box-level weak labels, our method can reduce the domain gap and obtain precise semantic localization abilities.

II-C Knowledge Distillation

Knowledge distillation is first introduced in [19]. It aims to improve a lightweight student model by learning knowledge from a high-capacity teacher model. According to distillation approaches, this technique can be roughly divided into two groups: logit distillation [19] and feature distillation [20]. The former let a student model learn the logit of a teacher model, while the latter let a student model learn the feature of a teacher model. These distillation approaches are also applied to object detection [29, 21]. Recently, some works in multispectral object detection also employ the knowledge distillation technique [23, 14]. In the distillation process, they generally introduce a projection layer to align the teacher and student feature channel number. The purpose of this approach is to learn all representations in the teacher model.

Different from previous works, we first confirm that not all information in teacher features is beneficial to downstream task including classification and regression. Based on this, we propose a core knowledge distillation technique to transfer the most important features for the downstream tasks to the student model.

III Method

Fig. 2 illustrates the overview of our method. We adopt a single-branch structure as the baseline model considering its low memory cost. To boost its performance, we develop three key modules: shape-priority early-fusion (ShaPE), weakly supervised auxiliary learning, and core knowledge distillation (CoreKD). In the following, we describe the ShaPE module in Section III-A, the weakly supervised auxiliary learning method in Section III-B, and the CoreKD in Section III-C. These contributions are designed to enhance information fusion, feature extraction, and feature classification abilities, respectively.

III-A Shape-Priority Early-Fusion Module

Observation. Given a pair of RGB-T images, the plain early-fusion strategy concatenates them in the channel dimension and then feeds them into a detector. With the plain strategy, we conduct pilot studies on the M3FD [1] dataset. We first train three commonly used one-stage detectors: RetinaNet [30], GFL [31] and YOLOv5 [2]. Then, we compute the mean values and standard deviations of their detection results and illustrate the computed results in Fig. 3. Besides, we also train these detectors using single-modality images as input for comparisons. We have the following two observations. First, the plain early-fusion strategy cannot achieve consistent improvement compared with single-modality input. Second, for objects that require color to identify, such as ‘Traffic Light’, the plain early-fusion strategy yields worse results than the RGB input.

Motivation. We attribute the above phenomena to the convolutional inductive bias, namely, local connectivity and weight sharing. The process of 2D convolution involves two steps: (1) sampling across the concatenated RGB-T images using a regular grid $\mathcal{R}$ ; (2) summing the sampled values with weighting factor $\mathbf{W}$ . The grid $\mathcal{R}$ determines both the receptive field size and dilation. For example,

\mathcal{R}=\left\{(-3,-3),(-3,-2),\dots,(2,3),(3,3)\right\}

defines a 7 $\times$ 7 kernel with dilation 1. For each position $\mathbf{p}_{0}$ on an out feature map $\mathrm{O}$ , we have

\mathrm{O}(\mathbf{p}_{0})=\sum_{\mathbf{p}_{n}\in\mathcal{R}}\sum_{j\in\{\rm rgb% ,t\}}\mathbf{W}_{j}(\mathbf{p}_{n})\cdot\mathbf{I}_{j}(\mathbf{p}_{0}+\mathbf{% p}_{n}),

(1)

where $\mathbf{p}_{n}$ enumerates the positions in $\mathcal{R}$ .

This process indicates that the plain early-fusion strategy is a pixel-level weighting method, with weights learned from data. However, the limited receptive field of pixel-level weighting methods makes the weights difficult to determine which modality is important. This weakness may result in valuable information from one modality being suppressed by another. As an example, Fig. 4 (c) depicts the feature map generated from the RGB-T images of Fig. 4 (a) and (b) using the plain early-fusion strategy. It is observed from the close-up that the ‘Traffic Light’ in the fused feature map doesn’t preserve the significant information of the RGB image.

The straightforward solutions to this weakness are: (1) enlarging the receptive field by using a larger kernel or more convolutional layers so that the model can judge the modality importance based on a broader range of contexts, or (2) increasing the number of convolutional kernels so that the model can learn more representations. However, these solutions increase memory costs and computational burden, making them unfriendly to edge devices.

ShaPE Module. We realize that shape is an inherent attribute of an object. Any visible objects in RGB and thermal images have consistent shapes. Thus, we consider the salience of shape as a modifying factor to adaptively determine the modality importance, and design the shape-priority early-fusion (ShaPE) module. In the ShaPE module, the RGB and thermal images are modified by self-gating masks. In this context, Eq. (1) becomes:

\mathrm{O}(\mathbf{p}_{0})=\sum_{\mathbf{p}_{n}\in\mathcal{R}}\sum_{j\in\{\rm rgb% ,t\}}\mathbf{W}_{j}(\mathbf{p}_{n})\cdot\mathbf{M}_{j}(\mathbf{p}_{0}+\mathbf{% p}_{n})\cdot\mathbf{I}_{j}(\mathbf{p}_{0}+\mathbf{p}_{n}),

(2)

where $\mathbf{M}_{\rm rgb}$ and $\mathbf{M}_{\rm t}$ denote the self-gating masks of RGB and thermal images, respectively.

In the following, we describe the generation process of self-gating masks $\mathbf{M}_{\rm rgb}$ and $\mathbf{M}_{\rm t}$ . Since our ShaPE module focuses on the shapes of objects and structural contributions of different modalities to the fused features, we employ the gradients and structural similarities in our method. For easy understanding, we visualize some important intermediate results in Fig. 4. Given the RGB-T images as shown in Fig. 4 (a) and (b), we compute their gradients

	$\displaystyle\nabla\mathbf{I}_{\rm rgb}(\mathbf{p}_{0})$	$\displaystyle=\sqrt{(\nabla_{x}\mathbf{I}_{\rm rgb}(\mathbf{p}_{0}))^{2}+(% \nabla_{y}\mathbf{I}_{\rm rgb}(\mathbf{p}_{0}))^{2}},$
	$\displaystyle\nabla\mathbf{I}_{\rm t}(\mathbf{p}_{0})$	$\displaystyle=\sqrt{(\nabla_{x}\mathbf{I}_{\rm t}(\mathbf{p}_{0}))^{2}+(\nabla% _{y}\mathbf{I}_{\rm t}(\mathbf{p}_{0}))^{2}},$

as shown in Fig. 4 (d) and (e). We then generate the union gradient as the reference using

\nabla\mathbf{I}^{\prime}_{\rm ref}(\mathbf{p}_{0})=\max(\nabla\mathbf{I}_{\rm rgb% }(\mathbf{p}_{0}),\nabla\mathbf{I}_{\rm t}(\mathbf{p}_{0})).

We further use max-pooling within a 3 $\times$ 3 neighborhood $\mathcal{R}^{\prime}$ to boost the reference gradient, which is written as

\nabla\mathbf{I}_{\rm ref}(\mathbf{p}_{0})=\max_{\mathbf{p}_{n}\in\mathcal{R}^% {\prime}}\nabla\mathbf{I}^{\prime}_{\rm ref}(\mathbf{p}_{0}+\mathbf{p}_{n}),

as shown in Fig. 4 (f).

To determine the structural contributions of each modality to the fused features, we compute the structural similarities between single-modality gradient images { $\nabla\mathbf{I}_{\rm rgb}$ , $\nabla\mathbf{I}_{\rm t}$ } and the reference gradient image $\nabla\mathbf{I}_{\rm ref}$ . Inspired by [32], for each patch $\mathcal{R}$ , we compute three fundamental properties: the means { $\mu_{\rm rgb},\mu_{\rm t},\mu_{\rm ref}$ }, the standard deviations { $\sigma_{\rm rgb},\sigma_{\rm t},\sigma_{\rm ref}$ }, and the covariances { $\sigma_{({\rm rgb,ref})}$ , $\sigma_{({\rm t,ref})}$ } between the single-modality gradient images and the reference gradient images. In this context, we generate the self-gating masks:

		$\displaystyle\mathbf{M}_{\rm rgb}^{\prime}=\frac{(2\mu_{\rm rgb}\cdot\mu_{\rm ref% }+\xi_{1})\cdot(2\sigma_{\rm(rgb,ref)}+\xi_{2})}{(\mu_{\rm rgb}^{2}+\mu_{\rm ref% }^{2}+\xi_{1})\cdot(\sigma_{\rm rgb}^{2}+\sigma_{\rm ref}^{2}+\xi_{2})},$
		$\displaystyle\mathbf{M}_{\rm t}^{\prime}=\frac{(2\mu_{\rm t}\cdot\mu_{\rm ref}% +\xi_{1})\cdot(2\sigma_{\rm(t,ref)}+\xi_{2})}{(\mu_{\rm t}^{2}+\mu_{\rm ref}^{% 2}+\xi_{1})\cdot(\sigma_{\rm t}^{2}+\sigma_{\rm ref}^{2}+\xi_{2})},$

where $\xi_{1}=(k_{1}L)^{2}$ and $\xi_{2}=(k_{2}L)^{2}$ are used to prevent instability. $L$ is the dynamic range of the gradient images, $k_{1}=\text{0.01}$ , and $k_{2}=\text{0.03}$ .

Since the ranges of both $\mathbf{M}_{\rm rgb}^{\prime}$ and $\mathbf{M}_{\rm t}^{\prime}$ are $[-\text{1},\text{1}]$ , we then normalize the self-gating masks and obtain

\mathbf{M}_{\rm rgb}=\frac{\exp(\mathbf{M}^{\prime}_{\rm rgb})}{\sum\limits_{j% \in\{\rm rgb,t\}}\exp(\mathbf{M}^{\prime}_{j})},\;\mathbf{M}_{\rm t}=\frac{% \exp(\mathbf{M}^{\prime}_{\rm t})}{\sum\limits_{j\in\{\rm rgb,t\}}\exp(\mathbf% {M}^{\prime}_{j})},

(3)

as shown in Fig. 4 (g) and (h). According to Eq. (2), we can finally generate the fused feature map as shown in Fig. 4 (i).

III-B Weakly Supervised Learning Method

In RGB-T object detection, an unneglectable issue is the lack of pre-trained backbone networks on large-scale RGB-T datasets. This is because there are few large-scale datasets like ImageNet [33] and COCO [34] in RGB-T image recognition fields. Previous works generally use backbone networks pre-trained on ImageNet. However, the domain gap between thermal and RGB images would cause representation distribution shifts, as illustrated in Fig. 5 (a) and (b). This is because the backbone network is trained solely on RGB images, but is applied to thermal images.

To handle this issue, we turn to the powerful Contrastive Language-Image Pre-training (CLIP) [17] model. It has been confirmed that CLIP can bridge domain gaps [18], since it is trained using a huge number of (image, text) pairs. In this context, we feed both RGB and thermal images into the backbone network, and let it learn the representation generated by the CLIP model. Specifically, we first present a CLIP-driven image-level weakly supervised learning method. This method enables the network to recognize the classes of objects in a pair of RGB-T images while locating their coarse regions. For fine-grained localization, we then introduce a box-level weakly supervised learning method. Fig. 6 illustrates the architecture of weakly supervised learning method.

CLIP-Driven Image-Level Weak Supervision. To learn the CLIP model’s knowledge, we construct the image-level weak supervision method. Based on three considerations, we adopt the multi-label classification task as the image-level weak supervision: (1) the CLIP model can be viewed as a classifier, (2) this auxiliary task can fully use the complementary information in the RGB-T images, and (3) by summarizing all classes and removing duplicates in an image, we can easily construct the ground-truth multi-label targets based on detection annotations.

Nevertheless, original CLIP model is only trained for recognizing a single object per image [17] and is not suitable for multi-label classification [36]. To address this issue, we introduce a Divide-and-Aggregation CLIP (DA-CLIP) model. DA-CLIP first divides input images into multiple crops. Each crop is then fed into CLIP. All predictions of these crops are finally aggregated by a max-pooling operation on each class. Considering DA-CLIP may generate inaccurate predictions, we construct a learnable adapter, which consists of three fully-connected (FC) layers, to fine-tune the result of DA-CLIP. To prevent overfitting, we add a dropout layer in the adapter. We denote the predicted probability from the adapter as $\mathbf{\hat{q}}_{\rm ad}\in\mathbb{R}^{c}$ , where $c$ denotes the number of classes.

For the backbone network, we add an auxiliary classification head on its top. The head consists of a global average pooling (GAP) operation and one FC layer. We denote the predicted probability from the classification head as $\mathbf{\hat{q}}_{\rm bb}\in\mathbb{R}^{c}$ .

We adopt the mutual learning approach [37] to train the backbone network and the adapter simultaneously. In this approach, an important step is that one model generates soft targets for the other model using the softmax function. However, this approach cannot be directly applied to the multi-label classification problem, since it requires the sum of predicted probabilities to be one, which is rarely satisfied in multi-label classification. To address this issue, we draw inspiration from self-training KD [38] and construct the soft targets for the adapter and backbone network as

\mathbf{\tilde{q}}_{\rm ad}=(1-\lambda)\mathbf{q}+\lambda\mathbf{\hat{q}}_{\rm ad% },\quad\mathbf{\tilde{q}}_{\rm bb}=(1-\lambda)\mathbf{q}+\lambda\mathbf{\hat{q% }}_{\rm bb},

where $\mathbf{q}\in\mathbb{R}^{c}$ denotes a ground-truth multi-label target, and $\lambda$ denotes a balancing factor set to 0.1. In this context, we compute the binary cross-entropy (BCE) losses


	$\displaystyle\mathcal{H}(\mathbf{\tilde{q}}_{\rm ad},\mathbf{\hat{q}}_{\rm bb})$
	$\displaystyle=-\sum_{i=1}^{c}\tilde{q}_{{\rm ad},i}\log(\hat{q}_{{\rm bb},i})+% (1-\tilde{q}_{{\rm ad},i})\log(1-\hat{q}_{{\rm bb},i}),$		(4a)
	$\displaystyle\mathcal{H}(\mathbf{\tilde{q}}_{\rm bb},\mathbf{\hat{q}}_{\rm ad})$
	$\displaystyle=-\sum_{i=1}^{c}\tilde{q}_{{\rm bb},i}\log(\hat{q}_{{\rm ad},i})+% (1-\tilde{q}_{{\rm bb},i})\log(1-\hat{q}_{{\rm ad},i}).$		(4b)

To showcase the semantic localization effect of our CLIP-driven image-level weak supervision, we visualize the class activation map (CAM) of the backbone network in Fig. 8 (a). CAM is a useful tool for understanding which regions the network focuses on to predict a class. We can observe that the backbone network can coarsely localize regions of ‘Person’, ‘Car’, and ‘Traffic Light’ in the image.

Box-Level Weak Supervision. To precisely localize the semantic regions, we introduce box-level weak supervision. The ground-truth box-level target is generated by directly filling the area within an annotation box with its corresponding class index. In this context, we add an auxiliary segmentation head on top of the backbone network to predict the target. Denoting the ground-truth box-level target mask as $\mathbf{G}$ , and the predicted mask as $\mathbf{\hat{G}}$ , we compute the BCE loss between them as

\mathcal{H}(\mathbf{G},\mathbf{\hat{G}})=-\sum_{n=1}^{N}G_{n}\log(\hat{G}_{n})% +(1-G_{n})\log(1-\hat{G}_{n}),

(5)

where $N$ denotes the number of elements in the mask.

We visualize attention maps of the backbone network for different classes, as shown in Fig. 8 (b). Using the box-level weak supervision, the backbone network can precisely localize the interest of objects, such as ’Car’. Nevertheless, it may miss some useful information in the image. Therefore, we combine the CLIP-driven image-level weak supervision and the box-level weak supervision. The results presented Fig. 8 (c) show that our weakly supervised learning method can effectively allow the backbone network to localize the important semantic regions.

Effect Validation. When our weakly supervised learning method is employed, Fig. 5 (c) and (d) demonstrate that the domain gap between RGB and thermal features is reduced. This implies that the backbone network can extract information from RGB and thermal images without bias. To further illustrate this effect, we visualize the feature map generated by the ResNet-50 [39] in Fig. 8. The generation process of these feature maps is as follows: First, we resize all features of the ResNet-50 across four stages to the same resolution as the input images. Then, we aggregate these features along the channel dimension using $\texttt{sum}(\texttt{softmax}(\mathbf{F},\texttt{dim=0})\otimes\mathbf{F},% \texttt{dim=0})$ , where $\mathbf{F}\in\mathbb{R}^{D\times H\times W}$ represents the concatenated feature. $D$ , $H$ , and $W$ denote its depth, height, and width, respectively. $\otimes$ denotes the element-wise production operation.

Fig. 8 (a) and (b) present the RGB and thermal images in one example scene. Fig. 8 (c) and (d) illustrate their corresponding feature maps. Fig. 8 (e) shows the RGB-T feature map without using our weakly supervised learning method. Fig. 8 (f) shows the feature map using our weakly supervised learning method. Observing Fig. 8 (e), we note that the ResNet-50 tends to acquire information primarily from the RGB image. In contrast, the feature map in Fig. 8 (f) demonstrates that our method enables the ResNet-50 to gather important information from both RGB and thermal images.

III-C Core Knowledge Distillation

Problem Description. To further improve the detection accuracy of the early-fusion strategy without increasing its computational cost, we introduce the knowledge distillation technique [19]. To achieve knowledge transfer, we instruct the student model to mimic intermediate features of teacher model. In this process, a primary obstacle the student model faces is the unequal number of feature channels as the teacher model. Previous works introduce convolution layers to align their feature channel numbers [20, 21], while neglecting whether the teacher’s knowledge is helpful to the student. To address this issue, we propose core knowledge distillation (CoreKD).

CoreKD Architecture. We use YOLOv5 [2] as an example and illustrate the knowledge distillation architecture in Fig. 9. In its architecture, we use the early-fusion single-branch structure as the student model and the medium-fusion two-branch structure as the teacher model. In the student model, a pair of RGB-T images is first concatenated, then fed into different network modules, and finally converted into predicted results. In the teacher model, the RGB and thermal images are respectively fed into different backbone networks. The generated multispectral features are fused in the feature space through concatenation and convolution operations. The fused features are then fed into the subsequent network modules and converted into predicted results. The predicted results of both the student and teacher models consist of bounding boxes and class-specific confidence scores.

CoreKD Formulation. Since we apply the same distillation techniques to different feature pyramid levels, we only describe the technique at one level and omit the subscript for simplicity. In the head modules of Fig. 9, we denote the input features of the student and teacher models as $\mathbf{X}^{\rm S}$ and $\mathbf{X}^{\rm T}$ , respectively. Feature distillation typically transfers the teacher’s knowledge to the student by minimizing the loss [20]

\mathcal{L}^{\prime\prime}_{\rm feat}=||\mathcal{A}(\mathbf{X}^{\rm S})-% \mathbf{X}^{\rm T}||_{2}^{2},

(6)

where $\mathcal{A}$ denotes an adaptation layer used to match the channel dimensions between the student and teacher features. Previous works usually use a convolution layer as the adaptation layer [20, 21]. This approach aims to make $\mathcal{A}(\mathbf{X}^{\rm S})$ learn all information in the teacher feature $\mathbf{X}^{\rm T}$ . However, they neglect whether all the information in $\mathbf{X}^{\rm T}$ is beneficial for downstream tasks, including classification and regression.

To address this problem, we revisit the structure of head module in the teacher model. As shown in Fig. 9, the official implementation of YOLOv5 uses a ‘ $1\times 1\;\texttt{Conv}$ ’ layer to output the predicted results

\mathbf{\hat{Y}}^{\rm T}=\texttt{Conv}(\mathbf{X}^{\rm T};\mathbf{W}^{\rm T}),

where $\mathbf{W}^{\rm T}$ denotes the weighting factor in the teacher’s head module. According to the 2D convolution formulation in Eq. (1), we can infer that the weighting factor $\mathbf{W}^{\rm T}$ reflects the importance of a channel map in $\mathbf{X}^{\rm T}$ for the downstream feature. We visualize the histogram of $\mathbf{W}^{\rm T}$ in Fig. 10. It is evident that most of the values in $\mathbf{W}^{\rm T}$ approximate 0. This implies that only a few feature representations in $\mathbf{X}^{\rm T}$ are important for the downstream tasks. We call these important feature representations the core knowledge in teacher model.

To learn this core knowledge, we modify the feature loss Eq. (6) into

\mathcal{L}^{\prime}_{\rm feat}=||\texttt{Conv}(\mathcal{A}(\mathbf{X}^{\rm S}% );\mathbf{W}^{\rm T})-\texttt{Conv}(\mathbf{X}^{\rm T};\mathbf{W}^{\rm T}))||_% {2}^{2}.

(7)

This modification ensures that $\mathcal{A}(\mathbf{X}^{\text{S}})$ and $\mathbf{X}^{\text{T}}$ are projected into an identical space constructed by $\mathbf{W}^{\text{T}}$ , and that the projected features are close to each other. Furthermore, to avoid introducing the adaption layer $\mathcal{A}$ , we construct a core knowledge convolution (Core Knowledge Conv) operator by sampling the weighting factor $\mathbf{W}^{\rm T}$ . We denote the sampling process as $\mathcal{S}(\cdot)$ . In the process, we first obtain the channel dimension $d$ of the student feature $\mathbf{X}^{\rm S}$ , then sample the top- $d$ values along the ‘in_channel’ axis from $\mathbf{W}^{\rm T}$ based on their absolute values. Finally, we obtain the sampled weighting factor $\mathcal{S}(\mathbf{W}^{\rm T})$ . In this context, we rewrite the feature loss given in Eq. (7) as

	$\displaystyle\mathcal{L}_{\rm feat}$	$\displaystyle=\|\|\mathbf{\hat{Y}}^{\rm CT}-\mathbf{\hat{Y}}^{\rm T}\|\|_{2}^{2}$		(8)
		$\displaystyle=\|\|\texttt{Conv}(\mathbf{X}^{\rm S};\mathcal{S}(\mathbf{W}^{\rm T% }))-\texttt{Conv}(\mathbf{X}^{\rm T};\mathbf{W}^{\rm T}))\|\|_{2}^{2},$		(8)

where $\mathbf{\hat{Y}}^{\rm CT}$ denotes the output of core knowledge convolution. When using this feature loss, we keep the weighting factor $\mathbf{W}^{\rm T}$ fixed and only compute the gradient with respect to the student feature $\mathbf{X}^{\rm S}$ .

III-D Loss Function

Our efficient multispectral early-fusion (EME) single-branch model is trained using all the losses described above. The total loss is

\mathcal{L}_{\rm total}=\mathcal{L}_{\rm cls}+\mathcal{L}_{\rm reg}+\mathcal{L% }_{\rm weak}+\mathcal{L}_{\rm feat},

(9)

where $\mathcal{L}_{\rm cls}$ and $\mathcal{L}_{\rm reg}$ represent the classification and regression losses defined by a detector [30, 31, 2], respectively. $\mathcal{L}_{\rm weak}$ is the summation of weakly supervised losses defined in Eq. (4) and Eq. (5):

\mathcal{L}_{\rm weak}=\mathcal{H}(\mathbf{\tilde{q}}_{\rm ad},\mathbf{\hat{q}% }_{\rm bb})+\mathcal{H}(\mathbf{\tilde{q}}_{\rm bb},\mathbf{\hat{q}}_{\rm ad})% +\mathcal{H}(\mathbf{G},\mathbf{\hat{G}}).

(a) Dataset Splitting Method: Random Splitting
	Thermal [2]	RGB [2]	AUIF [40]	CDDF [24]	DDcGAN [41]	DIVF [25]	DenseF [42]	PSF [43]	RFN [44]	SeAF [45]	TarDAL [1]	U2F [46]	EME (Ours)
mAP	49.10	52.40	53.30	53.00	52.20	52.70	53.40	53.10	53.50	53.10	52.50	53.40	54.20
mAP50	77.30	81.90	81.90	80.90	81.60	81.50	81.70	82.00	81.70	82.20	81.00	81.90	83.40
Person	79.30	68.40	76.70	76.30	73.60	74.50	76.50	76.70	75.30	77.00	79.10	77.00	79.90
Car	87.90	90.80	91.00	91.00	90.70	91.10	91.40	90.80	91.00	91.10	90.50	91.20	91.80
Bus	87.20	92.20	90.00	90.10	90.70	91.60	89.40	90.10	89.40	91.20	89.40	90.70	89.20
Motor	70.00	74.00	72.60	69.20	74.80	73.50	72.80	73.30	73.30	72.20	70.30	71.30	76.20
TrafficLight	55.90	80.30	77.40	75.40	76.90	74.80	77.20	78.20	77.40	77.60	72.70	77.70	78.30
Truck	83.40	85.70	83.70	83.10	82.90	83.40	82.90	82.90	83.90	84.10	84.00	83.60	85.30
(b) Dataset Splitting Method: M3FD-zxSplit
	Thermal [2]	RGB [2]	AUIF [40]	CDDF [24]	DDcGAN [41]	DIVF [25]	DenseF [42]	PSF [43]	RFN [44]	SeAF [45]	TarDAL [1]	U2F [46]	EME (Ours)
mAP	34.90	36.10	38.30	38.60	37.10	37.10	38.90	38.00	38.20	38.90	39.10	38.70	41.50
mAP50	57.20	60.20	62.00	61.90	61.00	60.80	62.40	61.10	61.30	62.20	61.90	61.90	66.80
Person	74.60	55.90	72.20	71.90	67.30	67.60	72.30	71.70	70.50	72.50	75.50	72.40	77.60
Car	80.20	84.80	85.50	85.60	84.90	85.20	85.90	85.50	85.80	85.50	85.00	85.50	87.40
Bus	58.30	65.70	58.60	61.80	61.60	59.80	61.40	58.30	61.30	61.50	60.90	60.10	65.10
Motor	48.00	45.10	49.10	47.60	49.00	48.70	49.60	45.80	44.60	47.50	46.80	50.80	55.60
TrafficLight	27.30	56.80	49.80	48.70	49.10	51.20	48.60	50.90	49.70	50.80	46.90	48.00	54.90
Truck	54.80	52.70	56.70	55.50	53.80	52.60	56.80	54.70	55.80	55.70	56.70	54.70	60.30

	CBF [47]	MCG [48]	MUN [48]	ODS [49]	CFR [50]	GAFF [15]	BU [51]	SMPD [52]	ThDe [53]	MSAT [54]	CSAA [55]	MFPT [56]	ProbEn3 [22]	EME (Ours)
Bicycle	60.50	50.26	49.43	55.53	55.77	-	57.40	56.20	60.04	-	-	67.70	73.49	79.80
Car	83.60	70.63	70.72	82.33	84.91	-	86.50	85.80	85.52	-	-	89.00	90.14	92.80
Person	57.60	63.31	64.47	71.01	74.49	-	75.60	78.74	78.24	-	-	83.20	87.65	81.20

IV Experiments

IV-A Experimental Setup

Datasets. Our experiments are conducted on the M3FD dataset [1] and FLIR dataset [35]. M3FD dataset contains 4200 pairs of RGB and thermal images. These image pairs are well aligned. The dataset contains 6 classes of objects: ‘Person’, ‘Car’, ‘Bus’, ‘Motorcycle’, ‘Traffic Light’, and ‘Truck’. Since this dataset doesn’t provide unified data splits, previous works have used a random splitting approach to determine the train and validation sets [1]. However, images in this dataset are sampled from video sequences, meaning that two adjacent frames may contain identical content. In this context, the random splitting approach results in information leakage between the train and validation sets. To address this problem, we first manually divide the dataset into 73 video segments based on different scenes. Then, we collect the first 70% of images in each video segment as the train set and the remaining images as the validation set. Finally, we obtain 2905 and 1295 pairs of RGB-T images in the train and validation sets, respectively. We name this data split ‘M3FD-zxSplit’ and release it to the public¹¹1https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection. For the performance evaluation in Section IV-B, we use this data split. When comparing with state-of-the-art approaches in Section IV-C, we employ both ‘M3FD-zxSplit’ and random splitting. Our random splitting refers to randomly selecting 80% images as the train set and the remaining images as the validation set. FLIR dataset originally contains unaligned RGB-T image pairs. The work [57] develops a data-processing approach to align these images and obtain 7381, 1056, and 2111 image pairs in the train, validation, and test sets. This dataset contains 3 classes: ‘Person’, ‘Bicycle’ and ‘Car’.

Evaluation Metrics. We use the standard mean Average Precision (mAP) with IoU thresholds ranging from 0.5 to 0.95 across various object scales as metrics.

Implementation Details. We incorporate our three key modules into commonly-used one-stage detectors, including RetinaNet [30], GFL [31], and YOLOv5 [2]. For RetinaNet and GFL, we adopt the implementations in MMDetection toolbox [58]. For YOLOv5, we use its official implemtation [2]. We keep the training setting consistent with the corresponding baselines.

Inference Efficiency Evaluations. We assess the inference efficiency of our method (Python implementation) on the edge device NVIDIA AGX Orin with 64GB memory. We also evaluate the complexity of our method using FLOPs and the number of parameters. All results are presented in Tables I and II.

IV-B Performance Evaluation

Table I and Table II present the performance of our method on the M3FD [1] and FLIR [35] datasets. Key observations include: (1) the medium-fusion strategy adds more parameters and FLOPs compared to the early-fusion strategy; (2) the medium-fusion strategy achieves better performance compared to single-modality inputs, whereas the plain early-fusion strategy does not consistently improve performance; (3) our EME method, incorporating the ShaPE module, weakly supervised learning, and CoreKD techniques into the plain early-fusion strategy, achieves significant performance improvement without significantly increasing parameters and FLOPs; (4) the inference time of our EME method is longer than that of the baseline method, since the structural similarity computation process has not been optimized when calculating the self-gating mask; (5) our EME method can outperform the medium-fusion strategy in both performance and efficiency to some extent.

Fig. 12 and Fig. 12 present visualization results for two example scenes from M3FD [1] and FLIR [35] datasets, respectively. We observe that false positives or false negatives in the single-modality results may affect the plain early-fusion strategy. For instance, the person missed in Fig. 12 (e) is also absent in Fig. 12 (g), despite being detected in Fig. 12 (f). Moreover, false positives in Fig. 12 (f) affect the detection results of plain early-fusion, as shown in Fig. 12 (g). These phenomena confirm that the problem of information interference is a key obstacle to performance in the plain early-fusion strategy. Clearly, our EME effectively alleviates this problem.

IV-C Comparison with the State-of-The-Art Approaches

We use the one-stage YOLOv5 [2] detector as the baseline, and incorporate our proposed modules to construct the effective multispectral early-fusion (EME) model. Table III and Table IV compares our EME and previous state-of-the-art approaches on M3FD [1] and FLIR [35] datasets.

In Table III, we compare our EME with 10 state-of-the-art image-fusion-based object detection approaches [40, 24, 41, 25, 42, 43, 44, 45, 1, 46]. We first generate fused images based on their official implementations, and then train YOLOv5 [2] using these fused images with the same training settings. The results show that our EME achieves state-of-the-art performance. We observe that the results in Table III (a) are obviously better than those in Table III (b). This demonstrates that random splitting causes information leakage and makes it difficult to improve performance. Fig. 13 presents an example scene for visualization.

In Table IV, we compare our EME with 13 multispectral object detection approaches. These approaches include (1) medium-fusion strategies, such as CBF [47], MCG [48], MUN [48], CFR [50], GAFF [15], SMPD [52], MSAT [54], CSAA [55], and MFPT [56]; (2) domain adaptation and single-modality detection approaches, such as ODS [49], BU [51], and ThDe [53]; and (3) late-fusion strategy [22]. The results show that our EME also achieves state-of-the-art performance on the FLIR dataset [35].

V Conclusions

In this paper, we have proposed the effective multispectral early-fusion (EME) detector, which achieves both high performance and efficiency. We identify and address performance obstacles such as information interference, domain gap, and weak feature presentation, proposing solutions including shape-priority early-fusion modules, weakly supervised learning, and core knowledge distillation. Extensive experiments demonstrate the effectiveness and efficiency of our EME.

References

[1] J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-Aware Dual Adversarial Learning and a Multi-Scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2022.
[2] G. Jocher, “YOLOv5 by Ultralytics,” 2020. [Online]. Available: https://github.com/ultralytics/yolov5
[3] Z. Chen and X. Huang, “Pedestrian Detection for Autonomous Vehicle Using Multi-Spectral Cameras,” IEEE Transactions on Intelligent Vehicles, vol. 4, no. 2, pp. 211–219, 2019.
[4] W. Zhou, S. Dong, M. Fang, and L. Yu, “CACFNet: Cross-Modal Attention Cascaded Fusion Network for RGB-T Urban Scene Parsing,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 1919–1929, 2024.
[5] Y. Liu, C. Hu, B. Zhao, Y. Huang, and X. Zhang, “Region-Based Illumination-Temperature Awareness and Cross-Modality Enhancement for Multispectral Pedestrian Detection,” IEEE Transactions on Intelligent Vehicles, pp. 1–12, 2024.
[6] M. A. Farooq, W. Shariff, and P. Corcoran, “Evaluation of Thermal Imaging on Embedded GPU Platforms for Application in Vehicular Assistance Systems,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2, pp. 1130–1144, 2022.
[7] M. Ding, W.-H. Chen, and Y.-F. Cao, “Thermal Infrared Single-Pedestrian Tracking for Advanced Driver Assistance System,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 814–824, 2023.
[8] W. Zhou, S. Dong, J. Lei, and L. Yu, “MTANet: Multitask-Aware Network With Hierarchical Multimodal Fusion for RGB-T Urban Scene Understanding,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 48–58, 2023.
[9] Y. Guo, H. Kong, and S. Gu, “Unsupervised Multi-Spectrum Stereo Depth Estimation for All-Day Vision,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 501–511, 2024.
[10] Y. Zhu, C. Li, J. Tang, and B. Luo, “Quality-Aware Feature Aggregation Network for Robust RGB-T Tracking,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 1, pp. 121–130, 2021.
[11] J. Liu, S. Zhang, S. Wang, and D. N. Metaxas, “Multispectral Deep Neural Networks for Pedestrian Detection,” in Proceedings of the British Machine Vision Conference, 2016.
[12] J. Wagner, V. Fischer, M. Herman, S. Behnke et al., “Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks,” in Proceedings of the European Symposium on Artificial Neural Networks, vol. 587, 2016, pp. 509–514.
[13] Q. Xie, T.-Y. Cheng, Z. Dai, V. Tran, N. Trigoni, and A. Markham, “Illumination-Aware Hallucination-Based Domain Adaptation for Thermal Pedestrian Detection,” IEEE Transactions on Intelligent Transportation Systems, 2023.
[14] T. Liu, K.-M. Lam, R. Zhao, and G. Qiu, “Deep Cross-Modal Representation Learning and Distillation for Illumination-Invariant Pedestrian Detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 315–329, 2021.
[15] H. Zhang, E. Fromont, S. Lefèvre, and B. Avignon, “Guided Attentive Feature Fusion for Multispectral Pedestrian Detection,” in Proceedings of the Winter Conference on Applications of Computer Vision, 2021, pp. 72–80.
[16] B. Yin, X. Zhang, Z. Li, L. Liu, M.-M. Cheng, and Q. Hou, “DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation,” in Proceedings of the International Conference on Learning Representations, 2024.
[17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models from Natural Language Supervision,” in Proceedings of the International Conference on Machine Learning, vol. 139, 2021, pp. 8748–8763.
[18] Z. Lai, N. Vesdapunt, N. Zhou, J. Wu, C. P. Huynh, X. Li, K. K. Fu, and C.-N. Chuah, “PADCLIP: Pseudo-Labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation,” in Proceedings of the International Conference on Computer Vision, 2023, pp. 16 109–16 119.
[19] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” in Proceedings of the Advances in Neural Information Processing Systems Workshop, 2015.
[20] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for Thin Deep Nets,” in Proceedings of the International Conference on Learning Representations, 2015.
[21] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning Efficient Object Detection Models with Knowledge Distillation,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[22] Y.-T. Chen, J. Shi, Z. Ye, C. Mertz, D. Ramanan, and S. Kong, “Multimodal Object Detection via Probabilistic Ensembling,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 139–158.
[23] H. Zhang, E. Fromont, S. Lefèvre, and B. Avignon, “Low-Cost Multispectral Scene Analysis with Modality Distillation,” in Proceedings of the Winter Conference on Applications of Computer Vision, 2022, pp. 803–812.
[24] Z. Zhao, H. Bai, J. Zhang, Y. Zhang, S. Xu, Z. Lin, R. Timofte, and L. Van Gool, “CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2023, pp. 5906–5916.
[25] L. Tang, X. Xiang, H. Zhang, M. Gong, and J. Ma, “DIVFusion: Darkness-Free Infrared and Visible Image Fusion,” Information Fusion, vol. 91, pp. 477–493, 2023.
[26] D. Zhang, J. Han, G. Cheng, and M.-H. Yang, “Weakly Supervised Object Localization and Detection: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5866–5885, 2021.
[27] Y. Zhang, H. Yu, Y. He, X. Wang, and W. Yang, “Illumination-Guided RGBT Object Detection with Inter- and Intra-Modality Fusion,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–13, 2023.
[28] X. Zhang, X. Zhang, Z. Sheng, and H.-L. Shen, “TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection,” arXiv preprint arXiv:2305.16580, 2023.
[29] Z. Li, P. Xu, X. Chang, L. Yang, Y. Zhang, L. Yao, and X. Chen, “When Object Detection Meets Knowledge Distillation: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” in Proceedings of the International Conference on Computer Vision, 2017, pp. 2980–2988.
[31] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 002–21 012, 2020.
[32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
[34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755.
[35] “FREE FLIR Thermal Dataset for Algorithm Training,” https://www.flir.com/oem/adas/adas-dataset-form/.
[36] R. Abdelfattah, Q. Guo, X. Li, X. Wang, and S. Wang, “CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification,” in Proceedings of the International Conference on Computer Vision, 2023, pp. 1348–1357.
[37] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep Mutual Learning,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2018, pp. 4320–4328.
[38] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting Knowledge Distillation via Label Smoothing Regularization,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2020, pp. 3903–3911.
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[40] Z. Zhao, S. Xu, J. Zhang, C. Liang, C. Zhang, and J. Liu, “Efficient and Model-Based Infrared and Visible Image Fusion via Algorithm Unrolling,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1186–1196, 2022.
[41] J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion,” IEEE Transactions on Image Processing, vol. 29, pp. 4980–4995, 2020.
[42] H. Li and X.-J. Wu, “DenseFuse: A Fusion Approach to Infrared and Visible Images,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2614–2623, 2019.
[43] L. Tang, H. Zhang, H. Xu, and J. Ma, “Rethinking the Necessity of Image Fusion in High-Level Vision Tasks: A Practical Infrared and Visible Image Fusion Network Based on Progressive Semantic Injection and Scene Fidelity,” Information Fusion, vol. 99, p. 101870, 2023.
[44] H. Li, X.-J. Wu, and J. Kittler, “RFN-Nest: An End-to-End Residual Fusion Network for Infrared and Visible Images,” Information Fusion, vol. 73, pp. 72–86, 2021.
[45] L. Tang, J. Yuan, and J. Ma, “Image Fusion in the Loop of High-Level Vision Tasks: A Semantic-Aware Real-Time Infrared and Visible Image Fusion Network,” Information Fusion, vol. 82, pp. 28–42, 2022.
[46] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2Fusion: A Unified Unsupervised Image Fusion Network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[47] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional Block Attention Module,” in Proceedings of the European Conference on Computer Vision, 2018.
[48] C. Devaguptapu, N. Akolekar, M. M Sharma, and V. N Balasubramanian, “Borrow from Anywhere: Pseudo Multi-Modal Object Detection in Thermal Imagery,” in Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
[49] F. Munir, S. Azam, M. A. Rafique, A. M. Sheri, and M. Jeon, “Thermal Object Detection using Domain Adaptation through Style Consistency,” ArXiv, vol. abs/2006.00821, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219176719
[50] H. Zhang, E. Fromont, S. Lefevre, and B. Avignon, “Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks,” in Proceedings of the International Conference on Image Processing, 2020, pp. 276–280.
[51] M. Kieu, A. D. Bagdanov, and M. Bertini, “Bottom-Up and Layerwise Domain Adaptation for Pedestrian Detection in Thermal Images,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 17, no. 1, 2021.
[52] Q. Li, C. Zhang, Q. Hu, P. Zhu, H. Fu, and L. Chen, “Stabilizing Multispectral Pedestrian Detection with Evidential Hybrid Fusion,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 3017–3029, 2024.
[53] Y. Cao, T. Zhou, X. Zhu, and Y. Su, “Every Feature Counts: An Improved One-Stage Detector in Thermal Imagery,” in Proceedings of the International Conference on Computer and Communications, 2019, pp. 1965–1969.
[54] S. You, X. Xie, Y. Feng, C. Mei, and Y. Ji, “Multi-Scale Aggregation Transformers for Multispectral Object Detection,” IEEE Signal Processing Letters, vol. 30, pp. 1172–1176, 2023.
[55] Y. Cao, J. Bin, J. Hamari, E. Blasch, and Z. Liu, “Multimodal Object Detection by Channel Switching and Spatial Attention,” in Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, 2023, pp. 403–411.
[56] Y. Zhu, X. Sun, M. Wang, and H. Huang, “Multi-Modal Feature Pyramid Transformer for RGB-Infrared Object Detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 9, pp. 9984–9995, 2023.
[57] V. Sam, K. Ali, M. Christian, K. Laurent, and E. Lutz, “Robust Environment Perception for Automated Driving: A Unified Learning Pipeline for Visual-Infrared Object Detection,” in IEEE Intelligent Vehicles Symposium, 2022, pp. 367–374.
[58] MMDetection Contributors, “OpenMMLab Detection Toolbox and Benchmark,” 2018. [Online]. Available: https://github.com/open-mmlab/mmdetection

Xue Zhang received his B.S. and M.Sc. degrees from Shandong Jianzhu University and Shandong University, China, in 2016 and 2019, respectively. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University. His research interests are optical compressive imaging, image classification, object detection, and knowledge distillation.

Si-Yuan Cao received his B.Eng. degree in electronic information engineering from Tianjin University in 2016, and Ph.D. degree in electronic science and technology from Zhejiang University in 2022. He is currently an Assistant Researcher in Ningbo Innovation Center, Zhejiang University, China. His research interests are multispectral/multimodal image registration, homography estimation, place recognition, and image processing.

Fang Wang received her B.Eng. and Ph.D. in Design and Construction of Naval Architecture and Ocean Structure from Harbin Engineering University in 2007 and 2012, respectively. She is currently an Associate Professor with the School of Information and Electrical Engineering, Hangzhou City University. Her current research interests include the autonomous control of unmanned vehicles, and urban air mobility.

Runmin Zhang received his B.Eng. degree from Zhejiang University in 2022. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University, China. His research interests are image registration and multimodal image restoration.

Zhe Wu received his B.Eng. degree from Xidian University in 2019 and M.Sc. degree from the University of Edinburgh in 2020. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University. His research interests are image restoration and image enhancement.

Xiaohan Zhang received the B.Eng. degree from Beijing Jiaotong University, China, in 2022. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University. His research interests are object detection and image processing.

Xiaokai Bai received his B.Eng. degree from Zhejiang University in 2023. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University, China. His research interests are 3D object detection and autonomous driving.

Hui-Liang Shen (Senior Member, IEEE) received his B.Eng. and Ph.D. degrees in Electronic Engineering from Zhejiang University, Hangzhou, China, in 1996 and 2002, respectively. He was a Research Associate and Research Fellow with The Hong Kong Polytechnic University from 2001 to 2005. He is currently a Professor with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China. His research interests are multispectral imaging, image processing, computer vision, deep learning, and machine learning.

	Detector	FLOPs (↓)	Parameters (↓)	Time (↓)	mAP (↑)	mAP50 (↑)	Person (↑)	Car (↑)	Bus (↑)	Motor (↑)	TrafficLight (↑)	Truck (↑)
RGB	RetinaNet-Res50	61.893G	36.434M	0.106s	30.90	51.10	44.60	74.50	58.10	44.10	36.00	49.50
Thermal	RetinaNet-Res50	61.893G	36.434M	0.106s	29.30	47.10	59.50	71.20	55.00	35.80	10.30	50.60
RGB-T Medium Fusion	RetinaNet-Res50	94.611G	47.582M	0.170s	33.50	53.70	60.20	77.30	61.70	45.50	25.80	51.80
Baseline: RGB-T Early Fusion	RetinaNet-Res50	62.164G	36.434M	0.110s	32.10	50.90	58.80	75.50	59.50	40.00	22.00	49.30
+ ShaPE	RetinaNet-Res50	62.218G	36.434M	0.149s	32.90	52.30	61.00	77.20	59.30	40.40	24.40	51.50
+ ShaPE + WeakSup.	RetinaNet-Res50	62.218G	36.434M	0.149s	33.50	52.70	59.70	76.60	63.10	38.60	23.70	54.70
+ ShaPE + WeakSup. + CoreKD	RetinaNet-Res50	62.218G	36.434M	0.149s	33.70	53.10	60.90	76.50	59.30	42.90	25.80	53.30
RGB	GFL-Res50	61.392G	32.270M	0.110s	32.40	53.12	48.80	77.80	60.00	43.70	38.40	50.00
Thermal	GFL-Res50	61.392G	32.270M	0.110s	29.80	48.70	64.60	73.90	54.20	36.50	15.20	47.70
RGB-T Medium Fusion	GFL-Res50	94.110G	43.419M	0.172s	34.60	55.30	65.20	79.80	62.60	39.50	35.20	49.80
Baseline: RGB-T Early Fusion	GFL-Res50	61.663G	32.271M	0.114s	33.90	53.07	64.00	78.40	54.50	39.50	29.70	52.30
+ ShaPE	GFL-Res50	61.718G	32.271M	0.151s	35.40	55.70	65.80	79.10	63.00	41.90	30.20	54.20
+ ShaPE + WeakSup.	GFL-Res50	61.718G	32.271M	0.151s	35.90	56.30	65.10	79.80	66.00	41.80	30.90	54.00
+ ShaPE + WeakSup. + CoreKD	GFL-Res50	61.718G	32.271M	0.151s	37.10	57.60	68.50	81.30	63.30	42.70	35.90	53.90

	Detector	FLOPs (↓)	Parameters (↓)	Time (↓)	mAP (↑)	mAP50 (↑)	Person (↑)	Bicycle (↑)	Car (↑)
RGB	RetinaNet-Res50	61.893G	36.434M	0.106s	28.10	59.50	45.10	55.70	77.90
Thermal	RetinaNet-Res50	61.893G	36.434M	0.106s	38.50	71.00	62.40	66.30	84.30
RGB-T Medium Fusion	RetinaNet-Res50	94.611G	47.582M	0.170s	38.60	71.50	61.90	67.50	85.10
Baseline: RGB-T Early Fusion	RetinaNet-Res50	62.164G	36.434M	0.110s	37.40	69.50	60.40	63.90	84.30
+ ShaPE	RetinaNet-Res50	62.218G	36.434M	0.149s	38.60	71.40	61.30	68.30	84.50
+ ShaPE + WeakSup.	RetinaNet-Res50	62.218G	36.434M	0.149s	38.40	71.60	61.50	68.40	85.00
+ ShaPE + WeakSup. + CoreKD	RetinaNet-Res50	62.218G	36.434M	0.149s	38.60	71.80	61.90	68.30	85.00
RGB	GFL-Res50	61.392G	32.270M	0.110s	31.90	63.70	51.70	57.60	81.80
Thermal	GFL-Res50	61.392G	32.270M	0.110s	42.60	75.10	69.60	68.80	86.90
RGB-T Medium Fusion	GFL-Res50	94.110G	43.419M	0.172s	42.70	75.80	69.80	70.10	87.70
Baseline: RGB-T Early Fusion	GFL-Res50	61.663G	32.271M	0.114s	42.20	74.70	69.70	67.50	87.00
+ ShaPE	GFL-Res50	61.718G	32.271M	0.151s	42.60	75.60	69.70	70.10	87.10
+ ShaPE + WeakSup.	GFL-Res50	61.718G	32.271M	0.151s	43.10	76.60	71.10	70.70	87.90
+ ShaPE + WeakSup. + CoreKD	GFL-Res50	61.718G	32.271M	0.151s	44.00	78.10	73.10	72.40	88.80

	CBF [47]	MCG [48]	MUN [48]	ODS [49]	CFR [50]	GAFF [15]	BU [51]	SMPD [52]	ThDe [53]	MSAT [54]	CSAA [55]	MFPT [56]	ProbEn3 [22]	EME (Ours)
mAP50	67.20	61.40	61.54	69.62	72.39	72.90	73.20	73.58	74.60	76.20	79.20	80.00	83.76	84.80
Bicycle	60.50	50.26	49.43	55.53	55.77	-	57.40	56.20	60.04	-	-	67.70	73.49	79.80
Car	83.60	70.63	70.72	82.33	84.91	-	86.50	85.80	85.52	-	-	89.00	90.14	92.80
Person	57.60	63.31	64.47	71.01	74.49	-	75.60	78.74	78.24	-	-	83.20	87.65	81.20