Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

Xue Zhang, Si-Yuan Cao, Fang Wang, Runmin Zhang, Zhe Wu, Xiaohan Zhang, Xiaokai Bai, and
Hui-Liang Shen, Senior Member, IEEE
This work was supported in part by the National Key R&D Program of China under grant 2023YFB3209800, in part by the Natural Science Foundation of Zhejiang Province under grant D24F020006, in part by the National Natural Science Foundation of China under grant 62301484, and in part by the Jinhua Science and Technology Bureau Project. (Corresponding authors: Si-Yuan Cao and Hui-Liang Shen.) X. Zhang, R. Zhang, Z. Wu, X. Zhang, and X. Bai are with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China (e-mail: zxue2019@zju.edu.cn, runmin_zhang@zju.edu.cn, jeffw@zju.edu.cn, zhangxh2023@zju.edu.cn, shawnnnkb@zju.edu.cn). S.-Y. Cao is with the Ningbo Research Institute, College of Information Science and Electronic Engineering, Zhejiang University, China (e-mail: cao_siyuan@zju.edu.cn). F. Wang is with the School of Information and Electrical Engineering, Hangzhou City University, Hangzhou 310015, China (e-mail: wangf@zucc.edu.cn) H.-L. Shen is with the College of Information Science and Electronic Engineering, Jinhua Institute, Zhejiang University, and also with the Key Laboratory of Collaborative Sensing and Autonomous Unmanned Systems of Zhejiang Province, China (e-mail: shenhl@zju.edu.cn).
Abstract

Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, we address this issue by improving the performance of efficient single-branch structures. We revisit the reasons causing the performance gap between these structures. For the first time, we reveal the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures. Besides, we find that the domain gap between multispectral images, and weak feature representation of the single-branch structure are also key obstacles for performance. Focusing on these three problems, we propose corresponding solutions, including a novel shape-priority early-fusion strategy, a weakly supervised learning method, and a core knowledge distillation technique. Experiments demonstrate that single-branch networks equipped with these three contributions achieve significant performance enhancements while retaining high efficiency. Our code will be available at https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection.

Index Terms:
Multispectral object detection; feature fusion; weakly supervised learning; knowledge distillation
Refer to caption
(a)
Refer to caption
(b)
Figure 1: Multispectral object detection and fusion strategies. (a) In Scene-1, objects are easier to detect in the thermal image. (b) In Scene-2, objects are easier to detect in the RGB image. (c) Early-fusion strategy. (d) Medium-fusion strategy. (e) Late-fusion strategy. (f) Detection results of different strategies on the M3FD dataset [1]. YOLOv5 [2] is adopted as the baseline in this experiment. The area of each circle denotes the number of parameters.

I Introduction

Multispectral object detection has been widely studied, since multispectral images can provide complementary information to achieve consistent detection in various lighting conditions [3, 4, 5, 6, 7, 8, 9, 10]. This complementarity is illustrated in Fig. 1 (a) and (b). Given the multispectral inputs, modern multispectral detectors develop three fusion strategies: early-fusion, medium-fusion, and late-fusion, shown in Fig. 1 (c) - (e). The medium and late-fusion strategies often achieve superior performance compared to early-fusion [11, 12, 13, 5, 4, 14]. However, they use a two-branch structure, making model deployment on edge devices expensive. In contrast, the early-fusion strategy adopts a simple single-branch structure, facilitating deployment on edge devices. Nevertheless, its performance is low, and there are few works to address this problem, resulting in an increasing gap between high performance and high efficiency.

To resolve this conflict, in this work we focus on improving the performance of the early-fusion strategy while maintaining its high efficiency. We first conduct pilot studies and observe that a plain early-fusion strategy cannot consistently obtain improved performances compared to single-modality inputs. Based on this observation, we rethink the early-fusion strategy and summarize three key obstacles: 1) the information interference problem when simply concatenating the multispectral images, 2) the domain gap existing in thermal and RGB images, and 3) the weak feature representation of the single-branch structure. Focusing on these obstacles, we propose corresponding solutions.
- Information interference problem refers to the potential suppression of important information in one modality by another. In the plain early-fusion strategy, previous works [15] typically feed concatenated multispectral images into a convolution layer and generate a fused feature. The convolution layer generally has a small receptive field. Therefore, based on limited contexts, this approach is hard to determine which modality information is important. We address this issue by first recognizing that object shapes are agnostic to visible and infrared wavelengths and devise a module to fuse multispectral images based on object shape saliency, named the shape-priority early-fusion (ShaPE) module.
- Domain gap between RGB and thermal images is usually neglected in previous works. They generally adopt an RGB pre-trained backbone network to extract features from both RGB and thermal images [13, 5]. However, the domain gap may cause the representation distribution shift. This issue is also recognized in the work [16] on an RGB-D task. Different from previous works, we introduce a weakly supervised learning method to address this issue. Within this method, the backbone network jointly uses RGB and thermal images to learn the representation of CLIP [17], since CLIP has demonstrated promising zero-shot generalizability in bridging the domain gap [18]. Additionally, we introduce a segmentation auxiliary branch. Our method allows the backbone network to reduce representation shifts and improve semantic localization ability.
- Weak feature representation problem results from the early-fusion strategy employing a single-branch structure. This structure has fewer parameters and simpler fusion modules compared to medium and late-fusion strategies. We address this issue by introducing the knowledge distillation (KD) technique [19]. In KD, a key problem is how to align the feature dimensions between teacher and student models. Previous works generally introduce a convolution layer for the student model to learn all knowledge from the teacher model [20, 21]. However, we show that not all information in teacher model is helpful for downstream tasks. Therefore, we introduce core knowledge distillation (CoreKD) to transfer the most crucial knowledge for specific downstream tasks, resembling the human learning process where the teacher highlights key knowledge for quick understanding and absorption by the students.

Experimental results validate that our efficient multispectral early-fusion (EME) detector achieves a significant performance improvement without considerably increasing the number of parameters, as shown in Fig. 1 (f). Besides, our EME outperforms the previous state-of-the-art approaches. In summary, our contributions are threefold:

  • Different from previous works, we summarize three key obstacles limiting the early-fusion strategy, including information interference, domain gap, and representation learning.

  • For each obstacle, we propose the corresponding solution: we develop 1) a ShaPE module to address the information interference issue, 2) a weakly supervised learning method to reduce domain gap and improve semantic localization abilities, and 3) a CoreKD to enhance the representation learning of single-branch networks.

  • Extensive experiments validate that the early-fusion strategy, equipped with our ShaPE module, weakly supervised learning, and CoreKD technique, shows significant improvement. Additionally, we only retain the ShaPE module during the inference phase. Consequently, our method is efficient and achieves improved performance.

II Related Work

In this section, we offer a brief overview of multispectral object detection and introduce related works in weakly supervised learning and knowledge distillation.

II-A Multispectral Object Detection

According to fusion strategies, multispectral object detection can be classified into three categories: early-fusion, medium-fusion and late-fusion strategies. Previous works [11], [12] and [22] confirm that both medium-fusion and late-fusion strategies outperform the early-fusion strategy.

However, both the medium and late fusion strategies adopt a two-branch structure that limits their use on resource-limited edge devices. Previous works notice this weakness and provide some solutions. For example, in [14], a model using the medium-fusion strategy is first trained as a teacher, and its knowledge is transferred to a student model. The student model only receives RGB images as inputs. Although it saves resources, it discards important complementary information from thermal images. The work [13] introduces a domain adaptation technique. It uses a medium-fusion model to guide single-branch model learning, which only receives thermal images as inputs and also discards complementary information from RGB images. To employ complementary information while saving computational resources, [23] transfers knowledge from a medium-fusion model to an early-fusion model. Nevertheless, it neglects information interference problem. Some works in the image fusion field [1, 24, 25] demonstrate that fused images can improve detectors, but the fusion process still introduces an additional computational burden.

Different from previous works, we identify the information interference problem in early-fusion strategies. By addressing this problem, we fully employ the complementary information in multispectral images, without significantly increasing computational burden.

II-B Weakly Supervised Learning in Object Detection

Weakly supervised learning has received much attention in object localization and detection, as comprehensively surveyed in [26]. Recent works in the multispectral object detection adopt this technique. Based on the weak annotations they utilize, we can coarsely divide them into image- and box-level weakly supervised learning approaches.

In image-level weakly supervised learning approaches, previous works mainly employ the illumination condition of RGB images as weighting factors to determine the modality importance [14, 5, 27, 13]. In box-level approaches, previous works [15, 28] mainly employ the bounding-box annotations to generate masks. They use these masks to construct spatial attention mechanisms, highlighting representations within target regions.

Different from previous works, we use weakly supervised learning to address the domain gap problem in RGB and thermal images. We employ image-level labels to construct a multi-label classification auxiliary task. This task can fully exploit the complementary information in multispectral images, instead of solely using information from one modality. Along with the powerful CLIP model [17] and box-level weak labels, our method can reduce the domain gap and obtain precise semantic localization abilities.

II-C Knowledge Distillation

Knowledge distillation is first introduced in [19]. It aims to improve a lightweight student model by learning knowledge from a high-capacity teacher model. According to distillation approaches, this technique can be roughly divided into two groups: logit distillation [19] and feature distillation [20]. The former let a student model learn the logit of a teacher model, while the latter let a student model learn the feature of a teacher model. These distillation approaches are also applied to object detection [29, 21]. Recently, some works in multispectral object detection also employ the knowledge distillation technique [23, 14]. In the distillation process, they generally introduce a projection layer to align the teacher and student feature channel number. The purpose of this approach is to learn all representations in the teacher model.

Different from previous works, we first confirm that not all information in teacher features is beneficial to downstream task including classification and regression. Based on this, we propose a core knowledge distillation technique to transfer the most important features for the downstream tasks to the student model.

III Method

Fig. 2 illustrates the overview of our method. We adopt a single-branch structure as the baseline model considering its low memory cost. To boost its performance, we develop three key modules: shape-priority early-fusion (ShaPE), weakly supervised auxiliary learning, and core knowledge distillation (CoreKD). In the following, we describe the ShaPE module in Section III-A, the weakly supervised auxiliary learning method in Section III-B, and the CoreKD in Section III-C. These contributions are designed to enhance information fusion, feature extraction, and feature classification abilities, respectively.

Refer to caption
Figure 2: Overview of our method. We adopt the single-branch structure as the baseline model and develop three key modules: shape-priority early-fusion (ShaPE), weakly supervised auxiliary learning, and core knowledge distillation. The ShaPE module remains in both the inference and training phases, while the other two modules are removed in the inference phase.

III-A Shape-Priority Early-Fusion Module

Observation. Given a pair of RGB-T images, the plain early-fusion strategy concatenates them in the channel dimension and then feeds them into a detector. With the plain strategy, we conduct pilot studies on the M3FD [1] dataset. We first train three commonly used one-stage detectors: RetinaNet [30], GFL [31] and YOLOv5 [2]. Then, we compute the mean values and standard deviations of their detection results and illustrate the computed results in Fig. 3. Besides, we also train these detectors using single-modality images as input for comparisons. We have the following two observations. First, the plain early-fusion strategy cannot achieve consistent improvement compared with single-modality input. Second, for objects that require color to identify, such as ‘Traffic Light’, the plain early-fusion strategy yields worse results than the RGB input.

Refer to caption
Figure 3: Pilot studies conducted on the M3FD [1] dataset. We use three detectors as baselines: RetinaNet [30], GFL [31] and YOLOv5 [2]. Each bar and error bar represents the mean values and standard deviation of the results obtained by these three detectors. ‘RGB’ represents detectors that only take RGB images as inputs, while ‘T’ represents detectors that only take thermal images as inputs. ‘PlainRGB-T’ denotes detectors that use the plain early-fusion strategy. The ‘All’ column illustrates the mAP50 for all classes, and the other columns illustrate the AP50 for specific classes. Red lines denote the plain RGB-T early fusion strategy obtains worse results compared to detectors that use single-modality inputs.

Motivation. We attribute the above phenomena to the convolutional inductive bias, namely, local connectivity and weight sharing. The process of 2D convolution involves two steps: (1) sampling across the concatenated RGB-T images using a regular grid \mathcal{R}caligraphic_R; (2) summing the sampled values with weighting factor 𝐖𝐖\mathbf{W}bold_W. The grid \mathcal{R}caligraphic_R determines both the receptive field size and dilation. For example,

={(3,3),(3,2),,(2,3),(3,3)}33322333\mathcal{R}=\left\{(-3,-3),(-3,-2),\dots,(2,3),(3,3)\right\}caligraphic_R = { ( - 3 , - 3 ) , ( - 3 , - 2 ) , … , ( 2 , 3 ) , ( 3 , 3 ) }

defines a 7×\times×7 kernel with dilation 1. For each position 𝐩0subscript𝐩0\mathbf{p}_{0}bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on an out feature map OO\mathrm{O}roman_O, we have

O(𝐩0)=𝐩nj{rgb,t}𝐖j(𝐩n)𝐈j(𝐩0+𝐩n),Osubscript𝐩0subscriptsubscript𝐩𝑛subscript𝑗rgbtsubscript𝐖𝑗subscript𝐩𝑛subscript𝐈𝑗subscript𝐩0subscript𝐩𝑛\mathrm{O}(\mathbf{p}_{0})=\sum_{\mathbf{p}_{n}\in\mathcal{R}}\sum_{j\in\{\rm rgb% ,t\}}\mathbf{W}_{j}(\mathbf{p}_{n})\cdot\mathbf{I}_{j}(\mathbf{p}_{0}+\mathbf{% p}_{n}),roman_O ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ { roman_rgb , roman_t } end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⋅ bold_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (1)

where 𝐩nsubscript𝐩𝑛\mathbf{p}_{n}bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT enumerates the positions in \mathcal{R}caligraphic_R.

This process indicates that the plain early-fusion strategy is a pixel-level weighting method, with weights learned from data. However, the limited receptive field of pixel-level weighting methods makes the weights difficult to determine which modality is important. This weakness may result in valuable information from one modality being suppressed by another. As an example, Fig. 4 (c) depicts the feature map generated from the RGB-T images of Fig. 4 (a) and (b) using the plain early-fusion strategy. It is observed from the close-up that the ‘Traffic Light’ in the fused feature map doesn’t preserve the significant information of the RGB image.

The straightforward solutions to this weakness are: (1) enlarging the receptive field by using a larger kernel or more convolutional layers so that the model can judge the modality importance based on a broader range of contexts, or (2) increasing the number of convolutional kernels so that the model can learn more representations. However, these solutions increase memory costs and computational burden, making them unfriendly to edge devices.

ShaPE Module. We realize that shape is an inherent attribute of an object. Any visible objects in RGB and thermal images have consistent shapes. Thus, we consider the salience of shape as a modifying factor to adaptively determine the modality importance, and design the shape-priority early-fusion (ShaPE) module. In the ShaPE module, the RGB and thermal images are modified by self-gating masks. In this context, Eq. (1) becomes:

O(𝐩0)=𝐩nj{rgb,t}𝐖j(𝐩n)𝐌j(𝐩0+𝐩n)𝐈j(𝐩0+𝐩n),Osubscript𝐩0subscriptsubscript𝐩𝑛subscript𝑗rgbtsubscript𝐖𝑗subscript𝐩𝑛subscript𝐌𝑗subscript𝐩0subscript𝐩𝑛subscript𝐈𝑗subscript𝐩0subscript𝐩𝑛\mathrm{O}(\mathbf{p}_{0})=\sum_{\mathbf{p}_{n}\in\mathcal{R}}\sum_{j\in\{\rm rgb% ,t\}}\mathbf{W}_{j}(\mathbf{p}_{n})\cdot\mathbf{M}_{j}(\mathbf{p}_{0}+\mathbf{% p}_{n})\cdot\mathbf{I}_{j}(\mathbf{p}_{0}+\mathbf{p}_{n}),roman_O ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ { roman_rgb , roman_t } end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⋅ bold_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⋅ bold_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (2)

where 𝐌rgbsubscript𝐌rgb\mathbf{M}_{\rm rgb}bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT and 𝐌tsubscript𝐌t\mathbf{M}_{\rm t}bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT denote the self-gating masks of RGB and thermal images, respectively.

In the following, we describe the generation process of self-gating masks 𝐌rgbsubscript𝐌rgb\mathbf{M}_{\rm rgb}bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT and 𝐌tsubscript𝐌t\mathbf{M}_{\rm t}bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT. Since our ShaPE module focuses on the shapes of objects and structural contributions of different modalities to the fused features, we employ the gradients and structural similarities in our method. For easy understanding, we visualize some important intermediate results in Fig. 4. Given the RGB-T images as shown in Fig. 4 (a) and (b), we compute their gradients

𝐈rgb(𝐩0)subscript𝐈rgbsubscript𝐩0\displaystyle\nabla\mathbf{I}_{\rm rgb}(\mathbf{p}_{0})∇ bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =(x𝐈rgb(𝐩0))2+(y𝐈rgb(𝐩0))2,absentsuperscriptsubscript𝑥subscript𝐈rgbsubscript𝐩02superscriptsubscript𝑦subscript𝐈rgbsubscript𝐩02\displaystyle=\sqrt{(\nabla_{x}\mathbf{I}_{\rm rgb}(\mathbf{p}_{0}))^{2}+(% \nabla_{y}\mathbf{I}_{\rm rgb}(\mathbf{p}_{0}))^{2}},= square-root start_ARG ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∇ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,
𝐈t(𝐩0)subscript𝐈tsubscript𝐩0\displaystyle\nabla\mathbf{I}_{\rm t}(\mathbf{p}_{0})∇ bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =(x𝐈t(𝐩0))2+(y𝐈t(𝐩0))2,absentsuperscriptsubscript𝑥subscript𝐈tsubscript𝐩02superscriptsubscript𝑦subscript𝐈tsubscript𝐩02\displaystyle=\sqrt{(\nabla_{x}\mathbf{I}_{\rm t}(\mathbf{p}_{0}))^{2}+(\nabla% _{y}\mathbf{I}_{\rm t}(\mathbf{p}_{0}))^{2}},= square-root start_ARG ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∇ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

as shown in Fig. 4 (d) and (e). We then generate the union gradient as the reference using

𝐈ref(𝐩0)=max(𝐈rgb(𝐩0),𝐈t(𝐩0)).subscriptsuperscript𝐈refsubscript𝐩0subscript𝐈rgbsubscript𝐩0subscript𝐈tsubscript𝐩0\nabla\mathbf{I}^{\prime}_{\rm ref}(\mathbf{p}_{0})=\max(\nabla\mathbf{I}_{\rm rgb% }(\mathbf{p}_{0}),\nabla\mathbf{I}_{\rm t}(\mathbf{p}_{0})).∇ bold_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = roman_max ( ∇ bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ∇ bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) .

We further use max-pooling within a 3×\times×3 neighborhood superscript\mathcal{R}^{\prime}caligraphic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to boost the reference gradient, which is written as

𝐈ref(𝐩0)=max𝐩n𝐈ref(𝐩0+𝐩n),subscript𝐈refsubscript𝐩0subscriptsubscript𝐩𝑛superscriptsubscriptsuperscript𝐈refsubscript𝐩0subscript𝐩𝑛\nabla\mathbf{I}_{\rm ref}(\mathbf{p}_{0})=\max_{\mathbf{p}_{n}\in\mathcal{R}^% {\prime}}\nabla\mathbf{I}^{\prime}_{\rm ref}(\mathbf{p}_{0}+\mathbf{p}_{n}),∇ bold_I start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ bold_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ,

as shown in Fig. 4 (f).

Refer to caption
Figure 4: Illustration of fused feature map generation process for the plain early-fusion strategy and our ShaPE module. (a) RGB image. (b) Thermal image. (c) Fused feature map generated using the plain early-fusion strategy, with a close-up indicated by a white circle line. (d) and (e) are gradient images of the RGB and thermal images, respectively. (f) Boosted reference gradient image. (g) and (h) are self-gating masks of the RGB and thermal images, respectively. (i) Fused feature map generated by our ShaPE module.

To determine the structural contributions of each modality to the fused features, we compute the structural similarities between single-modality gradient images {𝐈rgbsubscript𝐈rgb\nabla\mathbf{I}_{\rm rgb}∇ bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT, 𝐈tsubscript𝐈t\nabla\mathbf{I}_{\rm t}∇ bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT} and the reference gradient image 𝐈refsubscript𝐈ref\nabla\mathbf{I}_{\rm ref}∇ bold_I start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT. Inspired by [32], for each patch \mathcal{R}caligraphic_R, we compute three fundamental properties: the means {μrgb,μt,μrefsubscript𝜇rgbsubscript𝜇tsubscript𝜇ref\mu_{\rm rgb},\mu_{\rm t},\mu_{\rm ref}italic_μ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT}, the standard deviations {σrgb,σt,σrefsubscript𝜎rgbsubscript𝜎tsubscript𝜎ref\sigma_{\rm rgb},\sigma_{\rm t},\sigma_{\rm ref}italic_σ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT}, and the covariances {σ(rgb,ref)subscript𝜎rgbref\sigma_{({\rm rgb,ref})}italic_σ start_POSTSUBSCRIPT ( roman_rgb , roman_ref ) end_POSTSUBSCRIPT, σ(t,ref)subscript𝜎tref\sigma_{({\rm t,ref})}italic_σ start_POSTSUBSCRIPT ( roman_t , roman_ref ) end_POSTSUBSCRIPT} between the single-modality gradient images and the reference gradient images. In this context, we generate the self-gating masks:

𝐌rgb=(2μrgbμref+ξ1)(2σ(rgb,ref)+ξ2)(μrgb2+μref2+ξ1)(σrgb2+σref2+ξ2),superscriptsubscript𝐌rgb2subscript𝜇rgbsubscript𝜇refsubscript𝜉12subscript𝜎rgbrefsubscript𝜉2superscriptsubscript𝜇rgb2superscriptsubscript𝜇ref2subscript𝜉1superscriptsubscript𝜎rgb2superscriptsubscript𝜎ref2subscript𝜉2\displaystyle\mathbf{M}_{\rm rgb}^{\prime}=\frac{(2\mu_{\rm rgb}\cdot\mu_{\rm ref% }+\xi_{1})\cdot(2\sigma_{\rm(rgb,ref)}+\xi_{2})}{(\mu_{\rm rgb}^{2}+\mu_{\rm ref% }^{2}+\xi_{1})\cdot(\sigma_{\rm rgb}^{2}+\sigma_{\rm ref}^{2}+\xi_{2})},bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG ( 2 italic_μ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ⋅ italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ ( 2 italic_σ start_POSTSUBSCRIPT ( roman_rgb , roman_ref ) end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ ( italic_σ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ,
𝐌t=(2μtμref+ξ1)(2σ(t,ref)+ξ2)(μt2+μref2+ξ1)(σt2+σref2+ξ2),superscriptsubscript𝐌t2subscript𝜇tsubscript𝜇refsubscript𝜉12subscript𝜎trefsubscript𝜉2superscriptsubscript𝜇t2superscriptsubscript𝜇ref2subscript𝜉1superscriptsubscript𝜎t2superscriptsubscript𝜎ref2subscript𝜉2\displaystyle\mathbf{M}_{\rm t}^{\prime}=\frac{(2\mu_{\rm t}\cdot\mu_{\rm ref}% +\xi_{1})\cdot(2\sigma_{\rm(t,ref)}+\xi_{2})}{(\mu_{\rm t}^{2}+\mu_{\rm ref}^{% 2}+\xi_{1})\cdot(\sigma_{\rm t}^{2}+\sigma_{\rm ref}^{2}+\xi_{2})},bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG ( 2 italic_μ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ⋅ italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ ( 2 italic_σ start_POSTSUBSCRIPT ( roman_t , roman_ref ) end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ ( italic_σ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ,

where ξ1=(k1L)2subscript𝜉1superscriptsubscript𝑘1𝐿2\xi_{1}=(k_{1}L)^{2}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and ξ2=(k2L)2subscript𝜉2superscriptsubscript𝑘2𝐿2\xi_{2}=(k_{2}L)^{2}italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are used to prevent instability. L𝐿Litalic_L is the dynamic range of the gradient images, k1=0.01subscript𝑘10.01k_{1}=\text{0.01}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.01, and k2=0.03subscript𝑘20.03k_{2}=\text{0.03}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.03.

Since the ranges of both 𝐌rgbsuperscriptsubscript𝐌rgb\mathbf{M}_{\rm rgb}^{\prime}bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐌tsuperscriptsubscript𝐌t\mathbf{M}_{\rm t}^{\prime}bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are [1,1]11[-\text{1},\text{1}][ - 1 , 1 ], we then normalize the self-gating masks and obtain

𝐌rgb=exp(𝐌rgb)j{rgb,t}exp(𝐌j),𝐌t=exp(𝐌t)j{rgb,t}exp(𝐌j),formulae-sequencesubscript𝐌rgbsubscriptsuperscript𝐌rgbsubscript𝑗rgbtsubscriptsuperscript𝐌𝑗subscript𝐌tsubscriptsuperscript𝐌tsubscript𝑗rgbtsubscriptsuperscript𝐌𝑗\mathbf{M}_{\rm rgb}=\frac{\exp(\mathbf{M}^{\prime}_{\rm rgb})}{\sum\limits_{j% \in\{\rm rgb,t\}}\exp(\mathbf{M}^{\prime}_{j})},\;\mathbf{M}_{\rm t}=\frac{% \exp(\mathbf{M}^{\prime}_{\rm t})}{\sum\limits_{j\in\{\rm rgb,t\}}\exp(\mathbf% {M}^{\prime}_{j})},bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT = divide start_ARG roman_exp ( bold_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ { roman_rgb , roman_t } end_POSTSUBSCRIPT roman_exp ( bold_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG , bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT = divide start_ARG roman_exp ( bold_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ { roman_rgb , roman_t } end_POSTSUBSCRIPT roman_exp ( bold_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG , (3)

as shown in Fig. 4 (g) and (h). According to Eq. (2), we can finally generate the fused feature map as shown in Fig. 4 (i).

III-B Weakly Supervised Learning Method

In RGB-T object detection, an unneglectable issue is the lack of pre-trained backbone networks on large-scale RGB-T datasets. This is because there are few large-scale datasets like ImageNet [33] and COCO [34] in RGB-T image recognition fields. Previous works generally use backbone networks pre-trained on ImageNet. However, the domain gap between thermal and RGB images would cause representation distribution shifts, as illustrated in Fig. 5 (a) and (b). This is because the backbone network is trained solely on RGB images, but is applied to thermal images.

Refer to caption
Figure 5: T-SNE visualization of RGB and thermal image features. (a) and (b) visualize the image features of the M3FD [1] and FLIR [35] datasets using the ImageNet pre-trained ResNet-50 backbone network. (c) and (d) visualize the image features of the same datasets using the ResNet-50 trained with our weakly supervised learning method. Additionally, we present corresponding images of six pairs of feature points.

To handle this issue, we turn to the powerful Contrastive Language-Image Pre-training (CLIP) [17] model. It has been confirmed that CLIP can bridge domain gaps [18], since it is trained using a huge number of (image, text) pairs. In this context, we feed both RGB and thermal images into the backbone network, and let it learn the representation generated by the CLIP model. Specifically, we first present a CLIP-driven image-level weakly supervised learning method. This method enables the network to recognize the classes of objects in a pair of RGB-T images while locating their coarse regions. For fine-grained localization, we then introduce a box-level weakly supervised learning method. Fig. 6 illustrates the architecture of weakly supervised learning method.

CLIP-Driven Image-Level Weak Supervision. To learn the CLIP model’s knowledge, we construct the image-level weak supervision method. Based on three considerations, we adopt the multi-label classification task as the image-level weak supervision: (1) the CLIP model can be viewed as a classifier, (2) this auxiliary task can fully use the complementary information in the RGB-T images, and (3) by summarizing all classes and removing duplicates in an image, we can easily construct the ground-truth multi-label targets based on detection annotations.

Nevertheless, original CLIP model is only trained for recognizing a single object per image [17] and is not suitable for multi-label classification [36]. To address this issue, we introduce a Divide-and-Aggregation CLIP (DA-CLIP) model. DA-CLIP first divides input images into multiple crops. Each crop is then fed into CLIP. All predictions of these crops are finally aggregated by a max-pooling operation on each class. Considering DA-CLIP may generate inaccurate predictions, we construct a learnable adapter, which consists of three fully-connected (FC) layers, to fine-tune the result of DA-CLIP. To prevent overfitting, we add a dropout layer in the adapter. We denote the predicted probability from the adapter as 𝐪^adcsubscript^𝐪adsuperscript𝑐\mathbf{\hat{q}}_{\rm ad}\in\mathbb{R}^{c}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, where c𝑐citalic_c denotes the number of classes.

Refer to caption
Figure 6: Illustration of the weakly supervised learning method. It consists of a divide-and-aggregation CLIP model (DA-CLIP), an adapter, a backbone, two auxiliary heads used for classification and segmentation, and weakly supervised losses. All modules except the DA-CLIP are updated, and only the backbone network remains in the inference phase.

For the backbone network, we add an auxiliary classification head on its top. The head consists of a global average pooling (GAP) operation and one FC layer. We denote the predicted probability from the classification head as 𝐪^bbcsubscript^𝐪bbsuperscript𝑐\mathbf{\hat{q}}_{\rm bb}\in\mathbb{R}^{c}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

We adopt the mutual learning approach [37] to train the backbone network and the adapter simultaneously. In this approach, an important step is that one model generates soft targets for the other model using the softmax function. However, this approach cannot be directly applied to the multi-label classification problem, since it requires the sum of predicted probabilities to be one, which is rarely satisfied in multi-label classification. To address this issue, we draw inspiration from self-training KD [38] and construct the soft targets for the adapter and backbone network as

𝐪~ad=(1λ)𝐪+λ𝐪^ad,𝐪~bb=(1λ)𝐪+λ𝐪^bb,formulae-sequencesubscript~𝐪ad1𝜆𝐪𝜆subscript^𝐪adsubscript~𝐪bb1𝜆𝐪𝜆subscript^𝐪bb\mathbf{\tilde{q}}_{\rm ad}=(1-\lambda)\mathbf{q}+\lambda\mathbf{\hat{q}}_{\rm ad% },\quad\mathbf{\tilde{q}}_{\rm bb}=(1-\lambda)\mathbf{q}+\lambda\mathbf{\hat{q% }}_{\rm bb},over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT = ( 1 - italic_λ ) bold_q + italic_λ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT , over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT = ( 1 - italic_λ ) bold_q + italic_λ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT ,

where 𝐪c𝐪superscript𝑐\mathbf{q}\in\mathbb{R}^{c}bold_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes a ground-truth multi-label target, and λ𝜆\lambdaitalic_λ denotes a balancing factor set to 0.1. In this context, we compute the binary cross-entropy (BCE) losses

(𝐪~ad,𝐪^bb)subscript~𝐪adsubscript^𝐪bb\displaystyle\mathcal{H}(\mathbf{\tilde{q}}_{\rm ad},\mathbf{\hat{q}}_{\rm bb})caligraphic_H ( over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT )
=i=1cq~ad,ilog(q^bb,i)+(1q~ad,i)log(1q^bb,i),absentsuperscriptsubscript𝑖1𝑐subscript~𝑞ad𝑖subscript^𝑞bb𝑖1subscript~𝑞ad𝑖1subscript^𝑞bb𝑖\displaystyle=-\sum_{i=1}^{c}\tilde{q}_{{\rm ad},i}\log(\hat{q}_{{\rm bb},i})+% (1-\tilde{q}_{{\rm ad},i})\log(1-\hat{q}_{{\rm bb},i}),= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_ad , italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_bb , italic_i end_POSTSUBSCRIPT ) + ( 1 - over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_ad , italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_bb , italic_i end_POSTSUBSCRIPT ) , (4a)
(𝐪~bb,𝐪^ad)subscript~𝐪bbsubscript^𝐪ad\displaystyle\mathcal{H}(\mathbf{\tilde{q}}_{\rm bb},\mathbf{\hat{q}}_{\rm ad})caligraphic_H ( over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT )
=i=1cq~bb,ilog(q^ad,i)+(1q~bb,i)log(1q^ad,i).absentsuperscriptsubscript𝑖1𝑐subscript~𝑞bb𝑖subscript^𝑞ad𝑖1subscript~𝑞bb𝑖1subscript^𝑞ad𝑖\displaystyle=-\sum_{i=1}^{c}\tilde{q}_{{\rm bb},i}\log(\hat{q}_{{\rm ad},i})+% (1-\tilde{q}_{{\rm bb},i})\log(1-\hat{q}_{{\rm ad},i}).= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_bb , italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_ad , italic_i end_POSTSUBSCRIPT ) + ( 1 - over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_bb , italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_ad , italic_i end_POSTSUBSCRIPT ) . (4b)
Refer to caption
Figure 7: Illustration of the class activation map (CAM) of the backbone network. Each row’s triplet of images represents the CAM for a specific class, using (a) image-level auxiliary learning only, (b) box-level auxiliary learning only, and (c) both image-level and box-level auxiliary learning.
Refer to caption
Figure 8: Illustration of feature maps generated by the backbone network. (a) and (b) present the RGB and thermal images. (c) and (d) present their corresponding features map. (e) and (f) present the feature maps generated by the ResNet-50 trained without and with our weakly supervised learning, respectively. The close-up is highlighted with a red box.

To showcase the semantic localization effect of our CLIP-driven image-level weak supervision, we visualize the class activation map (CAM) of the backbone network in Fig. 8 (a). CAM is a useful tool for understanding which regions the network focuses on to predict a class. We can observe that the backbone network can coarsely localize regions of ‘Person’, ‘Car’, and ‘Traffic Light’ in the image.

Box-Level Weak Supervision. To precisely localize the semantic regions, we introduce box-level weak supervision. The ground-truth box-level target is generated by directly filling the area within an annotation box with its corresponding class index. In this context, we add an auxiliary segmentation head on top of the backbone network to predict the target. Denoting the ground-truth box-level target mask as 𝐆𝐆\mathbf{G}bold_G, and the predicted mask as 𝐆^^𝐆\mathbf{\hat{G}}over^ start_ARG bold_G end_ARG, we compute the BCE loss between them as

(𝐆,𝐆^)=n=1NGnlog(G^n)+(1Gn)log(1G^n),𝐆^𝐆superscriptsubscript𝑛1𝑁subscript𝐺𝑛subscript^𝐺𝑛1subscript𝐺𝑛1subscript^𝐺𝑛\mathcal{H}(\mathbf{G},\mathbf{\hat{G}})=-\sum_{n=1}^{N}G_{n}\log(\hat{G}_{n})% +(1-G_{n})\log(1-\hat{G}_{n}),caligraphic_H ( bold_G , over^ start_ARG bold_G end_ARG ) = - ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + ( 1 - italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (5)

where N𝑁Nitalic_N denotes the number of elements in the mask.

Refer to caption
Figure 9: Illustration of the knowledge distillation technique. The student model adopts an early-fusion single-branch structure, while the teacher model adopts a medium-fusion two-branch structure. In the training phase, both the pre-trained teacher model and the core knowledge convolution module are fixed, while only the student model is updated. After training, only the student model is used for deployment. In this diagram, we use YOLOv5 [2] as an example, and it can be easily extended to other detectors.

We visualize attention maps of the backbone network for different classes, as shown in Fig. 8 (b). Using the box-level weak supervision, the backbone network can precisely localize the interest of objects, such as ’Car’. Nevertheless, it may miss some useful information in the image. Therefore, we combine the CLIP-driven image-level weak supervision and the box-level weak supervision. The results presented Fig. 8 (c) show that our weakly supervised learning method can effectively allow the backbone network to localize the important semantic regions.

Effect Validation. When our weakly supervised learning method is employed, Fig. 5 (c) and (d) demonstrate that the domain gap between RGB and thermal features is reduced. This implies that the backbone network can extract information from RGB and thermal images without bias. To further illustrate this effect, we visualize the feature map generated by the ResNet-50 [39] in Fig. 8. The generation process of these feature maps is as follows: First, we resize all features of the ResNet-50 across four stages to the same resolution as the input images. Then, we aggregate these features along the channel dimension using sum(softmax(𝐅,dim=0)𝐅,dim=0)sumtensor-productsoftmax𝐅dim=0𝐅dim=0\texttt{sum}(\texttt{softmax}(\mathbf{F},\texttt{dim=0})\otimes\mathbf{F},% \texttt{dim=0})sum ( softmax ( bold_F , dim=0 ) ⊗ bold_F , dim=0 ), where 𝐅D×H×W𝐅superscript𝐷𝐻𝑊\mathbf{F}\in\mathbb{R}^{D\times H\times W}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_H × italic_W end_POSTSUPERSCRIPT represents the concatenated feature. D𝐷Ditalic_D, H𝐻Hitalic_H, and W𝑊Witalic_W denote its depth, height, and width, respectively. tensor-product\otimes denotes the element-wise production operation.

Fig. 8 (a) and (b) present the RGB and thermal images in one example scene. Fig. 8 (c) and (d) illustrate their corresponding feature maps. Fig. 8 (e) shows the RGB-T feature map without using our weakly supervised learning method. Fig. 8 (f) shows the feature map using our weakly supervised learning method. Observing Fig. 8 (e), we note that the ResNet-50 tends to acquire information primarily from the RGB image. In contrast, the feature map in Fig. 8 (f) demonstrates that our method enables the ResNet-50 to gather important information from both RGB and thermal images.

III-C Core Knowledge Distillation

Problem Description. To further improve the detection accuracy of the early-fusion strategy without increasing its computational cost, we introduce the knowledge distillation technique [19]. To achieve knowledge transfer, we instruct the student model to mimic intermediate features of teacher model. In this process, a primary obstacle the student model faces is the unequal number of feature channels as the teacher model. Previous works introduce convolution layers to align their feature channel numbers [20, 21], while neglecting whether the teacher’s knowledge is helpful to the student. To address this issue, we propose core knowledge distillation (CoreKD).

CoreKD Architecture. We use YOLOv5 [2] as an example and illustrate the knowledge distillation architecture in Fig. 9. In its architecture, we use the early-fusion single-branch structure as the student model and the medium-fusion two-branch structure as the teacher model. In the student model, a pair of RGB-T images is first concatenated, then fed into different network modules, and finally converted into predicted results. In the teacher model, the RGB and thermal images are respectively fed into different backbone networks. The generated multispectral features are fused in the feature space through concatenation and convolution operations. The fused features are then fed into the subsequent network modules and converted into predicted results. The predicted results of both the student and teacher models consist of bounding boxes and class-specific confidence scores.

CoreKD Formulation. Since we apply the same distillation techniques to different feature pyramid levels, we only describe the technique at one level and omit the subscript for simplicity. In the head modules of Fig. 9, we denote the input features of the student and teacher models as 𝐗Ssuperscript𝐗S\mathbf{X}^{\rm S}bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT and 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, respectively. Feature distillation typically transfers the teacher’s knowledge to the student by minimizing the loss [20]

feat′′=𝒜(𝐗S)𝐗T22,subscriptsuperscript′′featsuperscriptsubscriptnorm𝒜superscript𝐗Ssuperscript𝐗T22\mathcal{L}^{\prime\prime}_{\rm feat}=||\mathcal{A}(\mathbf{X}^{\rm S})-% \mathbf{X}^{\rm T}||_{2}^{2},caligraphic_L start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT = | | caligraphic_A ( bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ) - bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (6)

where 𝒜𝒜\mathcal{A}caligraphic_A denotes an adaptation layer used to match the channel dimensions between the student and teacher features. Previous works usually use a convolution layer as the adaptation layer [20, 21]. This approach aims to make 𝒜(𝐗S)𝒜superscript𝐗S\mathcal{A}(\mathbf{X}^{\rm S})caligraphic_A ( bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ) learn all information in the teacher feature 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT. However, they neglect whether all the information in 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is beneficial for downstream tasks, including classification and regression.

To address this problem, we revisit the structure of head module in the teacher model. As shown in Fig. 9, the official implementation of YOLOv5 uses a ‘1×1Conv11Conv1\times 1\;\texttt{Conv}1 × 1 Conv’ layer to output the predicted results

𝐘^T=Conv(𝐗T;𝐖T),superscript^𝐘TConvsuperscript𝐗Tsuperscript𝐖T\mathbf{\hat{Y}}^{\rm T}=\texttt{Conv}(\mathbf{X}^{\rm T};\mathbf{W}^{\rm T}),over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = Conv ( bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ; bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ,

where 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT denotes the weighting factor in the teacher’s head module. According to the 2D convolution formulation in Eq. (1), we can infer that the weighting factor 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT reflects the importance of a channel map in 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT for the downstream feature. We visualize the histogram of 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT in Fig. 10. It is evident that most of the values in 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT approximate 0. This implies that only a few feature representations in 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT are important for the downstream tasks. We call these important feature representations the core knowledge in teacher model.

To learn this core knowledge, we modify the feature loss Eq. (6) into

feat=||Conv(𝒜(𝐗S);𝐖T)Conv(𝐗T;𝐖T))||22.\mathcal{L}^{\prime}_{\rm feat}=||\texttt{Conv}(\mathcal{A}(\mathbf{X}^{\rm S}% );\mathbf{W}^{\rm T})-\texttt{Conv}(\mathbf{X}^{\rm T};\mathbf{W}^{\rm T}))||_% {2}^{2}.caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT = | | Conv ( caligraphic_A ( bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ) ; bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) - Conv ( bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ; bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (7)

This modification ensures that 𝒜(𝐗S)𝒜superscript𝐗S\mathcal{A}(\mathbf{X}^{\text{S}})caligraphic_A ( bold_X start_POSTSUPERSCRIPT S end_POSTSUPERSCRIPT ) and 𝐗Tsuperscript𝐗T\mathbf{X}^{\text{T}}bold_X start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT are projected into an identical space constructed by 𝐖Tsuperscript𝐖T\mathbf{W}^{\text{T}}bold_W start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT, and that the projected features are close to each other. Furthermore, to avoid introducing the adaption layer 𝒜𝒜\mathcal{A}caligraphic_A, we construct a core knowledge convolution (Core Knowledge Conv) operator by sampling the weighting factor 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT. We denote the sampling process as 𝒮()𝒮\mathcal{S}(\cdot)caligraphic_S ( ⋅ ). In the process, we first obtain the channel dimension d𝑑ditalic_d of the student feature 𝐗Ssuperscript𝐗S\mathbf{X}^{\rm S}bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT, then sample the top-d𝑑ditalic_d values along the ‘in_channel’ axis from 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT based on their absolute values. Finally, we obtain the sampled weighting factor 𝒮(𝐖T)𝒮superscript𝐖T\mathcal{S}(\mathbf{W}^{\rm T})caligraphic_S ( bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ). In this context, we rewrite the feature loss given in Eq. (7) as

featsubscriptfeat\displaystyle\mathcal{L}_{\rm feat}caligraphic_L start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT =𝐘^CT𝐘^T22absentsuperscriptsubscriptnormsuperscript^𝐘CTsuperscript^𝐘T22\displaystyle=||\mathbf{\hat{Y}}^{\rm CT}-\mathbf{\hat{Y}}^{\rm T}||_{2}^{2}= | | over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT roman_CT end_POSTSUPERSCRIPT - over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (8)
=||Conv(𝐗S;𝒮(𝐖T))Conv(𝐗T;𝐖T))||22,\displaystyle=||\texttt{Conv}(\mathbf{X}^{\rm S};\mathcal{S}(\mathbf{W}^{\rm T% }))-\texttt{Conv}(\mathbf{X}^{\rm T};\mathbf{W}^{\rm T}))||_{2}^{2},= | | Conv ( bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ; caligraphic_S ( bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ) - Conv ( bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ; bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where 𝐘^CTsuperscript^𝐘CT\mathbf{\hat{Y}}^{\rm CT}over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT roman_CT end_POSTSUPERSCRIPT denotes the output of core knowledge convolution. When using this feature loss, we keep the weighting factor 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT fixed and only compute the gradient with respect to the student feature 𝐗Ssuperscript𝐗S\mathbf{X}^{\rm S}bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT.

Refer to caption
Figure 10: Weighting factor histograms of the teacher’s head module in Fig. 9. (a), (b), and (c) correspond to the level-0, level-1, and level-2 convolution weighting factor histograms, respectively.

III-D Loss Function

Our efficient multispectral early-fusion (EME) single-branch model is trained using all the losses described above. The total loss is

total=cls+reg+weak+feat,subscripttotalsubscriptclssubscriptregsubscriptweaksubscriptfeat\mathcal{L}_{\rm total}=\mathcal{L}_{\rm cls}+\mathcal{L}_{\rm reg}+\mathcal{L% }_{\rm weak}+\mathcal{L}_{\rm feat},caligraphic_L start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_weak end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT , (9)

where clssubscriptcls\mathcal{L}_{\rm cls}caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT and regsubscriptreg\mathcal{L}_{\rm reg}caligraphic_L start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT represent the classification and regression losses defined by a detector [30, 31, 2], respectively. weaksubscriptweak\mathcal{L}_{\rm weak}caligraphic_L start_POSTSUBSCRIPT roman_weak end_POSTSUBSCRIPT is the summation of weakly supervised losses defined in Eq. (4) and Eq. (5):

weak=(𝐪~ad,𝐪^bb)+(𝐪~bb,𝐪^ad)+(𝐆,𝐆^).subscriptweaksubscript~𝐪adsubscript^𝐪bbsubscript~𝐪bbsubscript^𝐪ad𝐆^𝐆\mathcal{L}_{\rm weak}=\mathcal{H}(\mathbf{\tilde{q}}_{\rm ad},\mathbf{\hat{q}% }_{\rm bb})+\mathcal{H}(\mathbf{\tilde{q}}_{\rm bb},\mathbf{\hat{q}}_{\rm ad})% +\mathcal{H}(\mathbf{G},\mathbf{\hat{G}}).caligraphic_L start_POSTSUBSCRIPT roman_weak end_POSTSUBSCRIPT = caligraphic_H ( over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT ) + caligraphic_H ( over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT ) + caligraphic_H ( bold_G , over^ start_ARG bold_G end_ARG ) .
TABLE I: Performance on the M3FD dataset [1]. The best results in the mAP and mAP50 columns are highlighted in bold and marked in red, while the second best ones are underlined and marked in green.
Detector FLOPs (↓) Parameters (↓) Time (↓) mAP (↑) mAP50 (↑) Person (↑) Car (↑) Bus (↑) Motor (↑) TrafficLight (↑) Truck (↑)
RGB RetinaNet-Res50 61.893G 36.434M 0.106s 30.90 51.10 44.60 74.50 58.10 44.10 36.00 49.50
Thermal RetinaNet-Res50 61.893G 36.434M 0.106s 29.30 47.10 59.50 71.20 55.00 35.80 10.30 50.60
RGB-T Medium Fusion RetinaNet-Res50 94.611G 47.582M 0.170s 33.50 53.70 60.20 77.30 61.70 45.50 25.80 51.80
Baseline: RGB-T Early Fusion RetinaNet-Res50 62.164G 36.434M 0.110s 32.10 50.90 58.80 75.50 59.50 40.00 22.00 49.30
+ ShaPE RetinaNet-Res50 62.218G 36.434M 0.149s 32.90 52.30 61.00 77.20 59.30 40.40 24.40 51.50
+ ShaPE + WeakSup. RetinaNet-Res50 62.218G 36.434M 0.149s 33.50 52.70 59.70 76.60 63.10 38.60 23.70 54.70
+ ShaPE + WeakSup. + CoreKD RetinaNet-Res50 62.218G 36.434M 0.149s 33.70 53.10 60.90 76.50 59.30 42.90 25.80 53.30
RGB GFL-Res50 61.392G 32.270M 0.110s 32.40 53.12 48.80 77.80 60.00 43.70 38.40 50.00
Thermal GFL-Res50 61.392G 32.270M 0.110s 29.80 48.70 64.60 73.90 54.20 36.50 15.20 47.70
RGB-T Medium Fusion GFL-Res50 94.110G 43.419M 0.172s 34.60 55.30 65.20 79.80 62.60 39.50 35.20 49.80
Baseline: RGB-T Early Fusion GFL-Res50 61.663G 32.271M 0.114s 33.90 53.07 64.00 78.40 54.50 39.50 29.70 52.30
+ ShaPE GFL-Res50 61.718G 32.271M 0.151s 35.40 55.70 65.80 79.10 63.00 41.90 30.20 54.20
+ ShaPE + WeakSup. GFL-Res50 61.718G 32.271M 0.151s 35.90 56.30 65.10 79.80 66.00 41.80 30.90 54.00
+ ShaPE + WeakSup. + CoreKD GFL-Res50 61.718G 32.271M 0.151s 37.10 57.60 68.50 81.30 63.30 42.70 35.90 53.90
Refer to caption
Figure 11: Detection results of the GFL [31] detector on two example scenes from the M3FD [1] dataset. (a) and (e) display results using only RGB images. (b) and (f) show results using only thermal images. (c) and (g) demonstrate results using the plain RGB-T early-fusion strategy. (d) and (h) depict results using our EME method. Solid boxes represent detection results. Green dashed boxes mark missed objects (false negatives) while yellow dashed boxes mark false positives.
TABLE II: Performance on the FLIR dataset [35]. The best results in the mAP and mAP50 columns are highlighted in bold and marked in red, while the second best ones are underlined and marked in green.
Detector FLOPs (↓) Parameters (↓) Time (↓) mAP (↑) mAP50 (↑) Person (↑) Bicycle (↑) Car (↑)
RGB RetinaNet-Res50 61.893G 36.434M 0.106s 28.10 59.50 45.10 55.70 77.90
Thermal RetinaNet-Res50 61.893G 36.434M 0.106s 38.50 71.00 62.40 66.30 84.30
RGB-T Medium Fusion RetinaNet-Res50 94.611G 47.582M 0.170s 38.60 71.50 61.90 67.50 85.10
Baseline: RGB-T Early Fusion RetinaNet-Res50 62.164G 36.434M 0.110s 37.40 69.50 60.40 63.90 84.30
+ ShaPE RetinaNet-Res50 62.218G 36.434M 0.149s 38.60 71.40 61.30 68.30 84.50
+ ShaPE + WeakSup. RetinaNet-Res50 62.218G 36.434M 0.149s 38.40 71.60 61.50 68.40 85.00
+ ShaPE + WeakSup. + CoreKD RetinaNet-Res50 62.218G 36.434M 0.149s 38.60 71.80 61.90 68.30 85.00
RGB GFL-Res50 61.392G 32.270M 0.110s 31.90 63.70 51.70 57.60 81.80
Thermal GFL-Res50 61.392G 32.270M 0.110s 42.60 75.10 69.60 68.80 86.90
RGB-T Medium Fusion GFL-Res50 94.110G 43.419M 0.172s 42.70 75.80 69.80 70.10 87.70
Baseline: RGB-T Early Fusion GFL-Res50 61.663G 32.271M 0.114s 42.20 74.70 69.70 67.50 87.00
+ ShaPE GFL-Res50 61.718G 32.271M 0.151s 42.60 75.60 69.70 70.10 87.10
+ ShaPE + WeakSup. GFL-Res50 61.718G 32.271M 0.151s 43.10 76.60 71.10 70.70 87.90
+ ShaPE + WeakSup. + CoreKD GFL-Res50 61.718G 32.271M 0.151s 44.00 78.10 73.10 72.40 88.80
Refer to caption
Figure 12: Detection results of the GFL [31] detector on two example scenes from the FLIR [35] dataset. (a) and (e) display results using only RGB images. (b) and (f) show results using only thermal images. (c) and (g) demonstrate results using the plain RGB-T early-fusion strategy. (d) and (h) depict results using our EME method. Solid boxes represent detection results. Green dashed boxes mark missed objects (false negatives) while yellow dashed boxes mark false positives.
TABLE III: Comparisons with state-of-the-art approaches on the M3FD dataset [1].
(a) Dataset Splitting Method: Random Splitting
Thermal [2] RGB [2] AUIF [40] CDDF [24] DDcGAN [41] DIVF [25] DenseF [42] PSF [43] RFN [44] SeAF [45] TarDAL [1] U2F [46] EME (Ours)
mAP 49.10 52.40 53.30 53.00 52.20 52.70 53.40 53.10 53.50 53.10 52.50 53.40 54.20
mAP50 77.30 81.90 81.90 80.90 81.60 81.50 81.70 82.00 81.70 82.20 81.00 81.90 83.40
Person 79.30 68.40 76.70 76.30 73.60 74.50 76.50 76.70 75.30 77.00 79.10 77.00 79.90
Car 87.90 90.80 91.00 91.00 90.70 91.10 91.40 90.80 91.00 91.10 90.50 91.20 91.80
Bus 87.20 92.20 90.00 90.10 90.70 91.60 89.40 90.10 89.40 91.20 89.40 90.70 89.20
Motor 70.00 74.00 72.60 69.20 74.80 73.50 72.80 73.30 73.30 72.20 70.30 71.30 76.20
TrafficLight 55.90 80.30 77.40 75.40 76.90 74.80 77.20 78.20 77.40 77.60 72.70 77.70 78.30
Truck 83.40 85.70 83.70 83.10 82.90 83.40 82.90 82.90 83.90 84.10 84.00 83.60 85.30
(b) Dataset Splitting Method: M3FD-zxSplit
Thermal [2] RGB [2] AUIF [40] CDDF [24] DDcGAN [41] DIVF [25] DenseF [42] PSF [43] RFN [44] SeAF [45] TarDAL [1] U2F [46] EME (Ours)
mAP 34.90 36.10 38.30 38.60 37.10 37.10 38.90 38.00 38.20 38.90 39.10 38.70 41.50
mAP50 57.20 60.20 62.00 61.90 61.00 60.80 62.40 61.10 61.30 62.20 61.90 61.90 66.80
Person 74.60 55.90 72.20 71.90 67.30 67.60 72.30 71.70 70.50 72.50 75.50 72.40 77.60
Car 80.20 84.80 85.50 85.60 84.90 85.20 85.90 85.50 85.80 85.50 85.00 85.50 87.40
Bus 58.30 65.70 58.60 61.80 61.60 59.80 61.40 58.30 61.30 61.50 60.90 60.10 65.10
Motor 48.00 45.10 49.10 47.60 49.00 48.70 49.60 45.80 44.60 47.50 46.80 50.80 55.60
TrafficLight 27.30 56.80 49.80 48.70 49.10 51.20 48.60 50.90 49.70 50.80 46.90 48.00 54.90
Truck 54.80 52.70 56.70 55.50 53.80 52.60 56.80 54.70 55.80 55.70 56.70 54.70 60.30
Refer to caption
Figure 13: Detection results of the YOLOv5 [2] detector on one example scene from the M3FD [1] dataset. (a) and (b) respectively show the results using only a thermal image and only an RGB image. (c)-(l) display the detection results using fused images obtained from 10 different image fusion approaches. (m) demonstrates the results using our EME method.
TABLE IV: Comparisons with state-of-the-art approaches on the FLIR [35] dataset.
CBF [47] MCG [48] MUN [48] ODS [49] CFR [50] GAFF [15] BU [51] SMPD [52] ThDe [53] MSAT [54] CSAA [55] MFPT [56] ProbEn3 [22] EME (Ours)
mAP50 67.20 61.40 61.54 69.62 72.39 72.90 73.20 73.58 74.60 76.20 79.20 80.00 83.76 84.80
Bicycle 60.50 50.26 49.43 55.53 55.77 - 57.40 56.20 60.04 - - 67.70 73.49 79.80
Car 83.60 70.63 70.72 82.33 84.91 - 86.50 85.80 85.52 - - 89.00 90.14 92.80
Person 57.60 63.31 64.47 71.01 74.49 - 75.60 78.74 78.24 - - 83.20 87.65 81.20

IV Experiments

IV-A Experimental Setup

Datasets. Our experiments are conducted on the M3FD dataset [1] and FLIR dataset [35]. M3FD dataset contains 4200 pairs of RGB and thermal images. These image pairs are well aligned. The dataset contains 6 classes of objects: ‘Person’, ‘Car’, ‘Bus’, ‘Motorcycle’, ‘Traffic Light’, and ‘Truck’. Since this dataset doesn’t provide unified data splits, previous works have used a random splitting approach to determine the train and validation sets [1]. However, images in this dataset are sampled from video sequences, meaning that two adjacent frames may contain identical content. In this context, the random splitting approach results in information leakage between the train and validation sets. To address this problem, we first manually divide the dataset into 73 video segments based on different scenes. Then, we collect the first 70% of images in each video segment as the train set and the remaining images as the validation set. Finally, we obtain 2905 and 1295 pairs of RGB-T images in the train and validation sets, respectively. We name this data split ‘M3FD-zxSplit’ and release it to the public111https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection. For the performance evaluation in Section IV-B, we use this data split. When comparing with state-of-the-art approaches in Section IV-C, we employ both ‘M3FD-zxSplit’ and random splitting. Our random splitting refers to randomly selecting 80% images as the train set and the remaining images as the validation set. FLIR dataset originally contains unaligned RGB-T image pairs. The work [57] develops a data-processing approach to align these images and obtain 7381, 1056, and 2111 image pairs in the train, validation, and test sets. This dataset contains 3 classes: ‘Person’, ‘Bicycle’ and ‘Car’.

Evaluation Metrics. We use the standard mean Average Precision (mAP) with IoU thresholds ranging from 0.5 to 0.95 across various object scales as metrics.

Implementation Details. We incorporate our three key modules into commonly-used one-stage detectors, including RetinaNet [30], GFL [31], and YOLOv5 [2]. For RetinaNet and GFL, we adopt the implementations in MMDetection toolbox [58]. For YOLOv5, we use its official implemtation [2]. We keep the training setting consistent with the corresponding baselines.

Inference Efficiency Evaluations. We assess the inference efficiency of our method (Python implementation) on the edge device NVIDIA AGX Orin with 64GB memory. We also evaluate the complexity of our method using FLOPs and the number of parameters. All results are presented in Tables I and II.

IV-B Performance Evaluation

Table I and Table II present the performance of our method on the M3FD [1] and FLIR [35] datasets. Key observations include: (1) the medium-fusion strategy adds more parameters and FLOPs compared to the early-fusion strategy; (2) the medium-fusion strategy achieves better performance compared to single-modality inputs, whereas the plain early-fusion strategy does not consistently improve performance; (3) our EME method, incorporating the ShaPE module, weakly supervised learning, and CoreKD techniques into the plain early-fusion strategy, achieves significant performance improvement without significantly increasing parameters and FLOPs; (4) the inference time of our EME method is longer than that of the baseline method, since the structural similarity computation process has not been optimized when calculating the self-gating mask; (5) our EME method can outperform the medium-fusion strategy in both performance and efficiency to some extent.

Fig. 12 and Fig. 12 present visualization results for two example scenes from M3FD [1] and FLIR [35] datasets, respectively. We observe that false positives or false negatives in the single-modality results may affect the plain early-fusion strategy. For instance, the person missed in Fig. 12 (e) is also absent in Fig. 12 (g), despite being detected in Fig. 12 (f). Moreover, false positives in Fig. 12 (f) affect the detection results of plain early-fusion, as shown in Fig. 12 (g). These phenomena confirm that the problem of information interference is a key obstacle to performance in the plain early-fusion strategy. Clearly, our EME effectively alleviates this problem.

IV-C Comparison with the State-of-The-Art Approaches

We use the one-stage YOLOv5 [2] detector as the baseline, and incorporate our proposed modules to construct the effective multispectral early-fusion (EME) model. Table III and Table IV compares our EME and previous state-of-the-art approaches on M3FD [1] and FLIR [35] datasets.

In Table III, we compare our EME with 10 state-of-the-art image-fusion-based object detection approaches [40, 24, 41, 25, 42, 43, 44, 45, 1, 46]. We first generate fused images based on their official implementations, and then train YOLOv5 [2] using these fused images with the same training settings. The results show that our EME achieves state-of-the-art performance. We observe that the results in Table III (a) are obviously better than those in Table III (b). This demonstrates that random splitting causes information leakage and makes it difficult to improve performance. Fig. 13 presents an example scene for visualization.

In Table IV, we compare our EME with 13 multispectral object detection approaches. These approaches include (1) medium-fusion strategies, such as CBF [47], MCG [48], MUN [48], CFR [50], GAFF [15], SMPD [52], MSAT [54], CSAA [55], and MFPT [56]; (2) domain adaptation and single-modality detection approaches, such as ODS [49], BU [51], and ThDe [53]; and (3) late-fusion strategy [22]. The results show that our EME also achieves state-of-the-art performance on the FLIR dataset [35].

V Conclusions

In this paper, we have proposed the effective multispectral early-fusion (EME) detector, which achieves both high performance and efficiency. We identify and address performance obstacles such as information interference, domain gap, and weak feature presentation, proposing solutions including shape-priority early-fusion modules, weakly supervised learning, and core knowledge distillation. Extensive experiments demonstrate the effectiveness and efficiency of our EME.

References

  • [1] J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-Aware Dual Adversarial Learning and a Multi-Scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2022.
  • [2] G. Jocher, “YOLOv5 by Ultralytics,” 2020. [Online]. Available: https://github.com/ultralytics/yolov5
  • [3] Z. Chen and X. Huang, “Pedestrian Detection for Autonomous Vehicle Using Multi-Spectral Cameras,” IEEE Transactions on Intelligent Vehicles, vol. 4, no. 2, pp. 211–219, 2019.
  • [4] W. Zhou, S. Dong, M. Fang, and L. Yu, “CACFNet: Cross-Modal Attention Cascaded Fusion Network for RGB-T Urban Scene Parsing,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 1919–1929, 2024.
  • [5] Y. Liu, C. Hu, B. Zhao, Y. Huang, and X. Zhang, “Region-Based Illumination-Temperature Awareness and Cross-Modality Enhancement for Multispectral Pedestrian Detection,” IEEE Transactions on Intelligent Vehicles, pp. 1–12, 2024.
  • [6] M. A. Farooq, W. Shariff, and P. Corcoran, “Evaluation of Thermal Imaging on Embedded GPU Platforms for Application in Vehicular Assistance Systems,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2, pp. 1130–1144, 2022.
  • [7] M. Ding, W.-H. Chen, and Y.-F. Cao, “Thermal Infrared Single-Pedestrian Tracking for Advanced Driver Assistance System,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 814–824, 2023.
  • [8] W. Zhou, S. Dong, J. Lei, and L. Yu, “MTANet: Multitask-Aware Network With Hierarchical Multimodal Fusion for RGB-T Urban Scene Understanding,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 48–58, 2023.
  • [9] Y. Guo, H. Kong, and S. Gu, “Unsupervised Multi-Spectrum Stereo Depth Estimation for All-Day Vision,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 501–511, 2024.
  • [10] Y. Zhu, C. Li, J. Tang, and B. Luo, “Quality-Aware Feature Aggregation Network for Robust RGB-T Tracking,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 1, pp. 121–130, 2021.
  • [11] J. Liu, S. Zhang, S. Wang, and D. N. Metaxas, “Multispectral Deep Neural Networks for Pedestrian Detection,” in Proceedings of the British Machine Vision Conference, 2016.
  • [12] J. Wagner, V. Fischer, M. Herman, S. Behnke et al., “Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks,” in Proceedings of the European Symposium on Artificial Neural Networks, vol. 587, 2016, pp. 509–514.
  • [13] Q. Xie, T.-Y. Cheng, Z. Dai, V. Tran, N. Trigoni, and A. Markham, “Illumination-Aware Hallucination-Based Domain Adaptation for Thermal Pedestrian Detection,” IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [14] T. Liu, K.-M. Lam, R. Zhao, and G. Qiu, “Deep Cross-Modal Representation Learning and Distillation for Illumination-Invariant Pedestrian Detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 315–329, 2021.
  • [15] H. Zhang, E. Fromont, S. Lefèvre, and B. Avignon, “Guided Attentive Feature Fusion for Multispectral Pedestrian Detection,” in Proceedings of the Winter Conference on Applications of Computer Vision, 2021, pp. 72–80.
  • [16] B. Yin, X. Zhang, Z. Li, L. Liu, M.-M. Cheng, and Q. Hou, “DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation,” in Proceedings of the International Conference on Learning Representations, 2024.
  • [17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models from Natural Language Supervision,” in Proceedings of the International Conference on Machine Learning, vol. 139, 2021, pp. 8748–8763.
  • [18] Z. Lai, N. Vesdapunt, N. Zhou, J. Wu, C. P. Huynh, X. Li, K. K. Fu, and C.-N. Chuah, “PADCLIP: Pseudo-Labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation,” in Proceedings of the International Conference on Computer Vision, 2023, pp. 16 109–16 119.
  • [19] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” in Proceedings of the Advances in Neural Information Processing Systems Workshop, 2015.
  • [20] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for Thin Deep Nets,” in Proceedings of the International Conference on Learning Representations, 2015.
  • [21] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning Efficient Object Detection Models with Knowledge Distillation,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [22] Y.-T. Chen, J. Shi, Z. Ye, C. Mertz, D. Ramanan, and S. Kong, “Multimodal Object Detection via Probabilistic Ensembling,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 139–158.
  • [23] H. Zhang, E. Fromont, S. Lefèvre, and B. Avignon, “Low-Cost Multispectral Scene Analysis with Modality Distillation,” in Proceedings of the Winter Conference on Applications of Computer Vision, 2022, pp. 803–812.
  • [24] Z. Zhao, H. Bai, J. Zhang, Y. Zhang, S. Xu, Z. Lin, R. Timofte, and L. Van Gool, “CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2023, pp. 5906–5916.
  • [25] L. Tang, X. Xiang, H. Zhang, M. Gong, and J. Ma, “DIVFusion: Darkness-Free Infrared and Visible Image Fusion,” Information Fusion, vol. 91, pp. 477–493, 2023.
  • [26] D. Zhang, J. Han, G. Cheng, and M.-H. Yang, “Weakly Supervised Object Localization and Detection: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5866–5885, 2021.
  • [27] Y. Zhang, H. Yu, Y. He, X. Wang, and W. Yang, “Illumination-Guided RGBT Object Detection with Inter- and Intra-Modality Fusion,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–13, 2023.
  • [28] X. Zhang, X. Zhang, Z. Sheng, and H.-L. Shen, “TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection,” arXiv preprint arXiv:2305.16580, 2023.
  • [29] Z. Li, P. Xu, X. Chang, L. Yang, Y. Zhang, L. Yao, and X. Chen, “When Object Detection Meets Knowledge Distillation: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” in Proceedings of the International Conference on Computer Vision, 2017, pp. 2980–2988.
  • [31] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 002–21 012, 2020.
  • [32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755.
  • [35] “FREE FLIR Thermal Dataset for Algorithm Training,” https://www.flir.com/oem/adas/adas-dataset-form/.
  • [36] R. Abdelfattah, Q. Guo, X. Li, X. Wang, and S. Wang, “CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification,” in Proceedings of the International Conference on Computer Vision, 2023, pp. 1348–1357.
  • [37] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep Mutual Learning,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2018, pp. 4320–4328.
  • [38] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting Knowledge Distillation via Label Smoothing Regularization,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2020, pp. 3903–3911.
  • [39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [40] Z. Zhao, S. Xu, J. Zhang, C. Liang, C. Zhang, and J. Liu, “Efficient and Model-Based Infrared and Visible Image Fusion via Algorithm Unrolling,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1186–1196, 2022.
  • [41] J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion,” IEEE Transactions on Image Processing, vol. 29, pp. 4980–4995, 2020.
  • [42] H. Li and X.-J. Wu, “DenseFuse: A Fusion Approach to Infrared and Visible Images,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2614–2623, 2019.
  • [43] L. Tang, H. Zhang, H. Xu, and J. Ma, “Rethinking the Necessity of Image Fusion in High-Level Vision Tasks: A Practical Infrared and Visible Image Fusion Network Based on Progressive Semantic Injection and Scene Fidelity,” Information Fusion, vol. 99, p. 101870, 2023.
  • [44] H. Li, X.-J. Wu, and J. Kittler, “RFN-Nest: An End-to-End Residual Fusion Network for Infrared and Visible Images,” Information Fusion, vol. 73, pp. 72–86, 2021.
  • [45] L. Tang, J. Yuan, and J. Ma, “Image Fusion in the Loop of High-Level Vision Tasks: A Semantic-Aware Real-Time Infrared and Visible Image Fusion Network,” Information Fusion, vol. 82, pp. 28–42, 2022.
  • [46] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2Fusion: A Unified Unsupervised Image Fusion Network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [47] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional Block Attention Module,” in Proceedings of the European Conference on Computer Vision, 2018.
  • [48] C. Devaguptapu, N. Akolekar, M. M Sharma, and V. N Balasubramanian, “Borrow from Anywhere: Pseudo Multi-Modal Object Detection in Thermal Imagery,” in Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
  • [49] F. Munir, S. Azam, M. A. Rafique, A. M. Sheri, and M. Jeon, “Thermal Object Detection using Domain Adaptation through Style Consistency,” ArXiv, vol. abs/2006.00821, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219176719
  • [50] H. Zhang, E. Fromont, S. Lefevre, and B. Avignon, “Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks,” in Proceedings of the International Conference on Image Processing, 2020, pp. 276–280.
  • [51] M. Kieu, A. D. Bagdanov, and M. Bertini, “Bottom-Up and Layerwise Domain Adaptation for Pedestrian Detection in Thermal Images,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 17, no. 1, 2021.
  • [52] Q. Li, C. Zhang, Q. Hu, P. Zhu, H. Fu, and L. Chen, “Stabilizing Multispectral Pedestrian Detection with Evidential Hybrid Fusion,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 3017–3029, 2024.
  • [53] Y. Cao, T. Zhou, X. Zhu, and Y. Su, “Every Feature Counts: An Improved One-Stage Detector in Thermal Imagery,” in Proceedings of the International Conference on Computer and Communications, 2019, pp. 1965–1969.
  • [54] S. You, X. Xie, Y. Feng, C. Mei, and Y. Ji, “Multi-Scale Aggregation Transformers for Multispectral Object Detection,” IEEE Signal Processing Letters, vol. 30, pp. 1172–1176, 2023.
  • [55] Y. Cao, J. Bin, J. Hamari, E. Blasch, and Z. Liu, “Multimodal Object Detection by Channel Switching and Spatial Attention,” in Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, 2023, pp. 403–411.
  • [56] Y. Zhu, X. Sun, M. Wang, and H. Huang, “Multi-Modal Feature Pyramid Transformer for RGB-Infrared Object Detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 9, pp. 9984–9995, 2023.
  • [57] V. Sam, K. Ali, M. Christian, K. Laurent, and E. Lutz, “Robust Environment Perception for Automated Driving: A Unified Learning Pipeline for Visual-Infrared Object Detection,” in IEEE Intelligent Vehicles Symposium, 2022, pp. 367–374.
  • [58] MMDetection Contributors, “OpenMMLab Detection Toolbox and Benchmark,” 2018. [Online]. Available: https://github.com/open-mmlab/mmdetection
Xue Zhang received his B.S. and M.Sc. degrees from Shandong Jianzhu University and Shandong University, China, in 2016 and 2019, respectively. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University. His research interests are optical compressive imaging, image classification, object detection, and knowledge distillation.
Si-Yuan Cao received his B.Eng. degree in electronic information engineering from Tianjin University in 2016, and Ph.D. degree in electronic science and technology from Zhejiang University in 2022. He is currently an Assistant Researcher in Ningbo Innovation Center, Zhejiang University, China. His research interests are multispectral/multimodal image registration, homography estimation, place recognition, and image processing.
Fang Wang received her B.Eng. and Ph.D. in Design and Construction of Naval Architecture and Ocean Structure from Harbin Engineering University in 2007 and 2012, respectively. She is currently an Associate Professor with the School of Information and Electrical Engineering, Hangzhou City University. Her current research interests include the autonomous control of unmanned vehicles, and urban air mobility.
Runmin Zhang received his B.Eng. degree from Zhejiang University in 2022. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University, China. His research interests are image registration and multimodal image restoration.
Zhe Wu received his B.Eng. degree from Xidian University in 2019 and M.Sc. degree from the University of Edinburgh in 2020. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University. His research interests are image restoration and image enhancement.
Xiaohan Zhang received the B.Eng. degree from Beijing Jiaotong University, China, in 2022. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University. His research interests are object detection and image processing.
Xiaokai Bai received his B.Eng. degree from Zhejiang University in 2023. He is currently pursuing the Ph.D. degree with the College of Information Science and Electronic Engineering, Zhejiang University, China. His research interests are 3D object detection and autonomous driving.
Hui-Liang Shen (Senior Member, IEEE) received his B.Eng. and Ph.D. degrees in Electronic Engineering from Zhejiang University, Hangzhou, China, in 1996 and 2002, respectively. He was a Research Associate and Research Fellow with The Hong Kong Polytechnic University from 2001 to 2005. He is currently a Professor with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China. His research interests are multispectral imaging, image processing, computer vision, deep learning, and machine learning.