ECCV2022_Multimodal Object Detection via Probabilistic
ECCV2022_Multimodal Object Detection via Probabilistic
Ensembling
3
Argo AI
4
Texas A&M University
1 Introduction
Object detection is a canonical computer vision problem that has been greatly
advanced by the end-to-end training of deep neural detectors [48,26]. Such de-
tectors are widely adopted in various safety-critical systems such as autonomous
vehicles (AVs) [22,7]. Motivated by AVs that operate in both day and night,
∗
Equal contribution. Most of the work was done when authors were with CMU.
†
Equal supervision.
2 Chen et al.
fusion
we study multimodal object detection with RGB and thermal cameras, since
the latter can provide much stronger object signatures under poor illumina-
tion [29,56,35,12,63,4].
Multimodal Data. There exists several challenges in multimodal detection.
One is the lack of data. While there exists large repositories of annotated single-
modal datasets (RGB) and pre-trained models, there exists much less annotated
data of other modalities (thermal), and even less annotations of them paired to-
gether. One often-ignored aspect is the alignment of the modalities: aligning RGB
and thermal images requires special purpose hardware, e.g., a beam-splitter [29]
or a specialized rack [52] for spatial alignment, and a GPS clock synchronizer for
temporal alignment [45]. Fusion on unaligned RGB-thermal inputs (cf. Fig. 4)
remains relatively unexplored. For example, even annotating bounding boxes is
cumbersome because separate annotations are required for each modality, in-
creasing overall cost. As a result, many unaligned datasets annotate only one
modality (e.g., FLIR [20]), further complicating multimodal learning.
Multimodal Fusion. The central question in multimodal detection is how to
fuse information from different modalities. Previous work has explored strategies
for fusion at various stages [9,56,35,62,63,4], which are often categorized into
early-, mid- and late-fusion. Early-fusion constructs a four-channel RGB-thermal
input [53], which is then processed by a (typical) deep network. In contrast, mid-
fusion keeps RGB and thermal inputs in different streams and then merges their
features downstream within the network (Fig. 2a) [53,39,33]. The vast majority
of past work focuses on architectural design of where and how to merge. Our
key contribution is the exploration of an extreme variant of very-late fusion of
detectors trained on separate modalities (Fig. 2b) through detector ensembling.
Though conceptually simple, ensembling can be effective because one can learn
Multimodal Object Detection via Probabilistic Ensembling 3
(a) Prior work: mid-fusion of features (b) Our focus: late-fusion of detections
final
Fig. 2. High-level comparisons be-
final
detections detections tween mid- and late-fusion. (a) Past
detector detector head
work primarily focuses on mid-fusion,
detector ensemble
e.g., concatenating features computed
feature fusion
detections detections by single-modal feature extractors.
feature feature
(b) We focus on late-fusion via de-
RGB-feature thermal-feature
extractor extractor
RGB-detector thermal detector tector ensemble that fuses detections
from independent detectors, e.g., two
single-modal detectors trained with
RGB and thermal images respectively.
from single-modal datasets that often dwarf the size of multimodal datasets.
However, ensembling can be practically challenging because different detectors
might not fire on the same object. For example, RGB-based detectors often
fail to fire in nighttime conditions, implying one needs to deal with “missing”
detections during fusion.
Probabilistic Ensembling (ProbEn). We derive our very-late fusion ap-
proach, ProbEn, from first principles: simply put, if single-modal signals are
conditionally independent of each other given the true label, the optimal fusion
strategy is given by Bayes rule [44]. ProbEn requires no learning, and so does not
require any multimodal data for training. Importantly, ProbEn elegantly handles
“missing” modalities via probabilistic marginalization. While ProbEn is derived
assuming conditional independence, we empirically find that it can be used to
fuse outputs that are not strictly independent, by fusing outputs from other fu-
sion methods (both off-the-shelf and trained in-house). In this sense, ProbEn is a
general technique for ensembling detectors. We achieve significant improvements
over prior art, both on aligned and unaligned multimodal benchmarks.
Why ensemble? One may ask why detector ensembling should be re-
garded as an interesting contribution, given that ensembling is a well-studied
approach [21,32,3,13] that is often viewed as an “engineering detail” for improv-
ing leaderboard performance [34,28,25]. Firstly, we show that the precise ensem-
bling technique matters, and prior approaches proposed in the (single-modal)
detection literature such as score-averaging [34,14] or max-voting [57], are not
as effective as ProbEn, particularly when dealing with missing modalities. Sec-
ondly, to our knowledge, we are the first to propose detector ensembling as a
fusion method for multimodal detection. Though quite simple, it is remarkably
effective and should be considered a baseline for future research.
2 Related Work
Object Detection and Detector Ensembling. State-of-the-art detectors
train deep neural networks on large-scale datasets such as COCO [37] and often
focus on architectural design [40,46,47,48]. Crucially, most architectures gener-
ate overlapping detections which need to be post-processed with non-maximal
suppression (NMS) [10,5,51]. Overlapping detections could also be generated by
detectors tuned for different image crops and scales, which typically make use
of ensembling techniques for post-processing their output [1,25,28]. Somewhat
4 Chen et al.
surprisingly, although detector ensembling and NMS are widely studied in single-
modal RGB detection, to the best of our knowledge, they have not been used to
(very) late-fuse multimodal detections; we find them remarkably effective.
Multimodal Detection, particularly with RGB-thermal images, has at-
tracted increasing attention. The KAIST pedestrian detection dataset [29] is one
of the first benchmarks for RGB-thermal detection, fostering growth of research
in this area. Inspired by the successful RGB-based detectors [48,46,40], current
multimodal detectors train deep models with various methods for fusing mul-
timodal signals [9,56,35,62,63,4,64,63,31]. Most of these multimodal detection
methods work on aligned RGB-thermal images, but it is unclear how they per-
form on heavily unaligned modalities such as images in Fig. 4 taken from FLIR
dataset [20]. We study multimodal detection under both aligned and unaligned
RGB-thermal scenarios. Multimodal fusion is the central question in multi-
modal detection. Compared to early-fusion that simply concatenates RGB and
thermal inputs, mid-fusion of single-modal features performs better [53]. There-
fore, most multimodal methods study how to fuse features and focus on designing
new network architectures [53,39,33]. Because RGB-thermal pairs might not be
aligned, some methods train an RGB-thermal translation network to synthesize
aligned pairs, but this requires annotations in each modality [12,41,30]. Inter-
estingly, few works explore learning from unaligned data that are annotated
only in single modality; we show that mid-fusion architectures can still learn
in this setting by acting as an implicit alignment network. Finally, few fusion
architectures explore (very) late fusion of single-modal detections via detector
ensembling. Most that do simply take heuristic (weighted) averages of confidence
scores [23,35,63]. In contrast, we introduce probabilistic ensembling (ProbEn) for
late-fusion, which significantly outperforms prior methods on both aligned and
unaligned RGB-thermal data.
We now present multimodal fusion strategies for detection. We first point out
that single-modal detectors are viable methods for processing multimodal sig-
nals, and so include them as a baseline. We also include fusion baselines for
early-fusion, which concatenates RGB and thermal as a four-channel input, and
mid-fusion, which concatenates single-modal features inside a network (Fig. 2).
As a preview of results, we find that mid-fusion is generally the most effective
baseline (Table 1). Surprisingly, this holds even for unaligned data that is an-
notated with a single modality (Fig. 4), indicating that mid-fusion can perform
some implicit alignment (Table 3).
We describe strategies for late-fusing detectors from different modalities, or
detector ensembling. We begin with a naive approach (Fig. 1). Late-fusion needs
to fuse scores and boxes; we discuss the latter at the end of this section.
Naive Pooling. The possibly simplest strategy is to naively pool detections
from multiple modalities together. This will probably result in multiple detec-
tions overlapping the same ground-truth object (Fig. 1a).
Multimodal Object Detection via Probabilistic Ensembling 5
(Algorithm 1). Assume we have an object with label y (e.g., a “person”) and
measured signals from two modalities: x1 (RGB) and x2 (thermal). We write
out our formulation for two modalities, but the extension to multiple (evaluated
in our experiments) is straightforward. Crucially, we assume measurements are
conditionally independent given the object label y:
This can also be written as p(x1 |y) = p(x1 |x2 , y), which may be easier to intuit.
Given the person label y, predict its RGB appearance x1 ; if this prediction
would not change the given knowledge of the thermal signal x2 , then conditional
independence holds. We wish to infer labels given multimodal measurements:
\small p(y|x_1,x_2) \propto \ p(x_1|y)p(x_2|y)p(y) &\propto \frac {p(x_1|y)p(y)p(x_2|y)p(y)}{p(y)}\\ &\propto \frac {p(y|x_1) p(y|x_2)}{p(y)} \label {eq:fuse}
(4)
The above suggests a simple approach to fusion that is provably optimal when
single-modal features are conditionally-independent of the true object label:
1. Train independent single-modal classifiers that predict the distributions over
the label y given each individual feature modality p(y|x1 ) and p(y|x2 ).
2. Produce a final score by multiplying the two distributions, dividing by the
class prior distribution, and normalizing the final result (4) to sum-to-one.
To obtain the class prior p(y), we can simply normalize the counts of per-class
examples. Extending ProbEn (4) to M modalities is simple:
Fig. 3. Missing modalities. The orange-person (a) fails to trigger a thermal detec-
tion (b), resulting in a single-modal RGB detection (0.85 confidence). To generate an
output set of detections (for downstream metrics such as average precision), this detec-
tion must be compared to the fused multimodal detection of the red-person (RGB: 0.80,
thermal: 0.70). (c) averaging confidences for the red-person lowers their score (0.75)
below the orange-person, which is unintuitive because additional detections should
boost confidence. (d) ProbEn increases the red-person fused score to 0.90, allowing for
proper comparisons to single-modal detections.
class-k given modality i in terms of single-modal logit score si [k]. For notational
simplicity, we suppress its dependence on the underlying input modality xi :
p(y=k|xi ) = Pexp(s i [k])
∝ exp(si [k]), where we exploit the fact that the
j exp(si [j])
partition function in the denominator is not a function of the class label k. We
now plug the above into Eq. (5):
\small p(\text {$y$$=$$k$}| \{x_i\}_{i=1}^{M}) &\propto \frac {\Pi _{i=1}^{M} p(\text {$y$$=$$k$}|x_i)}{p(\text {$y$$=$$k$})^{M-1}} \propto \frac {\exp (\sum _{i=1}^M s_i[k])}{p(\text {$y$$=$$k$})^{M-1}} \label {eq:sum_logits_fuse} (6)
ProbEn is thus equivalent to summing logits, dividing by the class prior and nor-
malizing via a softmax. Our derivation (6) reveals that summing logits without
the division may over-count class priors, where the over-counting grows with the
number of modalities M . The supplement shows that dividing by class posteri-
ors p(y) marginally helps. In practice, we empirically find that assuming uniform
priors works surprisingly well, even on imbalanced datasets. This is the default
for our experiments, unless otherwise noted.
Missing modalities. Importantly, summing and averaging behave profoundly
differently when fusing across “missing” modalities (Fig. 3). Intuitively, differ-
ent single-modal detectors often do not fire on the same object. This means
that to output a final set of detections above a confidence threshold (e.g., nec-
essary for computing precision-recall metrics), one will need to compare scores
from fused multi-modal detections with single modal detections, as illustrated in
Fig. 3. ProbEn elegantly deals with missing modalities because probabilistically-
normalized multi-modal posteriors p(y|x1 , x2 ) can be directly compared with
single-modal posteriors p(y|x1 ).
Bounding Box Fusion. Thus far, we have focused on fusion of class pos-
teriors. We now extend ProbEn to probabilistically fuse bounding box (bbox)
coordinates of overlapping detections. We repurpose the derivation from (4) for
a continuous bbox label rather than a discrete one. Specifically, we write z for
the continuous random variable defining the bounding box (parameterized by its
centroid, width, and height) associated with a given detection. We assume single-
modal detections provide a posterior p(z|xi ) that takes the form of a Gaussian
with a single variance σi2 , i.e., p(z|xi ) = N (µi , σi2 I) where µi are box coordinates
predicted from modality i. We also assume a uniform prior on p(z), implying
8 Chen et al.
bbox coordinates can lie anywhere in the image plane. Doing so, we can write
\small \ p(\z | x_1, x_2) \propto & \ p(\z | x_1) p(\z |x_2) \propto \ \exp \big (\frac {\Vert \z -\muu _1 \Vert ^2}{-2\sigma _1^2} \big ) \exp \big (\frac {\Vert \z -\muu _2 \Vert ^2}{-2\sigma _2^2} \big ) \\ \propto & \ \exp \big (\frac {||\z -\muu ||^2}{-2(\frac {1}{\sigma _1^2} + \frac {1}{\sigma _2^2})}), \quad \text {where} \ \ \muu =\frac {\frac {\muu _1}{\sigma _1^2}+\frac {\muu _2}{\sigma _2^2}}{\frac {1}{\sigma _1^2}+\frac {1}{\sigma _2^2}} \label {eq:bbox-fusion} \vspace {-4mm}
(8)
We refer the reader to the supplement for a detailed derivation. Eq. (8) sug-
gests a simple way to probabilistically fuse box coordinates: compute a weighted
average of box coordinates, where weights are given by the inverse covariance.
We explore three methods for setting σi2 . The first method “avg” fixes σi2 =1,
amounting to simply averaging bounding box coodinates. The second “s-avg” ap-
1
proximates σi2 ≈ p(y=k|x i)
, implying that more confident detections should have
a higher weight when fusing box coordinates. This performs marginally better
than simply averaging. The third “v-avg” train the detector to predict regres-
sion variance/uncertainty using the Gaussian negative log likelihood (GNLL)
loss [42] alongside the box regression loss. Interestingly, incorporating GNLL not
only produces better variance/uncertainty estimate helpful for fusion but also
improves detection performance of the trained detectors (details in supplement).
4 Experiments
We validate different fusion methods on two datasets: KAIST [29] which is re-
leased under the Simplified BSD License, and FLIR [20] (Fig. 4), which allows
for non-commercial educational and research purposes. Because the two datasets
contain personally identifiable information such as faces and license plates, we
assure that we (1) use them only for research, and (2) will release our code
and models to the public without redistributing the data. We first describe im-
plementation details and then report the experimental results on each dataset
(alongside their evaluation metrics) in separate subsections.
4.1 Implementation
We conduct experiments with PyTorch [43] on a single GPU (Nvidia GTX 2080).
We train our detectors (based on Faster-RCNN) with Detectron2 [54], using
Multimodal Object Detection via Probabilistic Ensembling 9
MidFusion detection
testing examples in columns. Top: detec-
tions by our mid-fusion model. Bottom:
detections by our ProbEn by fusing detec-
tions of thermal-only and mid-fusion mod-
thermal images thermal images
els. Green, red and blue boxes stand for true
positives, false negative (miss-detection)
ProbEn detection
ProbEn detection
SGD and learning rate 5e-3. For data augmentation, we adopt random flipping
and resizing. We pre-train our detector on COCO dataset [37]. As COCO has
only RGB images, fine-tuning the pre-trained detector on thermal inputs needs
careful pre-processing of thermal images (detailed below).
Pre-processing. All RGB and thermal images have intensity in [0, 255]. In
training an RGB-based detector, RGB input images are commonly processed
using the mean subtraction [54] where the mean values are computed over all
the training images. Similarly, we calculate the mean value (135.438) in the ther-
mal training data. We find using a precise mean subtraction to process thermal
images yields better performance when fine-tuning the pre-trained detector.
Stage-wise Training. We fine-tune the pre-trained detector to train single-
modal detectors and the early-fusion detectors. To train a mid-fusion detector,
we truncate the already-trained single-modal detectors, concatenate features add
a new detection head and train the whole model (Fig. 2a). The late-fusion meth-
ods fuse detections from (single-modal) detectors. Note that all the late-fusion
methods are non-learned. We also experimented with learning-based late-fusion
methods (e.g., learning to fuse logits) but find them to be only marginally bet-
ter than ProbEn (9.08 vs. 9.16 in LAMR using argmax box fusion). Therefore,
we focus on the non-learned late fusion methods in the main paper and study
learning-based ones in the supplement.
Post-processing. When ensembling two detectors, we find it crucial to cal-
ibrate scores particularly when we we fuse detections from our in-house models
and off-the-shelf models released by others. We adopt the simple temperature
scaling for score calibration [24]. Please refer to the supplement for details.
(7,601 examples) [35] and a cleaned test-set (2,252 examples) [38]. We also follow
the literature [29] to evaluate under the “reasonable setting” for evaluation by
ignoring annotated persons that are occluded (tagged by KAIST) or too small
(<55 pixels). We follow this literature for fair comparison with recent methods.
Metric. We measure detection performance with the Log-Average Miss Rate
(LAMR), which is a standard metric in pedestrian detection [15] and KAIST [29].
LAMR is computed by averaging the miss rate (false negative rate) at nine false
positives per image (FPPI) rates evenly spaced in log-space from the range 10−2
to 100 [29]. It does not evaluate the detections that match to ignored ground-
truth [15,29]. A true positive is a detection that matches a ground-truth object
with IoU>0.5 [29]; false positives are detections that do not match any ground-
truth; false negatives are miss-detections.
Dataset. The FLIR dataset [20] consists of RGB images (captured by a FLIR
BlackFly RGB camera with 1280x1024 resolution) and thermal images (acquired
by a FLIR Tau2 thermal camera 640x512 resolution). We resize all images to
resolution 640x512. FLIR has 10, 228 unaligned RGB-thermal image pairs and
annotates only for thermal (Fig. 4). Image pairs are split into train-set (8, 862
images) and a validation set (1, 366 images). FLIR evaluates on three classes
which have imbalanced examples [8,30,60,41,12]: 28, 151 persons, 46, 692 cars,
and 4, 457 bicycles. Following [60], we remove 108 thermal images in the val-set
that do not have the RGB counterparts. For breakdown analysis w.r.t day/night
scenes, we manually tag the validation images with “day” (768) and “night”
(490). We will release our annotations to the public.
Misaligned modalities. Because FLIR’s RGB and thermal images are
heavily unaligned, it labels only thermal images and does not have RGB an-
notations. We can still train Early and MidFusion models using multimodal
inputs and the thermal annotations. These detectors might learn to internally
align the unaligned modalities to predict bounding boxes according to the ther-
mal annotations. Because we do not have an RGB-only detector, our ProbEn
ensembles EarlyFusion, MidFusion, and thermal-only detectors.
Metric. We measure performance using Average Precision (AP) [17,49]. Pre-
cision is computed over testing images within a single class, with true positives
that overlap ground-truth bounding boxes (e.g., IoU>0.5). Computing the av-
erage precision (AP) across all classes measures the performance in multi-class
object detection. Following [12,41,60,30,8], we define a true positive as a detec-
tion that overlaps a ground-truth with IoU>0.5. Note that AP used in the the
multimodal detection literature is different from mAP [37], which averages over
different AP’s computed with different IoU thresholds.
Multimodal Object Detection via Probabilistic Ensembling 13
ground-truth
thermal detector
ProbEn
Acknowledgement. This work was supported by the CMU Argo AI Center for Au-
tonomous Vehicle Research. Authors acknowledge valuable discussions with Jessica
Lee, Peiyun Hu, Jianren Wang, David Held, Kangle Deng, and Michel Laverne.
Multimodal Object Detection via Probabilistic Ensembling 15
References
1. Akiba, T., Kerola, T., Niitani, Y., Ogawa, T., Sano, S., Suzuki, S.: Pfdet: 2nd
place solution to open images challenge 2018 object detection track. arXiv preprint
arXiv:1809.00778 (2018) 3
2. Albaba, B.M., Ozer, S.: Synet: An ensemble network for object detection in uav
images. In: 2020 25th International Conference on Pattern Recognition (ICPR).
pp. 10227–10234. IEEE (2021) 11
3. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms:
Bagging, boosting, and variants. Machine learning 36(1), 105–139 (1999) 3
4. Bertini, M., del Bimbo, A.: Task-conditioned domain adaptation for pedestrian
detection in thermal imagery. In: ECCV (2020) 2, 4, 11
5. Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms–improving object detec-
tion with one line of code. In: ICCV (2017) 3
6. Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: Yolact: Real-time instance segmentation.
In: ICCV (2019) 5
7. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A.,
Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous
driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 11621–11631 (2020) 1
8. Cao, Y., Zhou, T., Zhu, X., Su, Y.: Every feature counts: An improved one-stage
detector in thermal imagery. In: IEEE International Conference on Computer and
Communications (ICCC) (2019) 12, 13, 14
9. Choi, H., Kim, S., Park, K., Sohn, K.: Multi-spectral pedestrian detection based on
accumulated object proposal with fully convolutional networks. In: International
Conference on Pattern Recognition (ICPR) (2016) 2, 4
10. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR (2005) 3
11. Dawid, A.P.: Conditional independence in statistical theory. Journal of the Royal
Statistical Society: Series B (Methodological) 41(1), 1–15 (1979) 6
12. Devaguptapu, C., Akolekar, N., M Sharma, M., N Balasubramanian, V.: Borrow
from anywhere: Pseudo multi-modal object detection in thermal imagery. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops. pp. 0–0 (2019) 2, 4, 12, 13, 14
13. Dietterich, T.G.: Ensemble methods in machine learning. In: International work-
shop on multiple classifier systems. pp. 1–15. Springer (2000) 3
14. Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark.
In: CVPR (2009) 3, 5
15. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation of
the state of the art. IEEE transactions on pattern analysis and machine intelligence
34(4), 743–761 (2011) 10
16. Dong, W.: https://github.com/wushidonguc/two-stream-action-recognition-keras/
blob/master/fuse_validate_model.py. commit 0a3e722 19
17. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman,
A.: The pascal visual object classes challenge: A retrospective. International journal
of computer vision 111(1), 98–136 (2015) 12
18. Feichtenhofer, C.: https://github.com/feichtenhofer/twostreamfusion/blob/
master/cnn_ucf101_fusion.m. commit 3e313c4 19
19. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fu-
sion for video action recognition. In: CVPR (2016) 19
16 Chen et al.
39. Liu, J., Zhang, S., Wang, S., Metaxas, D.N.: Multispectral deep neural networks
for pedestrian detection. BMVC (2016) 2, 4, 5, 6, 11, 13
40. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd:
Single shot multibox detector. In: ECCV. pp. 21–37. Springer (2016) 3, 4
41. Munir, F., Azam, S., Rafique, M.A., Sheri, A.M., Jeon, M.: Thermal object
detection using domain adaptation through style consistency. arXiv preprint
arXiv:2006.00821 (2020) 4, 12, 13, 14
42. Nix, D.A., Weigend, A.S.: Estimating the mean and variance of the target proba-
bility distribution. In: Proceedings of 1994 ieee international conference on neural
networks (ICNN’94). vol. 1, pp. 55–60. IEEE (1994) 8
43. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
8
44. Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible in-
ference. Elsevier (2014) 3, 6
45. Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R.,
Ng, A.Y.: Ros: an open-source robot operating system. In: ICRA workshop on
open source software. vol. 3, p. 5. Kobe, Japan (2009) 2
46. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,
real-time object detection. In: CVPR (2016) 3, 4
47. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017) 3
48. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-
tection with region proposal networks. In: NeurIPS (2015) 1, 3, 4
49. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-
nition challenge. International journal of computer vision 115(3), 211–252 (2015)
12
50. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-
nition in videos. In: NeurIPS. pp. 568–576 (2014) 19, 24
51. Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: Ensembling boxes
from different object detection models. Image and Vision Computing 107, 104117
(2021) 3, 5
52. Valverde, F.R., Hurtado, J.V., Valada, A.: There is more than meets the eye: Self-
supervised multi-object detection and tracking with sound by distilling multimodal
knowledge. In: CVPR (2021) 2
53. Wagner, J., Fischer, V., Herman, M., Behnke, S.: Multispectral pedestrian detec-
tion using deep fusion convolutional neural networks. In: Proceedings of European
Symposium on Artificial Neural Networks (2016) 2, 4
54. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://
github.com/facebookresearch/detectron2 (2019) 8, 9
55. Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues
in a hybrid deep learning framework for video classification. In: Proceedings of the
ACM international conference on Multimedia (2015) 19
56. Xu, D., Ouyang, W., Ricci, E., Wang, X., Sebe, N.: Learning cross-modal deep
representations for robust pedestrian detection. In: CVPR (2017) 2, 4
57. Xu, P., Davoine, F., Denoeux, T.: Evidential combination of pedestrian detectors.
In: British Machine Vision Conference. pp. 1–14 (2014) 3, 5, 6, 13
58. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R.,
Toderici, G.: Beyond short snippets: Deep networks for video classification. In:
CVPR (2015) 19
18 Chen et al.
59. Zhang, H., Dana, K.: Multi-style generative network for real-time transfer. arXiv
preprint arXiv:1703.06953 (2017) 13
60. Zhang, H., Fromont, E., Lefèvre, S., Avignon, B.: Multispectral fusion for object
detection with cyclic fuse-and-refine blocks. In: IEEE International Conference on
Image Processing (ICIP) (2020) 12, 13, 14
61. Zhang, H., Fromont, E., Lefèvre, S., Avignon, B.: Guided attentive feature fusion
for multispectral pedestrian detection. In: WACV (2021) 11, 13, 14, 23
62. Zhang, L., Liu, Z., Zhang, S., Yang, X., Qiao, H., Huang, K., Hussain, A.: Cross-
modality interactive attention network for multispectral pedestrian detection. In-
formation Fusion 50, 20–29 (2019) 2, 4, 11
63. Zhang, L., Zhu, X., Chen, X., Yang, X., Lei, Z., Liu, Z.: Weakly aligned cross-modal
learning for multispectral pedestrian detection. In: ICCV (2019) 2, 4, 11
64. Zhou, K., Chen, L., Cao, X.: Improving multispectral pedestrian detection by ad-
dressing modality imbalance problems. In: Computer Vision–ECCV 2020: 16th Eu-
ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII
16. pp. 787–803. Springer (2020) 4, 11
65. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation
using cycle-consistent adversarial networks. In: ICCV (2017) 13
66. Zhu, Y.: https://github.com/bryanyzhu/two-stream-pytorch/blob/master/
scripts/eval_ucf101_pytorch/temporal_demo.py. commit 32b6354 19
67. Zitnick, C.L., Dollár, P.: Edge boxes: Locating object proposals from edges. In:
ECCV. pp. 391–405. Springer (2014) 5
Multimodal Object Detection via Probabilistic Ensembling 19
Appendix
The appendix provides additional studies about the proposed probabilistic en-
sembling technique (ProbEn). Below is a sketch of document and we refer the
reader to each of these sections for details.
– Section 6: Analysis of ProbEn and comparisons to other late-fusion methods.
– Section 7: Score calibration for ProbEn
– Section 8: Further study of weight score fusion
– Section 9: Further study of class prior in ProbEn
– Section 10: A detailed derivation of probabilistic box fusion
– Section 11: A study of fusing more and better models
– Section 12: Qualitative results and video demo
where we exploit the fact that the partition function in the denominator is not
a function of the class label k. We now plug the above into Eq. 6:
p(y=k|x_1,x_2) &\propto \frac {p(y=k|x_1)p(y=k|x_2)}{p(y=k)} \propto \frac {e^{s_1[k] + s_2[k]}}{p(y=k)} \label {eq:sumfuse} (10)
score
score
NMS
logit
logit
logit
s2
s2
s2
logit s1 logit s1 logit s1
score
score
score
NMS
logit
logit
logit
s2
s2
s2
s_{\text {AvgLogit}}[k] &= .5(s_1[k] + s_2[k])\\ s_{\text {Bayes}}[k] &= s_1[k] + s_2[k]
(12)
Note that the relative ordering of the fused logits does not necessarily imply the
same holds for the final posterior because the other class logits are needed to
compute the softmax partition function. One particularly simple case to ana-
lyze is a single-class detector k ∈ {0, 1}, as is true for the KAIST benchmark
(that evaluates only pedestrians). Here we can analytically compute posteriors
by looking at the relative logit score si = si [1] − si [0] for modality i (by relying
on the well-known fact that a 2-class softmax function reduces to a sigmoid func-
tion of the relative input scores). We visualize the fused probability as a function
of the relative per-modality logits s1 and s2 in Fig. 7. Finally, Table 5 explic-
itly compares the performance of such fusion approaches with other diagnostic
variants. We refer the reader to both captions for more analysis.
22 Chen et al.
(a) RGB-detector (b) thermal-detector (c) NMS fusion (d) ProbEn fusion
Fig. 9. ProbEn handles false positives by lowering scores. Fig. 7 (d) shows that ProbEn
will reduce the fused score of overlapping detections with at least one low-scoring modal-
ity. This is an example from KAIST, where RGB- and thermal-detectors produce false-
positive pedestrian detections for the statues. NMS fusion keeps the higher-scoring
false-positive, while ProbEn lowers the fused score while keeping the higher score for
the true-positive (that contain overlapping detections with consistently high scores).
ProbEn assumes that detectors return true class posteriors. However, deep net-
works are notoriously over-confident in their predictions, even when wrong [24].
One popular calibration strategy is adding a temperature parameter T to the
Multimodal Object Detection via Probabilistic Ensembling 23
s_{\text {Learned}}[k] &= w_1[k] s_1[k] + w_2[k] s_2[k] \label {eq:logreg} (14)
One can view ProbEn, AvgLogits, and Temperature Scaling as special cases
of the above. ProbEn and AvgLogits use predefined weights that do not require
learning and so are easy to implement. Temperature scaling requires single-modal
validation data to tune each temperature parameter, but does not require mul-
timodal learning. This can be advantageous in settings where modalities do not
align (e.g., FLIR) or where there exists larger collections of single-modal train-
ing data (e.g., COCO training data for RGB detectors). Truly joint learning of
weights requires multimodal training data, but joint learning may better deal
with correlated modalities by downweighting the contribution of modalities that
are highly correlated (and don’t provide independant sources of information). We
experimented with joint learning of the weights with logistic regression. To do
so, we assembled training examples of overlapping single-modal detections (and
cached logit scores) encountered during NMS, assigning a binary target label
(corresponding to true vs false positive detection). After training on such data,
we observe a small improvement over non-learned fusion (Table 5), consistent
with prior art on late fusion [50]. We also tested learning-based late fusion meth-
ods on the FLIR dataset. We further tested learning class priors. However, these
methods do not yield better performance than the simple non-learned ProbEn
(both achieve 82.91 AP). The reason is that FLIR annotations are inconsistent
across frames, making it hard for learning-based late fusion methods to shine,
as explained in Fig. 14 and 11.
Fig. 11. We zoom in a frame from Fig. 14 to visualize more clearly that the ground-
truth anntoations can even miss bicycles and persons as shown in the third image.
In contrast, our ProbEn model can detect these miss-labeled objects (cf. red arrows).
This shows the issues in the FLIR dataset.
each of the three class, and assign the fourth background class with a dummy
number. Then, we normalize them to be sum-to-one as class priors. We vary the
background prior and evaluate the final detection performance measured by AP
at IoU>0.5, as shown in Fig. 12. Clearly, ProbEn works better with uniform
priors than the computed the class piriors.
We further ablate which class is more important by manually assigning a
prior. Concretely, we vary one class prior by fixing the others to be the same.
We plot the performance vs. the per-class prior in Fig. 13. Tuning specific class
priors yields marginal improvements compared to using uniform prior.
\begin {split}\small & \ p(\z | x_1, x_2) \propto \ p(\z | x_1) p(\z |x_2) \\ \propto & \ \exp \big (\frac {\Vert \z -\muu _1 \Vert ^2}{-2\sigma _1^2} \big ) \exp \big (\frac {\Vert \z -\muu _2 \Vert ^2}{-2\sigma _2^2} \big ) \\ \propto & \ \exp \big (\frac {\z ^T\z -2\muu _1^T\z +\muu _1^T\muu _1}{-2\sigma _1^2} \big ) \exp \big (\frac {\z ^T\z -2\muu _2^T\z +\muu _2^T\muu _2}{-2\sigma _2^2} \big ) \\ \propto & \ \exp \big (\frac {\z ^T\z -2\muu _1^T\z +\muu _1^T\muu _1}{-2\sigma _1^2} +\frac {\z ^T\z -2\muu _2^T\z +\muu _2^T\muu _2}{-2\sigma _2^2} \big ) \\ \propto & \ \exp \Big (\frac {\frac {1}{\sigma _1^2} + \frac {1}{\sigma _2^2}}{-2} * ( \z ^T \z -2 \frac {\frac {\muu _1^T}{\sigma _1^2} + \frac {\muu _2^T}{\sigma _2^2}}{\frac {1}{\sigma _1^2} + \frac {1}{\sigma _2^2}} * \z )\Big ) \\ \propto & \ \exp (\frac {\frac {1}{\sigma _1^2} + \frac {1}{\sigma _2^2}}{-2} * ||\z -\muu ||^2), \quad \text {where} \quad \muu =\frac {(\muu _1/\sigma _1^2 + \muu _2/\sigma _2^2 )}{(1/\sigma _1^2 + 1/\sigma _2^2)} \end {split} \nonumber
26 Chen et al.
Fig. 12. A study of ProbEn with class priors as class frequencies in the training set. We
use FLIR dataset for this study as it has 3 imbalanced classes. We fuse three models
(Thermal, Early and Mid) as used in the main paper. As there is a background class, we
vary the background class and proportionally change the class priors. Clearly, ProbEn
with uniform class priors performs better than using the computed priors. Tuning the
background prior does not notably affect the final detection performance once this prior
is set to be larger than 0.1.
Fig. 13. A study of tuning a single class prior while keeping others the same. Moti-
vated by the superior performance of ProbEn with uniform priors, we tune each of the
class prior by fixing others the same. We study this on the FLIR dataset by fusing
three models (Thermal, Early and Mid). We can see that tuning specific classes only
marginally improves detection performance.
Multimodal Object Detection via Probabilistic Ensembling 27
Fig. 14. We demosntrate inconsistent annotations in FLIR dataset with four con-
secutive frames in the validation set. top-row lists four RGB frames for reference.
mid-row displays thermal images and the ground-truth annotations. Looking at the
annotations in the orange rectangle, we can see that the annotations are not consistent
across frames. This is an critical issue that prevent learning-based late fusion from
improving further on the FLIR dataset. Bottom-row displays the detection results
by ProbEn of the three models (Thermal, Early, and Mid). Interestingly, the predic-
tions look more reasonable in detecting pedestrians within the orange rectangles. In
this sense, predictions is “better” than annotations, intuitively explaining why learning
based late fusion does not improve performance further. Please also refer to Fig. 11 for
a zoom-in visualization.
Table 8. ProbEn always outperforms NMS when applied to the same en-
semble of (even strong) detections. Results are comparable to Table 1.
fusing MLPD and GAFF on KAIST (LAMR↓ in %)
method Day Night All
MLPD 7.96 6.95 7.58
GAFF 8.25 3.46 6.38
NMS (MLPD+GAFF) 7.63 6.76 7.24
ProbEn (MLPD+GAFF) 6.23 3.79 5.38
NMS3 w/ MLPD 7.34 7.03 7.13
ProbEn3 w/ MLPD 7.81 5.02 6.76
NMS3 w/ GAFF 8.29 3.46 6.36
ProbEn3 w/ GAFF 6.04 3.59 5.14
Fig. 15. We attach a demo video in our Github repository. The demo video is generated
based on a testing video (captured at night) provided by the FLIR dataset. Hereby we
display two video frames for a same scene that compare detections by a thermal-only
single-modal detector and the ProbEn method that fuses three detectors (Thermal,
Early-fusion and Mid-fusion). We can see Thermal detector mis-detects a car and
produces larger bounding box for the rightmost car (right frame), in contrast, ProbEn
successfully detects all the cars and produces tight bounding boxes. We refer the reader
to the video demo for convincing visualization.
Multimodal Object Detection via Probabilistic Ensembling 29
Fig. 16. Qualitative results on more testing examples in KAIST dataset. We place
RGB-thermal images in pairs: in each macro row, we show RGB images in the upper
row and thermal images in lower row. Over RGB images, we overlay the detection
results from our MidFusion model; on the thermal images, we show results from our
best-performing ProbEn model. Green, red and blue boxes stand for true positives,
false negative (miss-detected persons) and false positives.
30 Chen et al.
Fig. 17. Qualitative results on more testing examples in FLIR dataset. We place RGB-
thermal images in triplet: in each macro row (divided by the black line), we show RGB
images in the upper row and thermal images in two lower rows. Over RGB images,
we overlay ground-truth annotations, highlighting that RGB and thermal images are
strongly unaligned. To avoid clutter, we do not mark class labels for the bounding boxes.
On the thermal images, we show detection results from our thermal-only (mid-row) and
best-performing ProbEn (with bounding box fusion) model (bottom-row). Green, red
and blue boxes stand for true positives, false negative (mis-detected persons) and false
positives.