(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: University of Freiburg, Germany ²²institutetext: Qualcomm SARL France ³³institutetext: QT Technologies Ireland Limited ⁴⁴institutetext: Federal University of Rio Grande, Brazil ⁵⁵institutetext: University of Technology Nuremberg, Germany
%\email{lncs@springer.com}\\http://letsmap.cs.uni-freiburg.de

LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping

Nikhil Gosala 11 Kürsat Petek 11 B Ravi Kiran 22 Senthil Yogamani 33
Paulo Drews-Jr 44 Wolfram Burgard 55 Abhinav Valada 11

Abstract

Semantic Bird’s Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner and then finetunes it for the task of semantic BEV mapping using only a small fraction of labels in the BEV. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach performs on par with the existing state-of-the-art approaches while using only $1\%$ of BEV labels and no additional labeled data.

Keywords:

Unsupervised Representation Learning Semantic BEV Mapping Scene Understanding

1 Introduction

Semantic Bird’s Eye View (BEV) maps are essential for autonomous driving as they offer rich, occlusion-aware information for height-agnostic applications including object tracking, collision avoidance, and motion control. Instantaneous BEV map estimation that does not rely on large amounts of annotated data is crucial for the rapid deployment of autonomous vehicles in novel domains. However, the majority of existing BEV mapping approaches follow a fully supervised learning paradigm and thus rely on large amounts of annotated data in BEV, which is extremely arduous to obtain and hinders the scalability of autonomous vehicles to novel environments [cit:bev-seg-pan2020vpn, cit:bev-seg-lu2019ved, cit:bev-seg-pon, cit:bev-seg-lss]. Recent works circumvent this problem by leveraging frontal view (FV) semantic labels for learning both scene geometry and generating BEV pseudolabels [cit:bev-seg-skyeye], or by leveraging semi-supervised learning using pairs of labeled and unlabeled samples [cit:bev-seg-s2g2]. However, the reliance on FV labels as well as the integrated network design of both approaches gives rise to three main challenges: (1) FV labels offer scene geometry supervision only along class boundaries which limits the geometric reasoning ability of the model; (2) FV labels are dataset-specific and any change in class definition mandates full model retraining; and (3) tightly coupled network designs hinder the quick adoption of latest advances from literature.

Refer to caption — Figure 1: LetsMap: The first unsupervised framework for label-efficient semantic BEV mapping. We use RGB image sequences to independently learn scene geometry (yellow) and scene representation (blue) in an unsupervised pretraining step, before adapting it to semantic BEV mapping in a label-efficient finetuning step.

In this work, we address these limitations by proposing the first unsupervised representation learning framework for predicting semantic BEV maps from monocular FV images in a label-efficient manner. Our approach, LetsMap, utilizes the spatiotemporal consistency and dense representation offered by FV image sequences to alleviate the need for manually annotated data. To this end, we disentangle the two sub-tasks of semantic BEV mapping, i.e., scene geometry modeling and scene representation learning, into two disjoint neural pathways (Fig. 1) and learn them using an unsupervised pretraining step. We then finetune the resultant model for semantic BEV mapping using only a small fraction of labels in BEV. LetsMap explicitly learns to model the scene geometry via the geometric pathway by leveraging implicit fields, while learning scene representations via the semantic pathway using a novel temporal masked autoencoder (T-MAE) mechanism. During pretraining, we supervise the geometric pathway by exploiting the spatial and temporal consistency of the multi-camera FV images across multiple timesteps and train the semantic pathway by enforcing reconstruction of the FV images for both the current and future timesteps using the masked image of only the current timestep. We extensively evaluate LetsMap on the KITTI-360 [cit:dataset-kitti360] and nuScenes [cit:dataset-nuscenes] datasets and demonstrate that our approach performs on par with existing fully-supervised and self-supervised approaches while using only $1\%$ of BEV labels, without leveraging any additional labeled data.

2 Related Work

In this section, we discuss existing work on semantic BEV mapping, scene geometry estimation from monocular cameras, and image-based scene representation learning.

BEV Segmentation: Monocular semantic BEV mapping methods typically focus on learning a lifting mechanism to transform features from FV to BEV. Early works of VED [cit:bev-seg-lu2019ved] and VPN [cit:bev-seg-pan2020vpn] learn the transformation without using scene geometry, which limits their performance in the real world. PON [cit:bev-seg-pon] solves this issue by incorporating scene geometry into the network design while LSS [cit:bev-seg-lss] learns a depth distribution to transform features from FV to BEV. PanopticBEV [cit:bev-seg-panopticbev] splits the world into flat and non-flat regions and transforms them to BEV using two disjoint pathways. Recent methods use transformers to generate BEV features from both single image [cit:bev-seg-tiim] and multi-view images [cit:bev-seg-bevformerv2]. Some works also use multi-modal data to augment monocular cameras [cit:bev-seg-hdmapnet, cit:bev-seg-bevfusion, cit:bev-seg-simplebev, schramm2024bevcar]. All the aforementioned approaches follow a fully supervised learning paradigm and rely on vast amounts of resource-intensive human-annotated semantic BEV labels. Recent works reduce reliance on BEV ground truth labels by combining labeled and unlabeled images in a semi-supervised manner [cit:bev-seg-s2g2] or by leveraging FV labels to generate BEV pseudolabels and train the network in a self-supervised manner [cit:bev-seg-skyeye]. However, these approaches rely on additional labeled data or use tightly coupled network designs which limits their ability to scale to new environments or incorporate the latest advances in literature. In this paper, we propose a novel unsupervised label-efficient approach that first learns scene geometry and scene representation in a modular, label-free manner before adapting to semantic BEV mapping using only a small fraction of BEV semantic labels.

Monocular Scene Geometry Estimation: Scene geometry estimation is a fundamental challenge in computer vision and is a core component of 3D scene reconstruction. Initial approaches use techniques such as multi-view stereo [furukawa2009accurate] and visual SLAM [cit:visual-slam, vodisch2022continual] while recent approaches leverage learnable functions in the form of ray distance functions [cit:ray-distance-functions] or implicit neural fields [cit:nerf-orig]. Early neural radiance fields-based approaches were optimized on single scenes and relied on substantial amounts of training data [cit:nerf-orig]. PixelNeRF [cit:pixel-nerf] addresses these issues by conditioning NeRF on input images, enabling simultaneous optimization across different scenes. Recent works improve upon PixelNeRF by decoupling color from scene density estimation [cit:behind-the-scenes], and by using a tri-planar representation to query the neural field from any world point [cit:neo360]. In our approach, we leverage implicit fields to generate the volumetric density from a single monocular FV image to constrain features from the uniformly-lifted 2D scene representation features.

Scene Representation Learning: Early works used augmentations such as image permutation [cit:ssl-image-permutation], rotation prediction [cit:ssl-image-rotation], noise discrimination [hindel2023inod], and frame ordering [lang2024self] to learn scene representation; which were primitive and lacked generalization across diverse tasks. [cit:ssl-moco, cit:ssl-simclr] propose using contrastive learning to learn scene representation, and [cit:ssl-swav] builds upon this paradigm by removing the need for negative samples during training. Recent works propose masked autoencoders [cit:ssl-mae] wherein masked input image patches are predicted by the network using the learned high-level understanding of the scene. More recently, foundation models such as DINO [cit:dino-v1] and DINOv2 [cit:dino-v2] employ self-distillation on large amounts of curated data to learn rich representations of the scene. However, all these approaches work on single timestep images and fail to leverage scene consistency over multiple timesteps. In this work, we explicitly enforce scene consistency over multiple timesteps by proposing a novel temporal masked autoencoding strategy to learn rich scene representations.

3 Technical Approach

In this section, we present an overview of LetsMap, the first unsupervised learning framework for predicting semantic BEV maps from monocular FV images using a label-efficient training paradigm. An overview of our framework is illustrated in Fig. 2. The key idea of our approach is to leverage sequences of multi-camera FV images to learn the two core sub-tasks of semantic BEV mapping, i.e., scene geometry modeling and scene representation learning, using two disjoint neural pathways following a label-free paradigm, before adapting it to the downstream task in a label-efficient manner. We achieve this desired behavior by splitting the training protocol into sequential FV pretraining and BEV finetuning stages. The FV pretraining stage learns to explicitly model the scene geometry by enforcing scene consistency over multiple views using the photometric loss ( $\mathcal{L}_{\text{photom}}$ , Sec. 3.2) while learning the scene representation by reconstructing a masked input image over multiple timesteps using the reconstruction loss ( $\mathcal{L}_{\text{rgb}}$ , Sec. 3.3). Upon culmination of the pretraining phase, the finetuning phase adapts the network to the task of semantic BEV mapping using the cross-entropy loss on the tiny fraction of available BEV labels ( $\mathcal{L}_{\text{bev}}$ , Sec. 3.4). The total loss of the network is thus computed as:

\mathcal{L}=\begin{cases}\mathcal{L}_{\text{photom}}+\mathcal{L}_{\text{rgb}}&% \text{when pretraining}\\ \mathcal{L}_{\text{bev}}&\text{when finetuning}\end{cases}.

(1)

3.1 Network Architecture

Our proposed LetsMap architecture, as shown in Fig. 2, consists of a pretrained DINOv2 [cit:dino-v2] (ViT-b) backbone to generate multi-scale features from an input image; a geometry pathway comprising a convolution-based adapter followed by an implicit neural field to predict the scene geometry; a semantic pathway encompassing a sparse convolution-based adapter to capture representation-specific features; an RGB reconstruction head to facilitate reconstruction of the masked input image patches over multiple timesteps; and a BEV semantic head to generate a semantic BEV map from the input monocular FV image during the finetuning phase.

During pretraining, an input image $\mathcal{I}_{0}$ is processed by the backbone to generate feature maps of three scales. The geometry pathway, $\mathcal{G}$ , processes these multi-scale features using a BiFPN [tan2020efficientdet] layer followed by an implicit field module to generate the volumetric density of the scene at the current timestep. In a parallel branch, a masking module first randomly masks non-overlapping patches in $\mathcal{I}_{0}$ and the backbone then processes the visible patches to generate the corresponding image features. The semantic pathway $\mathcal{S}$ then generates the representation-specific features using a five-layer adapter that ensures propagation of masked regions using the convolution masking strategy outlined in [cit:spark-masked-convolution]. We then uniformly lift the resultant 2D features to 3D using the camera projection equation and multiply them with the volumetric density computed from $\mathcal{G}$ to generate scene-consistent voxel features. We warp the voxel grid to multiple timesteps using the ego-motion and collapse it into 2D by applying the camera projection equation along the depth dimension. The RGB reconstruction head then predicts the pixel values for each of the masked patches to reconstruct the image at different timesteps. During finetuning, we disable image masking and orthographically collapse the voxel features along the height dimension to generate the BEV features. A BEV semantic head processes these features to generate semantic BEV predictions.

	$\displaystyle\alpha_{i}=\text{exp}(1-\sigma_{\mathbf{x}_{i}}\delta_{i}),$		(3)
	$\displaystyle\hat{d}_{\mathbf{u}}=\sum_{i=1}^{K}(\prod_{j=1}^{i-1}(1-\alpha_{j% }))\alpha_{i}d_{i},$		(4)

Method	FV	BEV	Road	Side.	Build.	Terrain	Person	2-Wh.	Car	Truck	mIoU
IPM [cit:bev-seg-ipm-original]	100%	-	53.03	24.90	15.19	32.31	0.20	0.36	11.59	1.90	17.44
VED [cit:bev-seg-lu2019ved]	-	100%	65.97	35.41	37.28	34.34	0.13	0.07	23.83	8.89	25.74
VPN [cit:bev-seg-pan2020vpn]	-	100%	69.90	34.31	33.65	40.17	0.56	2.26	27.76	6.10	26.84
PON [cit:bev-seg-pon]	-	100%	67.98	31.13	29.81	34.28	2.28	2.16	37.99	8.10	26.72
PoBEV [cit:bev-seg-panopticbev]	-	100%	70.14	35.23	34.68	40.72	2.85	5.63	39.77	14.38	30.42
PoBEV [cit:bev-seg-panopticbev]	-	1%	60.41	20.97	24.65	23.38	0.15	0.23	21.71	1.23	19.09
SkyEye [cit:bev-seg-skyeye]	100%	1%	69.26	33.48	32.79	39.46	0.00	0.34	32.36	7.93	26.94
LetsMap (Ours)	0%	1%	70.58	34.26	40.68	38.53	1.35	4.74	30.94	10.58	28.96

Method	FV	BEV	Road	Side.	Manm.	Terrain	Person	2-Wh.	Car	Truck	mIoU
IPM [cit:bev-seg-ipm-original]	100%	-	43.51	9.05	26.21	16.60	0.14	0.72	4.65	3.67	13.07
VED [cit:bev-seg-lu2019ved]	-	100%	67.97	25.23	49.69	31.51	0.80	1.28	21.85	17.51	26.98
VPN [cit:bev-seg-pan2020vpn]	-	100%	66.47	23.94	47.65	33.19	2.02	4.13	22.66	18.33	27.30
PON [cit:bev-seg-pon]	-	100%	67.50	24.49	47.02	30.86	2.49	6.85	26.68	18.85	28.09
PoBEV [cit:bev-seg-panopticbev]	-	100%	70.15	27.87	50.04	35.32	3.89	7.06	31.60	21.27	30.90
PoBEV [cit:bev-seg-panopticbev]	-	$\approx\frac{1}{40}\%$	64.55	19.85	45.21	28.45	1.20	1.06	20.45	11.48	24.03
LetsMap (Ours)	0%	$\approx\frac{1}{40}\%$	67.72	27.06	47.10	34.78	3.31	5.79	21.92	13.57	27.66

BEV	Model	FV	PT	Epochs	Road	Side	Build	Terr.	Pers.	2-Wh.	Car	Truck	mIoU
1%	PoBEV	✗	-	100	60.41	20.97	24.65	23.38	0.15	0.23	21.71	1.23	19.09
	SkyEye	✓	✓		69.26	33.48	32.79	39.46	0.00	0.34	32.36	7.93	26.94
	LetsMap	✗	✗		69.40	32.09	34.75	35.27	1.01	2.79	28.76	7.66	26.47
	LetsMap	✗	✓		70.58	34.26	40.68	38.53	1.35	4.74	30.94	10.58	28.96
5%	PoBEV	✗	-	80	64.45	27.36	30.15	31.66	0.69	0.98	29.75	6.06	23.89
	SkyEye	✓	✓		72.16	37.20	34.89	42.97	4.77	9.16	40.74	9.88	31.47
	LetsMap	✗	✗		72.80	37.89	38.59	40.06	2.34	5.62	34.86	16.26	31.05
	LetsMap	✗	✓		73.74	39.56	42.07	41.49	2.46	6.32	34.68	14.88	31.90
10%	PoBEV	✗	-	50	66.58	30.28	31.76	34.50	1.22	3.28	33.43	7.56	26.08
	SkyEye	✓	✓		73.36	38.30	37.54	44.62	4.80	9.67	42.84	10.06	32.65
	LetsMap	✗	✗		74.31	38.45	40.04	41.26	3.19	6.02	35.56	16.53	31.92
	LetsMap	✗	✓		74.74	39.40	43.63	43.33	2.91	6.95	37.62	18.09	33.33
50%	PoBEV	✗	-	30	69.88	33.81	33.40	40.48	2.47	4.63	38.81	9.84	29.16
	SkyEye	✓	✓		73.10	39.23	38.08	45.72	4.05	10.44	44.72	12.10	33.43
	LetsMap	✗	✗		73.89	38.42	42.25	41.46	2.26	6.26	37.20	15.08	32.10
	LetsMap	✗	✓		74.29	38.48	43.87	42.77	2.80	5.22	37.68	15.20	32.54
100%	PoBEV	✗	-	20	70.14	35.23	34.68	40.72	2.85	5.63	39.77	14.38	30.42
	SkyEye	✓	✓		73.57	39.45	38.74	46.06	3.95	9.66	45.21	10.92	33.44
	LetsMap	✗	✗		74.22	39.39	42.86	42.96	2.55	6.66	35.68	17.11	32.68
	LetsMap	✗	✓		74.81	38.59	42.58	43.67	3.52	6.21	38.47	15.24	32.88

Model	Geometric	Semantic	Road	Side.	Build.	Terr.	Pers.	2-Wh.	Car	Truck	mIoU
L1	✗	✗	69.40	32.09	34.75	35.27	1.01	2.79	28.76	7.66	26.47
L2	✓	✗	70.85	34.34	38.12	35.03	0.93	4.06	29.79	8.84	27.75
L3	✓	✓	70.58	34.26	40.68	38.53	1.35	4.74	30.94	10.58	28.96

LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping

Abstract

Keywords:

1 Introduction

2 Related Work

3 Technical Approach

3.1 Network Architecture

3.2 Geometric Pathway

3.3 Semantic Pathway

3.4 BEV Finetuning

4 Experimental Results

4.1 Datasets

4.2 Training Protocol

4.3 Quantitative Results

4.4 Ablation Study

S.1 Additional Ablative Experiments

S.1.1 DINOv2 Backbone Variants

S.1.2 Masking Patch Size

S.1.3 Impact of DINOv2 on Baseline Approaches

S.1.4 BEV Finetuning using SkyEye Split

S.2 Additional Qualitative Results

Patch Size	Road	Side.	Build.	Terr.	Per.	2-Wh.	Car	Truck	mIoU
14	71.45	33.41	36.89	37.48	0.75	3.69	30.05	9.23	27.87
28	70.58	34.26	40.68	38.53	1.35	4.74	30.94	10.58	28.96
56	70.02	34.10	38.87	37.88	1.37	4.71	30.91	9.66	28.44

BEV	Model	FV	PT	Backbone	Road	Side.	Build.	Terr.	Pers.	2-Wh.	Car	Truck	mIoU
1%	PoBEV	✗	-	Native	60.41	20.97	24.65	23.38	0.15	0.23	21.71	1.23	19.09
	SkyEye	✓	✓	Native	69.26	33.48	32.79	39.46	0.00	0.34	32.36	7.93	26.94
	PoBEV	✗	-	DINOv2	62.36	21.02	27.18	24.22	0.04	0.12	17.50	0.95	19.17
	SkyEye	✓	✓		65.13	29.56	29.02	34.22	0.78	2.87	26.04	5.12	24.09
	LetsMap	✗	✓		70.58	34.26	40.68	38.53	1.35	4.74	30.94	10.58	28.96
5%	PoBEV	✗	-	Native	64.45	27.36	30.15	31.66	0.69	0.98	29.75	6.06	23.89
	SkyEye	✓	✓	Native	72.16	37.20	34.89	42.97	4.77	9.16	40.74	9.88	31.47
	PoBEV	✗	-	DINOv2	67.61	30.73	30.97	32.80	0.42	0.47	25.48	5.58	24.26
	SkyEye	✓	✓		69.84	34.19	32.80	37.13	2.54	4.74	32.49	7.93	27.71
	LetsMap	✗	✓		73.74	39.56	42.07	41.49	2.46	6.32	34.68	14.88	31.90
10%	PoBEV	✗	-	Native	66.58	30.28	31.76	34.50	1.22	3.28	33.43	7.56	26.08
	SkyEye	✓	✓	Native	73.36	38.30	37.54	44.62	4.80	9.67	42.84	10.06	32.65
	PoBEV	✗	-	DINOv2	68.99	33.17	35.81	34.15	0.70	1.58	29.74	10.06	26.77
	SkyEye	✓	✓		72.19	36.18	35.26	39.84	3.78	5.61	36.95	10.44	30.03
	LetsMap	✗	✓		74.74	39.40	43.63	43.33	2.91	6.95	37.62	18.09	33.33
50%	PoBEV	✗	-	Native	69.88	33.81	33.40	40.48	2.47	4.63	38.81	9.84	29.16
	SkyEye	✓	✓	Native	73.10	39.23	38.08	45.72	4.05	10.44	44.72	12.10	33.43
	PoBEV	✗	-	DINOv2	73.04	37.38	37.86	41.31	1.82	3.83	37.13	14.85	30.90
	SkyEye	✓	✓		73.66	38.85	41.49	41.73	2.90	6.99	38.43	12.42	32.06
	LetsMap	✗	✓		74.29	38.48	43.87	42.77	2.80	5.22	37.68	15.20	32.54
100%	PoBEV	✗	-	Native	70.14	35.23	34.68	40.72	2.85	5.63	39.77	14.38	30.42
	SkyEye	✓	✓	Native	73.57	39.45	38.74	46.06	3.95	9.66	45.21	10.92	33.44
	PoBEV	✗	-	DINOv2	73.29	37.81	40.23	42.11	1.78	3.32	38.66	17.42	31.83
	SkyEye	✓	✓		73.51	39.13	40.04	42.08	3.17	5.90	39.29	12.72	31.98
	LetsMap	✗	✓		74.81	38.59	42.58	43.67	3.52	6.21	38.47	15.24	32.88

BEV	Model	FV	PT	Epochs	Road	Side.	Build.	Terr.	Pers.	2-Wh.	Car	Truck	mIoU
1%	PoBEV	✗	-	100	61.70	17.10	27.81	26.72	0.07	0.36	21.51	0.84	19.51
	SkyEye	✓	✓		72.56	34.33	36.70	41.66	0.00	0.16	33.85	10.29	28.71
	LetsMap	✗	✗		70.89	33.88	37.71	37.41	0.80	2.87	31.59	6.59	27.72
	LetsMap	✗	✓		72.94	37.79	43.70	38.29	0.87	2.57	30.62	10.86	29.70
10%	PoBEV	✗	-	50	70.00	32.75	38.07	34.43	0.80	3.33	34.46	9.25	27.89
	SkyEye	✓	✓		76.07	40.30	40.30	45.33	3.75	8.15	42.64	10.73	33.41
	LetsMap	✗	✗		76.69	40.41	42.55	42.17	1.33	6.57	40.46	18.06	33.53
	LetsMap	✗	✓		74.47	41.16	46.31	43.31	5.48	8.80	41.55	21.24	35.29
50%	PoBEV	✗	-	30	72.09	35.64	36.64	42.41	1.61	3.92	41.41	9.77	30.44
	SkyEye	✓	✓		76.43	39.89	45.22	46.64	5.10	7.93	42.43	12.30	34.49
	LetsMap	✗	✗		75.46	39.45	42.71	39.69	3.85	5.70	41.88	17.82	33.32
	LetsMap	✗	✓		76.54	42.65	49.23	41.47	3.36	8.61	38.76	19.42	35.01

	Input FV Image	$1\%$	$5\%$	$10\%$	$50\%$	$100\%$
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)

Masking Ratio	0%	25%	50%	75%	90%
mIoU	27.75	27.87	28.22	28.96	27.31

Backbone	vit-s	vit-b	vit-l	vit-g
mIoU	25.55	28.96	28.40	28.16