Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: University of Freiburg, Germany 22institutetext: Qualcomm SARL France 33institutetext: QT Technologies Ireland Limited 44institutetext: Federal University of Rio Grande, Brazil 55institutetext: University of Technology Nuremberg, Germany
%\email{lncs@springer.com}\\http://letsmap.cs.uni-freiburg.de

LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping

Nikhil Gosala 11    Kürsat Petek 11    B Ravi Kiran 22    Senthil Yogamani 33   
Paulo Drews-Jr
44
   Wolfram Burgard 55    Abhinav Valada 11
Abstract

Semantic Bird’s Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner and then finetunes it for the task of semantic BEV mapping using only a small fraction of labels in the BEV. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach performs on par with the existing state-of-the-art approaches while using only 1%percent11\%1 % of BEV labels and no additional labeled data.

Keywords:
Unsupervised Representation Learning Semantic BEV Mapping Scene Understanding

1 Introduction

Semantic Bird’s Eye View (BEV) maps are essential for autonomous driving as they offer rich, occlusion-aware information for height-agnostic applications including object tracking, collision avoidance, and motion control. Instantaneous BEV map estimation that does not rely on large amounts of annotated data is crucial for the rapid deployment of autonomous vehicles in novel domains. However, the majority of existing BEV mapping approaches follow a fully supervised learning paradigm and thus rely on large amounts of annotated data in BEV, which is extremely arduous to obtain and hinders the scalability of autonomous vehicles to novel environments [cit:bev-seg-pan2020vpn, cit:bev-seg-lu2019ved, cit:bev-seg-pon, cit:bev-seg-lss]. Recent works circumvent this problem by leveraging frontal view (FV) semantic labels for learning both scene geometry and generating BEV pseudolabels [cit:bev-seg-skyeye], or by leveraging semi-supervised learning using pairs of labeled and unlabeled samples [cit:bev-seg-s2g2]. However, the reliance on FV labels as well as the integrated network design of both approaches gives rise to three main challenges: (1) FV labels offer scene geometry supervision only along class boundaries which limits the geometric reasoning ability of the model; (2) FV labels are dataset-specific and any change in class definition mandates full model retraining; and (3) tightly coupled network designs hinder the quick adoption of latest advances from literature.

Refer to caption
Figure 1: LetsMap: The first unsupervised framework for label-efficient semantic BEV mapping. We use RGB image sequences to independently learn scene geometry (yellow) and scene representation (blue) in an unsupervised pretraining step, before adapting it to semantic BEV mapping in a label-efficient finetuning step.

In this work, we address these limitations by proposing the first unsupervised representation learning framework for predicting semantic BEV maps from monocular FV images in a label-efficient manner. Our approach, LetsMap, utilizes the spatiotemporal consistency and dense representation offered by FV image sequences to alleviate the need for manually annotated data. To this end, we disentangle the two sub-tasks of semantic BEV mapping, i.e., scene geometry modeling and scene representation learning, into two disjoint neural pathways (Fig. 1) and learn them using an unsupervised pretraining step. We then finetune the resultant model for semantic BEV mapping using only a small fraction of labels in BEV. LetsMap explicitly learns to model the scene geometry via the geometric pathway by leveraging implicit fields, while learning scene representations via the semantic pathway using a novel temporal masked autoencoder (T-MAE) mechanism. During pretraining, we supervise the geometric pathway by exploiting the spatial and temporal consistency of the multi-camera FV images across multiple timesteps and train the semantic pathway by enforcing reconstruction of the FV images for both the current and future timesteps using the masked image of only the current timestep. We extensively evaluate LetsMap on the KITTI-360 [cit:dataset-kitti360] and nuScenes [cit:dataset-nuscenes] datasets and demonstrate that our approach performs on par with existing fully-supervised and self-supervised approaches while using only 1%percent11\%1 % of BEV labels, without leveraging any additional labeled data.

2 Related Work

In this section, we discuss existing work on semantic BEV mapping, scene geometry estimation from monocular cameras, and image-based scene representation learning.

BEV Segmentation: Monocular semantic BEV mapping methods typically focus on learning a lifting mechanism to transform features from FV to BEV. Early works of VED [cit:bev-seg-lu2019ved] and VPN [cit:bev-seg-pan2020vpn] learn the transformation without using scene geometry, which limits their performance in the real world. PON [cit:bev-seg-pon] solves this issue by incorporating scene geometry into the network design while LSS [cit:bev-seg-lss] learns a depth distribution to transform features from FV to BEV. PanopticBEV [cit:bev-seg-panopticbev] splits the world into flat and non-flat regions and transforms them to BEV using two disjoint pathways. Recent methods use transformers to generate BEV features from both single image [cit:bev-seg-tiim] and multi-view images [cit:bev-seg-bevformerv2]. Some works also use multi-modal data to augment monocular cameras [cit:bev-seg-hdmapnet, cit:bev-seg-bevfusion, cit:bev-seg-simplebev, schramm2024bevcar]. All the aforementioned approaches follow a fully supervised learning paradigm and rely on vast amounts of resource-intensive human-annotated semantic BEV labels. Recent works reduce reliance on BEV ground truth labels by combining labeled and unlabeled images in a semi-supervised manner [cit:bev-seg-s2g2] or by leveraging FV labels to generate BEV pseudolabels and train the network in a self-supervised manner [cit:bev-seg-skyeye]. However, these approaches rely on additional labeled data or use tightly coupled network designs which limits their ability to scale to new environments or incorporate the latest advances in literature. In this paper, we propose a novel unsupervised label-efficient approach that first learns scene geometry and scene representation in a modular, label-free manner before adapting to semantic BEV mapping using only a small fraction of BEV semantic labels.

Monocular Scene Geometry Estimation: Scene geometry estimation is a fundamental challenge in computer vision and is a core component of 3D scene reconstruction. Initial approaches use techniques such as multi-view stereo [furukawa2009accurate] and visual SLAM [cit:visual-slam, vodisch2022continual] while recent approaches leverage learnable functions in the form of ray distance functions [cit:ray-distance-functions] or implicit neural fields [cit:nerf-orig]. Early neural radiance fields-based approaches were optimized on single scenes and relied on substantial amounts of training data [cit:nerf-orig]. PixelNeRF [cit:pixel-nerf] addresses these issues by conditioning NeRF on input images, enabling simultaneous optimization across different scenes. Recent works improve upon PixelNeRF by decoupling color from scene density estimation [cit:behind-the-scenes], and by using a tri-planar representation to query the neural field from any world point [cit:neo360]. In our approach, we leverage implicit fields to generate the volumetric density from a single monocular FV image to constrain features from the uniformly-lifted 2D scene representation features.

Scene Representation Learning: Early works used augmentations such as image permutation [cit:ssl-image-permutation], rotation prediction [cit:ssl-image-rotation], noise discrimination [hindel2023inod], and frame ordering [lang2024self] to learn scene representation; which were primitive and lacked generalization across diverse tasks. [cit:ssl-moco, cit:ssl-simclr] propose using contrastive learning to learn scene representation, and [cit:ssl-swav] builds upon this paradigm by removing the need for negative samples during training. Recent works propose masked autoencoders [cit:ssl-mae] wherein masked input image patches are predicted by the network using the learned high-level understanding of the scene. More recently, foundation models such as DINO [cit:dino-v1] and DINOv2 [cit:dino-v2] employ self-distillation on large amounts of curated data to learn rich representations of the scene. However, all these approaches work on single timestep images and fail to leverage scene consistency over multiple timesteps. In this work, we explicitly enforce scene consistency over multiple timesteps by proposing a novel temporal masked autoencoding strategy to learn rich scene representations.

3 Technical Approach

Refer to caption
Figure 2: Overview of LetsMap, our novel unsupervised representation learning framework for label-efficient semantic BEV mapping. The crux of our approach is to leverage FV image sequences to independently model scene geometry and learn scene representation using two disjoint pathways following an unsupervised training paradigm. The resulting model is then finetuned on a small fraction of BEV labels to the task of semantic BEV mapping.

In this section, we present an overview of LetsMap, the first unsupervised learning framework for predicting semantic BEV maps from monocular FV images using a label-efficient training paradigm. An overview of our framework is illustrated in Fig. 2. The key idea of our approach is to leverage sequences of multi-camera FV images to learn the two core sub-tasks of semantic BEV mapping, i.e., scene geometry modeling and scene representation learning, using two disjoint neural pathways following a label-free paradigm, before adapting it to the downstream task in a label-efficient manner. We achieve this desired behavior by splitting the training protocol into sequential FV pretraining and BEV finetuning stages. The FV pretraining stage learns to explicitly model the scene geometry by enforcing scene consistency over multiple views using the photometric loss (photomsubscriptphotom\mathcal{L}_{\text{photom}}caligraphic_L start_POSTSUBSCRIPT photom end_POSTSUBSCRIPT, Sec. 3.2) while learning the scene representation by reconstructing a masked input image over multiple timesteps using the reconstruction loss (rgbsubscriptrgb\mathcal{L}_{\text{rgb}}caligraphic_L start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT, Sec. 3.3). Upon culmination of the pretraining phase, the finetuning phase adapts the network to the task of semantic BEV mapping using the cross-entropy loss on the tiny fraction of available BEV labels (bevsubscriptbev\mathcal{L}_{\text{bev}}caligraphic_L start_POSTSUBSCRIPT bev end_POSTSUBSCRIPT, Sec. 3.4). The total loss of the network is thus computed as:

={photom+rgbwhen pretrainingbevwhen finetuning.casessubscriptphotomsubscriptrgbwhen pretrainingsubscriptbevwhen finetuning\mathcal{L}=\begin{cases}\mathcal{L}_{\text{photom}}+\mathcal{L}_{\text{rgb}}&% \text{when pretraining}\\ \mathcal{L}_{\text{bev}}&\text{when finetuning}\end{cases}.caligraphic_L = { start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT photom end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT end_CELL start_CELL when pretraining end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT bev end_POSTSUBSCRIPT end_CELL start_CELL when finetuning end_CELL end_ROW . (1)

3.1 Network Architecture

Our proposed LetsMap architecture, as shown in Fig. 2, consists of a pretrained DINOv2 [cit:dino-v2] (ViT-b) backbone to generate multi-scale features from an input image; a geometry pathway comprising a convolution-based adapter followed by an implicit neural field to predict the scene geometry; a semantic pathway encompassing a sparse convolution-based adapter to capture representation-specific features; an RGB reconstruction head to facilitate reconstruction of the masked input image patches over multiple timesteps; and a BEV semantic head to generate a semantic BEV map from the input monocular FV image during the finetuning phase.

During pretraining, an input image 0subscript0\mathcal{I}_{0}caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is processed by the backbone to generate feature maps of three scales. The geometry pathway, 𝒢𝒢\mathcal{G}caligraphic_G, processes these multi-scale features using a BiFPN [tan2020efficientdet] layer followed by an implicit field module to generate the volumetric density of the scene at the current timestep. In a parallel branch, a masking module first randomly masks non-overlapping patches in 0subscript0\mathcal{I}_{0}caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the backbone then processes the visible patches to generate the corresponding image features. The semantic pathway 𝒮𝒮\mathcal{S}caligraphic_S then generates the representation-specific features using a five-layer adapter that ensures propagation of masked regions using the convolution masking strategy outlined in [cit:spark-masked-convolution]. We then uniformly lift the resultant 2D features to 3D using the camera projection equation and multiply them with the volumetric density computed from 𝒢𝒢\mathcal{G}caligraphic_G to generate scene-consistent voxel features. We warp the voxel grid to multiple timesteps using the ego-motion and collapse it into 2D by applying the camera projection equation along the depth dimension. The RGB reconstruction head then predicts the pixel values for each of the masked patches to reconstruct the image at different timesteps. During finetuning, we disable image masking and orthographically collapse the voxel features along the height dimension to generate the BEV features. A BEV semantic head processes these features to generate semantic BEV predictions.

3.2 Geometric Pathway

Refer to caption
(a)
Refer to caption Refer to caption
Refer to caption Refer to caption
(b)
(c) (a) An illustration of our neural implicit field module. It leverages spatio-temporal consistency offered by multi-camera images to model the scene geometry. (b) FV predictions from our unsupervised pretraining step. A FV image (top left) is processed by the geometry pathway to generate a volumetric density of the field which generates a depth map (top right) upon ray casting. Parallelly, a masked FV image (bottom left) is processed by the semantic pathway to reconstruct the masked image (bottom right).

The goal of the geometric pathway 𝒢𝒢\mathcal{G}caligraphic_G is to explicitly model scene geometry in a label-free manner using only the spatio-temporal images obtained from cameras onboard an autonomous vehicle. Explicit scene geometry modeling allows the network to reason about occlusions and disocclusions in the scene, thus improving the quality of predictions in the downstream task. To this end, we design the task of scene geometry learning using an implicit field formulation wherein the main goal is to estimate the volumetric density of the scene in the camera coordinate system given a monocular FV image, as shown in Fig. 3(a). We multiply the estimated volumetric density with the uniformly-lifted semantic features to generate the geometrically consistent semantic features (see Sec. 3.3).

We generate the volumetric density for the scene by following the idea of image-conditioned NeRF outlined in [cit:pixel-nerf]. Firstly, we retrieve the image features f𝑓fitalic_f for randomly sampled points, 𝐱=(x,y,z)𝐱𝑥𝑦𝑧\mathbf{x}=(x,y,z)bold_x = ( italic_x , italic_y , italic_z ), along every camera ray by projecting them onto the 2D image plane and computing the value for each projection location using bilinear interpolation. We then pass the image features along with their positional encodings into a two-layer MLP, ϕitalic-ϕ\phiitalic_ϕ, to estimate the volumetric density, σ𝐱subscript𝜎𝐱\sigma_{\mathbf{x}}italic_σ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT, at each of the sampled locations. Mathematically, the volumetric density at location 𝐱𝐱\mathbf{x}bold_x is computed as:

σ𝐱=ϕ(f𝐮𝐱,γ(𝐮𝐱,d𝐱)),subscript𝜎𝐱italic-ϕsubscript𝑓subscript𝐮𝐱𝛾subscript𝐮𝐱subscript𝑑𝐱\sigma_{\mathbf{x}}=\phi(f_{\mathbf{u}_{\mathbf{x}}},\gamma(\mathbf{u}_{% \mathbf{x}},d_{\mathbf{x}})),italic_σ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT = italic_ϕ ( italic_f start_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_γ ( bold_u start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) ) , (2)

where γ(,)𝛾\gamma(\cdot,\cdot)italic_γ ( ⋅ , ⋅ ) represents the sinusoidal positional encoding computed using the 2D projection 𝐮𝐱subscript𝐮𝐱\mathbf{u}_{\mathbf{x}}bold_u start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT of 𝐱𝐱\mathbf{x}bold_x on the image plane and the distance d𝐱subscript𝑑𝐱d_{\mathbf{x}}italic_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT of 𝐱𝐱\mathbf{x}bold_x from the camera origin.

During training, we optimize ϕitalic-ϕ\phiitalic_ϕ by first computing the depth map from σ𝜎\sigmaitalic_σ and then computing the photometric loss between the multi-view FV images at both the current as well as future timesteps. Specifically, for a camera ray through pixel location 𝐮𝐮\mathbf{u}bold_u, we estimate the corresponding depth d^𝐮subscript^𝑑𝐮\hat{d}_{\mathbf{u}}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT by computing the integral of intermediate depths over the probability of ray termination at a given distance. Accordingly, we sample k𝑘kitalic_k points, 𝐱1,𝐱2,,𝐱ksubscript𝐱1subscript𝐱2subscript𝐱𝑘\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, on each camera ray and compute σ𝜎\sigmaitalic_σ at each of these locations. We then compute the probability of ray termination αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT between every pair of consecutive points (𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝐱i+1subscript𝐱𝑖1\mathbf{x}_{i+1}bold_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT) to determine the distance at which the ray is terminated, i.e., the depth d^𝐮subscript^𝑑𝐮\hat{d}_{\mathbf{u}}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT. Mathematically, \linenomathAMS

αi=exp(1σ𝐱iδi),subscript𝛼𝑖exp1subscript𝜎subscript𝐱𝑖subscript𝛿𝑖\displaystyle\alpha_{i}=\text{exp}(1-\sigma_{\mathbf{x}_{i}}\delta_{i}),italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = exp ( 1 - italic_σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (3)
d^𝐮=i=1K(j=1i1(1αj))αidi,subscript^𝑑𝐮superscriptsubscript𝑖1𝐾superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗subscript𝛼𝑖subscript𝑑𝑖\displaystyle\hat{d}_{\mathbf{u}}=\sum_{i=1}^{K}(\prod_{j=1}^{i-1}(1-\alpha_{j% }))\alpha_{i}d_{i},over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4)

where disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the distance of 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the camera center, and δi=di+1disubscript𝛿𝑖subscript𝑑𝑖1subscript𝑑𝑖\delta_{i}=d_{i+1}-d_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. A depth map output from 𝒢𝒢\mathcal{G}caligraphic_G is illustrated in Fig. 3.2. We use the computed depth map to supervise the geometric pathway 𝒢𝒢\mathcal{G}caligraphic_G using the photometric loss between RGB images generated using inverse and forward warping. Inverse warping is described as:

Itgt,inv(psrc)=ItgtKTsrctgtd(psrc)K1psrc,subscriptsuperscript𝐼tgtinvsubscript𝑝srcsubscript𝐼tgtdelimited-⟨⟩𝐾subscript𝑇srctgt𝑑subscript𝑝srcsuperscript𝐾1subscript𝑝srcI^{\prime}_{\text{tgt},\text{inv}}(p_{\text{src}})=I_{\text{tgt}}\langle KT_{% \text{src}\rightarrow\text{tgt}}d(p_{\text{src}})K^{-1}p_{\text{src}}\rangle,italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tgt , inv end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) = italic_I start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ⟨ italic_K italic_T start_POSTSUBSCRIPT src → tgt end_POSTSUBSCRIPT italic_d ( italic_p start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ⟩ , (5)

where K𝐾Kitalic_K is the intrinsic camera matrix, delimited-⟨⟩\langle\cdot\rangle⟨ ⋅ ⟩ denotes the bilinear sampling operator, and psrcsubscript𝑝srcp_{\text{src}}italic_p start_POSTSUBSCRIPT src end_POSTSUBSCRIPT is a pixel coordinate in the source image. Similarly, forward warping is described as:

Itgt,fwd(KTsrctgtd(psrc)K1psrc)=Isrc(psrc),subscriptsuperscript𝐼tgtfwd𝐾subscript𝑇srctgt𝑑subscript𝑝srcsuperscript𝐾1subscript𝑝srcsubscript𝐼srcsubscript𝑝srcI^{\prime}_{\text{tgt},\text{fwd}}(KT_{\text{src}\rightarrow\text{tgt}}d(p_{% \text{src}})K^{-1}p_{\text{src}})=I_{\text{src}}(p_{\text{src}}),italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tgt , fwd end_POSTSUBSCRIPT ( italic_K italic_T start_POSTSUBSCRIPT src → tgt end_POSTSUBSCRIPT italic_d ( italic_p start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) = italic_I start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) , (6)

We reduce the impact of occlusions and disocclusions across timesteps from corrupting the overall photometric loss by only computing the pixelwise minimum for each of the forward and inverse photometric losses. The photometric loss is then computed as:

photom=Itgt,fwdItgt1+Itgt,invIsrc1subscriptphotomsubscriptdelimited-∥∥subscriptsuperscript𝐼tgtfwdsubscript𝐼tgt1subscriptdelimited-∥∥subscriptsuperscript𝐼tgtinvsubscript𝐼src1\mathcal{L}_{\text{photom}}=\lVert I^{\prime}_{\text{tgt},\text{fwd}}-I_{\text% {tgt}}\rVert_{1}+\lVert I^{\prime}_{\text{tgt},\text{inv}}-I_{\text{src}}% \rVert_{1}caligraphic_L start_POSTSUBSCRIPT photom end_POSTSUBSCRIPT = ∥ italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tgt , fwd end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT tgt , inv end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (7)

3.3 Semantic Pathway

The semantic pathway 𝒮𝒮\mathcal{S}caligraphic_S aims to facilitate the learning of holistic feature representations for various scene elements in a label-free manner. This rich pretrained representation enables efficient adaptation to semantic classes during finetuning. To this end, we learn the representations of scene elements by masking out random patches in the input image and then forcing the network to generate pixel-wise predictions for each of the masked patches (Fig. 3.2). Moreover, we also exploit the temporal consistency of static elements in the scene by reconstructing the RGB images at future timesteps t1,t2,,tnsubscript𝑡1subscript𝑡2subscript𝑡𝑛t_{1},t_{2},...,t_{n}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT using the masked RGB input at timestep t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This novel formulation of temporal masked autoencoding (T-MAE) allows our network to learn spatially- and semantically consistent features which improve its occlusion reasoning ability and accordingly its performance on semantic BEV mapping.

Our semantic pathway 𝒮𝒮\mathcal{S}caligraphic_S, shown in Fig. 2, masks the input image 0subscript0\mathcal{I}_{0}caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using a binary mask M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with a masking ratio m𝑚mitalic_m, and generates the corresponding masked semantic 3D voxel grid V0𝒮subscriptsuperscript𝑉𝒮0V^{\mathcal{S}}_{0}italic_V start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We then multiply V0𝒮subscriptsuperscript𝑉𝒮0V^{\mathcal{S}}_{0}italic_V start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with the volumetric density σ𝜎\sigmaitalic_σ obtained from the geometric pathway 𝒢𝒢\mathcal{G}caligraphic_G to generate the intermediate masked voxel grid V0subscript𝑉0V_{0}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. During pretraining, we densify V0subscript𝑉0V_{0}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by filling the masked regions using a common mask token [M], and generating pseudo voxel grids V0isubscript𝑉0𝑖V_{0\rightarrow i}italic_V start_POSTSUBSCRIPT 0 → italic_i end_POSTSUBSCRIPT by warping V0subscript𝑉0V_{0}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using the known camera poses between the current and the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT timesteps. Mathematically,

V0i=T0iV0,subscript𝑉0𝑖subscript𝑇0𝑖subscript𝑉0V_{0\rightarrow i}=T_{0\rightarrow i}V_{0},italic_V start_POSTSUBSCRIPT 0 → italic_i end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT 0 → italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (8)

where T0isubscript𝑇0𝑖T_{0\rightarrow i}italic_T start_POSTSUBSCRIPT 0 → italic_i end_POSTSUBSCRIPT is the transformation between camera poses at timesteps t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We then independently use the voxel grids V0,V01,V02,,V0isubscript𝑉0subscript𝑉01subscript𝑉02subscript𝑉0𝑖V_{0},V_{0\rightarrow 1},V_{0\rightarrow 2},...,V_{0\rightarrow i}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 0 → 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 0 → 2 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT 0 → italic_i end_POSTSUBSCRIPT as inputs to an RGB reconstruction head to reconstruct the RGB images ^0,^01,^02,,^0isubscript^0subscript^01subscript^02subscript^0𝑖\hat{\mathcal{I}}_{0},\hat{\mathcal{I}}_{0\rightarrow 1},\hat{\mathcal{I}}_{0% \rightarrow 2},...,\hat{\mathcal{I}}_{0\rightarrow i}over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT 0 → 1 end_POSTSUBSCRIPT , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT 0 → 2 end_POSTSUBSCRIPT , … , over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT 0 → italic_i end_POSTSUBSCRIPT. We compute the L2 loss on the normalized pixel values of every patch between ksubscript𝑘\mathcal{I}_{k}caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and ^ksubscript^𝑘\hat{\mathcal{I}}_{k}over^ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to generate the supervision for the semantic pathway 𝒮𝒮\mathcal{S}caligraphic_S. We thus compute the reconstruction loss as:

rgb=i=0nip^0ip2,subscriptrgbsuperscriptsubscript𝑖0𝑛subscriptdelimited-∥∥subscriptsuperscript𝑝𝑖subscriptsuperscript^𝑝0𝑖2\mathcal{L}_{\text{rgb}}=\sum_{i=0}^{n}\lVert\mathcal{I}^{p}_{i}-\hat{\mathcal% {I}}^{p}_{0\rightarrow i}\rVert_{2},caligraphic_L start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ caligraphic_I start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG caligraphic_I end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 → italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (9)

where psuperscript𝑝\mathcal{I}^{p}caligraphic_I start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT denotes the per-patch normalized image.

3.4 BEV Finetuning

We set up the network for finetuning by disabling image masking and discarding the RGB reconstruction head. We finetune the network on semantic BEV mapping by training the model on a fraction of BEV ground truth semantic labels using the cross entropy loss function. Mathematically,

bev=CE(B,B^),subscriptbev𝐶𝐸𝐵^𝐵\mathcal{L}_{\text{bev}}=CE(B,\hat{B}),caligraphic_L start_POSTSUBSCRIPT bev end_POSTSUBSCRIPT = italic_C italic_E ( italic_B , over^ start_ARG italic_B end_ARG ) , (10)

where B𝐵Bitalic_B and B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG are the semantic BEV ground truth and semantic BEV prediction masks, respectively.

4 Experimental Results

In this section, we present quantitative and qualitative results of our novel unsupervised label-efficient semantic BEV mapping framework, LetsMap, and provide extensive ablative experiments to demonstrate the benefit of our proposed contributions.

4.1 Datasets

We evaluate LetsMap on two large-scale autonomous driving datasets, i.e., KITTI-360 [cit:dataset-kitti360] and nuScenes [cit:dataset-nuscenes]. Since neither dataset provides semantic BEV labels, we adopt the label generation pipeline outlined in PoBEV [cit:bev-seg-panopticbev] with minor modifications to discard the occlusion mask to generate the semantic BEV ground truth labels. We sample one forward-facing perspective image from either fisheye camera for multi-camera supervision in KITTI-360 but use only a single camera when training on nuScenes due to the lack of sufficient field-of-view overlap between the spatial cameras. For the KITTI-360 dataset, we hold out sequence 10101010 for validation and use the remaining 8888 sequences for training. For the nuScenes dataset, we follow the train-val split from [cit:bev-seg-pon] and obtain 702702702702 train and 142142142142 validation sequences.

4.2 Training Protocol

We train LetsMap on images of size 448×13444481344448\times 1344448 × 1344, and 448×896448896448\times 896448 × 896 for KITTI-360 and nuScenes, respectively. We select these image sizes to ensure compatibility with both the image encoder as well as the lower scales of the BiFPN adapter module since they are divisible by both 14141414 and 32323232. The pretraining phase follows a label-free paradigm and trains the network using only spatio-temporal FV images with a window size of 4444, masking ratio of 0.750.750.750.75, and masking patch size of 28282828 for 20202020 epochs with an initial learning rate (LR) of 0.0050.0050.0050.005 which is decayed by a factor of 0.50.50.50.5 at epoch 15151515 and 0.20.20.20.2 at epoch 18181818. We finetune the network on the task of semantic BEV mapping for 100100100100 epochs using only 1%percent11\%1 % of BEV labels for the KITTI-360 dataset and one sample from every scene for the nuScenes dataset (140%absentpercent140\approx\frac{1}{40}\%≈ divide start_ARG 1 end_ARG start_ARG 40 end_ARG %). We use an LR of 0.0050.0050.0050.005 during finetuning and decay it by a factor of 0.50.50.50.5 at epoch 75757575 and 0.20.20.20.2 at epoch 90909090. We optimize LetsMap using the SGD optimizer with a batch size of 12121212, momentum of 0.90.90.90.9, and weight decay of 0.00010.00010.00010.0001.

4.3 Quantitative Results

We evaluate the performance of LetsMap on the KITTI-360 dataset by comparing it with the self-supervised approach SkyEye [cit:bev-seg-skyeye] as well as the fully-supervised baselines outlined in SkyEye. However, since SkyEye cannot be trained on nuScenes due to the lack of FV labels, we compare our approach with only the fully-supervised baselines on the nuScenes dataset. For all experiments, we use the code provided by the authors and ensure fair comparison by using the training protocols described in their original manuscripts. We use the standard mIoU metric for quantifying the performance [hurtado2022semantic]. Tab. 1 and Tab. 2 present the results of this evaluation for KITTI-360 and nuScenes respectively. For these experiments, we report metrics obtained when fully-supervised approaches are trained using 100%percent100100\%100 % of BEV labels, SkyEye is pretrained using 100%percent100100\%100 % of FV labels and finetuned on a tiny fraction of BEV labels, while LetsMap is trained on only a tiny fraction of BEV labels, i.e., 1%percent11\%1 % on KITTI-360 and one sample per scene (140%absentpercent140\approx\frac{1}{40}\%≈ divide start_ARG 1 end_ARG start_ARG 40 end_ARG %) on nuScenes.

Table 1: Evaluation of semantic BEV mapping on the KITTI-360 dataset. All metrics are reported in [%][\%][ % ].
Method FV BEV Road Side. Build. Terrain Person 2-Wh. Car Truck mIoU
IPM [cit:bev-seg-ipm-original] 100% - 53.03 24.90 15.19 32.31 0.20 0.36 11.59 1.90 17.44
VED [cit:bev-seg-lu2019ved] - 100% 65.97 35.41 37.28 34.34 0.13 0.07 23.83 8.89 25.74
VPN [cit:bev-seg-pan2020vpn] - 100% 69.90 34.31 33.65 40.17 0.56 2.26 27.76 6.10 26.84
PON [cit:bev-seg-pon] - 100% 67.98 31.13 29.81 34.28 2.28 2.16 37.99 8.10 26.72
PoBEV [cit:bev-seg-panopticbev] - 100% 70.14 35.23 34.68 40.72 2.85 5.63 39.77 14.38 30.42
PoBEV [cit:bev-seg-panopticbev] - 1% 60.41 20.97 24.65 23.38 0.15 0.23 21.71 1.23 19.09
SkyEye [cit:bev-seg-skyeye] 100% 1% 69.26 33.48 32.79 39.46 0.00 0.34 32.36 7.93 26.94
LetsMap (Ours) 0% 1% 70.58 34.26 40.68 38.53 1.35 4.74 30.94 10.58 28.96
Table 2: Evaluation of semantic BEV mapping on the nuScenes dataset. All metrics are reported in [%][\%][ % ].
Method FV BEV Road Side. Manm. Terrain Person 2-Wh. Car Truck mIoU
IPM [cit:bev-seg-ipm-original] 100% - 43.51 9.05 26.21 16.60 0.14 0.72 4.65 3.67 13.07
VED [cit:bev-seg-lu2019ved] - 100% 67.97 25.23 49.69 31.51 0.80 1.28 21.85 17.51 26.98
VPN [cit:bev-seg-pan2020vpn] - 100% 66.47 23.94 47.65 33.19 2.02 4.13 22.66 18.33 27.30
PON [cit:bev-seg-pon] - 100% 67.50 24.49 47.02 30.86 2.49 6.85 26.68 18.85 28.09
PoBEV [cit:bev-seg-panopticbev] - 100% 70.15 27.87 50.04 35.32 3.89 7.06 31.60 21.27 30.90
PoBEV [cit:bev-seg-panopticbev] - 140%absentpercent140\approx\frac{1}{40}\%≈ divide start_ARG 1 end_ARG start_ARG 40 end_ARG % 64.55 19.85 45.21 28.45 1.20 1.06 20.45 11.48 24.03
LetsMap (Ours) 0% 140%absentpercent140\approx\frac{1}{40}\%≈ divide start_ARG 1 end_ARG start_ARG 40 end_ARG % 67.72 27.06 47.10 34.78 3.31 5.79 21.92 13.57 27.66

We observe from Tab. 1 that our approach, LetsMap, outperforms four of the five fully-supervised baselines by more than 2 pptimes2pp2\text{\,}\mathrm{p}\mathrm{p}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG while using only 1%percent11\%1 % of BEV labels. Notably, LetsMap also exceeds SkyEye by 2.02 pptimes2.02pp2.02\text{\,}\mathrm{p}\mathrm{p}start_ARG 2.02 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG without using any additional labeled data. We also note that our approach significantly outperforms SkyEye on the static classes of road and building, as well as the dynamic classes of person, 2-wheeler, and truck. This improvement stems from explicit modeling of both scene geometry and representation which ensures well-constrained extents of dynamic objects as well as efficient mapping of scene elements to BEV classes using only 1%percent11\%1 % of BEV labels. Although better than SkyEye, we observe that LetsMap underperforms PoBEV for most dynamic classes, reporting 8.83 pptimes8.83pp8.83\text{\,}\mathrm{p}\mathrm{p}start_ARG 8.83 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG and 3.80 pptimes3.80pp3.80\text{\,}\mathrm{p}\mathrm{p}start_ARG 3.80 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG lower on car and truck respectively. This is likely due to either insufficient views for training the implicit field or the presence of moving objects which results in its sub-optimal performance. Increasing the number of timesteps and sampling more perspective images from the fisheye cameras could address this limitation.

On the nuScenes dataset, we note that LetsMap is comparable to most of the fully-supervised baselines but is consistently outperformed by the state-of-the-art approach PoBEV. nuScenes, being extremely dynamic and diverse, presents a significant challenge to our implicit field formulation which enforces a static scene constraint. This is especially evident in the car and truck classes which report 9.68 pptimes9.68pp9.68\text{\,}\mathrm{p}\mathrm{p}start_ARG 9.68 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG and 7.70 pptimes7.70pp7.70\text{\,}\mathrm{p}\mathrm{p}start_ARG 7.70 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG lower than PoBEV. Nonetheless, LetsMap is able to efficiently learn the scene representations of static classes, resulting in a comparable performance with all baselines while using only 1%percent11\%1 % of annotated data.

4.4 Ablation Study

Table 3: Ablation study on the impact of our unsupervised pretraining on the overall network performance. The column “FV” shows whether the models leverage FV pretraining, and the column “PT” denotes whether the models have been pretrained. All experiments are on the KITTI-360 dataset.
BEV Model FV PT Epochs Road Side Build Terr. Pers. 2-Wh. Car Truck mIoU
1% PoBEV - 100 60.41 20.97 24.65 23.38 0.15 0.23 21.71 1.23 19.09
SkyEye 69.26 33.48 32.79 39.46 0.00 0.34 32.36 7.93 26.94
LetsMap 69.40 32.09 34.75 35.27 1.01 2.79 28.76 7.66 26.47
LetsMap 70.58 34.26 40.68 38.53 1.35 4.74 30.94 10.58 28.96
5% PoBEV - 80 64.45 27.36 30.15 31.66 0.69 0.98 29.75 6.06 23.89
SkyEye 72.16 37.20 34.89 42.97 4.77 9.16 40.74 9.88 31.47
LetsMap 72.80 37.89 38.59 40.06 2.34 5.62 34.86 16.26 31.05
LetsMap 73.74 39.56 42.07 41.49 2.46 6.32 34.68 14.88 31.90
10% PoBEV - 50 66.58 30.28 31.76 34.50 1.22 3.28 33.43 7.56 26.08
SkyEye 73.36 38.30 37.54 44.62 4.80 9.67 42.84 10.06 32.65
LetsMap 74.31 38.45 40.04 41.26 3.19 6.02 35.56 16.53 31.92
LetsMap 74.74 39.40 43.63 43.33 2.91 6.95 37.62 18.09 33.33
50% PoBEV - 30 69.88 33.81 33.40 40.48 2.47 4.63 38.81 9.84 29.16
SkyEye 73.10 39.23 38.08 45.72 4.05 10.44 44.72 12.10 33.43
LetsMap 73.89 38.42 42.25 41.46 2.26 6.26 37.20 15.08 32.10
LetsMap 74.29 38.48 43.87 42.77 2.80 5.22 37.68 15.20 32.54
100% PoBEV - 20 70.14 35.23 34.68 40.72 2.85 5.63 39.77 14.38 30.42
SkyEye 73.57 39.45 38.74 46.06 3.95 9.66 45.21 10.92 33.44
LetsMap 74.22 39.39 42.86 42.96 2.55 6.66 35.68 17.11 32.68
LetsMap 74.81 38.59 42.58 43.67 3.52 6.21 38.47 15.24 32.88

In this section, we investigate the influence of various components of our approach by performing an ablation study on the KITTI-360 dataset. Specifically, we evaluate the impact of model pretraining when presented with varying amounts of labeled BEV data, the benefit of each of our neural pathways, and the effect of varying masking ratios on the overall performance of the network.

Impact of Model Pretraining: In this section, we study the impact of model pretraining by finetuning our model with and without pretraining with varying percentages of labeled BEV data. Accordingly, we establish five percentage splits of BEV labels, i.e., 1%percent11\%1 %, 5%percent55\%5 %, 10%percent1010\%10 %, 50%percent5050\%50 %, and 100%percent100100\%100 %, and sample three random sets for each percentage split. We train each percentage split three times, once using each random set, and report the mean value to mitigate the risk of random chance affecting the final results. Moreover, we also train the best two baselines, i.e., PoBEV and SkyEye, across all percentage splits as a reference for evaluating our approach. Tab. 3 presents the results of this ablation study.

We observe that our model trained using our unsupervised pretraining strategy, LetsMap, consistently outperforms our model without pretraining across all percentage splits. The most substantial improvements of 2.49 pptimes2.49pp2.49\text{\,}\mathrm{p}\mathrm{p}start_ARG 2.49 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG and 1.41 pptimes1.41pp1.41\text{\,}\mathrm{p}\mathrm{p}start_ARG 1.41 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG occur when finetuning with only 1%percent11\%1 % and 10%percent1010\%10 % of BEV labels, respectively. We also note that LetsMap outperforms PoBEV by 9.87 pptimes9.87pp9.87\text{\,}\mathrm{p}\mathrm{p}start_ARG 9.87 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG and SkyEye by 2.02 pptimes2.02pp2.02\text{\,}\mathrm{p}\mathrm{p}start_ARG 2.02 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG when using only 1%percent11\%1 % of BEV labels. At extremely low percentage splits, PoBEV does not encounter enough BEV labels to learn the mapping from FV to BEV, while the FV semantic-based pretraining of SkyEye does not impart sufficient geometric modeling and representation learning ability to the network. The notable improvement over SkyEye is primarily attributed to the superior segmentation performance on static classes such as road and building as well as non-moving dynamic objects such as trucks and buses. This improvement directly stems from the use of implicit neural fields to model scene geometry which helps the network to effectively reason about static elements in the scene. Moreover, we highlight that LetsMap finetuned using only 5%percent55\%5 % of BEV labels already outperforms the state-of-the-art fully supervised approach PoBEV trained using 100%percent100100\%100 % of BEV labels; thus underscoring the impact of model pretraining in reducing the dependence on large quantities of labeled data. We also note that SkyEye consistently outperforms our approach across four of the five percentage splits on person, two-wheeler, and car. We believe that the superior performance of SkyEye stems from the presence of 100%percent100100\%100 % FV labels which provide unparalleled semantic knowledge during the pretraining phase. Nevertheless, our approach still yields competitive results without using any additional labeled data, thus highlighting the impact of our unsupervised pretraining mechanism.

Table 4: Ablation study to investigate the efficacy of various network components. All experiments are on the KITTI-360 dataset using 1% of BEV labels.
Model Geometric Semantic Road Side. Build. Terr. Pers. 2-Wh. Car Truck mIoU
L1 69.40 32.09 34.75 35.27 1.01 2.79 28.76 7.66 26.47
L2 70.85 34.34 38.12 35.03 0.93 4.06 29.79 8.84 27.75
L3 70.58 34.26 40.68 38.53 1.35 4.74 30.94 10.58 28.96

Influence of Network Components: In this section, we quantify the impact of the geometric and semantic pathways on the overall performance of the network by incrementally incorporating each component into the pretraining step and finetuning the resultant model on 1%percent11\%1 % of BEV labels. Tab. 4 presents the results of this ablation study. The first row, comprising model L1, illustrates a network without our unsupervised pretraining and serves as a baseline to assess the improvement brought about by the other components. Model L2 incorporates the geometric pathway into the pretraining which results in an improvement of 1.28 pptimes1.28pp1.28\text{\,}\mathrm{p}\mathrm{p}start_ARG 1.28 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG over model L1. The inclusion of geometric pathway during pretraining allows the implicit field to learn the scene geometry and reason about occlusions which helps improve the IoU metric on most of the classes by nearly 1 pptimes1pp1\text{\,}\mathrm{p}\mathrm{p}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG. Upon the incorporation of our novel temporal MAE strategy via the semantic pathway in model L3, we observe a significant 1.21 pptimes1.21pp1.21\text{\,}\mathrm{p}\mathrm{p}start_ARG 1.21 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG improvement over model L2. By learning to reconstruct the missing information in the masked patches over multiple timesteps, the network learns spatially- and temporally consistent representations of various scene components which allows it to easily map the learned representation to the semantic BEV task using only 1%percent11\%1 % of BEV labels.

Impact of Mask Ratios: In this experiment, we evaluate the impact of different masking ratios on the overall performance of the model and present the results in Tab. 5. We observe that a masking ratio of 75%percent7575\%75 % is ideal for our novel temporal masked autoencoding mechanism. Lower masking ratios do not present a sufficiently challenging pretraining task and thus result in only marginal improvements over model L2 in Tab. 4, while higher masking ratios mask out a significant portion of vital information resulting in worse performance as compared to a model with no masked autoencoding.

Table 5: Ablation study on the impact of masking ratio. All experiments are on the KITTI-360 dataset using 1% of BEV labels.
Masking Ratio 0% 25% 50% 75% 90%
mIoU 27.75 27.87 28.22 28.96 27.31

Supplementary Material

In this supplementary material, we present additional experimental results to analyze the performance of LetsMap, our unsupervised representation learning framework for semantic BEV mapping. To this end, we present further ablation experiments in Sec. S.1 and additional qualitative results in Sec. S.2.

S.1 Additional Ablative Experiments

In this section, we present additional ablative experiments to further study the impact of various parameters on the overall performance of the model. Specifically, we study the influence of (1) different DINOv2 variants (Sec. S.1.1) and (2) masking patch size in our novel T-MAE module (Sec. S.1.2) on the overall performance of the model. Further, we also present the results obtained when the native backbones of the best baselines are replaced with our backbone (Sec. S.1.3) and when the BEV percentage split defined in SkyEye [cit:bev-seg-skyeye] is used for model finetuning (Sec. S.1.4).

S.1.1 DINOv2 Backbone Variants

In this section, we study the influence of different variants of the DINOv2 backbone on the overall performance of the model. Specifically, we first pretrain the model using four variants, namely, vit-b, vit-s, vit-l, and vit-g, and finetune each of them using 1%percent11\%1 % of semantic BEV labels. Tab. S.1 presents the results of this ablation study. We observe that vit-s yields the lowest performance among all variants, achieving 3.41 pptimes3.41pp3.41\text{\,}\mathrm{p}\mathrm{p}start_ARG 3.41 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG lower than vit-b. Being the smallest of all variants, vit-s does not generate features that are as representative as its larger counterparts, thus resulting in its reduced overall performance.

The three larger variants, i.e., vit-b, vit-l, and vit-g, however, yield very similar mIoU scores with the difference between the highest and lowest performance being only 0.80 pptimes0.80pp0.80\text{\,}\mathrm{p}\mathrm{p}start_ARG 0.80 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG. In other words, the performance of semantic BEV mapping saturates after vit-b and does not improve upon using a larger backbone. We infer that this behavior can be attributed to one of the following reasons: (1) larger DINOv2 backbones provide better features for FV tasks, but these features do not easily transfer to the task of BEV mapping, or (2) 1%percent11\%1 % of BEV labels are insufficient to leverage the full potential of larger backbones. Given the similar performance of vit-b as compared to vit-l and vit-g while being more efficient in terms of number of parameters, we use the vit-b variant of the DINOv2 backbone in this work.

Table S.1: Ablation study on the impact of different DINOv2 backbones on the overall model performance. All models in this experiment are finetuned using only 1%percent11\%1 % of BEV labels. All metrics are reported in [%][\%][ % ] on the KITTI-360 dataset.
Backbone vit-s vit-b vit-l vit-g
mIoU 25.55 28.96 28.40 28.16
Table S.2: Ablation study on the impact of masking patch size in T-MAE on the overall performance of the model. All models are finetuned using only 1%percent11\%1 % of labels in BEV. All metrics are reported in [%][\%][ % ] on the KITTI-360 dataset.
Patch Size Road Side. Build. Terr. Per. 2-Wh. Car Truck mIoU
14 71.45 33.41 36.89 37.48 0.75 3.69 30.05 9.23 27.87
28 70.58 34.26 40.68 38.53 1.35 4.74 30.94 10.58 28.96
56 70.02 34.10 38.87 37.88 1.37 4.71 30.91 9.66 28.44

S.1.2 Masking Patch Size

In this section, we analyze the influence of masking patch sizes used for masking the input image in our novel temporal MAE (T-MAE) module on the overall performance of the model. To this end, we first pretrain the model using masking patches of size 14141414, 28282828, and 56565656, and then finetune the resultant model on 1%percent11\%1 % of BEV labels. Tab. S.2 presents the results of this ablation study.

We observe that a masking patch size of 28282828 gives the highest mIoU score across all the evaluated patch sizes. A smaller patch size does not mask out enough of an object and consequently does not present a challenging reconstruction task during the unsupervised pretraining phase. In contrast, a larger patch size masks out significant distinguishing regions in the image which hinders the representation learning ability of the network during the pretraining phase. The effect of patch sizes is noticeable across all classes while being significant for dynamic objects which experience a substantial reduction in the IoU scores when too little of the object is masked out. Given these observations, we use a patch size of 28282828 in our LetsMap framework.

S.1.3 Impact of DINOv2 on Baseline Approaches

In this section, we analyze the impact of the DINOv2 backbone on the overall performance of the baseline models. Specifically, we replace the native backbones of the two best baselines, PanopticBEV [cit:bev-seg-panopticbev] and SkyEye [cit:bev-seg-skyeye], with a pretrained DINOv2 backbone as used in our model. We follow the setting defined in Sec. 4.4 of the main paper and report the results when finetuning with varying percentages of BEV labels in Tab. S.3. We observe that PoBEV reports slightly better performance across all percentage splits when using the DINOv2 backbone with the highest improvement of 1.41 pptimes1.41pp1.41\text{\,}\mathrm{p}\mathrm{p}start_ARG 1.41 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG observed when using 100%percent100100\%100 % of BEV labels. In contrast, we observe that the performance of SkyEye deteriorates when the native encoder is replaced with the DINOv2 backbone. At lower percentage splits of 1%percent11\%1 %, 5%percent55\%5 %, and 10%percent1010\%10 %, the BEV segmentation performance drops by 2.85 pptimes2.85pp2.85\text{\,}\mathrm{p}\mathrm{p}start_ARG 2.85 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG, 3.76 pptimes3.76pp3.76\text{\,}\mathrm{p}\mathrm{p}start_ARG 3.76 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG, and 2.62 pptimes2.62pp2.62\text{\,}\mathrm{p}\mathrm{p}start_ARG 2.62 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG which indicates that the SkyEye framework is unable to adapt the DINOv2 features to this task. The performance drop is observed across all classes and is especially large for car and two-wheeler which we believe is a consequence of not having an explicit scene geometry estimation module to estimate the extent of objects in the scene. We infer that the native backbone of SkyEye absorbs a significant chunk of scene geometry, but when replaced with a frozen backbone as in our model, SkyEye fails to learn sufficient geometric information. We thus conclude that using our backbone in the baseline approaches results in only a slight improvement in PoBEV and deteriorates the BEV segmentation performance in SkyEye.

Table S.3: Performance of baseline approaches when using the DINOv2 backbone as used in LetsMap. All experiments are on the KITTI-360 dataset.
BEV Model FV PT Backbone Road Side. Build. Terr. Pers. 2-Wh. Car Truck mIoU
1% PoBEV - Native 60.41 20.97 24.65 23.38 0.15 0.23 21.71 1.23 19.09
SkyEye 69.26 33.48 32.79 39.46 0.00 0.34 32.36 7.93 26.94
PoBEV - DINOv2 62.36 21.02 27.18 24.22 0.04 0.12 17.50 0.95 19.17
SkyEye 65.13 29.56 29.02 34.22 0.78 2.87 26.04 5.12 24.09
LetsMap 70.58 34.26 40.68 38.53 1.35 4.74 30.94 10.58 28.96
5% PoBEV - Native 64.45 27.36 30.15 31.66 0.69 0.98 29.75 6.06 23.89
SkyEye 72.16 37.20 34.89 42.97 4.77 9.16 40.74 9.88 31.47
PoBEV - DINOv2 67.61 30.73 30.97 32.80 0.42 0.47 25.48 5.58 24.26
SkyEye 69.84 34.19 32.80 37.13 2.54 4.74 32.49 7.93 27.71
LetsMap 73.74 39.56 42.07 41.49 2.46 6.32 34.68 14.88 31.90
10% PoBEV - Native 66.58 30.28 31.76 34.50 1.22 3.28 33.43 7.56 26.08
SkyEye 73.36 38.30 37.54 44.62 4.80 9.67 42.84 10.06 32.65
PoBEV - DINOv2 68.99 33.17 35.81 34.15 0.70 1.58 29.74 10.06 26.77
SkyEye 72.19 36.18 35.26 39.84 3.78 5.61 36.95 10.44 30.03
LetsMap 74.74 39.40 43.63 43.33 2.91 6.95 37.62 18.09 33.33
50% PoBEV - Native 69.88 33.81 33.40 40.48 2.47 4.63 38.81 9.84 29.16
SkyEye 73.10 39.23 38.08 45.72 4.05 10.44 44.72 12.10 33.43
PoBEV - DINOv2 73.04 37.38 37.86 41.31 1.82 3.83 37.13 14.85 30.90
SkyEye 73.66 38.85 41.49 41.73 2.90 6.99 38.43 12.42 32.06
LetsMap 74.29 38.48 43.87 42.77 2.80 5.22 37.68 15.20 32.54
100% PoBEV - Native 70.14 35.23 34.68 40.72 2.85 5.63 39.77 14.38 30.42
SkyEye 73.57 39.45 38.74 46.06 3.95 9.66 45.21 10.92 33.44
PoBEV - DINOv2 73.29 37.81 40.23 42.11 1.78 3.32 38.66 17.42 31.83
SkyEye 73.51 39.13 40.04 42.08 3.17 5.90 39.29 12.72 31.98
LetsMap 74.81 38.59 42.58 43.67 3.52 6.21 38.47 15.24 32.88

S.1.4 BEV Finetuning using SkyEye Split

In this section, we report the results obtained upon finetuning both the baselines as well as our model with the single random set generated for each BEV percentage split as defined in SkyEye [cit:bev-seg-skyeye]. Please note that all networks are finetuned using the corresponding percent of BEV ground truth labels. Tab. S.4 presents the results of this study. We observe that our pretraining strategy significantly improves the performance of our model across all three percentage splits with the largest improvement of 1.98 pptimes1.98pp1.98\text{\,}\mathrm{p}\mathrm{p}start_ARG 1.98 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG observed when using 1%percent11\%1 % of BEV labels. We also note that LetsMap outperforms SkyEye by 0.99 pptimes0.99pp0.99\text{\,}\mathrm{p}\mathrm{p}start_ARG 0.99 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG and 1.88 pptimes1.88pp1.88\text{\,}\mathrm{p}\mathrm{p}start_ARG 1.88 end_ARG start_ARG times end_ARG start_ARG roman_pp end_ARG when using 1%percent11\%1 % and 10%percent1010\%10 % which highlights the impact of our approach in low label regimes. Thus, in line with Sec. 4.4 and Tab. 3, we conclude that our novel pretraining strategy positively influences network performance on the BEV segmentation task and results in competitive segmentation performance even in extremely low label regimes.

Table S.4: Ablation study on the impact of our unsupervised pretraining on the overall network performance using the finetuning split defined in SkyEye. All experiments are on the KITTI-360 dataset.
BEV Model FV PT Epochs Road Side. Build. Terr. Pers. 2-Wh. Car Truck mIoU
1% PoBEV - 100 61.70 17.10 27.81 26.72 0.07 0.36 21.51 0.84 19.51
SkyEye 72.56 34.33 36.70 41.66 0.00 0.16 33.85 10.29 28.71
LetsMap 70.89 33.88 37.71 37.41 0.80 2.87 31.59 6.59 27.72
LetsMap 72.94 37.79 43.70 38.29 0.87 2.57 30.62 10.86 29.70
10% PoBEV - 50 70.00 32.75 38.07 34.43 0.80 3.33 34.46 9.25 27.89
SkyEye 76.07 40.30 40.30 45.33 3.75 8.15 42.64 10.73 33.41
LetsMap 76.69 40.41 42.55 42.17 1.33 6.57 40.46 18.06 33.53
LetsMap 74.47 41.16 46.31 43.31 5.48 8.80 41.55 21.24 35.29
50% PoBEV - 30 72.09 35.64 36.64 42.41 1.61 3.92 41.41 9.77 30.44
SkyEye 76.43 39.89 45.22 46.64 5.10 7.93 42.43 12.30 34.49
LetsMap 75.46 39.45 42.71 39.69 3.85 5.70 41.88 17.82 33.32
LetsMap 76.54 42.65 49.23 41.47 3.36 8.61 38.76 19.42 35.01

S.2 Additional Qualitative Results

Input FV Image 1%percent11\%1 % 5%percent55\%5 % 10%percent1010\%10 % 50%percent5050\%50 % 100%percent100100\%100 %
(a) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(b) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(c) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(d) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(e) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(f) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(g) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(h) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure S.1: Qualitative results obtained when LetsMap is finetuned using 1%percent11\%1 %, 5%percent55\%5 %, 10%percent1010\%10 %, 50%percent5050\%50 % and 100%percent100100\%100 % of labels in BEV. Figures (a-d) depict predictions on the KITTI-360 dataset, while figures (e-h) show the predictions on the nuScenes dataset.

In this section, we qualitatively evaluate the performance of our model by comparing the semantic BEV maps obtained when the amount of BEV supervision is gradually increased from 1%percent11\%1 % to 100%percent100100\%100 %. Fig. S.2 presents the results of this evaluation. Fig. S.2(a, b, c, d) present the results on the KITTI-360 dataset and Fig. S.2(e, f, g, h) present the results on the nuScenes dataset.

We observe that the semantic BEV map predictions are largely consistent across all the percentage splits of the two datasets with only minor differences pertaining to the predicted object extents. This behavior is evident in Fig. S.2(d, f) where the model finetuned with 1%percent11\%1 % of BEV data tends to stretch objects along the radial direction, while models finetuned with higher percentage splits are not significantly affected by this factor. Moreover, we note that the 1%percent11\%1 % model is able to both detect and localize all the objects in the BEV map to a high degree of accuracy, with only minor errors in the heading of the detected objects (Fig. S.2(c)). Further, we observe in Fig. S.2(a, f, h) that the model finetuned with 1%percent11\%1 % labels is able to accurately reason about occlusions in the scene, such as the road behind the truck in Fig. S.2(a) and the regions beyond the curve in the road in Fig. S.2(h). This occlusion handling ability stems from the use of an independent implicit field-based geometry pathway to reason about the scene geometry in the unsupervised pretraining step. In some cases, however, the scene priors learned during the pretraining step do not generalize well to a given image input. For example, we observe in Fig. S.2(c) that the grass patch next to the vehicle in the adjacent lane is erroneously predicted as a road for the 1%percent11\%1 % model, while the models finetuned with more than 10%percent1010\%10 % BEV data accurately capture this characteristic. Nonetheless, these observations reinforce the fact that our unsupervised pretraining step encourages the network to learn rich geometric and semantic representations of the scene which allows models finetuned with extremely small BEV percentage splits to generate accurate BEV maps.