research-article

A streamlined framework for BEV-based 3D object detection with prior masking

Authors:

Dan ZengAuthors Info & Claims

Volume 150, Issue C

https://doi.org/10.1016/j.imavis.2024.105229

Published: 01 October 2024 Publication History

Abstract

In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.

Highlights

•

Efficient BEV framework for 3D object detection.

•

Incorporates 2D auxiliary branch and 4D information.

•

GPU memory-efficient lifting strategy.

•

Prior mask evaluates feature importance at different depths.

•

Tailored BEV encoder improves performance.

References

[1]

C. Li, G. Wang, Q. Long, Z. Zhou, Sgf3d: similarity-guided fusion network for 3d object detection, Image Vis. Comput. 142 (2024).

[2]

J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.

[3]

R. Qian, X. Lai, X. Li, 3d object detection for autonomous driving: a survey, Pattern Recogn. 130 (2022).

[4]

Y. Zhu, J. Xie, M. Liu, L. Yao, Y. Chen, Bf3d: bi-directional fusion 3d detector with semantic sampling and geometric mapping, Image Vis. Comput. 139 (2023).

[5]

H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, E. Xie, Z. Li, H. Deng, H. Tian, et al., Delving into the devils of bird’s-eye-view perception: a review, evaluation and recipe, IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (2024) 2151–2170.

[6]

T. Roddick, A. Kendall, R. Cipolla, Orthographic feature transform for monocular 3d object detection, arXiv (2018) preprint arXiv:1811.08188.

[7]

Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, K.Q. Weinberger, Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453.

[8]

T. Wang, X. Zhu, J. Pang, D. Lin, Fcos3d: Fully convolutional one-stage monocular 3d object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 913–922.

[9]

J. Philion, S. Fidler, Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, Springer, 2020, pp. 194–210.

[10]

J. Huang, G. Huang, Z. Zhu, Y. Ye, D. Du, Bevdet: High-performance multi-camera 3d object detection in bird-eye-view, arXiv (2021) preprint arXiv:2112.11790.

[11]

Y. Liu, T. Wang, X. Zhang, J. Sun, Petr: Position embedding transformation for multi-view 3d object detection, in: European Conference on Computer Vision, Springer, 2022, pp. 531–548.

[12]

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, J. Dai, Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers, in: European Conference on Computer Vision, Springer, 2022, pp. 1–18.

[13]

Y. Li, B. Huang, Z. Chen, Y. Cui, F. Liang, M. Shen, F. Liu, E. Xie, L. Sheng, W. Ouyang, et al., A fast and strong bird’s-eye view perception baseline, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, 2024, pp. 1–14.

[14]

A.W. Harley, Z. Fang, J. Li, R. Ambrus, K. Fragkiadaki, Simple-bev: What really matters for multi-sensor bev perception?, in: 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2023, pp. 2759–2765.

[15]

Y. Morvan, Acquisition, Compression and Rendering of Depth and Texture for Multi-View Video, Phd thesis Electrical Engineering, 2009.

[16]

H. Caesar, V. Bankiti, A.H. Lang, S. Vora, V.E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, O. Beijbom, Nuscenes: A multimodal dataset for autonomous driving, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11621–11631.

[17]

R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.

Digital Library

[18]

Z. Zou, K. Chen, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years: a survey, Proceedings of the IEEE, vol. 111, 2023, pp. 257-276.

[19]

R. Girshick, Fast r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.

Digital Library

[20]

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, Adv. Neural Inf. Proces. Syst. 28 (2015).

[21]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, Ssd: Single shot multibox detector, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, 2016, pp. 21–37.

[22]

Z. Tian, C. Shen, H. Chen, T. He, Fcos: Fully convolutional one-stage object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9627–9636.

[23]

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer, 2020, pp. 213–229.

[24]

C. Yang, Y. Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y. Qiao, L. Lu, et al., Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17830–17839.

[25]

Y. Zhang, W. Zheng, Z. Zhu, G. Huang, J. Lu, J. Zhou, A simple baseline for multi-camera 3d object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 37, 2023, pp. 3507–3515.

[26]

E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anatttndkumar, S. Fidler, P. Luo, J.M. Alvarez, M 2 BEV: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation, arXiv (2022) preprint arXiv:2204.05088.

[27]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., Scalability in perception for autonomous driving: Waymo open dataset, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2446–2454.

[28]

C. Reading, A. Harakeh, J. Chae, S.L. Waslander, Categorical depth distribution network for monocular 3d object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8555–8564.

[29]

J. Huang, G. Huang, Bevdet4d: Exploit temporal cues in multi-camera 3d object detection, arXiv (2022) preprint arXiv:2203.17054.

[30]

Y. Wang, V.C. Guizilini, T. Zhang, Y. Wang, H. Zhao, J. Solomon, Detr3d: 3d object detection from multi-view images via 3d-to-2d queries, in: Conference on Robot Learning, PMLR, 2022, pp. 180–191.

[31]

F. Manhardt, W. Kehl, A. Gaidon, Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2069–2078.

[32]

Z. Ge, S. Liu, F. Wang, Z. Li, J. Sun, Yolox: Exceeding yolo series in 2021, arXiv (2021) preprint arXiv:2107.08430.

[33]

S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.

Digital Library

[34]

D. Reis, J. Kupec, J. Hong, A. Daoudi, Real-time flying object detection with yolov8, arXiv (2023) preprint arXiv:2305.09972.

[35]

Murphy, T.I. : Tesla AI day. https://www.youtube.com/watch?v=j0z4FweCy4M.

[36]

Contributors, M. : Mmyolo: Openmmlab yolo series toolbox and benchmark. https://github.com/open-mmlab/mmyolo Mmyolo: Openmmlab yolo series toolbox and benchmark, 2022.

[37]

T. Yin, X. Zhou, P. Krahenbuhl, Center-based 3d object detection and tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11784–11793.

[38]

H. Law, J. Deng, Cornernet: Detecting objects as paired keypoints, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 734–750.

[39]

D. Park, R. Ambrus, V. Guizilini, J. Li, A. Gaidon, Is pseudo-lidar needed for monocular 3d object detection?, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3142–3152.

[40]

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.

[41]

Contributors, M. : Mmdetection3d: Openmmlab next-generation platform for general 3d object detection. https://github.com/open-mmlab/mmdetection3d Mmdetection3d: Openmmlab next-generation platform for general 3d object detection, 2020.

[42]

I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv (2017) preprint arXiv:1711.05101.

[43]

B. Zhu, Z. Jiang, X. Zhou, Z. Li, G. Yu, Class-balanced grouping and sampling for point cloud 3d object detection, arXiv (2019) preprint arXiv:1908.09492.

[44]

Y. Shi, J. Shen, Y. Sun, Y. Wang, J. Li, S. Sun, K. Jiang, D. Yang, Srcn3d: Sparse r-cnn 3d surround-view camera object detection and tracking for autonomous driving, arXiv (2022) preprint arXiv:2206.14451.

[45]

H. Chen, P. Wang, F. Wang, W. Tian, L. Xiong, H. Li, Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2781–2790.

Index Terms

A streamlined framework for BEV-based 3D object detection with prior masking
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Index terms have been assigned to the content through auto-classification.

Recommendations

Multi-view LiDAR Guided Monocular 3D Object Detection
Pattern Recognition and Computer Vision
Abstract
Detecting 3D objects from monocular RGB images is an ill-posed task for lacking depth knowledge, and monocular-based 3D detection methods perform poorly compared with LiDAR-based 3D detection methods. Some bird’s-eye-view-based monocular 3D ...
GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection
Computer Vision – ECCV 2024
Abstract
Integrating LiDAR and camera information into Bird’s-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship ...
Stereo 3D object detection via instance depth prior guidance and adaptive spatial feature aggregation
Abstract
We present a novel and high-performance framework for 3D object detection using stereo vision. This framework incorporates direct instance depth estimation efficiently, improving the accuracy of the final 3D object detection. Instead of detecting ...

Comments

Information & Contributors

Information

Published In

cover image Image and Vision Computing

Image and Vision Computing Volume 150, Issue C

Oct 2024

517 pages

Issue’s Table of Contents

Copyright © 2024.

Publisher

Butterworth-Heinemann

United States

Publication History

Published: 01 October 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents