Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A streamlined framework for BEV-based 3D object detection with prior masking

Published: 01 October 2024 Publication History

Abstract

In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.

Highlights

Efficient BEV framework for 3D object detection.
Incorporates 2D auxiliary branch and 4D information.
GPU memory-efficient lifting strategy.
Prior mask evaluates feature importance at different depths.
Tailored BEV encoder improves performance.

References

[1]
C. Li, G. Wang, Q. Long, Z. Zhou, Sgf3d: similarity-guided fusion network for 3d object detection, Image Vis. Comput. 142 (2024).
[2]
J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
[3]
R. Qian, X. Lai, X. Li, 3d object detection for autonomous driving: a survey, Pattern Recogn. 130 (2022).
[4]
Y. Zhu, J. Xie, M. Liu, L. Yao, Y. Chen, Bf3d: bi-directional fusion 3d detector with semantic sampling and geometric mapping, Image Vis. Comput. 139 (2023).
[5]
H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, E. Xie, Z. Li, H. Deng, H. Tian, et al., Delving into the devils of bird’s-eye-view perception: a review, evaluation and recipe, IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (2024) 2151–2170.
[6]
T. Roddick, A. Kendall, R. Cipolla, Orthographic feature transform for monocular 3d object detection, arXiv (2018) preprint arXiv:1811.08188.
[7]
Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, K.Q. Weinberger, Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453.
[8]
T. Wang, X. Zhu, J. Pang, D. Lin, Fcos3d: Fully convolutional one-stage monocular 3d object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 913–922.
[9]
J. Philion, S. Fidler, Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, Springer, 2020, pp. 194–210.
[10]
J. Huang, G. Huang, Z. Zhu, Y. Ye, D. Du, Bevdet: High-performance multi-camera 3d object detection in bird-eye-view, arXiv (2021) preprint arXiv:2112.11790.
[11]
Y. Liu, T. Wang, X. Zhang, J. Sun, Petr: Position embedding transformation for multi-view 3d object detection, in: European Conference on Computer Vision, Springer, 2022, pp. 531–548.
[12]
Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, J. Dai, Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers, in: European Conference on Computer Vision, Springer, 2022, pp. 1–18.
[13]
Y. Li, B. Huang, Z. Chen, Y. Cui, F. Liang, M. Shen, F. Liu, E. Xie, L. Sheng, W. Ouyang, et al., A fast and strong bird’s-eye view perception baseline, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, 2024, pp. 1–14.
[14]
A.W. Harley, Z. Fang, J. Li, R. Ambrus, K. Fragkiadaki, Simple-bev: What really matters for multi-sensor bev perception?, in: 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2023, pp. 2759–2765.
[15]
Y. Morvan, Acquisition, Compression and Rendering of Depth and Texture for Multi-View Video, Phd thesis Electrical Engineering, 2009.
[16]
H. Caesar, V. Bankiti, A.H. Lang, S. Vora, V.E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, O. Beijbom, Nuscenes: A multimodal dataset for autonomous driving, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11621–11631.
[17]
R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
[18]
Z. Zou, K. Chen, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years: a survey, Proceedings of the IEEE, vol. 111, 2023, pp. 257-276.
[19]
R. Girshick, Fast r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
[20]
S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, Adv. Neural Inf. Proces. Syst. 28 (2015).
[21]
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, Ssd: Single shot multibox detector, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, 2016, pp. 21–37.
[22]
Z. Tian, C. Shen, H. Chen, T. He, Fcos: Fully convolutional one-stage object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9627–9636.
[23]
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer, 2020, pp. 213–229.
[24]
C. Yang, Y. Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y. Qiao, L. Lu, et al., Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17830–17839.
[25]
Y. Zhang, W. Zheng, Z. Zhu, G. Huang, J. Lu, J. Zhou, A simple baseline for multi-camera 3d object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 37, 2023, pp. 3507–3515.
[26]
E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anatttndkumar, S. Fidler, P. Luo, J.M. Alvarez, M 2 BEV: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation, arXiv (2022) preprint arXiv:2204.05088.
[27]
P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., Scalability in perception for autonomous driving: Waymo open dataset, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2446–2454.
[28]
C. Reading, A. Harakeh, J. Chae, S.L. Waslander, Categorical depth distribution network for monocular 3d object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8555–8564.
[29]
J. Huang, G. Huang, Bevdet4d: Exploit temporal cues in multi-camera 3d object detection, arXiv (2022) preprint arXiv:2203.17054.
[30]
Y. Wang, V.C. Guizilini, T. Zhang, Y. Wang, H. Zhao, J. Solomon, Detr3d: 3d object detection from multi-view images via 3d-to-2d queries, in: Conference on Robot Learning, PMLR, 2022, pp. 180–191.
[31]
F. Manhardt, W. Kehl, A. Gaidon, Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2069–2078.
[32]
Z. Ge, S. Liu, F. Wang, Z. Li, J. Sun, Yolox: Exceeding yolo series in 2021, arXiv (2021) preprint arXiv:2107.08430.
[33]
S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
[34]
D. Reis, J. Kupec, J. Hong, A. Daoudi, Real-time flying object detection with yolov8, arXiv (2023) preprint arXiv:2305.09972.
[36]
Contributors, M. : Mmyolo: Openmmlab yolo series toolbox and benchmark. https://github.com/open-mmlab/mmyolo Mmyolo: Openmmlab yolo series toolbox and benchmark, 2022.
[37]
T. Yin, X. Zhou, P. Krahenbuhl, Center-based 3d object detection and tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11784–11793.
[38]
H. Law, J. Deng, Cornernet: Detecting objects as paired keypoints, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 734–750.
[39]
D. Park, R. Ambrus, V. Guizilini, J. Li, A. Gaidon, Is pseudo-lidar needed for monocular 3d object detection?, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3142–3152.
[40]
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
[41]
Contributors, M. : Mmdetection3d: Openmmlab next-generation platform for general 3d object detection. https://github.com/open-mmlab/mmdetection3d Mmdetection3d: Openmmlab next-generation platform for general 3d object detection, 2020.
[42]
I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv (2017) preprint arXiv:1711.05101.
[43]
B. Zhu, Z. Jiang, X. Zhou, Z. Li, G. Yu, Class-balanced grouping and sampling for point cloud 3d object detection, arXiv (2019) preprint arXiv:1908.09492.
[44]
Y. Shi, J. Shen, Y. Sun, Y. Wang, J. Li, S. Sun, K. Jiang, D. Yang, Srcn3d: Sparse r-cnn 3d surround-view camera object detection and tracking for autonomous driving, arXiv (2022) preprint arXiv:2206.14451.
[45]
H. Chen, P. Wang, F. Wang, W. Tian, L. Xiong, H. Li, Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2781–2790.

Index Terms

  1. A streamlined framework for BEV-based 3D object detection with prior masking
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Image and Vision Computing
            Image and Vision Computing  Volume 150, Issue C
            Oct 2024
            517 pages

            Publisher

            Butterworth-Heinemann

            United States

            Publication History

            Published: 01 October 2024

            Author Tags

            1. Multi-camera
            2. bird's-eye-view (BEV) representation
            3. 3D object detection
            4. Autonomous driving

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 18 Feb 2025

            Other Metrics

            Citations

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media