GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection

Song, Ziying; Yang, Lei; Xu, Shaoqing; Liu, Lin; Xu, Dongyang; Jia, Caiyan; Jia, Feiyang; Wang, Li

doi:10.1007/978-3-031-73347-5_20

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15084))

Included in the following conference series:

European Conference on Computer Vision

417 Accesses

Abstract

Integrating LiDAR and camera information into Bird’s-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called GraphBEV. Addressing errors caused by inaccurate point cloud projection, we introduce a LocalAlign module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a GlobalAlign module to rectify the misalignment between LiDAR and camera BEV features. Our GraphBEV framework achieves state-of-the-art performance, with an mAP of 70.1%, surpassing BEVFusion by 1.6% on the nuScnes validation set. Importantly, our GraphBEV outperforms BEVFusion by 8.3% under conditions with misalignment noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LiDAR-guided Geometric Pretraining for Vision-Centric 3D Object Detection

Article 09 February 2025

CL-fusionBEV: 3D object detection method with camera-LiDAR fusion in Bird’s Eye View

Article Open access 27 July 2024

SS-BEV: multi-camera BEV object detection based on multi-scale spatial structure understanding

Article 02 January 2025

Notes

1.
https://github.com/minrk/scipy-1/blob/master/scipy/spatial/ckdtree.c.

References

Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3D object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10090–1099(2022)
Google Scholar
Bi, J., Wei, H., Zhang, G., Yang, K., Song, Z.: Dyfusion: cross-attention 3D object detection with dynamic fusion. IEEE Lat. Am. Trans. 22(2), 106–112 (2024)
Article Google Scholar
Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9287–9296 (2019)
Google Scholar
Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Google Scholar
Cai, Q., Pan, Y., Yao, T., Ngo, C.W., Mei, T.: Objectfusion: multi-modal 3D object detection with object-centric fusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 18067–18076 (2023)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, X., Zhang, T., Wang, Y., Wang, Y., Zhao, H.: Futr3D: a unified sensor fusion framework for 3D detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 172–181 (2023)
Google Scholar
Chen, Y., Liu, J., Zhang, X., Qi, X., Jia, J.: Largekernel3D: scaling up kernels in 3D sparse CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13488–13498 (2023)
Google Scholar
Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F.: AutoAlignV2: deformable feature aggregation for dynamic multi-modal 3D object detection. arXiv:2207.10316 (2022)
Chen, Z., et al.: AutoAlign: pixel-instance feature aggregation for multi-modal 3D object detection. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (2022). https://doi.org/10.24963/ijcai.2022/116
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1201–1209 (2021)
Google Scholar
Dong, Y., et al.: Benchmarking robustness of 3D object detection to common corruptions in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1022–1032 (2023)
Google Scholar
Fan, L., et al.: Embracing single stride 3D object detector with sparse transformer. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022). https://doi.org/10.1109/cvpr52688.2022.00827
Ge, C., et al.: Metabev: solving sensor failures for bev detection and map segmentation. arXiv preprint arXiv:2304.09801 (2023)
He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: a set-to-set approach to 3D object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8417–8427(2022)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3
Chapter Google Scholar
Jiang, Y., et al.: Polarformer: multi-camera 3D object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1042–1050 (2023)
Google Scholar
Kim, Y., Park, K., Kim, M., Kum, D., Choi, J.W.: 3D dual-fusion: dual-domain dual-query camera-lidar fusion for 3D object detection. arXiv preprint arXiv:2211.13529 (2022)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)
Google Scholar
Li, X., et al.: LogoNet: towards accurate 3D object detection with local-to-global cross-modal fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17524–17534 (2023)
Google Scholar
Li, X., et al.: Homogeneous multi-modal feature fusion and interaction for 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 691–707. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_40
Li, Y., et al.: Deepfusion: Lidar-camera deep fusion for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17182–17191 (2022)
Google Scholar
Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, pp. 1486–1494 (2023). https://doi.org/10.1609/aaai.v37i2.25234
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1477–1485 (2023)
Google Scholar
Li, Z., Wang, F., Wang, N.: Lidar R-CNN: an efficient and universal 3D object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7546–7555 (2021)
Google Scholar
Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Liang, T., et al.: Bevfusion: a simple and robust lidar-camera fusion framework. In: Advances in Neural Information Processing Systems, vol. 35, pp. 10421–10434 (2022)
Google Scholar
Liu, L., et al.: Sparsedet: a simple and effective framework for fully sparse lidar-based 3D object detection. arXiv preprint arXiv:2406.10907 (2024)
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Y., et al.: Petrv2: a unified framework for 3D perception from multi-camera images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3262–3272 (2023)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., Huang, T., Li, B., Chen, X., Wang, X., Bai, X.: Epnet++: cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 45(7), 8324–8341 (2023). https://doi.org/10.1109/TPAMI.2022.3228806
Article Google Scholar
Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781 (2023)
Google Scholar
Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel CNN for efficient 3D deep learning. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Mao, J., et al.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021)
Google Scholar
Miao, Z., et al.: PVGNet: a bottom-up one-stage 3d object detector with integrated multi-level features. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://doi.org/10.1109/cvpr46437.2021.00329
Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. arXiv preprint arXiv:2210.02443 (2022)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Chapter Google Scholar
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Shi, S., et al.: PV-RCNN++: point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vision 131(2), 531–551 (2023)
Article Google Scholar
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
Google Scholar
Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1991–1999 (2019)
Google Scholar
Sindagi, V.A., Zhou, Y., Tuzel, O.: Mvx-net: multimodal voxelnet for 3D object detection. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282. IEEE (2019)
Google Scholar
Song, Z., Jia, C., Yang, L., Wei, H., Liu, L.: GraphAlign++: an accurate feature alignment by graph matching for multi-modal 3d object detection. IEEE Trans. Circ. Syst. Video Technol. 34, 2619–2632 (2023)
Article Google Scholar
Song, Z., et al.: ContrastAlign: toward robust bev feature alignment via contrastive learning for multi-modal 3D object detection. arXiv preprint arXiv:2405.16873 (2024)
Song, Z., et al.: Robustness-aware 3D object detection in autonomous driving: a review and outlook. arXiv preprint arXiv:2401.06542 (2024)
Song, Z., Wei, H., Bai, L., Yang, L., Jia, C.: GraphAlign: enhancing accurate feature alignment by graph matching for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3358–3369 (2023)
Google Scholar
Song, Z., Wei, H., Jia, C., Xia, Y., Li, X., Zhang, C.: VP-Net: voxels as points for 3D object detection. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023)
Google Scholar
Song, Z., et al.: Voxelnextfusion: a simple, unified, and effective voxel fusion framework for multimodal 3-D object detection. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023). https://doi.org/10.1109/TGRS.2023.3331893
Article Google Scholar
Song, Z., et al.: RoboFusion: towards robust multi-modal 3D obiect detection via SAM. arXiv preprint arXiv:2401.03907 (2024)
Team, O.D.: OpenPCDet: an open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: PointPainting: sequential fusion for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020)
Google Scholar
Wang, C., Ma, C., Zhu, M., Yang, X.: PointAugmenting: cross-modal augmentation for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11794–11803 (2021)
Google Scholar
Wang, L., et al.: SAT-GCN: self-attention graph convolutional network-based 3D object detection for autonomous driving. Knowl.-Based Syst. 259, 110080 (2023)
Article Google Scholar
Wang, L., et al.: Multi-modal 3D object detection in autonomous driving: a survey and taxonomy. IEEE Trans. Intell. Veh. 8, 3781–3798 (2023)
Article Google Scholar
Wang, L., et al.: Fuzzy-NMS: improving 3D object detection with fuzzy classification in NMS. IEEE Trans. Intell. Veh. (2024)
Google Scholar
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022)
Google Scholar
Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2345–2353 (2018)
Google Scholar
Xu, S., Li, F., Song, Z., Fang, J., Wang, S., Yang, Z.X.: Multi-sem fusion: multimodal semantic fusion for 3D object detection. IEEE Trans. Geosci. Remote Sens. 62, 1–14 (2024)
Google Scholar
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Article Google Scholar
Yang, C., et al.: Bevformer V2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839 (2023)
Google Scholar
Yang, L., et al.: Bevheight++: toward robust visual centric 3D object detection. arXiv preprint arXiv:2309.16179 (2023)
Yang, L., et al.: Bevheight: a robust framework for vision-based roadside 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21611–21620 (2023)
Google Scholar
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960 (2019)
Google Scholar
Yang, Z., Chen, J., Miao, Z., Li, W., Zhu, X., Zhang, L.: DeepInteraction: 3D object detection via modality interaction. arXiv:2208.11112 (2022)
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://doi.org/10.1109/cvpr46437.2021.01161
Yin, T., Zhou, X., Krähenbühl, P.: Multimodal virtual point 3D detection. In: Advances in Neural Information Processing Systems, vol. 34, pp. 16494–16507 (2021)
Google Scholar
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3D-CVF: generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 720–736. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_43
Chapter Google Scholar
Yu, K., et al.: Benchmarking the robustness of lidar-camera fusion for 3D object detection. arXiv:2205.14951 (2022)
Zhang, C., et al.: Robust-fusionNet: deep multimodal sensor fusion for 3-D object detection under severe weather conditions. IEEE Trans. Instrum. Meas. 71, 1–13 (2022)
Google Scholar
Zhang, G., Xie, J., Liu, L., Wang, Z., Yang, K., Song, Z.: URFormer: unified representation lidar-camera 3D object detection with transformer. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 401–413. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-8435-0_32
Zhang, Y., Chen, J., Huang, D.: CAT-Det: contrastively augmented transformer for multi-modal 3D object detection. arXiv:2204.00325 (2022)
Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)
Google Scholar
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgements

We sincerely appreciate the helpful discussions provided by Hongyu Pan from Horizon Robotics. This work was supported by the National Key R&D Program of China (2018AAA0100302).

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Ziying Song, Lin Liu, Caiyan Jia & Feiyang Jia
Beijing Key Lab of Traffic Data Analysis and Mining, Beijing, China
Ziying Song, Lin Liu, Caiyan Jia & Feiyang Jia
School of Vehicle and Mobility, Tsinghua University, Beijing, China
Lei Yang & Dongyang Xu
Department of Electrome chanical Engineering, University of Macau, Zhuhai, China
Shaoqing Xu
School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China
Li Wang

Authors

Ziying Song
View author publications
You can also search for this author in PubMed Google Scholar
Lei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shaoqing Xu
View author publications
You can also search for this author in PubMed Google Scholar
Lin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Caiyan Jia
View author publications
You can also search for this author in PubMed Google Scholar
Feiyang Jia
View author publications
You can also search for this author in PubMed Google Scholar
Li Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Caiyan Jia .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, Z. et al. (2025). GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15084. Springer, Cham. https://doi.org/10.1007/978-3-031-73347-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-73347-5_20
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73346-8
Online ISBN: 978-3-031-73347-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection