Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Integrating LiDAR and camera information into Bird’s-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called GraphBEV. Addressing errors caused by inaccurate point cloud projection, we introduce a LocalAlign module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a GlobalAlign module to rectify the misalignment between LiDAR and camera BEV features. Our GraphBEV framework achieves state-of-the-art performance, with an mAP of 70.1%, surpassing BEVFusion by 1.6% on the nuScnes validation set. Importantly, our GraphBEV outperforms BEVFusion by 8.3% under conditions with misalignment noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/minrk/scipy-1/blob/master/scipy/spatial/ckdtree.c.

References

  1. Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3D object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10090–1099(2022)

    Google Scholar 

  2. Bi, J., Wei, H., Zhang, G., Yang, K., Song, Z.: Dyfusion: cross-attention 3D object detection with dynamic fusion. IEEE Lat. Am. Trans. 22(2), 106–112 (2024)

    Article  Google Scholar 

  3. Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9287–9296 (2019)

    Google Scholar 

  4. Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)

    Google Scholar 

  5. Cai, Q., Pan, Y., Yao, T., Ngo, C.W., Mei, T.: Objectfusion: multi-modal 3D object detection with object-centric fusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 18067–18076 (2023)

    Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  7. Chen, X., Zhang, T., Wang, Y., Wang, Y., Zhao, H.: Futr3D: a unified sensor fusion framework for 3D detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 172–181 (2023)

    Google Scholar 

  8. Chen, Y., Liu, J., Zhang, X., Qi, X., Jia, J.: Largekernel3D: scaling up kernels in 3D sparse CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13488–13498 (2023)

    Google Scholar 

  9. Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F.: AutoAlignV2: deformable feature aggregation for dynamic multi-modal 3D object detection. arXiv:2207.10316 (2022)

  10. Chen, Z., et al.: AutoAlign: pixel-instance feature aggregation for multi-modal 3D object detection. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (2022). https://doi.org/10.24963/ijcai.2022/116

  11. Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1201–1209 (2021)

    Google Scholar 

  12. Dong, Y., et al.: Benchmarking robustness of 3D object detection to common corruptions in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1022–1032 (2023)

    Google Scholar 

  13. Fan, L., et al.: Embracing single stride 3D object detector with sparse transformer. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022). https://doi.org/10.1109/cvpr52688.2022.00827

  14. Ge, C., et al.: Metabev: solving sensor failures for bev detection and map segmentation. arXiv preprint arXiv:2304.09801 (2023)

  15. He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: a set-to-set approach to 3D object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8417–8427(2022)

    Google Scholar 

  16. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  17. Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3

    Chapter  Google Scholar 

  18. Jiang, Y., et al.: Polarformer: multi-camera 3D object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1042–1050 (2023)

    Google Scholar 

  19. Kim, Y., Park, K., Kim, M., Kum, D., Choi, J.W.: 3D dual-fusion: dual-domain dual-query camera-lidar fusion for 3D object detection. arXiv preprint arXiv:2211.13529 (2022)

  20. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  21. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)

    Google Scholar 

  22. Li, X., et al.: LogoNet: towards accurate 3D object detection with local-to-global cross-modal fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17524–17534 (2023)

    Google Scholar 

  23. Li, X., et al.: Homogeneous multi-modal feature fusion and interaction for 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 691–707. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_40

  24. Li, Y., et al.: Deepfusion: Lidar-camera deep fusion for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17182–17191 (2022)

    Google Scholar 

  25. Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, pp. 1486–1494 (2023). https://doi.org/10.1609/aaai.v37i2.25234

  26. Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1477–1485 (2023)

    Google Scholar 

  27. Li, Z., Wang, F., Wang, N.: Lidar R-CNN: an efficient and universal 3D object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7546–7555 (2021)

    Google Scholar 

  28. Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1

  29. Liang, T., et al.: Bevfusion: a simple and robust lidar-camera fusion framework. In: Advances in Neural Information Processing Systems, vol. 35, pp. 10421–10434 (2022)

    Google Scholar 

  30. Liu, L., et al.: Sparsedet: a simple and effective framework for fully sparse lidar-based 3D object detection. arXiv preprint arXiv:2406.10907 (2024)

  31. Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31

  32. Liu, Y., et al.: Petrv2: a unified framework for 3D perception from multi-camera images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3262–3272 (2023)

    Google Scholar 

  33. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  34. Liu, Z., Huang, T., Li, B., Chen, X., Wang, X., Bai, X.: Epnet++: cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 45(7), 8324–8341 (2023). https://doi.org/10.1109/TPAMI.2022.3228806

    Article  Google Scholar 

  35. Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781 (2023)

    Google Scholar 

  36. Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel CNN for efficient 3D deep learning. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  37. Mao, J., et al.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021)

    Google Scholar 

  38. Miao, Z., et al.: PVGNet: a bottom-up one-stage 3d object detector with integrated multi-level features. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://doi.org/10.1109/cvpr46437.2021.00329

  39. Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. arXiv preprint arXiv:2210.02443 (2022)

  40. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  41. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12

    Chapter  Google Scholar 

  42. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)

    Google Scholar 

  43. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)

    Google Scholar 

  44. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  45. Shi, S., et al.: PV-RCNN++: point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vision 131(2), 531–551 (2023)

    Article  Google Scholar 

  46. Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)

    Google Scholar 

  47. Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1991–1999 (2019)

    Google Scholar 

  48. Sindagi, V.A., Zhou, Y., Tuzel, O.: Mvx-net: multimodal voxelnet for 3D object detection. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282. IEEE (2019)

    Google Scholar 

  49. Song, Z., Jia, C., Yang, L., Wei, H., Liu, L.: GraphAlign++: an accurate feature alignment by graph matching for multi-modal 3d object detection. IEEE Trans. Circ. Syst. Video Technol. 34, 2619–2632 (2023)

    Article  Google Scholar 

  50. Song, Z., et al.: ContrastAlign: toward robust bev feature alignment via contrastive learning for multi-modal 3D object detection. arXiv preprint arXiv:2405.16873 (2024)

  51. Song, Z., et al.: Robustness-aware 3D object detection in autonomous driving: a review and outlook. arXiv preprint arXiv:2401.06542 (2024)

  52. Song, Z., Wei, H., Bai, L., Yang, L., Jia, C.: GraphAlign: enhancing accurate feature alignment by graph matching for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3358–3369 (2023)

    Google Scholar 

  53. Song, Z., Wei, H., Jia, C., Xia, Y., Li, X., Zhang, C.: VP-Net: voxels as points for 3D object detection. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023)

    Google Scholar 

  54. Song, Z., et al.: Voxelnextfusion: a simple, unified, and effective voxel fusion framework for multimodal 3-D object detection. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023). https://doi.org/10.1109/TGRS.2023.3331893

    Article  Google Scholar 

  55. Song, Z., et al.: RoboFusion: towards robust multi-modal 3D obiect detection via SAM. arXiv preprint arXiv:2401.03907 (2024)

  56. Team, O.D.: OpenPCDet: an open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet (2020)

  57. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  58. Vora, S., Lang, A.H., Helou, B., Beijbom, O.: PointPainting: sequential fusion for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020)

    Google Scholar 

  59. Wang, C., Ma, C., Zhu, M., Yang, X.: PointAugmenting: cross-modal augmentation for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11794–11803 (2021)

    Google Scholar 

  60. Wang, L., et al.: SAT-GCN: self-attention graph convolutional network-based 3D object detection for autonomous driving. Knowl.-Based Syst. 259, 110080 (2023)

    Article  Google Scholar 

  61. Wang, L., et al.: Multi-modal 3D object detection in autonomous driving: a survey and taxonomy. IEEE Trans. Intell. Veh. 8, 3781–3798 (2023)

    Article  Google Scholar 

  62. Wang, L., et al.: Fuzzy-NMS: improving 3D object detection with fuzzy classification in NMS. IEEE Trans. Intell. Veh. (2024)

    Google Scholar 

  63. Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022)

    Google Scholar 

  64. Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2345–2353 (2018)

    Google Scholar 

  65. Xu, S., Li, F., Song, Z., Fang, J., Wang, S., Yang, Z.X.: Multi-sem fusion: multimodal semantic fusion for 3D object detection. IEEE Trans. Geosci. Remote Sens. 62, 1–14 (2024)

    Google Scholar 

  66. Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)

    Article  Google Scholar 

  67. Yang, C., et al.: Bevformer V2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839 (2023)

    Google Scholar 

  68. Yang, L., et al.: Bevheight++: toward robust visual centric 3D object detection. arXiv preprint arXiv:2309.16179 (2023)

  69. Yang, L., et al.: Bevheight: a robust framework for vision-based roadside 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21611–21620 (2023)

    Google Scholar 

  70. Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960 (2019)

    Google Scholar 

  71. Yang, Z., Chen, J., Miao, Z., Li, W., Zhu, X., Zhang, L.: DeepInteraction: 3D object detection via modality interaction. arXiv:2208.11112 (2022)

  72. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://doi.org/10.1109/cvpr46437.2021.01161

  73. Yin, T., Zhou, X., Krähenbühl, P.: Multimodal virtual point 3D detection. In: Advances in Neural Information Processing Systems, vol. 34, pp. 16494–16507 (2021)

    Google Scholar 

  74. Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3D-CVF: generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 720–736. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_43

    Chapter  Google Scholar 

  75. Yu, K., et al.: Benchmarking the robustness of lidar-camera fusion for 3D object detection. arXiv:2205.14951 (2022)

  76. Zhang, C., et al.: Robust-fusionNet: deep multimodal sensor fusion for 3-D object detection under severe weather conditions. IEEE Trans. Instrum. Meas. 71, 1–13 (2022)

    Google Scholar 

  77. Zhang, G., Xie, J., Liu, L., Wang, Z., Yang, K., Song, Z.: URFormer: unified representation lidar-camera 3D object detection with transformer. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 401–413. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-8435-0_32

  78. Zhang, Y., Chen, J., Huang, D.: CAT-Det: contrastively augmented transformer for multi-modal 3D object detection. arXiv:2204.00325 (2022)

  79. Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)

    Google Scholar 

  80. Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)

  81. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgements

We sincerely appreciate the helpful discussions provided by Hongyu Pan from Horizon Robotics. This work was supported by the National Key R&D Program of China (2018AAA0100302).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Caiyan Jia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Song, Z. et al. (2025). GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15084. Springer, Cham. https://doi.org/10.1007/978-3-031-73347-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73347-5_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73346-8

  • Online ISBN: 978-3-031-73347-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics