Abstract
Integrating LiDAR and camera information into Bird’s-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called GraphBEV. Addressing errors caused by inaccurate point cloud projection, we introduce a LocalAlign module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a GlobalAlign module to rectify the misalignment between LiDAR and camera BEV features. Our GraphBEV framework achieves state-of-the-art performance, with an mAP of 70.1%, surpassing BEVFusion by 1.6% on the nuScnes validation set. Importantly, our GraphBEV outperforms BEVFusion by 8.3% under conditions with misalignment noise.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3D object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10090–1099(2022)
Bi, J., Wei, H., Zhang, G., Yang, K., Song, Z.: Dyfusion: cross-attention 3D object detection with dynamic fusion. IEEE Lat. Am. Trans. 22(2), 106–112 (2024)
Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9287–9296 (2019)
Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Cai, Q., Pan, Y., Yao, T., Ngo, C.W., Mei, T.: Objectfusion: multi-modal 3D object detection with object-centric fusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 18067–18076 (2023)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, X., Zhang, T., Wang, Y., Wang, Y., Zhao, H.: Futr3D: a unified sensor fusion framework for 3D detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 172–181 (2023)
Chen, Y., Liu, J., Zhang, X., Qi, X., Jia, J.: Largekernel3D: scaling up kernels in 3D sparse CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13488–13498 (2023)
Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F.: AutoAlignV2: deformable feature aggregation for dynamic multi-modal 3D object detection. arXiv:2207.10316 (2022)
Chen, Z., et al.: AutoAlign: pixel-instance feature aggregation for multi-modal 3D object detection. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (2022). https://doi.org/10.24963/ijcai.2022/116
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1201–1209 (2021)
Dong, Y., et al.: Benchmarking robustness of 3D object detection to common corruptions in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1022–1032 (2023)
Fan, L., et al.: Embracing single stride 3D object detector with sparse transformer. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022). https://doi.org/10.1109/cvpr52688.2022.00827
Ge, C., et al.: Metabev: solving sensor failures for bev detection and map segmentation. arXiv preprint arXiv:2304.09801 (2023)
He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: a set-to-set approach to 3D object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8417–8427(2022)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3
Jiang, Y., et al.: Polarformer: multi-camera 3D object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1042–1050 (2023)
Kim, Y., Park, K., Kim, M., Kum, D., Choi, J.W.: 3D dual-fusion: dual-domain dual-query camera-lidar fusion for 3D object detection. arXiv preprint arXiv:2211.13529 (2022)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)
Li, X., et al.: LogoNet: towards accurate 3D object detection with local-to-global cross-modal fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17524–17534 (2023)
Li, X., et al.: Homogeneous multi-modal feature fusion and interaction for 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 691–707. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_40
Li, Y., et al.: Deepfusion: Lidar-camera deep fusion for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17182–17191 (2022)
Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, pp. 1486–1494 (2023). https://doi.org/10.1609/aaai.v37i2.25234
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1477–1485 (2023)
Li, Z., Wang, F., Wang, N.: Lidar R-CNN: an efficient and universal 3D object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7546–7555 (2021)
Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Liang, T., et al.: Bevfusion: a simple and robust lidar-camera fusion framework. In: Advances in Neural Information Processing Systems, vol. 35, pp. 10421–10434 (2022)
Liu, L., et al.: Sparsedet: a simple and effective framework for fully sparse lidar-based 3D object detection. arXiv preprint arXiv:2406.10907 (2024)
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Y., et al.: Petrv2: a unified framework for 3D perception from multi-camera images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3262–3272 (2023)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Liu, Z., Huang, T., Li, B., Chen, X., Wang, X., Bai, X.: Epnet++: cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 45(7), 8324–8341 (2023). https://doi.org/10.1109/TPAMI.2022.3228806
Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781 (2023)
Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel CNN for efficient 3D deep learning. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Mao, J., et al.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021)
Miao, Z., et al.: PVGNet: a bottom-up one-stage 3d object detector with integrated multi-level features. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://doi.org/10.1109/cvpr46437.2021.00329
Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. arXiv preprint arXiv:2210.02443 (2022)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Shi, S., et al.: PV-RCNN++: point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vision 131(2), 531–551 (2023)
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1991–1999 (2019)
Sindagi, V.A., Zhou, Y., Tuzel, O.: Mvx-net: multimodal voxelnet for 3D object detection. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282. IEEE (2019)
Song, Z., Jia, C., Yang, L., Wei, H., Liu, L.: GraphAlign++: an accurate feature alignment by graph matching for multi-modal 3d object detection. IEEE Trans. Circ. Syst. Video Technol. 34, 2619–2632 (2023)
Song, Z., et al.: ContrastAlign: toward robust bev feature alignment via contrastive learning for multi-modal 3D object detection. arXiv preprint arXiv:2405.16873 (2024)
Song, Z., et al.: Robustness-aware 3D object detection in autonomous driving: a review and outlook. arXiv preprint arXiv:2401.06542 (2024)
Song, Z., Wei, H., Bai, L., Yang, L., Jia, C.: GraphAlign: enhancing accurate feature alignment by graph matching for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3358–3369 (2023)
Song, Z., Wei, H., Jia, C., Xia, Y., Li, X., Zhang, C.: VP-Net: voxels as points for 3D object detection. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023)
Song, Z., et al.: Voxelnextfusion: a simple, unified, and effective voxel fusion framework for multimodal 3-D object detection. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023). https://doi.org/10.1109/TGRS.2023.3331893
Song, Z., et al.: RoboFusion: towards robust multi-modal 3D obiect detection via SAM. arXiv preprint arXiv:2401.03907 (2024)
Team, O.D.: OpenPCDet: an open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: PointPainting: sequential fusion for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020)
Wang, C., Ma, C., Zhu, M., Yang, X.: PointAugmenting: cross-modal augmentation for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11794–11803 (2021)
Wang, L., et al.: SAT-GCN: self-attention graph convolutional network-based 3D object detection for autonomous driving. Knowl.-Based Syst. 259, 110080 (2023)
Wang, L., et al.: Multi-modal 3D object detection in autonomous driving: a survey and taxonomy. IEEE Trans. Intell. Veh. 8, 3781–3798 (2023)
Wang, L., et al.: Fuzzy-NMS: improving 3D object detection with fuzzy classification in NMS. IEEE Trans. Intell. Veh. (2024)
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022)
Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2345–2353 (2018)
Xu, S., Li, F., Song, Z., Fang, J., Wang, S., Yang, Z.X.: Multi-sem fusion: multimodal semantic fusion for 3D object detection. IEEE Trans. Geosci. Remote Sens. 62, 1–14 (2024)
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Yang, C., et al.: Bevformer V2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839 (2023)
Yang, L., et al.: Bevheight++: toward robust visual centric 3D object detection. arXiv preprint arXiv:2309.16179 (2023)
Yang, L., et al.: Bevheight: a robust framework for vision-based roadside 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21611–21620 (2023)
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960 (2019)
Yang, Z., Chen, J., Miao, Z., Li, W., Zhu, X., Zhang, L.: DeepInteraction: 3D object detection via modality interaction. arXiv:2208.11112 (2022)
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://doi.org/10.1109/cvpr46437.2021.01161
Yin, T., Zhou, X., Krähenbühl, P.: Multimodal virtual point 3D detection. In: Advances in Neural Information Processing Systems, vol. 34, pp. 16494–16507 (2021)
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3D-CVF: generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 720–736. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_43
Yu, K., et al.: Benchmarking the robustness of lidar-camera fusion for 3D object detection. arXiv:2205.14951 (2022)
Zhang, C., et al.: Robust-fusionNet: deep multimodal sensor fusion for 3-D object detection under severe weather conditions. IEEE Trans. Instrum. Meas. 71, 1–13 (2022)
Zhang, G., Xie, J., Liu, L., Wang, Z., Yang, K., Song, Z.: URFormer: unified representation lidar-camera 3D object detection with transformer. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 401–413. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-8435-0_32
Zhang, Y., Chen, J., Huang, D.: CAT-Det: contrastively augmented transformer for multi-modal 3D object detection. arXiv:2204.00325 (2022)
Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Acknowledgements
We sincerely appreciate the helpful discussions provided by Hongyu Pan from Horizon Robotics. This work was supported by the National Key R&D Program of China (2018AAA0100302).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Song, Z. et al. (2025). GraphBEV: Towards Robust BEV Feature Alignment for Multi-modal 3D Object Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15084. Springer, Cham. https://doi.org/10.1007/978-3-031-73347-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-73347-5_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73346-8
Online ISBN: 978-3-031-73347-5
eBook Packages: Computer ScienceComputer Science (R0)