Abstract
Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at https://github.com/TuSimple/centerformer
Z. Zhou– Work done during an internship at TuSimple.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range conditioned dilated convolutions for scale invariant 3d object detection. In: CoRL (2021)
Bowen, C., Alexander, G.S., Alexander, K.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
Caesar, H., et al.: nuscenes: A multimodal dataset for autonomous driving. In: CVPR (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Cheng, S., et al.: Improving 3d object detection through progressive population based augmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 279–294. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_17
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel r-cnn: Towards high performance voxel-based 3d object detection. In: AAAI (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
Fan, L., et al.: Embracing single stride 3d object detector with sparse transformer. In: CVPR (2022)
Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.: Rangedet: In defense of range view for lidar-based 3d object detection. In: ICCV (2021)
Ge, R., et al.: Afdet: Anchor free one stage 3d object detection. In: CVPRW (2020)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Guan, T., et al.: M3detr: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: WACV (2022)
Hu, Y., et al.: Afdetv2: Rethinking the necessity of the second stage for object detection from point clouds. In: AAAI (2022)
Huang, R., et al.: An LSTM approach to temporal 3d object detection in lidar point clouds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 266–282. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_16
Jianwei, Y., et al.: Focal self-attention for local-global interactions in vision transformers. In: NeurIPS (2021)
Jifeng, D., et al.: Deformable convolutional networks. In: ICCV (2017)
Kaiwen, D., Song, B., Lingxi, X., Honggang, Q., Qingming, H., Qi, T.: Centernet: Keypoint triplets for object detection. In: ICCV (2019)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: PointPillars: Fast encoders for object detection from point clouds. In: CVPR (2019)
Law, H., Deng, J.: CornerNet: Detecting objects as paired keypoints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 765–781. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_45
Li, Z., Wang, F., Wang, N.: Lidar r-cnn: An efficient and universal 3d object detector. In: CVPR (2021)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: CVPR (2017)
Liu, S., et al.: Dab-detr: Dynamic anchor boxes are better queries for detr. In: ICLR (2021)
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: ICCV (2021)
Mao, J., Niu, M., Bai, H., Liang, X., Xu, H., Xu, C.: Pyramid r-cnn: Towards better performance and adaptability for 3d object detection. In: CVPR (2021)
Mao, J., et al.: Voxel transformer for 3d object detection. In: CVPR (2021)
Meng, D., et al.: Conditional detr for fast training convergence. In: ICCV (2021)
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: CVPR (2021)
Ngiam, J., et al.: Starnet: Targeted computation for object detection in point clouds. arXiv (2019)
Nguyen, D.K., Ju, J., Booji, O., Oswald, M.R., Snoek, C.G.: Boxer: Box-attention for 2d and 3d transformers. In: CVPR (2022)
Noh, J., Lee, S., Ham, B.: Hvpr: Hybrid voxel-point representation for single-stage 3d object detection. In: CVPR (2021)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Sheng, H., et al.: Improving 3d object detection with channel-wise transformer. In: ICCV (2021)
Shi, S., et al.: Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: CVPR (2020)
Shi, S., et al.: Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. arXiv (2021)
Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object progposal generation and detection from point cloud. In: CVPR (2019)
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. In: TPAMI (2020)
Sun, P., ., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)
Sun, P., et al.: Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In: CVPR (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, C., Ma, C., Zhu, M., Yang, X.: Pointaugmenting: Cross-modal augmentation for 3d object detection. In: CVPR (2021)
Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Max-deeplab: End-to-end panoptic segmentation with mask transformers. In: CVPR (2021)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: Convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Wu, B., Wan, A., Yue, X., Keutzer, K.: Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In: ICRA (2018)
Wu, Z., Weiliang, T., Sijin, C., Li, J., Chi-Wing, F.: Cia-ssd: Confident iou-aware single-stage object detector from point cloud. In: AAAI (2021)
Xu, Q., Zhong, Y., Neumann, U.: Behind the curtain: Learning occluded shapes for 3d object detection. In: AAAI (2022)
Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. In: Sensors (2018)
Yang, Z., Zhou, Y., Chen, Z., Ngiam, J.: 3d-man: 3d multi-frame attention network for object detection. In: CVPR (2021)
Ye, M., Xu, S., Cao, T.: Hvnet: Hybrid voxel network for lidar based 3d object detection. In: CVPR (2020)
Yin, J., Shen, J., Guan, C., Zhou, D., Yang, R.: Lidar-based online 3d video object detection with graph-based message passing and spatiotemporal transformer attention. In: CVPR (2020)
Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3d object detection and tracking. In: CVPR (2021)
Ze, L., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV (2021)
Zhang, H., et al.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv (2022)
Zhang, Y., et al.: Polarnet: An improved grid representation for online lidar point clouds semantic segmentation. In: CVPR (2020)
Zhang, Z., Sun, B., Yang, H., Huang, Q.: H3DNet: 3D object detection using hybrid geometric primitives. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 311–329. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_19
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: CVPR (2021)
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: CVPR (2018)
Zhou, Z., Zhang, Y., Foroosh, H.: Panoptic-polarnet: Proposal-free lidar point cloud panoptic segmentation. In: CVPR (2021)
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3d object detection. arXiv (2019)
Zhu, X., et al.: Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In: CVPR (2021)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. In: ICLR (2020)
Acknowledgements
We thank Yufei Xie for his help with refactoring the code for release.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, Z., Zhao, X., Wang, Y., Wang, P., Foroosh, H. (2022). CenterFormer: Center-Based Transformer for 3D Object Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13698. Springer, Cham. https://doi.org/10.1007/978-3-031-19839-7_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-19839-7_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19838-0
Online ISBN: 978-3-031-19839-7
eBook Packages: Computer ScienceComputer Science (R0)