Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3680890acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Point Cloud Reconstruction Is Insufficient to Learn 3D Representations

Published: 28 October 2024 Publication History

Abstract

This paper revisits the development of generative self-supervised learning in 2D images and 3D point clouds in autonomous driving. In 2D images, the pretext task has evolved from low-level to high-level features. Inspired by this, through explore model analysis, we find that the gap in weight distribution between self-supervised learning and supervised learning is substantial when employing only low-level features as the pretext task in 3D point clouds. Low-level features represented by PoInt Cloud reconsTruction are insUfficient to learn 3D REpresentations (dubbed PICTURE). To advance the development of pretext tasks, we propose a unified generative self-supervised framework. Firstly, high-level features are demonstrated to exhibit semantic consistency with downstream tasks. We utilize the high-level features as an additional pretext task to enhance the understanding of semantic information during the pre-training. Next, we propose inter-class and intra-class discrimination-guided masking (I2Mask) based on the attributes of the high-level features, adaptively setting the masking ratio for each superclass. On Waymo and nuScenes datasets, we achieve 75.13% mAP and 72.69% mAPH for 3D object detection, 79.4% mIoU for 3D semantic segmentation, and 18.4% mIoU for occupancy prediction. Extensive experiments have demonstrated the effectiveness and necessity of high-level features.

Supplemental Material

MP4 File - Recorded Presentation Video
We revisit the development of generative self-supervised learning in 2D images and 3D point clouds in autonomous driving. In 2D images, the pretext task has evolved from low-level to highlevel features. Inspired by this, through explore model analysis, we find that the gap in weight distribution between self-supervised learning and supervised learning is substantial when employing only low-level features as the pretext task in 3D point clouds. Low-level features represented by Point cloud reconstruction are insufficient to learn 3D representations (dubbed PICTURE).

References

[1]
RadhakrishnaAchanta, Appu Shaji, Kevin Smith,Aurelien Lucchi, et al. 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 11 (2012), 2274--2282.
[2]
Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. 2019. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9297--9307.
[3]
Alexandre Boulch, Corentin Sautier, Björn Michele, and Gilles Puy. 2023. Also: Automotive lidar self-supervision by occupancy estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13455--13465.
[4]
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11621--11631.
[5]
Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. 2023. CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7020--7030.
[6]
Yizong Cheng. 1995. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 8 (1995), 790--799.
[7]
Christopher Choy, JunYoung Gwak, and Silvio Savarese. 2019. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3075--3084.
[8]
MMDetection3D Contributors. 2020. MMDetection3D: OpenMMLab nextgeneration platform for general 3D object detection. https://github.com/openmmlab/ mmdetection3d.
[9]
Lue Fan, Ziqi Pang, Tianyuan Zhang, et al. 2022. Embracing single stride 3d object detector with sparse transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8458--8468.
[10]
Whye Kit Fong, Rohit Mohan, Juana Valeria Hurtado, Lubing Zhou, et al. 2022. Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. IEEE Robotics and Automation Letters 7, 2 (2022), 3795--3802.
[11]
Peng Gao, Ziyi Lin, Renrui Zhang, Rongyao Fang, Hongyang Li, Hongsheng Li, and Yu Qiao. 2023. Mimic before reconstruct: Enhancing masked autoencoders with feature mimicking. International Journal of Computer Vision (2023), 1--11.
[12]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLRWorkshop and Conference Proceedings, 249--256.
[13]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000--16009.
[14]
Georg Hess, Johan Jaxing, Elias Svensson, David Hagerman, Christoffer Petersson, and Lennart Svensson. 2023. Masked autoencoder for self-supervised pre-training on lidar point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 350--359.
[15]
Ioannis Kakogeorgiou, Spyros Gidaris, Bill Psomas, Yannis Avrithis, et al. 2022. What to hide from your students: Attention-guided masked image modeling. In European Conference on Computer Vision. Springer, 300--318.
[16]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015--4026.
[17]
Lingdong Kong, Niamul Quader, et al. 2023. Conda: Unsupervised domain adaptation for lidar segmentation via regularized domain concatenation. In IEEE International Conference on Robotics and Automation. IEEE, 9338--9345.
[18]
Georg Krispel, David Schinagl, Christian Fruhwirth-Reisinger, et al. 2024. MAELi: Masked autoencoder for large-scale LiDAR point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3383--3392.
[19]
Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, and Jiaya Jia. 2023. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17545--17555.
[20]
Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, et al. 2019. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12697--12705.
[21]
Gang Li, Heliang Zheng, Daqing Liu, ChaoyueWang, Bing Su, et al. 2022. Semmae: Semantic-guided masking for learning masked autoencoders. Advances in Neural Information Processing Systems 35 (2022), 14290--14302.
[22]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2980--2988.
[23]
Zhiwei Lin, Yongtao Wang, et al. 2024. BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 3531--3539.
[24]
Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen,Wenwei Zhang, Liang Pan, et al. 2023. Segment Any Point Cloud Sequences by Distilling Vision Foundation Models. In Advances in Neural Information Processing Systems.
[25]
Zhengqi Liu, Jie Gui, and Hao Luo. 2023. Good helper is around you: Attentiondriven masked image modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 1799--1807.
[26]
Jiageng Mao, Minzhe Niu, Chenhan Jiang, et al. 2021. One Million Scenes for Autonomous Driving: ONCE Dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.
[27]
Chen Min, Liang Xiao, et al. 2024. Multi-Camera Unified Pre-Training Via 3D Scene Reconstruction. IEEE Robotics and Automation Letters (2024).
[28]
Chen Min, Liang Xiao, Dawei Zhao, et al. 2023. Occupancy-MAE: Self-Supervised Pre-Training Large-Scale LiDAR Point Clouds With Masked Occupancy Autoencoders. IEEE Transactions on Intelligent Vehicles (2023).
[29]
Lucas Nunes, Louis Wiesmann, Rodrigo Marcuzzi, Xieyuanli Chen, Jens Behley, and Cyrill Stachniss. 2023. Temporal consistent 3d lidar representation learning for semantic perception in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5217--5228.
[30]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.
[31]
Corentin Sautier, Gilles Puy, Spyros Gidaris, Alexandre Boulch, Andrei Bursuc, and Renaud Marlet. 2022. Image-to-lidar self-supervised distillation for autonomous driving data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9891--9901.
[32]
Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. 2020. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10529--10538.
[33]
Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. 2023. PV-RCNN: Point-voxel feature set abstraction with local vector representation for 3D object detection. International Journal of Computer Vision 131, 2 (2023), 531--551.
[34]
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, et al. 2020. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2446--2454.
[35]
OpenPCDet Development Team. 2020. OpenPCDet: An Open-source Toolbox for 3D Object Detection from Point Clouds. https://github.com/open-mmlab/ OpenPCDet.
[36]
Xiaoyu Tian, Haoxi Ran, et al. 2023. GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13570--13580.
[37]
Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, and Liwei Wang. 2023. Dsvt: Dynamic sparse voxel transformer with rotated sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13520--13529.
[38]
Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang. 2023. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 17850--17859.
[39]
Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. 2022. Masked feature prediction for self-supervised visual pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14668--14678.
[40]
Runsen Xu, Tai Wang, Wenwei Zhang, Runjian Chen, Jinkun Cao, Jiangmiao Pang, and Dahua Lin. 2023. MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13445--13454.
[41]
Yan Yan, Yuxing Mao, and Bo Li. 2018. Second: Sparsely embedded convolutional detection. Sensors 18, 10 (2018), 3337.
[42]
Honghui Yang, Tong He, Jiaheng Liu, et al. 2023. GD-MAE: generative decoder for MAE pre-training on lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9403--9414.
[43]
Honghui Yang, Sha Zhang, Di Huang, Xiaoyang Wu, et al. 2024. Unipad: A universal pre-training paradigm for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15238--15250.
[44]
Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. 2024. Visual point cloud forecasting enables scalable autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14673--14684.
[45]
Junbo Yin, Dingfu Zhou, Liangjun Zhang, Jin Fang, Cheng-Zhong Xu, et al. 2022. Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection. In European Conference on Computer Vision. Springer, 17--33.
[46]
Tianwei Yin, Xingyi Zhou, et al. 2021. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11784--11793.
[47]
Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, et al. 2023. CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15244--15253.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. autonomous driving
  2. multimedia foundation models
  3. point cloud scene understanding
  4. self-supervised learning

Qualifiers

  • Research-article

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 52
    Total Downloads
  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)36
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media