research-article

CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection

Authors:

Dhanalaxmi Gaddam,

Fahad Shahbaz Khan,

Rao Muhammad Anwer, and

Hisham CholakkalAuthors Info & Claims

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

December 2022

Article No.: 27, Pages 1 - 8

https://doi.org/10.1145/3551626.3564938

Published: 13 December 2022 Publication History

Abstract

Existing deep learning-based 3D object detectors typically rely on the appearance of individual objects and do not explicitly pay attention to the rich contextual information of the scene. In this work, we propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework, which takes a 3D scene as an input and strives to explicitly integrate useful contextual information of the scene at multiple levels to predict a set of object bounding-boxes along with their corresponding semantic labels. To this end, we propose to utilize a context enhancement network that captures the contextual information at different levels of granularity followed by a multi-stage refinement module to progressively refine the box positions and class predictions. Extensive experiments on the large-scale ScanNetV2 benchmark reveals the benefits of our proposed method, leading to an absolute improvement of 2.0% over the baseline. In addition to 3D object detection, we investigate the effectiveness of our CMR3D framework for the problem of 3D object counting. Our source code is available at https://github.com/Dhanalaxmi17/CMR3D.

Supplementary Material

PDF File (a27-gaddam-supp.pdf)

Supplemental material.

Download
881.06 KB

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[2]

Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6154--6162.

[3]

Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ramprasaath R Selvaraju, Dhruv Batra, and Devi Parikh. 2017. Counting everyday objects in everyday scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1135--1144.

[4]

Jintai Chen, Biwen Lei, Qingyu Song, Haochao Ying, Danny Z Chen, and Jian Wu. 2020. A hierarchical graph network for 3d object detection on point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 392--401.

[5]

Bowen Cheng, Lu Sheng, Shaoshuai Shi, Ming Yang, and Dong Xu. 2021. Back-tracing representative points for voting-based 3d object detection in point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8963--8972.

[6]

Hisham Cholakkal, Guolei Sun, Fahad Shahbaz Khan, and Ling Shao. 2019. Object counting and instance segmentation with image-level supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12397--12405.

[7]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828--5839.

[8]

Haowen Deng, Tolga Birdal, and Slobodan Ilic. 2018. Ppfnet: Global context aware local features for robust 3d point matching. In Proceedings of the IEEE conference on computer vision and pattern recognition. 195--205.

[9]

Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, and Matthias Nießner. 2020. 3d-mpa: Multi-proposal aggregation for 3d semantic instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9031--9040.

[10]

Mingtao Feng, Syed Zulqarnain Gilani, Yaonan Wang, Liang Zhang, and Ajmal Mian. 2020. Relation graph network for 3D object detection in point clouds. IEEE Transactions on Image Processing 30 (2020), 92--107.

Digital Library

[11]

Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3146--3154.

[12]

Ji Hou, Angela Dai, and Matthias Nießner. 2019. 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4421--4430.

[13]

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3588--3597.

[14]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.

[15]

Shi-Min Hu, Jun-Xiong Cai, and Yu-Kun Lai. 2018. Semantic labeling and instance segmentation of 3D point clouds using patch context analysis and multiscale processing. IEEE transactions on visualization and computer graphics 26, 7 (2018), 2485--2498.

[16]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[17]

Bastian Leibe, Ales Leonardis, and Bernt Schiele. 2004. Combined object categorization and segmentation with an implicit shape model. In Workshop on statistical learning in computer vision, ECCV, Vol. 2. 7.

[18]

Yong Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2018. Structure inference net: Object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6985--6994.

[19]

Anshul Paigwar, Ozgur Erkent, Christian Wolf, and Christian Laugier. 2019. Attentional pointnet for 3d-object detection in point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0--0.

[20]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Digital Library

[21]

Charles R Qi, Xinlei Chen, Or Litany, and Leonidas J Guibas. 2020. Imvotenet: Boosting 3d object detection in point clouds with image votes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4404--4413.

[22]

Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. 2019. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision. 9277--9286.

[23]

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30 (2017).

[24]

Santi Seguí, Oriol Pujol, and Jordi Vitria. 2015. Learning to count with deep object features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 90--96.

[25]

Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. 2019. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 770--779.

[26]

Yifei Shi, Angel X Chang, Zhelun Wu, Manolis Savva, and Kai Xu. 2019. Hierarchy denoising recursive autoencoders for 3D scene layout prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1771--1780.

[27]

Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition. 567--576.

[28]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[29]

Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Dening Lu, Mingqiang Wei, and Jun Wang. 2021. VENet: Voting Enhancement Network for 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3712--3721.

[30]

Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming Zhang, Kai Xu, and Jun Wang. 2020. Mlcvnet: Multi-level context votenet for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10447--10456.

[31]

Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. 2018. Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4606--4615.

[32]

Bo Yang, Jianan Wang, Ronald Clark, Qingyong Hu, Sen Wang, Andrew Markham, and Niki Trigoni. 2019. Learning object bounding boxes for 3D instance segmentation on point clouds. Advances in neural information processing systems 32 (2019).

[33]

Xiaoqing Ye, Jiamao Li, Hexiao Huang, Liang Du, and Xiaolin Zhang. 2018. 3d recurrent neural networks with context fusion for point cloud semantic segmentation. In Proceedings of the European conference on computer vision (ECCV). 403--417.

Digital Library

[34]

Ruichi Yu, Xi Chen, Vlad I Morariu, and Larry S Davis. 2016. The role of context selection in object detection. arXiv preprint arXiv:1609.02948 (2016).

[35]

Kaiyu Yue, Ming Sun, Yuchen Yuan, Feng Zhou, Errui Ding, and Fuxin Xu. 2018. Compact generalized non-local network. Advances in neural information processing systems 31 (2018).

[36]

Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. 2015. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 833--841.

[37]

Wenxiao Zhang and Chunxia Xiao. 2019. PCAN: 3D attention map learning using contextual information for point cloud based retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12436--12445.

[38]

Yinda Zhang, Mingru Bai, Pushmeet Kohli, Shahram Izadi, and Jianxiong Xiao. 2017. Deepcontext: Context-encoding neural pathways for 3d holistic scene understanding. In Proceedings of the IEEE international conference on computer vision. 1192--1201.

[39]

Zaiwei Zhang, Bo Sun, Haitao Yang, and Qixing Huang. 2020. H3dnet: 3d object detection using hybrid geometric primitives. In European Conference on Computer Vision. Springer, 311--329.

Digital Library

[40]

Yin Zhou and Oncel Tuzel. 2018. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4490--4499.

Index Terms

CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Robotics

Recommendations

Objformer: Boosting 3D object detection via instance-wise interaction
Abstract
Deep learning on point clouds drives 3D object detection. Despite rapid progress, point-based methods still suffer from the problems such as incompletion and occlusion, which are caused by the material properties of objects and cluttered scenes. ...
Highlights
- This paper proposes a novel two-stage end-to-end differentiable architecture for the 3D object detection in point clouds, which is dubbed as Objformer.
- Equipped with the specially designed instance feature encoder, Objformer can ...
Read More
H3DNet: 3D Object Detection Using Hybrid Geometric Primitives
Computer Vision – ECCV 2020
Abstract
We introduce H3DNet, which takes a colorless 3D point cloud as input and outputs a collection of oriented object bounding boxes (or BB) and their semantic labels. The critical idea of H3DNet is to predict a hybrid set of geometric primitives, i.e.,...
Read More
SpOT: Spatiotemporal Modeling for 3D Object Tracking
Computer Vision – ECCV 2022
Abstract
3D multi-object tracking aims to uniquely and consistently identify all mobile entities through time. Despite the rich spatiotemporal information available in this setting, current 3D tracking methods primarily rely on abstracted information and ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

December 2022

296 pages

ISBN:9781450394789

DOI:10.1145/3551626

Conference Chair:
Shuqiang Jiang
CASROLE@GENERAL CHAIR
,
General Chairs:
Kiyoharu Aizawa
The University of Tokyo
,
Phoebe Chen
La Trobe
,
Keiji Yanai
The University of Electro-Communications

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 December 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MMAsia '22

Sponsor:

SIGMM

MMAsia '22: ACM Multimedia Asia

December 13 - 16, 2022

Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
62
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents