research-article

Reimagining 3D Visual Grounding: Instance Segmentation and Transformers for Fragmented Point Cloud Scenarios

Authors:

Zhiwei WangAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 33, Pages 1 - 7

https://doi.org/10.1145/3595916.3626405

Published: 01 January 2024 Publication History

Abstract

This work introduces a pioneering, engineerable approach to 3D visual localization(3DVG). Current challenges for 2D Visual Grounding (2DVG) and 3DVG are summarized: Absence of Depth Information in 2DVG, Memory and Computational Demands of Global Point Clouds, Limitations in Dynamic Scenarios, and Limited Understanding of Spatial Localization Reference Frames. Our solution proposes a Re_3DVG method for fragmented point cloud scenarios. Utilizing instance segmentation and transformer models, our approach offers a potent mechanism for establishing robust correspondences between text queries and object instances within the shared visible range. The introduction of a FragCloud3DRef dataset, grounded in ScanNet and supplemented with RGB data, object segmentation, and textual descriptions, fortifies the effectiveness of our proposed model. Experimental outcomes display that our model excels beyond conventional 3DVG and 2DVG models, establishing a formidable benchmark for future research within this discipline. The code source and dataset are open at https://github.com/zehantan6970/Reimagining_3DVG.

References

[1]

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 422–440.

[2]

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).

[3]

Berta Bescos, Carlos Campos, Juan D Tardós, and José Neira. 2021. DynaSLAM II: Tightly-coupled multi-object tracking and SLAM. IEEE robotics and automation letters 6, 3 (2021), 5191–5198.

[4]

Daigang Cai, Lichen Zhao, Jing Zhang, Lu Sheng, and Dong Xu. 2022. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16464–16473.

[5]

Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. 2021. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics 37, 6 (2021), 1874–1890.

[6]

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. 2020. Scanrefer: 3d object localization in rgb-d scans using natural language. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX. Springer, 202–221.

[7]

Jiaming Chen, Weixin Luo, Xiaolin Wei, Lin Ma, and Wei Zhang. 2022. HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding. arXiv preprint arXiv:2210.12513 (2022).

[8]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839.

[9]

Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1769–1779.

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[11]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[12]

Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. 2022. Visual grounding with transformers. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.

[13]

Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio Guadarrama, and Kevin P Murphy. 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277 (2017).

[14]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440–1448.

Digital Library

[15]

Wenchao Gu, Shuang Bai, and Lingxing Kong. 2022. A review on 2D instance segmentation based on deep neural networks. Image and Vision Computing (2022), 104401.

[16]

Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. 2020. A survey on instance segmentation: state of the art. International journal of multimedia information retrieval 9, 3 (2020), 171–189.

[17]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969.

[18]

Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1115–1124.

[19]

Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. 2022. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15524–15533.

[20]

Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, and Gao Huang. 2022. Pseudo-q: Generating pseudo language queries for visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15513–15523.

[21]

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1780–1790.

[22]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117–2125.

[23]

Haolin Liu, Anran Lin, Xiaoguang Han, Lei Yang, Yizhou Yu, and Shuguang Cui. 2021. Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6032–6041.

[24]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).

[25]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440.

[26]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).

[27]

Mishaim Malik, Muhammad Kamran Malik, Khawar Mehmood, and Imran Makhdoom. 2021. Automatic speech recognition: a survey. Multimedia Tools and Applications 80 (2021), 9411–9457.

Digital Library

[28]

Taiki Miyanishi, Daichi Azuma, Shuhei Kurita, and Motoki Kawanabe. 2023. Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans. arXiv preprint arXiv:2305.13876 (2023).

[29]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[30]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).

[31]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images.ECCV (5) 7576 (2012), 746–760.

[32]

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. 2012. A benchmark for the evaluation of RGB-D SLAM systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 573–580.

[33]

J-M Valin, François Michaud, Jean Rouat, and Dominic Létourneau. 2003. Robust sound source localization using a microphone array on a mobile robot. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003)(Cat. No. 03CH37453), Vol. 2. IEEE, 1228–1233.

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[35]

Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1960–1968.

[36]

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 23318–23340.

[37]

Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, and Jian Zhang. 2023. EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19231–19242.

[38]

Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. 2023. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15325–15336.

[39]

Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4145–4154.

[40]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1307–1315.

[41]

Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Sheng Wang, Zhen Li, and Shuguang Cui. 2021. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1791–1800.

[42]

Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158–4166.

[43]

Zhengyou Zhang. 2000. A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence 22, 11 (2000), 1330–1334.

Digital Library

[44]

Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 2021. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2928–2937.

Index Terms

Reimagining 3D Visual Grounding: Instance Segmentation and Transformers for Fragmented Point Cloud Scenarios
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Vision for robotics

Recommendations

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt ...
Lgvc: language-guided visual context modeling for 3D visual grounding
Abstract
3D visual grounding is crucial for understanding cross-modal scenes, linking visual objects to their corresponding language descriptions. Traditional methods often use fixed attention patterns in visual encoders, limiting the utility of language-...
Visual Grounding: Building Cross-Modal Visual-Text Alignment

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
109
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)6

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents