research-article

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Authors:

Si LiuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 2344 - 2352

https://doi.org/10.1145/3474085.3475397

Published: 17 October 2021 Publication History

Abstract

Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents. In this work, we exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data and propose a TransRefer3D network to extract entity-and-relation aware multimodal context among objects for more discriminative feature learning. Concretely, we devise an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation, our EA module matches visual entity features with linguistic entity features while RA module matches pair-wise visual relation features with linguistic relation features, respectively. We further integrate EA and RA modules into an Entity-and-Relation aware Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets demonstrate that our proposed model significantly outperforms existing approaches by up to 10.6% and claims the new state-of-the-art performance. To the best of our knowledge, this is the first work investigating Transformer architecture for fine-grained 3D visual grounding task.

References

[1]

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In ECCV.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.

[3]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[4]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV.

[5]

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. 2019. Scanrefer: 3d object localization in rgb-d scans using natural language. arXiv preprint arXiv:1912.08830 (2019).

[6]

Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2020. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364 (2020).

[7]

Kan Chen, Rama Kovvuri, and Ram Nevatia. 2017. Query-guided regression network with context policy for phrase grounding. In ICCV.

[8]

Meng-Jiun Chiou, Roger Zimmermann, and Jiashi Feng. 2021. Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations. IEEE Access, Vol. 9 (2021), 50441--50451.

[9]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In CVPR.

[10]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR.

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[12]

Zhipeng Ding, Xu Han, and Marc Niethammer. 2019. VoteNet: A deep learning label fusion method for multi-atlas segmentation. In MICCAI.

[13]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[14]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In ICCV.

[15]

Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, and Tyng-Luh Liu. 2021. Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation. In AAAI.

[16]

Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, and Bo Li. 2020. Referring image segmentation via cross-modal progressive comprehension. In CVPR.

[17]

Chenchen Jing, Yuwei Wu, Mingtao Pei, Yao Hu, Yunde Jia, and Qi Wu. 2020. Visual-Semantic Graph Matching for Visual Grounding. In ACM MM.

Digital Library

[18]

Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679 (2014).

[19]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[20]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV.

[21]

Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020 b. Graph structured network for image-text matching. In CVPR.

[22]

Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020 a. Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249 (2020).

[23]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[24]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061 (2016).

Digital Library

[25]

Anh Viet Phan, Minh Le Nguyen, Yen Lam Hoang Nguyen, and Lam Thu Bui. 2018. Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Networks (2018).

[26]

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR.

[27]

Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. Pointnet+: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017).

[28]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).

[29]

Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma. 2020. Consensus-aware visual-semantic embedding for image-text matching. In ECCV.

[30]

Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In ICCV.

[31]

Tianyu Yu, Tianrui Hui, Zhihao Yu, Yue Liao, Sansi Yu, Faxi Zhang, and Si Liu. 2020. Cross-Modal Omni Interaction Modeling for Phrase Grounding. In ACM MM.

Digital Library

[32]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In CVPR.

[33]

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. TNNLS (2018).

[34]

Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li, and Shuguang Cui. 2021. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring. arXiv preprint arXiv:2103.01128 (2021).

[35]

Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. 2019. Graph transformer networks. arXiv preprint arXiv:1911.06455 (2019).

Digital Library

[36]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.

[37]

Chao Zhang, Weiming Li, Wanli Ouyang, Qiang Wang, Woo-Shik Kim, and Sunghoon Hong. 2019. Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping. In ACM MM.

Digital Library

[38]

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and Vladlen Koltun. 2020. Point transformer. arXiv preprint arXiv:2012.09164 (2020).

Cited By

Yinjie LKai XYulan GXin YYuwei WWei HJiaqi YHanyun W(2024)Comprehensive survey on 3D visual-language understanding techniquesJournal of Image and Graphics10.11834/jig.24002929:6(1747-1764)Online publication date: 2024
https://doi.org/10.11834/jig.240029
Abdelreheem AOlszewski KLee HWonka PAchlioptas P(2024)ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00349(3512-3522)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00349
Mao AYang ZChen WYi RLiu Y(2024)Complete 3D Relationships Extraction Modality Alignment Network for 3D Dense CaptioningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.327920430:8(4867-4880)Online publication date: Aug-2024
https://doi.org/10.1109/TVCG.2023.3279204
Show More Cited By

Index Terms

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Lgvc: language-guided visual context modeling for 3D visual grounding
Abstract
3D visual grounding is crucial for understanding cross-modal scenes, linking visual objects to their corresponding language descriptions. Traditional methods often use fixed attention patterns in visual encoders, limiting the utility of language-...
SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Fine-grained visual categorization (FGVC) aims at recognizing objects from similar subordinate categories, which is challenging and practical for human's accurate automatic recognition needs. Most FGVC approaches focus on the attention mechanism ...
Local self-attention in transformer for visual question answering
Abstract
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Beijing Natural Science Foundation

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
317
Total Downloads

Downloads (Last 12 months)78
Downloads (Last 6 weeks)6

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yinjie LKai XYulan GXin YYuwei WWei HJiaqi YHanyun W(2024)Comprehensive survey on 3D visual-language understanding techniquesJournal of Image and Graphics10.11834/jig.24002929:6(1747-1764)Online publication date: 2024
https://doi.org/10.11834/jig.240029
Abdelreheem AOlszewski KLee HWonka PAchlioptas P(2024)ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00349(3512-3522)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00349
Mao AYang ZChen WYi RLiu Y(2024)Complete 3D Relationships Extraction Modality Alignment Network for 3D Dense CaptioningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.327920430:8(4867-4880)Online publication date: Aug-2024
https://doi.org/10.1109/TVCG.2023.3279204
Chen SZhu HLi MChen XGuo PLei YYu GLi TChen T(2024)Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338783846:11(7331-7347)Online publication date: Nov-2024
https://doi.org/10.1109/TPAMI.2024.3387838
Liu YZhang YWang YHou FYuan JTian JZhang YShi ZFan JHe Z(2024)A Survey of Visual TransformersIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.322771735:6(7478-7498)Online publication date: Jun-2024
https://doi.org/10.1109/TNNLS.2022.3227717
Liang NLiu YWang FXia Y(2024)Addressing Predicate Overlap in Scene Graph Generation with Semantics-prototype Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650600(1-7)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650600
Yue XYang KCheng KLuo JChen X(2024)Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW63481.2024.10645436(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICMEW63481.2024.10645436
Liang NLiu YSun WXia YWang F(2024)CKT-RCM: Clip-Based Knowledge Transfer and Relational Context Mining for Unbiased Panoptic Scene Graph GenerationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446810(3570-3574)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446810
Chen SChen XZhang CLi MYu GFei HZhu HFan JChen T(2024)LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02496(26418-26428)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.02496
Xu CHan YXu RHui LXie JYang J(2024)Multi-Attribute Interactions Matter for 3D Visual Grounding2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01633(17253-17262)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01633
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents