research-article

Transformer-Based Visual Grounding with Cross-Modality Interaction

Authors:

Meng WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 6

Article No.: 183, Pages 1 - 19

https://doi.org/10.1145/3587251

Published: 30 May 2023 Publication History

Abstract

This article tackles the challenging yet important task of Visual Grounding (VG), which aims to localize a visual region in the given image referred by a natural language query. Existing efforts on the VG task are twofold: (1) two-stage methods first extract region proposals and then rank them according to their similarities with the referring expression, which usually leads to suboptimal results due to the quality of region proposals; (2) one-stage methods usually predict all the possible coordinates of the target region online by leveraging modern object detection architectures, which pay little attention to cross-modality correlations and have limited generalization ability. To better address the task, we present an effective transformer-based end-to-end visual grounding approach, which focuses on capturing the cross-modality correlations between the referring expression and visual regions for accurately reasoning the location of the target region. Specifically, our model consists of a feature encoder, a cross-modality interactor, and a modality-agnostic decoder. The feature encoder is employed to capture the intra-modality correlation, which models the linguistic context in query and the spatial dependency in image respectively. The cross-modality interactor endows the model with the capability of highlighting the localization-relevant visual and textual cues by mutual verification of vision and language, which plays a key role in our model. The decoder learns a consolidated token representation enriched by multi-modal contexts and further directly predicts the box coordinates. Extensive experiments on five public benchmark datasets with quantitative and qualitative analysis clearly demonstrate the effectiveness and rationale of our proposed method.

References

[1]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision. 213–229.

Digital Library

[2]

Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, and Jiebo Luo. 2018. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018).

[3]

Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using syntax to ground referring expressions in natural images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[4]

Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. TransVG: End-to-end visual grounding with transformers. In Proceedings of the IEEE International Conference on Computer Vision. 1769–1779.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018).

[6]

Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. 2022. Visual grounding with transformers. In Proceedings of the IEEE International Conference on Multimedia and Expo.

[7]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.

[8]

Ross Girshick. 2015. Fastr-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440–1448.

Digital Library

[9]

Dan Guo, Kun Li, Zheng-Jun Zha, and Meng Wang. 2019. DADNET: Dilated-attention-deformable convnet for crowd counting. In Proceedings of the 27th ACM International Conference on Multimedia. 1823–1832.

Digital Library

[10]

Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, and Meng Wang. 2020. Iterative context-aware graph inference for visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10055–10064.

[11]

Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2017. Online early-late fusion based on adaptive HMM for sign language recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1 (2017), 1–18.

Digital Library

[12]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[14]

Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. 2022. Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 684–696.

Digital Library

[15]

Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1115–1124.

[16]

Weike Jin, Zhou Zhao, Yimeng Li, Jie Li, Jun Xiao, and Yueting Zhuang. 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1–22.

Digital Library

[17]

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE International Conference on Computer Vision. 1780–1790.

[18]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 787–798.

[19]

Rama Kovvuri and Ram Nevatia. 2018. PIRCNET: Using proposal indexing, relationships and context for phrase grounding. In Proceedings of the Asian Conference on Computer Vision. 451–467.

[20]

Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902–1910.

[21]

Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10880–10889.

[22]

Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE International Conference on Computer Vision. 4673–4682.

[23]

Jingyu Liu, Liang Wang, and Ming-Hsuan Yang. 2017. Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4856–4864.

[24]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. 21–37.

[25]

Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1950–1959.

[26]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[27]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11–20.

[28]

Varun K. Nagaraja, Vlad I. Morariu, and Larry S. Davis. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision. 792–807.

[29]

Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, and Svetlana Lazebnik. 2018. Conditional image-text embedding networks. In Proceedings of the European Conference on Computer Vision. 249–264.

Digital Library

[30]

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641–2649.

Digital Library

[31]

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9982–9991.

[32]

Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv:1804.02767 (2018).

[33]

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 658–666.

[34]

Arka Sadhu, Kan Chen, and Ram Nevatia. 2019. Zero-shot grounding of objects from natural language queries. In Proceedings of the IEEE International Conference on Computer Vision. 4694–4703.

[35]

Ling Shen, Richang Hong, Haoran Zhang, Xinmei Tian, and Meng Wang. 2019. Video retrieval with similarity-preserving deep temporal hashing. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 4 (2019), 1–16.

Digital Library

[36]

Jasper R. R. Uijlings, Koen E. A. Van De Sande, Theo Gevers, and Arnold W. M. Smeulders. 2013. Selective search for object recognition. International Journal of Computer Vision (2013), 154–171.

Digital Library

[37]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).

[38]

Hui Wang, Dan Guo, Xian-Sheng Hua, and Meng Wang. 2021. Pairwise VLAD interaction network for video question answering. In Proceedings of the 29th ACM International Conference on Multimedia. 5119–5127.

Digital Library

[39]

Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 394–407.

[40]

Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1960–1968.

[41]

Shuo Wang, Dan Guo, Xin Xu, Li Zhuo, and Meng Wang. 2019. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1–16.

Digital Library

[42]

Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE International Conference on Computer Vision. 4644–4653.

[43]

Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Propagating over phrase relations for one-stage visual grounding. In Proceedings of the European Conference on Computer Vision. 589–605.

Digital Library

[44]

Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339–1348.

Digital Library

[45]

Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1–10.

Digital Library

[46]

Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020. Weakly-supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the 28th ACM International Conference on Multimedia. 1939–1947.

Digital Library

[47]

Xun Yang, Meng Wang, Richang Hong, Qi Tian, and Yong Rui. 2017. Enhancing person re-identification in a self-trained subspace. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 3 (2017), 1–23.

Digital Library

[48]

Xun Yang, Shanshan Wang, Jian Dong, Jianfeng Dong, Meng Wang, and Tat-Seng Chua. 2022. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing 31 (2022), 1204–1216.

[49]

Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the European Conference on Computer Vision. 387–404.

Digital Library

[50]

Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE International Conference on Computer Vision. 4683–4693.

[51]

Jiabo Ye, Xin Lin, Liang He, Dingbang Li, and Qin Chen. 2021. One-stage visual grounding via semantic-aware feature filter. In Proceedings of the 29th ACM International Conference on Multimedia. 1702–1711.

Digital Library

[52]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MATTNET: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307–1315.

[53]

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. 2016. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision. 69–85.

[54]

Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. 2018. Rethinking diversified and discriminative proposal generation for visual grounding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 1114–1120.

Digital Library

[55]

Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158–4166.

[56]

Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, and Xun Wang. 2021. Progressive localization networks for language based moment localization. ACM Transactions on Multimedia Computing, Communications, and Applications (2021).

[57]

Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252–4261.

[58]

C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision. 391–405.

Cited By

Liu YWen ZWang YZhong YWang JHu YZhou PGuo S(2024)Artificial intelligence in ischemic stroke images: current applications and future directionsFrontiers in Neurology10.3389/fneur.2024.141806015Online publication date: 10-Jul-2024
https://doi.org/10.3389/fneur.2024.1418060
Liu YYuan XLi HTan ZHuang JXiao JLi WMo T(2024)SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664816Online publication date: 11-May-2024
https://doi.org/10.1145/3664816
Zhang PLiu MSong XCao DGao ZNie L(2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3656045
Show More Cited By

Index Terms

Transformer-Based Visual Grounding with Cross-Modality Interaction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Visual Grounding in Remote Sensing Images
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Ground object retrieval from a large-scale remote sensing image is very important for lots of applications. We present a novel problem of visual grounding in remote sensing images. Visual grounding aims to locate the particular objects (in the form of ...
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring ...
Grounding Language in Visual and Conversational Contexts
WWW '21: Companion Proceedings of the Web Conference 2021

Most language use is driven by specific communicative goals in interactive setups, where often visual perception goes hand in hand with language processing. I will discuss some recent projects by my research group related to modelling language ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 6

November 2023

858 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3599695

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023

Online AM: 09 March 2023

Accepted: 07 March 2023

Revised: 23 January 2023

Received: 04 July 2022

Published in TOMM Volume 19, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Major Project of Anhui Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
977
Total Downloads

Downloads (Last 12 months)504
Downloads (Last 6 weeks)47

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu YWen ZWang YZhong YWang JHu YZhou PGuo S(2024)Artificial intelligence in ischemic stroke images: current applications and future directionsFrontiers in Neurology10.3389/fneur.2024.141806015Online publication date: 10-Jul-2024
https://doi.org/10.3389/fneur.2024.1418060
Liu YYuan XLi HTan ZHuang JXiao JLi WMo T(2024)SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664816Online publication date: 11-May-2024
https://doi.org/10.1145/3664816
Zhang PLiu MSong XCao DGao ZNie L(2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3656045
Hsu WLin H(2024)Context-detail-aware United Network for Single Image DerainingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363940720:5(1-18)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3639407
Song PGuo DYang XTang SWang M(2024)Emotional Video Captioning With Vision-Based Emotion Interpretation NetworkIEEE Transactions on Image Processing10.1109/TIP.2024.335904533(1122-1135)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3359045
Yao HWang LCai CWang WZhang ZShang X(2024)Language conditioned multi-scale visual attention networks for visual groundingImage and Vision Computing10.1016/j.imavis.2024.105242150(105242)Online publication date: Oct-2024
https://doi.org/10.1016/j.imavis.2024.105242
Yang ZAn GZheng ZCao SWang F(2024)EPK-CLIPExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124183252:PAOnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124183
Bhattacharjee SSharma HChoudhury TAbdelmoniem A(2024)Leveraging chaos for enhancing encryption and compression in large cloud data transfersThe Journal of Supercomputing10.1007/s11227-024-05906-380:9(11923-11957)Online publication date: 4-Feb-2024
https://dl.acm.org/doi/10.1007/s11227-024-05906-3
Azadiabad SKhendek F(2024)Dependability of Network Services in the Context of NFV: A Taxonomy and State of the Art ClassificationJournal of Network and Systems Management10.1007/s10922-024-09810-232:2Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1007/s10922-024-09810-2
Li JLi KLi JChen GWang MGuo D(2024)Dual-path temporal map optimization for make-up temporal video groundingMultimedia Systems10.1007/s00530-024-01340-w30:3Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s00530-024-01340-w
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents