Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Transformer-Based Visual Grounding with Cross-Modality Interaction

Published: 30 May 2023 Publication History

Abstract

This article tackles the challenging yet important task of Visual Grounding (VG), which aims to localize a visual region in the given image referred by a natural language query. Existing efforts on the VG task are twofold: (1) two-stage methods first extract region proposals and then rank them according to their similarities with the referring expression, which usually leads to suboptimal results due to the quality of region proposals; (2) one-stage methods usually predict all the possible coordinates of the target region online by leveraging modern object detection architectures, which pay little attention to cross-modality correlations and have limited generalization ability. To better address the task, we present an effective transformer-based end-to-end visual grounding approach, which focuses on capturing the cross-modality correlations between the referring expression and visual regions for accurately reasoning the location of the target region. Specifically, our model consists of a feature encoder, a cross-modality interactor, and a modality-agnostic decoder. The feature encoder is employed to capture the intra-modality correlation, which models the linguistic context in query and the spatial dependency in image respectively. The cross-modality interactor endows the model with the capability of highlighting the localization-relevant visual and textual cues by mutual verification of vision and language, which plays a key role in our model. The decoder learns a consolidated token representation enriched by multi-modal contexts and further directly predicts the box coordinates. Extensive experiments on five public benchmark datasets with quantitative and qualitative analysis clearly demonstrate the effectiveness and rationale of our proposed method.

References

[1]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision. 213–229.
[2]
Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, and Jiebo Luo. 2018. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018).
[3]
Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using syntax to ground referring expressions in natural images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[4]
Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. TransVG: End-to-end visual grounding with transformers. In Proceedings of the IEEE International Conference on Computer Vision. 1769–1779.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018).
[6]
Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. 2022. Visual grounding with transformers. In Proceedings of the IEEE International Conference on Multimedia and Expo.
[7]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.
[8]
Ross Girshick. 2015. Fastr-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440–1448.
[9]
Dan Guo, Kun Li, Zheng-Jun Zha, and Meng Wang. 2019. DADNET: Dilated-attention-deformable convnet for crowd counting. In Proceedings of the 27th ACM International Conference on Multimedia. 1823–1832.
[10]
Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, and Meng Wang. 2020. Iterative context-aware graph inference for visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10055–10064.
[11]
Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2017. Online early-late fusion based on adaptive HMM for sign language recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1 (2017), 1–18.
[12]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[14]
Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. 2022. Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 684–696.
[15]
Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1115–1124.
[16]
Weike Jin, Zhou Zhao, Yimeng Li, Jie Li, Jun Xiao, and Yueting Zhuang. 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1–22.
[17]
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE International Conference on Computer Vision. 1780–1790.
[18]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 787–798.
[19]
Rama Kovvuri and Ram Nevatia. 2018. PIRCNET: Using proposal indexing, relationships and context for phrase grounding. In Proceedings of the Asian Conference on Computer Vision. 451–467.
[20]
Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902–1910.
[21]
Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10880–10889.
[22]
Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE International Conference on Computer Vision. 4673–4682.
[23]
Jingyu Liu, Liang Wang, and Ming-Hsuan Yang. 2017. Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4856–4864.
[24]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. 21–37.
[25]
Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1950–1959.
[26]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[27]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11–20.
[28]
Varun K. Nagaraja, Vlad I. Morariu, and Larry S. Davis. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision. 792–807.
[29]
Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, and Svetlana Lazebnik. 2018. Conditional image-text embedding networks. In Proceedings of the European Conference on Computer Vision. 249–264.
[30]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641–2649.
[31]
Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9982–9991.
[32]
Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv:1804.02767 (2018).
[33]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 658–666.
[34]
Arka Sadhu, Kan Chen, and Ram Nevatia. 2019. Zero-shot grounding of objects from natural language queries. In Proceedings of the IEEE International Conference on Computer Vision. 4694–4703.
[35]
Ling Shen, Richang Hong, Haoran Zhang, Xinmei Tian, and Meng Wang. 2019. Video retrieval with similarity-preserving deep temporal hashing. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 4 (2019), 1–16.
[36]
Jasper R. R. Uijlings, Koen E. A. Van De Sande, Theo Gevers, and Arnold W. M. Smeulders. 2013. Selective search for object recognition. International Journal of Computer Vision (2013), 154–171.
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
[38]
Hui Wang, Dan Guo, Xian-Sheng Hua, and Meng Wang. 2021. Pairwise VLAD interaction network for video question answering. In Proceedings of the 29th ACM International Conference on Multimedia. 5119–5127.
[39]
Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 394–407.
[40]
Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1960–1968.
[41]
Shuo Wang, Dan Guo, Xin Xu, Li Zhuo, and Meng Wang. 2019. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1–16.
[42]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE International Conference on Computer Vision. 4644–4653.
[43]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Propagating over phrase relations for one-stage visual grounding. In Proceedings of the European Conference on Computer Vision. 589–605.
[44]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339–1348.
[45]
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1–10.
[46]
Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020. Weakly-supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the 28th ACM International Conference on Multimedia. 1939–1947.
[47]
Xun Yang, Meng Wang, Richang Hong, Qi Tian, and Yong Rui. 2017. Enhancing person re-identification in a self-trained subspace. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 3 (2017), 1–23.
[48]
Xun Yang, Shanshan Wang, Jian Dong, Jianfeng Dong, Meng Wang, and Tat-Seng Chua. 2022. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing 31 (2022), 1204–1216.
[49]
Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the European Conference on Computer Vision. 387–404.
[50]
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE International Conference on Computer Vision. 4683–4693.
[51]
Jiabo Ye, Xin Lin, Liang He, Dingbang Li, and Qin Chen. 2021. One-stage visual grounding via semantic-aware feature filter. In Proceedings of the 29th ACM International Conference on Multimedia. 1702–1711.
[52]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MATTNET: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307–1315.
[53]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. 2016. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision. 69–85.
[54]
Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. 2018. Rethinking diversified and discriminative proposal generation for visual grounding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 1114–1120.
[55]
Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158–4166.
[56]
Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, and Xun Wang. 2021. Progressive localization networks for language based moment localization. ACM Transactions on Multimedia Computing, Communications, and Applications (2021).
[57]
Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252–4261.
[58]
C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision. 391–405.

Cited By

View all
  • (2024)Artificial intelligence in ischemic stroke images: current applications and future directionsFrontiers in Neurology10.3389/fneur.2024.141806015Online publication date: 10-Jul-2024
  • (2024)SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664816Online publication date: 11-May-2024
  • (2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
  • Show More Cited By

Index Terms

  1. Transformer-Based Visual Grounding with Cross-Modality Interaction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 6
    November 2023
    858 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3599695
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2023
    Online AM: 09 March 2023
    Accepted: 07 March 2023
    Revised: 23 January 2023
    Received: 04 July 2022
    Published in TOMM Volume 19, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Visual Grounding
    2. referring expression
    3. cross-modality interaction

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Major Project of Anhui Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)504
    • Downloads (Last 6 weeks)47
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Artificial intelligence in ischemic stroke images: current applications and future directionsFrontiers in Neurology10.3389/fneur.2024.141806015Online publication date: 10-Jul-2024
    • (2024)SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664816Online publication date: 11-May-2024
    • (2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
    • (2024)Context-detail-aware United Network for Single Image DerainingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363940720:5(1-18)Online publication date: 22-Jan-2024
    • (2024)Emotional Video Captioning With Vision-Based Emotion Interpretation NetworkIEEE Transactions on Image Processing10.1109/TIP.2024.335904533(1122-1135)Online publication date: 2024
    • (2024)Language conditioned multi-scale visual attention networks for visual groundingImage and Vision Computing10.1016/j.imavis.2024.105242150(105242)Online publication date: Oct-2024
    • (2024)EPK-CLIPExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124183252:PAOnline publication date: 24-Jul-2024
    • (2024)Leveraging chaos for enhancing encryption and compression in large cloud data transfersThe Journal of Supercomputing10.1007/s11227-024-05906-380:9(11923-11957)Online publication date: 4-Feb-2024
    • (2024)Dependability of Network Services in the Context of NFV: A Taxonomy and State of the Art ClassificationJournal of Network and Systems Management10.1007/s10922-024-09810-232:2Online publication date: 26-Mar-2024
    • (2024)Dual-path temporal map optimization for make-up temporal video groundingMultimedia Systems10.1007/s00530-024-01340-w30:3Online publication date: 1-Jun-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media