Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3469877.3490592acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension

Published: 10 January 2022 Publication History

Abstract

Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.

References

[1]
Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. 2021. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021. 1036–1044.
[2]
Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, and Jiebo Luo. 2018. Real-Time Referring Expression Comprehension by Single-Stage Grounding Network. arxiv:1812.03426 [cs.CV]
[3]
Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using Syntax to Ground Referring Expressions in Natural Images. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18). 6756–6764.
[4]
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable Convolutional Networks. In IEEE International Conference on Computer Vision, ICCV 2017. 764–773.
[5]
Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. 2018. Visual Grounding via Accumulated Attention. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 7746–7755.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019), 4171–4186.
[7]
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. 2019. CenterNet: Keypoint Triplets for Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. 6568–6577.
[8]
Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014. 580–587.
[9]
Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019. Aligning Linguistic Words and Visual Semantic Units for Image Captioning. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019. 765–773.
[10]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017. 2980–2988.
[11]
Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling Relationships in Referential Expressions with Compositional Modular Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 4418–4427.
[12]
Weike Jin, Zhou Zhao, Mao Gu, Jun Yu, Jun Xiao, and Yueting Zhuang. 2019. Multi-interaction Network with Object Relation for Video Question Answering. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019. 1193–1201.
[13]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 787–798.
[14]
Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. 10877–10886.
[15]
Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature Pyramid Networks for Object Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 936–944.
[16]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference(Lecture Notes in Computer Science, Vol. 8693). 740–755.
[17]
Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, and Feng Wu. 2019. Learning to Assemble Neural Module Tree Networks for Visual Grounding. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. 4672–4681.
[18]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. In Computer Vision - ECCV 2016 - 14th European Conference(Lecture Notes in Computer Science, Vol. 9905). 21–37.
[19]
Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 1950–1959.
[20]
Yongfei Liu, Bo Wan, Xiaodan Zhu, and Xuming He. 2020. Learning Cross-Modal Context Graph for Visual Grounding. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020. 11645–11652.
[21]
Zhengzhe Liu, Xiaojuan Qi, and Lei Pang. 2018. Self-boosted Gesture Interactive System with ST-Net. In 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018. 145–153.
[22]
Yadan Luo, Zi Huang, Zheng Zhang, Ziwei Wang, Jingjing Li, and Yang Yang. 2019. Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019. 2341–2350.
[23]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and Comprehension of Unambiguous Object Descriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 11–20.
[24]
Varun K. Nagaraja, Vlad I. Morariu, and Larry S. Davis. 2016. Modeling Context Between Objects for Referring Expression Understanding. In Computer Vision - ECCV 2016 - 14th European Conference(Lecture Notes in Computer Science, Vol. 9908). 792–807.
[25]
Heqian Qiu, Hongliang Li, Qingbo Wu, Fanman Meng, Hengcan Shi, Taijin Zhao, and King Ngi Ngan. 2020. Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle. 4171–4180.
[26]
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. CoRR abs/1804.02767(2018). arxiv:1804.02767
[27]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. (2015), 91–99.
[28]
Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of Textual Phrases in Images by Reconstruction. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I(Lecture Notes in Computer Science, Vol. 9905). 817–834.
[29]
Arka Sadhu, Kan Chen, and Ram Nevatia. 2019. Zero-Shot Grounding of Objects From Natural Language Queries. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. 4693–4702.
[30]
Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-Word-Aware Network for Referring Expression Image Segmentation. In Computer Vision - ECCV 2018 - 15th European Conference(Lecture Notes in Computer Science, Vol. 11210). 38–54.
[31]
Hengcan Shi, Hongliang Li, Qingbo Wu, and King Ngi Ngan. 2021. Query Reconstruction Network for Referring Expression Image Segmentation. IEEE Trans. Multim. 23(2021), 995–1007.
[32]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015.
[33]
Xiaomeng Song, Yucheng Shi, Xin Chen, and Yahong Han. 2018. Explore Multi-Step Reasoning in Video Question Answering. In 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018. 239–247.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. (2017), 5998–6008.
[35]
Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 1960–1968.
[36]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Dynamic Graph Attention for Referring Expression Comprehension. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. 4643–4652.
[37]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Graph-Structured Referring Expression Reasoning in the Wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. 9949–9958.
[38]
Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving One-Stage Visual Grounding by Recursive Sub-query Construction. 12359 (2020), 387–404.
[39]
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A Fast and Accurate One-Stage Approach to Visual Grounding. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South). 4682–4692.
[40]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 1307–1315.
[41]
Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L. Berg. 2017. A Joint Speaker-Listener-Reinforcer Model for Referring Expressions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 3521–3529.
[42]
Yongqing Zhu and Shuqiang Jiang. 2019. Attention-based Densely Connected LSTM for Video Captioning. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019. 802–810.
[43]
Bohan Zhuang, Qi Wu, Chunhua Shen, Ian D. Reid, and Anton van den Hengel. 2018. Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 4252–4261.
[44]
C. Lawrence Zitnick and Piotr Dollár. 2014. Edge Boxes: Locating Object Proposals from Edges. In Computer Vision - ECCV 2014 - 13th European Conference(Lecture Notes in Computer Science, Vol. 8693). 391–405.

Index Terms

  1. Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia
      December 2021
      508 pages
      ISBN:9781450386074
      DOI:10.1145/3469877
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 January 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Entity relation fusion networks (ERFN)
      2. Language guided feature fusion (LGFF)
      3. Language guided multi-scale fusion (LGMSF)
      4. Referring expression comprehension

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      MMAsia '21
      Sponsor:
      MMAsia '21: ACM Multimedia Asia
      December 1 - 3, 2021
      Gold Coast, Australia

      Acceptance Rates

      Overall Acceptance Rate 59 of 204 submissions, 29%

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 105
        Total Downloads
      • Downloads (Last 12 months)25
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 12 Sep 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media