research-article

Exploring Logical Reasoning for Referring Expression Comprehension

Authors:

Rui FengAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5047 - 5055

https://doi.org/10.1145/3474085.3475677

Published: 17 October 2021 Publication History

Abstract

Referring expression comprehension aims to localize the target object in an image referred by a natural language expression. Most existing approaches neglect the implicit logical correlations among fine-grained cues, e.g., categories, attributes, which are beneficial for distinguishing objects. In this paper, we propose a logic-guided approach to explore logical knowledge for referring expression comprehension in a hierarchical modular-based framework. Specifically, we propose to extract fine-grained cues in visual and textual domains and perform logical reasoning over them with explicit logical expressions to regularize the matching process without extra parameters. Besides, we propose to improve existing modular-based methods by introducing context information of objects in the relationship module. Extensive experiments are conducted on three referring expression datasets, and the results demonstrate that our model can produce more consistent predictions and further achieve superior performance compared with previous methods.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.

Digital Library

[3]

Akari Asai and Hannaneh Hajishirzi. 2020. Logic-Guided Data Augmentation and Regularization for Consistent Question Answering. arXiv preprint arXiv:2004.10157 (2020).

[4]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In ECCV.

Digital Library

[5]

Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using syntax to ground referring expressions in natural images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[6]

Thomas Demeester, Tim Rocktäschel, and Sebastian Riedel. 2016. Lifted rule injection for relation embeddings. In EMNLP. 1389--1399.

[7]

Chaorui Deng, QiWu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. 2018. Visual grounding via accumulated attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7746--7755.

[8]

Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. 2019. Neural logic machines. In ICLR.

[9]

Richard Evans and Edward Grefenstette. 2018. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research 61 (2018), 1--64.

Digital Library

[10]

Marc Fischer, Mislav Balunovic, Dana Drachsler-Cohen, Timon Gehr, Ce Zhang, and Martin Vechev. 2019. Dl2: Training and querying neural networks with logic. In ICML. 1931--1941.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.

[12]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[13]

Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould. 2020. A Recurrent Vision-and-Language BERT for Navigation. arXiv preprint arXiv:2011.13922 (2020).

[14]

Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1115--1124.

[15]

Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. 2016. Harnessing deep neural networks with logic rules. In ACL.

[16]

Vahid Kazemi and Ali Elqursh. 2017. Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv preprint arXiv:1704.03162 (2017).

[17]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787--798.

[18]

Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha Srinivasa. 2019. Tactical rewind: Selfcorrection via backtracking in vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6741--6749.

[19]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[20]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 1 (2017), 32--73.

Digital Library

[21]

German Kruszewski, Denis Paperno, and Marco Baroni. 2015. Deriving boolean structures from distributional vectors. Transactions of the Association for Computational Linguistics 3 (2015), 375--388.

[22]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicodervl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11336--11344.

[23]

Tao Li, Vivek Gupta, Maitrey Mehta, and Vivek Srikumar. 2019. A Logic-Driven Framework for Consistency of Neural Models. In EMNLP-IJCNLP. 3922--3933.

[24]

Tao Li and Vivek Srikumar. 2019. Augmenting neural networks with first-order logic. In ACL. 292--302.

[25]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[26]

Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4673--4682.

[27]

Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1950--1959.

[28]

Yongfei Liu, Bo Wan, Xiaodan Zhu, and Xuming He. 2020. Learning cross-modal context graph for visual grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11645--11652.

[29]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019).

[30]

Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7102--7111.

[31]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11--20.

[32]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[33]

Yanyuan Qiao, Chaorui Deng, and Qi Wu. 2020. Referring Expression Comprehension: A Survey of Methods and Datasets. IEEE Transactions on Multimedia (2020).

[34]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. 91--99.

Digital Library

[35]

Tim Rocktäschel and Sebastian Riedel. 2017. End-to-end differentiable proving. In Advances in Neural Information Processing Systems. 3788--3800.

Digital Library

[36]

Tim Rocktäschel, Sameer Singh, and Sebastian Riedel. 2015. Injecting logical background knowledge into embeddings for relation extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1119--1129.

[37]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).

[38]

Hai Wang and Hoifung Poon. 2018. Deep probabilistic logic: A unifying framework for indirect supervision. arXiv preprint arXiv:1808.08485 (2018).

[39]

Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1960--1968.

[40]

Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6629--6638.

[41]

Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck. 2018. A semantic loss function for deep learning with symbolic knowledge. In ICML. 5498--5507.

[42]

Fan Yang, Zhilin Yang, and William W Cohen. 2017. Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems. 2319--2328.

Digital Library

[43]

Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4145--4154.

[44]

Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4644--4653.

[45]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307--1315.

[46]

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In European Conference on Computer Vision. Springer, 69--85.

[47]

Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017. A joint speakerlistener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7282--7290.

[48]

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6720--6731.

[49]

Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158--4166.

[50]

Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton Van Den Hengel. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252--4261.

Cited By

Zhang PLiu MSong XCao DGao ZNie L(2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3656045
Xue DQian SXu C(2024)Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question AnsweringIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339801246:12(7893-7908)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3398012
Li MBai ZDeng J(2024)Multi-view Attention Networks for Visual Question Answering2024 6th International Conference on Natural Language Processing (ICNLP)10.1109/ICNLP60986.2024.10692598(788-794)Online publication date: 22-Mar-2024
https://doi.org/10.1109/ICNLP60986.2024.10692598
Show More Cited By

Index Terms

Exploring Logical Reasoning for Referring Expression Comprehension
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
    2. Natural language processing

Recommendations

Towards Further Comprehension on Referring Expression with Rationale
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Referring Expression Comprehension (REC) is one important research branch in visual grounding, where the goal of REC is to localize a relevant object in the image, given an expression in the form of text to exactly describe a specific object. However, ...
Bottom-Up and Bidirectional Alignment for Referring Expression Comprehension
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

In this paper, we propose a one-stage approach to improve referring expression comprehension (REC) which aims at grounding the referent according to a natural language expression. We observe that humans understand referring expressions through a fine-to-...
Knowledge Mining of Scene Text for Referring Expression Comprehension
Document Analysis and Recognition - ICDAR 2024
Abstract
Text-based referring expression comprehension requires reading and understanding scene text in an image to locate a specific object described by a natural language expression. Existing methods predominantly focus on the literal interpretation of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Science and Technology Major Project of Commission of Science and Technology of Shanghai
National Natural Science Foundation of China
the Science and Technology Commission of Shanghai Municipality

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
352
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)3

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang PLiu MSong XCao DGao ZNie L(2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3656045
Xue DQian SXu C(2024)Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question AnsweringIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339801246:12(7893-7908)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3398012
Li MBai ZDeng J(2024)Multi-view Attention Networks for Visual Question Answering2024 6th International Conference on Natural Language Processing (ICNLP)10.1109/ICNLP60986.2024.10692598(788-794)Online publication date: 22-Mar-2024
https://doi.org/10.1109/ICNLP60986.2024.10692598
Qiu HWang LZhao TMeng FLi H(2024)HumanFormer: Human-centric Prompting Multi-modal Perception Transformer for Referring Crowd Detection2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00562(5530-5540)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00562
Yang DJi JSun XWang HLi YMa YJi REl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Semi-Supervised Panoptic Narrative GroundingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612259(7164-7174)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612259
Mi JChen ZZhang J(2023)Weakly Supervised Referring Expression Grounding via Dynamic Self-Knowledge Distillation2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS55552.2023.10341909(1254-1260)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IROS55552.2023.10341909
Ding ZDing ZHui THuang JWei XWei XLiu SMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative GroundingProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548086(5537-5546)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548086
Qiu HLi HZhao TWang LWu QMeng FMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)RefCrowd: Grounding the Target in Crowd with Referring ExpressionsProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547765(4435-4444)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547765

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents