research-article

Recurrent Attention Network with Reinforced Generator for Visual Dialog

Authors:

Fei WuAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 16, Issue 3

Article No.: 78, Pages 1 - 16

https://doi.org/10.1145/3390891

Published: 05 July 2020 Publication History

Abstract

In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer “what is the man on her left wearing?” the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as “her,” (2) parse the image to attend “her,” and (3) uncover the spatial context to shift the attention to “her left” and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.

References

[1]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 39--48.

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedingsof the 2015 IEEE InternationalConference on Computer Vision (ICCV’15). 2425--2433.

Digital Library

[3]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.

[4]

Yalong Bai, Jianlong Fu, Tiejun Zhao, and Tao Mei. 2018. Deep attention neural tensor network for visual question answering. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18). 21--37.

[5]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1080--1089.

[6]

Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 2970--2979.

[7]

Yuhang Ding, Hehe Fan, Mingliang Xu, and Yi Yang. 2020. Adaptive exploration for unsupervised person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1 (2020), Article 3.

Digital Library

[8]

Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 677--691.

Digital Library

[9]

Hehe Fan, Zhongwen Xu, Linchao Zhu, Chenggang Yan, Jianjun Ge, and Yi Yang. 2018. Watching a small portion could be as good as watching all: Towards efficient video classification. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). 705--711.

[10]

Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 4 (2018), Article 83.

Digital Library

[11]

Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1473--1482.

[12]

Q. Feng, Y. Wu, H. Fan, C. Yan, M. Xu, and Y. Yang. 2020. Cascaded revision network for novel object captioning. IEEE Transactions on Circuits and Systems for Video Technology. Early Access.

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.

Digital Library

[14]

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16), Part I. 108--124.

[15]

Andrej Karpathy and Li Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664--676.

Digital Library

[16]

Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and Sanja Fidler. 2014. What are you talking about? Text-to-image coreference. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 3558--3565.

Digital Library

[17]

Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7492--7500.

[18]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Part V. 740--755.

[19]

Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 313--323.

Digital Library

[20]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016. 289--297.

[21]

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2017. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision 125, 1–3 (2017), 110--135.

Digital Library

[22]

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. 2204--2212.

[23]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533.

[24]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1029--1038.

[25]

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123, 1 (2017), 74--93.

Digital Library

[26]

Vignesh Ramanathan, Armand Joulin, Percy Liang, and Fei-Fei Li. 2014. Linking people in videos with “their” names using coreference resolution. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Part I. 95--110.

[27]

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv:1511.06732.

[28]

Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. 2953--2961.

[29]

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1179--1195.

[30]

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16), Part I. 817--834.

[31]

Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3202--3212.

[32]

Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and Leonid Sigal. 2017. Visual reference resolution using attention memory for visual dialog. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 3722--3732.

[33]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

[34]

Anqi Wang, Haifeng Hu, and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), Article 73, 15 pages.

Digital Library

[35]

Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8 (1992), 279--292.

Digital Library

[36]

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (1992), 229--256.

Digital Library

[37]

Jie Wu, Haifeng Hu, and Yi Wu. 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 4 (2018), Article 87.

Digital Library

[38]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony R. Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 203--212.

[39]

Qi Wu, Chunhua Shen, Peng Wang, Anthony R. Dick, and Anton van den Hengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1367--1381.

[40]

Qi Wu, Peng Wang, Chunhua Shen, Ian D. Reid, and Anton van den Hengel. 2018. Are you talking to me? Reasoned visual dialog generation through adversarial learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).

[41]

Yu Wu, Lu Jiang, and Yi Yang. 2020. Revisiting EmbodiedQA: A simple baseline and beyond. IEEE Transactions on Image Processing 29 (2020), 3984--3992.

[42]

Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 2017 ACM Conference on Multimedia (MM’17). 537--545.

Digital Library

[43]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 2048--2057.

Digital Library

[44]

Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia 18, 12 (2016), 2494--2502.

Digital Library

[45]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 21--29.

[46]

Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. 2017. Multi-level attention networks for visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4187--4195.

[47]

Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (2017), 409--421.

Digital Library

[48]

Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual7W: Grounded question answering in images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4995--5004.

Cited By

Xu LChen FLi J(2025)Ordinal and Position Enhance the Framework of the Multimodal Dialogue SystemIntelligent Robotics10.1007/978-981-96-1614-5_9(135-151)Online publication date: 15-Feb-2025
https://doi.org/10.1007/978-981-96-1614-5_9
Liang CZhu LYang ZChen WYang Y(2024)Noise-Tolerant Hybrid Prototypical Learning with Noisy Web DataACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367239620:10(1-19)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3672396
He WLi ZWang HXu TWang ZHuai BYuan NChen E(2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3645099
Show More Cited By

Index Terms

Recurrent Attention Network with Reinforced Generator for Visual Dialog
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Multimodal Fusion of Visual Dialog: A Survey
RICAI '20: Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence

Visual Dialog: aiming at holding a meaningful conversation with humans based on natural images, is a 'high-level' AI task of multimodal fusion. Since the challenge for visual dialog was proposed in 2017, multimodal fusion has been developed and made ...
Unified Multimodal Model with Unlikelihood Training for Visual Dialog
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). ...
Visual Dialog with Multi-turn Attentional Memory Network
Advances in Multimedia Information Processing – PCM 2018
Abstract
Visual dialog is a task of answering a question given an input image, a historical dialog about the image and often requires to retrieve visual and textual facts about the question. This problem is different from visual question answering (VQA), ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 3

August 2020

364 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3409646

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2020

Online AM: 07 May 2020

Accepted: 01 March 2020

Revised: 01 October 2019

Received: 01 November 2018

Published in TOMM Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Australian Research Council
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
322
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)2

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu LChen FLi J(2025)Ordinal and Position Enhance the Framework of the Multimodal Dialogue SystemIntelligent Robotics10.1007/978-981-96-1614-5_9(135-151)Online publication date: 15-Feb-2025
https://doi.org/10.1007/978-981-96-1614-5_9
Liang CZhu LYang ZChen WYang Y(2024)Noise-Tolerant Hybrid Prototypical Learning with Noisy Web DataACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367239620:10(1-19)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3672396
He WLi ZWang HXu TWang ZHuai BYuan NChen E(2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3645099
Liu AHuang CXu NTian HLiu JZhang Y(2024)Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased TrainingIEEE Transactions on Multimedia10.1109/TMM.2023.328459426(1639-1651)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3284594
Zhao GHu SFan JGuo QShen BLuo L(2024)Attention-Aware Meta-Reweighted Optimization for Enhanced Intelligent Fault DiagnosisIEEE Access10.1109/ACCESS.2024.339718412(64672-64685)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3397184
Wang YLi PSi QZhang HZang WLin ZFu P(2023)Cross-modality Multiple Relations Learning for Knowledge-based Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361830120:3(1-22)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3618301
Cheng YFan HLin DSun YKankanhalli MLim J(2023)Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2023.334517226(6131-6141)Online publication date: 20-Dec-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3345172
Liu FChen HCheng ZLiu ANie LKankanhalli M(2023)Disentangled Multimodal Representation Learning for RecommendationIEEE Transactions on Multimedia10.1109/TMM.2022.321744925(7149-7159)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3217449
Tang MWang ZZeng ZLi XZhou L(2023)Stay in Grid: Improving Video Captioning via Fully Grid-Level RepresentationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.323263433:7(3319-3332)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TCSVT.2022.3232634
Bilakeri SKotegar K(2023)Combined visual and spatial-temporal information for appearance change person re-identificationCogent Engineering10.1080/23311916.2023.219769510:1Online publication date: 25-Apr-2023
https://doi.org/10.1080/23311916.2023.2197695
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents