Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Published: 03 July 2019 Publication History

Abstract

Visual Question Answering (VQA) is a challenging task that has gained increasing attention from both the computer vision and the natural language processing communities in recent years. Given a question in natural language, a VQA system is designed to automatically generate the answer according to the referenced visual content. Though there recently has been much intereset in this topic, the existing work of visual question answering mainly focuses on a single static image, which is only a small part of the dynamic and sequential visual data in the real world. As a natural extension, video question answering (VideoQA) is less explored. Because of the inherent temporal structure in the video, the approaches of ImageQA may be ineffectively applied to video question answering. In this article, we not only take the spatial and temporal dimension of video content into account but also employ an external knowledge base to improve the answering ability of the network. More specifically, we propose a knowledge-based progressive spatial-temporal attention network to tackle this problem. We obtain both objects and region features of the video frames from a region proposal network. The knowledge representation is generated by a word-level attention mechanism using the comment information of each object that is extracted from DBpedia. Then, we develop a question-knowledge-guided progressive spatial-temporal attention network to learn the joint video representation for video question answering task. We construct a large-scale video question answering dataset. The extensive experiments based on two different datasets validate the effectiveness of our method.

References

[1]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39--48.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.
[3]
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web. Springer, 722--735.
[4]
Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07), Vol. 7. 2670--2676.
[5]
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, 1247--1250.
[6]
Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075 (2015).
[7]
Antoine Bordes, Jason Weston, and Nicolas Usunier. 2014. Open question answering with weakly supervised embedding models. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 165--180.
[8]
Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. 2013. Neil: Extracting visual knowledge from web data. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). IEEE, 1409--1416.
[9]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.
[10]
Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015. Question answering over freebase with multi-column convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1. 260--269.
[11]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 457--468.
[12]
Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6576--6585.
[13]
Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. 2015. Visual turing test for computer vision systems. Proc. Natl. Acad. Sci. U.S.A. 112, 12 (2015), 3618--3623.
[14]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). IEEE, 2712--2719.
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[16]
Michael Heilman and Noah A Smith. 2009. Question Generation via Overgenerating Transformations and Ranking. Technical Report. Language Technologie Institute, Carnegie-Mellon University, Pittsburgh, PA.
[17]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.
[18]
Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang, and Qi Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans. Image Process. 26, 9 (2017), 4128--4138.
[19]
Richang Hong, Jinhui Tang, Hung-Khoon Tan, Chong-Wah Ngo, Shuicheng Yan, and Tat-Seng Chua. 2011. Beyond search: Event-driven summarization for web videos. ACM Trans. Multimedia Comput. Commun. Appl. 7, 4 (2011), 35.
[20]
Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017. Search-based neural structured learning for sequential question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1821--1831.
[21]
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1359--1367.
[22]
Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems. 1889--1897.
[23]
Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual qa. In Advances in Neural Information Processing Systems. 361--369.
[24]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. 1188--1196.
[25]
Ruiyu Li and Jiaya Jia. 2016. Visual question answering with question representation update (qru). In Advances in Neural Information Processing Systems. 4655--4663.
[26]
Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander Hauptmann. 2018. Focal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6135--6143.
[27]
Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware multimodal dialogue systems. In Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 801--809.
[28]
Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In Proceedings of the European Conference on Computer Vision. Springer, 261--277.
[29]
Hugo Liu and Push Singh. 2004. ConceptNet: A practical commonsense reasoning tool-kit. BT Technol. J. 22, 4 (2004), 211--226.
[30]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. 289--297.
[31]
Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to answer questions from image using convolutional neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’16). 3567--3573.
[32]
Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. 2013. Yago3: A knowledge base from multilingual wikipedias. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’13).
[33]
Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems. 1682--1690.
[34]
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). IEEE Computer Society, 1--9.
[35]
Seil Na, Sangho Lee, Jisung Kim, and Gunhee Kim. 2017. A read-write memory network for movie story understanding. In Proceedings of the International Conference on Computer Vision. 677--685.
[36]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30--38.
[37]
Siva Reddy, Oscar Täckström, Michael Collins, Tom Kwiatkowski, Dipanjan Das, Mark Steedman, and Mirella Lapata. 2016. Transforming dependency structures to logical forms for semantic parsing. Trans. Assoc. Comput. Linguist. 4 (2016), 127--140.
[38]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.
[39]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations.
[40]
Xiaomeng Song, Yucheng Shi, Xin Chen, and Yahong Han. 2018. Explore multi-step reasoning in video question answering. In Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 239--247.
[41]
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems. 2440--2448.
[42]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. 2015. Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.
[43]
Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-structured representations for visual question answering. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 3233--3241.
[44]
Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78--85.
[45]
Bo Wang, Youjiang Xu, Yahong Han, and Richang Hong. 2018. Movie question answering: Remembering the textual cues for layered visual contents. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 7380--7387.
[46]
Aming Wu and Yahong Han. 2018. Multi-modal circulant fusion for video-to-language and backward. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18). 1029--1035.
[47]
Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40 (2017), 1367--1381.
[48]
Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630.
[49]
Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 133--138.
[50]
Chunyang Xiao, Marc Dymetman, and Claire Gardent. 2016. Sequence-based structured prediction for semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1341--1350.
[51]
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning. 2397--2406.
[52]
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia. 1645--1653.
[53]
Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the European Conference on Computer Vision. Springer, 451--466.
[54]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.
[55]
Youjiang Xu, Yahong Han, Richang Hong, and Qi Tian. 2018. Sequential video VLAD: Training the aggregation locally and temporally. IEEE Trans. Image Process. 27, 10 (2018), 4933--4944.
[56]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.
[57]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507--4515.
[58]
Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 829--832.
[59]
Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. 2017. Multi-level attention networks for visual question answering. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Vol. 1. 8.
[60]
Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3261--3269.
[61]
Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging video descriptions to learn video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’17). 4334--4340.
[62]
Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Proceedings of the Conference in Uncertainty in Artificial Intelligence. 658--666.
[63]
Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video question answering via hierarchical dual-level attention network learning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1050--1058.
[64]
Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, and Yueting Zhuang. 2018. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18). 3683--3689.
[65]
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. Int. J. Comput. Vis. 124, 3 (2017), 409--421.
[66]
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995--5004.
[67]
Yuke Zhu, Ce Zhang, Christopher Ré, and Li Fei-Fei. 2015. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint arXiv:1507.05670 (2015).

Cited By

View all
  • (2024)Video question answering via traffic knowledge database and question classificationMultimedia Systems10.1007/s00530-023-01240-530:1Online publication date: 16-Jan-2024
  • (2023)Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363010120:4(1-22)Online publication date: 11-Dec-2023
  • (2023)Cross-modality Multiple Relations Learning for Knowledge-based Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361830120:3(1-22)Online publication date: 23-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 2s
Special Section on Cross-Media Analysis for Visual Question Answering, Special Section on Big Data, Machine Learning and AI Technologies for Art and Design and Special Section on MMSys/NOSSDAV 2018
April 2019
381 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3343360
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2019
Accepted: 01 February 2019
Revised: 01 January 2019
Received: 01 June 2018
Published in TOMM Volume 15, Issue 2s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Video question answering
  2. attention
  3. knowledge
  4. spatial-temporal

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)4
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Video question answering via traffic knowledge database and question classificationMultimedia Systems10.1007/s00530-023-01240-530:1Online publication date: 16-Jan-2024
  • (2023)Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363010120:4(1-22)Online publication date: 11-Dec-2023
  • (2023)Cross-modality Multiple Relations Learning for Knowledge-based Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361830120:3(1-22)Online publication date: 23-Oct-2023
  • (2023)Transformer-Based Visual Grounding with Cross-Modality InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725119:6(1-19)Online publication date: 9-Mar-2023
  • (2023)Visual Paraphrase Generation with Key Information RetainedACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358501019:6(1-19)Online publication date: 30-May-2023
  • (2023)Fine-Grained Text-to-Video Temporal Grounding from Coarse BoundaryACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357982519:5(1-21)Online publication date: 16-Mar-2023
  • (2022)Advanced Models for Video Question AnsweringVisual Question Answering10.1007/978-981-19-0964-1_9(135-143)Online publication date: 13-May-2022
  • (2022)Spatio-temporal Data Sources Integration with Ontology for Road Accidents AnalysisBusiness Information Systems Workshops10.1007/978-3-031-04216-4_23(251-262)Online publication date: 6-Apr-2022
  • (2022)Estimation and Aggregation Method of Open Data Sources for Road Accident AnalysisIntelligent Systems Design and Applications10.1007/978-3-030-96308-8_95(1025-1034)Online publication date: 27-Mar-2022
  • (2021)Assessment Formation of Open Data Sources During Their Aggregation For Analyzing Road Accidents2021 30th Conference of Open Innovations Association FRUCT10.23919/FRUCT53335.2021.9599968(239-245)Online publication date: 27-Oct-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media