Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548291acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

AVQA: A Dataset for Audio-Visual Question Answering on Videos

Published: 10 October 2022 Publication History

Abstract

Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video, and has drawn increasing research interest in recent years. However, there have been no appropriate datasets for this challenging task on videos in real-life scenarios so far. They are either designed with questions containing only visual clues without taking any audio information into account, or considering audio with restrictions to specific scenarios, such as panoramic videos and videos about music performances. In this paper, to overcome the limitations of existing datasets, we introduce AVQA, a new audio-visual question answering dataset on videos in real-life scenarios. We collect 57,015 videos from daily audio-visual activities and 57,335 specially-designed question-answer pairs relying on clues from both modalities, where information contained in a single modality is insufficient or ambiguous. Furthermore, we propose a Hierarchical Audio-Visual Fusing module to model multiple semantic correlations among audio, visual, and text modalities and conduct ablation studies to analyze the role of different modalities on our datasets. Experimental results show that our proposed method significantly improves the audio-visual question answering performance over various question types. Therefore, AVQA can provide an adequate testbed for the generation of models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios. (The dataset is available at https://mn.cs.tsinghua.edu.cn/avqa)

Supplementary Material

MP4 File (MM22-fp2343.mp4)
Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video. However, there have been no appropriate datasets for this challenging task on videos in real-life scenarios so far. They are either designed with questions containing only visual clues or considering audio with restrictions to specific scenarios, such as panoramic videos and videos about music performances. In this paper, to overcome the limitations of existing datasets, we introduce AVQA, a new audio-visual question answering dataset on videos in real-life scenarios. Furthermore, we propose a Hierarchical Audio-Visual Fusing module to model multiple semantic correlations among three modalities and conduct ablation studies to analyze the role of different modalities. Experimental results show that our proposed method significantly improves the audio-visual question answering performance over various question types. (The dataset is available at https://mn.cs.tsinghua.edu.cn/avqa)

References

[1]
David A Bulkin and Jennifer M Groh. 2006. Seeing sounds: visual and auditory interactions in the brain. Current opinion in neurobiology, Vol. 16, 4 (2006), 415--419.
[2]
Santiago Castro, Mahmoud Azab, Jonathan Stroud, Cristina Noujaim, Ruoyao Wang, Jia Deng, and Rada Mihalcea. 2020. LifeQA: A Real-life Dataset for Video Question Answering. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4352--4358. https://aclanthology.org/2020.lrec-1.536
[3]
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 721--725.
[4]
Seongho Choi, Kyoung-Woon On, Yu-Jung Heo, Ahjeong Seo, Youwon Jang, Minsu Lee, and Byoung-Tak Zhang. 2020. Dramaqa: Character-centered video story understanding with hierarchical qa. arXiv preprint arXiv:2005.03356 (2020).
[5]
Anthony Colas, Seokhwan Kim, Franck Dernoncourt, Siddhesh Gupte, Daisy Zhe Wang, and Doo Soon Kim. 2019. TutorialVQA: Question answering dataset for tutorial videos. arXiv preprint arXiv:1912.01046 (2019).
[6]
Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. Advances in Neural Information Processing Systems, Vol. 31 (2018).
[7]
Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1999--2007.
[8]
Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6576--6585.
[9]
Noa Garcia, Mayu Otani, Chenhui Chu, and Yuta Nakashima. 2020. KnowIT VQA: Answering knowledge-based questions about videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10826--10834.
[10]
Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. 2021a. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11287--11297.
[11]
Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. 2021b. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11287--11297.
[12]
Mao Gu, Zhou Zhao, Weike Jin, Richang Hong, and Fei Wu. 2021. Graph-Based Multi-Interaction Network for Video Question Answering. IEEE Transactions on Image Processing, Vol. 30 (2021), 2758--2770.
[13]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6546--6555.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[15]
Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. 2020. Location-aware graph convolutional networks for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11021--11028.
[16]
Robert A Jacobs and Chenliang Xu. 2019. Can multisensory training aid visual learning? A computational investigation. Journal of vision, Vol. 19, 11 (2019), 1--1.
[17]
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2758--2766.
[18]
Pin Jiang and Yahong Han. 2020. Reasoning with Heterogeneous Graph Alignment for Video Question Answering. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. 11109--11116.
[19]
Weike Jin, Zhou Zhao, Xiaochun Cao, Jieming Zhu, Xiuqiang He, and Yueting Zhuang. 2021. Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA. IEEE Transactions on Image Processing, Vol. 30 (2021), 5477--5489.
[20]
Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, and Chang D Yoo. 2019. Progressive attention memory network for movie story question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8337--8346.
[21]
Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. 2017. Deepstory: Video story qa by deep embedded memory networks. arXiv preprint arXiv:1707.00836 (2017).
[22]
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. 2019. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. (2019). http://arxiv.org/abs/1912.10211
[23]
Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Hierarchical Conditional Relation Networks for Video Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9969--9978.
[24]
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In EMNLP.
[25]
Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2019. TVQA: Spatio-Temporal Grounding for Video Question Answering. In Tech Report, arXiv.
[26]
Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. 2022c. Learning to Answer Questions in Dynamic Audio-Visual Scenarios. arXiv preprint arXiv:2203.14072 (2022).
[27]
Jiangtong Li, Li Niu, and Liqing Zhang. 2022a. From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21273--21282.
[28]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. Hero: Hierarchical encoder for video language omni-representation pre-training. arXiv preprint arXiv:2005.00200 (2020).
[29]
Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019a. Learnable Aggregating Net with Diversity Learning for Video Question Answering. In Proceedings of the 27th ACM International Conference on Multimedia (MM '19). 1166--1174.
[30]
Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019b. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 8658--8665.
[31]
Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022b. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2928--2937.
[32]
Taiki Miyanishi and Motoaki Kawanabe. 2021. Watch, Listen, and Answer: Open-Ended VideoQA with Modulated Multi-Stream 3D ConvNets. In 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 706--710.
[33]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532--1543.
[34]
Ladan Shams and Aaron R Seitz. 2008. Benefits of multisensory learning. Trends in cognitive sciences, Vol. 12, 11 (2008), 411--417.
[35]
Guangyao Shen, Xin Wang, Xuguang Duan, Hongzhi Li, and Wenwu Zhu. 2020. Memor: A dataset for multimodal emotion reasoning in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 493--502.
[36]
SVQA-founder. 2018. Synthetic Video Question Answering. https://github.com/SVQA-founder/SVQA.
[37]
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4631--4640.
[38]
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. 2021. STAR: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[39]
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9777--9786.
[40]
Saining Xie, Ross B. Girshick, Piotr Dollá r, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5987--5995.
[41]
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia. 1645--1653.
[42]
Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing, Vol. 26, 12 (2017), 5656--5666.
[43]
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. 2019. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019).
[44]
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9127--9134.
[45]
Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. 2021. Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2031--2041.
[46]
Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. 2019. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8807--8817.
[47]
Jipeng Zhang, Jie Shao, Rui Cao, Lianli Gao, Xing Xu, and Heng Tao Shen. 2020. Action-Centric Relation Transformer Network for Video Question Answering. IEEE Transactions on Circuits and Systems for Video Technology (2020).
[48]
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision, Vol. 124, 3 (2017), 409--421.
[49]
Yueting Zhuang, Dejing Xu, Xin Yan, Wenzhuo Cheng, Zhou Zhao, Shiliang Pu, and Jun Xiao. 2020. Multichannel attention refinement for video question answering. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 16, 1s (2020), 1--23.

Cited By

View all
  • (2024)Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for Unbiased Question-Answering2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00442(4466-4475)Online publication date: 3-Jan-2024
  • (2024)Question-Aware Global-Local Video Understanding Network for Audio-Visual Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331822034:5(4109-4119)Online publication date: May-2024
  • (2024)Semantic Enrichment for Video Question Answering with Gated Graph Neural NetworksICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447275(11616-11620)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio-visual question answering
  2. dataset
  3. multimodal

Qualifiers

  • Research-article

Funding Sources

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)818
  • Downloads (Last 6 weeks)118
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for Unbiased Question-Answering2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00442(4466-4475)Online publication date: 3-Jan-2024
  • (2024)Question-Aware Global-Local Video Understanding Network for Audio-Visual Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331822034:5(4109-4119)Online publication date: May-2024
  • (2024)Semantic Enrichment for Video Question Answering with Gated Graph Neural NetworksICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447275(11616-11620)Online publication date: 14-Apr-2024
  • (2024)Enhancing Audio-Visual Question Answering with Missing Modality via Trans-Modal Associative LearningICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446292(5755-5759)Online publication date: 14-Apr-2024
  • (2024)CrossMAE: Cross-Modality Masked Autoencoders for Region-Aware Audio-Visual Pre-Training2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02523(26711-26721)Online publication date: 16-Jun-2024
  • (2024)Heterogeneous Interactive Graph Network for Audio–Visual Question AnsweringKnowledge-Based Systems10.1016/j.knosys.2024.112165300(112165)Online publication date: Sep-2024
  • (2024)From image to languageInformation Fusion10.1016/j.inffus.2024.102270106:COnline publication date: 25-Jun-2024
  • (2024)Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo LabelingInternational Journal of Computer Vision10.1007/s11263-024-02142-3Online publication date: 9-Jun-2024
  • (2024)FunQA: Towards Surprising Video ComprehensionComputer Vision – ECCV 202410.1007/978-3-031-73232-4_3(39-57)Online publication date: 30-Sep-2024
  • (2023)Progressive Spatio-temporal Perception for Audio-Visual Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612293(7808-7816)Online publication date: 26-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media