research-article

Open access

AVQA: A Dataset for Audio-Visual Question Answering on Videos

Authors:

Wenwu ZhuAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3480 - 3491

https://doi.org/10.1145/3503161.3548291

Published: 10 October 2022 Publication History

PDF eReader

Abstract

Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video, and has drawn increasing research interest in recent years. However, there have been no appropriate datasets for this challenging task on videos in real-life scenarios so far. They are either designed with questions containing only visual clues without taking any audio information into account, or considering audio with restrictions to specific scenarios, such as panoramic videos and videos about music performances. In this paper, to overcome the limitations of existing datasets, we introduce AVQA, a new audio-visual question answering dataset on videos in real-life scenarios. We collect 57,015 videos from daily audio-visual activities and 57,335 specially-designed question-answer pairs relying on clues from both modalities, where information contained in a single modality is insufficient or ambiguous. Furthermore, we propose a Hierarchical Audio-Visual Fusing module to model multiple semantic correlations among audio, visual, and text modalities and conduct ablation studies to analyze the role of different modalities on our datasets. Experimental results show that our proposed method significantly improves the audio-visual question answering performance over various question types. Therefore, AVQA can provide an adequate testbed for the generation of models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios. (The dataset is available at https://mn.cs.tsinghua.edu.cn/avqa)

Supplementary Material

MP4 File (MM22-fp2343.mp4)

Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video. However, there have been no appropriate datasets for this challenging task on videos in real-life scenarios so far. They are either designed with questions containing only visual clues or considering audio with restrictions to specific scenarios, such as panoramic videos and videos about music performances. In this paper, to overcome the limitations of existing datasets, we introduce AVQA, a new audio-visual question answering dataset on videos in real-life scenarios. Furthermore, we propose a Hierarchical Audio-Visual Fusing module to model multiple semantic correlations among three modalities and conduct ablation studies to analyze the role of different modalities. Experimental results show that our proposed method significantly improves the audio-visual question answering performance over various question types. (The dataset is available at https://mn.cs.tsinghua.edu.cn/avqa)

Download
20.71 MB

References

[1]

David A Bulkin and Jennifer M Groh. 2006. Seeing sounds: visual and auditory interactions in the brain. Current opinion in neurobiology, Vol. 16, 4 (2006), 415--419.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

ParaQA: A Question Answering Dataset with Paraphrase Responses for Single-Turn Conversation

VQuAnDa: Verbalization QUestion ANswering DAtaset

Cross-Modal Retrieval for Knowledge-Based Visual Question Answering

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations