Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612293acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Published: 27 October 2023 Publication History

Abstract

Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region selection module is utilized to choose the most relevant regions associated with the question from the selected temporal segments. To further refine the selection of features, anaudio-guided visual attention module is employed to perceive the association between auido and selected spatial regions. Finally, the spatio-temporal features from these modules are integrated for answering the question. Extensive experimental results on the public MUSIC-AVQA and AVQA datasets provide compelling evidence of the effectiveness and efficiency of PSTP-Net.

References

[1]
Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K Marks, Chiori Hori, Peter Anderson, et al. 2019. Audio visual scene-aware dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7558--7567.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.
[3]
Mathilde Brousmiche, Jean Rouat, and Stéphane Dupont. 2021. Multi-level Attention Fusion Network for Audio-visual Event Recognition. arXiv preprint arXiv:2106.06736 (2021).
[4]
Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. 2023. VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset. arXiv preprint arXiv:2304.08345 (2023).
[5]
Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1999--2007.
[6]
Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. 2020. PySlowFast. https://github.com/facebookresearch/slowfast.
[7]
Haytham M Fayek and Justin Johnson. 2020. Temporal Reasoning via Audio Question Answering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28 (2020), 2283--2294.
[8]
Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10478--10487.
[9]
Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, and Mike Zheng Shou. 2023. MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14773--14783.
[10]
Ruohan Gao and Kristen Grauman. 2021. VisualVoice: Audio-Visual Speech Separation With Cross-Modal Consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15495--15505.
[11]
Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, and Lorenzo Torresani. 2020. Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10457--10467.
[12]
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776--780.
[13]
Nicholas P Holmes and Charles Spence. 2005. Multisensory integration: space, time and superadditivity. Current Biology, Vol. 15, 18 (2005), R762--R764.
[14]
Wenxuan Hou, Guangyao Li, Yapeng Tian, and Di Hu. 2023. Towards Long Form Audio-visual Video Understanding. arXiv preprint arXiv:2306.09431 (2023).
[15]
Di Hu, Zheng Wang, Feiping Nie, Rong Wang, and Xuelong Li. 2022. Self-supervised Learning for Heterogeneous Audiovisual Scene Analysis. IEEE Transactions on Multimedia (2022).
[16]
Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, and Ji-Rong Wen. 2021. Class-aware Sounding Objects Localization via Audiovisual Correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[17]
Pin Jiang and Yahong Han. 2020. Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11109--11116.
[18]
Xun Jiang, Xing Xu, Zhiguo Chen, Jingran Zhang, Jingkuan Song, Fumin Shen, Huimin Lu, and Heng Tao Shen. 2022. Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing. In Proceedings of the 30th ACM International Conference on Multimedia. 719--727.
[19]
Junyeong Kim, Minuk Ma, Trung Pham, Kyungsu Kim, and Chang D Yoo. 2020. Modality shifting attention network for multi-modal video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10106--10115.
[20]
Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9972--9981.
[21]
Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. 2022. Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19108--19118.
[22]
Guangyao Li, Yixin Xu, and Di Hu. 2023. Multi-Scale Attention for Audio Question Answering. arXiv preprint arXiv:2305.17993 (2023).
[23]
Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019a. Learnable aggregating net with diversity learning for video question answering. In Proceedings of the 27th ACM international conference on multimedia. 1166--1174.
[24]
Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019b. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8658--8665.
[25]
Yi-Lin Lin, Yan-Bo an Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. 2023. Vision Transformers are Parameter-Efficient Audio-Visual Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[26]
Jinxiang Liu, Yu Wang, Chen Ju, Ya Zhang, and Weidi Xie. 2023. Annotation-free Audio-Visual Segmentation. arXiv preprint arXiv:2305.11019 (2023).
[27]
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. 2022. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12009--12019.
[28]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061 (2016).
[29]
Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. 2022. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8238--8247.
[30]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[31]
Idan Schwartz, Alexander G Schwing, and Tamir Hazan. 2019. A simple baseline for audio-visual scene-aware dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12548--12558.
[32]
Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2018. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4358--4366.
[33]
Yapeng Tian, Dingzeyu Li, and Chenliang Xu. 2020. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In European Conference on Computer Vision. Springer, 436--454.
[34]
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision. 247--263.
[35]
Yake Wei, Di Hu, Yapeng Tian, and Xuelong Li. 2022. Learning in audio-visual context: A review, analysis, and new perspective. arXiv preprint arXiv:2208.09579 (2022).
[36]
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023).
[37]
Yu Wu and Yi Yang. 2021. Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1326--1335.
[38]
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9777--9786.
[39]
Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. 2022. AVQA: A Dataset for Audio-Visual Question Answering on Videos. In Proceedings of the 30th ACM International Conference on Multimedia. 3480--3491.
[40]
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019a. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9127--9134.
[41]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019b. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6281--6290.
[42]
Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. 2021. Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2031--2041.
[43]
Jipeng Zhang, Jie Shao, Rui Cao, Lianli Gao, Xing Xu, and Heng Tao Shen. 2020. Action-centric relation transformer network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 1 (2020), 63--74.
[44]
Ted Zhang, Dengxin Dai, Tinne Tuytelaars, Marie-Francine Moens, and Luc Van Gool. 2017. Speech-based visual question answering. arXiv preprint arXiv:1705.00464 (2017).
[45]
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The sound of pixels. In Proceedings of the European conference on computer vision. 570--586.
[46]
Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks. In IJCAI. 3518--3524.
[47]
Dongzhan Zhou, Xinchi Zhou, Di Hu, Hang Zhou, Lei Bai, Ziwei Liu, and Wanli Ouyang. 2022b. SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation. In AAAI.
[48]
Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. 2022a. Audio-Visual Segmentation. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXVII. Springer, 386--403.

Cited By

View all
  • (2024)Towards Long Form Audio-visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672079Online publication date: 7-Jun-2024
  • (2024)Heterogeneous Interactive Graph Network for Audio–Visual Question AnsweringKnowledge-Based Systems10.1016/j.knosys.2024.112165300(112165)Online publication date: Sep-2024

Index Terms

  1. Progressive Spatio-temporal Perception for Audio-Visual Question Answering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. audio-visual
    2. question answering
    3. scene understanding

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Founda- tion of China
    • Young Elite Scientists Sponsorship Program by CAST
    • Research Funds of Renmin University of China

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)171
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 10 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Towards Long Form Audio-visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672079Online publication date: 7-Jun-2024
    • (2024)Heterogeneous Interactive Graph Network for Audio–Visual Question AnsweringKnowledge-Based Systems10.1016/j.knosys.2024.112165300(112165)Online publication date: Sep-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media