research-article

Toward Long Form Audio-Visual Video Understanding

Authors:

Di HuAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 9

Article No.: 284, Pages 1 - 26

https://doi.org/10.1145/3672079

Published: 23 September 2024 Publication History

Abstract

We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos (LFAVs) are expected as an important bridge for better exploring and understanding the world. In this article, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale LFAV dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of LFAV understanding. Project page: https://gewu-lab.github.io/LFAV/.

References

[1]

Anurag Bagchi, Jazib Mahmood, Dolton Fernandes, and Ravi Kiran Sarvadevabhatla. 2021. Hear me out: Fusional approaches for audio augmented temporal action localization. arXiv:2106.14118. Retrieved from https://arxiv.org/abs/2106.14118

[2]

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Retrieved from https://arxiv.org/abs/2004.05150

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[4]

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 721–725.

[5]

Jiawei Chen and Chiu Man Ho. 2022. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1910–1921.

[6]

Tianshui Chen, Muxin Xu, Xiaolu Hui, Hefeng Wu, and Liang Lin. 2019b. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 522–531.

[7]

Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li, and Wenwu Wang. 2024. CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’24). IEEE, 8421–8425.

[8]

Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019a. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5177–5186.

[9]

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2022. Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision 130, 1 (2022), 33–55.

Digital Library

[10]

Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.

[11]

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6824–6835.

[12]

Alireza Fathi, Xiaofeng Ren, and James M. Rehg. 2011. Learning to recognize objects in egocentric activities. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. 3281–3288.

Digital Library

[13]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.

[14]

Edward Fish, Jon Weinbren, and Andrew Gilbert. 2022. Two-stream transformer architecture for long video understanding. arXiv:2208.01753. Retrieved from https://arxiv.org/abs/2208.01753

[15]

Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2023. Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18827–18836.

[16]

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 776–780.

Digital Library

[17]

Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, and Feng Zheng. 2023. Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22942–22951.

[18]

Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. 2023. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5961–5971.

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 9 (2015), 1904–1916.

Digital Library

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[21]

Linjiang Huang, Liang Wang, and Hongsheng Li. 2022. Weakly supervised temporal action localization via representative snippet knowledge propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3272–3281.

[22]

Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, and Andrew Zisserman. 2023. EPIC-SOUNDS: A large-scale dataset of actions that sound. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1–5.

[23]

Vladimir Iashin and Esa Rahtu. 2020. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. arXiv:2005.08271. Retrieved from https://arxiv.org/abs/2005.08271

[24]

Xun Jiang, Xing Xu, Zhiguo Chen, Jingran Zhang, Jingkuan Song, Fumin Shen, Huimin Lu, and Heng Tao Shen. 2022. DHHN: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing. In Proceedings of the 30th ACM International Conference on Multimedia. 719–727.

Digital Library

[25]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.

[26]

Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 780–787.

Digital Library

[27]

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. TVQA: Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1369–1379.

[28]

Guangyao Li, Wenxuan Hou, and Di Hu. 2023. Progressive spatio-temporal perception for audio-visual question answering. In Proceedings of the 31st ACM International Conference on Multimedia. 7808–7816.

Digital Library

[29]

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. 2022. Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19108–19118.

[30]

Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083–7093.

[31]

Xinfang Liu, Xiushan Nie, Junya Teng, Li Lian, and Yilong Yin. 2021. Single-shot semantic matching network for moment localization in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 3 (2021), 1–14.

Digital Library

[32]

Quanling Meng, Heyan Zhu, Weigang Zhang, Xuefeng Piao, and Aijie Zhang. 2020. Action recognition using form and motion modalities. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 1s (2020), 1–16.

Digital Library

[33]

Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6752–6761.

[34]

Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. 2022. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8238–8247.

[35]

Tanzila Rahman, Bicheng Xu, and Leonid Sigal. 2019. Watch, listen and tell: Multi-modal weakly supervised dense event captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8908–8917.

[36]

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM Computing Surveys (CSUR) 54, 9 (2021), 1–40.

Digital Library

[37]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 234–241.

[38]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.

[39]

Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant network for action recognition in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 3 (2022), 1–18.

Digital Library

[40]

Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.

[41]

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. 510–526.

[42]

Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba, Chen Zhao, Silvio Giancola, and Bernard Ghanem. 2022. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5026–5035.

[43]

Sebastian Stein and Stephen J. McKenna. 2013. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 729–738.

Digital Library

[44]

Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, and Jianlong Fu. 2022. Long-form video-language pre-training with multimodal temporal contrastive learning. arXiv:2210.06031. Retrieved from https://arxiv.org/abs/2210.06031

[45]

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. 2019. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1207–1216.

[46]

Yapeng Tian, Dingzeyu Li, and Chenliang Xu. 2020. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proceedings of the European Conference on Computer Vision. 436–454.

Digital Library

[47]

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision. 247–263.

Digital Library

[48]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450–6459.

[49]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv:1710.10903. Retrieved from https://arxiv.org/abs/1710.10903

[50]

Hao Wang, Zheng-Jun Zha, Liang Li, Xuejin Chen, and Jiebo Luo. 2022. Semantic and relation modulation for audio-visual event localization. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2022), 7711–7725.

Digital Library

[51]

Jianyu Wang, Bing-Kun Bao, and Changsheng Xu. 2021a. DualVGR: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia 24 (2021), 3369–3380.

Digital Library

[52]

Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4325–4334.

[53]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20–36.

[54]

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021b. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 568–578.

[55]

Yake Wei, Di Hu, Yapeng Tian, and Xuelong Li. 2022. Learning in audio-visual context: A review, analysis, and new perspective. arXiv:2208.09579. Retrieved from https://arxiv.org/abs/2208.09579

[56]

Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 9 (2015), 1901–1907.

Digital Library

[57]

Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1884–1894.

[58]

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. 2022. MeMViT: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13587–13597.

[59]

Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. 2020. Audiovisual slowfast networks for video recognition. arXiv:2001.08740. Retrieved from https://arxiv.org/abs/2001.08740

[60]

Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu, and Stefano Soatto. 2021. Long short-term transformer for online action detection. In Proceedings of the Advances in Neural Information Processing Systems. 1086–1099.

[61]

Zichen Yang, Jie Qin, and Di Huang. 2022. ACGNet: Action complement graph network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. 3090–3098.

[62]

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. 2021. Pano-AVQA: Grounded audio-visual question answering on 360deg videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2031–2041.

[63]

Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision. 492–510.

Digital Library

[64]

Hang Zhang, Xin Li, and Lidong Bing. 2023b. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858. Retrieved from https://arxiv.org/abs/2306.02858

[65]

Shiyi Zhang, Wenxun Dai, Sujia Wang, Xiangwei Shen, Jiwen Lu, Jie Zhou, and Yansong Tang. 2023a. LOGO: A long-form video dataset for group action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2405–2414.

[66]

Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhenxin Xiao, Xiaohui Yan, Jun Yu, Deng Cai, and Fei Wu. 2019. Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing 28, 12 (2019), 5939–5952.

[67]

Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. 2022. Audio-visual segmentation. arXiv:2207.05042. Retrieved from https://arxiv.org/abs/2207.05042

[68]

Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8436–8444.

[69]

Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, and Bryan Catanzaro. 2021. Long-short transformer: Efficient transformers for language and vision. In Proceedings of the Advances in Neural Information Processing Systems. 17723–17736.

[70]

Yueting Zhuang, Dejing Xu, Xin Yan, Wenzhuo Cheng, Zhou Zhao, Shiliang Pu, and Jun Xiao. 2020. Multichannel attention refinement for video question answering. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 1s (2020), 1–23.

Digital Library

Index Terms

Toward Long Form Audio-Visual Video Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Event-centric multi-modal fusion method for dense video captioning
Abstract
Dense video captioning aims to automatically describe several events that occur in a given video, which most state-of-the-art models accomplish by locating and describing multiple events in an untrimmed video. Despite much progress in ...
The DIRAC AWEAR audio-visual platform for detection of unexpected and incongruent events
ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces

It is of prime importance in everyday human life to cope with and respond appropriately to events that are not foreseen by prior experience. Machines to a large extent lack the ability to respond appropriately to such inputs. An important class of ...
Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video

Understanding video events, i.e., the translation of low-level content in video sequences into high-level semantic concepts, is a research topic that has received much interest in recent years. Important applications of this paper include smart ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 9

September 2024

780 pages

EISSN:1551-6865

DOI:10.1145/3613681

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2024

Online AM: 07 June 2024

Accepted: 26 May 2024

Revised: 21 April 2024

Received: 10 December 2023

Published in TOMM Volume 20, Issue 9

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Public Computing Cloud, Renmin University of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
268
Total Downloads

Downloads (Last 12 months)268
Downloads (Last 6 weeks)81

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents