research-article

Free access

Just Accepted

Towards Long Form Audio-visual Video Understanding

Authors:

Di HuAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications

Accepted on 26 May 2024

https://doi.org/10.1145/3672079

Online AM: 07 June 2024 Publication History

Abstract

We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In this paper, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of long form audio-visual video understanding. Project page: https://gewu-lab.github.io/LFAV/.

References

[1]

Anurag Bagchi, Jazib Mahmood, Dolton Fernandes, and Ravi Kiran Sarvadevabhatla. 2021. Hear me out: Fusional approaches for audio augmented temporal action localization. arXiv preprint arXiv:2106.14118 (2021).

[2]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[4]

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 721–725.

[5]

Jiawei Chen and Chiu Man Ho. 2022. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1910–1921.

[6]

Tianshui Chen, Muxin Xu, Xiaolu Hui, Hefeng Wu, and Liang Lin. 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 522–531.

[7]

Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li, and Wenwu Wang. 2024. CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8421–8425.

[8]

Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5177–5186.

[9]

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2022. Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision 130, 1 (2022), 33–55.

Digital Library

[10]

Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. 2015. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.

[11]

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6824–6835.

[12]

Alireza Fathi, Xiaofeng Ren, and James M Rehg. 2011. Learning to recognize objects in egocentric activities. In 2011 IEEE Conference on Computer Vision and Pattern Recognition. 3281–3288.

Digital Library

[13]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.

[14]

Edward Fish, Jon Weinbren, and Andrew Gilbert. 2022. Two-Stream Transformer Architecture for Long Video Understanding. arXiv preprint arXiv:2208.01753 (2022).

[15]

Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2023. Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18827–18836.

[16]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 776–780.

Digital Library

[17]

Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, and Feng Zheng. 2023. Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22942–22951.

[18]

Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. 2023. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5961–5971.

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 9 (2015), 1904–1916.

Digital Library

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[21]

Linjiang Huang, Liang Wang, and Hongsheng Li. 2022. Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3272–3281.

[22]

Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, and Andrew Zisserman. 2023. EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1–5.

[23]

Xun Jiang, Xing Xu, Zhiguo Chen, Jingran Zhang, Jingkuan Song, Fumin Shen, Huimin Lu, and Heng Tao Shen. 2022. DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing. In Proceedings of the 30th ACM International Conference on Multimedia. 719–727.

Digital Library

[24]

Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 780–787.

Digital Library

[25]

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1369–1379.

[26]

Guangyao Li, Wenxuan Hou, and Di Hu. 2023. Progressive Spatio-temporal Perception for Audio-Visual Question Answering. In Proceedings of the 31st ACM International Conference on Multimedia. 7808–7816.

Digital Library

[27]

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. 2022. Learning to Answer Questions in Dynamic Audio-Visual Scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19108–19118.

[28]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083–7093.

[29]

Xinfang Liu, Xiushan Nie, Junya Teng, Li Lian, and Yilong Yin. 2021. Single-shot semantic matching network for moment localization in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 3 (2021), 1–14.

Digital Library

[30]

Quanling Meng, Heyan Zhu, Weigang Zhang, Xuefeng Piao, and Aijie Zhang. 2020. Action recognition using form and motion modalities. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 1s (2020), 1–16.

Digital Library

[31]

Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6752–6761.

[32]

Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. 2022. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8238–8247.

[33]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.

[34]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.

[35]

Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant network for action recognition in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 3 (2022), 1–18.

Digital Library

[36]

Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.

[37]

Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision. 510–526.

[38]

Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba, Chen Zhao, Silvio Giancola, and Bernard Ghanem. 2022. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5026–5035.

[39]

Sebastian Stein and Stephen J McKenna. 2013. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 729–738.

Digital Library

[40]

Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, and Jianlong Fu. 2022. Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning. arXiv preprint arXiv:2210.06031 (2022).

[41]

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. 2019. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1207–1216.

[42]

Yapeng Tian, Dingzeyu Li, and Chenliang Xu. 2020. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In European Conference on Computer Vision. 436–454.

Digital Library

[43]

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In European Conference on Computer Vision. 247–263.

Digital Library

[44]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450–6459.

[45]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).

[46]

Hao Wang, Zheng-Jun Zha, Liang Li, Xuejin Chen, and Jiebo Luo. 2022. Semantic and relation modulation for audio-visual event localization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

[47]

Jianyu Wang, Bing-Kun Bao, and Changsheng Xu. 2021. Dualvgr: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia 24 (2021), 3369–3380.

Digital Library

[48]

Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 4325–4334.

[49]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer vision. 20–36.

[50]

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 568–578.

[51]

Yake Wei, Di Hu, Yapeng Tian, and Xuelong Li. 2022. Learning in audio-visual context: A review, analysis, and new perspective. arXiv preprint arXiv:2208.09579 (2022).

[52]

Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label image classification. IEEE transactions on Pattern Analysis and Machine Intelligence 38, 9 (2015), 1901–1907.

[53]

Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1884–1894.

[54]

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. 2022. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13587–13597.

[55]

Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. 2020. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020).

[56]

Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu, and Stefano Soatto. 2021. Long short-term transformer for online action detection. Advances in Neural Information Processing Systems (2021), 1086–1099.

[57]

Zichen Yang, Jie Qin, and Di Huang. 2022. Acgnet: Action complement graph network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. 3090–3098.

[58]

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. 2021. Pano-avqa: Grounded audio-visual question answering on 360deg videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2031–2041.

[59]

Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision. 492–510.

Digital Library

[60]

Shiyi Zhang, Wenxun Dai, Sujia Wang, Xiangwei Shen, Jiwen Lu, Jie Zhou, and Yansong Tang. 2023. LOGO: A Long-Form Video Dataset for Group Action Quality Assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2405–2414.

[61]

Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhenxin Xiao, Xiaohui Yan, Jun Yu, Deng Cai, and Fei Wu. 2019. Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing 28, 12 (2019), 5939–5952.

[62]

Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. 2022. Audio-visual segmentation. arXiv preprint arXiv:2207.05042 (2022).

[63]

Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8436–8444.

[64]

Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, and Bryan Catanzaro. 2021. Long-short transformer: Efficient transformers for language and vision. Advances in Neural Information Processing Systems (2021), 17723–17736.

[65]

Yueting Zhuang, Dejing Xu, Xin Yan, Wenzhuo Cheng, Zhou Zhao, Shiliang Pu, and Jun Xiao. 2020. Multichannel attention refinement for video question answering. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 1s (2020), 1–23.

Digital Library

Index Terms

Towards Long Form Audio-visual Video Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Harmony across Music, Visuals and Movement in a New Audio-visual Gestural Performance
TEI '20: Proceedings of the Fourteenth International Conference on Tangible, Embedded, and Embodied Interaction

This paper describes the technology, concepts and development of Computer Storm, a live audio-visual piece created for a gestural instrument, the 'AirSticks'. The AirSticks allow the composition, performance and improvisation of live electronic music ...
The DIRAC AWEAR audio-visual platform for detection of unexpected and incongruent events
ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces

It is of prime importance in everyday human life to cope with and respond appropriately to events that are not foreseen by prior experience. Machines to a large extent lack the ability to respond appropriately to such inputs. An important class of ...
Event-centric multi-modal fusion method for dense video captioning
Abstract
Dense video captioning aims to automatically describe several events that occur in a given video, which most state-of-the-art models accomplish by locating and describing multiple events in an untrimmed video. Despite much progress in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted

EISSN:1551-6865

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 07 June 2024

Accepted: 26 May 2024

Revised: 21 April 2024

Received: 10 December 2023

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
208
Total Downloads

Downloads (Last 12 months)208
Downloads (Last 6 weeks)80

Reflects downloads up to 10 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables