Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548309acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing

Published: 10 October 2022 Publication History

Abstract

The Weakly-Supervised Audio-Visual Video Parsing (AVVP) task aims to parse a video into temporal segments and predict their event categories in terms of modalities, labeling them as either audible, visible, or both. Since the temporal boundaries and modalities annotations are not provided, only video-level event labels are available, this task is more challenging than conventional video understanding tasks.Most previous works attempt to analyze videos by jointly modeling the audio and video data and then learning information from the segment-level features with fixed lengths. However, such a design exist two defects: 1) The various semantic information hidden in temporal lengths is neglected, which may lead the models to learn incorrect information; 2) Due to the joint context modeling, the unique features of different modalities are not fully explored. In this paper, we propose a novel AVVP framework termedDual Hierarchical Hybrid Network (DHHN) to tackle the above two problems. Our DHHN method consists of three components: 1) A hierarchical context modeling network for extracting different semantics in multiple temporal lengths; 2) A modality-wise guiding network for learning unique information from different modalities; 3) A dual-stream framework generating audio and visual predictions separately. It maintains the best adaptions on different modalities, further boosting the video parsing performance. Extensive quantitative and qualitative experiments demonstrate that our proposed method establishes the new state-of-the-art performance on the AVVP task.

Supplementary Material

MP4 File (MM22-fp2446.mp4)
Presentation video for the ACM MM 2022 poster paper "DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing"

References

[1]
Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-supervised Learning of Audio-Visual Objects from Video. In Proceedings of the European Conference on Computer Vision. 208--224.
[2]
Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision. 609--617.
[3]
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. SoundNet: Learning Sound Representations from Unlabeled Video. In Proceedings of Annual Conference on Neural Information Processing Systems. 892--900.
[4]
Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, and Anoop Cherian. 2021. Visual Scene Graphs for Audio Source Separation. In Proceedings of the IEEE International Conference on Computer Vision. 1204--1213.
[5]
Nuo Chen, Chenyu You, and Yuexian Zou. 2021. Self-Supervised Dialogue Learning for Spoken Conversational Question Answering. In Proceedings of the Annual Conference of the International Speech. 231--235.
[6]
Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, and Yuejie Zhang. 2020. Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning. In Proceedings of the 28th ACM International Conference on Multimedia. 3884--3892.
[7]
Kalyan Das, Jiming Jiang, and JNK Rao. 2004. Mean squared error of empirical predictor. The Annals of Statistics 32, 2 (2004), 818--840.
[8]
Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. 248--255.
[9]
Ruohan Gao and Kristen Grauman. 2021. Visualvoice: Audio-visual speech separation with cross-modal consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15490--15500.
[10]
Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, and Lorenzo Torresani. 2020. Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10457--10467.
[11]
Jort F Gemmeke, Daniel PWEllis, Dylan Freedman, Aren Jansen,Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). 776--780.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[13]
Yixuan He, Xing Xu, Jingran Zhang, Fumin Shen, Yang Yang, and Heng Tao Shen. 2021. Modeling Two-Stream Correspondence for Visual Sound Separation. IEEE Transactions on Circuits and Systems for Video Technology (2021).
[14]
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). 131--135.
[15]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[16]
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 5492--5501.
[17]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[18]
Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization. In Proceedings of Annual Conference on Neural Information Processing Systems. 7774--7785.
[19]
Jatin Lamba, Abhishek, Jayaprakash Akula, Rishabh Dabral, Preethi Jyothi, and Ganesh Ramakrishnan. 2021. Cross-Modal Learning for Audio-Visual Video Parsing. In Proceedings of the Annual Conference of the International Speech. 1937--1941.
[20]
Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang Frank Wang. 2019. Dual-modality seq2seq network for audio-visual event localization. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2002--2006.
[21]
Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, and Ming-Hsuan Yang. 2021. Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing. Preceedings of the Advances in Neural Information Processing Systems 34.
[22]
Fenglin Liu, Xian Wu, Chenyu You, Shen Ge, Yuexian Zou, and Xu Sun. 2021. Aligning source visual and target language domains for unpaired video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[23]
Shuang Ma, Zhaoyang Zeng, Daniel J. McDuff, and Yale Song. 2021. Active Contrastive Learning of Audio-Visual Video Representations. In Proceedings of International Conference on Learning Representations.
[24]
Sagnik Majumder, Ziad Al-Halah, and Kristen Grauman. 2021. Move2hear: Active audio-visual source separation. In Proceedings of the IEEE International Conference on Computer Vision. 275--285.
[25]
Rafael Müller, Simon Kornblith, and Geoffrey E. Hinton. 2019. When does label smoothing help?. In Preceedings of the Advances in Neural Information Processing Systems. 4696--4705.
[26]
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, and Andrew Zisserman. 2020. Speech2action: Cross-modal supervision for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10317--10326.
[27]
Andrew Owens and Alexei A. Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In Proceedings of the European Conference on Computer Vision. 639--658.
[28]
Janani Ramaswamy. 2020. What makes the sound?: A dual-modality interacting network for audio-visual event localization. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 4372--4376.
[29]
Mahesh Subedar, Ranganath Krishnan, Paulo Lopez Meyer, Omesh Tickoo, and Jonathan Huang. 2019. Uncertainty-aware audiovisual activity recognition using deep bayesian variational inference. In Proceedings of the IEEE international conference on computer vision. 6301--6310.
[30]
Yapeng Tian, Di Hu, and Chenliang Xu. 2021. Cyclic co-learning of sounding object visual grounding and sound separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2745--2754.
[31]
Yapeng Tian, Dingzeyu Li, and Chenliang Xu. 2020. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proceedings of the European Conference on Computer Vision. 436--454.
[32]
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audiovisual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision. 247--263.
[33]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Preceedings of the Advances in Neural Information Processing Systems 30.
[35]
Gongmian Wang, Xing Xu, Fumin Shen, Huimin Lu, Yanli Ji, and Heng Tao Shen. 2022. Cross-modal dynamic networks for video moment retrieval with text query. IEEE Transactions on Multimedia 24 (2022), 1221--1232.
[36]
Yu Wu and Yi Yang. 2021. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1326--1335.
[37]
Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. 2019. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE international conference on computer vision. 6292--6300.
[38]
Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan. 2020. Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the 28th ACM International Conference on Multimedia. 3893--3901.
[39]
Xing Xu, Kaiyi Lin, Lianli Gao, Huimin Lu, Heng Tao Shen, and Xuelong Li. 2022. Learning Cross-Modal Common Representations by Private-Shared Subspaces Separation. IEEE Trans. Cybern. 52, 5 (2022), 3261--3275.
[40]
Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2022. Joint Feature Synthesis and Embedding: Adversarial Cross-Modal Retrieval Revisited. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6 (2022), 3030--3047.
[41]
Xing Xu, Tan Wang, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2021. Radial Graph Convolutional Network for Visual Question Generation. IEEE Trans. Neural Networks Learn. Syst. 32, 4 (2021), 1654--1667.
[42]
Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, and Yan Yan. 2020. Crossmodal attention network for temporal inconsistent audio-visual event localization. In Proceedings of the AAAI Conference on Artificial Intelligence. 279--286.
[43]
Chenyu You, Nuo Chen, and Yuexian Zou. 2021. Self-supervised contrastive cross-modality representation learning for spoken question answering. arXiv preprint arXiv:2109.03381 (2021).
[44]
Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, and Yuejie Zhang. 2021. MMPyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing. arXiv preprint arXiv:2111.12374 (2021).
[45]
Jingran Zhang, Xing Xu, Fumin Shen, Huimin Lu, Xin Liu, and Heng Tao Shen. 2021. Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning. In Proceedings of AAAI Conference on Artificial Intelligence. 3351--3359.
[46]
Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba. 2019. The Sound of Motions. In Proceedings of the IEEE International Conference on Computer Vision. 1735--1744.
[47]
Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8436--8444.

Cited By

View all
  • (2024)Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video ParsingProceedings of the 2024 6th International Conference on Big-data Service and Intelligent Computation10.1145/3686540.3686547(48-56)Online publication date: 29-May-2024
  • (2024)Toward Long Form Audio-Visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367207920:9(1-26)Online publication date: 7-Jun-2024
  • (2024)PTAN: Principal Token-aware Adjacent Network for Compositional Temporal GroundingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658113(618-627)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio-visual comprehension
  2. multimodality
  3. video understanding
  4. weakly-supervised learning

Qualifiers

  • Research-article

Funding Sources

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)4
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video ParsingProceedings of the 2024 6th International Conference on Big-data Service and Intelligent Computation10.1145/3686540.3686547(48-56)Online publication date: 29-May-2024
  • (2024)Toward Long Form Audio-Visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367207920:9(1-26)Online publication date: 7-Jun-2024
  • (2024)PTAN: Principal Token-aware Adjacent Network for Compositional Temporal GroundingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658113(618-627)Online publication date: 30-May-2024
  • (2024)Unsupervised Cross-Domain Image Retrieval with Semantic-Attended Mixture-of-ExpertsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657826(197-207)Online publication date: 10-Jul-2024
  • (2024)Zero-Shot Video Moment Retrieval With Angular Reconstructive Text EmbeddingsIEEE Transactions on Multimedia10.1109/TMM.2024.339627226(9657-9670)Online publication date: 2024
  • (2024)Fuzzy Multimodal Graph Reasoning for Human-Centric Instructional Video GroundingIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.343603032:9(5046-5059)Online publication date: Sep-2024
  • (2024)Uncertainty-Debiased Multimodal Fusion: Learning Deterministic Joint Representation for Multimodal Sentiment Analysis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688376(1-6)Online publication date: 15-Jul-2024
  • (2024)Temporal Self-Paced Proposal Learning for Weakly-Supervised Video Moment Retrieval and Highlight Detection2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687638(1-6)Online publication date: 15-Jul-2024
  • (2024)SADA: Self-Adaptive Domain Adaptation From Black-Box Predictors2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687543(1-6)Online publication date: 15-Jul-2024
  • (2024)FCC-MF: Detecting Violence in Audio-Visual Context with Frame-Wise Cluster Contrast and Modality-Stage FloodingICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447086(8346-8350)Online publication date: 14-Apr-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media