Audio-Visual Event Localization in Unconstrained Videos

Published: 08 September 2018 Publication History


In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.


  Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video ParsingProceedings of the 2024 6th International Conference on Big-data Service and Intelligent Computation10.1145/3686540.3686547(48-56)Online publication date: 29-May-2024
  Toward Long Form Audio-Visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367207920:9(1-26)Online publication date: 7-Jun-2024
  From CNNs to Transformers in Multimodal Human Action Recognition: A SurveyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481520:8(1-24)Online publication date: 13-May-2024
          • (2024)Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video ParsingProceedings of the 2024 6th International Conference on Big-data Service and Intelligent Computation10.1145/3686540.3686547(48-56)Online publication date: 29-May-2024
          • (2024)Toward Long Form Audio-Visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367207920:9(1-26)Online publication date: 7-Jun-2024
          • (2024)From CNNs to Transformers in Multimodal Human Action Recognition: A SurveyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481520:8(1-24)Online publication date: 13-May-2024
          • (2024)Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys10.1145/365658056:10(1-42)Online publication date: 22-Jun-2024
          • (2024)SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision ViewersProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642632(1-18)Online publication date: 11-May-2024
          • (2024)Enhanced video clustering using multiple riemannian manifold-valued descriptors and audio-visual information▪Expert Systems with Applications: An International Journal10.1016/j.eswa.2023.123099246:COnline publication date: 15-Jul-2024
          • (2024)Integrating Audio-Visual Contexts with Refinement for SegmentationArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72338-4_3(31-44)Online publication date: 17-Sep-2024
          • (2023)Modality-independent teachers meet weakly-supervised audio-visual event parserProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669343(73633-73651)Online publication date: 10-Dec-2023
          • (2023)Achieving cross modal generalization with multimodal unified representationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668896(63529-63541)Online publication date: 10-Dec-2023
          • (2023)Revisit weakly-supervised audio-visual video parsing from the language perspectiveProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667889(40610-40622)Online publication date: 10-Dec-2023
