Article

Audio-Visual Event Localization in Unconstrained Videos

Authors:

Chenliang XuAuthors Info & Claims

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II

Pages 252 - 268

https://doi.org/10.1007/978-3-030-01216-8_16

Published: 08 September 2018 Publication History

Abstract

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.

References

[1]

Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: Proceedings of ICML, pp. 1247–1255. PMLR (2013)

[2]

Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of ICCV. IEEE (2017)

[3]

Arandjelović R and Zisserman A Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Objects that sound Computer Vision – ECCV 2018 2018 Heidelberg Springer

[4]

Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. In: Proceedings of ICLR Workshop (2017)

[5]

Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: sentence-level lipreading. CoRR abs/1611.01599 (2016)

[6]

Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Proceedings of NIPS. Curran Associates, Inc. (2016)

[7]

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR (2015)

[8]

Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE TPAMI (2018)

[9]

Bulkin DA and Groh JM Seeing sounds: visual and auditory interactions in the brain Curr. Opin. Neurobiol. 2006 16 4 415-419

[10]

Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Polyphonic sound event detection using multi label deep neural networks. In: Proceedings of IJCNN. IEEE (2015)

[11]

Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of ACMMM Workshop. ACM (2017)

[12]

Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of CVPR. IEEE (2017)

[13]

Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018)

[14]

Escorcia V, Caba Heilbron F, Niebles JC, and Ghanem B Leibe B, Matas J, Sebe N, and Welling M DAPs: deep action proposals for action understanding Computer Vision – ECCV 2016 2016 Cham Springer 768-784

[15]

Fisher III., J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: Proceedings of NIPS. Curran Associates, Inc. (2001)

[16]

Gao R, Feris R, and Grauman K Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Learning to separate object sounds by watching unlabeled video Computer Vision – ECCV 2018 2018 Heidelberg Springer

[17]

Gaver WW What in the world do we hear?: An ecological approach to auditory event perception Ecol. Psychol. 1993 5 1 1-29

[18]

Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP. IEEE (2017)

[19]

Gönen M and Alpaydın E Multiple kernel learning algorithms JMLR 2011 12 Jul 2211-2268

[20]

Gurban, M., Thiran, J.P., Drugman, T., Dutoit, T.: Dynamic modality weighting for multi-stream HMMs in audio-visual speech recognition. In: Proceedings of ICMI. ACM (2008)

[21]

Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of CVPR. IEEE (2006)

[22]

Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Proceedings of NIPS. Curran Associates, Inc. (2016)

[23]

Heittola T, Mesaros A, Eronen A, and Virtanen T Context-dependent sound event detection EURASIP J. Audio Speech Music Process. 2013 2013 1 1

[24]

Hershey, J.R., Movellan, J.R.: Audio vision: Using audio-visual synchrony to locate sounds. In: Proceedings of NIPS. Curran Associates, Inc. (2000)

[25]

Hershey, S., et al.: CNN architectures for large-scale audio classification. In: Proceedings of ICASSP. IEEE (2017)

[26]

Hochreiter S and Schmidhuber J Long short-term memory Neural Comput. 1997 9 8 1735-1780

[27]

Hu, D., Li, X., et al.: Temporal multimodal learning in audiovisual speech recognition. In: Proceedings of CVPR. IEEE (2016)

[28]

Kiela, D., Grave, E., Joulin, A., Mikolov, T.: Efficient large-scale multi-modal classification. In: Proceedings of AAAI. AAAI Press (2018)

[29]

Kim, J.H., et al.: Multimodal residual learning for visual QA. In: Proceedings of NIPS. Curran Associates, Inc. (2016)

[30]

Lea, C., et al.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of CVPR. IEEE (2017)

[31]

LeCun Y, Bottou L, Bengio Y, and Haffner P Gradient-based learning applied to document recognition Proc. IEEE 1998 86 11 2278-2324

[32]

Li, B., Xu, C., Duan, Z.: Audio-visual source association for string ensembles through multi-modal vibrato analysis. In: Proceedings of SMC (2017)

[33]

Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)

[34]

Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of CVPR. IEEE (2017)

[35]

Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Proceedings of NIPS. Curran Associates, Inc. (1998)

[36]

Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: Proceedings of EUSIPCO. IEEE (2016)

[37]

Mroueh, Y., Marcheret, E., Goel, V.: Deep multimodal learning for audio-visual speech recognition. In: Proceedings of ICASSP. IEEE (2015)

[38]

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of ICML. PMLR (2011)

[39]

Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of ICCV. IEEE (2013)

[40]

Owens A and Efros AA Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Audio-visual scene analysis with self-supervised multisensory features Computer Vision – ECCV 2018 2018 Heidelberg Springer

[41]

Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of CVPR. IEEE (2016)

[42]

Owens A, Wu J, McDermott JH, Freeman WT, and Torralba A Leibe B, Matas J, Sebe N, and Welling M Ambient sound provides supervision for visual learning Computer Vision – ECCV 2016 2016 Cham Springer 801-816

[43]

Parascandolo, G., Huttunen, H., Virtanen, T.: Recurrent neural networks for polyphonic sound event detection in real life recordings. In: Proceedings of ICASSP. IEEE (2016)

[44]

Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of EMNLP. Association for Computational Linguistics (2015)

[45]

Russakovsky O et al. ImageNet large scale visual recognition challenge IJCV 2015 115 3 211-252

[46]

Schuster M and Paliwal KK Bidirectional recurrent neural networks IEEE TSP 1997 45 11 2673-2681

[47]

Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of CVPR. IEEE (2018)

[48]

Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of CVPR. IEEE (2016)

[49]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2015)

[50]

Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: Proceedings of ICML Workshop. PMLR (2012)

[51]

Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Proceedings of NIPS. Curran Associates, Inc. (2012)

[52]

Surís, D., Duarte, A., Salvador, A., Torres, J., Giró-i Nieto, X.: Cross-modal embeddings for video and audio retrieval. arXiv preprint arXiv:1801.02200 (2018)

[53]

Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of CVPR. IEEE (2016)

[54]

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of ICCV. IEEE (2015)

[55]

Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: Proceedings of CVPR. IEEE (2015)

[56]

Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of CVPR. IEEE (2016)

[57]

Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of ICLR (2015)

[58]

Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E.A., Luo, J.: Deep multimodal representation learning from temporal data. In: Proceedings of CVPR. IEEE (2017)

[59]

Zhao H, Gan C, Rouditchenko A, Vondrick C, McDermott J, and Torralba A Ferrari V, Hebert M, Sminchisescu C, and Weiss Y The sound of pixels Computer Vision – ECCV 2018 2018 Heidelberg Springer

[60]

Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of ICCV. IEEE (2017)

[61]

Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. In: CVPR. IEEE (2018)

Cited By

Li MHan SYuan X(2024)Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video ParsingProceedings of the 2024 6th International Conference on Big-data Service and Intelligent Computation10.1145/3686540.3686547(48-56)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3686540.3686547
Hou WLi GTian YHu D(2024)Toward Long Form Audio-Visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367207920:9(1-26)Online publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1145/3672079
Shaikh MChai DIslam SAkhtar N(2024)From CNNs to Transformers in Multimodal Human Action Recognition: A SurveyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481520:8(1-24)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3664815
Show More Cited By

Index Terms

Audio-Visual Event Localization in Unconstrained Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

We address the challenging task of event localization, which requires the machine to localize an event and recognize its category in unconstrained videos. Most existing methods leverage only the visual information of a video while neglecting its audio ...
Unimodal-Multimodal Collaborative Enhancement for Audio-Visual Event Localization
Pattern Recognition and Computer Vision
Abstract
Audio-visual event localization (AVE) task focuses on localizing audio-visual events where event signals occur in both audio and visual modalities. Existing approaches primarily emphasize multimodal (i.e. audio-visual fused) feature processing to ...
Multimodal Attentive Fusion Network for audio-visual event recognition
Abstract
Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multimodal Attentive ...
Highlights
- State-of-the-art audio and visual interactions in neural networks are relatively simple.

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II

Sep 2018

778 pages

ISBN:978-3-030-01215-1

DOI:10.1007/978-3-030-01216-8

Editors:
Vittorio Ferrari
Google Research, Zurich, Switzerland
,
Martial Hebert
Carnegie Mellon University, Pittsburgh, PA, USA
,
Cristian Sminchisescu
Google Research, Zurich, Switzerland
,
Yair Weiss
Hebrew University of Jerusalem, Jerusalem, Israel

© Springer Nature Switzerland AG 2018.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 September 2018

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li MHan SYuan X(2024)Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video ParsingProceedings of the 2024 6th International Conference on Big-data Service and Intelligent Computation10.1145/3686540.3686547(48-56)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3686540.3686547
Hou WLi GTian YHu D(2024)Toward Long Form Audio-Visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367207920:9(1-26)Online publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1145/3672079
Shaikh MChai DIslam SAkhtar N(2024)From CNNs to Transformers in Multimodal Human Action Recognition: A SurveyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481520:8(1-24)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3664815
Liang PZadeh AMorency L(2024)Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys10.1145/365658056:10(1-42)Online publication date: 22-Jun-2024
https://dl.acm.org/doi/10.1145/3656580
Ning ZWimer BJiang KChen KBan JTian YZhao YLi T(2024)SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision ViewersProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642632(1-18)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642632
Hu WZhan HTian YXiong YLu Y(2024)Enhanced video clustering using multiple riemannian manifold-valued descriptors and audio-visual information▪Expert Systems with Applications: An International Journal10.1016/j.eswa.2023.123099246:COnline publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.123099
Geng QGu X(2024)Integrating Audio-Visual Contexts with Refinement for SegmentationArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72338-4_3(31-44)Online publication date: 17-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72338-4_3
Lai YChen YWang YOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Modality-independent teachers meet weakly-supervised audio-visual event parserProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669343(73633-73651)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669343
Xia YHuang HZhu JZhao ZOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Achieving cross modal generalization with multimodal unified representationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668896(63529-63541)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668896
Fan YWu YDu BLin YOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Revisit weakly-supervised audio-visual video parsing from the language perspectiveProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667889(40610-40622)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667889
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents