Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-030-01216-8_16guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Audio-Visual Event Localization in Unconstrained Videos

Published: 08 September 2018 Publication History

Abstract

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.

References

[1]
Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: Proceedings of ICML, pp. 1247–1255. PMLR (2013)
[2]
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of ICCV. IEEE (2017)
[3]
Arandjelović R and Zisserman A Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Objects that sound Computer Vision – ECCV 2018 2018 Heidelberg Springer
[4]
Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. In: Proceedings of ICLR Workshop (2017)
[5]
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: sentence-level lipreading. CoRR abs/1611.01599 (2016)
[6]
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Proceedings of NIPS. Curran Associates, Inc. (2016)
[7]
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR (2015)
[8]
Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE TPAMI (2018)
[9]
Bulkin DA and Groh JM Seeing sounds: visual and auditory interactions in the brain Curr. Opin. Neurobiol. 2006 16 4 415-419
[10]
Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Polyphonic sound event detection using multi label deep neural networks. In: Proceedings of IJCNN. IEEE (2015)
[11]
Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of ACMMM Workshop. ACM (2017)
[12]
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of CVPR. IEEE (2017)
[13]
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018)
[14]
Escorcia V, Caba Heilbron F, Niebles JC, and Ghanem B Leibe B, Matas J, Sebe N, and Welling M DAPs: deep action proposals for action understanding Computer Vision – ECCV 2016 2016 Cham Springer 768-784
[15]
Fisher III., J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: Proceedings of NIPS. Curran Associates, Inc. (2001)
[16]
Gao R, Feris R, and Grauman K Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Learning to separate object sounds by watching unlabeled video Computer Vision – ECCV 2018 2018 Heidelberg Springer
[17]
Gaver WW What in the world do we hear?: An ecological approach to auditory event perception Ecol. Psychol. 1993 5 1 1-29
[18]
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP. IEEE (2017)
[19]
Gönen M and Alpaydın E Multiple kernel learning algorithms JMLR 2011 12 Jul 2211-2268
[20]
Gurban, M., Thiran, J.P., Drugman, T., Dutoit, T.: Dynamic modality weighting for multi-stream HMMs in audio-visual speech recognition. In: Proceedings of ICMI. ACM (2008)
[21]
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of CVPR. IEEE (2006)
[22]
Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Proceedings of NIPS. Curran Associates, Inc. (2016)
[23]
Heittola T, Mesaros A, Eronen A, and Virtanen T Context-dependent sound event detection EURASIP J. Audio Speech Music Process. 2013 2013 1 1
[24]
Hershey, J.R., Movellan, J.R.: Audio vision: Using audio-visual synchrony to locate sounds. In: Proceedings of NIPS. Curran Associates, Inc. (2000)
[25]
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: Proceedings of ICASSP. IEEE (2017)
[26]
Hochreiter S and Schmidhuber J Long short-term memory Neural Comput. 1997 9 8 1735-1780
[27]
Hu, D., Li, X., et al.: Temporal multimodal learning in audiovisual speech recognition. In: Proceedings of CVPR. IEEE (2016)
[28]
Kiela, D., Grave, E., Joulin, A., Mikolov, T.: Efficient large-scale multi-modal classification. In: Proceedings of AAAI. AAAI Press (2018)
[29]
Kim, J.H., et al.: Multimodal residual learning for visual QA. In: Proceedings of NIPS. Curran Associates, Inc. (2016)
[30]
Lea, C., et al.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of CVPR. IEEE (2017)
[31]
LeCun Y, Bottou L, Bengio Y, and Haffner P Gradient-based learning applied to document recognition Proc. IEEE 1998 86 11 2278-2324
[32]
Li, B., Xu, C., Duan, Z.: Audio-visual source association for string ensembles through multi-modal vibrato analysis. In: Proceedings of SMC (2017)
[33]
Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)
[34]
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of CVPR. IEEE (2017)
[35]
Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Proceedings of NIPS. Curran Associates, Inc. (1998)
[36]
Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: Proceedings of EUSIPCO. IEEE (2016)
[37]
Mroueh, Y., Marcheret, E., Goel, V.: Deep multimodal learning for audio-visual speech recognition. In: Proceedings of ICASSP. IEEE (2015)
[38]
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of ICML. PMLR (2011)
[39]
Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of ICCV. IEEE (2013)
[40]
Owens A and Efros AA Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Audio-visual scene analysis with self-supervised multisensory features Computer Vision – ECCV 2018 2018 Heidelberg Springer
[41]
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of CVPR. IEEE (2016)
[42]
Owens A, Wu J, McDermott JH, Freeman WT, and Torralba A Leibe B, Matas J, Sebe N, and Welling M Ambient sound provides supervision for visual learning Computer Vision – ECCV 2016 2016 Cham Springer 801-816
[43]
Parascandolo, G., Huttunen, H., Virtanen, T.: Recurrent neural networks for polyphonic sound event detection in real life recordings. In: Proceedings of ICASSP. IEEE (2016)
[44]
Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of EMNLP. Association for Computational Linguistics (2015)
[45]
Russakovsky O et al. ImageNet large scale visual recognition challenge IJCV 2015 115 3 211-252
[46]
Schuster M and Paliwal KK Bidirectional recurrent neural networks IEEE TSP 1997 45 11 2673-2681
[47]
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of CVPR. IEEE (2018)
[48]
Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of CVPR. IEEE (2016)
[49]
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2015)
[50]
Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: Proceedings of ICML Workshop. PMLR (2012)
[51]
Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Proceedings of NIPS. Curran Associates, Inc. (2012)
[52]
Surís, D., Duarte, A., Salvador, A., Torres, J., Giró-i Nieto, X.: Cross-modal embeddings for video and audio retrieval. arXiv preprint arXiv:1801.02200 (2018)
[53]
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of CVPR. IEEE (2016)
[54]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of ICCV. IEEE (2015)
[55]
Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: Proceedings of CVPR. IEEE (2015)
[56]
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of CVPR. IEEE (2016)
[57]
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of ICLR (2015)
[58]
Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E.A., Luo, J.: Deep multimodal representation learning from temporal data. In: Proceedings of CVPR. IEEE (2017)
[59]
Zhao H, Gan C, Rouditchenko A, Vondrick C, McDermott J, and Torralba A Ferrari V, Hebert M, Sminchisescu C, and Weiss Y The sound of pixels Computer Vision – ECCV 2018 2018 Heidelberg Springer
[60]
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of ICCV. IEEE (2017)
[61]
Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. In: CVPR. IEEE (2018)

Cited By

View all
  • (2024)Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video ParsingProceedings of the 2024 6th International Conference on Big-data Service and Intelligent Computation10.1145/3686540.3686547(48-56)Online publication date: 29-May-2024
  • (2024)Toward Long Form Audio-Visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367207920:9(1-26)Online publication date: 7-Jun-2024
  • (2024)From CNNs to Transformers in Multimodal Human Action Recognition: A SurveyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481520:8(1-24)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. Audio-Visual Event Localization in Unconstrained Videos
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Guide Proceedings
          Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II
          Sep 2018
          778 pages
          ISBN:978-3-030-01215-1
          DOI:10.1007/978-3-030-01216-8

          Publisher

          Springer-Verlag

          Berlin, Heidelberg

          Publication History

          Published: 08 September 2018

          Author Tags

          1. Audio-visual event
          2. Temporal localization
          3. Attention
          4. Fusion

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 10 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Exploring Event Misalignment Bias and Segment Focus Bias for Weakly-Supervised Audio-Visual Video ParsingProceedings of the 2024 6th International Conference on Big-data Service and Intelligent Computation10.1145/3686540.3686547(48-56)Online publication date: 29-May-2024
          • (2024)Toward Long Form Audio-Visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367207920:9(1-26)Online publication date: 7-Jun-2024
          • (2024)From CNNs to Transformers in Multimodal Human Action Recognition: A SurveyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481520:8(1-24)Online publication date: 13-May-2024
          • (2024)Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys10.1145/365658056:10(1-42)Online publication date: 22-Jun-2024
          • (2024)SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision ViewersProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642632(1-18)Online publication date: 11-May-2024
          • (2024)Enhanced video clustering using multiple riemannian manifold-valued descriptors and audio-visual information▪Expert Systems with Applications: An International Journal10.1016/j.eswa.2023.123099246:COnline publication date: 15-Jul-2024
          • (2024)Integrating Audio-Visual Contexts with Refinement for SegmentationArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72338-4_3(31-44)Online publication date: 17-Sep-2024
          • (2023)Modality-independent teachers meet weakly-supervised audio-visual event parserProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669343(73633-73651)Online publication date: 10-Dec-2023
          • (2023)Achieving cross modal generalization with multimodal unified representationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668896(63529-63541)Online publication date: 10-Dec-2023
          • (2023)Revisit weakly-supervised audio-visual video parsing from the language perspectiveProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667889(40610-40622)Online publication date: 10-Dec-2023
          • Show More Cited By

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media