Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning

Sanguineti, Valentina; Morerio, Pietro; Pozzetti, Niccolò; Greco, Danilo; Cristani, Marco; Murino, Vittorio

doi:10.1007/978-3-030-58542-6_8

Valentina Sanguineti^12,13,
Pietro Morerio¹²,
Niccolò Pozzetti¹⁵,
Danilo Greco^12,13,
Marco Cristani¹⁵ &
…
Vittorio Murino^12,14,15

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12367))

Included in the following conference series:

European Conference on Computer Vision

Abstract

In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audio-visual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array of microphones. By coupling such array with a video camera, we obtain spatio-temporal alignment of acoustic images and video frames. This constitutes a powerful source of self-supervision, which can be exploited in the learning pipeline we are proposing, without resorting to expensive data annotations. However, since 2D planar arrays are cumbersome and not as widespread as ordinary microphones, we propose that the richer information content of acoustic images can be distilled, through a self-supervised learning scheme, into more powerful audio and visual feature representations. The learnt feature representations can then be employed for downstream tasks such as classification and cross-modal retrieval, without the need of a microphone array. To prove that, we introduce a novel multimodal dataset consisting in RGB videos, raw audio signals and acoustic images, aligned in space and synchronized in time. Experimental results demonstrate the validity of our hypothesis and the effectiveness of the proposed pipeline, also when tested for tasks and datasets different from those used for training.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Objects that Sound

Deep Audio-visual Learning: A Survey

Article Open access 15 April 2021

Exploring Efficient-Tuned Learning Audio Representation Method from BriVL

Notes

1.
https://github.com/IIT-PAVIS/acoustic-images-self-supervision.

References

Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Arandjelovic, R., Zisserman, A.: Objects that sound. In: The European Conference on Computer Vision (ECCV), September 2018
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, pp. 892–900. Curran Associates Inc., USA (2016). http://dl.acm.org/citation.cfm?id=3157096.3157196
Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. CoRR abs/1706.00932 (2017). http://arxiv.org/abs/1706.00932
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML (2009)
Google Scholar
Crocco, M., Martelli, S., Trucco, A., Zunino, A., Murino, V.: Audio tracking in noisy environments by acoustic map and spectral signature. IEEE Trans. Cybern. 48, 1619–1632 (2018)
Article Google Scholar
Crocco, M., Trucco, A.: Design of superdirective planar arrays with sparse aperiodic layouts for processing broadband signals via 3-D beamforming. IEEE/ACM Trans. Audio, Speech Lang. Process. 22(4), 800–815 (2014). https://doi.org/10.1109/TASLP.2014.2304635
Article Google Scholar
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 1–2 (2018)
Article Google Scholar
Gao, R., Grauman, K.: 2.5d visual sound. CVPR 2019 arXiv:1812.04204 (2019)
Garcia, N.C., Morerio, P., Murino, V.: Learning with privileged information via adversarial discriminative modality distillation. CoRR abs/1810.08437 (2018)
Google Scholar
Garcia, N.C., Morerio, P., Murino, V.: Modality distillation with multiple stream networks for action recognition. In: The European Conference on Computer Vision (ECCV), September 2018
Google Scholar
Harwath, D., Recasens, A., Suris, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: The European Conference on Computer Vision (ECCV), September 2018
Google Scholar
Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 1858–1866. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016
Google Scholar
Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. NIPS 2014 Deep Learning Workshop abs/1503.02531 (2015)
Google Scholar
Hoffman, J., Gupta, S., Darrell, T.: Learning with side information through modality hallucination. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 826–834, June 2016. https://doi.org/10.1109/CVPR.2016.96
Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. CoRR abs/1902.06162 (2019)
Google Scholar
Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 7774–7785. Curran Associates, Inc. (2018). http://papers.nips.cc/paper/8002-cooperative-learning-of-audio-and-video-models-from-self-supervised-synchronization.pdf
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)
Google Scholar
Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. ICLR 2016 abs/1511.03643 (2016)
Google Scholar
Mesaros, A., Heittola, T., Virtanen, T.: A multi-device dataset for urban acoustic scene classification. In: DCASE 2018 Workshop (2018)
Google Scholar
Morgado, P., Vasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 video. In: Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS 2018, pp. 360–370. Curran Associates Inc., USA (2018)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML 2011, pp. 689–696. Omnipress, USA (2011). http://dl.acm.org/citation.cfm?id=3104482.3104569
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: The European Conference on Computer Vision (ECCV), September 2018
Google Scholar
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2405–2413 (2016)
Google Scholar
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Learning sight from sound: ambient sound provides supervision for visual learning. Int. J. Comput. Vis. 126(10), 1120–1137 (2018). https://doi.org/10.1007/s11263-018-1083-5
Pérez, A.F., Sanguineti, V., Morerio, P., Murino, V.: Audio-visual model distillation using acoustic images. In: Winter Conference on Applications of Computer Vision (WACV) (2020)
Google Scholar
Ramaswamy, J., Das, S.: See the sound, hear the pixels. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2959–2968 (2020)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823, June 2015
Google Scholar
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Terasawa, H., Slaney, M., Berger, J.: A statistical model of timbre perception. In: SAPA@INTERSPEECH (2006)
Google Scholar
Van Trees, H.: Detection, Estimation, and Modulation Theory, Optimum Array Processing. Wiley (2002)
Google Scholar
Vapnik, V., Vashist, A.: A new learning paradigm: learning using privileged information. Neural Netw. 22(5–6), 544–557 (2009)
Article Google Scholar
Yang, K., Russell, B., Salamon, J.: Telling left from right: learning spatial correspondence of sight and sound. In: CVPR (2020)
Google Scholar
Zunino, A., Crocco, M., Martelli, S., Trucco, A., Bue, A.D., Murino, V.: Seeing the sound: a new multimodal imaging device for computer vision. In: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 693–701, December 2015. https://doi.org/10.1109/ICCVW.2015.95

Download references

Author information

Authors and Affiliations

Pattern Analysis and Computer Vision, Istituto Italiano di Tecnologia, Genoa, Italy
Valentina Sanguineti, Pietro Morerio, Danilo Greco & Vittorio Murino
University of Genova, Genoa, Italy
Valentina Sanguineti & Danilo Greco
Huawei Technologies Ltd., Ireland Research Center, Dublin, Ireland
Vittorio Murino
University of Verona, Verona, Italy
Niccolò Pozzetti, Marco Cristani & Vittorio Murino

Authors

Valentina Sanguineti
View author publications
You can also search for this author in PubMed Google Scholar
Pietro Morerio
View author publications
You can also search for this author in PubMed Google Scholar
Niccolò Pozzetti
View author publications
You can also search for this author in PubMed Google Scholar
Danilo Greco
View author publications
You can also search for this author in PubMed Google Scholar
Marco Cristani
View author publications
You can also search for this author in PubMed Google Scholar
Vittorio Murino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentina Sanguineti .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 14053 KB)

Supplementary material 2 (mp4 214 KB)

Supplementary material 3 (mp4 122 KB)

Supplementary material 4 (mp4 447 KB)

Supplementary material 5 (mp4 461 KB)

Supplementary material 6 (mp4 460 KB)

Supplementary material 7 (mp4 344 KB)

Supplementary material 8 (mp4 306 KB)

Supplementary material 9 (mp4 307 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sanguineti, V., Morerio, P., Pozzetti, N., Greco, D., Cristani, M., Murino, V. (2020). Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12367. Springer, Cham. https://doi.org/10.1007/978-3-030-58542-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-58542-6_8
Published: 17 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58541-9
Online ISBN: 978-3-030-58542-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Objects that Sound

Deep Audio-visual Learning: A Survey

Exploring Efficient-Tuned Learning Audio Representation Method from BriVL

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 14053 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Objects that Sound

Deep Audio-visual Learning: A Survey

Exploring Efficient-Tuned Learning Audio Representation Method from BriVL

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 14053 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation