research-article

Free access

Self-supervised learning by cross-modal audio-video clustering

AUTHORs:

Humam Alwassel,

Lorenzo Torresani,

Bernard Ghanem,

Du TranAuthors Info & Claims

NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems

Article No.: 818, Pages 9758 - 9770

Published: 06 December 2020 Publication History

PDF eReader Publisher Site

Abstract

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

Supplementary Material

Additional material (3495724.3496542_supp.pdf)

Supplemental material.

Download
1.46 MB

References

[1]

Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In ICCV, 2017. 3, 4, 9

[2]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In NeurIPS, 2016. 2, 9

[3]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. See, hear, and read: Deep aligned representations. arXiv preprint arXiv:1706.00932, 2017. 3

[4]

Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. In CVPR, 2020. 9

[5]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015. 2

[6]

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018. 2, 3

Digital Library

[7]

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 2

[8]

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015. 2

Digital Library

[9]

Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In CVPR, 2017. 2

[10]

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Chan-ning Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017. 4

Digital Library

[11]

A Gentile and S DiFrancesca. Academic achievement test performance of hearing-impaired students. united states, spring, 1969.(series d, no. 1). washington, dc: Gallaudet university. Center for Assessment and Demographic Studies, 1969. 1

[12]

Deepti Ghadiyaram, Matt Feiszli, Du Tran, Xueting Yan, Heng Wang, and Dhruv Mahajan. Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019. 4

[13]

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. ICLR, 2018. 2

[14]

Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019. 5

[15]

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The "something something" video database for learning and evaluating visual common sense. In ICCV, 2017. 2

[16]

Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. In CVPR, 2016. 2

[17]

Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In ICCVW, 2019. 9

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 4

[19]

Rickye S. Heffner and Henry E. Heffner. Evolution of Sound Localization in Mammals, pages 691-715. Springer New York, New York, NY, 1992. 1

[20]

Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 2006. 2

Digital Library

[21]

Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 2006. 2

[22]

Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning visual groups from co-occurrences in space and time. ICLR, 2015. 2

[23]

Dinesh Jayaraman and Kristen Grauman. Slow and steady feature analysis: higher order temporal coherence in video. In CVPR, 2016. 2

[24]

Y. Jiang, J. Liu, A. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes, 2014. 9

[25]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 2

Digital Library

[26]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017. 2, 4

[27]

Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time cubic puzzles. In AAAI, 2019. 2, 9

[28]

Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018. 3, 4, 5, 8, 9

[29]

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011. 4

Digital Library

[30]

Zihang Lai and Weidi Xie. Self-supervised learning for video correspondence flow. BMVC, 2019. 2

[31]

Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 2

Digital Library

[32]

Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algorithms. In NeurIPS, 2007. 2

[33]

Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In ICCV, 2017. 2

[34]

David Li, Jason Tam, and Derek Toub. Auditory scene classification using machine learning techniques. AASP Challenge, 2013. 9

[35]

William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. ICLR, 2017. 2

[36]

Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. ICLR, 2016. 2

[37]

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020. 3, 9

[38]

Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, 2016. 2

[39]

Helmer R Myklebust. The psychology of deafness: Sensory deprivation, learning, and adjustment. Grune & Stratton, 1960. 1

[40]

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016. 2

[41]

Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In ICCV, 2017. 2

[42]

Risto Näätänen. Attention and Brain Function. Lawrence Erlbaum Associates, Inc, 1992. 1

[43]

Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018. 3

Digital Library

[44]

Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In ECCV, 2016. 3

[45]

Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017. 2

[46]

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016. 2

[47]

Karol J. Piczak. Environmental sound classification with convolutional neural networks. MLSP, 2015. 9

[48]

Karol J. Piczak. Esc: Dataset for environmental sound classification. In ACM Multimedia, 2015. 4, 9

[49]

AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. Evolving losses for unsupervised video representation learning. In CVPR, 2020. 8, 9

[50]

Alain Rakotomamonjy and Gilles Gasso. Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 2015. 9

[51]

Marc'aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007. 2

[52]

Guido Roma, Waldo Nogueira, and Perfecto Herrera. Recurrence quantification analysis features for environmental sound recognition. WASPAA, 2013. 9

[53]

Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, and Antonio Torralba. Self-supervised audio-visual co-segmentation. In ICASSP, 2019. 2

[54]

Hardik B. Sailor, Dharmesh M Agrawal, and Hemant A Patil. Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification. In INTERSPEECH, 2017. 8, 9

[55]

Ladan Shams and Robyn Kim. Crossmodal influences on visual perception. Physics of Life Reviews, 7(3):269-284, 2010. 1

[56]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012. 4

[57]

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, 2015. 2

Digital Library

[58]

Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D. Plumbley. Detection and classification of acoustic scenes and events. TM, 2015. 9

[59]

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733-1746, Oct 2015. 8

Digital Library

[60]

Street Performers. This violinist plays beautiful music St Petersburg, Russia. https://www.youtube.com/watch?v=yLw2Sq8Bz78, https://creativecommons.org/licenses/by/3.0.3

[61]

Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743, 2019. 3, 9

[62]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In ECCV, 2020. 3

Digital Library

[63]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018. 4, 8

[64]

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008. 2

Digital Library

[65]

Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In CVPR, 2016. 2

[66]

Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NeurIPS, 2016. 2

[67]

Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking emerges by colorizing videos. In ECCV, 2018. 2

Digital Library

[68]

Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In CVPR, 2019. 2, 9

[69]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016. 9

[70]

Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? In CVPR, 2020. 5

[71]

Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015. 2

Digital Library

[72]

Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In CVPR, 2019. 2

[73]

Xiaolong Wang, Allan Jabri, and Alexei A. Efros. Learning correspondence from the cycle-consistency of time. In CVPR, 2019. 2

[74]

Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In CVPR, 2018. 2

[75]

Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 2019. 2, 9

[76]

Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. G-tad: Sub-graph localization for temporal action detection. In CVPR, 2020. 9

[77]

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016. 2

[78]

Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. In ECCV, 2018. 2

Digital Library

Recommendations

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Audio and vision are important senses for high-level cognition, and their special strong correlation makes audio-visual coding a crucial factor in many multimodal tasks. However, there are two challenges in audio-visual coding. First, the heterogeneity ...
Self-Supervised Correlation Learning for Cross-Modal Retrieval
Cross-modal retrieval aims to retrieve relevant data from another modality when given a query of one modality. Although most existing methods that rely on the label information of multimedia data have achieved promising results, the performance benefiting ...
Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
Computer Vision – ECCV 2022
Abstract
This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries. This task is challenging because only overall labels indicating the ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems

December 2020

22651 pages

ISBN:9781713829546

Editors:
H. Larochelle
Google Research
,
M. Ranzato
Facebook AI Research
,
R. Hadsell
DeepMind
,
M.F. Balcan
Carnegie Mellon University
,
H. Lin
National Taiwan University

Copyright © 2020 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 06 December 2020

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
36
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents