Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3481541acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Once and for All: Self-supervised Multi-modal Co-training on One-billion Videos at Alibaba

Published: 17 October 2021 Publication History

Abstract

Videos grow to be one of the largest mediums on the Internet. E-commerce platforms like Alibaba need to process millions of video data across multimedia (e.g., visual, audio, image, and text) and on a variety of tasks (e.g., retrieval, tagging, and summary) every day. In this work, we aim to develop a once and for all pretraining technique for diverse modalities and downstream tasks. To achieve this, we make the following contributions: (1) We propose a self-supervised multi-modal co-training framework. It takes cross-modal pseudo-label consistency as the supervision and can jointly learn representations of multiple modalities. (2) We introduce several novel techniques (e.g., sliding-window subset sampling, coarse-to-fine clustering, fast spatial-temporal convolution and parallel data transmission and processing) to optimize the training process, making billion-scale stable training feasible. (3) We construct a large-scale multi-modal dataset consisting of 1.4 billion videos (~0.5 PB) and train our framework on it. The training takes only 4.6 days on an in-house 256 GPUs cluster, and it simultaneously produces pretrained video, audio, image, motion, and text networks. (4) Finetuning from our pretrained models, we obtain significant performance gains and faster convergence on diverse multimedia tasks at Alibaba. Furthermore, we also validate the learned representation on public datasets. Despite the domain gap between our commodity-centric pretraining and the action-centric evaluation data, we show superior results against state-of-the-arts.

References

[1]
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv (2016).
[2]
Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelovi?, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. NeurIPS (2020).
[3]
Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. 2020. Self-supervised learning by cross-modal audio-video clustering. NeurIPS (2020).
[4]
Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. 2020. Self-labelling via simultaneous clustering and representation learning. (2020).
[5]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. NeurIPS (2020).
[6]
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In COLT. 92--100.
[7]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et almbox. 2020. Language models are few-shot learners. NeurIPS (2020).
[8]
Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about kinetics-600. arXiv (2018).
[9]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 b. A Simple Framework for Contrastive Learning of Visual Representations. In ICML. 1597--1607.
[10]
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. 2020 c. Big Self-Supervised Models are Strong Semi-Supervised Learners. NeurIPS (2020).
[11]
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020 a. Improved baselines with momentum contrastive learning. arXiv (2020).
[12]
Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. 2019. An unsupervised autoregressive model for speech representation learning. In Interspeech.
[13]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248--255.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL.
[15]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In CVPR. 9346--9355.
[16]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et almbox. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021).
[17]
Ryan Eloff, André Nortje, B. Niekerk, A. Govender, Leanne Nortje, Arnu Pretorius, Elan Van Biljon, E. V. D. Westhuizen, Lisa van Staden, and H. Kamper. 2019. Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. In INTERSPEECH.
[18]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In ICCV. 6202--6211.
[19]
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP. 776--780.
[20]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv (2017).
[21]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2020 a. Memory-augmented Dense Predictive Coding for Video Representation Learning. In European Conference on Computer Vision.
[22]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2020 b. Self-supervised co-training for video representation learning. NeurIPS (2020).
[23]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR.
[24]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.
[25]
Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In ICCV. 2556--2563.
[26]
Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, and Yale Song. 2021. Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning. arXiv (2021).
[27]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In ICCV. 2980--2988.
[28]
Yu Liu, Lianghua Huang, Pan Pan, Bin Wang, Yinghui Xu, and Rong Jin. 2021. Train a One-Million-Way Instance Classifier for Unsupervised Visual Representation Learning. AAAI (2021).
[29]
Chenxu Luo and Alan L Yuille. 2019. Grouped spatial-temporal aggregation for efficient action recognition. In ICCV. 5512--5521.
[30]
Fan Ma, Deyu Meng, Qi Xie, Zina Li, and Xuanyi Dong. 2017. Self-paced co-training. In ICML. 2275--2284.
[31]
Shuang Ma, Daniel McDuff, and Yale Song. 2019. Unpaired image-to-speech synthesis with multimodal information bottleneck. In ICCV. 7598--7607.
[32]
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In CVPR. 9879--9889.
[33]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV. 2630--2640.
[34]
Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E Papalexakis, and Amit K Roy-Chowdhury. 2018. Webly supervised joint embedding for cross-modal image-text retrieval. In ACMMM. 1856--1864.
[35]
Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2020. Audio-visual instance discrimination with cross-modal agreement. arXiv (2020).
[36]
Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. 2018. Boosting self-supervised learning via knowledge transfer. In CVPR. 9359--9367.
[37]
Mandela Patrick, Yuki M Asano, Ruth Fong, Joao F Henriques, Geoffrey Zweig, and Andrea Vedaldi. 2020. Multi-modal self-supervision from generalized data transformations. arXiv (2020).
[38]
Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In ACMMM. 1015--1018.
[39]
AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. 2020. Evolving losses for unsupervised video representation learning. In CVPR. 133--142.
[40]
Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. 2020. Spatiotemporal contrastive video representation learning. arXiv (2020).
[41]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV. 5533--5541.
[42]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR. 4510--4520.
[43]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv (2012).
[44]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450--6459.
[45]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In ACL. 6558.
[46]
Jiangliu Wang, Jianbo Jiao, and Yunhui Liu. 2020 b. Self-Supervised Video Representation Learning by Pace Prediction. In European Conference on Computer Vision.
[47]
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo Si. 2020 a. Structbert: Incorporating language structures into pre-training for deep language understanding. ICLR (2020).
[48]
Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In CVPR. 6026--6035.
[49]
Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR. 10334--10343.
[50]
Yang You, Igor Gitman, and Boris Ginsburg. 2017. Large batch training of convolutional networks. arXiv (2017).
[51]
Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo. 2019. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In CVPR. 2547--2555.
[52]
Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In ECCV. 649--666.
[53]
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR. 6848--6856.
[54]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. PAMI (2017), 1452--1464.
[55]
Zhi-Hua Zhou, De-Chuan Zhan, and Qiang Yang. 2007. Semi-supervised learning with very few labeled training examples. In AAAI.

Cited By

View all
  • (2023)Boosting Generic Visual-Linguistic Representation With Dynamic ContextsIEEE Transactions on Multimedia10.1109/TMM.2023.323716425(8445-8457)Online publication date: 1-Jan-2023
  • (2023)Self-Supervised Learning for Recommender Systems: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328290736:1(335-355)Online publication date: 5-Jun-2023
  • (2023)FALCoNInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10338160:4Online publication date: 1-Jul-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. co-training
  2. multi-modal
  3. once and for all
  4. self-supervised learning

Qualifiers

  • Research-article

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)2
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Boosting Generic Visual-Linguistic Representation With Dynamic ContextsIEEE Transactions on Multimedia10.1109/TMM.2023.323716425(8445-8457)Online publication date: 1-Jan-2023
  • (2023)Self-Supervised Learning for Recommender Systems: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328290736:1(335-355)Online publication date: 5-Jun-2023
  • (2023)FALCoNInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10338160:4Online publication date: 1-Jul-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media