research-article

Once and for All: Self-supervised Multi-modal Co-training on One-billion Videos at Alibaba

Authors:

Lianghua Huang,

Xiangzeng Zhou,

Xu YinghuiAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 1148 - 1156

https://doi.org/10.1145/3474085.3481541

Published: 17 October 2021 Publication History

Abstract

Videos grow to be one of the largest mediums on the Internet. E-commerce platforms like Alibaba need to process millions of video data across multimedia (e.g., visual, audio, image, and text) and on a variety of tasks (e.g., retrieval, tagging, and summary) every day. In this work, we aim to develop a once and for all pretraining technique for diverse modalities and downstream tasks. To achieve this, we make the following contributions: (1) We propose a self-supervised multi-modal co-training framework. It takes cross-modal pseudo-label consistency as the supervision and can jointly learn representations of multiple modalities. (2) We introduce several novel techniques (e.g., sliding-window subset sampling, coarse-to-fine clustering, fast spatial-temporal convolution and parallel data transmission and processing) to optimize the training process, making billion-scale stable training feasible. (3) We construct a large-scale multi-modal dataset consisting of 1.4 billion videos (~0.5 PB) and train our framework on it. The training takes only 4.6 days on an in-house 256 GPUs cluster, and it simultaneously produces pretrained video, audio, image, motion, and text networks. (4) Finetuning from our pretrained models, we obtain significant performance gains and faster convergence on diverse multimedia tasks at Alibaba. Furthermore, we also validate the learned representation on public datasets. Despite the domain gap between our commodity-centric pretraining and the action-centric evaluation data, we show superior results against state-of-the-arts.

References

[1]

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv (2016).

[2]

Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelovi?, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. NeurIPS (2020).

[3]

Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. 2020. Self-supervised learning by cross-modal audio-video clustering. NeurIPS (2020).

[4]

Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. 2020. Self-labelling via simultaneous clustering and representation learning. (2020).

[5]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. NeurIPS (2020).

[6]

Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In COLT. 92--100.

Digital Library

[7]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et almbox. 2020. Language models are few-shot learners. NeurIPS (2020).

[8]

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about kinetics-600. arXiv (2018).

[9]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 b. A Simple Framework for Contrastive Learning of Visual Representations. In ICML. 1597--1607.

[10]

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. 2020 c. Big Self-Supervised Models are Strong Semi-Supervised Learners. NeurIPS (2020).

[11]

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020 a. Improved baselines with momentum contrastive learning. arXiv (2020).

[12]

Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. 2019. An unsupervised autoregressive model for speech representation learning. In Interspeech.

[13]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248--255.

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL.

[15]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In CVPR. 9346--9355.

[16]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et almbox. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021).

[17]

Ryan Eloff, André Nortje, B. Niekerk, A. Govender, Leanne Nortje, Arnu Pretorius, Elan Van Biljon, E. V. D. Westhuizen, Lisa van Staden, and H. Kamper. 2019. Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. In INTERSPEECH.

[18]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In ICCV. 6202--6211.

[19]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP. 776--780.

[20]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv (2017).

[21]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2020 a. Memory-augmented Dense Predictive Coding for Video Representation Learning. In European Conference on Computer Vision.

[22]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2020 b. Self-supervised co-training for video representation learning. NeurIPS (2020).

[23]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR.

[24]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.

[25]

Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In ICCV. 2556--2563.

Digital Library

[26]

Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, and Yale Song. 2021. Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning. arXiv (2021).

[27]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In ICCV. 2980--2988.

[28]

Yu Liu, Lianghua Huang, Pan Pan, Bin Wang, Yinghui Xu, and Rong Jin. 2021. Train a One-Million-Way Instance Classifier for Unsupervised Visual Representation Learning. AAAI (2021).

[29]

Chenxu Luo and Alan L Yuille. 2019. Grouped spatial-temporal aggregation for efficient action recognition. In ICCV. 5512--5521.

[30]

Fan Ma, Deyu Meng, Qi Xie, Zina Li, and Xuanyi Dong. 2017. Self-paced co-training. In ICML. 2275--2284.

Digital Library

[31]

Shuang Ma, Daniel McDuff, and Yale Song. 2019. Unpaired image-to-speech synthesis with multimodal information bottleneck. In ICCV. 7598--7607.

[32]

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In CVPR. 9879--9889.

[33]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV. 2630--2640.

[34]

Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E Papalexakis, and Amit K Roy-Chowdhury. 2018. Webly supervised joint embedding for cross-modal image-text retrieval. In ACMMM. 1856--1864.

Digital Library

[35]

Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2020. Audio-visual instance discrimination with cross-modal agreement. arXiv (2020).

[36]

Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. 2018. Boosting self-supervised learning via knowledge transfer. In CVPR. 9359--9367.

[37]

Mandela Patrick, Yuki M Asano, Ruth Fong, Joao F Henriques, Geoffrey Zweig, and Andrea Vedaldi. 2020. Multi-modal self-supervision from generalized data transformations. arXiv (2020).

[38]

Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In ACMMM. 1015--1018.

Digital Library

[39]

AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. 2020. Evolving losses for unsupervised video representation learning. In CVPR. 133--142.

[40]

Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. 2020. Spatiotemporal contrastive video representation learning. arXiv (2020).

[41]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV. 5533--5541.

[42]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR. 4510--4520.

[43]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv (2012).

[44]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450--6459.

[45]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In ACL. 6558.

[46]

Jiangliu Wang, Jianbo Jiao, and Yunhui Liu. 2020 b. Self-Supervised Video Representation Learning by Pace Prediction. In European Conference on Computer Vision.

[47]

Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo Si. 2020 a. Structbert: Incorporating language structures into pre-training for deep language understanding. ICLR (2020).

[48]

Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In CVPR. 6026--6035.

[49]

Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR. 10334--10343.

[50]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Large batch training of convolutional networks. arXiv (2017).

[51]

Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo. 2019. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In CVPR. 2547--2555.

[52]

Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In ECCV. 649--666.

[53]

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR. 6848--6856.

[54]

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. PAMI (2017), 1452--1464.

[55]

Zhi-Hua Zhou, De-Chuan Zhan, and Qiang Yang. 2007. Semi-supervised learning with very few labeled training examples. In AAAI.

Digital Library

Cited By

Ma GBai YZhang WYao TShihada BMei T(2023)Boosting Generic Visual-Linguistic Representation With Dynamic ContextsIEEE Transactions on Multimedia10.1109/TMM.2023.323716425(8445-8457)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3237164
Yu JYin HXia XChen TLi JHuang Z(2023)Self-Supervised Learning for Recommender Systems: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328290736:1(335-355)Online publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1109/TKDE.2023.3282907
Tuarob SSatravisut MSangtunchai PNunthavanich SNoraset T(2023)FALCoNInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10338160:4Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.ipm.2023.103381

Index Terms

Once and for All: Self-supervised Multi-modal Co-training on One-billion Videos at Alibaba
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Hierarchical representations
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
    2. Machine learning approaches
      1. Neural networks

Recommendations

Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
MILD: Multiple-Instance Learning via Disambiguation

In multiple-instance learning (MIL), an individual example is called an instance and a bag contains a single or multiple instances. The class labels available in the training set are associated with bags rather than instances. A bag is labeled positive ...
Self-Supervised Graph Co-Training for Session-based Recommendation
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Session-based recommendation targets next-item prediction by exploiting user behaviors within a short time period. Compared with other recommendation paradigms, session-based recommendation suffers more from the problem of data sparsity due to the very ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
347
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ma GBai YZhang WYao TShihada BMei T(2023)Boosting Generic Visual-Linguistic Representation With Dynamic ContextsIEEE Transactions on Multimedia10.1109/TMM.2023.323716425(8445-8457)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3237164
Yu JYin HXia XChen TLi JHuang Z(2023)Self-Supervised Learning for Recommender Systems: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328290736:1(335-355)Online publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1109/TKDE.2023.3282907
Tuarob SSatravisut MSangtunchai PNunthavanich SNoraset T(2023)FALCoNInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10338160:4Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.ipm.2023.103381

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents