research-article

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

Authors:

Houqiang LiAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 2567 - 2576

https://doi.org/10.1145/3474085.3475431

Published: 17 October 2021 Publication History

Abstract

The pre-trained neural models have recently achieved impressive performance in understanding multimodal content. However, it is still very challenging to pre-train neural models for video and language understanding, especially for Chinese video-language data, due to the following reasons. Firstly, existing video-language pre-training algorithms mainly focus on the co-occurrence of words and video frames, but ignore other valuable semantic and structure information of video-language content, e.g., sequential order and spatiotemporal relationships. Secondly, there exist conflicts between video sentence alignment and other proxy tasks. Thirdly, there is a lack of large-scale and high-quality Chinese video-language datasets (eg. including 10 million unique videos), which are the fundamental success conditions for pre-training techniques. In this work, we propose a novel video-language understanding framework named Victor, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training. Besides general proxy tasks such as masked language modeling, Victor constructs several novel proxy tasks under the contrastive learning paradigm, making the model be more robust and able to capture more complex multimodal semantic and structural relationships from different perspectives. Victor is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions. We apply the pre-trained Victor model to a series of downstream applications and demonstrate its superior performance, comparing against the state-of-the-art pre-training methods such as VideoBERT and UniVL.

References

[1]

Hugo Larochelle Aaron Courville Atousa Torabi, Christopher Pal. 2015. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research. arxiv: 1503.01070

[2]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 a. A Simple Framework for Contrastive Learning of Visual Representations. In ICML. IEEE.

[3]

Xusong Chen, Chenyi Lei, Dong Liu, Guoxin Wang, Haihong Tang, Zheng-Jun Zha, and Houqiang Li. 2021. E-Commerce Storytelling Recommendation Using Attentional Domain-Transfer Network and Adversarial Pre-Training. In Transactions on Multimedia. IEEE.

[4]

Xusong Chen, Dong Liu, Chenyi Lei, Rui Li, Zheng-Jun Zha, and Zhiwei Xiong. 2019. BERT4SessRec: Content-Based Video Relevance Prediction with Bidirectional Encoder Representations from Transformer. In MM. ACM, 2597--2601.

Digital Library

[5]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020 b. UNITER: UNiversal Image-TExt Representation Learning. In ECCV.

[6]

Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. 2013. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. In CVPR. IEEE.

Digital Library

[7]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. IEEE.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv: 1810.04805

[9]

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In ECCV.

[10]

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. In NIPS.

[11]

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. In ICML.

[12]

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arxiv: 2004.00849

[13]

Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, and et al. 2021. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. arxiv: 2103.06561

[14]

Yuxin Wu Saining Xie Ross Girshick Kaiming He, Haoqi Fan. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR. IEEE.

[15]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, and et al. 2017. The Kinetics Human Action Video Dataset. arxiv: 1705.06950

[16]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[17]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arxiv: 1602.07332

[18]

Chenyi Lei, Yong Liu, Lingzi Zhang, Guoxin Wang, Haihong Tang, Houqiang Li, and Chunyan Miao. 2021 b. SEMI: A Sequential Multi-Modal Information Transfer Network for E-Commerce Micro-Video Recommendations. In KDD. ACM.

Digital Library

[19]

Chenyi Lei, Lei Wu, Dong Liu, Zhao Li, Guoxin Wang, Haihong Tang, and Houqiang Li. 2020. Multi-Question Learning for Visual Question Answering. In AAAI.

[20]

Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu. 2021 a. Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In CVPR. IEEE.

[21]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Ming Zhou. 2020 b. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. In AAAI.

[22]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020 a. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In EMNLP.

[23]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020 c. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.

[24]

Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, and Tao Mei. 2021. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network. In AAAI.

[25]

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A New Dataset and Benchmark on Animated GIF Description. arxiv: 1604.02748

[26]

Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang, Peng Wang, and et al. 2021. M6: A Chinese Multimodal Pretrainer. arxiv: 2103.00823

[27]

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. Microsoft COCO: Common Objects in Context. arxiv: 1405.0312

[28]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv: 1907.11692

[29]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NIPS.

Digital Library

[30]

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arxiv: 1906.05743

[31]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV. IEEE.

[32]

Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arxiv: 2007.02375

[33]

Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, João Henriques, and Andrea Vedaldi. 2021. Support-Set Bottlenecks for Video-Text Representation Learning. In ICLR.

[34]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, and et al. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv: 2103.00020

[35]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arxiv: 1910.10683

[36]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. arxiv: 2102.12092

[37]

Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A Dataset for Movie Description. arxiv: 1501.02530

[38]

Gunnar A. Sigurdsson, Gu l Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes Crowdsourcing Data Collection for Activity Understanding. In ECCV.

[39]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR.

[40]

Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019 a. Learning Video Representations using Contrastive Bidirectional Transformer. arxiv: 2002.06353

[41]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019 b. Videobert: A joint model for video and language representation learning. In ICCV. IEEE.

[42]

Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In NAACL.

[43]

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding. In AAAI.

[44]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arxiv: 1602.07261

[45]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP.

[46]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arxiv: 1807.03748

[47]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In NIPS. ACM, 6000--6010.

Digital Library

[48]

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In ICCV.

[49]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, and et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arxiv: 1609.08144

[50]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. IEEE.

[51]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NIPS.

Digital Library

[52]

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2018a. Deep Interest Evolution Network for Click-Through Rate Prediction. In KDD. ACM.

[53]

Luowei Zhou, Jingjing Liu, Yu Cheng, Zhe Gan, and Lei Zhang. 2021 a. CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning. arxiv: 2104.00285

[54]

Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018b. Towards Automatic Learning of Procedures from Web Instructional Videos. In AAAI.

[55]

Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018c. End-to-End Dense Video Captioning with Masked Transformer. In CVPR. IEEE.

[56]

Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, and Jingjing Liu. 2021 b. UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training. In CVPR.

Cited By

Liu YShen TZhang DSun QLi SZhou GGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor DetectionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658094(442-450)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658094
Zhao FZhang CGeng B(2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1145/3649447
Zhang ZMa ZYuan CChen YWang PQi ZHao CLi BShan YHu WMaybank S(2024)Chinese Title Generation for Short Videos: Dataset, Metric and AlgorithmIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336573946:7(5192-5208)Online publication date: Jul-2024
https://doi.org/10.1109/TPAMI.2024.3365739
Show More Cited By

Index Terms

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing

Recommendations

Enhancing Sentence Representation with Visually-supervised Multimodal Pre-training
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Large-scale pre-trained language models have garnered significant attention in recent years due to their effectiveness in extracting sentence representations. However, most pre-trained models currently use transformer-based encoder with a single modality ...
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining

The growing prevalence of visually rich documents, such as webpages and scanned/digital-born documents (images, PDFs, etc.), has led to increased interest in automatic document understanding and information extraction across academia and industry. ...
Contrastive Language-knowledge Graph Pre-training
Recent years have witnessed a surge of academic interest in knowledge-enhanced pre-trained language models (PLMs) that incorporate factual knowledge to enhance knowledge-driven applications. Nevertheless, existing studies primarily focus on shallow, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
379
Total Downloads

Downloads (Last 12 months)93
Downloads (Last 6 weeks)10

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu YShen TZhang DSun QLi SZhou GGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor DetectionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658094(442-450)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658094
Zhao FZhang CGeng B(2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1145/3649447
Zhang ZMa ZYuan CChen YWang PQi ZHao CLi BShan YHu WMaybank S(2024)Chinese Title Generation for Short Videos: Dataset, Metric and AlgorithmIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336573946:7(5192-5208)Online publication date: Jul-2024
https://doi.org/10.1109/TPAMI.2024.3365739
Yang WFang ZZhang TWu SLu CEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Modal-aware Bias Constrained Contrastive Learning for Multimodal RecommendationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612568(6369-6378)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612568
Chen SXu QMa YQiao YWang Y(2023)Attentive Snippet Prompting for Video RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.332150326(4348-4359)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3321503
Qi QZhang ALiao YSun WWang YLi XLiu S(2023)Simultaneously Training and Compressing Vision-and-Language Pre-Training ModelIEEE Transactions on Multimedia10.1109/TMM.2022.323325825(8194-8203)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3233258
Shen WSong JZhu XLi GShen H(2023)End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video RetrievalIEEE Transactions on Image Processing10.1109/TIP.2023.327507132(5017-5030)Online publication date: 2023
https://doi.org/10.1109/TIP.2023.3275071
Zonneveld AGatt ACalixto I(2023)Video-and-Language (VidL) models and their cognitive relevance2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00040(325-338)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00040
Chen GLiu XWang GZhang KTorr PZhang XTang Y(2023)Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01282(13899-13909)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01282
Pramanick SSong YNag SLin KShah HShou MChellappa RZhang P(2023)EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00487(5262-5274)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00487
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents