research-article

TEVL: Trilinear Encoder for Video-language Representation Learning

Authors:

Mingxing Zhang,

Heng Tao ShenAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 5s

Article No.: 168, Pages 1 - 20

https://doi.org/10.1145/3585388

Published: 07 June 2023 Publication History

Abstract

Pre-training model on large-scale unlabeled web videos followed by task-specific fine-tuning is a canonical approach to learning video and language representations. However, the accompanying Automatic Speech Recognition (ASR) transcripts in these videos are directly transcribed from audio, which may be inconsistent with visual information and would impair the language modeling ability of the model. Meanwhile, previous V-L models fuse visual and language modality features using single- or dual-stream architectures, which are not suitable for the current situation. Besides, traditional V-L research focuses mainly on the interaction between vision and language modalities and leaves the modeling of relationships within modalities untouched. To address these issues and maintain a small manual labor cost, we add automatically extracted dense captions as a supplementary text and propose a new trilinear video-language interaction framework TEVL (Trilinear Encoder for Video-Language representation learning). TEVL contains three unimodal encoders, a TRIlinear encOder (TRIO) block, and a temporal Transformer. TRIO is specially designed to support effective text-vision-text interaction, which encourages inter-modal cooperation while maintaining intra-modal dependencies. We pre-train TEVL on the HowTo100M and TV datasets with four task objectives. Experimental results demonstrate that TEVL can learn powerful video-text representation and achieve competitive performance on three downstream tasks, including multimodal video captioning, video Question Answering (QA), as well as video and language inference. Implementation code is available at https://github.com/Gufrannn/TEVL.

References

[1]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005. 65–72.

[2]

Aman Chadha, Gurneet Arora, and Navpreet Kaloty. 2020. iPerceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering. CoRR abs/2011.07735 (2020).

[3]

Junwen Chen and Yu Kong Golisano. 2021. Explainable video entailment with grounded visual evidence. In IEEE/CVF International Conference on Computer Vision. 2001–2010.

[4]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal image-text representation learning. In 16th European Conference on Computer Vision. 104–120.

Digital Library

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 248–255.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.

[7]

Tuong Do, Huy Tran, Thanh-Toan Do, Erman Tjiputra, and Quang D. Tran. 2019. Compact trilinear interaction for visual question answering. In IEEE/CVF International Conference on Computer Vision. 392–401.

[8]

Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Trans. Circ. Syst. Video Technol. 32, 8 (2022), 5680–5694.

Digital Library

[9]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast networks for video recognition. In IEEE/CVF International Conference on Computer Vision. 6201–6210.

[10]

Lianli Gao, Yu Lei, Pengpeng Zeng, Jingkuan Song, Meng Wang, and Heng Tao Shen. 2022. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans. Image Process. 31 (2022), 202–215.

[11]

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. In Annual Conference on Neural Information Processing Systems.

[12]

Shikha Gupta, Krishan Sharma, Dileep Aroor Dinesh, and Veena Thenkanidiyoor. 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:1–53:24.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[14]

Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. 2019. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2485–2494.

[15]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 4565–4574.

[16]

Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. CoRR abs/1602.02410 (2016).

[17]

Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. CoRR abs/1705.06950 (2017).

[18]

Hyounghun Kim and Mohit Bansal. 2019. Improving visual question answering by referring to generated paragraph captions. In 57th Conference of the Association for Computational Linguistics. 3606–3612.

[19]

Hyounghun Kim, Zineng Tang, and Mohit Bansal. 2020. Dense-caption matching and frame-selection gating for temporal localization in VideoQA. In 58th Annual Meeting of the Association for Computational Linguistics. 4812–4822.

[20]

Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, and Chang D. Yoo. 2019. Gaining extra supervision via multi-task learning for multi-modal video question answering. In International Joint Conference on Neural Networks. 1–8.

[21]

Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, and Chang D. Yoo. 2019. Progressive attention memory network for movie story question answering. In IEEE Conference on Computer Vision and Pattern Recognition. 8337–8346.

[22]

Bruno Korbar, Fabio Petroni, Rohit Girdhar, and Lorenzo Torresani. 2020. Video understanding as machine translation. CoRR abs/2006.07203 (2020).

[23]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.

Digital Library

[24]

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. 2018. TVQA: Localized, compositional video question answering. In Conference on Empirical Methods in Natural Language Processing. 1369–1379.

[25]

Jie Lei, Licheng Yu, Tamara L. Berg, and Mohit Bansal. 2020. TVQA+: Spatio-temporal grounding for video question answering. In 58th Annual Meeting of the Association for Computational Linguistics. 8211–8225.

[26]

Jie Lei, Licheng Yu, Tamara L. Berg, and Mohit Bansal. 2020. TVR: A large-scale dataset for video-subtitle moment retrieval. In 16th European Conference on Computer Vision. 447–463.

Digital Library

[27]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. 11336–11344.

[28]

Guohao Li, Feng He, and Zhifan Feng. 2021. A CLIP-enhanced method for video-language understanding. CoRR abs/2110.07137 (2021).

[29]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical encoder for video+language omni-representation pre-training. In Conference on Empirical Methods in Natural Language Processing. 2046–2065.

[30]

Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, and Zicheng Liu. 2021. VALUE: A multi-task benchmark for video-and-language understanding evaluation. In Annual Conference on Neural Information Processing Systems.

[31]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A simple and performant baseline for vision and language. CoRR abs/1908.03557 (2019).

[32]

Muheng Li, Lei Chen, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2022. Order-constrained representation learning for instructional video prediction. IEEE Trans. Circ Syst. Video Technol. 32, 8 (2022), 5438–5452.

Digital Library

[33]

Yehao Li, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin, and Tao Mei. 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. ACM Trans. Multim. Comput. Commun. Appl. 18, 2 (2022), 48:1–48:16.

[34]

Yingjian Li, Yingnan Gao, Bingzhi Chen, Zheng Zhang, Guangming Lu, and David Zhang. 2022. Self-supervised exclusive-inclusive interactive learning for multi-label facial expression recognition in the wild. IEEE Trans. Circ Syst. Video Technol. 32, 5 (2022), 3190–3202.

Digital Library

[35]

Zechao Li and Jinhui Tang. 2017. Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26, 1 (2017), 276–288.

Digital Library

[36]

Zechao Li, Jinhui Tang, and Tao Mei. 2019. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 2070–2083.

[37]

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, Association for Computational Linguistics, 74–81.

[38]

Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. SwinBERT: End-to-end transformers with sparse attention for video captioning. In IEEE Conference on Computer Vision and Pattern Recognition. Association for Computational Linguistics, 17949–17958.

[39]

Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, and Jingjing Liu. 2020. Violin: A large-scale dataset for video-and-language inference. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10897–10907.

[40]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).

[41]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations.

[42]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Annual Conference on Neural Information Processing Systems. 13–23.

[43]

Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, and Tao Mei. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In ACM Multimedia Conference, Virtual Event. 5600–5608.

Digital Library

[44]

Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron C. Courville, and Christopher Joseph Pal. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In IEEE Conference on Computer Vision and Pattern Recognition. 7359–7368.

[45]

Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multim. Comput. Commun. Appl. 17, 4 (2021), 128:1–128:23.

[46]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In IEEE/CVF International Conference on Computer Vision. 2630–2640.

[47]

Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2022. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. In 30th ACM International Conference on Multimedia. 7070–7074.

Digital Library

[48]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In 40th Annual Meeting of the Association for Computational Linguistics. 311–318.

[49]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In 38th International Conference on Machine Learning. 8748–8763.

[50]

Minchul Shin, Jonghwan Mun, Kyoung-Woon On, Woo-Young Kang, Gunsoo Han, and Eun-Sol Kim. 2021. Winning the ICCV’2021 VALUE challenge: Task-aware ensemble and transfer learning with visual concepts. CoRR abs/2110.06476 (2021).

[51]

Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019. Contrastive bidirectional transformer for temporal representation learning. CoRR abs/1906.05743 (2019).

[52]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A joint model for video and language representation learning. In IEEE/CVF International Conference on Computer Vision. 7463–7472.

[53]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 5099–5110.

[54]

Jinhui Tang, Jing Wang, Zechao Li, Jianlong Fu, and Tao Mei. 2019. Show, reward, and tell: Adversarial visual story generation. ACM Trans. Multim. Comput. Commun. Appl. 15, 2s (2019), 54:1–54:20.

[55]

Zineng Tang, Jie Lei, and Mohit Bansal. 2021. DeCEMBERT: Learning from noisy instructional videos via dense captions and entropy minimization. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2415–2426.

[56]

Fangzheng Tian, Yongbin Gao, Zhijun Fang, Yuming Fang, Jia Gu, Hamido Fujita, and Jenq-Neng Hwang. 2022. Depth estimation using a self-supervised network based on cross-layer feature fusion and the quadtree constraint. IEEE Trans. Circ. Syst. Video Technol. 32, 4 (2022), 1751–1766.

[57]

Ottokar Tilk and Tanel Alumäe. 2015. LSTM for punctuation restoration in speech transcripts. In 16th Annual Conference of the International Speech Communication Association. 683–687.

[58]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Annual Conference on Neural Information Processing Systems. 5998–6008.

[59]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.

[60]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.

[61]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).

[62]

Linjie Yang, Kevin D. Tang, Jianchao Yang, and Li-Jia Li. 2017. Dense captioning with joint inference and visual context. In IEEE Conference on Computer Vision and Pattern Recognition. 1978–1987.

[63]

Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, and Haruo Takemura. 2020. BERT representations for video question answering. In IEEE Winter Conference on Applications of Computer Vision. 1545–1554.

[64]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition. 21–29.

[65]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular attention network for referring expression comprehension. In IEEE Conference on Computer Vision and Pattern Recognition. 1307–1315.

[66]

Zhaoquan Yuan, Siyuan Sun, Lixin Duan, Changsheng Li, Xiao Wu, and Changsheng Xu. 2021. Adversarial multimodal network for movie story question answering. IEEE Trans. Multim. 23 (2021), 1744–1756.

Digital Library

[67]

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, and Heng Tao Shen. 2022. Video question answering with prior knowledge and object-sensitive learning. IEEE Trans. Image Process. 31 (2022).

[68]

Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, and Lianli Gao. 2022. Progressive tree-structured prototype network for end-to-end image captioning. In 30th ACM International Conference on Multimedia. 5210–5218.

Digital Library

[69]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. VinVL: Revisiting visual representations in vision-language models. In IEEE Conference on Computer Vision and Pattern Recognition. 5579–5588.

[70]

Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.

[71]

Jinkuan Zhu, Pengpeng Zeng, Lianli Gao, Gongfu Li, Dongliang Liao, and Jingkuan Song. 2023. Complementarity-aware space learning for video-text retrieval. IEEE Trans. Circ. Syst. Video Technol. (2023). DOI:

[72]

Linchao Zhu and Yi Yang. 2020. ActBERT: Learning global-local video-text representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8743–8752.

[73]

Mohammadreza Zolfaghari, Yi Zhu, Peter V. Gehler, and Thomas Brox. 2021. CrossCLR: Cross-modal contrastive learning for multi-modal video representations. In IEEE/CVF International Conference on Computer Vision. 1430–1439.

Cited By

Liang XYang EDeng CYang Y(2024)CrossFormer: Cross-modal Representation Learning via Heterogeneous Graph TransformerACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3688801Online publication date: 20-Sep-2024
https://doi.org/10.1145/3688801
Peng YHe LHu DLiu YYang LShang S(2024)Decoupling Deep Learning for Enhanced Image Recognition InterpretabilityACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367483720:10(1-24)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3674837
Cheng DKong SJiang BGuo Q(2024)Transferable dual multi-granularity semantic excavating for partially relevant video retrievalImage and Vision Computing10.1016/j.imavis.2024.105168149(105168)Online publication date: Sep-2024
https://doi.org/10.1016/j.imavis.2024.105168
Show More Cited By

Index Terms

TEVL: Trilinear Encoder for Video-language Representation Learning
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval

Recommendations

GRMI: Graph Representation Learning of Multimodal Data with Incompleteness
Database Systems for Advanced Applications
Abstract
Multimodal data can provide supplementary information of the subjects, which is of great potential for exploring the data-driven insights in various application scenarios. A large amount of researches focus on modal fusion to deriving quality ...
Multi-modal Graph Contrastive Learning for Micro-video Recommendation
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Recently micro-videos have become more popular in social media platforms such as TikTok and Instagram. Engagements in these platforms are facilitated by multi-modal recommendation systems. Indeed, such multimedia content can involve diverse modalities, ...
Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 5s

October 2023

280 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3599694

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2023

Online AM: 24 February 2023

Accepted: 21 February 2023

Revised: 13 February 2023

Received: 19 September 2022

Published in TOMM Volume 19, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Open Fund of Intelligent Terminal Key Laboratory of Sichuan Province
Sichuan Science and Technology Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
247
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)17

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liang XYang EDeng CYang Y(2024)CrossFormer: Cross-modal Representation Learning via Heterogeneous Graph TransformerACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3688801Online publication date: 20-Sep-2024
https://doi.org/10.1145/3688801
Peng YHe LHu DLiu YYang LShang S(2024)Decoupling Deep Learning for Enhanced Image Recognition InterpretabilityACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367483720:10(1-24)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3674837
Cheng DKong SJiang BGuo Q(2024)Transferable dual multi-granularity semantic excavating for partially relevant video retrievalImage and Vision Computing10.1016/j.imavis.2024.105168149(105168)Online publication date: Sep-2024
https://doi.org/10.1016/j.imavis.2024.105168
Nguyen TBin YWu XDong XHu ZLe KNguyen CNg STuan L(2024)Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation LearningComputer Vision – ECCV 202410.1007/978-3-031-72989-8_5(77-98)Online publication date: 26-Oct-2024
https://doi.org/10.1007/978-3-031-72989-8_5
Dong XGuo QGan TWang QWu JRen XCheng YChu W(2023)SNP-S³: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text TasksIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.330394534:4(2525-2535)Online publication date: 10-Aug-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3303945

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents