Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3611992acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Cross-Lingual Transfer of Large Language Model by Visually-Derived Supervision Toward Low-Resource Languages

Published: 27 October 2023 Publication History

Abstract

Recent progress on vision and language research has shown that visual supervision improves the performance of large language models (LLMs) in various natural language processing (NLP) tasks. In particular, the Vokenization approach [65] initiated a new way of incorporating visual information into LLM training, demonstrating the potential of visual supervision for NLP tasks in a monolingual (i.e., English) setting. Given the effectiveness of visual information in human communication among people who speak different languages, we tackle an ambitious question in this paper; can we expect that visual supervision contributes to cross-lingual transfer learning from a high-resource language to low-resource languages in NLP tasks? To study this hypothesis, we build a cross-lingual Vokenization model and train a cross-lingual LLM on three languages, English, Urdu, and Swahili, in which the last two are considered low-resource languages. The experimental results demonstrate that our visually-supervised cross-lingual transfer learning method significantly improves the LLM performance in multiple cross-lingual NLP tasks such as XNLI, NER, and TyDiQA tasks for low-resource languages. We also qualitatively and quantitatively demonstrate that the benefit of our approach increases as the linguistic distance between low-and high-resource languages grows larger.

References

[1]
Gerry TM Altmann and Yuki Kamide. 2004. Now you see it, now you don't: mediating the mapping between language and the visual world. In The interface of language, vision, and action: eye movements and the visual world. 347--368.
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv:1607.06450 [stat.ML]
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR. http://arxiv.org/ abs/1409.0473
[4]
Paul Bloom. 2000. How children learn the meanings of words. (2000).
[5]
Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, and Patrick Gallinari. 2019. Incorporating Visual Semantics into Sentence Representations within a Grounded Space. In EMNLP-IJCNLP. 696--707. https://doi.org/10.18653/v1/D19-1064
[6]
Ozan Caglayan, Menekse Kuyu, Mustafa Sercan Amac, Pranava Madhyastha, Erkut Erdem, Aykut Erdem, and Lucia Specia. 2021. Cross-lingual Visual Pretraining for Multimodal Machine Translation. In EACL. 1317--1324. https://doi. org/10.18653/v1/2021.eacl-main.112
[7]
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme Ruiz, Andreas Peter Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. 2023. PaLI: A Jointly-Scaled Multilingual Language-Image Model. In ICLR. https://openreview.net/forum?id=mWVoBz4W0u
[8]
Gordon Christie, Ankit Laddha, Aishwarya Agrawal, Stanislaw Antol, Yash Goyal, Kevin Kochersberger, and Dhruv Batra. 2016. Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes. In EMNLP. 1493--1503. https://doi.org/10.18653/v1/D16-1156
[9]
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, TomKwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. TACL 8 (2020), 454--470. https://doi.org/10.1162/tacl_a_00317
[10]
Guillem Collell, Ted Zhang, and Marie-Francine Moens. 2017. Imagined Visual Representations as Multimodal Embeddings. In AAAI, Vol. 31. https://doi.org/10. 1609/aaai.v31i1.11155
[11]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In ACL. 8440--8451. https://doi.org/10.18653/v1/2020.acl-main.747
[12]
Alexis Conneau and Guillaume Lample. 2019. Cross-Lingual Language Model Pretraining. In NeurIPS. https://dl.acm.org/doi/abs/10.5555/3454287.3454921
[13]
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating Crosslingual Sentence Representations. In EMNLP. 2475--2485. https://doi.org/10. 18653/v1/D18-1269
[14]
Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Emerging Cross-lingual Structure in Pretrained Language Models. In ACL. 6022--6034. https://doi.org/10.18653/v1/2020.acl-main.536
[15]
Banchiamlack Dessalegn and Barbara Landau. 2013. Interaction between language and vision: It's momentary, abstract, and it develops. Cognition 127, 3 (2013), 331--344. https://doi.org/10.1016/j.cognition.2013.02.003
[16]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171--4186. https://doi.org/10.18653/v1/N19-1423
[17]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Multilingual BERT readme document. Retrieved April 1, 2023 from https: //github.com/google-research/bert/blob/master/multilingual.md
[18]
Desmond Elliott, Stella Frank, Khalil Sima'an, and Lucia Specia. 2016. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language. 70--74. https://doi.org/10.18653/v1/W16-3210
[19]
Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. In NAACL. 3644--3650. https://doi.org/10.18653/v1/ 2021.naacl-main.285
[20]
Edward W Forgy. 1965. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. biometrics 21 (1965), 768--769.
[21]
Ruka Funaki and Hideki Nakayama. 2015. Image-Mediated Learning for Zero-Shot Cross-Lingual Document Retrieval. In EMNLP. 585--590. https://doi.org/10. 18653/v1/D15-1070
[22]
Susan Goldin-Meadow. 1999. The role of gesture in communication and thinking. Trends in Cognitive Sciences 3, 11 (1999), 419--429. https://doi.org/10.1016/S1364-6613(99)01397-2
[23]
Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems. In OntoImage 2006 Workshop on Language Resources for Content-based Image Retrieval during LREC 2006 Final Programme.
[24]
Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and GangWang. 2018. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval With Generative Models. In CVPR. https://openaccess.thecvf.com/content_cvpr_2018/html/Gu_ Look_Imagine_and_CVPR_2018_paper.html
[25]
Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Qinyu Zhang, and Ji-Rong Wen. 2023. Visually-augmented pretrained language models for NLP tasks without images. In ACL. 14912--14929. https://aclanthology.org/2023.acl-long.833
[26]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality Reduction by Learning an Invariant Mapping. In CVPR. 1735--1742. https://doi.org/10.1109/ CVPR.2006.100
[27]
Dan Hendrycks and Kevin Gimpel. 2020. Gaussian Error Linear Units (GELUs). arXiv:1606.08415 [cs.LG]
[28]
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation. In ICML, Vol. 119. 4411--4421. https: //proceedings.mlr.press/v119/hu20b.html
[29]
Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander Hauptmann. 2021. Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. In NAACL. 2443--2459. https://doi.org/10.18653/v1/2021.naacl-main.195
[30]
Julia Ive, Pranava Madhyastha, and Lucia Specia. 2019. Distilling Translations with Visual Awareness. In ACL. 6525--6538. https://doi.org/10.18653/v1/P19-1653
[31]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data 7, 3 (2021), 535--547. https: //doi.org/10.1109/TBDATA.2019.2921572
[32]
Douwe Kiela, Alexis Conneau, Allan Jabri, and Maximilian Nickel. 2018. Learning Visually Grounded Sentence Representations. In NAACL. 408--418. https://doi. org/10.18653/v1/N18-1038
[33]
Douwe Kiela, Ivan Vulić, and Stephen Clark. 2015. Visual Bilingual Lexicon Induction with Transferred ConvNet Features. In EMNLP. 148--158. https://doi. org/10.18653/v1/D15--1015
[34]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR. http://arxiv.org/abs/1412.6980
[35]
Donald Ervin Knuth. 1998. The Art of Computer Programming. Vol. 3: Sorting and Searching. Addison-Wesley.
[36]
Noriyuki Kojima, Hadar Averbuch-Elor, Alexander Rush, and Yoav Artzi. 2020. What is Learned in Visually Grounded Neural Syntax Acquisition. In ACL. 2615--2635. https://doi.org/10.18653/v1/2020.acl-main.234
[37]
Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and Sanja Fidler. 2014. What are You Talking About? Text-to-Image Coreference. In CVPR. https://openaccess.thecvf.com/content_cvpr_2014/html/Kong_What_ are_You_2014_CVPR_paper.html
[38]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV 123, 1 (may 2017), 32--73. https://doi.org/10.1007/s11263-016-0981-7
[39]
Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. 2015. Combining Language and Vision with a Multimodal Skip-gram Model. In NAACL. 153--163. https://doi.org/10.3115/v1/N15-1016
[40]
Jialu Li, Hao Tan, and Mohit Bansal. 2022. CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations. In Findings of NAACL. 633--649. https://doi.org/10.18653/v1/2022.findings-naacl.48
[41]
Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, and Kai-Wei Chang. 2021. Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions. In NAACL. 5339--5350. https://doi.org/10. 18653/v1/2021.naacl-main.420
[42]
Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2021. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. In ACL. 2592--2607. https: //doi.org/10.18653/v1/2021.acl-long.202
[43]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV. 740--755. https://doi.org/10.1007/978-3-319-10602-1_48
[44]
Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neubig. 2019. Choosing Transfer Languages for Cross-Lingual Learning. In ACL. 3125--3135. https://doi.org/10.18653/v1/P19-1301
[45]
Xiao Liu, Da Yin, Yansong Feng, and Dongyan Zhao. 2022. Things not Written in Text: Exploring Spatial Commonsense from Visual Signals. In ACL. 2365--2376. https://doi.org/10.18653/v1/2022.acl-long.168
[46]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
[47]
S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129--137. https://doi.org/10.1109/TIT.1982.1056489
[48]
Ilya Loshchilov and Frank Hutter. 2019. DecoupledWeight Decay Regularization. In ICLR. https://openreview.net/forum?id=Bkg6RiCqY7
[49]
Yujie Lu, Wanrong Zhu, Xin Wang, Miguel Eckstein, and William Yang Wang. 2022. Imagination-Augmented Natural Language Understanding. In NAACL. 4392--4402. https://doi.org/10.18653/v1/2022.naacl-main.326
[50]
TorchVision maintainers and contributors. 2016. TorchVision: PyTorch's Computer Vision library. https://github.com/pytorch/vision
[51]
Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. 2018. UMAP: Uniform Manifold Approximation and Projection. JOSS 3, 29 (2018), 861. https: //doi.org/10.21105/joss.00861
[52]
Masayasu Muraoka, Tetsuya Nasukawa, and Bishwaranjan Bhattacharjee. 2020. Visual Objects As Context: Exploiting Visual Objects for Lexical Entailment. In Findings of EMNLP. 2723--2735. https://doi.org/10.18653/v1/2020.findingsemnlp. 246
[53]
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. In ICML. 807--814. https://dl.acm.org/doi/10.5555/ 3104322.3104425
[54]
Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual Name Tagging and Linking for 282 Languages. In ACL. 1946--1958. https://doi.org/10.18653/v1/P17-1178
[55]
Johanne Paradis, Fred Genesee, and Martha B Crago. 2011. Dual Language Development and Disorders: A Handbook on Bilingualism and Second Language Learning. Brookes Publishing Company (2011).
[56]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS. 8024--8035. http://papers.neurips.cc/paper/9015-pytorchan-imperative-style-high-performance-deep-learning-library.pdf
[57]
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How Multilingual is Multilingual BERT?. In ACL. 4996--5001. https://doi.org/10.18653/v1/P19--1493
[58]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019). https://openai.com/research/better-language-models
[59]
Hyeonggon Ryu, Arda Senocak, In So Kweon, and Joon Son Chung. 2023. Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples. arXiv:2303.17517 [cs.CL] https://arxiv.org/abs/2303.17517
[60]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A Unified Embedding for Face Recognition and Clustering. In CVPR. https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Schroff_ FaceNet_A_Unified_2015_CVPR_paper.html
[61]
Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019. Visually Grounded Neural Syntax Acquisition. In ACL. 1842--1861. https://doi.org/10. 18653/v1/P19-1180
[62]
Richard Sinatra. 1981. Using Visuals to Help the Second Language Learner. The Reading Teacher 34, 5 (1981), 539--546. http://www.jstor.org/stable/20195283
[63]
Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning. In SIGIR. 2443--2449. https://doi.org/10.1145/3404835. 3463257
[64]
Dídac Surís, Dave Epstein, and Carl Vondrick. 2022. Globetrotter: Connecting Languages by Connecting Images. In CVPR. 16474--16484. https://openaccess.thecvf.com/content/CVPR2022/html/Suris_Globetrotter_ Connecting_Languages_by_Connecting_Images_CVPR_2022_paper.html
[65]
Hao Tan and Mohit Bansal. 2020. Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision. In EMNLP. 2066--2080. https://doi.org/10.18653/v1/2020.emnlp-main.162
[66]
Zineng Tang, Jaemin Cho, Hao Tan, and Mohit Bansal. 2021. VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer. In NeurIPS, Vol. 34. 24468--24481. https://proceedings.neurips.cc/paper_files/paper/2021/file/ccdf3864e2fa9089f9eca4fc7a48ea0a-Paper.pdf
[67]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS, Vol. 30. https://proceedings.neurips.cc/paper_files/paper/ 2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[68]
Ivan Vulić, Douwe Kiela, Stephen Clark, and Marie-Francine Moens. 2016. Multi-Modal Representations for Improved Bilingual Lexicon Learning. In ACL. 188--194. https://doi.org/10.18653/v1/P16--2031
[69]
AlexWang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In ICLR. https://openreview.net/forum?id= rJ4km2R5t7
[70]
Weizhi Wang, Li Dong, Hao Cheng, Haoyu Song, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. 2023. Visually-Augmented Language Modeling. In ICLR. https://openreview.net/forum?id=8IN-qLkl215
[71]
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. 2020. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In LREC. 4003--4012. https://aclanthology.org/2020.lrec-1.494
[72]
ThomasWolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP: System Demonstrations. 38--45. https://doi.org/10.18653/ v1/2020.emnlp-demos.6
[73]
Zixiu Wu, Julia Ive, Josiah Wang, Pranava Madhyastha, and Lucia Specia. 2019. Predicting Actions to Help Predict Translations. In ICML Workshop on The How2 Challenge: New Tasks for Vision and Language. https://srvk.github.io/how2-challenge/assets/authors/1908.01665.pdf
[74]
Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. 2019. Visual Entailment: A Novel Task for Fine-Grained Image Understanding. arXiv:1901.06706 [cs.CV]
[75]
Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In CVPR. https://openaccess.thecvf.com/content_cvpr_2017/html/Xie_Aggregated_ Residual_Transformations_CVPR_2017_paper.html
[76]
Yue Yang, Wenlin Yao, Hongming Zhang, Xiaoyang Wang, Dong Yu, and Jianshu Chen. 2022. Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination. In EMNLP. 1186--1203. https://aclanthology.org/2022.emnlp-main.78
[77]
Thomas Zenkel, Joern Wuebker, and John DeNero. 2020. End-to-End Neural Word Alignment Outperforms GIZA. In ACL. 1605--1617. https://doi.org/10. 18653/v1/2020.acl-main.146
[78]
Miaoran Zhang, Marius Mosbach, David Adelani, Michael Hedderich, and Dietrich Klakow. 2022. MCSE: Multimodal Contrastive Learning of Sentence Embeddings. In NAACL. 5959--5969. https://doi.org/10.18653/v1/2022.naacl-main.436
[79]
Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, and Hai Zhao. 2020. Neural Machine Translation with Universal Visual Representation. In ICLR. https://openreview.net/forum?id=Byl8hhNYPS
[80]
Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu, and Ning Zhang. 2022. Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment. In CVPR. 16485--16494. https://openaccess.thecvf.com/content/CVPR2022/html/ Zhou_Unsupervised_Vision-and-Language_Pre-Training_via_Retrieval-Based_Multi-Granular_Alignment_CVPR_2022_paper.html
[81]
Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, and Jingjing Liu. 2021. UC2: Universal Cross-Lingual Cross-Modal Vision and-Language Pre-Training. In CVPR. 4155--4165. https://openaccess.thecvf. com/content/CVPR2021/html/Zhou_UC2_Universal_Cross-Lingual_Cross-Modal_Vision-and-Language_Pre-Training_CVPR_2021_paper.html
[82]
Wanrong Zhu, Xin Eric Wang, An Yan, Miguel Eckstein, and William Yang Wang. 2023. ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation. arXiv:2106.05970 [cs.CL]

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Check for updates

Author Tags

  1. cross-lingual transfer
  2. large language model training
  3. low-resource languages
  4. visual supervision

Qualifiers

  • Research-article

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 928
    Total Downloads
  • Downloads (Last 12 months)722
  • Downloads (Last 6 weeks)68
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media