M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Guo, Qingpei; Xu, Furong; Zhang, Hanxiao; Ren, Wang; Ma, Ziping; Ju, Lin; Wang, Jian; Chen, Jingdong; Yang, Ming

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.15896 (cs)

[Submitted on 29 Jan 2024 (v1), last revised 4 Feb 2024 (this version, v2)]

Title:M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Authors:Qingpei Guo, Furong Xu, Hanxiao Zhang, Wang Ren, Ziping Ma, Lin Ju, Jian Wang, Jingdong Chen, Ming Yang

View PDF HTML (experimental)

Abstract:Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2401.15896 [cs.CV]
	(or arXiv:2401.15896v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.15896

Submission history

From: Qingpei Guo [view email]
[v1] Mon, 29 Jan 2024 05:43:33 UTC (8,831 KB)
[v2] Sun, 4 Feb 2024 04:30:07 UTC (8,795 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators