MoDE: CLIP Data Experts via Clustering

Ma, Jiawei; Huang, Po-Yao; Xie, Saining; Li, Shang-Wen; Zettlemoyer, Luke; Chang, Shih-Fu; Yih, Wen-Tau; Xu, Hu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.16030 (cs)

[Submitted on 24 Apr 2024]

Title:MoDE: CLIP Data Experts via Clustering

Authors:Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih, Hu Xu

View PDF HTML (experimental)

Abstract:The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at this https URL.

Comments:	IEEE CVPR 2024 Camera Ready. Code Link: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2404.16030 [cs.CV]
	(or arXiv:2404.16030v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.16030

Submission history

From: Jiawei Ma [view email]
[v1] Wed, 24 Apr 2024 17:59:24 UTC (951 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MoDE: CLIP Data Experts via Clustering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MoDE: CLIP Data Experts via Clustering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators