Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Radenovic, Filip; Dubey, Abhimanyu; Kadian, Abhishek; Mihaylov, Todor; Vandenhende, Simon; Patel, Yash; Wen, Yi; Ramanathan, Vignesh; Mahajan, Dhruv

Computer Science > Computer Vision and Pattern Recognition

arXiv:2301.02280 (cs)

[Submitted on 5 Jan 2023 (v1), last revised 29 Mar 2023 (this version, v2)]

Title:Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Authors:Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov, Simon Vandenhende, Yash Patel, Yi Wen, Vignesh Ramanathan, Dhruv Mahajan

View PDF

Abstract:Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at this https URL.

Comments:	CVPR 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2301.02280 [cs.CV]
	(or arXiv:2301.02280v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2301.02280

Submission history

From: Filip Radenovic [view email]
[v1] Thu, 5 Jan 2023 19:48:01 UTC (2,718 KB)
[v2] Wed, 29 Mar 2023 19:05:14 UTC (2,712 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators