Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Cheng, Jiacheng; Shin, Hijung Valentina; Vasconcelos, Nuno; Russell, Bryan; Heilbron, Fabian Caba

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.03190 (cs)

[Submitted on 6 May 2024]

Title:Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Authors:Jiacheng Cheng, Hijung Valentina Shin, Nuno Vasconcelos, Bryan Russell, Fabian Caba Heilbron

View PDF HTML (experimental)

Abstract:In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.03190 [cs.CV]
	(or arXiv:2405.03190v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.03190

Submission history

From: Jiacheng Cheng [view email]
[v1] Mon, 6 May 2024 06:30:17 UTC (5,018 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators