Linear Alignment of Vision-language Models for Image Captioning

Paischer, Fabian; Hofmarcher, Markus; Hochreiter, Sepp; Adler, Thomas

Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.05591 (cs)

[Submitted on 10 Jul 2023 (v1), last revised 6 Feb 2024 (this version, v3)]

Title:Linear Alignment of Vision-language Models for Image Captioning

Authors:Fabian Paischer, Markus Hofmarcher, Sepp Hochreiter, Thomas Adler

View PDF

Abstract:Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches adapt CLIP-style models to a downstream task by training a mapping network between CLIP and a language model. This is costly as it usually involves calculating gradients for large models. We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP via a closed-form solution. This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics that build on CLIP score along with our linear mapping. Furthermore, we combine ReCap with our new metrics to design an iterative datastore-augmentation loop (DAL) based on synthetic captions. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT. ReCap achieves performance comparable to state-of-the-art lightweight methods on established metrics while outperforming them on our new metrics, which are better aligned with human ratings on Flickr8k-Expert and Flickr8k-Crowdflower. Finally, we demonstrate that ReCap transfers well to other domains and that our DAL leads to a performance boost.

Comments:	8 pages (+ references and appendix)
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2307.05591 [cs.CV]
	(or arXiv:2307.05591v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.05591

Submission history

From: Fabian Paischer [view email]
[v1] Mon, 10 Jul 2023 17:59:21 UTC (2,286 KB)
[v2] Mon, 5 Feb 2024 09:10:15 UTC (11,476 KB)
[v3] Tue, 6 Feb 2024 09:33:48 UTC (11,478 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Linear Alignment of Vision-language Models for Image Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Linear Alignment of Vision-language Models for Image Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators