Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Sharifzadeh, Sahand; Kaplanis, Christos; Pathak, Shreya; Kumaran, Dharshan; Ilic, Anastasija; Mitrovic, Jovana; Blundell, Charles; Banino, Andrea

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.07750 (cs)

[Submitted on 12 Mar 2024 (v1), last revised 7 Jun 2024 (this version, v2)]

Title:Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Authors:Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak, Dharshan Kumaran, Anastasija Ilic, Jovana Mitrovic, Charles Blundell, Andrea Banino

View PDF HTML (experimental)

Abstract:The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). In this work, we investigate an approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator's ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25\% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.

Comments:	9 pages, 6 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.07750 [cs.CV]
	(or arXiv:2403.07750v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.07750

Submission history

From: Andrea Banino [view email]
[v1] Tue, 12 Mar 2024 15:36:42 UTC (5,027 KB)
[v2] Fri, 7 Jun 2024 12:10:47 UTC (2,248 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators