PromptCap: Prompt-Guided Task-Aware Image Captioning

Hu, Yushi; Hua, Hang; Yang, Zhengyuan; Shi, Weijia; Smith, Noah A; Luo, Jiebo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.09699 (cs)

[Submitted on 15 Nov 2022 (v1), last revised 17 Aug 2023 (this version, v4)]

Title:PromptCap: Prompt-Guided Task-Aware Image Captioning

Authors:Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, Jiebo Luo

View PDF

Abstract:Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains.

Comments:	Accepted to ICCV 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2211.09699 [cs.CV]
	(or arXiv:2211.09699v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.09699

Submission history

From: Yushi Hu [view email]
[v1] Tue, 15 Nov 2022 19:07:53 UTC (15,510 KB)
[v2] Tue, 21 Mar 2023 12:10:18 UTC (5,838 KB)
[v3] Tue, 28 Mar 2023 11:14:23 UTC (5,838 KB)
[v4] Thu, 17 Aug 2023 21:43:24 UTC (6,155 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:PromptCap: Prompt-Guided Task-Aware Image Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:PromptCap: Prompt-Guided Task-Aware Image Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators