Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan Jaemin Cho Elias Stengel-Eskin Mohit Bansal
UNC Chapel Hill
{davidwan, jmincho, esteng, mbansal}@cs.unc.edu

https://contrastive-region-guidance.github.io/

Abstract

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a “visual prompt”, where visual markers such as bounding boxes delineate key image regions; this approach has become popular due to the improvement it provides in tasks requiring region-level information. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model’s prior). CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to $11.1\%$ on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG’s applicability to spatial reasoning, where we obtain up to $10\%$ improvement on the hardest setting of What’sUp, as well as to compositional generalization – improving accuracy by $11.5\%$ and $7.5\%$ on two challenging splits from SugarCrepe – and to image-text alignment for generated images, where we improve by up to $8.4$ AUROC and $6.8$ F1 points on SeeTRUE. For cases that do not have reference regions for the prompt, we also show that CRG allows us to re-rank regions proposed by an object detection model in referring expression comprehension and phrase grounding benchmarks like RefCOCO/RefCOCO+/RefCOCOg and Flickr30K Entities, with an average improvement of $3.2\%$ in accuracy when multiple proposals are available. In our analysis, we explore alternative masking strategies for CRG, demonstrate how CRG impacts the model’s probability over relevant text phrases, and evaluate the role of the region guidance strength, empirically validating CRG’s design choices.

1 Introduction

Recent progress in large vision-language models (VLMs) has led to significant advances in tackling multimodal tasks by marrying the language-based reasoning strength of large language models (LLMs) with a visual encoder such as ViT [12]. While large VLMs (e.g., LLaVA [31; 30], BLIP [25], PaLI [9], etc.) have increasingly strong performance on tasks involving a whole image (e.g., answering questions about images [1; 14] or describing them [57; 20]), they often struggle with grounding specific regions, making errors on inter-object spatial relations [19] and compositional reasoning [17]. This inability to ground also prevents models from following “visual prompts” [8; 22; 63; 47; 6; 53], where visual markers like bounding boxes are overlaid onto the image to help the model focus on important regions. Improving models’ visual prompt following ability has the potential to increase performance across a wide variety of VL domains where fine-grained reasoning is key, including visual question answering, image-text alignment, spatial reasoning, and referring expression comprehension.

For example, in Fig. 1 (a), the base VLM struggles with a question that requires spatial reasoning, “Where is the bowl?”, mistakenly answering that the bowl is under the chair (while the bowl is to the right of the chair). The failure can in part be attributed to the model’s prior, which biases the output towards certain answers even in the absence of relevant information; for example, in Fig. 1 (d), we see that even when the objects are blacked out, the model still tends to answer “under”, despite the fact that the question is unanswerable from the masked image, as the region under the chair is masked.

Refer to caption — Figure 1: Comparison of different methods for visual grounding. (a) Predicting the answer with a base VLM fails. (b) Even when bounding boxes are added, open-source VLMs produce the wrong answer. (c) The VLM can be trained to recognize overlays like bounding boxes, but this process involves updating the VLM and is costly. (d) Our method, CRG, offers a way to correct predictions without training. The right image has relevant object regions blacked out. Here, the model’s distribution reflects its prior on answering *“under”* and *“left”* even without visual evidence. By factoring this distribution out, we reduce the prior, leading to the correct answer.

Several approaches for correcting these errors and improving fine-grained region grounding have been attempted, but require costly, proprietary models or additional data and training. Yang et al. [53] introduce Set-of-Mark (SoM) prompting, a method that overlays visual markers directly onto an image at test time, helping the model to generate answers that are grounded to specific image regions. However, SoM is tested only on GPT-4V, and our results in Table 1 and Table 5 indicate that SoM with segmentation marks does not transfer well to open-source VLMs. Furthermore, as illustrated in Fig. 1 (b), when the model is given an image overlaid with bounding boxes as marks, it predicts probabilities for the question that are similar to those predicted when using the original model depicted in Fig. 1 (a). SoM’s reliance on GPT-4V leads to a number of limitations: firstly, the model used is financially costly and large, making it impractical for many applications. In fact, the authors only present results on a small subset of the data due to “limited quota and absence of GPT-4V API”. Secondly, the model’s training data and details are unknown, meaning that it may in fact have been finetuned using additional data to supervise grounding. This kind of finetuning has been shown to improve open-source VLMs’ ability to follow visual prompts: Cai et al. [6] synthesize large amounts of fine-tuning data for adding visual markings like arrows and bounding boxes to images to enable open-source VLMs to follow visual prompts. While finetuning is effective, as illustrated in Fig. 1 (c) where the fine-tuned model correctly predicts the correct preposition with high confidence, it incurs a substantial training cost, especially as models grow in size. To address the shortcomings of existing methods (i.e., reliance on costly training or proprietary models), we propose a training-free method that is compatible with a variety of existing models. We additionally show that our method is complementary to models finetuned with region grounding supervision, i.e., it can further increase a model’s performance when using visual prompts.

Specifically, we propose Contrastive Region Guidance (CRG), a novel strategy that leverages classifier-free guidance (CFG) [15; 44] to help open-source VLMs focus on specific regions and understand visual markers without additional training. CRG reduces a given model’s bias towards certain answers (e.g., towards “under” in Fig. 1) by factoring out what the model’s response would be without visual evidence from the key region. Intuitively, after factorization, the final answer will be the one that changed the most when the key visual information is removed (i.e., the answer that relies most heavily on the visual information), whereas all answers that do not rely on visual information from the key region will be down-weighted. As shown in Fig. 1 (d), by blacking out the relevant objects, CRG reveals a prior that biases the model towards the incorrect answer, “under”; in other words, the model answers “under” even in the absence of the relevant visual evidence it would need to determine the relationship between objects. CRG then factors this prior out, amending the answer distribution, and providing the correct answer, “right”. Crucially, CRG relies only on either visual prompts or – if such prompts are not provided – access to an object detection module for proposing bounding boxes; such modules are readily available across many domains [59; 32].

We evaluate CRG on a variety of datasets in 5 different domains and on 2 different models, described in more detail below.

•

Visual Prompt Following. To measure CRG’s ability to incorporate visual prompts, we test on ViP-Bench [6], which contains 6 diverse task types each requiring understanding granular, region-level reasoning: Object Recognition (Rec), OCR, commonsense knowledge (Know), Math, relations (Rel), and language generation (Lang). For example, the Math subset of ViP-Bench requires solving math equations based on images of multiple equations, where one is highlighted or circled, like in Fig. 2 (a). Here, CRG improves on average $11.1\%$ accuracy over LLaVA-1.6-34B model, performing competitively with the strongest baseline of GPT-4V. When applied to ViP-LLaVA, CRG also provides substantial improvements, indicating it is complementary to supervised methods.
•

Spatial Reasoning. We also measure the role CRG plays in improving spatial reasoning by highlighting relevant image regions (see Fig. 1); on the hardest setting of the What’sUp spatial reasoning benchmark [19], LLaVA-1.5-13B with CRG outperforms the baselines by up to $8.3\%$ , and in fact also surpasses training-based methods relying on large amounts of pre-training with the same model by $15.4\%$ . Furthermore, LLaVA-1.6-34B with CRG improves by $10\%$ over a LLaVA-1.6-34B baseline on What’sUp’s hardest setting.
•

Compositional Generalization. Furthermore, we show that CRG’s better grounding leads to improvements in visual understanding and reasoning. We find that CRG helps to address a major limitation of current vision-language methods: a poor ability to analyze language compositionally. Models often cannot differentiate between two similar sentences like “a plant on a house” and “a house on a plant” [50; 34; 40]. We show that with CRG, LLaVA-1.6-34B improves its performance on SugarCrepe [17] – a challenging benchmark dataset for compositionality in visual-language tasks – by $11.5\%$ and $7.5\%$ over the model without CRG and by $4.7\%$ and $3.6\%$ with LLaVA-1.6-34B over the strongest GPT-4V baseline on SugarCrepe’s two challenging settings.
•

Evaluation on Images from Text-to-Image Generation Models. We show that CRG can also evaluate generated images; when applied to Yarom et al. [55]’s DrawBench, EditBench, and COCO-t2i splits, CRG improves a model’s ability to identify matching image-text pairs by $8.4$ AUROC and $6.7$ F1 points on average.
•

Reranking for Referring Expression Comprehensions and Phrase Grounding. Because of its granularity, CRG can be used to rerank bounding box proposals from an object detector to find ones relevant to a given text (see Fig. 2 (d) for an example); on the RefCOCO, RefCOCO+, and RefCOCOg [20; 35] referring expression comprehension task and the Flickr30K Entities phrase grounding task [38], CRG applied to LLaVA-1.5-13B improves performance by up to $3.2\%$ over a baseline LLaVA-1.5-13B reranker on cases with multiple bounding boxes.

We further conduct a detailed analysis of CRG, first by ablating each of its components. Our findings highlight that CRG’s masking strategy, i.e., blacking out each object separately, proves to be the most effective and outperforms alternative contrastive approaches that vary in granularity of the black-out regions, such as blacking out the entire image or blacking out the objects with segmentation masks. Our analysis also reveals that current models fail to follow the prompts using other popular visual prompting strategies that do not use contrast, such as only overlaying bounding boxes and segmentation masks. Furthermore, we examine the impact of CRG on the probability of grounded text that is aligned with specific regions, confirming that it increases the likelihood of correct text and penalizes incorrect text. This underscores CRG’s precision in enhancing model interpretability. Finally, our experiments demonstrate that the default value of the guidance strength, i.e., how much the model should rely on the contrast, consistently achieves high performance across different tasks, validating the robustness of our configuration.

2 Related Work

Visual prompting for VLMs.

Several recent research directions have studied prompting VLMs by manipulating visual inputs in different ways: (i) incorporating learnable soft tokens in visual inputs for parameter-efficient finetuning [2; 21], (ii) concatenating an image sequence as a demonstration of a new task [5; 4], and (iii) grounding regions by overlaying visual markers (e.g., masks/boxes/circles, etc.) onto visual inputs [54; 58; 47]. Our work falls into to the third category, using visual guidance for grounding. Yang et al. [53] present set-of-mark (SoM) prompts, where an image is partitioned into regions with segmentation models and each region is marked with a number marker, which improves visual grounding for GPT-4V [37]. However, in our experiments detailed in Secs. 4.2 and 4.5.1 we confirm past findings [6] that such visual prompt does not work well with public VLMs such as LLaVA. Cai et al. [6] instruction-tune with diverse visual markers so that VLMs can better follow visual prompts from user input. Instead of relying on proprietary models or finetuning, our work elicits visual grounding in VLMs by masking image regions and contrasting model distributions, i.e., with no additional training or data. Moreover, we show that our work is complementary to finetuning methods such as those used by Cai et al. [6], obtaining additional improvements when combined.

Context-guided sampling for autoregressive models.

Several works in different domains have proposed context-guided sampling for autoregressive models to incorporate additional context. Guided models can be thought of as sampling tokens from the logit difference of conditional and unconditional models: $\text{logit}(y|c,x)-\text{logit}(y|x)$ where $x$ is the input, $y$ is the output, and $c$ is the context (see Sec. 3 for additional details). For text generation, Shi et al. [46] extend contrastive decoding [27], by contrasting the logits of conditional and unconditional language models. CFG has also been applied in multimodal settings: for autoregressive image generation, Gafni et al. [13] use classifier-free guidance [15] to incorporate contextual inputs (i.e., text and segmentation map). For image captioning, Kornbilith et al. [23] use classifier-free guidance (CFG), contrasting the logits of an image captioner and a language model. Concurrently, Leng et al. [24] and Zhao et al. [62] use CFG to improve the faithfulness of VLMs by adding Gaussian noise to the entire image or object detection results to text input. While all of these existing methods combining CFG with image manipulate the entire image (via dropping [23] or adding noise [24]), our work, CRG differs in focusing on fine-grained guidance, explicitly grounding to specific image regions, i.e., operating at the sub-image level.

Biases and lack of grounding in visual models.

CRG’s benefit comes from factoring out the biases present in VL models and tasks, whereby the correct response can be obtained without considering the relevant image regions, or, in some cases, without the image entirely. Such biases have been well documented in past work [60; 14; 10]. Other work has noted that VQA models often focus on non-relevant regions of images even when correctly answering questions and has attempted to regularize models towards focusing on relevant regions [45; 52; 33]. Along these lines, Ying et al. [56] introduce a series of losses for mitigating cases where models are “right for the wrong reasons”, i.e., the answer is right but based on unrelated regions of the image, with some losses using human-drawn bounding boxes. CRG also aims to draw attention to relevant image regions, but does so in a gradient-free manner and can operate using automatically detected bounding boxes.

3 Method

3.1 Background: Visual Prompting for VLMs

In our setting, a vision-language model (VLM) with parameters $\theta$ takes as input an image $I\in\mathbb{R}^{H\times W\times 3}$ and a text $X=[x_{1},...,x_{n}]$ of $n$ tokens, and outputs a text $Y=[y_{1},...,y_{m}]$ with $m$ tokens. When generating the output text $Y$ , we generate tokens autoregressively from a probability distribution conditioned on the input $I$ and $X$ . At time $t$ , the probability of the token $y_{t}$ is:

y_{t}\sim p_{\theta}(y_{t}|I,X,y_{<t})\propto\exp{\text{logit}_{\theta}(y_{t}|% I,X,y_{<t})}

(1)

where $\text{logit}_{\theta}$ is the unnormalized log probability of token $y_{t}$ , i.e., before softmax. Recent work [47; 53; 6] has introduced visual prompting methods that augment images by overlaying visual markers (e.g., bounding boxes, masks, and arrows) to highlight specific regions. While past work has found that visual prompting improves visual grounding of GPT-4V [53] or VLMs specifically trained on images with visual prompts [6], we find that publicly available base VLMs usually ignore such visual prompts in our experiments (Table 1, 2, and 3).

3.2 Contrastive Region Guidance (CRG) for Visual Grounding in VLMs

We introduce Contrastive Region Guidance (CRG), a training-free visual grounding method that guides any VLM to focus on specific regions in an image by extending classifier-free guidance (CFG) [15]. Inspired by work on visual feature importance [42; 56], we measure the importance of an image region by how a VLM’s output distribution changes when the region is removed, and use the contrast between distributions to guide a VLM to focus on a particular region, as illustrated on the left side of Fig. 2. Concretely, we sample outputs from a probability distribution derived by contrasting an image $I$ with another image $I^{\prime}=\texttt{mask}(I,b)$ , where the pixels in a specific region $b$ are masked out with black pixels:

	$\displaystyle y_{t}$	$\displaystyle\propto p_{\theta}(y_{t}\|I,X,y_{<t})(\frac{p_{\theta}(y_{t}\|I,X,y% _{<t})}{p_{\theta}(y_{t}\|\texttt{mask}(I,b),X,y_{<t})})^{\alpha}$		(2)
		$\displaystyle\sim\texttt{softmax}[(1+\alpha)\cdot\text{logit}_{\theta}(y_{t}\|I% ,X,y_{<t})-\alpha\cdot\text{logit}_{\theta}(y_{t}\|\texttt{mask}(I,b),X,y_{<t})].$		(3)

Here, $\alpha$ is the region guidance strength parameter that controls the strength of the focus on the region $b$ . A larger $\alpha$ means the region guidance is more amplified; for example, $\alpha=1$ puts a high weight on the region and $\alpha=0$ reduces the equation to standard decoding. We follow prior work [46; 36] to use $\alpha=1$ for all settings.

As shown in Fig. 2, CRG is applicable to many VL tasks, including image-conditional text generation as well as image-text and region-text alignment tasks. When a region of interest is given as in Fig. 2 (a), we can guide VLMs to focus on that region while generating answers. When no specific region is given as in Fig. 2 (b) and (c), we can use region proposals from text-conditional object detectors such as GroundingDINO [32] and guide VLMs to focus on proposed regions. We do so by taking all noun phrases (“dog” and “table” in Fig. 2 (b), and “dog” and “car” in Fig. 2 (c)), finding their corresponding bounding boxes, and blacking out the objects in the image. We then either generate the answers in VQA (e.g., Fig. 2 (b)) or forced-decode the sentence and retrieve its probability (e.g., Fig. 2 (c)). For cases where there are multiple bounding box candidates for the NP, we apply the following re-ranking strategy. For each bounding box proposal, we black out the corresponding image region and calculate the score of the given text or phrase with Eq. 3. We select the region that achieves the highest contrast when blacked out. As illustrated in Fig. 2 (d), removing the dog on the left causes the most drastic change in the probability of the sentence "a dog with mouth closed", thereby indicating a strong association with the described text.

4 Experiments and Results

We demonstrate the usefulness of CRG in diverse vision-language tasks. First, we show that CRG can unlock the visual prompt following capabilities of VLMs on ViP-Bench [6] (Sec. 4.2). Next, we demonstrate the effectiveness of CRG for improving image-text alignment in VLMs on three datasets (Sec. 4.3): What’sUp [19] that measures spatial understanding, SugarCrepe [17] that measures compositional generalization, and SeeTRUE [55] whose images are from text-to-image generation models. Moreover, we show that CRG can also be used as a re-ranker for visual grounding tasks on four datasets: RefCOCO, RefCOCO+ [20], RefCOCOg [35], and Flickr 30K Entities [38] (Sec. 4.4). Lastly, we also provide three ablation studies in (Sec. 4.5), comparing different methods for contrasting regions, evaluating the probability shifts for correct and incorrect texts, and analyzing the impact of region guidance strength $\alpha$ .

4.1 Experimental Setup

We use the LLaVA-1.5-13B [29] and LLaVA-1.6-34B [30] models. For visual prompt following with ViP-Bench, we use the provided visual prompts in the dataset; for other tasks where no region is given, we first extract noun phrases with spaCy[16], then generate region proposals for each phrase with GroundingDINO-B [32], and filter the resulting bounding boxes to those with scores above a threshold of 0.3. For text generation, we opt for greedy decoding to ensure reproducibility. We use CRG strength $\alpha=1$ for all settings. We refer the readers to Appendix A for more details on the experimental setup and dataset statistics.

4.2 Evaluation on Visual Prompt Following

ViP-Bench [6] is comprised of 303 image-question pairs specifically designed to comprehensively evaluate visual prompt following capabilities, with six categories: Object Recognition (Rec), OCR (OCR), Knowledge (Know), Math (Math), Object Relationship Reasoning (Rel), and Language Generation (Lang). We report the performance on the default split - synthesized visual prompts - consisting of tight bounding boxes. In addition to the baselines provided in the paper, we also apply Set-of-Mark [53] (SoM) approach to our models. Specifically, we use the reference bounding box to generate the segmentation mask with SAM [22], and subsequently overlaying the mask and attach numbers to the image, as described in [53]. To make a fair comparison, we transform the questions asking about the bounding boxes into questions asking about the numbers, which SoM expects. Details can be found in Sec. A.3.

Table 1: ViP-Bench [6] results. * indicates results from [6] and

{\dagger}

indicates models fine-tuned with visual prompt data. For each model we run, we report the average and standard deviation of 5 runs using ViP-Bench [6]’s GPT-4-based evaluation, and bold the best prompting or guidance strategy.

Model	Rec	OCR	Know	Math	Rel	Lang	Avg
Shikra 7B ${\dagger}$ * [7]	40.2	10.0	28.0	3.5	18.9	20.6	33.7
GPT4ROI 7B ${\dagger}$ * [61]	35.6	16.7	29.7	9.7	32.5	13.8	35.1
Qwen-VL-Chat* [3]	43.0	30.4	40.2	9.7	25.7	28.7	39.2
InstructBLIP-13B* [11]	42.5	12.2	37.5	3.2	33.2	12.5	35.8
GPT-4V* [37]	58.1	48.5	69.5	63.3	82.9	68.1	55.9
LLaVA-1.5-13B [29]	45.9 ${}_{\pm 0.4}$	18.7 ${}_{\pm 0.4}$	46.4 ${}_{\pm 0.9}$	6.5 ${}_{\pm 0.0}$	50.5 ${}_{\pm 2.7}$	40.5 ${}_{\pm 1.9}$	40.2 ${}_{\pm 0.4}$
+ SoM [53]	48.9 ${}_{\pm 0.2}$	14.4 ${}_{\pm 0.1}$	46.0 ${}_{\pm 0.3}$	3.2 ${}_{\pm 0.0}$	34.3 ${}_{\pm 0.0}$	30.9 ${}_{\pm 0.8}$	41.2 ${}_{\pm 0.2}$
+ CRG (Ours)	48.0 ${}_{\pm 0.3}$	20.0 ${}_{\pm 0.5}$	45.1 ${}_{\pm 1.0}$	10.3 ${}_{\pm 1.4}$	46.1 ${}_{\pm 2.0}$	32.6 ${}_{\pm 3.6}$	41.8 ${}_{\pm 0.2}$
ViP-LLaVA-13B ${\dagger}$ [6]	56.9 ${}_{\pm 0.1}$	22.7 ${}_{\pm 1.9}$	54.0 ${}_{\pm 1.1}$	11.7 ${}_{\pm 1.1}$	49.3 ${}_{\pm 1.6}$	53.1 ${}_{\pm 4.2}$	48.6 ${}_{\pm 0.2}$
+ SoM [53]	49.3 ${}_{\pm 0.1}$	19.1 ${}_{\pm 0.1}$	43.8 ${}_{\pm 0.5}$	5.8 ${}_{\pm 0.0}$	41.1 ${}_{\pm 0.0}$	35.1 ${}_{\pm 1.9}$	41.8 ${}_{\pm 0.1}$
+ CRG (Ours)	57.4 ${}_{\pm 0.3}$	20.6 ${}_{\pm 0.5}$	54.0 ${}_{\pm 1.0}$	12.1 ${}_{\pm 0.2}$	49.4 ${}_{\pm 1.8}$	45.0 ${}_{\pm 1.3}$	49.4 ${}_{\pm 0.2}$
LLaVA-1.6-34B [30]	49.1 ${}_{\pm 0.3}$	28.7 ${}_{\pm 0.4}$	48.1 ${}_{\pm 0.4}$	41.1 ${}_{\pm 1.0}$	45.6 ${}_{\pm 0.7}$	49.5 ${}_{\pm 2.9}$	45.0 ${}_{\pm 0.2}$
+ SoM [53]	42.0 ${}_{\pm 0.7}$	41.1 ${}_{\pm 1.3}$	48.5 ${}_{\pm 1.0}$	39.7 ${}_{\pm 2.7}$	31.1 ${}_{\pm 2.4}$	47.0 ${}_{\pm 2.5}$	42.0 ${}_{\pm 0.8}$
+ CRG (Ours)	58.4 ${}_{\pm 0.3}$	47.5 ${}_{\pm 0.5}$	61.1 ${}_{\pm 0.3}$	50.3 ${}_{\pm 1.8}$	61.6 ${}_{\pm 0.9}$	46.5 ${}_{\pm 1.1}$	56.1 ${}_{\pm 0.3}$

CRG unlocks visual prompt following, matching fine-tuned models. We present the result in Table 1. We note that the base model LLaVA-1.5-13B already surpasses several baselines including fine-tuned visual prompting models like Shikra [7] and GPT4ROI [61], as well as other notable VLMs such as Qwen-VL-Chat [3] and InstructBLIP [11]. Nevertheless, applying CRG on the LLaVA-1.5-13B model results in further improvements of $2.1\%$ , $1.3\%$ , $3.8\%$ in the Rec, OCR, and Math categories, respectively, and improve $1.6\%$ on average. While LLaVA-1.5-13B with CRG lags behind ViP-LLaVA-13B [6], which uses the LLaVA-1.5-13B as backbone but is trained with curated visual prompting data, the performance gap between LLaVA-1.5-13B+CRG and ViP-LLaVA narrows significantly in the OCR and Math categories. This indicates that CRG can help models follow visual prompts by contrasting between an image and a version of it where the visual prompt region is removed.

CRG can also help models fine-tuned with visual prompts. Our findings also indicate that CRG is complementary to fine-tuned models for visual prompting, i.e., ViP-LLaVA, via its contrast between distributions, further improving the performance on categories like Rec, Math, and Rel by $0.5\%$ , $0.4\%$ , and $0.1\%$ , respectively, with an average improvement of $0.8\%$ .

CRG is more helpful to a stronger VLM backbone. The improvement is more pronounced when we apply CRG to the LLaVA-1.6-34B model, achieving on average an $11.1\%$ increase in performance. Despite a $3\%$ decrease in Lang, the improvements in other categories range from $9.2\%$ to $18.8\%$ , surpassing all previous models. Notably, LLaVA-1.6-34B+CRG also surpasses ViP-LLaVA-13B in all categories except Lang, although it was never trained with any visual prompt data. This underscores the efficiency of CRG in scaling up models without the need for additional training.

Set-of-Mark prompting [53] is not effective on LLaVA-based models. Finally, we observe that Set-of-Mark (SoM) generally decreases the performance of LLaVA-based models, indicating that this visual prompting strategy, which works on proprietary models, does not transfer well to the open-source VLMs we study. One potential reason is that SoM requires OCR capability, a domain where LLaVA-based models perform poorly compared to GPT-4V ( $28.7\%$ for LLaVA-1.6-34B without CRG versus $48.5\%$ for GPT-4V). While we observe improved overall performance when SoM is applied to LLaVA-1.5-13B, this is driven solely by increased recognition performance, which improves by $3\%$ . In all other categories, SoM decreases performance, sometimes drastically (e.g., a $16.2\%$ drop on Rel). Similarly, we observe that applying SoM to ViP-LLaVA-13B and LLaVA-1.6-34B decreases accuracy in all categories except OCR and Know for LLaVA-1.6-34B, lowering performance by an average of $6.8\%$ and $3\%$ for ViP-LLaVA-13B and LLaVA-1.6-34B, respectively.

4.3 Evaluation on Image-Text Alignment (Spatial Understanding, Compositionality, Generated Image Evaluation)

The region-text scores CRG produces can be used to measure the alignment between an image and a piece of text. We apply this to answering questions about spatial understanding by scoring answers, to compositional reasoning, where we use CRG’s score to decide between possible descriptions, and to evaluating generated images, where we score the match between an image description and a model-generated image.

4.3.1 Evaluation on Spatial Understanding

What’sUp [19] is a benchmark for assessing the spatial understanding capabilities of VLMs, with 820 images of unambiguous spatial relations between two household objects (e.g., a chair and a bowl, etc.), where the images contain only the two objects in four distinct spatial relations (see Fig. 1 and 2 (b)). As illustrated in Fig. 2 (b), we extract bounding boxes for the two objects. We compare our method with the best-performing baselines, including FLAVA [48], CLIP [39], and GPT-4V [37].

Table 2: Results on spatial understanding (What’sUp) and compositionality (SugarCrepe) benchmarks tasks. * indicates results reported in [19; 17].

	What’sUp			SugarCrepe
Model	Indiv.	Pairs	Set of 4	swap-Obj	Swap-Att
CLIP ViT-L-14 [39]	26.1*	1.5*	0.0*	60.1*	62.3*
CLIP RN50x64 [39]	26.2	2.0	0.0	61.8*	66.7*
FLAVA [48]	30.4*	10.9*	0.0*	-	-
ViP-LLaVA-13B [6]	70.9	57.5	21.8	74.3	84.2
GPT-4V [37]	-	-	-	83.1*	90.1*
LLaVA-1.5-13B [29]	73.1	60.6	28.9	78.0	83.9
+ bbox overlay	71.0	57.1	22.0	75.1	83.8
+ CRG (Ours)	76.7	64.2	37.2	84.5	91.3
LLaVA-1.6-34B [30]	86.8	76.8	54.0	76.3	86.2
+ bbox overlay	82.2	69.4	39.7	76.7	85.0
+ CRG (Ours)	87.6	80.3	64.0	87.8	93.7

CRG improves spatial understanding in VLMs. The results, as shown in Table 2, demonstrate that CRG consistently improves accuracy across all settings when applied to LLaVA-1.5-13B, improving accuracy by $3.6\%$ for both individual and pairs settings, and by $8.3\%$ for the set of 4 setting. Notably, CRG with LLaVA-1.5-13B again outperforms ViP-LLaVA-13B – despite the latter’s extensive additional training – when prompted with the bounding boxes. On LLaVA-1.6-34B, CRG also increases accuracy on all settings. For the hardest ‘Set of 4’ setting, which involves accurately linking four prepositions to their corresponding images, CRG improves the accuracy by $10\%$ . Interestingly, we see that applying bounding boxes as visual markers on the images (‘+ bbox overlay’) does not improve the performance for both models, indicating that visual prompting is not something the base VLM can already do and thus illustrating the effectiveness of CRG.

4.3.2 Evaluation on Vision-Language Compositionality

The SugarCrepe [17] dataset evaluates the compositional reasoning capability of VLMs, highlighting the fact that they generally struggle in correctly identifying instances when objects or attributes are swapped. Focusing on the subsets swap-Obj and Swap-Att – the two subsets that current models have the most difficulty with [17] – we include best-performing models in [17], including CLIP [39] and GPT-4V [37].

CRG improves compositional generalization of VLMs. As depicted in Table 2, our observations reveal a consistent pattern where CRG enhances the performance of the original models. Applying CRG to LLaVA-1.5-13B results in improvements of $6.5\%$ and $7.4\%$ in the swap-Obj and Swap-Att subsets, respectively, while for LLaVA-1.6-34B it improves the two tasks by $11.5\%$ and $7.5\%$ . Notably, applying CRG to LLaVA-1.5-13B surpasses the performance of GPT-4V by $1.3\%$ on average, indicating CRG’s effectiveness in improving models’ compositional understanding.

4.3.3 Evaluation on Images from Text-to-Image Generation Models

Next, we show how CRG can also be applied for text-to-image scenarios where we have generated images. For this, we employ SeeTRUE [55], a meta-evaluation benchmark to assess the model’s ability to determine whether the given image-text pair is aligned or not. methods. The dataset contains real text and synthetic images, encompassing examples from DrawBench [43], EditBench [51], and COCO [28], containing 1,311, 3,827, and 1,791 image-text pairs, respectively. The authors collected 3 human annotations of binary judgments per example for the three benchmarks. We follow the authors in measuring performance with Area Under the ROC Curve (AUROC) and additionally include F1, which we compute by taking a threshold on the score and labeling instances above the threshold as positive. Since no validation or training data for the three sets are available for threshold tuning, we set the threshold to the average score the model assigns to all examples for each dataset.

CRG helps measure the alignments between text and generated images. We observe a similar trend that adding CRG improves the performance greatly, increasing AUROC by $7.3$ and F1 points by $5.4$ on average when applied to LLaVA-1.5-13B and $8.4$ AUROC and $6.7$ F1 points for the 34B model. CRG can also be combined with ViP-LLaVA resulting in an increase of $8.7$ points on AUROC and $6.2$ points on F1, again complementing the learned visual prompt-following. We also observe that directly visually prompting the models (‘+ bbox overlay’) does not improve the results, even when using the fine-tuned ViP-LLaVA model. This verifies the effectiveness and robustness of CRG for evaluating model-generated images.

Table 3: Results on text-to-image evaluation benchmarks from SeeTRUE.

Model	DrawBench		EditBench		COCO-t2i		Average of 3 datasets
	AUROC	F1	AUROC	F1	AUROC	F1	AUROC	F1
CLIP ViT-L14 [39]	61.4	51.5	62.1	60.4	59.3	61.1	60.9	57.7
CLIP RN50x64 [39]	60.7	50.8	67.1	65.3	58.6	59.6	62.1	58.6
VQ²[55]	70.4	59.6	60.8	63.3	67.7	66.1	66.3	63.0
LLaVA-1.5-13B [29]	62.9	53.3	62.8	63.3	60.4	53.3	62.1	56.6
+ bbox overlay	63.1	52.5	63.0	63.2	60.0	52.5	62.1	56.1
+ CRG (Ours)	68.3	58.2	71.7	69.6	68.3	58.2	69.4	62.0
ViP-LLaVA-13B [6]	60.3	52.2	63.0	62.5	60.3	64.1	61.1	59.6
+ bbox overlay	60.3	52.0	63.4	62.4	60.3	63.4	61.3	59.3
+ CRG (Ours)	68.5	58.6	71.8	69.6	69.2	69.2	69.8	65.8
LLaVA-1.6-34B [30]	70.9	60.0	66.4	63.8	59.9	61.8	65.7	61.9
+ bbox overlay	70.6	59.9	66.5	64.1	59.4	62.3	65.5	62.0
+ CRG (Ours)	77.6	63.1	75.7	72.9	69.0	69.9	74.1	68.6

4.4 Evaluation on Referring Expression Comprehension and Phrase Grounding

Finally, we evaluate CRG’s abilities in referring expression comprehension (REC), i.e., locating the object referred to by a sentence, and phrase grounding, i.e., locating multiple objects referred to by a phrase. Specifically, we test whether the model can assign the correct bounding box given a textual description by re-ranking bounding box proposals such that the top bounding box matches a given phrase. We include three classic grounding benchmarks, RefCOCO, RefCOCO+ [20], and RefCOCOg [35] for REC , and Flickr30K Entities [38] for phrase grounding. For each proposal, we assign scores to bounding boxes based on the model’s probability of producing the phrase when overlaid on the image. Following prior work [32; 26], we evaluate the methods using accuracy@0.5, where we consider a predicted box correct if it has an IoU greater than 0.5 with to the reference box.

Table 4: Results on region re-ranking experiments for referring expression comprehension and phrase grounding benchmarks. We report Top-1 accuracy@0.5. For each dataset, we include the percentage of the examples where multiple proposals are available in the parentheses.

	RefCOCO		RefCOCO+		RefCOCOg	Flickr30K Entities
Model	testA ( $23.3\%$ )	testB ( $26.9\%$ )	testA ( $28.2\%$ )	testB ( $38.9\%$ )	test ( $27.2\%$ )	test ( $26.9\%$ )
Full Dataset
GroundingDINO-B	77.3	72.5	72.0	59.3	66.3	70.4
+ LLaVA-1.5-13B	80.4	72.8	75.8	60.4	68.6	71.2
+ LLaVA-1.5-13B + CRG (Ours)	81.6	73.2	77.0	60.0	69.6	72.8
Multiple Proposals Only
GroundingDINO-B	48.5	48.9	47.1	45.1	49.1	51.5
+ LLaVA-1.5-13B	62.0	50.0	60.5	47.9	57.2	54.6
+ LLaVA-1.5-13B + CRG (Ours)	66.9	51.4	64.8	47.0	61.1	60.4

CRG improves region-text alignment in VLMs. In Table 4, CRG surpasses GroundingDINO in terms of top prediction accuracy and demonstrates superior performance over using LLaVA’s probabilities in all scenarios except for the RefCOCO+ testB. We improve by $2.73\%$ on average compared to the top prediction by GroundingDINO, and by $0.8\%$ on average over re-ranking with LLaVA-1.5-13B. It is important to note that in scenarios where only a single bounding box is available, re-ranking is infeasible, and we select the single box by default. Thus, we additionally show the results on the subset of data where there are multiple proposals, which on average accounts for $28.6\%$ of the data. As illustrated in the bottom section of Table 4, CRG reveals a much larger improvement, e.g., $3.2\%$ , $1.7\%$ , and $3.9\%$ on average for the RefCOCO, RefCOCO+, and RefCOCOg test splits, and a $5.8\%$ improvement for the Flickr30K Entities test set. This indicates the value of our approach in linking phrases to the most pertinent image regions.

4.5 Analysis and Ablation Studies

In the following, we analyze design choices for CRG, including the comparison of different region guidance strategies (Sec. 4.5.1), analyzing probability shift for grounded text to understand why CRG works (Sec. 4.5.2), and the impact of region guidance strength $\alpha$ (Sec. 4.5.3).

Table 5: What’sUp results with different region guidance strategies on LLaVA-1.6-34B.

Contrasted with Original Image
Method	Indiv.	Pairs	Set of 4
Original Image	86.8	76.8	54.0
(a) Blackout Single Object	86.7	78.1	59.0
(b) Blackout Combined	85.8	78.6	60.1
(c) Blackout w. Mask	85.4	75.4	55.7
(d) Blackout All	82.6	75.2	53.8
(e) Blackout CRG	87.6	80.3	64.0
Overlay on Original Image
(f) Overlay with Boxes	82.2	69.4	39.7
(g) Overlay with Masks	84.2	82.7	49.6
(h) Overlay with SoM	77.9	63.9	33.0

4.5.1 Different Region Guidance Strategies

We investigate the impact of different region guidance strategies, including contrasting original images with another image (e.g., images where different regions are blacked out) and overlaying visual markers (e.g., bounding box and segmentation mask) on the What’sUp benchmark, given that each scene in the dataset contains exactly two distinct objects for reliable analysis. Here, we use LLaVA-1.6-34B. As depicted in the top section of Fig. 3, apart from our method of applying bounding boxes for each object separately shown in (e), we apply four distinct masking approaches. First, considering the different combinations of the objects, we black out one of the objects (we take the average of the results of masking each object) in (a), and apply a combined mask over both objects in (b). We also consider blacking out with a segmentation mask using Grounded-SAM [41] in Fig. 3 (c), motivated by the success of overlaying such masks as visual prompts [53], and blacking the entire image in (d) as ablations, which has been previously applied to CFG [23]. We are primarily interested in finding the best strategy to mask especially in the presence of multiple objects, as well as the effect of the granularity of the mask.

Blacking out only the relevant regions is important. The findings are detailed in Table 5. Our method of masking each object separately achieves superior performance compared to the other masking strategies. In particular, our method (e) performs better than blacking out a combined mask (b) and blacking out the entire image (d), indicating the importance of precisely targeting necessary regions for removal to prevent the accidental exclusion of additional information. When blacking out with a segmentation mask in (c), we observe that while this approach yields competitive results among other masking strategies, it is still worse than the our main method of (Blackout Separate). This indicates that the model may be using object’s shape, which segmentation masks preserve.

Simply overlaying visual markers without CRG is ineffective for pre-trained VLMs. Finally, we explore different visual markers for direct visual prompting, including bounding boxes (f) and segmentation masks (g), which has been shown to be effective for trained models [49; 6], in Fig. 3. Lastly, we also experiment with SoM [53] in (h). The results in Table 5 show that the LLaVA-1.6-34B model does not follow such visual prompts, as the three overlaying methods show worse performance than using the original image. The decrease in performance with SoM also echoes the result in Table 1. As mentioned in Sec. 3, overlaying does not guarantee that the model will focus on the region of interest – or contrast with it – as it can still attend to any spurious information outside that region, as in the original image. Contrasting with a blacked-out image reduces this kind of spurious information by factoring it out when the logits are subtracted. Thus, we show the usefulness of our chosen blackout strategy in aiding the model to follow visual prompts.

4.5.2 Quantifying the Intuition Behind CRG: Probability Contrasts for Grounded Text

To better understand why CRG improves visual grounding, we analyze CRG’s behavior on SugarCrepe, which measures compositional generalization. Here, we examine how the probability distribution on key words tied to the proposed regions changes when applying CRG. In SugarCrepe, each image has one correct caption and an incorrect distractor caption. In Swap-Att, the distractors are formed by swapping the visual attributes of objects in the scene; for example in Fig. 4 (a), “grey dog” is changed to “black dog”. We treat the attribute phrases from the correct captions as correct words $W_{C}$ (e.g., “grey dog”) and the phrase from incorrect captions as incorrect words $W_{I}$ (e.g., “black dog”). If CRG operates as expected, CRG should increase the model’s probability on correct words $W_{C}$ , as CRG emphasizes the correct object in the image while lowering the probability of incorrect words $W_{I}$ , which cannot be inferred from the regions.

CRG amplifies the correct text probability and reduces incorrect text probability.

We use LLaVA-1.6-34B and calculate the average probabilities for the $W_{C}$ and $W_{I}$ . We compare the probabilities obtained from LLaVA-1.6-34B alone to those obtained by applying CRG in Fig. 4 (b). We see that the probability of the correct phrases increases slightly, while the probability of the incorrect phrase decreases with CRG. This indicates that the model is following the visual prompts and paying attention to the correct image regions to better differentiate the positive text from the negative, and is able to ground these regions in the image to the relevant part of the text. Thus, CRG improves performance in an interpretable way, by improving the matching between the image and the relevant tokens in the text.

4.5.3 Impact of Region Guidance Strength $\alpha$

We analyze the impact of the region guidance strength $\alpha$ on various tasks. As explained in Sec. 3, $\alpha$ in the Eq. 3: $\texttt{softmax}[(1+\alpha)\cdot\text{logit}_{\theta}(y_{t}|I,X,y_{<t})-\alpha% \cdot\text{logit}_{\theta}(y_{t}|\texttt{mask}(I,b),X,y_{<t})]$ controls how strongly CRG guides VLMs to focus on the region $b$ . We illustrate the effect of adjusting $\alpha$ from 0 (regular decoding from the original image) to 1 with a step of 0.1. Fig. 5 shows the accuracy (and AUROC for SeeTRUE) across different datasets with various $\alpha$ . We observe a clear trend that increasing $\alpha$ from 0 to 1 improves performance, indicating that focusing on the provided regions is more beneficial to the tasks. Based on this finding, we further experiment with increasing the $a$ value up to 10, which is shown in the bottom section of Fig. 5. While for some tasks, such as SugarCrepe, setting a more aggressive weight improves performance there is no clear trend where a single value can achieve the best performance. Thus, we advocate for $\alpha=1$ as a default value that can be optionally tuned if a validation set is present.

5 Conclusion

We present CRG, an accessible and easy-to-use training-free approach for improving visual prompt following capability for VLMs without the need for training. CRG provides significant improvement in visual prompt following and is effective across a broad spectrum of vision-language tasks that lack ground truth for region annotations: CRG improves spatial reasoning, compositional generalization, and image-text alignment for generated images by employing a re-ranking strategy for regions identified by an object detection model. We further explore different region guidance strategies, aiming to set a foundation for future advancements in visual prompting techniques. One direction for future work includes the integration of visual and textual contexts: our research focuses on guiding the model with visual inputs, and a concurrent work [62] proposes guidance through the added textual context in the caption. We believe that these directions complement each other and suggest a combined approach for an enhanced multimodal prompt following strategy.

Limitations

CRG shows strong performance across a variety of VL tasks; however, like all CFG methods, it comes with an additional computational cost due to the necessity of running the model twice (on the original image and the blacked-out image). This cost is offset by the fact that CRG is broadly applicable across various models and datasets, and does not require fine-tuning – although it can also be complementary to fine-tuned models, as shown in Sec. 4.2. If visual markers are absent, CRG also relies on an object detection model; however, such models are available across many domains. Future work could incorporate better visual encoders that can yield direct identification of relevant regions without additional object detectors.

Acknowledgement

We thank Peter Hase for the thoughtful discussion. This work was supported by DARPA ECOLE Program No. HR00112390060, NSF-AI Engage Institute DRL-2112635, DARPA Machine Commonsense (MCS) Grant N66001-19-2-4031, ARO Award W911NF2110220, ONR Grant N00014-23-1-2356, and a Bloomberg Data Science Ph.D. Fellowship. The views contained in this article are those of the authors and not of the funding agency.

References

Antol et al. [2015] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
Bahng et al. [2022] H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola. Exploring visual prompts for adapting large-scale models, 2022.
Bai et al. [2023a] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023a.
Bai et al. [2023b] Y. Bai, X. Geng, K. Mangalam, A. Bar, A. Yuille, T. Darrell, J. Malik, and A. A. Efros. Sequential modeling enables scalable learning for large vision models, 2023b.
Bar et al. [2022] A. Bar, Y. Gandelsman, T. Darrell, A. Globerson, and A. A. Efros. Visual Prompting via Image Inpainting. In NeurIPS, 2022. ISBN 9781713871088.
Cai et al. [2023] M. Cai, H. Liu, S. K. Mustikovela, G. P. Meyer, Y. Chai, D. Park, and Y. J. Lee. Making Large Multimodal Models Understand Arbitrary Visual Prompts, 2023. URL http://arxiv.org/abs/2312.00784.
Chen et al. [2023a] K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
Chen et al. [2022] X. Chen, Z. Zhao, Y. Zhang, M. Duan, D. Qi, and H. Zhao. Focalclick: Towards practical interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1300–1309, June 2022.
Chen et al. [2023b] X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz, S. Goodman, X. Wang, Y. Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023b.
Cho et al. [2023] J. W. Cho, D.-J. Kim, H. Ryu, and I. S. Kweon. Generative bias for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11681–11690, 2023.
Dai et al. [2023] W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=vvoWPYqZJA.
Dosovitskiy et al. [2020] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Gafni et al. [2022] O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. In ECCV, 2022. URL http://arxiv.org/abs/2203.13131.
Goyal et al. [2017] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
Ho and Salimans [2021] J. Ho and T. Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URL https://openreview.net/forum?id=qw8AKxfYbI.
Honnibal et al. [2020] M. Honnibal, I. Montani, S. V. Landeghem, and A. Boyd. spacy: Industrial-strength natural language processing in python, 2020. URL https://spacy.io.
Hsieh et al. [2023] C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, and R. Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
Kamath et al. [2021] A. Kamath, M. Singh, Y. LeCun, I. Misra, G. Synnaeve, and N. Carion. Mdetr–modulated detection for end-to-end multi-modal understanding. arXiv preprint arXiv:2104.12763, 2021.
Kamath et al. [2023] A. Kamath, J. Hessel, and K.-W. Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023.
Kazemzadeh et al. [2014] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
Khattak et al. [2023] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan. MaPLe: Multi-modal Prompt Learning. In CVPR, 2023. ISBN 9798350301298. doi: 10.1109/CVPR52729.2023.01832.
Kirillov et al. [2023] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment anything. arXiv:2304.02643, 2023.
Kornblith et al. [2023] S. Kornblith, L. Li, Z. Wang, and T. Nguyen. Guiding Image Captioning Models Toward More Specific Captions. In ICCV, 2023. URL http://arxiv.org/abs/2307.16686.
Leng et al. [2023] S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing. Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding, 2023. URL http://arxiv.org/abs/2311.16922.
Li et al. [2023a] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
Li* et al. [2022] L. H. Li*, P. Zhang*, H. Zhang*, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, K.-W. Chang, and J. Gao. Grounded language-image pre-training. In CVPR, 2022.
Li et al. [2023b] X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023b.
Lin et al. [2014] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Liu et al. [2023a] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=w0H2xGHlkw.
Liu et al. [2024a] H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
Liu et al. [2024b] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
Liu et al. [2023b] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
Liu et al. [2022] Y. Liu, Y. Guo, J. Yin, X. Song, W. Liu, L. Nie, and M. Zhang. Answer questions with right image regions: A visual attention regularization approach. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(4):1–18, 2022.
Ma et al. [2023] Z. Ma, J. Hong, M. O. Gul, M. Gandhi, I. Gao, and R. Krishna. Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10910–10921, 2023.
Mao et al. [2015] J. Mao, J. Huang, A. Toshev, O.-M. Camburu, A. L. Yuille, and K. P. Murphy. Generation and comprehension of unambiguous object descriptions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11–20, 2015. URL https://api.semanticscholar.org/CorpusID:8745888.
O’Brien and Lewis [2023] S. O’Brien and M. Lewis. Contrastive decoding improves reasoning in large language models, 2023.
OpenAI et al. [2023] OpenAI, :, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, and B. Zoph. Gpt-4 technical report, 2023.
Plummer et al. [2017] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 123(1):74–93, 2017.
Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
Ray et al. [2024] A. Ray, F. Radenovic, A. Dubey, B. Plummer, R. Krishna, and K. Saenko. cola: A benchmark for compositional text-to-image retrieval. Advances in Neural Information Processing Systems, 36, 2024.
Ren et al. [2024] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024.
Ribeiro et al. [2016] M. Ribeiro, S. Singh, and C. Guestrin. “why should I trust you?”: Explaining the predictions of any classifier. In J. DeNero, M. Finlayson, and S. Reddy, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-3020. URL https://aclanthology.org/N16-3020.
Saharia et al. [2022] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, R. Gontijo-Lopes, B. K. Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=08Yk-n5l2Al.
Sanchez et al. [2023] G. Sanchez, H. Fan, A. Spangher, E. Levi, P. S. Ammanamanchi, and S. Biderman. Stay on topic with classifier-free guidance. arXiv preprint arXiv:2306.17806, 2023.
Selvaraju et al. [2019] R. R. Selvaraju, S. Lee, Y. Shen, H. Jin, S. Ghosh, L. Heck, D. Batra, and D. Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2591–2600, 2019.
Shi et al. [2023] W. Shi, X. Han, M. Lewis, Y. Tsvetkov, L. Zettlemoyer, and S. W. tau Yih. Trusting your evidence: Hallucinate less with context-aware decoding, 2023.
Shtedritski et al. [2023] A. Shtedritski, C. Rupprecht, and A. Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11987–11997, October 2023.
Singh et al. [2022] A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela. FLAVA: A foundational language and vision alignment model. In CVPR, 2022.
Sun et al. [2023] Z. Sun, Y. Fang, T. Wu, P. Zhang, Y. Zang, S. Kong, Y. Xiong, D. Lin, and J. Wang. Alpha-clip: A clip model focusing on wherever you want, 2023.
Thrush et al. [2022] T. Thrush, R. Jiang, M. Bartolo, A. Singh, A. Williams, D. Kiela, and C. Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
Wang et al. [2023] S. Wang, C. Saharia, C. Montgomery, J. Pont-Tuset, S. Noy, S. Pellegrini, Y. Onoe, S. Laszlo, D. J. Fleet, R. Soricut, J. Baldridge, M. Norouzi, P. Anderson, and W. Chan. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18359–18369, Los Alamitos, CA, USA, jun 2023. IEEE Computer Society. doi: 10.1109/CVPR52729.2023.01761. URL https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.01761.
Wu and Mooney [2019] J. Wu and R. Mooney. Self-critical reasoning for robust visual question answering. Advances in Neural Information Processing Systems, 32, 2019.
Yang et al. [2023] J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, 2023. URL http://arxiv.org/abs/2310.11441.
Yao et al. [2021] Y. Yao, A. Zhang, Z. Zhang, Z. Liu, T.-S. Chua, and M. Sun. CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models, 2021. URL http://arxiv.org/abs/2109.11797.
Yarom et al. [2023] M. Yarom, Y. Bitton, S. Changpinyo, R. Aharoni, J. Herzig, O. Lang, E. Ofek, and I. Szpektor. What you see is what you read? improving text-image alignment evaluation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=j5AoleAIru.
Ying et al. [2022] Z. Ying, P. Hase, and M. Bansal. Visfis: Visual feature importance supervision with right-for-the-right-reason objectives. Advances in Neural Information Processing Systems, 35:17057–17072, 2022.
Young et al. [2014] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
Zellers et al. [2021] R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi. MERLOT: Multimodal Neural Script Knowledge Models. In NeurIPS, 2021. URL http://arxiv.org/abs/2106.02636.
Zhang et al. [2022] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. H. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao. Glipv2: Unifying localization and vision-language understanding. arXiv preprint arXiv:2206.05836, 2022.
Zhang et al. [2016] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5014–5022, 2016.
Zhang et al. [2023] S. Zhang, P. Sun, S. Chen, M. Xiao, W. Shao, W. Zhang, Y. Liu, K. Chen, and P. Luo. Gpt4roi: Instruction tuning large language model on region-of-interest, 2023.
Zhao et al. [2024] L. Zhao, Y. Deng, W. Zhang, and Q. Gu. Mitigating object hallucination in large vision-language models via classifier-free guidance, 2024.
Zou et al. [2023] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y. J. Lee. Segment everything everywhere all at once. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=UHBrWeFWlL.

Appendix

In this appendix, we include details of the experiment setup (Appendix A) and qualitative examples (Appendix B).

Appendix A Experimental Setup Details

A.1 Dataset Statistics

In Table 6, we show the statistics of the datasets used in our experiments.

Table 6: Dataset Statistics. REC denotes Referring Expression Comprehension.

Dataset	# examples
ViP-Bench	303
Image-Text Alignment
What’sUp	3,280
SugarCrepe swap-att	1,332
SugarCrepe swap-obj	490
SeeTRUE DrawBench	1,312
SeeTRUE EditBench	3,827
SeeTRUE COCO-t2i	1,791
REC and Phrase Grounding
RefCOCO testA	5,657
RefCOCO testB	5,095
RefCOCO+ testA	5,726
RefCOCO+ testB	4,889
RefCOCOg test	9,602
Flickr30K Entities test	4,969

A.2 Details on CRG Prompting, Region Proposals, and Probability Extraction

To generate answers for VQA-style tasks, such as ViP-Bench (Sec. 4.2) and What’sUp (Sec. 4.3.1), we directly provide the model with the question to generate the response. Following Kamath et al. [2023], we use the probability of Yes as the score of each image-text pair for What’sUp. For other tasks, we use the prompt Provide a one-sentence caption for the provided image, which has been seen by the models tested during the pre-training phase Liu et al. [2023a, 2024a]. Then, we force the model to decode the sentence or phrase to retrieve the probabilities of the tokens.

For the bounding box extractions on Referring Expression Comprehension (REC), we use positive tokens from the pre-processed data provided by Kamath et al. Kamath et al. [2021] for generating bounding box candidates. For phrase grounding, we directly use the phrase to extract bounding box proposals.

A.3 Details on Set-of-Mark Prompting

To implement SoM prompting, we transform any mentions of ‘‘within the {color} rectangle’’ to ‘‘in {number}’’ in the text prompt via regular expression. For example, the question "Are the numbers within the red rectangle and within the purple rectangle the same?" becomes "Are the numbers in 0 and in 1 the same?", where the image for SoM removes the bounding box and instead contains an overlay on the object with numbering. We apply the inverse process to the outputs, transforming mentions of the numbers back to bounding boxes with colors in the answer, to be compatible with the original scoring methods. For example, the answer “Yes, the numbers in 0 and in 1 are the same." is transformed into “Yes, the numbers within the red rectangle and within the purple rectangle are the same."

Appendix B Qualitative Examples

In Fig. 6, Fig. 7, Fig. 8, and Fig. 9, we include qualitative examples from ViP-Bench, where we show that CRG can direct the model to generate the correct answer to the given question. In Fig. 10, we include examples of the correct text and its associated regions. In Fig. 11, we include RefCOCO examples from the validation data with the bounding box proposals generated by GroundingDINO.