Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Zaid Khan  Yun Fu
Northeastern University
Abstract

The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of neighborhood consistency to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model’s responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model.

1 Introduction

Refer to caption
Figure 1: Identifying unreliable responses from an API-only black-box vision-language model (VLM) can be challenging because confidence scores are not always trustworthy, and more sophisticated methods for selective prediction require a level of access to the model that is unavailable. We explore the idea of model consistency to identify unreliable model responses in this realistic scenario: a reliable response is one that is consistent across questions that are semantically equivalent but different on the surface.

Powerful commercial frontier models are sometimes only available as black boxes accessible through an API [30, 2]. When using these models in high-risk scenarios, it is preferable that the model defers to an expert or abstains from answering rather than deliver an incorrect answer [8]. Many approaches for selective prediction [35, 8] or improving the predictive uncertainty of a model exist, such as ensembling [15], gradient-guided sampling in feature space [12], retraining the model [32], or training a auxiliary module using model predictions [24]. Selective prediction has typically been studied in unimodal settings and/or for tasks with a closed-world assumption, such as image classification, and has only recently been studied for multimodal, open-ended tasks such as visual question answering [34, 7] (VQA).

In existing deployments, training data is private, model features and gradients are unavailable, retraining is not possible, the number of predictions may be limited by the API, training on model outputs is often prohibited, and queries are open-ended. In a black-box setting with realistic constraints, how do we identify unreliable predictions from a vision-language model?

An intuitive approach is to consider self-consistency: if a human subject is given two semantically equivalent questions, we expect the human subject’s answers to the questions to be identical. A more formal notion of consistency is that given a classifier f()𝑓f(\cdot)italic_f ( ⋅ ) and an point 𝐱N𝐱superscript𝑁\mathbf{x}\in\mathbb{R}^{N}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT in feature space, the classifier’s predictions over an ϵitalic-ϵ\epsilonitalic_ϵ-neighborhood of 𝐱𝐱\mathbf{x}bold_x should be consistent with f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ) for a small enough ϵitalic-ϵ\epsilonitalic_ϵ [12]. It is not straightforward to operationalize either of these notions. How can we scalably obtain “semantically equivalent” visual questions to an input visual question? Since we can’t access the internal representations of a black-box model, how can we sample from the neighborhood of an input visual question?

First, we study selective prediction on VQA across in-distribution, out-of-distribution, and adversarial inputs using a large VLM. Next, we describe how rephrasings of a question can be viewed as samples from the ϵitalic-ϵ\epsilonitalic_ϵ-neighborhood of a visual question pair. We propose training a visual question generation model as a probing model to scalably and cheaply produce rephrasings of visual questions given answers and an image, allowing us to approximately sample from the neighborhood of a visual question pair. To quantify uncertainty in the answer to a visual question pair, we feed the rephrasings of the question to the black-box VLM, and count the number of rephrasings for which the answer of the VLM remains the same. Surprisingly, we show that consistency over model-generated “approximate rephrasings” is effective at identifying unreliable predictions of a black-box vision-language model, even when the rephrasings are not semantically equivalent and the probing model is an order of magnitude smaller than the black-box model.

Our approach is analogous to consistency over samples taken from the neighborhood of an input sample in feature space, but this method does not require access to the features of the vision-language model. Furthermore, it does not require a held-out validation set, access to the original training data, or retraining the vision-language model, making it appropriate for black-box uncertainty estimates of a vision-language model. We conduct a series of experiments testing the effectiveness of consistency over rephrasings for assessing predictive uncertainty using the task of selective visual question answering in a number of settings, including adversarial visual questions, distribution shift, and out of distribution detection.

Our contributions are:

  • We study the problem of black-box selective prediction for a large vision-language model, using the setting of selective visual question answering.

  • We show that on in-distribution data, a state-of-the-art large vision-language model is capable of identifying when it does not know the answer to a question, but this ability is severely degraded for out-of-distribution and adversarial visual questions.

  • We propose identifying high-risk inputs for visual question answering based on consistency over samples in the neighborhood of a visual question.

  • We show that consistency defines a different ordering than model confidence / uncertainty over instances in a dataset.

  • We conduct a series of experiments validating the proposed method on in-distribution, out-of-distribution and adversarial visual questions, and show that our approach even works in the likely setting that the black box model being probed is substantially larger than the probing model.

We show that consistency over the rephrasings of a question is correlated with model accuracy on a question and can select slices of a test dataset on which a model can achieve lower risk, reject out of distribution samples, and works well to separate right from wrong answers, even on adversarial and out of distribution inputs. Surprisingly, this technique works even though many rephrasings are not literally valid rephrasings of a question. Our proposed method is a step towards reliable usage of vision-language models as an API. Limitations: Due to resource constraints, we study models that might now be considered relatively small (\leq 13B parameters), and our VQG model is “small” (<<< 700M).

2 Motivating Experiment

Refer to caption
Figure 2: For out of distribution (OKVQA) and adversarial visual (AdVQA) questions, confidence scores alone do not work well to separate right from wrong answers — many correct answers are low confidence for OOD data, and many wrong answers are high confidence for adversarial data. Note: Displayed confidence scores are raw. See Appendix for discussion on calibration.
Refer to caption
Figure 3: Selective VQA performance of a VLM (BLIP) on three datasets: adversarial (AdVQA), out-of-distribution (OKVQA), and in-distribution (VQAv2). On OOD and adversarial questions, the model has a harder time identifying which questions it should abstain from.

We empirically examine the predictive uncertainty of a large VLM through the lens of selective visual question answering. In contrast to the classical VQA setting where a model is forced to answer, a model is allowed to abstain from answering in selective VQA. For safety and reliability, it is important to examine both out-of-distribution and adversarial inputs, on which we expect that the VLM will have a high error rate if forced to answer every out-of-distribution or adversarial question posed to the model. However, because the VLM is allowed to abstain, in principle the model can achieve low risk (low error rate) on a slice of the dataset corresponding to questions that it knows the answer to. In a black-box setting, only the raw confidence scores for the answer candidates are likely to be available, so we use the confidence of the most likely answer as the uncertainty.

In Fig. 3, we plot the selective visual question answering performance of BLIP [17] finetuned on VQAv2 using the confidence scores on the validation sets of adversarial (AdVQA, Sheng et al. [29]), out-of-distribution (OKVQA, Marino et al. [23]) and in-distribution (VQAv2) datasets. For the in-distribution dataset (VQAv2), the model is quickly able to identify which questions it is likely to know the answer to, achieving nearly perfect accuracy by rejecting the most uncertain 40%percent4040\%40 % of the dataset. However, for out-of-distribution and adversarial datasets, the model has a harder time – after rejecting 50%percent5050\%50 % of the questions, the model still has an error rate of 40%absentpercent40\approx 40\%≈ 40 %. The reason for this is evident in Fig. 2, where we plot the distribution of confidence scores for incorrect and correct answers for OOD, in-distribution, and adversarial visual questions. For in-distribution visual questions, the confidence distribution is bimodal, and incorrect and correct answers are clearly separated by confidence. For OOD visual questions, many correctly answered questions are low confidence and difficult to distinguish from incorrectly answered questions. A similar situation occurs for adversarial visual questions, in which many questions are incorrectly answered with high confidence.

Although the strategy of using model confidence alone to detect questions the model cannot answer is effective for in-distribution visual questions, this strategy fails on out-of-distribution and adversarial visual questions.

3 Method

3.1 Task Definition and Background

Given an image v𝑣vitalic_v and question q𝑞qitalic_q, the task of selective visual question answering is to decide whether a model fVQA(v,q)subscript𝑓𝑉𝑄𝐴𝑣𝑞f_{VQA}(v,q)italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_A end_POSTSUBSCRIPT ( italic_v , italic_q ) should predict an answer a𝑎aitalic_a, or abstain from making a prediction. A typical solution to this problem is to train a selection function g()𝑔g(\cdot)italic_g ( ⋅ ) that produces an abstention score prej[0,1]subscript𝑝rej01p_{\mathrm{rej}}\in[0,1]italic_p start_POSTSUBSCRIPT roman_rej end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. The simplest selection function would be to take the rejection probability prej=1p(a|q,v)subscript𝑝rej1𝑝conditional𝑎𝑞𝑣p_{\mathrm{rej}}=1-p(a|q,v)italic_p start_POSTSUBSCRIPT roman_rej end_POSTSUBSCRIPT = 1 - italic_p ( italic_a | italic_q , italic_v ) where p(a|q,v)𝑝conditional𝑎𝑞𝑣p(a|q,v)italic_p ( italic_a | italic_q , italic_v ) is the model confidence that a𝑎aitalic_a is the answer, and then use a threshold τ𝜏\tauitalic_τ so that the model abstains when prej>τsubscript𝑝rej𝜏p_{\mathrm{rej}}>\tauitalic_p start_POSTSUBSCRIPT roman_rej end_POSTSUBSCRIPT > italic_τ and predicts otherwise. A more complex approach taken by Whitehead et al. [34] is to train a parametric selection function g(𝐳v,𝐳q;θ)𝑔subscript𝐳𝑣subscript𝐳𝑞𝜃g(\mathbf{z}_{v},\mathbf{z}_{q};\theta)italic_g ( bold_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ; italic_θ ) where 𝐳vsubscript𝐳𝑣\mathbf{z}_{v}bold_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝐳qsubscript𝐳𝑞\mathbf{z}_{q}bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are the model’s dense representations of the question and image respectively. The parameters θ𝜃\thetaitalic_θ are optimized on a held-out validation set, effectively training a classifier to predict when fVQAsubscript𝑓𝑉𝑄𝐴f_{VQA}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_A end_POSTSUBSCRIPT will predict incorrectly on an input visual question v,q𝑣𝑞v,qitalic_v , italic_q.

In the black box setting, access to the dense representations 𝐳v,𝐳qsubscript𝐳𝑣subscript𝐳𝑞\mathbf{z}_{v},\mathbf{z}_{q}bold_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT of the image v𝑣vitalic_v and question q𝑞qitalic_q is typically forbidden. Furthermore, even if access to the representation is allowed, a large number of evaluations of fVQAsubscript𝑓𝑉𝑄𝐴f_{VQA}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_A end_POSTSUBSCRIPT would be needed to obtain the training data for the selection function. Existing methods for selective prediction typically assume and evaluate a fixed set of classes, but for VQA, the label space can shift for each task (differing sets of acceptable answers for different types of questions) or be open-set.

  1. 1.

    The approach should not require access to the black-box model’s internal representations of v,q𝑣𝑞v,qitalic_v , italic_q.

  2. 2.

    The approach should be model agnostic, as the architecture of the black-box model is unknown.

  3. 3.

    The approach should not require a large number of predictions from black-box model to train a selection function, because each usage of the black-box model incurs a financial cost, which can be substantial if large number of predictions are needed to train an auxiliary model.

  4. 4.

    Similarly, the approach should not require a held-out validation set for calibrating predictions, because this potentially requires a large number of evaluations of the black-box model.

3.2 Deep Structure and Surface Forms

Within the field of linguistics, a popular view first espoused by Chomsky [4] is that every natural language sentence has both a surface form and a deep structure. Multiple surface forms can be instances of the same deep structure. Simply put, multiple sentences that have different words arranged in different orders can mean the same thing. A rephrasing of a question corresponds to an alternate surface form, but the same deep structure. We thus expect that the answer to a rephrasing of a question should be the same as the original question. If the answer to a rephrasing is inconsistent with the answer to an original question, it indicates the model is sensitive to variations in the surface form of the original question. This indicates the model’s understanding of the question is highly dependent on superficial characteristics, making it a good candidate for abstention — we hypothesize inconsistency on the rephrasings can be used to better quantify predictive uncertainty and reject questions a model has not understood.

3.3 Rephrasing Generation as Neighborhood Sampling

The idea behind many methods for representation learning is that a good representation should map multiple surface forms close together in feature space. For example, in contrastive learning, variations in surface form are generated by applying augmentations to an input, and the distance between multiple surface forms is minimized. In general, a characteristic of deep representation is that surface forms of an input should be mapped close together in feature space. Previous work, such as Attribution-Based Confidence [12] and Implicit Semantic Data Augmentation [33], exploit this by perturbing input samples in feature space to explore the neighborhood of an input. In a black-box setting, we don’t have access to the features of the model, so there is no direct way to explore the neighborhood of an input in feature space. An alternate surface form of the input should be mapped close to the original input in feature space. Thus, a surface form variation of an input should be a neighbor of the input in feature space. Generating a surface form variation of a natural language sentence corresponds to a rephrasing of the natural language sentence. Since a rephrasing of a question is a surface form variation of a question, and surface form variations of an input should be mapped close to the original input in feature space, a rephrasing of a question is analogous to a sample from the neighborhood of a question. We discuss this further in the appendix.

3.4 Cyclic Generation of Rephrasings

Refer to caption
Figure 4: Examples showing the use of model-generated rephrasings to identify errors in model predictions with BLIP as the black box model fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT. In the left panel, we show high-confidence answers that wrong, and identified by their low consistency across rephrasings. In the right panel, we show low-confidence answers that are actually correct, identified by their high-confidence across rephrasings.

A straightforward way to generate a rephrasing of a question is to invert the visual question answering problem, as is done in visual question generation. Let p(V),p(Q),p(A)𝑝𝑉𝑝𝑄𝑝𝐴p(V),p(Q),p(A)italic_p ( italic_V ) , italic_p ( italic_Q ) , italic_p ( italic_A ) be the distribution of images, questions, and answers respectively. Visual question generation can be framed as approximating p(Q|A,V)𝑝conditional𝑄𝐴𝑉p(Q|A,V)italic_p ( italic_Q | italic_A , italic_V ), in contrast to visual question answering, which approximates p(A|Q,V)𝑝conditional𝐴𝑄𝑉p(A|Q,V)italic_p ( italic_A | italic_Q , italic_V ). We want to probe the predictive uncertainty of a black box visual question answering model fBB()subscript𝑓𝐵𝐵f_{BB}(\cdot)italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ( ⋅ ) on an input visual question pair v,q𝑣𝑞v,qitalic_v , italic_q where vp(V)similar-to𝑣𝑝𝑉v\sim p(V)italic_v ∼ italic_p ( italic_V ) is an image and qp(Q)similar-to𝑞𝑝𝑄q\sim p(Q)italic_q ∼ italic_p ( italic_Q ) is a question.. The VQA model fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT approximates p(A|Q,V)𝑝conditional𝐴𝑄𝑉p(A|Q,V)italic_p ( italic_A | italic_Q , italic_V ). Let the answer a𝑎aitalic_a assigned the highest probability by the VQA model fBB()subscript𝑓𝐵𝐵f_{BB}(\cdot)italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ( ⋅ ) be taken as the prospective answer. A VQG model fVQGp(Q|A,V)subscript𝑓𝑉𝑄𝐺𝑝conditional𝑄𝐴𝑉f_{VQG}\approx p(Q|A,V)italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT ≈ italic_p ( italic_Q | italic_A , italic_V ) can then be used to generate a rephrasing of an input question q𝑞qitalic_q. To see how, consider feeding the highest probability answer a𝑎aitalic_a from fBB()p(A|Q,V)subscript𝑓𝐵𝐵𝑝conditional𝐴𝑄𝑉f_{BB}(\cdot)\approx p(A|Q,V)italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ( ⋅ ) ≈ italic_p ( italic_A | italic_Q , italic_V ) into fVQG()p(Q|A,V)subscript𝑓𝑉𝑄𝐺𝑝conditional𝑄𝐴𝑉f_{VQG}(\cdot)\approx p(Q|A,V)italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT ( ⋅ ) ≈ italic_p ( italic_Q | italic_A , italic_V ) and then sampling a sentence qfVQGp(Q|A,V)similar-tosuperscript𝑞subscript𝑓𝑉𝑄𝐺𝑝conditional𝑄𝐴𝑉q^{\prime}\sim f_{VQG}\approx p(Q|A,V)italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT ≈ italic_p ( italic_Q | italic_A , italic_V ) from the visual question generation model. In the case of an ideal fVQG()subscript𝑓𝑉𝑄𝐺f_{VQG}(\cdot)italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT ( ⋅ ) and perfectly consistent fBB()subscript𝑓𝐵𝐵f_{BB}(\cdot)italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ( ⋅ ), qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT should be a generated question for which p(a|q,v)p(ai|q,v)aiA𝑝conditional𝑎superscript𝑞𝑣𝑝conditionalsubscript𝑎𝑖superscript𝑞𝑣for-allsubscript𝑎𝑖𝐴p(a|q^{\prime},v)\geq p(a_{i}|q^{\prime},v)\forall a_{i}\in Aitalic_p ( italic_a | italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v ) ≥ italic_p ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v ) ∀ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_A, with equality occurring in the case that ai=asubscript𝑎𝑖𝑎a_{i}=aitalic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a. So, qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a question having the same answer as q𝑞qitalic_q, which is practically speaking, a rephrasing. We provide an algorithm listing in Algorithm 1.

To summarize, we ask the black box model for an answer to a visual question, then give the predicted answer to a visual question generation model to produce a question qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT conditioned on the image v𝑣vitalic_v and the answer a𝑎aitalic_a by the black box model, which corresponds to a question the VQG model thinks should lead to the predicted answer a𝑎aitalic_a. We assume the rephrasings generated by fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT are good enough, fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT should be consistent on the rephrasings, and inconsistency indicates a problem with fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT. In practice, each qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not guaranteed to be a rephrasing (see Fig. 4) due to the probabilistic nature of the sampling process and because the VQG model is not perfect. The VQG model can be trained by following any procedure that results in a model approximating p(a|q,v)𝑝conditional𝑎𝑞𝑣p(a|q,v)italic_p ( italic_a | italic_q , italic_v ) that is an autoregressive model capable of text generation conditional on multimodal image-text input. The training procedure of the VQG model is an implementation detail we discuss in Sec. 3.5.

3.5 Implementation Details

We initialize the VQG model fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT from a BLIP checkpoint pretrained on 129129129129m image-text pairs, and train it to maximize p(a|q,v)𝑝conditional𝑎𝑞𝑣p(a|q,v)italic_p ( italic_a | italic_q , italic_v ) using a standard language modeling loss. Specifically, we use

VQG=n=1NlogPθ(yny<n,a,v)subscriptVQGsuperscriptsubscript𝑛1𝑁subscript𝑃𝜃conditionalsubscript𝑦𝑛subscript𝑦absent𝑛𝑎𝑣\mathcal{L}_{\mathrm{VQG}}=-\sum_{n=1}^{N}\log P_{\theta}\left(y_{n}\mid y_{<n% },a,v\right)caligraphic_L start_POSTSUBSCRIPT roman_VQG end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_n end_POSTSUBSCRIPT , italic_a , italic_v ) (1)

where y1,Y2,ynsubscript𝑦1subscript𝑌2subscript𝑦𝑛y_{1},Y_{2},\ldots y_{n}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the tokens of a question q𝑞qitalic_q and a,v𝑎𝑣a,vitalic_a , italic_v are the ground-truth answer and image, respectively, from a vqa triplet (v,q,a)𝑣𝑞𝑎(v,q,a)( italic_v , italic_q , italic_a ). We train for 10 epochs, using an AdamW [21] optimizer with a weight decay of 0.05 and decay the learning rate linearly to 0 from 2e-5. We use a batch size of 64 with an image size of 480×480480480480\times 480480 × 480, and train the model on the VQAv2 training set [10]. To sample questions from the VQG model, we use nucleus sampling [11] with a top-p𝑝pitalic_p of 0.9.

Input: v,q,k𝑣𝑞𝑘v,q,kitalic_v , italic_q , italic_k
Data: fBB,fVQGsubscript𝑓𝐵𝐵subscript𝑓𝑉𝑄𝐺f_{BB},f_{VQG}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT
Result: c𝑐c\in\mathbb{Q}italic_c ∈ blackboard_Q: the consistency of fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT over k𝑘kitalic_k rephrasings of v,q𝑣𝑞v,qitalic_v , italic_q
a0fBB(q,v)subscript𝑎0subscript𝑓𝐵𝐵𝑞𝑣a_{0}\longleftarrow f_{BB}(q,v)italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟵ italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ( italic_q , italic_v ) ;
c0𝑐0c\longleftarrow 0italic_c ⟵ 0 ;
for i0𝑖0i\leftarrow 0italic_i ← 0 to k𝑘kitalic_k do
       qnucleus_sample(fVQG(v,a0))superscript𝑞𝑛𝑢𝑐𝑙𝑒𝑢𝑠_𝑠𝑎𝑚𝑝𝑙𝑒subscript𝑓𝑉𝑄𝐺𝑣subscript𝑎0q^{\prime}\longleftarrow nucleus\_sample(f_{VQG}(v,a_{0}))italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟵ italic_n italic_u italic_c italic_l italic_e italic_u italic_s _ italic_s italic_a italic_m italic_p italic_l italic_e ( italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT ( italic_v , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) );
       afBB(q,v)superscript𝑎subscript𝑓𝐵𝐵superscript𝑞𝑣a^{\prime}\longleftarrow f_{BB}(q^{\prime},v)italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟵ italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v ) ;
       if a=a0superscript𝑎subscript𝑎0a^{\prime}=a_{0}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT  then
             cc+1𝑐𝑐1c\longleftarrow c+1italic_c ⟵ italic_c + 1;
            
       end if
      
end for
return c÷k𝑐𝑘c\div kitalic_c ÷ italic_k
Algorithm 1 Probing Predictive Uncertainty of a Black-Box Vision-Language Model
Refer to caption
Figure 5: The distribution of confidence scores of fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT at each level of consistency. While higher levels of consistency have a larger proportion of high confidence answers, they also retain a large number of low confidence answers, showing that consistency defines a different ordering over questions than confidence scores alone. BLIP is used as the black-box model fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT.

4 Experiments

We conduct a series of experiments probing predictive uncertainty in a black-box visual question answering setting over two large vision-language models and three datasets. The primary task we use to probe predictive uncertainty is selective visual question answering, which we give a detailed description of in Sec. 3.5. Futher qualitative examples and results can be found in the appendix.

4.1 Experimental Setup

Black-box Models The experimental setup requires a black-box VQA model fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT and a rephrasing generator fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT. We describe the training of the rephrasing generator fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT in Sec. 3.5. We choose ALBEF [16], BLIP [17], and BLIP-2[18] as our black-box models. ALBEF and BLIP have 200mabsent200m\approx 200\text{m}≈ 200 m parameters, while the version of BLIP-2 we use is based on the 11B parameter FLAN-T5 [5] model. ALBEF has been pretrained on 14m image-text pairs, while BLIP has been pretrained on over 100m image-text pairs, and BLIP-2 is aligned on 4M images. We use the official checkpoints provided by the authors, finetuned on Visual Genome [14] and VQAv2 [10] with 1.4m1.4m1.4\text{m}1.4 m and 440k440k440\text{k}440 k training triplets respectively.

Datasets We evaluate in three settings: in-distribution, out-of-distribution, and adversarial. For the in-distribution setting, we pairs from the VQAv2 validation set following the selection of [28]. For the out-of-distribution setting, we use OK-VQA [23], a dataset for question answering on natural images that requires outside knowledge. OK-VQA is an natural choice for a out-of-distribution selective prediction task, because many of the questions require external knowledge that a VLM may not have acquired, even through large scale pretraining. On such questions, a model that knows what it doesn’t know should abstain due to lack of requisite knowledge. Finally, we consider adversarial visual questions in the AdVQA [29]. We use the official validation splits provided by the authors. The OK-VQA, AdVQA, and VQAv2 validation sets contain 5555k, 10101010k, and 40404040k questions respectively.

Refer to caption
Figure 6: The percentage of each dataset at a given level of consistency. On a well-understood, in-distribution dataset (VQAv2), a large percentage of the questions are at a high consistency level.
Refer to caption
Figure 7: The accuracy of the answers of a VQA model (BLIP) plotted as a function of how consistent each answer was over up to 5 rephrasings of an original question.

4.2 Properties of Consistency

In this section we, analyze properties of the consistency. We are interested in:

  1. 1.

    Is increased consistency on rephrasings correlated with model accuracy on the original question?

  2. 2.

    What does the confidence distribution look like for different levels of consistency?

  3. 3.

    What is the distribution of consistency across different datasets?

In Fig. 7 we plot the accuracy of the answers when fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT is BLIP by how consistent each answer was over up to 5 rephrasings of an original question. We find that consistency over rephrasings is correlated with accuracy across all three datasets, through the correlation is weakest on adversarial data. Increased consistency on the rephrasings of a question implies lower risk on the original answer to the original question. Next, we examine how the distribution of model confidence varies across consistency levels in Fig. 5. Across all datasets, slices of a dataset at higher consistency levels also have a greater proportion of high-confidence answers, but retain a substantial proportion of low confidence answers. This clearly shows that consistency and confidence are not equivalent, and define different orderings on a set of questions and answers. Put another way, low confidence on a question does not preclude high consistency on a question, and similarly, high confidence on a question does not guarantee the model will be highly consistent on rephrasings of a question. Finally, plot the percentage of each dataset at a given level of consistency in Fig. 6. The in-distribution dataset, VQAv2, has the highest proportion of questions with 5 agreeing neighbors, with all other consistency levels making up the rest of the dataset. For the out-of-distribution dataset (OKVQA), a substantial proportion of questions (40%absentpercent40\approx 40\%≈ 40 % ) have five agreeing neighbors, with the rest of the dataset shared roughly equally between the other consistency levels. On the adversarial dataset (AdVQA), the distribution is nearly flat, with equal slices of the dataset at each consistency level. One conclusion from this is that higher consistency is not necessarily rarer, and is highly dependent on how well a model understands the data distribution the question is drawn from.

4.3 Selective VQA with Neighborhood Consistency

Refer to caption
Figure 8: Risk-coverage curves at on slices of test datasets at different levels of consistency. A curve labeled nk𝑛𝑘n\geq kitalic_n ≥ italic_k shows the risk-coverage tradeoff for a slice of the target dataset where the answers of the model are consistent over at least k𝑘kitalic_k rephrasings of an original question. The n0𝑛0n\geq 0italic_n ≥ 0 curve is the baseline. Higher consistency levels identify questions on which a model can achieve lower risk across all datasets.

fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT BLIP ALBEF Risk 10.0 15.0 20.0 30.0 40.0 10.0 15.0 20.0 30.0 40.0 Consistency n \geq 0 0.11 0.18 0.25 0.4 0.61 0.08 0.14 0.21 0.41 0.68 n \geq 1 0.13 0.22 0.3 0.47 0.74 0.1 0.18 0.29 0.52 0.83 n \geq 2 0.14 0.23 0.33 0.51 0.78 0.1 0.21 0.32 0.59 0.89 n \geq 3 0.16 0.26 0.37 0.56 0.84 0.12 0.23 0.37 0.66 0.97 n \geq 4 0.18 0.28 0.38 0.59 0.88 0.13 0.26 0.42 0.71 1.0 n \geq 5 0.19 0.31 0.44 0.65 0.95 0.11 0.33 0.47 0.8 1.0

Table 1: OK-VQA coverage at a specified risk levels, stratified by consistency levels. nk𝑛𝑘n\geq kitalic_n ≥ italic_k indicates that the prediction of the model was consistent over at least k𝑘kitalic_k rephrasings of the question.

fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT BLIP ALBEF Risk 20.0 30.0 40.0 50.0 56.0 20.0 30.0 40.0 50.0 60.0 Consistency n \geq 0 0.01 0.09 0.51 0.83 0.98 0.0 0.07 0.24 0.75 1.0 n \geq 1 0.01 0.11 0.58 0.9 1.0 0.01 0.09 0.29 0.86 1.0 n \geq 2 0.01 0.1 0.61 0.93 1.0 0.01 0.09 0.3 0.89 1.0 n \geq 3 0.01 0.1 0.58 0.93 1.0 0.02 0.11 0.3 0.89 1.0 n \geq 4 0.01 0.08 0.55 0.92 1.0 0.02 0.11 0.3 0.87 1.0 n \geq 5 0.01 0.04 0.53 0.87 1.0 0.04 0.12 0.27 0.84 1.0

Table 2: AdVQA coverage at a specified risk levels, stratified by consistency levels. nk𝑛𝑘n\geq kitalic_n ≥ italic_k indicates that the prediction of the model was consistent over at least k𝑘kitalic_k rephrasings of the question.
Refer to caption
Figure 9: Risk-coverage curves when fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT (200m parameters) is substantially smaller than fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT (11B). Even in this scenario, fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT can reliably identify high-risk questions based on consistency.

We turn to the question of whether consistency over rephrasings is useful in the setting of selective visual question answering.

  1. 1.

    Can consistency select slices of a test dataset which a model understands well (achieves lower risk), or alternatively, identify questions the model doesn’t understand, and should reject (high risk)?

  2. 2.

    How well does consistency over rephrasings work to identify low / high risk questions in out-of-distribution and adversarial settings?

  3. 3.

    What happens when the question generator is much smaller than the black-box model?

To analyze how useful consistency is for separating low-risk from high-risk inputs, we use the task of selective visual question answering. In Fig. 8 we plot risk-coverage curves for in-distribution, out-of-distribution, and adversarial visual questions. Each curve shows the risk-coverage tradeoff for questions at a level of consistency. For example, a curve labeled as n3𝑛3n\geq 3italic_n ≥ 3 shows the risk-coverage tradeoff for questions on which 3 or more neighbors (rephrasings) were consistent with the original answer. Hence, the n0𝑛0n\geq 0italic_n ≥ 0 curve is a baseline representing the risk-coverage curve for any question, regardless of consistency. If greater consistency over rephrasings is indicative over lower risk (and a higher probability the model knows the answer), we expect to see that the model should be able to achieve lower risk on slices of a dataset that the model is more consistent on. On in-distribution visual questions (VQAv2), the model achieves lower risk at equivalent coverage for slices of the dataset that have higher consistency levels. A similar situation holds for the out-of-distribution dataset, OKVQA, and the adversarial dataset AdVQA. In general, the model is able to achieve lower risk on slices of a dataset on which the consistency of the model over rephrasings is higher. In Tab. 2 and Tab. 2, we show risk-coverage information in tabular form for AdVQA and OK-VQA at specific risk levels. Finally, in Fig. 3, we show that our approach works even when there is a large size difference between the black-box model and the question generator.

5 Related Work

5.1 Selective Prediction

Deep models with a reject option have been studied in the context of unimodal classification and regression [8, 9, 35] for some time, and more recently for the open-ended task of question answering [13]. Deep models with a reject option in the context of visual question answering were first explored by Whitehead et al. [34]. They take the approach of training a selection function using features from the model and a held-out validation set to make the decision of whether to predict or abstain. Dancette et al. [7] takes an alternate approach by training models on different dataset slices. The problem of eliciting truthful information from a language model [19] is closely related to selective prediction for VQA. In both settings, the model must avoid providing false information in response to a question.

5.2 Self-Consistency

Jha et al. [12] introduced the idea of using consistency over the predictions of a model to quantify the predictive uncertainty of the model. Their Attribution Based Confidence (ABC) metric is based on using guidance from feature attributions, specifically Integrated Gradients [31] to perturb samples in feature space, then using consistency over the perturbed samples to quantify predictive uncertainty. Shah et al. [28] show that VQA models are not robust to linguistic variations in a sentence by demonstrating inconsistency of the answers of multiple VQA models over human-generated rephrasings of a sentence. Similarly, Selvaraju et al. [27] show that the answers of VQA models to more complex reasoning questions are inconsistent with the answers to simpler perceptual questions whose answers should entail the answer to the reasoning question. We connect these ideas to hypothesize that inconsistency on linguistic variations of a visual question is indicative of more superifical understanding of the content of the question, and therefore a higher chance of being wrong when answering the question.

5.3 Robustness of VQA Models

VQA models have been shown to lack robustness, and severely prone to overfitting on dataset-specific correlations rather than learning to answer questions. The VQA-CP [1] task showed that VQA models often use linguistic priors to answer questions (e.g. the sky is usually blue), rather than looking at the image. Dancette et al. [6] showed that VQA models often use simple rules based on co-occurences of objects with noun phrases to answer questions. The existence of adversarial visual questions has also been demonstrated by [29], who used an iterative model-in-the-loop process to allow human annotators to attack state-of-the-art While VQA models are approaching human-level performance on the VQAv2 benchmark [10], their performance on more complex VQA tasks such as OK-VQA [23] lags far behind human performance.

6 Conclusion

The capital investment required to train large, powerful models on massive amounts of data means that there is a strong commercial incentive to keep the weights and features of a model private while making the model accessible through an API. Using these models in low-risk situations is not problematic, but using black-box models in situations where mistakes can have serious consequences is dangerous. At the same time, the power of these black-box models makes using them very appealing.

In this paper, we explore a way to judge the reliability of the answer of a black-box visual question answering model by assessing the consistency of the model’s answer over rephrasings of the original question, which we generate dynamically using a VQG model. We show that this is analogous to the technique of consistency over neighborhood samples, which has been used in white-box settings for self-training as well as predictive uncertainty. We conduct experiments on in-distribution, out-of-distribution, and adversarial settings, and show that consistency over rephrasings is correlated with model accuracy, and predictions of a model that are highly consistent over rephrasings are more likely to be correct. Hence, consistency over rephrasings constitutes an effective first step for using a black-box visual question answering model reliably by identifying queries that a black-box model may not know the answer to.

References

  • Agrawal et al. [2017] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4971–4980, 2017.
  • Brando et al. [2020] Axel Brando, Damià Torres, Jose A. Rodríguez-Serrano, and Jordi Vitrià. Building uncertainty models on top of black-box predictive apis. IEEE Access, 8:121344–121356, 2020.
  • Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  • Chomsky [1975] N. Chomsky. The Logical Structure of Linguistic Theory. Springer, 1975.
  • Chung et al. [2022] Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models. ArXiv, abs/2210.11416, 2022.
  • Dancette et al. [2021] Corentin Dancette, Rémi Cadène, Damien Teney, and Matthieu Cord. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1554–1563, 2021.
  • Dancette et al. [2023] Corentin Dancette, Spencer Whitehead, Rishabh Maheshwary, Ramakrishna Vedantam, Stefan Scherer, Xinlei Chen, Matthieu Cord, and Marcus Rohrbach. Improving selective visual question answering by learning from your peers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24049–24059, 2023.
  • Geifman and El-Yaniv [2017] Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In NIPS, 2017.
  • Geifman and El-Yaniv [2019] Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning, 2019.
  • Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Holtzman et al. [2019] Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2019.
  • Jha et al. [2019] Susmit Jha, Sunny Raj, Steven Fernandes, Sumit K Jha, Somesh Jha, Brian Jalaian, Gunjan Verma, and Ananthram Swami. Attribution-Based Confidence Metric For Deep Neural Networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  • Kamath et al. [2020] Amita Kamath, Robin Jia, and Percy Liang. Selective question answering under domain shift. In Annual Meeting of the Association for Computational Linguistics, 2020.
  • Krishna et al. [2016] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2016.
  • Lakshminarayanan et al. [2016] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2016.
  • Li et al. [2021] Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Neural Information Processing Systems, 2021.
  • Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022.
  • Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023.
  • Lin et al. [2021] Stephanie C. Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Annual Meeting of the Association for Computational Linguistics, 2021.
  • Liu et al. [2021] Hong Liu, Jianmin Wang, and Mingsheng Long. Cycle self-training for domain adaptation. In Neural Information Processing Systems, 2021.
  • Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. ArXiv, abs/1711.05101, 2017.
  • Lu et al. [2019] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems, 2019.
  • Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3190–3199, 2019.
  • Mozannar and Sontag [2020] Hussein Mozannar and David A. Sontag. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, 2020.
  • Niculescu-Mizil and Caruana [2005] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, page 625–632, New York, NY, USA, 2005. Association for Computing Machinery.
  • Nixon et al. [2019] Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. ArXiv, abs/1904.01685, 2019.
  • Selvaraju et al. [2020] Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, and Ece Kamar. Squinting at vqa models: Introspecting vqa models with sub-questions. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10000–10008, 2020.
  • [28] Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-consistency for robust visual question answering. In 2019 Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
  • Sheng et al. [2021] Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Alberto Lopez Magana, Wojciech Galuba, Devi Parikh, and Douwe Kiela. Human-adversarial visual question answering. 2021.
  • Sun et al. [2022] Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. Black-box tuning for language-model-as-a-service. ArXiv, abs/2201.03514, 2022.
  • Sundararajan et al. [2017] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. ArXiv, abs/1703.01365, 2017.
  • van Amersfoort et al. [2020] Joost van Amersfoort, Lewis Smith, Yee Whye Teh, and Yarin Gal. Uncertainty estimation using a single deep deterministic neural network. 2020.
  • Wang et al. [2019] Yulin Wang, Xuran Pan, Shiji Song, Hong Zhang, Cheng Wu, and Gao Huang. Implicit semantic data augmentation for deep networks. In Neural Information Processing Systems, 2019.
  • Whitehead et al. [2022] Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, and Marcus Rohrbach. Reliable visual question answering: Abstain rather than answer incorrectly. In European Conference on Computer Vision, 2022.
  • Ziyin et al. [2019] Liu Ziyin, Zhikang T. Wang, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency, and Masahito Ueda. Deep gamblers: Learning to abstain with portfolio theory. ArXiv, abs/1907.00208, 2019.

Appendix A Detailed Risk-Coverage Data

In Tabs. 4, 3, 6 and 5, we show more granular risk-coverage curves across all three evaluated datasets and both black-box models.

fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT BLIP ALBEF Risk 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 Consistency n \geq 0 0.0 0.0 0.11 0.18 0.25 0.32 0.4 0.49 0.61 0.77 0.02 0.03 0.08 0.14 0.21 0.3 0.41 0.53 0.68 0.85 n \geq 1 0.0 0.0 0.13 0.22 0.3 0.38 0.47 0.59 0.74 0.89 0.02 0.04 0.1 0.18 0.29 0.4 0.52 0.66 0.83 0.97 n \geq 2 0.0 0.0 0.14 0.23 0.33 0.42 0.51 0.63 0.78 0.94 0.03 0.04 0.1 0.21 0.32 0.45 0.59 0.73 0.89 1.0 n \geq 3 0.0 0.0 0.16 0.26 0.37 0.45 0.56 0.68 0.84 1.0 0.03 0.05 0.12 0.23 0.37 0.51 0.66 0.83 0.97 1.0 n \geq 4 0.0 0.0 0.18 0.28 0.38 0.48 0.59 0.74 0.88 1.0 0.04 0.06 0.13 0.26 0.42 0.55 0.71 0.88 1.0 1.0 n \geq 5 0.0 0.0 0.19 0.31 0.44 0.54 0.65 0.8 0.95 1.0 0.04 0.06 0.11 0.33 0.47 0.63 0.8 0.93 1.0 1.0

Table 3: More granular risk-coverage data for OK-VQA.

fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT BLIP ALBEF Risk 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0 56.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0 Consistency n \geq 0 0.01 0.04 0.09 0.23 0.51 0.69 0.83 0.95 0.98 0.0 0.04 0.07 0.12 0.24 0.46 0.75 0.92 1.0 n \geq 1 0.01 0.04 0.11 0.27 0.58 0.76 0.9 1.0 1.0 0.01 0.05 0.09 0.15 0.29 0.55 0.86 1.0 1.0 n \geq 2 0.01 0.04 0.1 0.25 0.61 0.79 0.93 1.0 1.0 0.01 0.05 0.09 0.15 0.3 0.59 0.89 1.0 1.0 n \geq 3 0.01 0.04 0.1 0.25 0.58 0.8 0.93 1.0 1.0 0.02 0.06 0.11 0.17 0.3 0.6 0.89 1.0 1.0 n \geq 4 0.01 0.02 0.08 0.24 0.55 0.77 0.92 1.0 1.0 0.02 0.06 0.11 0.16 0.3 0.6 0.87 1.0 1.0 n \geq 5 0.01 0.01 0.04 0.27 0.53 0.72 0.87 1.0 1.0 0.04 0.07 0.12 0.18 0.27 0.53 0.84 1.0 1.0

Table 4: More granular risk-coverage data for AdVQA.

fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT BLIP risk 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 Consistency n \geq 0 0.01 0.55 0.63 0.69 0.74 0.77 0.8 0.82 0.85 0.88 0.9 0.91 0.93 0.95 0.97 n \geq 1 0.01 0.6 0.69 0.76 0.8 0.83 0.86 0.9 0.92 0.94 0.96 0.98 0.99 1.0 1.0 n \geq 2 0.01 0.63 0.72 0.78 0.83 0.86 0.89 0.92 0.94 0.96 0.98 1.0 1.0 1.0 1.0 n \geq 3 0.01 0.66 0.75 0.81 0.85 0.88 0.92 0.94 0.96 0.98 1.0 1.0 1.0 1.0 1.0 n \geq 4 0.01 0.68 0.77 0.83 0.87 0.91 0.93 0.96 0.98 0.99 1.0 1.0 1.0 1.0 1.0 n \geq 5 0.01 0.7 0.79 0.84 0.88 0.92 0.94 0.96 0.99 1.0 1.0 1.0 1.0 1.0 1.0

Table 5: Granular risk-coverage data for VQAv2 with BLIP as fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT.

fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ALBEF risk 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 Consistency n \geq 0 0.01 0.55 0.63 0.69 0.74 0.77 0.8 0.82 0.85 0.88 0.9 0.91 0.93 0.95 0.97 n \geq 1 0.01 0.6 0.69 0.76 0.8 0.83 0.86 0.9 0.92 0.94 0.96 0.98 0.99 1.0 1.0 n \geq 2 0.01 0.63 0.72 0.78 0.83 0.86 0.89 0.92 0.94 0.96 0.98 1.0 1.0 1.0 1.0 n \geq 3 0.01 0.66 0.75 0.81 0.85 0.88 0.92 0.94 0.96 0.98 1.0 1.0 1.0 1.0 1.0 n \geq 4 0.01 0.68 0.77 0.83 0.87 0.91 0.93 0.96 0.98 0.99 1.0 1.0 1.0 1.0 1.0 n \geq 5 0.01 0.7 0.79 0.84 0.88 0.92 0.94 0.96 0.99 1.0 1.0 1.0 1.0 1.0 1.0

Table 6: Granular risk-coverage data for VQAv2 with ALBEF as fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT.

Appendix B Inference Details

For both BLIP and ALBEF, we follow the original inference procedures. Both models have an encoder-decoder architecture and VQA is treated as a text-to-text task. We use the rank-classification approach [3] to allow the autoregressive decoder of the VLM to predict an answer for a visual question. Concretely, let 𝒜={a1,a2,a3,ak}𝒜subscript𝑎1subscript𝑎2subscript𝑎3subscript𝑎𝑘\mathcal{A}=\{a_{1},a_{2},a_{3},\ldots a_{k}\}caligraphic_A = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } be a list of length k𝑘kitalic_k for a dataset consisting of the most frequent ground-truth answers. These answer lists are standardized and distributed by the authors of the datasets themselves. We use the standard answer lists for each dataset. Next, let v,q𝑣𝑞v,qitalic_v , italic_q be a visual question pair and let fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT be a VQA model. Recall that fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT is a language model defining a distribution p(a|q,v)𝑝conditional𝑎𝑞𝑣p(a|q,v)italic_p ( italic_a | italic_q , italic_v ), and is thus able to assign a score to each ai𝒜subscript𝑎𝑖𝒜a_{i}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_A. We take the highest probability aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

maxak𝒜fBB(v,q,ak)=maxak𝒜p(ak|v,q)subscript𝑎𝑘𝒜maxsubscript𝑓𝐵𝐵𝑣𝑞subscript𝑎𝑘subscript𝑎𝑘𝒜max𝑝conditionalsubscript𝑎𝑘𝑣𝑞\underset{a_{k}\in\mathcal{A}}{\operatorname{max}}~{}f_{BB}(v,q,a_{k})=% \underset{a_{k}\in\mathcal{A}}{\operatorname{max}}p(a_{k}|v,q)start_UNDERACCENT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ( italic_v , italic_q , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = start_UNDERACCENT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG italic_p ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_v , italic_q ) (2)

as the predicted answer for a question. This is effectively asking the model to rank each of the possible answer candidates, turning the open-ended VQA task into a very large multiple choice problem. Note that the highest probability ak𝒜subscript𝑎𝑘𝒜a_{k}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_A is not necessarily the answer that would be produced by fBBp(a|v,q)similar-tosubscript𝑓𝐵𝐵𝑝conditional𝑎𝑣𝑞f_{BB}\sim p(a|v,q)italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ∼ italic_p ( italic_a | italic_v , italic_q ) in an unconstrained setting such as stochastic decoding. However, for consistency with previous work, we use the rank classification approach.

Visual question answering is thus treated differently when using large autoregressive vision-language models compared to non-autoregressive odels. In traditional approaches, VQA is treated as a classification task, and a standard approach used in older, non-autoregressive vision-language models such as ViLBERT [22] is to train a MLP with a cross-entropy loss with each of the possible answers as a class.

Appendix C Hallucinations

Refer to caption
Figure 10: The rephrasing generator fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT can hallucinate questions that imagine not present in the context of the image.

We describe a peculiar mode of the rephrasing generator fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT in this section. When an answer is out-of-context for a given image, the rephrasing generator fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT will generate questions premised on the out-of-context answer. For example, in Fig. 10, we show that if an out-of-context answer such as “unicorn” for the surfing image in Fig. 10 is provided to fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT for cycle-consistent rephrasing generation, fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT will generate questions such as “what animals are in the water”, assuming that there are unicorns in the water, though this is implausible. A more correct question would have been something such as “what animals are not present?” A likely reason fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT cannot handle these cases well is because fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT is trained on a VQA dataset to approximate p(q|v,a)𝑝conditional𝑞𝑣𝑎p(q|v,a)italic_p ( italic_q | italic_v , italic_a ), and traditional VQA datasets have very few counterfactual questions such as these.

This is not specific to the fVQGsubscript𝑓𝑉𝑄𝐺f_{VQG}italic_f start_POSTSUBSCRIPT italic_V italic_Q italic_G end_POSTSUBSCRIPT used in our framework, and should apply to any question generator trained in this manner. It does reveal that even large VLMs pretrained on a massive amount of image-text pairs have a superficial understanding of counterfactuals, and possibly other properties of language.

Appendix D Are the rephrasings really rephrasings?

As visible in Fig. 4, some of the rephrasings are not literally rephrasings of the original question. It may be more correct to call the rephrasings pseudo-rephrasings, in the same way that generated labels are referred to as pseudolabels in the semi-supervised learning literature [20]. However, the pseudo-rephrasings seem to be good enough that inconsistency over the pseudo-rephrasings indicates potentially unreliable predictions from fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT.

Refer to caption
Figure 11: See App. D for an explanation of the figure.

Why does this work? Decompose fBBsubscript𝑓𝐵𝐵f_{BB}italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT as fBB=fD(fE(v,q))subscript𝑓𝐵𝐵subscript𝑓𝐷subscript𝑓𝐸𝑣𝑞f_{BB}=f_{D}(f_{E}(v,q))italic_f start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_v , italic_q ) ), where fE(v,z)=𝐳subscript𝑓𝐸𝑣𝑧𝐳f_{E}(v,z)=\mathbf{z}italic_f start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_v , italic_z ) = bold_z is the encoder that maps a visual question pair v,q𝑣𝑞v,qitalic_v , italic_q to a dense representation 𝐳𝐳\mathbf{z}bold_z, and fD(𝐳)=asubscript𝑓𝐷𝐳𝑎f_{D}(\mathbf{z})=aitalic_f start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( bold_z ) = italic_a is the decoder that maps the dense representation 𝐳𝐳\mathbf{z}bold_z to an answer. For two rephrasings qα,qβsubscript𝑞𝛼subscript𝑞𝛽q_{\alpha},q_{\beta}italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT of a question q𝑞qitalic_q, the model will be consistent over the rephrasings if all the rephrasings are embedded onto a subset of the embedding space that fDsubscript𝑓𝐷f_{D}italic_f start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT assigns the same answer a𝑎aitalic_a. This is the situation we depict on the left side of Fig. 11.

On the other hand, if qαsubscript𝑞𝛼q_{\alpha}italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and qβsubscript𝑞𝛽q_{\beta}italic_q start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT are embedded into parts of the embedding space that fDsubscript𝑓𝐷f_{D}italic_f start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT assigns them different answers, the answers will not be consistent (right side of Fig. 11). Thus, whether a qα,qβsubscript𝑞𝛼subscript𝑞𝛽q_{\alpha},q_{\beta}italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT are linguistically valid rephrasings does not matter so much as if qα,qβsubscript𝑞𝛼subscript𝑞𝛽q_{\alpha},q_{\beta}italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT should technically have the same answer as the original question q𝑞qitalic_q. Of course, it is true that the answer to a linguistically valid rephrasing should be the same as the same as the answer to the question being rephrased. However, for any question, there are many other questions that have the same answer but are not rephrasings of the original question.

Appendix E Calibration

The confidence scores in Figs. 2 and 5 are the raw scores from the logits of the VQA model, in this case BLIP. Recall that the models under consideration are autoregressive models that approximate a probability distribution p(a|v,q)𝑝conditional𝑎𝑣𝑞p(a|v,q)italic_p ( italic_a | italic_v , italic_q ), where a𝑎aitalic_a can take on an infinite number of values — the model must be able to assign a score to any natural language sentence. The raw distribution of confidence scores is clearly truncated in the sense that all scores appear to lie in the interval [0,0.07]00.07[0,0.07][ 0 , 0.07 ]. We apply temperature scaling [25] to assess how well the confidence scores are calibrated. In temperature scaling, the logits of a model are multiplied by a parameter τ𝜏\tauitalic_τ. This is rank-preserving, and yields confidence scores that are more directly interpretable. In our case, we can use it to rescale the model logits into the interval [0,1]01[0,1][ 0 , 1 ] and analyze the Adaptive Calibration Error [26] of the model’s predictions. We grid search the τ𝜏\tauitalic_τ that minimizes the Adaptive ECE directly on the model predictions, and show the results in Tabs. 7, 8 and 9. The Adaptive Calibration Error is lowest on the in-distribution dataset, highest on the adversarial dataset, and second highest on the out-of-distribution dataset. Notably, the model is systematically overconfident on adversarial samples, but not on out-of-distribution samples. This suggests that calibration is not the only problem in selective prediction.

Raw Confidence Accuracy Scaled Confidence Error
percentile
0 0.020 0.477 0.390 0.087
10 0.022 0.507 0.430 0.077
20 0.024 0.540 0.473 0.067
30 0.026 0.573 0.522 0.051
40 0.029 0.604 0.577 0.026
50 0.032 0.647 0.643 0.004
60 0.036 0.699 0.723 0.024
70 0.041 0.766 0.819 0.053
80 0.047 0.831 0.934 0.104
90 0.054 0.909 1.000 0.091
Table 7: Calibration of BLIP on OK-VQA. For scaling, a temperature of 19.919.919.919.9 is used.
Raw Confidence Accuracy Scaled Confidence Error
percentile
0 0.042 0.837 0.841 0.004
10 0.047 0.898 0.926 0.028
20 0.051 0.938 1.000 0.062
30 0.055 0.968 1.000 0.032
40 0.058 0.984 1.000 0.016
50 0.060 0.994 1.000 0.006
60 0.062 0.998 1.000 0.002
70 0.064 0.999 1.000 0.001
80 0.065 1.000 1.000 0.000
90 0.065 0.999 1.000 0.001
Table 8: Calibration of BLIP on VQAv2. For scaling, a temperature of 19.319.319.319.3 is used.
Raw Confidence Accuracy Scaled Confidence Error
percentile
0 0.032 0.430 0.637 0.206
10 0.035 0.472 0.703 0.231
20 0.039 0.510 0.769 0.259
30 0.042 0.547 0.834 0.287
40 0.045 0.580 0.897 0.317
50 0.048 0.601 0.956 0.355
60 0.051 0.618 1.000 0.382
70 0.055 0.636 1.000 0.364
80 0.058 0.655 1.000 0.345
90 0.062 0.693 1.000 0.307
Table 9: Calibration of BLIP on AdVQA. For scaling, a temperature of 12.512.512.512.5 is used.

Appendix F More Rephrasings Examples

We show more examples of generated rephrasings by Fig. 12.

Refer to caption
Figure 12: More examples of generated rephrasings.