Ai 3

CPL: Counterfactual Prompt Learning for Vision and Language Models
Xuehai He1 Diji Yang1 Weixi Feng2 Tsu-Jui Fu2 Arjun Akula3 Varun Jampani3
Pradyumna Narayana3 Sugato Basu3 William Yang Wang2 Xin Eric Wang1
1
UC Santa Cruz, 2 UC Santa Barbara, 3 Google
{xhe89,dyang39,xwang366}@ucsc.edu
{weixifeng,tsu-juifu,william}@ucsb.edu
{arjunakula,varunjampani,pradyn,sugato}@google.com
Abstract A: A large long train on a B: A large long train on a

steel track steel track near a barn
Prompt tuning is a new few-shot transfer
learning technique that only tunes the learn-
able prompt for pre-trained vision and lan-
guage models such as CLIP. However, ex-
isting prompt tuning methods tend to learn
spurious or entangled representations, which
leads to poor generalization to unseen con-
cepts. Towards non-spurious and efficient What if we add a barn to image A
prompt learning from limited examples, this (or remove the barn from image B)?
paper presents a novel Counterfactual Prompt Will the prompt be changed ?
Learning (CPL) method for vision and lan-
guage models, which simultaneously employs Figure 1: A conceptual overview of counterfactual
counterfactual generation and contrastive learn- prompt learning. CPL constructs counterfactuals by
ing in a joint optimization framework. Partic- identifying non-spurious feature change that causally
ularly, CPL constructs counterfactual by iden- causes the prompt change. In this case, the “barn” fea-
tifying minimal non-spurious feature change ture is the essential cause between Prompt A and B.
between semantically-similar positive and neg-
ative samples that causes concept change and
learns more generalizable prompt representa-
tion from both factual and counterfactual exam- zero-shot and few-shot scenarios, including im-
ples via contrastive learning. Extensive exper- age classification (Deng et al., 2009), visual ques-
iments demonstrate that CPL can obtain supe- tion answering (Shen et al., 2021), image-text re-
rior few-shot performance on different vision trieval (Jia et al., 2021), etc. But manually con-
and language tasks than previous prompt tuning structing prompts for vision and language models
methods on CLIP. On image classification, we such as CLIP is a tedious, time-consuming process,
achieve a 3.55% average relative improvement
which usually requires prior domain knowledge
on unseen classes across seven datasets; on
image-text retrieval and visual question answer- and leads to suboptimal solutions.
ing, we gain up to 4.09% and 25.08% relative Prompt tuning (Lester et al., 2021), on the other
improvements across three few-shot scenarios hand, liberates us from manual prompt engineering
on unseen test sets respectively.1
and automates this process. Prompt tuning meth-
1 Introduction ods (Ju et al., 2021; Lin et al., 2014; Zhou et al.,
2022) are proposed to effectively transfer CLIP to
Pre-trained vision and language foundation mod- image recognition tasks after tuning a learnable
els (Radford et al., 2021; Jia et al., 2021) have prompt with a few examples of the classes. How-
shown encouraging results toward open-domain ever, those methods purely conduct empirical risk
visual-concept matching. Benefiting from prompt minimization (ERM) and optimize for predictive
engineering (Song et al., 2022a; Liu et al., 2022), accuracy, which often produces spurious, ineffi-
where free-form text prompts are designed for spe- cient, or entangled representations (Wang and Jor-
cific task goals, those foundation models can be dan, 2021). Therefore, the generalization ability
easily transferred to a wide array of tasks under of existing prompt tuning methods for vision and
1
Our code is released at https://github.com/ language models is limited, and they often fail to
eric-ai-lab/CPL. transfer well to unseen classes or concepts. For
3407
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3407 - 3418
December 7-11, 2022 ©2022 Association for Computational Linguistics
example, the image classification performance of • We introduce Counterfactual Prompt
the SOTA method CoCoOp (Zhou et al., 2022) is Learning (CPL), a task-agnostic causality-
similar or even degrades on unseen classes when based prompt learning method to effectively
compared with zero-shot CLIP. transfer CLIP to unseen concepts for different
Learning non-spurious representation for better downstream tasks.
generalization requires disentangling features that
• We propose a text-based negative sam-
causally determine the prompts. One solution is
pling strategy, where we compute
counterfactual reasoning. Counterfactual (“counter
BERTScore (Zhang et al., 2019) between text
to the facts”) is a concept that describes the human
prompts, based on which we sample the most
capacity to learn from limited prior experiences by
semantically-similar negative images.
imagining the outcome of an alternative action that
could have been taken. So we can do counterfac- • We introduce a optimization framework that
tual intervention by asking “what if ...” questions simultaneously constructs counterfactuals by
in prompt learning. For example, as shown in Fig- identifying minimal non-spurious feature
ure 1, a change in the visual feature of the barn change, and learns the generalized prompt rep-
would cause the label to change (if we view the resentation from both factual and counterfac-
two prompts as two labels). tual examples.
Therefore, we introduce a new causality-based
• We conduct extensive experiments on image
approach, Counterfactual Prompt Learning (CPL),
classification, image-text retrieval, and visual
for non-spurious and efficient prompt learning.
question answering, and validate the superior-
First, we introduce a text-based negative sampling
ity of CPL to existing prompt tuning methods
strategy to discover the most semantically-similar
in transferring effectiveness on unseen con-
negative sample based on text similarity. Then
cepts.
we generate a counterfactual example by identify-
ing minimal non-spurious feature change between 2 Related Work
semantically-similar positive and negative samples
that causally causes prompt change. Finally, we Vision-and-Language Models. Vision-and-
adopt contrastive learning in the joint optimization Language models pre-trained on large-scale
framework (with counterfactual construction) to image-text pairs have demonstrated great potential
tune the learnable prompts using both factual and in multimodal representation learning (Jia et al.,
counterfactual examples. The causally fine-tuned 2021; Yao et al., 2021; Yuan et al., 2021). Among
prompts will eventually guide vision-and-language them, the representative CLIP (Radford et al.,
foundation models to distinguish images from un- 2021) benefits from 400M curated data and defines
seen concepts, thereby improving the generaliza- various prompt templates to carry out zero-shot
tion ability of prompt learning. image classification. However, those prompts still
We extensively evaluate CPL using seven stan- require hand-crafted designs. In this work, we
dard datasets for image classification, two for automatically learn task-agnostic and task-relevant
image-text-retrieval, and one for visual question prompts without human priors. In addition, by
answering (VQA). We show that CPL outper- considering the counterfactual examples, we
forms the baseline on all three tasks: on im- can further improve various vision-and-language
age classification, our method achieves 3.55% tasks, including visual question answering and
average relative improvement on unseen classes image-text retrieval in a few-shot scenario.
across the seven datasets in terms of accuracy; Prompt Tuning. Many works focus on learning
on image-text retrieval, our method improves the from discrete natural language prompts, e.g., Auto-
most (4.09% relative improvement in terms of Prompt (Shin et al., 2020) elicits knowledge from
Recall@1) when using 0.5% of total training language models with automatically generated dis-
instances on MSCOCO (Lin et al., 2014) and crete prompts. Lately, many other works (Zhou
Flickr30K (Plummer et al., 2015); on VQA, we et al., 2021, 2022) directly tune prompts in con-
gain up to 25.08% relative improvement on the tinuous vector forms. Guo et al. (2021) intro-
VQAv2 (Goyal et al., 2017a) dataset. duces Q-Learning to optimize the soft prompt. P-
Our main contributions are summarized below: Tuning v2 (Liu et al., 2021) shows that continuous
3408
prompt tuning achieves the same performance as instances. Frozen (Tsimpoukelli et al., 2021) is de-
fine-tuning in various settings. Prompt tuning also veloped based on GPT and made into a multimodal
receives great interest in the computer vision do- few-shot learner by expanding the soft prompting
main. For example, CoOp proposes a continuous to include a collection of images and text. Their
prompt optimization strategy to avoid prompt de- method demonstrates strong few-shot capabilities
sign. CoCoOp (Zhou et al., 2022) extends CoOp by on visual question answering and image classifi-
further learning an instance-conditional network to cation tasks. Similarly, CoCa (Yu et al., 2022) is
generate an input-conditional token for each image. pre-trained from scratch and end-to-end using both
However, these methods trained with empirical risk web-scale data and annotated images by consider-
minimization (ERM) may learn to rely on correla- ing all labels as text, therefore unifying supervi-
tions between class labels and spurious attributes sion for learning representations through natural
by minimizing average training error (Zhang et al., language. It can achieve state-of-the-art perfor-
2022). They usually learn spurious, inefficient, and mance with few-shot transfer or by minimal task-
entangled representation, lacking generalization specific adaptation on a wide range of downstream
ability to unseen scenarios. vision-and-language tasks, including visual recog-
nition, multimodal understanding, crossmodal re-
Counterfactual Reasoning. A number of retrieval, and image captioning. SimVLM (Wang
cent works have investigated generating counterfac- et al., 2021b) is pre-trained with prefix language
tual images (Besserve et al., 2020), or counterfac- modeling on datasets with weak supervision. It
tual text in specific language domains (e.g., court exhibits its efficacy on few-shot captioning tasks.
view (Wu et al., 2020), dialogue generation (Zhu Even though all these models mentioned above can
et al., 2020), Natural Language Inference (Kaushik already achieve improvement on some few-shot
et al., 2019; Gokhale et al., 2021), named entity tasks, how to exploit their few-shot reasoning abil-
recognition (Zeng et al., 2020)); On the vision end, ity using limited training examples still deserves
Zhang et al. (2021) proposes to add intervention the effort. In this work, we study this direction
over the changed domain on images during the via the lens of prompt learning utilizing CLIP as a
data-generation process and steer the generative starting point.
model to produce counterfactual features to aug-
ment the training process. Agarwal et al. (2020) 3 Counterfactual Prompt Learning
uses automated semantic image manipulations to 3.1 Problem Formulation
generate synthetic data to make models more ro-
bust against spurious correlations; On the vision Our goal is to learn generalizable prompt repre-
and language end, Chen et al. (2020) proposes to sentation with limited data. The prompt in CLIP
generate counterfactual VQA samples by masking is divided into two parts: task-agnostic prompt p
critical objects in images or words in questions to and task-relevant prompt h. Task-agnostic prompt
augment the training data and gain a huge improve- p is learned end-to-end automatically. The set of
ment on the VQAv2 dataset. Gokhale et al. (2020) task-relevant prompts H = {h0 , h1 , . . . , hC } is
proposes template-based counterfactual image aug- mapped from the label space Y with some prede-
mentation methods. Fu et al. (2020) proposes a fined rules hinging on the task type, where C is the
novel training strategy for visual language naviga- total number of classes. The final prompt tc is the
tion that dynamically generates counterfactuals to concatenation of the task-agnostic prompt and the
account for unseen scenarios. To our best knowl- task-relevant prompt fed into CLIP’s text encoder:
edge, CPL is the first to apply counterfactual gener- tc = [p, hc ].
ation to prompt-based few-shot learning for vision Existing works to this problem (Zhou et al., 2021,
and language models. 2022) propose to first extract visual feature v of
each input image by feeding it into CLIP’s vision
Few-shot Learning. Recently, several few-shot encoder F ; and text embeddings are generated by
efficient learners on vision (He et al., 2022) and lan- feeding {tc }C c=1 into the CLIP’s text encoder G.
guage (Brown et al., 2020) tasks were proposed in- The probability of i-th class is computed as
cluding CLIP. GPT (Brown et al., 2020), as a strong <G(ti ),v>
few-shot learner, is capable of performing a new e τ

p(ti | x) = P <G(tc ),v>
, (1)
language task by learning from only a few training C
c=1 e
τ
3409
Figure 2: The counterfactual prompt learning framework. We freeze the vision encoder F and the text encoder
G, and only optimize the task-agnostic prompts and the instance-conditioned net M (blue blocks). Please refer to
Section 3.2 for the explanation.
where τ is the temperature parameter, < · > de- derstand non-spurious semantic information and
notes the cosine similarity. Cross-entropy loss is learn generalized prompt representations.
then minimized and the gradients can be back-
propagated via the text encoder G to update the 3.3 Controllable Counterfactual Generation
learnable prompt representation p. During training, By viewing image feature v as a potential cause
the weights of CLIP always remain frozen. During of the label, a non-spurious feature shall be a suf-
inference, Eq. 1 is used to compute the probability ficient cause of the label. So we would like to
for each class. generate counterfactuals by identifying minimal
non-spurious feature change that causes the label
3.2 Method Overview
change. The illustration of the counterfactual con-
An overview of the Counterfactual Prompt Learn- struction process is shown in Figure 3. Given posi-
ing (CPL) framework is shown in Figure 2. For tive image features v and negative image features
pre-processing, we construct task-relevant prompts v − , we can generate negative counterfactual image
for all training samples. The goal is to optimize features v ′ as below:
the task-agnostic prompt p.2 During training,
given a positive image-prompt pair, we first per- v ′ = (1 − u) ◦ v + u ◦ v − , (2)
form text-based negative sampling to find the most
semantically-similar negative sample based on text where ◦ is the element-wise multiplication and u
similarity scores. Then we adopt a controllable is the parameter controlling the amount of nega-
counterfactual generation strategy to construct the tive image feature that replaces the positive image
counterfactual from the positive and negative sam- feature. The negative image features are extracted
ples in the visual feature space. Finally, we perform from those images similar to the original image
contrastive learning using both generated counter- at the semantic level, which we will introduce in
factual image features and factual image features Section 3.4.
in a joint optimization framework to fine-tune the To capture the non-spuriousness, we would like
task-agnostic prompt p, allowing the model to un- to construct counterfactuals by replacing essential
2
non-spurious features only. This can be achieved
Together with the instance-conditional net M as intro-
duced in Zhou et al. (2022). For simplicity, we will only use by minimizing the amount of feature change u∗
p hereafter as p and M are always optimized together. to the original image that can causally incur label
3410
robustness and algorithm efficiency. Therefore, dur-
ing training, in each batch, we only utilize the most
= + semantically-similar one to generate counterfactual
image features. Other image samples are filtered
out.
Semantic concepts may be highly complex in the
visual representations, and thus it is hard to directly
measure semantic similarity in the visual space.
While language is more expressive and naturally
preserves semantic meanings. Therefore, we pro-
Figure 3: Counterfactual generation process. v and pose a text-based negative sampling method. We
c are the positive image feature and label, while v − first measure the text similarity between prompts
and c− are the negative image feature and label. ◦ is
with BERTScore (Zhang et al., 2019), which com- 3
element-wise multiplication. By mixing v and v − , the
counterfactual image feature v ′ is predicted as a nega- putes pairwise cosine similarity between reference
tive label c− by the discriminator D. u is minimized sentences and candidate sentences using BERT
so a minimal change to the positive image feature u is contextual embedding (Devlin et al., 2019). We
captured here to causally change the label. compute a similarity matrix with the value of each
element being:
change: sim(i, j) = BERTScore(hi , hj ). (5)
minimize
∗
∥u∗ ∥1 Denote B as the collection of sampled instances.
u (3) During training, each prompt hc ∈ B (1 ≤ c ≤ C,
s.t. u∗ = arg maxDc− (v ′ ).
u where C is the size of sampled instances) can be
Given the factual and counterfactual features v treated as a query. Given a query prompt hq , its
and v ′ , we aim to learn the prompt that can help most semantically similar prompt (the one with
CLIP better align visual features v and textual fea- the highest BERTScore) hk is searched from B.
tures G(t) with same semantic meanings. This can Then we use the CLIP vision encoder to obtain the
be achieved by maximizing the mutual information features of the corresponding positive and negative
(MI) between v and G(t). Therefore, by minimiz- images v and v − .
ing the InfoNCE loss (Hjelm et al., 2018), we can 3.5 Joint Optimization
maximize the lower bound on MI(v, G(t)). To this
end, we define the contrastive objective function In addition to the contrastive learning loss as intro-
based on the InfoNCE estimator following Khosla duced in Eq. 4, we also adopt the standard cross-
et al. (2020): entropy loss for training:
X
S(v,G(t)) LCE (p) = − y c log p (tc | x) , (6)
∗ e τ
LCL (p, u ) = −log( S(v,G(t)) S(v ′ ,G(t))
), c
e τ +e τ
where y c denotes the one-hot ground-truth an-
(4)
notation of the label. We treat all downstream
where S (·, ·) is normally the cosine similarity func-
tasks in this work as classification tasks, where
tion and τ is the temperature value.
the model predicts if the image and text prompt
3.4 Text-based Negative Sampling pair is matched or not.
Then the task-agnostic prompt p is learned
We then discuss how to perform negative sampling
by minimizing the weighted combination of con-
for constructing counterfactual features. As sug-
trastive learning loss and cross-entropy loss:
gested in Robinson et al. (2020), good negative
samples have different labels and are difficult to be L(p) = LCE (p) + λ · LCL (p, u∗ ), (7)
distinguished from an anchor point, while their se-
mantic representations are close (Suresh and Ong, where λ determines the weight of LCL .
2021). Since not all negative samples can serve In fact, we can seek to put Eq. 3 and Eq. 7 in
as useful negatives (Chuang et al., 2020), indis- a single-stage optimization framework. The in-
criminate leverage of these data may harm model tuition is that we generate counterfactual image
3411
Algorithm 1 Counterfactual Prompt Learning cation, the prompts are class labels for each task;
1: X: image space for image-text retrieval, captions for each image
2: Y: label space are adopted as prompts; for visual question an-
3: hc : task-relevant prompt for the c-th class
4: H: the set of task-relevant prompts swering, we first use a pre-trained generative T5
5: p: the task-agnostic prompt model (Raffel et al., 2019) to convert the question-
6: v: image features answer pairs into declarative sentences referring
7: v − : negative image features
8: u: parameter controls the generation of counterfactual to the VQA prompt generation method proposed
image features in Song et al. (2022b). Then, motivated by Wei et al.
9: function CPL(X, Y)
10: H←Y
(2022), we add additional category information into
11: tc ← [p, hc ] the prompt generated from templates based on the
12: for each i, j do question type to help the model perform interme-
13: sim(i, j) = BERTScore(hi , hj ) ▷ Eq. 5
14: end for
diate reasoning steps. Specifically, we add “The
15: for q in the batch do question is asking about others” for Other ques-
16: v ← vq tions before the generated declarative sentence. In
17: Find the index k that maximize sim(q, k) with the
given index q a similar vein, “The question is asking about yes
18: v − ← vk or no” and “The question is asking about numbers”
19: Generate counterfactual image features ▷ Eq. 2 are added for Yes/No and Number questions.
20: LCE ← cross-entropy loss ▷ Eq. 6
21: LCL ← contrastive loss ▷ Eq. 4
22: Update p and u with the joint optimization loss ▷ 4 Experiments
Eq. 7
23: end for 4.1 Tasks and Datasets
24: end function
Image Classification. We employ seven pub-
licly available image classification datasets used
features with minimal feature change that can max-
in CLIP: SUN397 (Xiao et al., 2010), Cal-
imize the negative prediction probability, and at
tech101 (Griffin et al., 2007), ImageNet (Deng
the same time, utilize contrastive learning to learn
et al., 2009), OxfordPets (Parkhi et al., 2012),
the prompt that can guide CLIP to explicitly distin-
StandfordCars (Krause et al., 2013), Flow-
guish between factual images and counterfactual
ers102 (Nilsback and Zisserman, 2008), and
images. Putting all pieces together, we have:
Food101 (Bossard et al., 2014). These datasets
minimize LCE (p) + λ · LCL (p, u∗ ) + ∥u∗ ∥1 constitute a comprehensive benchmark, which cov-
∗
p,u ers a diverse set of vision tasks including the clas-
s.t. u∗ = arg maxDc− (v ′ ) sification of generic objects, fine-grained image
u
where v ′ = (1 − u) ◦ v + u ◦ v − . recognition, action classification, etc. To evaluate
(8) the generalization ability of methods, we split those
In Eq. 8, the gradients can be back-propagated all datasets into seen and unseen classes. Only images
the way through the text encoder G to the task- in the seen classes will be used for training. The
agnostic prompt, making use of the rich knowledge setting follows the few-shot evaluation protocol in
encoded in the pre-trained CLIP model to optimize CLIP, where we use 16 shots for training and full
the prompt. test sets for testing.
Algorithm 1 presents the learning algorithm of
CPL. In summary, given few input training samples Image-Text Retrieval. We consider two datasets
{(x1 , y1 ) , . . . , (xn , yn )}, CPL consists of three for image-text retrieval: MSCOCO (Lin et al.,
main steps: (1) compute the similarity matrix be- 2014) and Flickr30K (Plummer et al., 2015). We
tween different text prompts within the sampled adopt the widely used Karpathy split (Karpathy
batch; (2) generate counterfactual image features; and Fei-Fei, 2015) for both the MSCOCO and
(3) optimize p and u with contrastive learning loss Flickr30K datasets, where MSCOCO contains
and cross-entropy loss. 113/5K/5K for train/validation/test. Flickr30K con-
tains 29K/1K/1K images for train/validation/test.
3.6 Task-relevant Prompt Construction We construct few-shot setting subsets for both Co-
We construct task-relevant prompts H for image CoOp and CPL by taking 0.5%, 1%, and 3% of
classification, image-text retrieval, and visual ques- training instances. We train the model with the sub-
tion answering, respectively. For image classifi- sets and evaluate its performance on the complete
3412
Classes Method SUN397 Caltech101 ImageNet OxfordPets StanfordCars Flowers102 Food101 Average
CLIP 69.40 96.51 72.46 91.33 74.85 72.17 90.12 80.98
Seen CoCoOp 79.08 [+13.95] 97.66 [+1.19] 76.01 [+4.90] 95.18 [+4.22] 70.91 [-5.26] 94.65 [+31.15] 90.67 [+0.61] 86.31 [+6.58]
CPL (ours) 81.05 [+16.79] 97.70 [+1.23] 78.81 [+8.76] 96.69 [+5.87] 75.51 [+0.88] 93.91 [+30.12] 93.01 [+3.21] 88.10 [+8.79]
CLIP 75.40 94.10 68.09 97.04 74.95 77.87 91.30 82.68
Unseen CoCoOp 76.83 [+1.90] 93.92 [-0.19] 70.44 [+3.45] 97.78 [+0.76] 73.09 [-2.48] 69.24 [-11.08] 91.53 [+0.25] 81.83 [-1.02]
CPL (ours) 80.19 [+6.35] 94.94 [+0.89] 73.17 [+7.46] 98.81 [+1.82] 78.90 [+5.27] 72.30 [-7.15] 93.44 [+2.34] 84.54 [+2.25]
Table 1: Result comparison between CPL and CoCoOp (Zhou et al., 2022) on seen and unseen classes across
seven image classification datasets in terms of accuracy (%) under the few-shot setting. The relative difference (%)
compared with CLIP is reported in color.
Training data used Method Flickr30k MSCOCO Average The questions are first converted into a masked
0 CLIP 83.00 53.35 68.18
CoCoOp 82.40 [-0.72] 55.55 [+4.12] 68.98 [+1.17]
template using the pre-trained T5 model and pre-
0.5%
CPL (ours) 85.64 [+3.18] 57.91 [+8.55] 71.78 [+5.28] defined rules. The infilled template along with the
1%
CoCoOp 84.80 [+2.17] 56.62 [+6.13] 70.71 [+3.71] questions will be turned into prompts that naturally
CPL (ours) 86.91 [+4.71] 58.43 [+9.52] 72.67 [+6.59]
CoCoOp 85.90 [+3.49] 58.08 [+8.87] 71.99 [+5.59]
connect questions and answers. The model will
3%
CPL (ours) 87.74 [+5.71] 59.96 [+12.39] 73.85 [+8.32] predict whether the given prompt and image pairs
are matched. We construct the few-shot setting by
Table 2: Result comparison between CPL and CoCoOp taking 0.5%, 1%, and 3% instances for training.
on two image-text retrieval datasets, Flickr30k (Plum-
mer et al., 2015) and MSCOCO (Lin et al., 2014), on the
unseen test sets in terms of Recall@1 (%). The relative
difference (%) over CLIP is reported in color. 4.2 Implementation Details
Training data used Method VQAv2 Baselines. We mainly compare CPL with Co-
0 CLIP 11.83 CoOp (Zhou et al., 2022), one of the earliest prompt
CoCoOp 27.98 [+136.52] tuning methods proposed for vision-and-language
0.5% CPL w/o. Category Information 31.68 [+167.79] pre-trained models. CoCoOp considers each input
CPL 33.39 [+182.25]
image and injects the learnable instance-aware to-
CoCoOp 28.51 [+141.00]
1% CPL w/o. Category Information 34.70 [+193.32] kens into the context vectors as the final prompt.
CPL 35.66 [+201.44] For a fair comparison, both CPL and CoCoOp
CoCoOp 30.18 [+155.11] adopt CLIP (Radford et al., 2021) as the pre-trained
3% CPL w/o. Category Information 35.41 [+199.32]
CPL 36.32 [+207.02] vision-and-language backbone and are compared
with respect to their relative improvements over
Table 3: Result comparison on the VQAv2 zero-shot CLIP.
dataset (Goyal et al., 2017a) in terms of accuracy (%).
The relative improvements over CLIP are reported in
color. Incorporating category information into task-
Prompt Tuning. The task-agnostic prompt is ran-
relevant prompts can further improve the performance.
domly initialized from a zero-mean Gaussian dis-
tribution with the standard deviation 0.02, where
test set. We use Recall at 1 (R@1) as the default we set length L = 4 by default. For vision and
evaluation metric. language tasks, in contrast to image classification,
where an image is labeled by a category, the task-
Visual Question Answering. VQAv2 (Goyal relevant prompts comprise more fine-grained de-
et al., 2017b) is an extended dataset from the tails, usually a sentence. We here similarly to-
VQA (Antol et al., 2015) dataset. The questions are kenize the whole sentence using the CLIP word
categorized into three types: Number, Yes/No, and embedding (Radford et al., 2021), and feed the tok-
Other. We set up the experiments following An- enized results to the text encoder with task-agnostic
derson et al. (2018), which treats visual question prompt vectors, to generate the language embed-
answering as a classification problem: for each ding for each prompt. In both the image-text re-
question, the model picks the corresponding an- trieval and visual question answering, all data in
swer from a given set of predefined most frequent the test set can be treated as belonging to unseen
candidate answers and matches it with the image. classes.
3413
4.3 Main Results
Positive Examples BERTScore Sampled Random Sampled
Classification
Image Classification. The experimental results
Image
for image classification are shown in Table 1. With
better prompts learned from counterfactual exam- Tabby cat Tiger cat Jeep
ples, our CPL method achieves clear advantages (BERTScore = 0.9126) (BERTScore = 0.8556)
over CoCoOp for both seen and unseen classes
Image-text
Retrieval
across almost all datasets. Particularly on unseen
classes, we gain an average relative improvement
of 3.55%. A big bunch of ripe
yellow bananas on
Bunches of bananas are
neatly arranged on a
The plate is empty on
the table
Meanwhile, CoCoOp shows its poor generaliza- display display
(BERTScore = 0.9313)
(BERTScore = 0.8908)
tion ability. Specifically, we found that CoCoOp

performs worse than CLIP on StandfordCars on Figure 4: Visualization of the weights of the con-
troller parameter u on images. The first column is
both seen and unseen classes, and on Caltech101
the original positive examples; the second column is
and Flower102 on unseen classes, indicating that it BERT-sampled negative examples; the third column is
tends to learn and leverage spurious relations and randomly-sampled negative examples for comparison.
could not generalize well on unseen classes in some The BERTScore between the text prompts of positive
cases. We believe all these mentioned above can examples and sampled examples are shown at the bot-
be sufficient evidence that the main idea of CPL, tom.
learning non-spurious prompt representation can
aid CLIP adapting at test time, is practical.
of CPL where we do not add additional category
Image-Text Retrieval. Table 2 reports results on information into the prompt (denoted as CPL w/o.
image-text retrieval on the unseen test set. CPL Category Information), the results indicate that con-
can beat the zero-shot CLIP consistently across structing task-relevant prompts by adding categori-
the three different settings, demonstrating that CPL cal information contributes to the improvement.
can also learn better prompt representation and 4.4 Ablation Analysis
more effectively exploit the limited amount of data
on image-text retrieval. Meanwhile, CoCoOp per- Negative Sampling. We compare the random
forms even worse than CLIP on Flickr30k using sampling vs. BERTScore sampling over ImageNet
0.5% training data, which suggests that a tiny quan- for image classification, MSCOCO for image-text
tity of training data for image-text retrieval can lead retrieval, and VQAv2 for visual question answering
to spurious prompt representation if using naïve in Table 4. With more challenging negative exam-
instance-conditional prompt tuning method. ples, BERTScore sampling leads to more effective
prompt tuning and overbeats random sampling on
Visual Question Answering. For visual ques- all three tasks. The qualitative visualizations of
tion answering, the results are shown in Table 3. the two sampling strategies are shown in Figure 4,
As can be seen, CPL surpasses the baseline Co- from which it can be seen that BERTScore-sampled
CoOp with a relative improvement of up to 25.08% images are much more semantically similar to the
when using 1% instances for training. This proves original images.
the concept that CPL can be effective on more
Non-spurious Feature Visualization. We visual-
complicated vision-and-language tasks. In fact, vi-
ize the heatmap of the learned non-spurious feature
sual question answering is more challenging for
weights in the image level in Figure 4. The weights
zero-shot CLIP which is pre-trained for image-text
are mainly centralized on the semantically mean-
matching. During pre-training, CLIP sees most
ingful regions that are aligned to the text prompts.
sentences similar to captions in image-text retrieval
and those captions can be directly used as prompts; Number of Shots in Image Classification. We
while for VQA, question-answer pairs have to be then study the effects of the number of shots on
adapted into declarative prompts. Therefore, zero- CPL for image classification. Following the few-
shot CLIP has poor performance on VQA, but few- shot evaluation protocol adopted in CLIP, we use
shot prompt tuning via CPL can help reduce the 4, 8, and 16 shots for training on ImageNet. From
prompt domain gap significantly. Apart from the Figure 5, increasing the number of shots keeps
vanilla CPL method, we examined another variant improving the performance of both two methods
3414
Method ImageNet MSCOCO VQAv2 80.62 80.34
Random sampling 75.28 57.78 33.01 79.82
BERTScore sampling 76.02 58.43 35.66
Table 4: Random sampling vs. BERTScore sampling

for CPL over three tasks. On ImageNet, we measure the 77.96
average accuracy across seen and unseen classes. On
MSCOCO and VQAv2, we both use 1% instances for
few-shot learning.
73.17
Figure 6: Ablation of four different λ values on the
72.94
72.83 SUN397 dataset in terms of average accuracy (%). The
performance of CPL peaks at λ = 1.
method outperforms the previous prompt tuning

baseline and the zero-shot CLIP across the three
70.44
70.25 70.32 tasks. In the future, we plan to develop more so-
phisticated methods based on CPL and extend CPL
to other vision and language tasks.
Figure 5: Accuracy comparison on ImageNet (Deng
Limitations
et al., 2009) unseen classes under three different shots.
CPL performs better than CoCoOp consistently and has There are fairness issues in large pre-trained vision
lower standard errors. and language models such as CLIP. The proposed
prompt learning method in this study automatically
on unseen classes. Meanwhile, CPL outperforms learns the prompt and does not address those issues
CoCoOp under the three different settings and has in the pre-trained model. Considering the method
lower standard errors. is proposed for the few-shot setting, careful inspec-
tion and tuning are also needed when testing our
Contribution of Contrastive Learning. In Sec- method on other biased datasets. The methodolo-
tion 3, we use the coefficient λ to weigh the con- gies proposed in Booth et al. (2021) and Wang et al.
trastive learning loss and combine it with the cross- (2021a) may possibly be paired with CPL to po-
entropy loss. It is observed that the scale of con- tentially address the issues. Another limitation is
trastive learning loss is smaller, hence we try to use the absence of explainability in CPL, which is a
a larger λ to balance the two loss terms. Figure 6 common problem with existing soft prompt tun-
shows the average accuracy result across seen and ing methods. Back-mapping tuned soft prompts
unseen classes on the SUN397 dataset under four representation to natural language is a way for in-
different λ values. Note that when λ is zero, there terpretation; however, due to the limited size of
is no contribution from the contrastive loss and the vocabulary used by CLIP during the training, prior
method actually learns the prompt using standard methods such as searching for the nearest words in
cross-entropy loss. From experimental results ob- the embedding space can not accurately match the
tained on the SUN397 dataset, we can observe that vector to natural language. Expanding the dictio-
using λ = 1 leads to the best performance. nary size for CLIP embedding or developing more
advanced back-mapping techniques can possibly
5 Conclusion address the limitation.
In this paper, we propose a Counterfactual
Acknowledgments
Prompt Learning (CPL) framework to avoid time-
consuming prompt engineering and learn more gen- We would like to thank the support of the Google
eralizable prompt representation for vision and lan- Ads Faculty Research Award. We also thank the
guage models. We conduct abundant experiments anonymous reviewers for their thought-provoking
on seven widely used image classification datasets, comments. The views and conclusions contained in
two image-text retrieval datasets, and one visual this document are those of the authors and should
question answering dataset. Our proposed CPL not be interpreted as representing the sponsor.
3415
References Tsu-Jui Fu, Xin Eric Wang, Matthew F Peterson, Scott T
Grafton, Miguel P Eckstein, and William Yang Wang.
Vedika Agarwal, Rakshith Shetty, and Mario Fritz. 2020. 2020. Counterfactual vision-and-language naviga-
Towards causal vqa: Revealing and reducing spuri- tion via adversarial path sampler. In European Con-
ous correlations by invariant and covariant semantic ference on Computer Vision, pages 71–86. Springer.
editing. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and
9690–9698. Yezhou Yang. 2020. Mutant: A training paradigm for
out-of-distribution generalization in visual question
Peter Anderson, Xiaodong He, Chris Buehler, Damien answering. arXiv preprint arXiv:2009.08566.
Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
2018. Bottom-up and top-down attention for image Tejas Gokhale, Abhishek Chaudhary, Pratyay Baner-
captioning and visual question answering. In CVPR. jee, Chitta Baral, and Yezhou Yang. 2021. Se-
mantically distributed robust optimization for
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- vision-and-language inference. arXiv preprint
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and arXiv:2110.07165.
Devi Parikh. 2015. Vqa: Visual question answering.
In Proceedings of the IEEE international conference
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv
on computer vision, pages 2425–2433.
Batra, and Devi Parikh. 2017a. Making the V in VQA
matter: Elevating the role of image understanding
M Besserve, A Mehrjou, R Sun, and B Schölkopf. 2020.
in Visual Question Answering. In Conference on
Counterfactuals uncover the modular structure of
Computer Vision and Pattern Recognition (CVPR).
deep generative models. In Eighth International Con-
ference on Learning Representations (ICLR 2020).
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv
Batra, and Devi Parikh. 2017b. Making the v in vqa
Brandon M Booth, Louis Hickman, Shree Krishna Sub-
matter: Elevating the role of image understanding in
buraj, Louis Tay, Sang Eun Woo, and Sidney K
visual question answering. In CVPR.
D’Mello. 2021. Bias and fairness in multimodal
machine learning: A case study of automated video
interviews. In Proceedings of the 2021 International Gregory Griffin, Alex Holub, and Pietro Perona. 2007.
Conference on Multimodal Interaction, pages 268– Caltech-256 object category dataset.
277.
Han Guo, Bowen Tan, Zhengzhong Liu, Eric P Xing,
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. and Zhiting Hu. 2021. Text generation with efficient
2014. Food-101 – mining discriminative components (soft) q-learning. arXiv preprint arXiv:2106.07704.
with random forests. In European Conference on
Computer Vision. Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei
Yang, and Xin Eric Wang. 2022. Parameter-efficient
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie fine-tuning for vision transformers. arXiv preprint
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind arXiv:2203.16329.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot R Devon Hjelm, Alex Fedorov, Samuel Lavoie-
learners. arXiv preprint arXiv:2005.14165. Marchildon, Karan Grewal, Phil Bachman, Adam
Trischler, and Yoshua Bengio. 2018. Learning deep
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shil- representations by mutual information estimation and
iang Pu, and Yueting Zhuang. 2020. Counterfactual maximization. arXiv preprint arXiv:1808.06670.
samples synthesizing for robust visual question an-
swering. In Proceedings of the IEEE/CVF Confer- Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
ence on Computer Vision and Pattern Recognition, Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen
pages 10800–10809. Li, and Tom Duerig. 2021. Scaling up visual and
vision-language representation learning with noisy
Ching-Yao Chuang, Joshua Robinson, Yen-Chen Lin, text supervision. In International Conference on
Antonio Torralba, and Stefanie Jegelka. 2020. Debi- Machine Learning, pages 4904–4916. PMLR.
ased contrastive learning. Advances in neural infor-
mation processing systems, 33:8765–8775. Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and
Weidi Xie. 2021. Prompting visual-language models
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai for efficient video understanding. arXiv preprint
Li, and Li Fei-Fei. 2009. Imagenet: A large-scale arXiv:2112.04478.
hierarchical image database. In CVPR.
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and semantic alignments for generating image descrip-
Kristina Toutanova. 2019. BERT: Pre-training of tions. In Proceedings of the IEEE conference on
deep bidirectional transformers for language under- computer vision and pattern recognition, pages 3128–
standing. In NAACL. 3137.
3416
Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. of transfer learning with a unified text-to-text trans-
2019. Learning the difference that makes a differ- former. arXiv preprint arXiv:1910.10683.
ence with counterfactually-augmented data. In Inter-
national Conference on Learning Representations. Joshua Robinson, Ching-Yao Chuang, Suvrit Sra,
and Stefanie Jegelka. 2020. Contrastive learn-
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron ing with hard negative samples. arXiv preprint
Sarna, Yonglong Tian, Phillip Isola, Aaron arXiv:2010.04592.
Maschinot, Ce Liu, and Dilip Krishnan. 2020. Su-
pervised contrastive learning. Advances in Neural Sheng Shen, Liunian Harold Li, Hao Tan, Mohit
Information Processing Systems, 33:18661–18673. Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei
Yao, and Kurt Keutzer. 2021. How much can clip
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei- benefit vision-and-language tasks? arXiv preprint
Fei. 2013. 3d object representations for fine-grained arXiv:2107.06383.
categorization. In 4th International IEEE Workshop
on 3D Representation and Recognition (3dRR-13), Taylor Shin, Yasaman Razeghi, Robert L Logan IV,
Sydney, Australia. Eric Wallace, and Sameer Singh. 2020. Autoprompt:
Eliciting knowledge from language models with
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. automatically generated prompts. arXiv preprint
The power of scale for parameter-efficient prompt arXiv:2010.15980.
tuning. arXiv preprint arXiv:2104.08691.
Haoyu Song, Li Dong, Wei-Nan Zhang, Ting Liu, and
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Furu Wei. 2022a. Clip models are few-shot learners:
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, Empirical studies on vqa and visual entailment. arXiv
and C Lawrence Zitnick. 2014. Microsoft coco: preprint arXiv:2203.07190.
Common objects in context. In ECCV.
Haoyu Song, Li Dong, Wei-Nan Zhang, Ting Liu, and
Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Furu Wei. 2022b. Clip models are few-shot learners:
Zhilin Yang, and Jie Tang. 2021. P-tuning v2: Empirical studies on vqa and visual entailment. arXiv
Prompt tuning can be comparable to fine-tuning preprint arXiv:2203.07190.
universally across scales and tasks. arXiv preprint
arXiv:2110.07602. Varsha Suresh and Desmond C Ong. 2021. Not all
negatives are equal: Label-aware contrastive loss
Yuhang Liu, Wei Wei, Daowan Peng, and Feida for fine-grained text classification. arXiv preprint
Zhu. 2022. Declaration-based prompt tuning arXiv:2109.05427.
for visual question answering. arXiv preprint
arXiv:2205.02456. Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi,
SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Mul-
Maria-Elena Nilsback and Andrew Zisserman. 2008. timodal few-shot learning with frozen language mod-
Automated flower classification over a large number els. Advances in Neural Information Processing Sys-
of classes. In 2008 Sixth Indian Conference on Com- tems, 34:200–212.
puter Vision, Graphics & Image Processing, pages
722–729. IEEE. Jialu Wang, Yang Liu, and Xin Eric Wang.
2021a. Assessing multilingual fairness in pre-
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, trained multimodal representations. arXiv preprint
and CV Jawahar. 2012. Cats and dogs. In 2012 arXiv:2106.06683.
IEEE conference on computer vision and pattern
recognition, pages 3498–3505. IEEE. Yixin Wang and Michael I Jordan. 2021. Desiderata for
representation learning: A causal perspective. arXiv
Bryan A Plummer, Liwei Wang, Chris M Cervantes, preprint arXiv:2109.03795.
Juan C Caicedo, Julia Hockenmaier, and Svetlana
Lazebnik. 2015. Flickr30k entities: Collecting Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu-
region-to-phrase correspondences for richer image- lia Tsvetkov, and Yuan Cao. 2021b. Simvlm: Simple
to-sentence models. In Proceedings of the IEEE visual language model pretraining with weak super-
international conference on computer vision, pages vision. arXiv preprint arXiv:2108.10904.
2641–2649.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- Chain of thought prompting elicits reasoning in large
try, Amanda Askell, Pamela Mishkin, Jack Clark, language models. arXiv preprint arXiv:2201.11903.
et al. 2021. Learning transferable visual models
from natural language supervision. arXiv preprint Yiquan Wu, Kun Kuang, Yating Zhang, Xiaozhong Liu,
arXiv:2103.00020. Changlong Sun, Jun Xiao, Yueting Zhuang, Luo Si,
and Fei Wu. 2020. De-biased court’s view generation
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine with causality. In Proceedings of the 2020 Confer-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, ence on Empirical Methods in Natural Language
Wei Li, and Peter J Liu. 2019. Exploring the limits Processing (EMNLP), pages 763–780.
3417
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude
Oliva, and Antonio Torralba. 2010. Sun database:
Large-scale scene recognition from abbey to zoo. In
2010 IEEE computer society conference on computer
vision and pattern recognition, pages 3485–3492.
IEEE.
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu,
Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo
Li, Xin Jiang, and Chunjing Xu. 2021. Filip:
Fine-grained interactive language-image pre-training.
arXiv preprint arXiv:2111.07783.
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Ye-
ung, Mojtaba Seyedhosseini, and Yonghui Wu. 2022.
Coca: Contrastive captioners are image-text founda-
tion models. arXiv preprint arXiv:2205.01917.
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong
Huang, Boxin Li, Chunyuan Li, et al. 2021. Florence:
A new foundation model for computer vision. arXiv
preprint arXiv:2111.11432.
Xiangji Zeng, Yunliang Li, Yuchen Zhai, and Yin
Zhang. 2020. Counterfactual generator: A weakly-
supervised method for named entity recognition. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 7270–7280.
Michael Zhang, Nimit S Sohoni, Hongyang R Zhang,
Chelsea Finn, and Christopher Ré. 2022. Correct-
n-contrast: A contrastive approach for improving
robustness to spurious correlations. arXiv preprint
arXiv:2203.01517.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
uating text generation with bert. arXiv preprint
arXiv:1904.09675.
Xiheng Zhang, Yongkang Wong, Xiaofei Wu, Juwei
Lu, Mohan Kankanhalli, Xiangdong Li, and Wei-
dong Geng. 2021. Learning causal representation for
training cross-domain pose estimator via generative
interventions. In Proceedings of the IEEE/CVF In-
ternational Conference on Computer Vision, pages
11270–11280.
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and
Ziwei Liu. 2021. Learning to prompt for vision-
language models. arXiv preprint arXiv:2109.01134.
Kaiyang Zhou, Jingkang Yang, Chen Change Loy,
and Ziwei Liu. 2022. Conditional prompt learn-
ing for vision-language models. arXiv preprint
arXiv:2203.05557.
Qingfu Zhu, Weinan Zhang, Ting Liu, and
William Yang Wang. 2020. Counterfactual
off-policy training for neural dialogue generation. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 3438–3448.
3418

Ai 3

Uploaded by

Copyright:

Available Formats

Ai 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ai 3

Uploaded by

Copyright:

Available Formats

CPL: Counterfactual Prompt Learning for Vision and Language Models

Abstract A: A large long train on a B: A large long train on a

few-shot learner, is capable of performing a new e τ

over CoCoOp for both seen and unseen classes

tion ability. Specifically, we found that CoCoOp

Table 4: Random sampling vs. BERTScore sampling

method outperforms the previous prompt tuning

You might also like