Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Benchmarking VLMs’ Reasoning About Persuasive Atypical Images

Sina Malakouti
University of Pittsburgh
Pittsburgh, PA
   Aysan Aghazadeh
University of Pittsburgh
Pittsburgh, PA
   Ashmit Khandelwal
BITS Pilani
Goa, India
   Adriana Kovashka
University of Pittsburgh
Pittsburgh, PA
Abstract

Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer’s lightness.

We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Atypical Object Recognition, to benchmark VLMs’ understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad’s message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements.

Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.

**footnotetext: These authors contributed equally to this work: Contact sem238@pitt.edu, aya34@pitt.edu for questions and suggestions.

1 Introduction

In visual media, particularly advertisements, creators employ creative visual rhetoric to capture attention and convey memorable, powerful messages. They intentionally deviate from realism, depicting objects in unique and atypical ways [33, 28]. Creative ads that are “out of the ordinary” or “connect objects that are usually unrelated” can generate twice as much revenue as non-creative ads [33].

Atypical imagery in ads often involves transforming objects metaphorically [42, 45]. These creative transformations are not random; they are carefully chosen to convey specific ideas  [42]. For example, Fig. 1(a) depicts a text box as tape to suggest silencing, while in (d), potato chips are shown as flames to metaphorically represent spiciness, borrowing properties from fire (hotness symbolizing flavor). Understanding these atypical images requires more than just recognizing objects. It requires advanced reasoning skills, including knowledge of cultural contexts and social norms, posing a significant challenge for AI systems.

Refer to caption
Figure 1: Atypicality categories. We study four types of atypicality from [45]: Texture Replacement 1, Texture Replacement 2, Object Inside Objects, Object Replacement (defined in Sec. 3.1).

Modern pretrained vision-language models (VLMs), such as LLaVA [26, 25], demonstrate strong visual understanding across various tasks such as recognition [24], and capabilities like zero-shot generalizability. However, there is a lack of in-depth study on VLMs’ ability to understand complex persuasive images, such as advertisements.

We address this gap by introducing three novel tasks over PittAds [18] to evaluate VLMs’ understanding of atypicality: (1) multi-label atypicality classification, where the model predicts the type of atypicalities in the image; (2) atypicality statement retrieval, where the model retrieves correct atypicality statements describing the atypicality relation among objects; (3) atypical object recognition, where the model generates objects to complete an atypicality statement based on a given relation. These tasks are essential as prior works’ binary classification oversimplifies atypicality’s nuanced nature. Our evaluation shows that although VLMs struggle with direct atypicality inference, they can extract valuable information about atypical aspects.

Next, we investigate how atypicality influences understanding an ad’s message. We use the action-reason retrieval (ARR) task [18, 45], which requires models to identify the suggested action (e.g., “buy these chips”) and its rationale (e.g., “because they are spicy”). However, to rigorously test the model’s reasoning, we introduce semantically challenging negative options, rather than mining hard negatives from other images [44, 19]. For example, we generate statements that include wrong action (e.g., “don’t buy these chips”) or wrong rationale (e.g., “because they are sweet”). This prevents VLMs from ruling out negatives by merely comparing objects in the image and options. Our evaluation shows a significant performance drop when VLM is faced with hard negatives (e.g., LLaVA drops by 67.51%).

Finally, we hypothesize that deep atypicality understanding improves action-reason retrieval performance. We test this by proposing an atypicality-aware verbalization. We explore a few simple prompting strategies to construct a comprehensive, atypicality-sensitive ad verbalization, which is used to predict the atypicality statement. This statement, combined with the atypicality-aware verbalization, serves as input to an LLM classifier. This LLM integrates all the information to retrieve the final action-reason, effectively leveraging both visual understanding and high-level reasoning capabilities.

Our proposed framework achieves state-of-the-art performance on the ARR task. Interestingly, when a VLM is given both the image and our atypicality-aware verbalization, its performance on the ARR task actually declines (e.g., LLaVA(I+𝒯𝒱𝐼subscript𝒯𝒱I+\mathcal{T}_{\mathcal{V}}italic_I + caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT) shows a 1.71 point drop in performance compared to LLaVA(I𝐼Iitalic_I)) and it is significantly underperformed compared to LLM. This stark contrast highlights a critical gap: VLMs lack the advanced reasoning capabilities of LLMs when interpreting complex, atypical visual media. To summarize, our contributions are:

  1. 1.

    We introduce three novel tasks for understanding atypicality in persuasive media.

  2. 2.

    We pioneer the use of atypicality inference in action-reason retrieval and are the first to benchmark VLMs/LLMs for advertisement understanding.

  3. 3.

    We generate semantically challenging negatives using GPT-4 for action-reason retrieval, revealing VLMs’ reasoning limitations in interpreting atypical ads.

We hope this work inspires the inclusion of persuasive ads in VLM benchmarks, foster development of robust models for complex visual media, and offer insights for creating more effective advertisements.

2 Related Works

Creativity in advertising has long been of interest in advertising research. It has been broken down into categories, and its impact on the effectiveness of ads has been measured. Both [33] and [34] define the categories as originality, flexibility, synthesis, elaboration, and artistic value, which capture different shades of divergence from the ordinary. Atypicality most directly maps to synthesis. However, these creativity strategies have not been explored in computer vision for predicting the message of an ad.

Advertisement image understanding. The PittAds Dataset [18] introduced the action-reason retrieval task, establishing a baseline for automatic ad understanding. However, most studies have not explicitly captured advertising-specific strategies for this task, nor have they addressed atypicality. [44] incorporated symbolism, but the gains were minimal. Others utilized scene-text [14, 20], graph-based methods to incorporate external knowledge [46], and CLIP [32] for brand name understanding [19]. [5] used Automatic Speech Recognition, OCR, WikiData and BLIP-2 [23] to describe the stories of video ads. [2] analyzed metaphors in ads. Yet, the impact of atypicality on ad image understanding remains unexplored. The only exception is [17], which proposed a self-supervised approach to classify images as typical or atypical but did not classify the type of atypicality nor use them for action-reason prediction.

Vision-language models. We benchmark pre-trained VLMs and LLMs on tasks involving atypicality and advertisement image understanding, focusing on their zero-shot reasoning capabilities. We also use pre-trained VLMs to verbalize advertisement images for LLMs. Given the substantial computational power required for training and fine-tuning large models, off-the-shelf, frozen models are typically used [38, 3]. Techniques to align visual and textual features without parameter updates include optimizing image encoders [38], inserting cross-attention layers [3], prompt learning [53, 52, 9, 21], and employing external transformers [31, 23, 13, 25]. However, direct application of models like [23, 32] may miss hidden messages by focusing more on visual context than semantics. Note that we restrict our experiments to the zero-shot inference setting.

Language models for multi-modal reasoning. Recent studies [15, 49, 5, 43] have explored LLMs for reasoning tasks, including chain-of-thought reasoning [41]. Some works leverage LLMs in multi-modal reasoning. [49, 40, 50] extended chain-of-thought to a multimodal context.[22] uses an image-captioning model followed by ‘reasoning questions’ to aid an LLM in answering the main question. Related/concurrent works like [47] improve zero-shot reasoning by iteratively asking and answering questions with 3 VLM/LLM, and [29] uses scene-graphs to enhance compositional reasoning. [27, 49] devised sophisticated LLM-augmented tools for task subdivision and external tool selection. In contrast, this paper challenges the reasoning ability of VLMs on complex persuasive images through novel atypicality tasks and action-reason retrieval. It improves performance with a more lightweight atypicality-aware verbalization, and no external tools are needed.

Refer to caption
Figure 2: Atypicality Understanding and Action-Reason Retrieval Tasks. We introduce three tasks: Multi-label Atypicality Classification , Atypicality Statement Retrieval, and Atypical Object Retrieval. Incorrect/correct phrases/statements are in red/green.

VLM evaluation. Various vision-language benchmarks have been proposed, including tasks like recognition [16], captioning [10], commonsense reasoning [39, 6], VCR [48], and compositional reasoning [36]. However, these benchmarks feature straightforward scenes that do not challenge the model’s reasoning abilities beyond literal imagery. Recently, ROME [51] generated counter-intuitive images focusing on 5 primitive common sense types (e.g., color and size) to challenge models’ ability in object, attribute, and spatial relation recognition. Similarly, WHOOPS! [7] addressed this by creating 500 synthetic scenes by placing ‘normal’ objects in unusual contexts (e.g., a snowplow in a desert and Einstein holding a smartphone). Atypical advertising images differ from those in WHOOPS! and ROME in two important ways: First, they are real ads created by human designers to intentionally convey a particular message (e.g., cigar replacing a bullet, to highlight the dangers of smoking), rather than simply to be unusual. This requires models to detect atypicality and atypical objects and reason about their impact on the ad’s meaning, making ad image understanding a more realistic and challenging benchmark for evaluating VLMs’ reasoning abilities. Second, atypical ads feature more categories of atypicality than placing an object out of context or altering primitive attributes, as WHOOPS! and ROME do, respectively.

3 Methodology

This paper evaluates VLM/LLM understanding of atypical advertisements. We address two key questions: (1) Are current VLMs capable of reasoning about atypicality and understanding advertisements? (2) What is the impact of atypicality on understanding ad images?

Unlike prior works [17] that only classify images as typical or atypical, we propose three new tasks: Multi-label Atypicality Classification (MAC), Atypicality Statement Retrieval (ASR), and Atypical Object Recognition (AOR). MAC predicts multiple categories of atypicality in the image. ASR uses additional annotations to identify objects involved in the atypical portrayal (e.g., ”The surface of the bottle mimics the texture of feather”). AOR evaluates VLMs’ visual reasoning by identifying primary and secondary objects in atypical relation.

Our analysis shows while VLMs initially struggle with MAC and ASR tasks, they can extract valuable insights about atypical aspects of images. Leveraging these insights, we develop an atypicality-aware image verbalization. To detect atypicality, we use the prompt UH: What is unusual about this image?. Atypicality adds depth to the content of the advertisement, complementing surface-level content like objects and scene descriptions. Thus, we combine UH𝑈𝐻UHitalic_U italic_H and surface-level content to construct the final verbalization. It is then passed to an LLM for the action-reason inference task. We elaborate on the tasks and pipeline below.

Refer to caption
Figure 3: Atypicality definitions and atypicality relation statements.

3.1 Proposed Atypicality Understanding Tasks

Ye et al. [45] devised a taxonomy of atypicality based on object transformations. In this work, we focus on the subset of atypicality categories that entail two objects (examples in Fig 1): (1) Texture Replacement 1 (TR1): Objects’ texture borrowed from another object, (2) Texture Replacement 2 (TR2): Texture created by combining several small objects, Object Inside Object (OIO): An object is completely or partially inside of another object, and (4) Object Replacement (OR): The whole object appearing in the context normally associated with another. We define the following new atypicality understanding tasks, shown in Fig. 2.

Multi-label Atypicality Classification (MAC). Unlike prior works [17] that only detect the presence of atypicality, we formulate atypicality detection as a multi-label classification task. The PittAds dataset provides three annotations of atypicality per image from different annotators, which may vary by type. For example, Fig.1(c) can be classified as ‘Object Inside Object’ (Earth inside a cup sleeve) and ‘Object Replacement’ (Earth replaces coffee cup). Some annotators may even label it as ‘Not Atypical’ (NA), creating five possible labels. MAC challenges VLMs to predict the relevant atypicality categories for an image based on atypicality definitions denoted as 𝒟𝒜subscript𝒟𝒜\mathcal{D}_{\mathcal{A}}caligraphic_D start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT (prompts in supp). Due to the complexity of differentiating between these categories, we extend the definitions provided by [45] as shown in Fig. 3: they not only distinguish different atypicality categories but also hint at how atypicality impacts the image’s interpretation (e.g., Fig. 1 is TR1, where the beer’s texture is replaced by a feather to advertise its lightness).

Refer to caption
(a) Image Verbalization
Refer to caption
(b) Atypicality Statement Detection
Refer to caption
(c) Action-Reason Retrieval
Figure 4: Our approach consists of three steps: (a) Image verbalization: We first prompt LLaVA to obtain top-5 objects (V𝑉Vitalic_V), scene-text (T𝑇Titalic_T), scene description IN𝐼𝑁INitalic_I italic_N, and unusualness UH𝑈𝐻UHitalic_U italic_H. Then we combine all the information to obtain a uniform description 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT. (b) Atypicality Statement Detection: We utilize V𝑉Vitalic_V and atypicality statement templates 𝒮𝒜subscript𝒮𝒜\mathcal{S}_{\mathcal{A}}caligraphic_S start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT to generate the options which are then used along with IN𝐼𝑁INitalic_I italic_N to retrieve the atypicality statement s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG. (c) Action-Reason Retrieval: We input s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG along with verbalization 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT to retrieve action-reason.

Atypicality Statement Retrieval (ASR). ASR formulates atypicality inference as retrieving a statement describing relations between two objects. We represent atypicality 𝒜𝒜\mathcal{A}caligraphic_A with templates 𝒮𝒜subscript𝒮𝒜\mathcal{S_{\mathcal{A}}}caligraphic_S start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT, defined in Fig. 3. A statement s=(a,op,os)𝑠𝑎superscript𝑜𝑝superscript𝑜𝑠s=(a,o^{p},o^{s})italic_s = ( italic_a , italic_o start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) includes an atypicality type a𝑎aitalic_a, primary object opsuperscript𝑜𝑝o^{p}italic_o start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, and secondary object ossuperscript𝑜𝑠o^{s}italic_o start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as described in Sec. 3.6 of [45]. In TR1 and TR2, opsuperscript𝑜𝑝o^{p}italic_o start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is the object with a new texture, and ossuperscript𝑜𝑠o^{s}italic_o start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is source of that texture. In OIO, opsuperscript𝑜𝑝o^{p}italic_o start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is the object inside, and ossuperscript𝑜𝑠o^{s}italic_o start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is the object outside. In OR, opsuperscript𝑜𝑝o^{p}italic_o start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is the object placed in another’s context, and ossuperscript𝑜𝑠o^{s}italic_o start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is the expected object.

Given an ad, ASR distinguishes the correct atypicality statement s+=(a+,o+p,o+s)superscript𝑠superscript𝑎superscript𝑜𝑝superscript𝑜𝑠{s}^{+}=(a^{+},o^{+p},o^{+s})italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT + italic_p end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT + italic_s end_POSTSUPERSCRIPT ), from a set of negative statements {si¯}S¯subscript𝑠𝑖superscript𝑆\{\bar{s_{i}}\}\in{S}^{-}{ over¯ start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG } ∈ italic_S start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT. We generate negatives as follows: (1) Wrong object: replacing o+psuperscript𝑜𝑝o^{+p}italic_o start_POSTSUPERSCRIPT + italic_p end_POSTSUPERSCRIPT and o+ssuperscript𝑜𝑠o^{+s}italic_o start_POSTSUPERSCRIPT + italic_s end_POSTSUPERSCRIPT with objects from K𝐾Kitalic_K random images, producing 2K2𝐾2K2 italic_K negatives (e.g., s¯1=(a+,o1p,o1s)subscript¯𝑠1superscript𝑎subscriptsuperscript𝑜𝑝1subscriptsuperscript𝑜𝑠1\bar{s}_{1}=(a^{+},o^{p}_{1},o^{s}_{1})over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and s¯2=(a+,o1s,o1p)subscript¯𝑠2superscript𝑎subscriptsuperscript𝑜𝑠1subscriptsuperscript𝑜𝑝1\bar{s}_{2}=(a^{+},o^{s}_{1},o^{p}_{1})over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) where o1p/o1ssubscriptsuperscript𝑜𝑝1subscriptsuperscript𝑜𝑠1o^{p}_{1}/o^{s}_{1}italic_o start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_o start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are from a random image); (2) Wrong atypicality relation: altering the relation with one not in the ground-truths (objects remain the same) to create up to 3 negatives. (3) Swapped primary/secondary objects: we create (a+,o+s,o+p)superscript𝑎superscript𝑜𝑠superscript𝑜𝑝(a^{+},o^{+s},o^{+p})( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT + italic_s end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT + italic_p end_POSTSUPERSCRIPT ). Thus, ASR tests the model’s understanding of objects, their atypicality relation, and primary/secondary object roles. It bridges the gap between MAC and action-reason retrieval by detecting atypical statements and enhancing action-reason retrieval. We use K=2𝐾2K=2italic_K = 2.

Atypical Object Recognition (AOR). To assess the VLMs’ fine-grained visual understanding, we introduce the task of recognizing the primary and secondary objects in an atypical image. Given an ad and the true atypicality, the model must generate the correct primary/secondary objects to complete the atypicality statement. AOR can be viewed as a fill-in-the-blank task (i.e., generative) based on the statement templates.

3.2 Proposed Approach

To explore the impact of atypicality in action-reason retrieval and compare the reasoning ability of VLMs and LLMs, we propose a three-steps in-context learning method : (i) Atypicality-aware image verbalization: Using LLaVA [26] and LLM, we generate a coherent verbalization 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT sensitive to atypicality; (ii) Atypicality statement detection: We detect the atypicality statement s^=(a^,o^p,o^s)^𝑠^𝑎superscript^𝑜𝑝superscript^𝑜𝑠\hat{s}=(\hat{a},\hat{o}^{p},\hat{o}^{s})over^ start_ARG italic_s end_ARG = ( over^ start_ARG italic_a end_ARG , over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ); (iii) Action-reason retrieval: We combine steps (i) and (ii) to retrieve the final action-reason statement. Fig. 4 illustrates the method.

(i) Verbalize image in atypicality-aware manner. Each ad image I=(V,T)𝐼𝑉𝑇I=\left(V,T\right)italic_I = ( italic_V , italic_T ) is composed of visual content (V𝑉Vitalic_V) (objects) and textual content (T𝑇Titalic_T) (scene-text). We obtain V𝑉Vitalic_V and T𝑇Titalic_T by querying LLaVA for up to 5 objects and a list of text-scenes visible in the image. However, this information is insufficient to fully comprehend the image. Hence, we additionally prompt LLaVA with (1) ImageNarration𝐼𝑚𝑎𝑔𝑒𝑁𝑎𝑟𝑟𝑎𝑡𝑖𝑜𝑛ImageNarrationitalic_I italic_m italic_a italic_g italic_e italic_N italic_a italic_r italic_r italic_a italic_t italic_i italic_o italic_n (IN𝐼𝑁INitalic_I italic_N): LLaVA’s responses when prompted Describe the image in detail. (2) UnusualHighlighter𝑈𝑛𝑢𝑠𝑢𝑎𝑙𝐻𝑖𝑔𝑙𝑖𝑔𝑡𝑒𝑟UnusualHighlighteritalic_U italic_n italic_u italic_s italic_u italic_a italic_l italic_H italic_i italic_g italic_h italic_l italic_i italic_g italic_h italic_t italic_e italic_r (UH𝑈𝐻UHitalic_U italic_H): LLaVA’s reponse when asked What is unusual about the image?. MAC results (Table 1) show that UH𝑈𝐻UHitalic_U italic_H effectively captures image unusualness (closely related to atypicality), while IN𝐼𝑁INitalic_I italic_N provides scene and object information useful for retrieving atypicality statement s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG. Thus, we construct the final verbalization 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT by combining UH𝑈𝐻UHitalic_U italic_H, V𝑉Vitalic_V, T𝑇Titalic_T, and IN𝐼𝑁INitalic_I italic_N using an LLM.

(ii) Detect atypicality statement. We construct all possible statements to predict the atypicality statement s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG. Option Generator (teal module in Fig 4(b)) combines V𝑉Vitalic_V and 𝒮𝒜subscript𝒮𝒜\mathcal{S}_{\mathcal{A}}caligraphic_S start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT to generate all possible statements SIsubscript𝑆𝐼{S}_{I}italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. Specifically, each object pair (oi,oj)Vsubscript𝑜𝑖subscript𝑜𝑗𝑉(o_{i},o_{j})\in V( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_V is combined with all atypicality statement template s𝒮𝒜𝑠subscript𝒮𝒜s\in\mathcal{S}_{\mathcal{A}}italic_s ∈ caligraphic_S start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT to create atypicality statements (ak,oi,oj)subscript𝑎𝑘subscript𝑜𝑖subscript𝑜𝑗(a_{k},o_{i},o_{j})( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and (ak,oj,oi)subscript𝑎𝑘subscript𝑜𝑗subscript𝑜𝑖(a_{k},o_{j},o_{i})( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for all ak𝒜subscript𝑎𝑘𝒜a_{k}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_A. We then pass these atypical statements SIsubscript𝑆𝐼{S}_{I}italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT along with verbalization S^INsubscript^𝑆𝐼𝑁\hat{S}_{IN}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPTinto the classifier to predict s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG (no ground-truth is used).

(iii) Retrieve action-reason statements. [18] provided three action-reason statements for each image, each offering different plausible reasons for the same action. Given these three plausible (i.e., positive) and many implausible (i.e., negative) action-reason statements, the task is to detect all plausible action-reason statements {ar^I}i=13superscriptsubscriptsubscript^𝑎𝑟𝐼𝑖13\{\hat{ar}_{I}\}_{i=1}^{3}{ over^ start_ARG italic_a italic_r end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. We proposed various verbalization strategies, including concatenation and LLM-based combinations (i.e., 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT) of V,T,IN,UH𝑉𝑇𝐼𝑁𝑈𝐻V,T,IN,UHitalic_V , italic_T , italic_I italic_N , italic_U italic_H, as well as concatenation of s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG with 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT, to be utilized by an LLM for retrieving the final action-reason statement.

4 Experimental Setup

Datasets. PittAds [18] includes 64,832 ad images, with 3,928 annotated for atypicality, action-reason, and primary/secondary objects. For MAC, we use atypicality categories, while for AOR and ASR, we use primary/secondary objects along with our atypicality statement templates 𝒮𝒜subscript𝒮𝒜\mathcal{S}_{\mathcal{A}}caligraphic_S start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT (Fig. 3) to generate ground truth for evaluation. We utilized the train set with 1,168 samples for the main results, as it includes at least one annotation of the atypicality categories studied and is larger than the test set. No training was performed. Ablation studies are reported on a smaller subset of 250 images due to the high cost.

Baselines. We use LLaVA (‘vicuna-13b-v1.5’) [26], InstructBLIP (‘vicuna-13b-v0’) [13], MiniGPT4 [54] (‘vicuna-13b-v0’), and CLIP [32] (‘ViT-L/14@336px’ following [19]), and InternVL-Chat-V1-1 [11] as VLM baselines (and InternVL2-8B, LLaVA 1.6 in supp). LLaVA is our multimodal component due to its GPT-4-informed instruction tuning, state-of-the-art reasoning performance, and promptability [26]. We evaluate GPT-4V on a limited 250 examples, constrained by cost. We report BLIP-2 [23] (‘blip2-flan-t5-xl’) only for AOR as it failed to produce meaningful output when asked to return multiple options, i.e., MAC and multi-ARR (detail in supp).

Our analysis spans recent public LLMs, such as ‘vicuna-13b-v1.5’ (Vicuna) [37], ‘InternLM2-5‑7b‑chat’ (InternLM; see supp), and leading commercial models like GPT-3.5/4. We chose Vicuna as it is used in all VLMs (LLaVA, MiniGPT4, InstructBLIP) and InternLM as InternVL2-8B’s LM. We also compare GPT-4 and GPT-4V. We introduce CLIP (I+T𝐼𝑇I+Titalic_I + italic_T) as a zero-shot baseline aligned with KAFA [19] but avoid direct comparison with KAFA as it is not publicly available. To assess our atypicality method, we compare against V+T𝑉𝑇V+Titalic_V + italic_T (verb. baseline), which includes basic image information (up to 5 objects, scene text).

Metrics. We use Precision, Recall, and F1-score to evaluate MAC (additional metrics in supp). For AOR, we assess sentence similarity between s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG using ‘all-mpnet-base-v2’ [35], a state-of-the-art sentence embedding method. Common text-matching and accuracy metrics aren’t suitable for AOR since it is a generative task, and annotations can vary widely (e.g., ‘beer,’ ‘glass of beer,’ ‘beer glass’) due to human inconsistencies and typos. We report accuracy (denoted ‘Acc’) for single statement retrieval tasks, where the model returns only one statement per query (i.e., ASR and Single ARR). Top-k Acc and unranked Precision@k are the metrics for the multi-option ARR, with Precision@k=min(k,# of relevant statements in top k predictions)k𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘𝑚𝑖𝑛𝑘# of relevant statements in top k predictions𝑘Precision@k=\frac{min(k,\textit{\# of relevant statements in top k predictions% })}{k}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n @ italic_k = divide start_ARG italic_m italic_i italic_n ( italic_k , # of relevant statements in top k predictions ) end_ARG start_ARG italic_k end_ARG. Note top-3 acc and prec@1 are the same.

Hard Negative (HN) Generation. To measure ARR performance, [44] mined hard negatives from images within the same topic, while [19] expanded the negative options. These negatives often include irrelevant objects, allowing VLMs to easily disregard them by comparing objects. This hinders accurate measurement of models’ reasoning ability. In contrast, a concurrent work [4] used annotators to write implausible statements based on visible objects/texts. Our approach differs in three key ways: (1) it is LLM-based, image-agnostic, and scalable; (2) it generates a wider variety of negatives while focusing on semantics (e.g., altering actions, reasons, adjectives, or swapping objects not visible such as ‘lipstick’ instead of ‘lip balm’), and (3) they evaluate contrastive-based VLMs, whereas we focus on generative VLM/LLMs with stronger reasoning ability.

Specifically, for each ground-truth action-reason statement, we ask GPT-4 to generate a negative action-reason statement by (1) Action alter: changing the action to an unrelated or opposite action while preserving the reason; (2) Reason alter: changing the reason to an unrelated or opposite reason; (3) Adjective alter: negating or modifying adjectives to make the statement incorrect; (4) Object swap: substituting at least one object in the statement with an unrelated object; (5) Statement alter: generating a completely unrelated action-reason statement.

We validated our hard negatives by sampling 100 images and asking annotators to select correct statements (in supp).

Implementation. For GPT-4, GPT-3.5, and Vicuna temperature is set to 0. For LLaVA [25], BLIP-2 [23], InternVL models [11] and InternLM [8] we used the original setting. We applied 8-bit quantization for LLaVA, MiniGPT-4, Vicuna, InternVL and InternLM. All experiments were zero-shot and conducted on an NVIDIA 368 Quadro RTX A5000 GPU with 24 GB memory. Example prompts are in supp.

5 Results

The key goal is to benchmark VLMs and evaluate their understanding of persuasive ads. This section first presents results on VLMs/LLMs’ atypicality understanding tasks, forming the foundation for our atypicality-aware verbalization. Then, we explore whether atypicality can help ad understanding on Action-Reason Retrieval (ARR).

5.1 Atypicality Understanding Results

We assess opone-sour VLMs’ (LLaVA [25], InstructBLIP [13]) understanding of the atypicality and GPT-4V. Additionally, we evaluate the effectiveness of two prompting strategies for capturing atypicality: (1) IN𝐼𝑁INitalic_I italic_N (Describe the image in detail.) and (2) UH𝑈𝐻UHitalic_U italic_H (What is unusual about the image?). These strategies are compared to the VLMs and the V+T𝑉𝑇V+Titalic_V + italic_T baseline across three LLMs (GPT-4, GPT-3.5, Vicuna). Table 1 summarizes the results on MAC and ASR, while AOR results are in Table 2.

MAC ASR
Classifier Method Precision Recall F1-score Acc
\checkmark ×\times× \checkmark ×\times× \checkmark ×\times×
LLaVA [26] I 27.75 27.75 42.38 52.71 21.24 26.03 18.83
IN 25.12 31.40 42.44 53.04 25.06 31.32 20.90
UH 44.35 30.44 42.04 52.44 24.16 29.98 17.90
InstructBLIP [23] I 34.81 27.60 41.43 50.73 17.72 20.18 19.76
Vicuna [12] V+T𝑉𝑇V+Titalic_V + italic_T 36.70 30.64 41.73 45.78 32.52 31.66 14.30
IN𝐼𝑁INitalic_I italic_N 37.71 32.04 43.70 45.91 34.51 32.09 23.29
UH𝑈𝐻UHitalic_U italic_H 39.41 33.33 36.05 42.88 27.35 30.36 14.74
GPT 3.5 V+T𝑉𝑇V+Titalic_V + italic_T 41.46 35.36 23.21 21.54 28.18 24.95 50.00
IN𝐼𝑁INitalic_I italic_N 46.28 42.50 25.13 14.75 28.49 19.64 50.55
UH𝑈𝐻UHitalic_U italic_H 49.10 43.34 27.38 30.92 27.06 28.24 50.05
GPT 4 V+T𝑉𝑇V+Titalic_V + italic_T 40.38 35.95 22.56 6.69 22.66 10.99 52.44
IN𝐼𝑁INitalic_I italic_N 54.78 53.40 27.19 13.64 30.58 20.91 57.70
UH𝑈𝐻UHitalic_U italic_H 53.49 51.01 29.15 28.89 34.62 33.05 56.89
Table 1: Atypicality Understanding Tasks (MAC & ASR) on Full-set. UH𝑈𝐻UHitalic_U italic_H is very effective on MAC. IN𝐼𝑁INitalic_I italic_N slightly outperforms UH𝑈𝐻UHitalic_U italic_H on ASR; ASR also requires to identify the objects which may not be well-represented in UH𝑈𝐻UHitalic_U italic_H. \checkmark/×\times× denotes performance with/without No Atypicality (NA) class. Best result per LLM bolded.

VLMs struggle with direct atypicality inference. Table 1 shows VLMs consistently underperform verbalization approaches (UH𝑈𝐻UHitalic_U italic_H and IN𝐼𝑁INitalic_I italic_N) in both MAC (F1-score) and ASR. For instance, LLaVA shows high recall (52.71) but low precision (27.75) in the MAC task, indicating it over-predicts atypicalities. This trend is particularly evident in the OIO and TR2 categories, where LLaVA achieves perfect and near-perfect recall (1.0 and 0.79) but low precision (0.18 and 0.23). Conversely, recall for other categories doesn’t exceed 0.24, suggesting a bias towards OIO and TR2 predictions (category-wise metrics not shown in table). InstructBLIP exhibits similar issues. This suggests VLMs lack the reasoning ability to accurately infer atypicality.

V+T lacks sufficient context to fully understand the image. V+T𝑉𝑇V+Titalic_V + italic_T provides inadequate context for extracting atypicality, as evidenced by GPT-4’s low recall/F1 (precision/recall scores of 58.11/86.04 for NA) and low recall for atypical categories (6.00, 0.79, 6.10, and 13.86 for TR1, TR2, OIO, and OR; not shown in table). Smaller LLMs (Vicuna and GPT-3.5) perform better than GPT-4 with V+T, but their improvement is likely due to hallucination rather than extracting useful information. In contrast, UH𝑈𝐻UHitalic_U italic_H and IN𝐼𝑁INitalic_I italic_N provide richer descriptions that better capture atypicality.

VLMs are effective for verbalization. In MAC, UH𝑈𝐻UHitalic_U italic_H emerges as the top-performing strategy, effectively extracting unusualness and atypicality from images. It significantly improves V+T𝑉𝑇V+Titalic_V + italic_T F1 score without NA (details in section 3.1) by 3.29 and 22.06 for GPT-3.5 and GPT-4, respectively, and is only slightly lower than Vicuna’s V+T𝑉𝑇V+Titalic_V + italic_T. UH𝑈𝐻UHitalic_U italic_H also outperforms IN𝐼𝑁INitalic_I italic_N overall. In ASR, UH𝑈𝐻UHitalic_U italic_H is more effective than V+T𝑉𝑇V+Titalic_V + italic_T and outperforms VLMs by a significant margin when used in GPT models (e.g., by 37.13 with GPT-4 compared to the strongest VLM). However, IN𝐼𝑁INitalic_I italic_N slightly outperforms UH𝑈𝐻UHitalic_U italic_H on GPT-4/GPT-3.5 and significantly on Vicuna. These results indicate that identifying atypicality requires both image understanding and reasoning capabilities. ASR necessitates the model to identify the correct atypicality statement, which includes both objects and their relation (e.g., search bar completely replaces mouth in its usual context, assuming its function or position in Fig. 1 (a)). Thus, IN𝐼𝑁INitalic_I italic_N may be a better verbalization as it provides more detailed information about the image, objects, and their relations, whereas UH𝑈𝐻UHitalic_U italic_H offers implicit information about the image’s unusualness. Therefore, we use UH𝑈𝐻UHitalic_U italic_H (best on MAC) and IN𝐼𝑁INitalic_I italic_N (best on ASR) along with V+T𝑉𝑇V+Titalic_V + italic_T to create atypicality-aware verbalization. Inspired by ASR results we also adopt IN𝐼𝑁INitalic_I italic_N for detecting atypicality statement s^INsubscript^𝑠𝐼𝑁\hat{s}_{IN}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT which is combined with our verbalization for ARR.

Model Avg. similarity (s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG to s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT) % of scores
score   >>> 0.7    >>> 0.6    >>> 0.5
BLIP2 [23]  0.45   8.77    19.78    35.43
InstructBLIP [13]  0.46   9.54    21.24    40.76
MiniGPT4 [54]  0.51   15.24    32.28    51.71
LLaVA [26] 0.59   31.41    51.35    65.16
GPT-4V \dagger [1]  0.67   46.94    61.63    77.14
Table 2: AOR on Full-set. \dagger on Small-set.

VLMs’ limitations in Atypical Object Recognition. Table 2 explores VLMs performance on the AOR task, which involves generating primary and secondary objects given only GT atypicality relation to complete its atypicality statement from Fig. 3. Scores above 0.7 indicate strong semantic overlap, while scores between 0.5 and 0.7 indicate moderate overlap. The results underscore current VLMs’ difficulty in reasoning and recognizing atypical objects. InstructBLIP and MiniGPT4 perform poorly, with most predictions scoring below 0.5, highlighting their struggle to recognize primary and secondary objects in atypical contexts. GPT-4V emerges as the most proficient model, yet only about half of its predictions surpass the 0.7 mark. These results highlight the need to improve VLMs’ reasoning in complex visual scenes, such as atypical objects.

5.2 Action-Reason Retrieval Results

We evaluate using 4 LLMs (GPT-4, GPT-3.5, Vicuna, InternLM in supp) and compare them to VLMs (CLIP, LLaVA, InstructBLIP, InternVL-Chat-V1-1, GPT-4V, and LLaVA1.6 and InternVL2-8B in supp). Table 3 presents the multi-option action-reason retrieval results (ARR) and their comparison with VLMs on full-set. Table 4 exhibits ARR results on small-set when GPT-4V is the VLM in our pipeline. We adopted Vicuna as the public LLM because it’s used in LLaVA, InstructBLIP, and MiniGPT4. GPT-4 and GPT-4V were used as powerful LLM/VLM pairs.

Classifier Verb. Precision@k Top-k Acc Avg
k=1 k=2 k=3 k=1 k=2
CLIP[32] I𝐼Iitalic_I 61.04 33.86 22.66 23.72 44.61 37.18
I+T𝐼𝑇I+Titalic_I + italic_T 46.15 24.36 16.24 15.15 31.25 26.63
LLaVA [26] I𝐼Iitalic_I 59.67 38.27 26.06 32.92 48.14 41.01
I+𝒯𝒱𝐼subscript𝒯𝒱I+\mathcal{T}_{\mathcal{V}}italic_I + caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT 59.45 37.37 25.14 27.49 47.07 39.30
InstructBLIP [13] I𝐼Iitalic_I 15.05 10.03 7.80 13.04 13.04 11.79
InternVL-Chat-V1-1 [11] I𝐼Iitalic_I 52.22 32.79 22.17 22.51 40.66 30.07
Vicuna [12] V+T𝑉𝑇V+Titalic_V + italic_T 64.13 40.71 27.57 21.49 43.41 39.46
𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT (Ours) 67.38 44.01 29.94 23.20 41.95 41.30
𝒯𝒱+s^INsubscript𝒯𝒱subscript^𝑠𝐼𝑁\mathcal{T}_{\mathcal{V}}+\hat{s}_{IN}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT + over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT (Ours) 68.32 44.52 30.25 22.95 43.24 41.86
𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT (GPT-4 Verb.) (Ours) 68.49 44.52 30.37 24.06 43.24 42.14
GPT-4 V+T𝑉𝑇V+Titalic_V + italic_T 93.73 84.42 70.50 71.50 89.87 82.00
𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT (Ours) 93.99 86.35 72.96 74.94 91.16 83.88
𝒯𝒱+s^INsubscript𝒯𝒱subscript^𝑠𝐼𝑁\mathcal{T}_{\mathcal{V}}+\hat{s}_{IN}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT + over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT (Ours) 95.54 87.55 74.62 88.42 93.40 87.91
Table 3: ARR on Full-set . Predicted atypicality statement s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG uses the respective prompting (IN) and LLM (Vicuna, GPT-4), with all LLMs using LLaVA verbalization. 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT combines V+T+IN+UH𝑉𝑇𝐼𝑁𝑈𝐻V+T+IN+UHitalic_V + italic_T + italic_I italic_N + italic_U italic_H using Vicuna or GPT-4. Best number per block and column is bolded. Precision@k is unranked. Task is Multi-ARR.

LLMs are more powerful than VLMs. In Table 3 and 4, LLMs with atypicality-aware verbalization (𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT) consistently outperform VLMs. For example, GPT-4 surpasses LLaVA by 42.87 points. GPT-4 also outperforms a strong VLM, GPT-4V, in all metrics (Table 4). Results in supp confirm this trend across InternVL2-8B and LLaVA 1.6. This highlights the superior reasoning ability of LLMs in understanding atypicality and action-reason statements.

Interestingly, when LLaVA is provided with both image and 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT (LLaVA(I+𝒯𝒱𝐼subscript𝒯𝒱I+\mathcal{T}_{\mathcal{V}}italic_I + caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT)), it underperforms Vicuna by 7.93, 6.64, and 4.8 points on prec@1, prec@2, and prec@3, respectively. Also, its performance drops by 1.71 points compared to LLaVA(I𝐼Iitalic_I). This reveals that despite using Vicuna as its LLM, LLaVA’s reasoning ability is limited, hindering it effective use of 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT for action-reason retrieval.

Refer to caption
Figure 5: ARR error analysis on Full-set.
Classifier precision@k Top-k acc
k=1 k=2 k=3 k=1 k=2
LLaVA 59.67 38.27 26.06 32.92 48.14
GPT-4V 97.17 89.91 74.86 77.01 90.32
GPT-4 (𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT) 97.58 90.72 76.61 81.04 92.74
Table 4: GPT-4V verb. in Multi-ARR on Small-set.

Atypicality boosts persuasive visual media understanding. Table 3 demonstrates the effectiveness of our atypicality-aware verbalization compared to VLMs and the V+T𝑉𝑇V+Titalic_V + italic_T verbalization baseline. While Vicuna(V+T𝑉𝑇V+Titalic_V + italic_T) lags behind LLaVA by 1.55 points on avg, indicating insufficient context from basic verbalization, our 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT consistently improves V+T𝑉𝑇V+Titalic_V + italic_T performance. We observe gains of 3.8/2.8 and 1.93/2.46 on prec@2/prec@3 for Vicuna and GPT-4, respectively. Combining 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT with s^INsubscript^𝑠𝐼𝑁\hat{s}_{IN}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT yields the best results across all LLMs (GPT3.5 in supp), confirming the benefit of incorporating atypicality-aware verbalization and atypicality statement for understanding persuasive ads. Interestingly, Vicuna shows minimal gains when using GPT-4’s verbalization, suggesting that our strategy does not depend heavily on extensive LLMs like GPT-4.

Table 5 investigates how the addition of atypicality statements impacts VLM performance on ARR using three methods: (1) I+s+𝐼superscript𝑠I+s^{+}italic_I + italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, with true atypicality statements s+=(a+,o+p,o+s)superscript𝑠superscript𝑎superscript𝑜𝑝superscript𝑜𝑠s^{+}=(a^{+},o^{+p},o^{+s})italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT + italic_p end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT + italic_s end_POSTSUPERSCRIPT ); (2) I+s^𝐼^𝑠I+\hat{s}italic_I + over^ start_ARG italic_s end_ARG, with predicted atypicality statements s^=(a^,o^p,o^s)^𝑠^𝑎superscript^𝑜𝑝superscript^𝑜𝑠\hat{s}=(\hat{a},\hat{o}^{p},\hat{o}^{s})over^ start_ARG italic_s end_ARG = ( over^ start_ARG italic_a end_ARG , over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) by GPT-4 and 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT; and (3) I+s¯𝐼¯𝑠I+\bar{s}italic_I + over¯ start_ARG italic_s end_ARG, with incorrect atypicality statements using correct objects but incorrect relations, s¯=(a¯,o+p,o+s)¯𝑠¯𝑎superscript𝑜𝑝superscript𝑜𝑠\bar{s}=(\bar{a},o^{+p},o^{+s})over¯ start_ARG italic_s end_ARG = ( over¯ start_ARG italic_a end_ARG , italic_o start_POSTSUPERSCRIPT + italic_p end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT + italic_s end_POSTSUPERSCRIPT ). The results reveal that VLMs benefit from atypicality statements. However, for GPT-4 (an LLM), adding the atypicality to 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT is not as useful, since 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT already contains correct atypicality information, and incorrect statement reduce the performance. These findings highlight the importance of incorporating atypicality to improve VLM performance on ARR tasks.

Generalization to typical images. We compared our pipeline with Vicuna (𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT) on typical images (i.e., no atypicality) against LLaVA on small-set ARR (details in supp). Vicuna (𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT) achieved 71.2%/48.6%/33.2% vs. 66.4%/42.2%/28.3% for LLaVA on prec@1/2/3, demonstrating our approach’s effectiveness beyond atypical ads.

Refer to caption
Figure 6: ASR error analysis on Full-set
Model I𝐼Iitalic_I I+s+𝐼superscript𝑠I+s^{+}italic_I + italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT I+s^𝐼^𝑠I+\hat{s}italic_I + over^ start_ARG italic_s end_ARG I+s¯𝐼¯𝑠I+\bar{s}italic_I + over¯ start_ARG italic_s end_ARG
LLaVA 26.00 35.18 54.28 28.16
InstructBLIP 20.44 23.25 23.40 19.69
GPT-4V 86.87 89.35 87.24 86.96
GPT-4 (𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT) 96.77 91.42 96.76 90.20
Table 5: Atypicality ablation on ARR Small-set. I𝐼Iitalic_I: image, s+superscript𝑠s^{+}italic_s start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT: GT atyp., s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG: predicted atyp. based on 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT, s¯¯𝑠\bar{s}over¯ start_ARG italic_s end_ARG: incorrect atyp. Acc on Single ARR. Best result per row in bold.

5.3 Further Analysis & Ablation

Neg. Strategy Model Multi Single
Precision@k
k=1 k=2 k=3 Acc
12 Neg. [18, 19] CLIP (I)𝐼(I)( italic_I ) 98.79 97.58 92.20 96.77
CLIP (I+T)𝐼𝑇(I+T)( italic_I + italic_T ) 97.58 97.58 87.10 90.32
LLaVA (I)𝐼(I)( italic_I ) 93.47 74.08 56.33 94.31
GPT4 (𝒯𝒱)subscript𝒯𝒱(\mathcal{T}_{\mathcal{V}})( caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ) 99.60 96.98 91.13 93.52
18 Hard Neg. CLIP (I)𝐼(I)( italic_I ) 64.52 34.48 22.98 20.97
CLIP (I+T) 47.18 25.40 16.94 15.73
LLaVA (I)𝐼(I)( italic_I ) 59.67 38.27 26.06 26.80
GPT4 (𝒯𝒱)subscript𝒯𝒱(\mathcal{T}_{\mathcal{V}})( caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ) 96.77 87.30 74.60 96.77
Table 6: Impact of semantically hard negatives on Small-set.

Hard Negatives ablation. Table 6 compares our semantically hard negatives against those used in prior work. VLM performance drops substantially when faced with our hard negatives, evidenced by 69.22/30.27 drop in CLIP(I)/LLaVA(I) on prec@3. Conversely, our method exhibits a decrease of no more than 17 across all metrics, demonstrating robustness in reasoning compared to VLMs.

Error analysis on ARR. In Fig. 5, VLMs (i.e., LLaVA and CLIP) do not make considerably more errors than GPT-4 (𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT) on ‘object swap’ since these negative options contain different objects from the ground-truth, making them easier for VLMs to identify. However, VLMs make far more errors than GPT-4(𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT) on semantically incorrect negatives (‘action alter,’ ‘reason alter,’ ‘statement alter,’ and ‘adjective alter’ ). This shows VLMs mainly rely on visual elements rather than deeper reasoning (examples in supp).

Error analysis on ASR. In Fig. 6, LLaVA makes notably more errors, particularly on ‘Wrong Relation’ options, which demand deeper reasoning than other negative types.

Effectiveness of each component in atypicality-aware verbalization. We ablate the effectiveness of each step in atypicality-aware verbalization on the ARR small-set (details in supp). The performance of 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT shows the advantage of atypicality-aware verbalization over basic concatenation (V+T+IN+UH𝑉𝑇𝐼𝑁𝑈𝐻V+T+IN+UHitalic_V + italic_T + italic_I italic_N + italic_U italic_H). Specifically, 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT improves top-1 acc by 14.76 on GPT-3.5 and 12.78 on GPT-4. This is because LLaVA-generated descriptions are inherently noisy, and their naive concatenation can mislead the models. A further issue is increased prompt length. Additionally, UH𝑈𝐻UHitalic_U italic_H alone is less effective than IN𝐼𝑁INitalic_I italic_N, but 𝒯𝒱subscript𝒯𝒱\mathcal{T}_{\mathcal{V}}caligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT outperforms 𝒯𝒱UHsubscript𝒯𝒱𝑈𝐻\mathcal{T}_{\mathcal{V}}\setminus UHcaligraphic_T start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ∖ italic_U italic_H, showing that atypicality is important yet complementary to basic image information captured in V𝑉Vitalic_V, T𝑇Titalic_T, and IN𝐼𝑁INitalic_I italic_N.

Generalization beyond ads. We test our atypicality-aware verbalization pipeline on WHOOPS! [7] in supp.

6 Conclusion

This work challenged VLMs on complex rhetorical visual media, focusing on atypicality in advertisements. We introduced three novel atypicality tasks and benchmarked VLMs on and ARR, revealing their limitations in advanced reasoning. Despite these limitations, VLMs showed potential in extracting relevant information for understanding the atypical images. Our atypicality-aware verbalization strategy significantly enhances LLM performance on ARR tasks. Extensive experiments demonstrate that our approach outperforms VLM baselines, proving the effectiveness of incorporating atypicality inference for understanding ad images. These findings highlight the importance of atypicality in interpreting complex visual media and the superior reasoning abilities of LLMs over VLMs.

Limitations. The PittAd dataset [18] is widely used for understanding visual media (e.g., [44, 2, 19, 17, 30]), but it contains many annotations reflecting human biases and some images with sensitive content. Exploring these biases is beyond the scope of this paper.

References

  • [1] Gpt-4v(ision) system card. 2023.
  • [2] Arjun R Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhiwei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T Freeman, et al. Metaclue: Towards comprehensive visual metaphors research. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23201–23211, 2023.
  • [3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  • [4] Anna Bavaresco, Alberto Testoni, and Raquel Fernández. Don’t buy it! Reassessing the ad understanding abilities of contrastive multimodal models. In ACL (Short), Aug. 2024.
  • [5] Aanisha Bhattacharya, Yaman K Singla, Balaji Krishnamurthy, Rajiv Ratn Shah, and Changyou Chen. A video is worth 4096 tokens: Verbalize story videos to understand them in zero shot. In EMNLP, 2023.
  • [6] Yonatan Bitton, Ron Yosef, Eliyahu Strugo, Dafna Shahaf, Roy Schwartz, and Gabriel Stanovsky. Vasr: Visual analogies of situation recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 241–249, 2023.
  • [7] Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023.
  • [8] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  • [9] Aochuan Chen, Yuguang Yao, Pin-Yu Chen, Yihua Zhang, and Sijia Liu. Understanding and improving visual prompting: A label-mapping perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19133–19143, 2023.
  • [10] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • [11] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24185–24198, June 2024.
  • [12] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  • [13] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [14] Arka Ujjal Dey, Suman K Ghosh, Ernest Valveny, and Gaurav Harit. Beyond visual semantics: Exploring the role of scene text in image understanding. Pattern Recognition Letters, 149:164–171, 2021.
  • [15] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024.
  • [16] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  • [17] Meiqi Guo, Rebecca Hwa, and Adriana Kovashka. Detecting persuasive atypicality by modeling contextual compatibility. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 972–982, 2021.
  • [18] Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. Automatic understanding of image and video advertisements. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1705–1715, 2017.
  • [19] Zhiwei Jia, Pradyumna Narayana, Arjun Akula, Garima Pruthi, Hao Su, Sugato Basu, and Varun Jampani. KAFA: Rethinking image ad understanding with knowledge-augmented feature adaptation of vision-language models. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D Williams, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 772–785, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [20] Kanika Kalra, Bhargav Kurma, Silpa Vadakkeeveetil Sreelatha, Manasi Patwardhan, and Shirish Karande. Understanding advertisements with bert. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7542–7547, 2020.
  • [21] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
  • [22] Yunshi Lan, Xiang Li, Xin Liu, Yang Li, Wei Qin, and Weining Qian. Improving zero-shot visual question answering via large language models with reasoning question prompts. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4389–4400, 2023.
  • [23] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  • [24] Liunian Li, Zi-Yi Dou, Nanyun Peng, and Kai-Wei Chang. Desco: Learning object recognition with rich language descriptions. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 37511–37526. Curran Associates, Inc., 2023.
  • [25] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023.
  • [26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
  • [27] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  • [28] Edward F McQuarrie and David Glen Mick. Visual rhetoric in advertising: Text-interpretive, experimental, and reader-response analyses. Journal of consumer research, 26(1):37–54, 1999.
  • [29] Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024.
  • [30] Trisha Mittal, Puneet Mathur, Aniket Bera, and Dinesh Manocha. Affect2mm: Affective analysis of multimedia content using emotion causality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5661–5671, 2021.
  • [31] Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  • [32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [33] Werner Reinartz and Peter Saffert. Creativity in advertising: When it works and when it doesn’t. Harvard Business Review, 91(6):106–111, 2013.
  • [34] Robert E Smith, Scott B MacKenzie, Xiaojing Yang, Laura M Buchholz, and William K Darley. Modeling the determinants and effects of creativity in advertising. Marketing science, 26(6):819–833, 2007.
  • [35] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. arXiv preprint arXiv:2004.09297, 2020.
  • [36] Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
  • [37] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
  • [38] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  • [39] Ramakrishna Vedantam, Xiao Lin, Tanmay Batra, C Lawrence Zitnick, and Devi Parikh. Learning common sense through visual abstraction. In Proceedings of the IEEE international conference on computer vision, pages 2542–2550, 2015.
  • [40] Lei Wang, Yi Hu, Jiabang He, Xing Xu, Ning Liu, Hui Liu, and Heng Tao Shen. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19162–19170, 2024.
  • [41] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • [42] Judith Williamson. Decoding advertisements, volume 4. Marion Boyars London, 1978.
  • [43] Yueting Yang, Xintong Zhang, and Wenjuan Han. Enhance reasoning ability of visual-language models via large language models. arXiv preprint arXiv:2305.13267, 2023.
  • [44] Keren Ye and Adriana Kovashka. Advise: Symbolism and external knowledge for decoding advertisements. In Proceedings of the European Conference on Computer Vision (ECCV), pages 837–855, 2018.
  • [45] Keren Ye, Narges Honarvar Nazari, James Hahn, Zaeem Hussain, Mingda Zhang, and Adriana Kovashka. Interpreting the rhetoric of visual advertisements. IEEE transactions on pattern analysis and machine intelligence, 43(4):1308–1323, 2019.
  • [46] Keren Ye, Mingda Zhang, and Adriana Kovashka. Breaking shortcuts by masking for robust visual reasoning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3520–3530, January 2021.
  • [47] Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A Ayyubi, Kai-Wei Chang, and Shih-Fu Chang. Idealgpt: Iteratively decomposing vision and language reasoning via large language models. arXiv preprint arXiv:2305.14985, 2023.
  • [48] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019.
  • [49] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
  • [50] Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. Advances in Neural Information Processing Systems, 36:5168–5191, 2023.
  • [51] Kankan Zhou, Eason Lai, Wei Bin Au Yeong, Kyriakos Mouratidis, and Jing Jiang. Rome: Evaluating pre-trained vision-language models on reasoning beyond visual common sense. arXiv preprint arXiv:2310.19301, 2023.
  • [52] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
  • [53] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  • [54] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024.