Benchmarking VLMs’ Reasoning About Persuasive Atypical Images
Abstract
Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer’s lightness.
We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Atypical Object Recognition, to benchmark VLMs’ understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad’s message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements.
Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.
1 Introduction
In visual media, particularly advertisements, creators employ creative visual rhetoric to capture attention and convey memorable, powerful messages. They intentionally deviate from realism, depicting objects in unique and atypical ways [33, 28]. Creative ads that are “out of the ordinary” or “connect objects that are usually unrelated” can generate twice as much revenue as non-creative ads [33].
Atypical imagery in ads often involves transforming objects metaphorically [42, 45]. These creative transformations are not random; they are carefully chosen to convey specific ideas [42]. For example, Fig. 1(a) depicts a text box as tape to suggest silencing, while in (d), potato chips are shown as flames to metaphorically represent spiciness, borrowing properties from fire (hotness symbolizing flavor). Understanding these atypical images requires more than just recognizing objects. It requires advanced reasoning skills, including knowledge of cultural contexts and social norms, posing a significant challenge for AI systems.
Modern pretrained vision-language models (VLMs), such as LLaVA [26, 25], demonstrate strong visual understanding across various tasks such as recognition [24], and capabilities like zero-shot generalizability. However, there is a lack of in-depth study on VLMs’ ability to understand complex persuasive images, such as advertisements.
We address this gap by introducing three novel tasks over PittAds [18] to evaluate VLMs’ understanding of atypicality: (1) multi-label atypicality classification, where the model predicts the type of atypicalities in the image; (2) atypicality statement retrieval, where the model retrieves correct atypicality statements describing the atypicality relation among objects; (3) atypical object recognition, where the model generates objects to complete an atypicality statement based on a given relation. These tasks are essential as prior works’ binary classification oversimplifies atypicality’s nuanced nature. Our evaluation shows that although VLMs struggle with direct atypicality inference, they can extract valuable information about atypical aspects.
Next, we investigate how atypicality influences understanding an ad’s message. We use the action-reason retrieval (ARR) task [18, 45], which requires models to identify the suggested action (e.g., “buy these chips”) and its rationale (e.g., “because they are spicy”). However, to rigorously test the model’s reasoning, we introduce semantically challenging negative options, rather than mining hard negatives from other images [44, 19]. For example, we generate statements that include wrong action (e.g., “don’t buy these chips”) or wrong rationale (e.g., “because they are sweet”). This prevents VLMs from ruling out negatives by merely comparing objects in the image and options. Our evaluation shows a significant performance drop when VLM is faced with hard negatives (e.g., LLaVA drops by 67.51%).
Finally, we hypothesize that deep atypicality understanding improves action-reason retrieval performance. We test this by proposing an atypicality-aware verbalization. We explore a few simple prompting strategies to construct a comprehensive, atypicality-sensitive ad verbalization, which is used to predict the atypicality statement. This statement, combined with the atypicality-aware verbalization, serves as input to an LLM classifier. This LLM integrates all the information to retrieve the final action-reason, effectively leveraging both visual understanding and high-level reasoning capabilities.
Our proposed framework achieves state-of-the-art performance on the ARR task. Interestingly, when a VLM is given both the image and our atypicality-aware verbalization, its performance on the ARR task actually declines (e.g., LLaVA() shows a 1.71 point drop in performance compared to LLaVA()) and it is significantly underperformed compared to LLM. This stark contrast highlights a critical gap: VLMs lack the advanced reasoning capabilities of LLMs when interpreting complex, atypical visual media. To summarize, our contributions are:
-
1.
We introduce three novel tasks for understanding atypicality in persuasive media.
-
2.
We pioneer the use of atypicality inference in action-reason retrieval and are the first to benchmark VLMs/LLMs for advertisement understanding.
-
3.
We generate semantically challenging negatives using GPT-4 for action-reason retrieval, revealing VLMs’ reasoning limitations in interpreting atypical ads.
We hope this work inspires the inclusion of persuasive ads in VLM benchmarks, foster development of robust models for complex visual media, and offer insights for creating more effective advertisements.
2 Related Works
Creativity in advertising has long been of interest in advertising research. It has been broken down into categories, and its impact on the effectiveness of ads has been measured. Both [33] and [34] define the categories as originality, flexibility, synthesis, elaboration, and artistic value, which capture different shades of divergence from the ordinary. Atypicality most directly maps to synthesis. However, these creativity strategies have not been explored in computer vision for predicting the message of an ad.
Advertisement image understanding. The PittAds Dataset [18] introduced the action-reason retrieval task, establishing a baseline for automatic ad understanding. However, most studies have not explicitly captured advertising-specific strategies for this task, nor have they addressed atypicality. [44] incorporated symbolism, but the gains were minimal. Others utilized scene-text [14, 20], graph-based methods to incorporate external knowledge [46], and CLIP [32] for brand name understanding [19]. [5] used Automatic Speech Recognition, OCR, WikiData and BLIP-2 [23] to describe the stories of video ads. [2] analyzed metaphors in ads. Yet, the impact of atypicality on ad image understanding remains unexplored. The only exception is [17], which proposed a self-supervised approach to classify images as typical or atypical but did not classify the type of atypicality nor use them for action-reason prediction.
Vision-language models. We benchmark pre-trained VLMs and LLMs on tasks involving atypicality and advertisement image understanding, focusing on their zero-shot reasoning capabilities. We also use pre-trained VLMs to verbalize advertisement images for LLMs. Given the substantial computational power required for training and fine-tuning large models, off-the-shelf, frozen models are typically used [38, 3]. Techniques to align visual and textual features without parameter updates include optimizing image encoders [38], inserting cross-attention layers [3], prompt learning [53, 52, 9, 21], and employing external transformers [31, 23, 13, 25]. However, direct application of models like [23, 32] may miss hidden messages by focusing more on visual context than semantics. Note that we restrict our experiments to the zero-shot inference setting.
Language models for multi-modal reasoning. Recent studies [15, 49, 5, 43] have explored LLMs for reasoning tasks, including chain-of-thought reasoning [41]. Some works leverage LLMs in multi-modal reasoning. [49, 40, 50] extended chain-of-thought to a multimodal context.[22] uses an image-captioning model followed by ‘reasoning questions’ to aid an LLM in answering the main question. Related/concurrent works like [47] improve zero-shot reasoning by iteratively asking and answering questions with 3 VLM/LLM, and [29] uses scene-graphs to enhance compositional reasoning. [27, 49] devised sophisticated LLM-augmented tools for task subdivision and external tool selection. In contrast, this paper challenges the reasoning ability of VLMs on complex persuasive images through novel atypicality tasks and action-reason retrieval. It improves performance with a more lightweight atypicality-aware verbalization, and no external tools are needed.
VLM evaluation. Various vision-language benchmarks have been proposed, including tasks like recognition [16], captioning [10], commonsense reasoning [39, 6], VCR [48], and compositional reasoning [36]. However, these benchmarks feature straightforward scenes that do not challenge the model’s reasoning abilities beyond literal imagery. Recently, ROME [51] generated counter-intuitive images focusing on 5 primitive common sense types (e.g., color and size) to challenge models’ ability in object, attribute, and spatial relation recognition. Similarly, WHOOPS! [7] addressed this by creating 500 synthetic scenes by placing ‘normal’ objects in unusual contexts (e.g., a snowplow in a desert and Einstein holding a smartphone). Atypical advertising images differ from those in WHOOPS! and ROME in two important ways: First, they are real ads created by human designers to intentionally convey a particular message (e.g., cigar replacing a bullet, to highlight the dangers of smoking), rather than simply to be unusual. This requires models to detect atypicality and atypical objects and reason about their impact on the ad’s meaning, making ad image understanding a more realistic and challenging benchmark for evaluating VLMs’ reasoning abilities. Second, atypical ads feature more categories of atypicality than placing an object out of context or altering primitive attributes, as WHOOPS! and ROME do, respectively.
3 Methodology
This paper evaluates VLM/LLM understanding of atypical advertisements. We address two key questions: (1) Are current VLMs capable of reasoning about atypicality and understanding advertisements? (2) What is the impact of atypicality on understanding ad images?
Unlike prior works [17] that only classify images as typical or atypical, we propose three new tasks: Multi-label Atypicality Classification (MAC), Atypicality Statement Retrieval (ASR), and Atypical Object Recognition (AOR). MAC predicts multiple categories of atypicality in the image. ASR uses additional annotations to identify objects involved in the atypical portrayal (e.g., ”The surface of the bottle mimics the texture of feather”). AOR evaluates VLMs’ visual reasoning by identifying primary and secondary objects in atypical relation.
Our analysis shows while VLMs initially struggle with MAC and ASR tasks, they can extract valuable insights about atypical aspects of images. Leveraging these insights, we develop an atypicality-aware image verbalization. To detect atypicality, we use the prompt UH: What is unusual about this image?. Atypicality adds depth to the content of the advertisement, complementing surface-level content like objects and scene descriptions. Thus, we combine and surface-level content to construct the final verbalization. It is then passed to an LLM for the action-reason inference task. We elaborate on the tasks and pipeline below.
3.1 Proposed Atypicality Understanding Tasks
Ye et al. [45] devised a taxonomy of atypicality based on object transformations. In this work, we focus on the subset of atypicality categories that entail two objects (examples in Fig 1): (1) Texture Replacement 1 (TR1): Objects’ texture borrowed from another object, (2) Texture Replacement 2 (TR2): Texture created by combining several small objects, Object Inside Object (OIO): An object is completely or partially inside of another object, and (4) Object Replacement (OR): The whole object appearing in the context normally associated with another. We define the following new atypicality understanding tasks, shown in Fig. 2.
Multi-label Atypicality Classification (MAC). Unlike prior works [17] that only detect the presence of atypicality, we formulate atypicality detection as a multi-label classification task. The PittAds dataset provides three annotations of atypicality per image from different annotators, which may vary by type. For example, Fig.1(c) can be classified as ‘Object Inside Object’ (Earth inside a cup sleeve) and ‘Object Replacement’ (Earth replaces coffee cup). Some annotators may even label it as ‘Not Atypical’ (NA), creating five possible labels. MAC challenges VLMs to predict the relevant atypicality categories for an image based on atypicality definitions denoted as (prompts in supp). Due to the complexity of differentiating between these categories, we extend the definitions provided by [45] as shown in Fig. 3: they not only distinguish different atypicality categories but also hint at how atypicality impacts the image’s interpretation (e.g., Fig. 1 is TR1, where the beer’s texture is replaced by a feather to advertise its lightness).
Atypicality Statement Retrieval (ASR). ASR formulates atypicality inference as retrieving a statement describing relations between two objects. We represent atypicality with templates , defined in Fig. 3. A statement includes an atypicality type , primary object , and secondary object as described in Sec. 3.6 of [45]. In TR1 and TR2, is the object with a new texture, and is source of that texture. In OIO, is the object inside, and is the object outside. In OR, is the object placed in another’s context, and is the expected object.
Given an ad, ASR distinguishes the correct atypicality statement , from a set of negative statements . We generate negatives as follows: (1) Wrong object: replacing and with objects from random images, producing negatives (e.g., and where are from a random image); (2) Wrong atypicality relation: altering the relation with one not in the ground-truths (objects remain the same) to create up to 3 negatives. (3) Swapped primary/secondary objects: we create . Thus, ASR tests the model’s understanding of objects, their atypicality relation, and primary/secondary object roles. It bridges the gap between MAC and action-reason retrieval by detecting atypical statements and enhancing action-reason retrieval. We use .
Atypical Object Recognition (AOR). To assess the VLMs’ fine-grained visual understanding, we introduce the task of recognizing the primary and secondary objects in an atypical image. Given an ad and the true atypicality, the model must generate the correct primary/secondary objects to complete the atypicality statement. AOR can be viewed as a fill-in-the-blank task (i.e., generative) based on the statement templates.
3.2 Proposed Approach
To explore the impact of atypicality in action-reason retrieval and compare the reasoning ability of VLMs and LLMs, we propose a three-steps in-context learning method : (i) Atypicality-aware image verbalization: Using LLaVA [26] and LLM, we generate a coherent verbalization sensitive to atypicality; (ii) Atypicality statement detection: We detect the atypicality statement ; (iii) Action-reason retrieval: We combine steps (i) and (ii) to retrieve the final action-reason statement. Fig. 4 illustrates the method.
(i) Verbalize image in atypicality-aware manner. Each ad image is composed of visual content () (objects) and textual content () (scene-text). We obtain and by querying LLaVA for up to 5 objects and a list of text-scenes visible in the image. However, this information is insufficient to fully comprehend the image. Hence, we additionally prompt LLaVA with (1) (): LLaVA’s responses when prompted Describe the image in detail. (2) (): LLaVA’s reponse when asked What is unusual about the image?. MAC results (Table 1) show that effectively captures image unusualness (closely related to atypicality), while provides scene and object information useful for retrieving atypicality statement . Thus, we construct the final verbalization by combining , , , and using an LLM.
(ii) Detect atypicality statement. We construct all possible statements to predict the atypicality statement . Option Generator (teal module in Fig 4(b)) combines and to generate all possible statements . Specifically, each object pair is combined with all atypicality statement template to create atypicality statements and for all . We then pass these atypical statements along with verbalization into the classifier to predict (no ground-truth is used).
(iii) Retrieve action-reason statements. [18] provided three action-reason statements for each image, each offering different plausible reasons for the same action. Given these three plausible (i.e., positive) and many implausible (i.e., negative) action-reason statements, the task is to detect all plausible action-reason statements . We proposed various verbalization strategies, including concatenation and LLM-based combinations (i.e., ) of , as well as concatenation of with , to be utilized by an LLM for retrieving the final action-reason statement.
4 Experimental Setup
Datasets. PittAds [18] includes 64,832 ad images, with 3,928 annotated for atypicality, action-reason, and primary/secondary objects. For MAC, we use atypicality categories, while for AOR and ASR, we use primary/secondary objects along with our atypicality statement templates (Fig. 3) to generate ground truth for evaluation. We utilized the train set with 1,168 samples for the main results, as it includes at least one annotation of the atypicality categories studied and is larger than the test set. No training was performed. Ablation studies are reported on a smaller subset of 250 images due to the high cost.
Baselines. We use LLaVA (‘vicuna-13b-v1.5’) [26], InstructBLIP (‘vicuna-13b-v0’) [13], MiniGPT4 [54] (‘vicuna-13b-v0’), and CLIP [32] (‘ViT-L/14@336px’ following [19]), and InternVL-Chat-V1-1 [11] as VLM baselines (and InternVL2-8B, LLaVA 1.6 in supp). LLaVA is our multimodal component due to its GPT-4-informed instruction tuning, state-of-the-art reasoning performance, and promptability [26]. We evaluate GPT-4V on a limited 250 examples, constrained by cost. We report BLIP-2 [23] (‘blip2-flan-t5-xl’) only for AOR as it failed to produce meaningful output when asked to return multiple options, i.e., MAC and multi-ARR (detail in supp).
Our analysis spans recent public LLMs, such as ‘vicuna-13b-v1.5’ (Vicuna) [37], ‘InternLM2-5‑7b‑chat’ (InternLM; see supp), and leading commercial models like GPT-3.5/4. We chose Vicuna as it is used in all VLMs (LLaVA, MiniGPT4, InstructBLIP) and InternLM as InternVL2-8B’s LM. We also compare GPT-4 and GPT-4V. We introduce CLIP () as a zero-shot baseline aligned with KAFA [19] but avoid direct comparison with KAFA as it is not publicly available. To assess our atypicality method, we compare against (verb. baseline), which includes basic image information (up to 5 objects, scene text).
Metrics. We use Precision, Recall, and F1-score to evaluate MAC (additional metrics in supp). For AOR, we assess sentence similarity between and using ‘all-mpnet-base-v2’ [35], a state-of-the-art sentence embedding method. Common text-matching and accuracy metrics aren’t suitable for AOR since it is a generative task, and annotations can vary widely (e.g., ‘beer,’ ‘glass of beer,’ ‘beer glass’) due to human inconsistencies and typos. We report accuracy (denoted ‘Acc’) for single statement retrieval tasks, where the model returns only one statement per query (i.e., ASR and Single ARR). Top-k Acc and unranked Precision@k are the metrics for the multi-option ARR, with . Note top-3 acc and prec@1 are the same.
Hard Negative (HN) Generation. To measure ARR performance, [44] mined hard negatives from images within the same topic, while [19] expanded the negative options. These negatives often include irrelevant objects, allowing VLMs to easily disregard them by comparing objects. This hinders accurate measurement of models’ reasoning ability. In contrast, a concurrent work [4] used annotators to write implausible statements based on visible objects/texts. Our approach differs in three key ways: (1) it is LLM-based, image-agnostic, and scalable; (2) it generates a wider variety of negatives while focusing on semantics (e.g., altering actions, reasons, adjectives, or swapping objects not visible such as ‘lipstick’ instead of ‘lip balm’), and (3) they evaluate contrastive-based VLMs, whereas we focus on generative VLM/LLMs with stronger reasoning ability.
Specifically, for each ground-truth action-reason statement, we ask GPT-4 to generate a negative action-reason statement by (1) Action alter: changing the action to an unrelated or opposite action while preserving the reason; (2) Reason alter: changing the reason to an unrelated or opposite reason; (3) Adjective alter: negating or modifying adjectives to make the statement incorrect; (4) Object swap: substituting at least one object in the statement with an unrelated object; (5) Statement alter: generating a completely unrelated action-reason statement.
We validated our hard negatives by sampling 100 images and asking annotators to select correct statements (in supp).
Implementation. For GPT-4, GPT-3.5, and Vicuna temperature is set to 0. For LLaVA [25], BLIP-2 [23], InternVL models [11] and InternLM [8] we used the original setting. We applied 8-bit quantization for LLaVA, MiniGPT-4, Vicuna, InternVL and InternLM. All experiments were zero-shot and conducted on an NVIDIA 368 Quadro RTX A5000 GPU with 24 GB memory. Example prompts are in supp.
5 Results
The key goal is to benchmark VLMs and evaluate their understanding of persuasive ads. This section first presents results on VLMs/LLMs’ atypicality understanding tasks, forming the foundation for our atypicality-aware verbalization. Then, we explore whether atypicality can help ad understanding on Action-Reason Retrieval (ARR).
5.1 Atypicality Understanding Results
We assess opone-sour VLMs’ (LLaVA [25], InstructBLIP [13]) understanding of the atypicality and GPT-4V. Additionally, we evaluate the effectiveness of two prompting strategies for capturing atypicality: (1) (Describe the image in detail.) and (2) (What is unusual about the image?). These strategies are compared to the VLMs and the baseline across three LLMs (GPT-4, GPT-3.5, Vicuna). Table 1 summarizes the results on MAC and ASR, while AOR results are in Table 2.
MAC | ASR | |||||||
Classifier | Method | Precision | Recall | F1-score | Acc | |||
LLaVA [26] | I | 27.75 | 27.75 | 42.38 | 52.71 | 21.24 | 26.03 | 18.83 |
IN | 25.12 | 31.40 | 42.44 | 53.04 | 25.06 | 31.32 | 20.90 | |
UH | 44.35 | 30.44 | 42.04 | 52.44 | 24.16 | 29.98 | 17.90 | |
InstructBLIP [23] | I | 34.81 | 27.60 | 41.43 | 50.73 | 17.72 | 20.18 | 19.76 |
Vicuna [12] | 36.70 | 30.64 | 41.73 | 45.78 | 32.52 | 31.66 | 14.30 | |
37.71 | 32.04 | 43.70 | 45.91 | 34.51 | 32.09 | 23.29 | ||
39.41 | 33.33 | 36.05 | 42.88 | 27.35 | 30.36 | 14.74 | ||
GPT 3.5 | 41.46 | 35.36 | 23.21 | 21.54 | 28.18 | 24.95 | 50.00 | |
46.28 | 42.50 | 25.13 | 14.75 | 28.49 | 19.64 | 50.55 | ||
49.10 | 43.34 | 27.38 | 30.92 | 27.06 | 28.24 | 50.05 | ||
GPT 4 | 40.38 | 35.95 | 22.56 | 6.69 | 22.66 | 10.99 | 52.44 | |
54.78 | 53.40 | 27.19 | 13.64 | 30.58 | 20.91 | 57.70 | ||
53.49 | 51.01 | 29.15 | 28.89 | 34.62 | 33.05 | 56.89 |
VLMs struggle with direct atypicality inference. Table 1 shows VLMs consistently underperform verbalization approaches ( and ) in both MAC (F1-score) and ASR. For instance, LLaVA shows high recall (52.71) but low precision (27.75) in the MAC task, indicating it over-predicts atypicalities. This trend is particularly evident in the OIO and TR2 categories, where LLaVA achieves perfect and near-perfect recall (1.0 and 0.79) but low precision (0.18 and 0.23). Conversely, recall for other categories doesn’t exceed 0.24, suggesting a bias towards OIO and TR2 predictions (category-wise metrics not shown in table). InstructBLIP exhibits similar issues. This suggests VLMs lack the reasoning ability to accurately infer atypicality.
V+T lacks sufficient context to fully understand the image. provides inadequate context for extracting atypicality, as evidenced by GPT-4’s low recall/F1 (precision/recall scores of 58.11/86.04 for NA) and low recall for atypical categories (6.00, 0.79, 6.10, and 13.86 for TR1, TR2, OIO, and OR; not shown in table). Smaller LLMs (Vicuna and GPT-3.5) perform better than GPT-4 with V+T, but their improvement is likely due to hallucination rather than extracting useful information. In contrast, and provide richer descriptions that better capture atypicality.
VLMs are effective for verbalization. In MAC, emerges as the top-performing strategy, effectively extracting unusualness and atypicality from images. It significantly improves F1 score without NA (details in section 3.1) by 3.29 and 22.06 for GPT-3.5 and GPT-4, respectively, and is only slightly lower than Vicuna’s . also outperforms overall. In ASR, is more effective than and outperforms VLMs by a significant margin when used in GPT models (e.g., by 37.13 with GPT-4 compared to the strongest VLM). However, slightly outperforms on GPT-4/GPT-3.5 and significantly on Vicuna. These results indicate that identifying atypicality requires both image understanding and reasoning capabilities. ASR necessitates the model to identify the correct atypicality statement, which includes both objects and their relation (e.g., search bar completely replaces mouth in its usual context, assuming its function or position in Fig. 1 (a)). Thus, may be a better verbalization as it provides more detailed information about the image, objects, and their relations, whereas offers implicit information about the image’s unusualness. Therefore, we use (best on MAC) and (best on ASR) along with to create atypicality-aware verbalization. Inspired by ASR results we also adopt for detecting atypicality statement which is combined with our verbalization for ARR.
Model | Avg. similarity ( to ) | % of scores | ||
---|---|---|---|---|
score | 0.7 | 0.6 | 0.5 | |
BLIP2 [23] | 0.45 | 8.77 | 19.78 | 35.43 |
InstructBLIP [13] | 0.46 | 9.54 | 21.24 | 40.76 |
MiniGPT4 [54] | 0.51 | 15.24 | 32.28 | 51.71 |
LLaVA [26] | 0.59 | 31.41 | 51.35 | 65.16 |
GPT-4V [1] | 0.67 | 46.94 | 61.63 | 77.14 |
VLMs’ limitations in Atypical Object Recognition. Table 2 explores VLMs performance on the AOR task, which involves generating primary and secondary objects given only GT atypicality relation to complete its atypicality statement from Fig. 3. Scores above 0.7 indicate strong semantic overlap, while scores between 0.5 and 0.7 indicate moderate overlap. The results underscore current VLMs’ difficulty in reasoning and recognizing atypical objects. InstructBLIP and MiniGPT4 perform poorly, with most predictions scoring below 0.5, highlighting their struggle to recognize primary and secondary objects in atypical contexts. GPT-4V emerges as the most proficient model, yet only about half of its predictions surpass the 0.7 mark. These results highlight the need to improve VLMs’ reasoning in complex visual scenes, such as atypical objects.
5.2 Action-Reason Retrieval Results
We evaluate using 4 LLMs (GPT-4, GPT-3.5, Vicuna, InternLM in supp) and compare them to VLMs (CLIP, LLaVA, InstructBLIP, InternVL-Chat-V1-1, GPT-4V, and LLaVA1.6 and InternVL2-8B in supp). Table 3 presents the multi-option action-reason retrieval results (ARR) and their comparison with VLMs on full-set. Table 4 exhibits ARR results on small-set when GPT-4V is the VLM in our pipeline. We adopted Vicuna as the public LLM because it’s used in LLaVA, InstructBLIP, and MiniGPT4. GPT-4 and GPT-4V were used as powerful LLM/VLM pairs.
Classifier | Verb. | Precision@k | Top-k Acc | Avg | |||
---|---|---|---|---|---|---|---|
k=1 | k=2 | k=3 | k=1 | k=2 | |||
CLIP[32] | 61.04 | 33.86 | 22.66 | 23.72 | 44.61 | 37.18 | |
46.15 | 24.36 | 16.24 | 15.15 | 31.25 | 26.63 | ||
LLaVA [26] | 59.67 | 38.27 | 26.06 | 32.92 | 48.14 | 41.01 | |
59.45 | 37.37 | 25.14 | 27.49 | 47.07 | 39.30 | ||
InstructBLIP [13] | 15.05 | 10.03 | 7.80 | 13.04 | 13.04 | 11.79 | |
InternVL-Chat-V1-1 [11] | 52.22 | 32.79 | 22.17 | 22.51 | 40.66 | 30.07 | |
Vicuna [12] | 64.13 | 40.71 | 27.57 | 21.49 | 43.41 | 39.46 | |
(Ours) | 67.38 | 44.01 | 29.94 | 23.20 | 41.95 | 41.30 | |
(Ours) | 68.32 | 44.52 | 30.25 | 22.95 | 43.24 | 41.86 | |
(GPT-4 Verb.) (Ours) | 68.49 | 44.52 | 30.37 | 24.06 | 43.24 | 42.14 | |
GPT-4 | 93.73 | 84.42 | 70.50 | 71.50 | 89.87 | 82.00 | |
(Ours) | 93.99 | 86.35 | 72.96 | 74.94 | 91.16 | 83.88 | |
(Ours) | 95.54 | 87.55 | 74.62 | 88.42 | 93.40 | 87.91 |
LLMs are more powerful than VLMs. In Table 3 and 4, LLMs with atypicality-aware verbalization () consistently outperform VLMs. For example, GPT-4 surpasses LLaVA by 42.87 points. GPT-4 also outperforms a strong VLM, GPT-4V, in all metrics (Table 4). Results in supp confirm this trend across InternVL2-8B and LLaVA 1.6. This highlights the superior reasoning ability of LLMs in understanding atypicality and action-reason statements.
Interestingly, when LLaVA is provided with both image and (LLaVA()), it underperforms Vicuna by 7.93, 6.64, and 4.8 points on prec@1, prec@2, and prec@3, respectively. Also, its performance drops by 1.71 points compared to LLaVA(). This reveals that despite using Vicuna as its LLM, LLaVA’s reasoning ability is limited, hindering it effective use of for action-reason retrieval.
Classifier | precision@k | Top-k acc | |||
---|---|---|---|---|---|
k=1 | k=2 | k=3 | k=1 | k=2 | |
LLaVA | 59.67 | 38.27 | 26.06 | 32.92 | 48.14 |
GPT-4V | 97.17 | 89.91 | 74.86 | 77.01 | 90.32 |
GPT-4 () | 97.58 | 90.72 | 76.61 | 81.04 | 92.74 |
Atypicality boosts persuasive visual media understanding. Table 3 demonstrates the effectiveness of our atypicality-aware verbalization compared to VLMs and the verbalization baseline. While Vicuna() lags behind LLaVA by 1.55 points on avg, indicating insufficient context from basic verbalization, our consistently improves performance. We observe gains of 3.8/2.8 and 1.93/2.46 on prec@2/prec@3 for Vicuna and GPT-4, respectively. Combining with yields the best results across all LLMs (GPT3.5 in supp), confirming the benefit of incorporating atypicality-aware verbalization and atypicality statement for understanding persuasive ads. Interestingly, Vicuna shows minimal gains when using GPT-4’s verbalization, suggesting that our strategy does not depend heavily on extensive LLMs like GPT-4.
Table 5 investigates how the addition of atypicality statements impacts VLM performance on ARR using three methods: (1) , with true atypicality statements ; (2) , with predicted atypicality statements by GPT-4 and ; and (3) , with incorrect atypicality statements using correct objects but incorrect relations, . The results reveal that VLMs benefit from atypicality statements. However, for GPT-4 (an LLM), adding the atypicality to is not as useful, since already contains correct atypicality information, and incorrect statement reduce the performance. These findings highlight the importance of incorporating atypicality to improve VLM performance on ARR tasks.
Generalization to typical images. We compared our pipeline with Vicuna () on typical images (i.e., no atypicality) against LLaVA on small-set ARR (details in supp). Vicuna () achieved 71.2%/48.6%/33.2% vs. 66.4%/42.2%/28.3% for LLaVA on prec@1/2/3, demonstrating our approach’s effectiveness beyond atypical ads.
Model | ||||
---|---|---|---|---|
LLaVA | 26.00 | 35.18 | 54.28 | 28.16 |
InstructBLIP | 20.44 | 23.25 | 23.40 | 19.69 |
GPT-4V | 86.87 | 89.35 | 87.24 | 86.96 |
GPT-4 () | 96.77 | 91.42 | 96.76 | 90.20 |
5.3 Further Analysis & Ablation
Neg. Strategy | Model | Multi | Single | ||
Precision@k | |||||
k=1 | k=2 | k=3 | Acc | ||
12 Neg. [18, 19] | CLIP | 98.79 | 97.58 | 92.20 | 96.77 |
CLIP | 97.58 | 97.58 | 87.10 | 90.32 | |
LLaVA | 93.47 | 74.08 | 56.33 | 94.31 | |
GPT4 | 99.60 | 96.98 | 91.13 | 93.52 | |
18 Hard Neg. | CLIP | 64.52 | 34.48 | 22.98 | 20.97 |
CLIP (I+T) | 47.18 | 25.40 | 16.94 | 15.73 | |
LLaVA | 59.67 | 38.27 | 26.06 | 26.80 | |
GPT4 | 96.77 | 87.30 | 74.60 | 96.77 |
Hard Negatives ablation. Table 6 compares our semantically hard negatives against those used in prior work. VLM performance drops substantially when faced with our hard negatives, evidenced by 69.22/30.27 drop in CLIP(I)/LLaVA(I) on prec@3. Conversely, our method exhibits a decrease of no more than 17 across all metrics, demonstrating robustness in reasoning compared to VLMs.
Error analysis on ARR. In Fig. 5, VLMs (i.e., LLaVA and CLIP) do not make considerably more errors than GPT-4 () on ‘object swap’ since these negative options contain different objects from the ground-truth, making them easier for VLMs to identify. However, VLMs make far more errors than GPT-4() on semantically incorrect negatives (‘action alter,’ ‘reason alter,’ ‘statement alter,’ and ‘adjective alter’ ). This shows VLMs mainly rely on visual elements rather than deeper reasoning (examples in supp).
Error analysis on ASR. In Fig. 6, LLaVA makes notably more errors, particularly on ‘Wrong Relation’ options, which demand deeper reasoning than other negative types.
Effectiveness of each component in atypicality-aware verbalization. We ablate the effectiveness of each step in atypicality-aware verbalization on the ARR small-set (details in supp). The performance of shows the advantage of atypicality-aware verbalization over basic concatenation (). Specifically, improves top-1 acc by 14.76 on GPT-3.5 and 12.78 on GPT-4. This is because LLaVA-generated descriptions are inherently noisy, and their naive concatenation can mislead the models. A further issue is increased prompt length. Additionally, alone is less effective than , but outperforms , showing that atypicality is important yet complementary to basic image information captured in , , and .
Generalization beyond ads. We test our atypicality-aware verbalization pipeline on WHOOPS! [7] in supp.
6 Conclusion
This work challenged VLMs on complex rhetorical visual media, focusing on atypicality in advertisements. We introduced three novel atypicality tasks and benchmarked VLMs on and ARR, revealing their limitations in advanced reasoning. Despite these limitations, VLMs showed potential in extracting relevant information for understanding the atypical images. Our atypicality-aware verbalization strategy significantly enhances LLM performance on ARR tasks. Extensive experiments demonstrate that our approach outperforms VLM baselines, proving the effectiveness of incorporating atypicality inference for understanding ad images. These findings highlight the importance of atypicality in interpreting complex visual media and the superior reasoning abilities of LLMs over VLMs.
References
- [1] Gpt-4v(ision) system card. 2023.
- [2] Arjun R Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhiwei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T Freeman, et al. Metaclue: Towards comprehensive visual metaphors research. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23201–23211, 2023.
- [3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- [4] Anna Bavaresco, Alberto Testoni, and Raquel Fernández. Don’t buy it! Reassessing the ad understanding abilities of contrastive multimodal models. In ACL (Short), Aug. 2024.
- [5] Aanisha Bhattacharya, Yaman K Singla, Balaji Krishnamurthy, Rajiv Ratn Shah, and Changyou Chen. A video is worth 4096 tokens: Verbalize story videos to understand them in zero shot. In EMNLP, 2023.
- [6] Yonatan Bitton, Ron Yosef, Eliyahu Strugo, Dafna Shahaf, Roy Schwartz, and Gabriel Stanovsky. Vasr: Visual analogies of situation recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 241–249, 2023.
- [7] Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023.
- [8] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- [9] Aochuan Chen, Yuguang Yao, Pin-Yu Chen, Yihua Zhang, and Sijia Liu. Understanding and improving visual prompting: A label-mapping perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19133–19143, 2023.
- [10] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- [11] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24185–24198, June 2024.
- [12] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- [13] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- [14] Arka Ujjal Dey, Suman K Ghosh, Ernest Valveny, and Gaurav Harit. Beyond visual semantics: Exploring the role of scene text in image understanding. Pattern Recognition Letters, 149:164–171, 2021.
- [15] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024.
- [16] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- [17] Meiqi Guo, Rebecca Hwa, and Adriana Kovashka. Detecting persuasive atypicality by modeling contextual compatibility. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 972–982, 2021.
- [18] Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. Automatic understanding of image and video advertisements. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1705–1715, 2017.
- [19] Zhiwei Jia, Pradyumna Narayana, Arjun Akula, Garima Pruthi, Hao Su, Sugato Basu, and Varun Jampani. KAFA: Rethinking image ad understanding with knowledge-augmented feature adaptation of vision-language models. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D Williams, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 772–785, Toronto, Canada, July 2023. Association for Computational Linguistics.
- [20] Kanika Kalra, Bhargav Kurma, Silpa Vadakkeeveetil Sreelatha, Manasi Patwardhan, and Shirish Karande. Understanding advertisements with bert. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7542–7547, 2020.
- [21] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
- [22] Yunshi Lan, Xiang Li, Xin Liu, Yang Li, Wei Qin, and Weining Qian. Improving zero-shot visual question answering via large language models with reasoning question prompts. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4389–4400, 2023.
- [23] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- [24] Liunian Li, Zi-Yi Dou, Nanyun Peng, and Kai-Wei Chang. Desco: Learning object recognition with rich language descriptions. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 37511–37526. Curran Associates, Inc., 2023.
- [25] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023.
- [26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
- [27] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024.
- [28] Edward F McQuarrie and David Glen Mick. Visual rhetoric in advertising: Text-interpretive, experimental, and reader-response analyses. Journal of consumer research, 26(1):37–54, 1999.
- [29] Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024.
- [30] Trisha Mittal, Puneet Mathur, Aniket Bera, and Dinesh Manocha. Affect2mm: Affective analysis of multimedia content using emotion causality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5661–5671, 2021.
- [31] Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- [32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- [33] Werner Reinartz and Peter Saffert. Creativity in advertising: When it works and when it doesn’t. Harvard Business Review, 91(6):106–111, 2013.
- [34] Robert E Smith, Scott B MacKenzie, Xiaojing Yang, Laura M Buchholz, and William K Darley. Modeling the determinants and effects of creativity in advertising. Marketing science, 26(6):819–833, 2007.
- [35] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. arXiv preprint arXiv:2004.09297, 2020.
- [36] Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
- [37] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
- [38] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
- [39] Ramakrishna Vedantam, Xiao Lin, Tanmay Batra, C Lawrence Zitnick, and Devi Parikh. Learning common sense through visual abstraction. In Proceedings of the IEEE international conference on computer vision, pages 2542–2550, 2015.
- [40] Lei Wang, Yi Hu, Jiabang He, Xing Xu, Ning Liu, Hui Liu, and Heng Tao Shen. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19162–19170, 2024.
- [41] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- [42] Judith Williamson. Decoding advertisements, volume 4. Marion Boyars London, 1978.
- [43] Yueting Yang, Xintong Zhang, and Wenjuan Han. Enhance reasoning ability of visual-language models via large language models. arXiv preprint arXiv:2305.13267, 2023.
- [44] Keren Ye and Adriana Kovashka. Advise: Symbolism and external knowledge for decoding advertisements. In Proceedings of the European Conference on Computer Vision (ECCV), pages 837–855, 2018.
- [45] Keren Ye, Narges Honarvar Nazari, James Hahn, Zaeem Hussain, Mingda Zhang, and Adriana Kovashka. Interpreting the rhetoric of visual advertisements. IEEE transactions on pattern analysis and machine intelligence, 43(4):1308–1323, 2019.
- [46] Keren Ye, Mingda Zhang, and Adriana Kovashka. Breaking shortcuts by masking for robust visual reasoning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3520–3530, January 2021.
- [47] Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A Ayyubi, Kai-Wei Chang, and Shih-Fu Chang. Idealgpt: Iteratively decomposing vision and language reasoning via large language models. arXiv preprint arXiv:2305.14985, 2023.
- [48] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019.
- [49] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
- [50] Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. Advances in Neural Information Processing Systems, 36:5168–5191, 2023.
- [51] Kankan Zhou, Eason Lai, Wei Bin Au Yeong, Kyriakos Mouratidis, and Jing Jiang. Rome: Evaluating pre-trained vision-language models on reasoning beyond visual common sense. arXiv preprint arXiv:2310.19301, 2023.
- [52] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
- [53] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
- [54] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024.