Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

Jie Zhang∗123, Zhongqi Wang∗123, Mengqi Lei∗4, Zheng Yuan123,
Bei Yan123, Shiguang Shan123, Xilin Chen123
1Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
2 University of Chinese Academy of Sciences, Beijing 100049, China
3 Key Laboratory of Al Safety, Chinese Academy of Sciences, Beijing, 100190, China
4China University of Geosciences
Abstract

Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 8 advanced open-source LVLMs with 10 checkpoints are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released in https://github.com/Benchmark-Dysca/Dysca.

footnotetext: Equal contribution.
Corresponding Author: jiezhang@ict.ac.cn

1 Introduction

Recent years have witnessed the great success of the Large Vision-Language Models (LVLMs) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. These models leverage the powerful Large Language Models (LLMs) [11, 12, 13, 14, 15] as their brain and incorporate the state-of-the-art visual encoders [16, 17, 18] as their eyes. Thanks to the alignment of visual feature with textual space and the development of visual instruction tuning techniques [4], LVLMs showcase the impressive capability in terms of visual scene comprehension and multimodal instruction-following.

In order to comprehensively evaluate the capabilities of LVLMs, many benchmarks have been purposed [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], where we categorize the current benchmarks into three types [25]. The first type is the classical benchmarks, such as COCO Caption [30] and VQA [19, 31, 32]. Although these benchmarks provide high-quality evaluation data, they also have notable limitations. On the one hand, they are inadequate for measuring the fine-grained capabilities of current LVLMs, offering the limited insightful feedback for the future improvement. On the other hand, since these classical benchmarks have been available as the open-source test data for a long time, it is hard to prevent the data leakage problem. The second type of benchmarks evaluate the LVLMs through a subjective manner [28, 33]. Although the benchmarks reveal the insightful drawbacks of current models, their data scale is limited (i.e., less than 200 annotations) and they require manual evaluation by experts. The third type is built for objectively evaluating current LVLMs and the comparison between them are shown in Tab. 1. They provide an objective and automatic evaluation manner, giving the fine-grained evaluation for the LVLMs. However, these benchmarks conduct Vision-language QAs by selecting images from existing dataset. Although they claim that the questions are re-annotated, the previous work [29] has demonstrated that these benchmarks have unintentionally leaked into the training data of LLMs and LVLMs. Besides, most benchmarks focus on evaluating LVLMs in the realistic images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. While some works like MMCBench [34] and Typographic Dataset [35] have investigated the robustness of LVLMs with corrupted and print-attacked images, respectively, they have not explored the effect of these noisy images on various perceptual tasks.

Refer to caption
Figure 1: Overview of the automatic pipeline for generating Vision-language QAs, cleaning Vision-language QAs and evaluating LVLMs. (a) We first constructs prompts in terms of content, style and background, leveraging the Text-to-Image (T2I) diffusion model (e.g., SDXL [36]) to synthesis images to be asked. Then based on the scenarios and the question type, we post-process the synthesis images and generate the specific textual questions, respectively. (b) We further filter out low quality Vision-language QAs by utilizing trained models to form the final Dysca. (c) Finally, we evaluate LVLMs on our Dysca and feedback the fine-grained evaluation results.

In this paper, aiming to address these challenges above, we propose Dysca which is a dynamic and scalable benchmark for evaluating the perception ability of LVLMs via various subtasks and scenarios. Inspired by the prior evaluation works for LLMs [37], we investigate on whether we could leverage the large-scale synthesized images for evaluating LVLMs. We display the overview of our pipeline in Fig. 1. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and corresponding answers. We decouple the prompt into 4 part, i.e., attribute, foreground, style and background, and design pre-defined templates to dynamically generate prompts, as displayed in Fig. 3. Then we utilize the state-of-the-art text-to-image diffusion models (e.g., SDXL [36]) to generate the corresponding images. Since we already know the main information of the images through prompts, we easily generate question-answer textual pairs by the rule-based method. After that, in order to obtain the high quality Vision-language QAs, we employ CLIP [16] to perform data cleaning on the generated Vision-language QA pairs. Dysca focuses on assessing the fine-grained perception abilities, including recognizing human, animal, object, landmark, etc. Dysca evaluates LVLMs with 20 perceptual subtasks, containing a total number of 51 different artistic styles. Besides, to evaluate the robustness of the models across different scenarios and question types, we construct 4 testing scenarios (clean, corruption, print attacking and adversarial attacking) and 3 question types (multi-choices, true-or-false and free-form questions). In the end, Dysca consists of 617K Vision-language QA pairs (×\times×20 larger than MM-BigBench [38] and ×\times×25 larger than Seed-Bench2 [24] as shown in Tab. 1). Thanks to the generative paradigm, Dysca achieves the scalable benchmark to new subtasks and scenarios and dynamically generate unlimited Vision-language QAs for evaluation.

In summary, our work makes the following key contributions:

  • Dynamic and Scalable Benchmark: We propose Dysca, a benchmark that is able to dynamically generate the test data that users need and is easily to scale up to to new subtasks and scenarios.

  • Multi-grained Perceptual Subtasks and Multi-scenarios: Dysca evaluates the 20 perceptual subtasks performance of 8 mainstream LVLMs with 10 checkpoints under 4 image scenarios (i.e., clean, corruption, print attacking and adversarial attacking) and 3 question types (i.e., multi-choices, true-or-false and free-form questions).

  • Analysis and Observations: We demonstrate for the first time that evaluating LVLMs using large-scale synthetic data is valid. Experiments show the strong correlation coefficient between our evaluation rankings and the rankings obtained from non-synthetic benchmarks. The evaluation results also reveal the weakness of current LVLMs when facing different question types, image styles and image scenarios.

Table 1: Comparisons between existing LVLM benchmarks. ’✓ ’ indicates that the benchmarks include both newly collected images / annotations and images / annotations gathered from existing datasets. ’*’ The scale of our released benchmark is 617K, however Dysca is able to generate unlimited data to be tested.
Benchmark #Evaluation Data Scale #Perceptual Tasks
Automatic
Annotation
Collecting from Existing Datasets Question Type
Automatic
Evaluation
LLaVA-Bench [4] 0.15K - Free-form
MME [25] 2.3K 10 True-or-false
LVLM-eHub [21] - 3 Free-form
tiny-LVLM-eHub [22] 2.1K 3 Free-form
SEED-Bench [23] 19K 8 Multi-choices
MMBench [39] 2.9K 12 Multi-choices
TouchStone [26] 0.9K 10 Free-form
REFORM-EVAL [40] 50K 7 Multi-choices
MM-BigBench [38] 30K 6 Multi-choices
MM-VET [27] 0.2K 4 Free-form
MLLM-Bench [41] 0.42K 7 Free-form
SEED-Bench2 [24] 24K 10 Multi-choices
BenchLMM [42] 2.4K 15 Free-form
JourneyDB [43] 5.4K 2
Free-form
Multi-choices
\cdashline1-7 Dysca (Ours) 617K* 20
Free-form
Multi-choices
True-or-false

2 Related Works

2.1 Large Vision-Language Models

The landscape of Large Vision-Language Models (LVLMs) has been significantly shaped by the pioneering success of Large Language Models (LLMs) such as GPTs [44, 45, 46] and LLaMA [13], catalyzing advancements in multimodal content understanding and generation [47], including intricate tasks like image-text comprehension. At the forefront of these developments, BLIP-2 [1] introduces a lightweight Q-Former [1] that facilitates alignment between textual and visual representations through a cross-attention mechanism [1]. InstructBLIP [3] takes a step further by incorporating textual instructions into the Q-Former, which significantly improves performance. LLAVA [4] employs GPT-4 [14] to transform data into multimodal instruction-following data and uses CLIP [16] and LLAMA [13] for fine-tuning instructions, achieving advanced performance. LLAVA-1.5 [48] extends this paradigm by integrating MLP projection and introducing academic task-specific Vision-language QA data. Recently, models like Otter [5], MiniGPT-4 [2], Qwen-VL-Chat [49] and XComposer-VL [7] further unleash the cross-modal understanding capabilities of LVLMs.

2.2 Benchmarks for LVLMs

The great progress of LVLMs triggers the development of benchmarks for evaluating these models, where we divide previous benchmarks into three categories. The first type is the classical benchmarks which focuses on evaluating LVLMs abilities via image caption [50] and VQA [51, 19]. However, these benchmarks cannot provide the fine-grained feedback on how to improve the models. Besides, since these benchmarks have been the public resources for a long time, it is hardly to guarantee that the LVLMs have not use them for training. The second type subjectively evaluates LVLMs by experts [28, 33]. Although these benchmarks reveal the insightful feedback of the LVLMs, their scale is limited (i.e., less than 200 annotations). The subjective manner also makes the evaluation expensive and hardly to expand the scale of the benchmarks.

The third type [4, 25, 21, 22, 23, 24, 39, 26, 40, 38, 27, 41, 42, 29, 52] focuses on evaluating LVLMs in an objective and large-scaled manner, where we list the detailed information of them in the Tab. 1. Some of them have been adopted by the community [53] as the standard benchmarks for evaluating LVLMs [12, 5, 7], like MME [25] and MMBench [39]. These benchmarks evaluate models through the objective answer types and most of them leverage the automatic annotation and evaluation manner for revealing the fine-grained drawbacks of current LVLMs. However, the previous benchmarks primarily concentrate on evaluating LVLMs using realistic images and clean scenario, leaving multi-stylized images and noisy scenarios unexplored. Moreover, many of them conduct QA by selecting images from publicly available datasets (e.g., [50, 54]). While they state that the questions have been re-annotated, they cannot guarantee that the LVLMs have not seen the image during training stage. The previous work [29] has proved that these benchmarks have unintentionally leaked into the training data of LLMs and LVLMs. One possible way to solve data leakage is using novel but synthesis images, where JourneyDB [43] is the first work aiming to leverage synthesis images to evaluate current LVLMs. The prompts and the corresponding images are downloaded from Midjourney [55] and ChatGPT [12] is leveraged to label the images. However, JourneyDB is a top-down framework where the number of images is fixed. Besides, the ChatGPT labeling may cause hallucinate annotations, leading to the unreliable evaluation results. Although 40 annotators have involved to clean the data, the data cleaning cost are expensive and it limits the data scale. In contrast, our Dysca serves as the bottom-up framework, allowing for dynamic and scalable generation for both images and evaluation questions. The rule-based question generation method also makes the annotations more accuracy. Besides, Dysca contains 20 subtasks which is more comprehensive than JourneyDB.

3 Dysca

3.1 Overview of Our Pipeline

The overview of our pipeline is shown in Fig. 1, containing data generation, data cleaning and LVLMs evaluation. For the data generation, our Dysca benchmark consists of four dimensions, i.e., (M,P,I,Q)𝑀𝑃𝐼𝑄(M,P,I,Q)( italic_M , italic_P , italic_I , italic_Q ), where M𝑀Mitalic_M means "Metadata", P𝑃Pitalic_P means "Prompt", I𝐼Iitalic_I means "Image" and Q𝑄Qitalic_Q means "Question-answer pair". We further decouple the metadata M𝑀Mitalic_M into 4 parts, i.e., "style", "attribute", "foreground" and "background", and the combination of the four parts constitute the image prompts P𝑃Pitalic_P. Then, given the prompt P𝑃Pitalic_P and the selected scenario, we leverage the Text-to-Image (T2I) diffusion model (e.g., SDXL [36]) to synthesis image I𝐼Iitalic_I and add the specific perturbation to the image I𝐼Iitalic_I. After that, since the prompt already includes the question angle and the corresponding answer, we construct a rule-based approach to generate the Q𝑄Qitalic_Q. Three types of questions are considered, i.e., multi-choice, true-or-false and free-form. Multi-choice and true-or-false questions utilize a closed-ended manner to assess LVLMs, while free-form questions employ an open-ended manner through image captioning for evaluation. For the data cleaning, considering that the T2I diffusion model may generate unsuccessful outcomes, we then use CLIP [16] and PP-OCRv3 [56] to automatically clean the whole dataset to obtain the final Dysca. Finally, we evaluate 8 open-source LVLMs with 10 checkpoints on our proposed Dysca.

Statistic #Number
Total questions 617K
- Clean 156K (25.2%)
- Print attacking 149K (24.1%)
- Adversarial attacking 156K (25.2%)
- Corruption 156K (25.2%)
Question type
- Multi-choices 251K (40.6%)
- True-or-false 250K (40.5%)
- Free-form 116K (18.8%)
Image resolution 1024*1024
Unique number of images 289K
Unique number of questions 162K
Unique number of answers 31K
Average question length 37.8
Average answer length 2.7
Average choice number 3.0
Table 2: Key statistics of Dysca.
Refer to caption
Figure 2: Overview of the dataset distribution of 20 perceptual tasks. The number in each subtask shows the corresponding amount of their annotation.

3.2 Perceptual Tasks

Evaluation dimensions. Perception is one of the most fundamental capabilities of LVLMs and previous works [25] have shown that the lack of perceptual ability may result in hallucination [57]. In order to comprehensively evaluate LVLMs’ perception capability, we design 20 perceptual subtasks where we show all subtasks and the corresponding amount of their annotation in the Fig. 2. We investigate on two types of perception dimensions, i.e., coarse-grained and fine-grained perception. Coarse-grained perception involves recognizing the style, background and color of images. Fine-grained perception involves recognizing the animal, object, plant, food, age, gender, expression, race, celebrity, action, text, clothes, movie, anime, landmark, profession and TV shows.

Data sources. For each perceptual subtask, we collect the textual data first to construct the metadata M𝑀Mitalic_M. For the TV shows, Anime and Movie, we select the titles from the rating list of IMDb***https://www.imdb.com/ based on the number of user reviews. For the styles, we utilize the style lists collected from the communityhttps://stable-diffusion-art.com/sdxl-styles/ and remove those which have strong reflect on the image content like "architectural style" and "Pokemon style". Note that the style list does not include the style prompt associated with a particular artist’s name. Besides, for the remaining contents, we select them from the label of current dataset (e.g., ImageNet [54]). All the selected textual data above constitute the metadata M𝑀Mitalic_M. We provide the detailed information of the metadata in the Appendix C.

3.3 Construction of Questions & Answers

Recall that the data generation for Dysca benchmark consists of four dimensions, i.e., (M,P,I,Q)𝑀𝑃𝐼𝑄(M,P,I,Q)( italic_M , italic_P , italic_I , italic_Q ), denoting the metadata (M𝑀Mitalic_M), prompt (P𝑃Pitalic_P), image (I𝐼Iitalic_I) and question-answer pairs (Q𝑄Qitalic_Q), respectively. The relationships between these parts and the process of constructing Dysca are shown in Fig. 3. The metadata M is the core of the whole Dysca, containing all the information for generating P𝑃Pitalic_P, I𝐼Iitalic_I and Q𝑄Qitalic_Q. The metadata M𝑀Mitalic_M consists of foreground, attribute, background and style, and these information guide the generation of the prompt (P𝑃Pitalic_P) through pre-designed templates. Then, we utilize the T2I diffusion model to generate the corresponding image using the prompt P𝑃Pitalic_P. For generating the image with specific text on it for the OCR subtask, we leverage TextDiffusion2 [58], which is the state-of-the-art text rendering method. For the rest of images, we leverage Stable Diffusion XL [36]. Subsequently, based on the different question types we select, i.e., multi-choices, true-or-false and free-form, we generate the corresponding Vision-language QA pairs in Dysca.

Besides, in order to evaluate the model performance under various scenarios, we conduct experiments on 4 scenarios, i.e., clean, corruption, print attacking and adversarial attacking. For the print attacking, followed by [35], we add the deceptive text on the image, where the text is a wrong option. Besides, to comprehensively evaluate the performance of LVLMs under corruption scenario, we add more typographic factors to original settings (i.e., different font orientations and font positions). For the adversarial attacking, we leverage PGD [59] to generate the adversarial image. We use InstructBLIP [3] as the proxy model and regard others as the black box models. The reason why we choose InstructBLIP is that it has shown superior performance in clean scenario. Besides, the black-box setting better reflects the robustness of the models when they face the real-world adversarial attacks. For the corruption, we leverage the image corruption methods collected from [34]. We remove some hard corruptions as they significantly impact the quality of the image, leading to human failure in judging the style and content of the image. The detailed examples are shown in Appendix D.

Consider that the Text-to-image diffusion model may generate the failure cases that affect the quality of the proposed benchmark, we leverage the off-the-shelf models, i.e., PP-OCRv3 [56] and CLIP-L-14 [16], to clean the data. PP-OCRv3 [56] is leveraged as the filter to exclude the failure image that TextDiffusion2 [58] generates the wrong text on the image. For the other images, we use CLIP-L-14 [16] to filter out the images with low text-image consistency. In the end, we filter out nearly 15% of low quality samples. The final statistics of our released Dysca are shown in Tab. 2. Note that the OCR subtask does not involve print attacking scenario as misidentifying adversarial text does not indicate poor OCR robustness of the LVLMs. Therefore, there are 7K fewer questions in the print attacking scenario. Besides, for the free-form question type, since it allows to assess the model’s perception abilities across multiple subtasks at the same time, we reduce the number of free-form questions for achieving a balanced data distribution.

Refer to caption
Figure 3: The process of generating the prompt (P), image (I) and question-answer pairs (Q) from the metadata (M).

3.4 Evaluation Strategy

Instruction Design. We design two types of instructions to improve the instruction-following result of LVLMs. For the multi-choices and true-or-false questions, we design the questions followed by the description "Please answer the question and provide the correct option letter, e.g., (A), (B), (C), (D), at the end. Do not contain the analysis progress. Your answer is: ". For the free-form questions, recalling that the prompt P𝑃Pitalic_P contains four part, i.e., the style, attribute, foreground and background, we instruct the model to caption these four dimensions by "Please describe the image. You can describe it from these aspects: {}\{\}{ }", where "{}\{\}{ }" includes the specific template we design for each part. We display the sample in the Fig. 3 and more examples can be found in the Appendix E.

Table 3: Evaluation results on the 20 perceptual subtasks. The top two results on each subtask are bolded and underlined, respectively. "MC" and "TF" indicate the accuracy (%) of "Multi-choices" and "True-or-false", respectively.
Movie Action TV Show Profession Landmark
Model Language Model MC TF MC TF MC TF MC TF MC TF
Blip2 [1] Flan-T5-XL 71.46 68.61 96.24 93.04 56.29 61.99 77.88 75.87 98.70 94.58
InstructBLIP [3] Flan-T5-XL 77.35 67.38 97.28 93.61 65.89 60.96 79.17 76.51 98.04 95.75
XComposer-VL [7] InternLM-7B 81.90 78.36 97.03 94.75 77.81 73.29 83.97 75.87 98.04 94.81
LLava-1.5 [48] Vicuna-7B 55.95 51.68 74.78 50.75 56.29 50.68 58.65 56.83 85.00 54.48
LLava-1.5 [48] Vicuna-13B 66.29 54.04 77.93 59.71 59.27 54.11 66.67 59.05 94.35 55.90
MiniGPT-4 [2] Vicuna-7B 32.68 52.35 42.96 51.00 30.46 50.00 33.97 50.16 53.26 52.36
Otter [5] LLaMA-7B 65.36 59.08 65.85 72.36 68.54 57.53 66.99 54.92 58.48 66.51
Qwen-VL-Chat [49] Qwen-7B 71.25 51.01 96.35 63.14 67.22 49.32 75.64 54.60 96.96 54.72
Shikra [6] Vicuna-7B 66.29 59.64 78.00 77.29 60.26 56.85 78.85 68.25 89.35 70.99
Shikra-VQA [6] Vicuna-7B 66.39 61.66 96.17 80.32 59.93 60.27 78.21 67.94 92.83 72.88
Anime Clothes Celebrity Food Plant
Model Language Model MC TF MC TF MC TF MC TF MC TF
Blip2 [1] Flan-T5-XL 57.07 61.94 82.38 73.79 81.98 76.48 91.52 88.04 92.50 91.19
InstructBLIP [3] Flan-T5-XL 61.28 64.04 86.51 81.08 83.69 76.14 91.71 88.85 93.31 91.92
XComposer-VL [7] InternLM-7B 75.41 74.58 86.18 85.73 88.53 87.28 92.33 90.21 93.25 93.14
LLava-1.5 [48] Vicuna-7B 47.15 48.46 47.19 50.41 57.36 54.72 54.63 51.10 49.08 48.59
LLava-1.5 [48] Vicuna-13B 57.61 50.14 65.20 57.20 60.14 57.94 78.87 57.45 79.72 55.51
MiniGPT-4 [2] Vicuna-7B 29.21 48.74 31.25 50.47 28.62 49.64 44.66 50.78 45.38 50.53
Otter [5] LLaMA-7B 61.82 57.30 47.13 69.52 42.41 63.52 46.81 79.62 66.41 81.51
Qwen-VL-Chat [49] Qwen-7B 71.60 54.63 78.44 60.72 87.37 50.05 89.37 53.37 92.32 59.91
Shikra [6] Vicuna-7B 47.96 57.72 75.21 59.65 63.76 59.86 83.30 68.37 88.55 66.60
Shikra-VQA [6] Vicuna-7B 49.05 57.58 76.57 63.10 63.73 62.13 89.25 71.22 88.34 71.76
Age Gender Expression Race Animal
Model Language Model MC TF MC TF MC TF MC TF MC TF
Blip2 [1] Flan-T5-XL 62.61 59.65 99.37 94.86 89.27 75.92 74.38 71.95 96.64 95.27
InstructBLIP [3] Flan-T5-XL 65.14 59.55 99.51 88.21 91.85 79.34 77.99 74.62 97.21 94.98
XComposer-VL [7] InternLM-7B 67.89 78.35 99.60 98.06 89.64 82.52 80.42 76.69 97.83 96.70
LLava-1.5 [48] Vicuna-7B 38.35 55.15 54.14 49.78 63.86 49.76 43.98 50.56 49.26 50.90
LLava-1.5 [48] Vicuna-13B 49.55 59.59 98.71 83.53 71.29 58.33 70.16 62.84 85.99 58.23
MiniGPT-4 [2] Vicuna-7B 31.75 51.71 56.49 49.22 42.01 50.91 27.54 50.41 45.97 49.57
Otter [5] LLaMA-7B 37.98 51.19 78.23 77.96 73.86 60.36 43.00 57.88 81.32 83.63
Qwen-VL-Chat [49] Qwen-7B 53.48 49.09 97.72 58.55 85.22 64.30 73.50 56.88 95.60 64.37
Shikra [6] Vicuna-7B 65.37 56.71 97.85 73.16 90.47 70.55 74.17 54.88 89.82 69.06
Shikra-VQA [6] Vicuna-7B 65.50 57.34 99.39 81.41 92.36 74.71 73.90 55.24 91.18 74.26
Object OCR Style Background Color
Model Language Model MC TF MC TF MC TF MC TF MC TF
Blip2 [1] Flan-T5-XL 89.32 87.75 71.89 62.07 99.86 88.62 63.97 68.37 88.01 85.55
InstructBLIP [3] Flan-T5-XL 89.87 90.13 72.61 60.93 99.96 81.23 66.50 69.19 90.93 85.55
XComposer-VL [7] InternLM-7B 90.23 91.72 69.22 78.66 95.83 81.32 71.26 75.27 87.97 89.02
LLava-1.5 [48] Vicuna-7B 58.63 49.99 47.08 51.02 46.35 50.49 44.33 50.95 41.91 51.90
LLava-1.5 [48] Vicuna-13B 77.86 57.93 58.19 53.17 85.80 51.56 62.67 55.33 63.81 52.90
MiniGPT-4 [2] Vicuna-7B 51.75 51.55 31.37 51.56 43.41 48.64 31.94 50.10 35.29 50.51
Otter [5] LLaMA-7B 47.31 82.51 66.23 61.41 98.04 61.58 63.06 63.40 48.28 57.97
Qwen-VL-Chat [49] Qwen-7B 88.32 63.14 71.74 51.86 98.21 52.43 69.42 51.59 86.39 53.05
Shikra [6] Vicuna-7B 70.91 69.22 60.79 53.63 95.13 58.95 71.24 62.66 84.91 61.83
Shikra-VQA [6] Vicuna-7B 89.43 76.00 59.63 54.14 99.51 61.48 70.98 65.96 83.79 64.49

Evaluation Metric. For the multi-choices and true-or-false questions, we use accuracy as the evaluation metric. We randomly shuffle the order of choices to prevent evaluation results from being influenced by the model’s tendency towards specific choices [60]. The random accuracy of the two types are equal to 25%percent\%% and 50%percent\%%, respectively. We use regular expressions to extract the model’s answer choices. For cases where the extraction is fail, we calculate the Levenshtein distance between the answer string and the choice string, and select the option with the minimum distance as the model’s answer. For the free-form questions, we test the model’s image caption capability where the ground truth is the prompt of the image. Followed by [21], we use SentenceTransformer [61] to compute the text similarity with prompt P𝑃Pitalic_P and the caption output of the LVLMs. The final score of each question type is the average score of subtasks.

4 Results and Analysis

In this section, we report the evaluation results and make insightful analysis. A total of 8 LVLMs with 10 checkpoints are evaluated on Dysca benchmark, including BLIP2 [1], InstructBLIP [3], LLavA [48], MiniGPT-4 [2], Otter [5], XComposer-VL [7], Qwen-VL-chat [49] and Shikra [6]. Each model is evaluated with all the 20 perception subtasks under 4 scenarios. The detailed rankings for each subtask can be found in the Appendix A.

4.1 Main Results

Clean Scenario. The evaluation results of various LVLMs in different perceptual subtasks under clean scenarios are presented in Tab. 3. Since the evaluation for free-form question type usually involves multiple subtasks, we can not calculate the results of free-form for each subtask individually. Instead, we display the overall score of free-form in the first row in Tab. 4. As can be seen, Xcomposer-VL [7] outperforms other LVLMs, achieving top-1 or top-2 results in most subtasks, but InstructBLIP [3], Qwen-VL-chat [49] and BLIP2 also take a lead in a few subtasks.

Noisy Scenarios The evaluation results of various LVLMs under noisy scenarios (i.e., corruption, print attacking and adversarial attacking) are presented in Tab. 4. As can be seen, For the multi-choices and true-or-false question type, Xcomposer-VL [7] still takes a lead on all 4 scenarios. For the free-form, LLava-1.5-7b [48] achieves the best.

Table 4: Evaluation results on 4 scenarios. "MC", "TF" and "FF" indicates "Multi-choices", "True-or-false" and "Free-form", respectively. "PrintAtt" and "AdverAtt" means "Print Attacking" and "Adversarial Attacking", respectively. "*": the model is under white-box setting.
Scenarios Blip2 [1] InstructBLIP [3] XComposer-VL [7] LLava-1.5-13b [48] LLava-1.5-7b [48]
MC TF FF MC TF FF MC TF FF MC TF FF MC TF FF
Clean 82.07 78.78 37.56 84.29 79.00 38.94 86.22 84.82 46.03 71.50 57.72 48.61 53.70 51.41 50.15
Corruption 81.17 77.98 37.32 83.77 78.57 38.72 85.20 84.16 45.52 70.34 57.57 48.28 53.21 51.40 49.54
PrintAtt 64.78 68.01 35.82 59.18 58.80 35.71 74.08 72.71 44.46 48.91 54.72 48.47 40.02 50.94 48.38
AdverAtt 23.28 48.98 25.70 21.49* 48.74* 29.15* 20.85 49.54 21.17 66.86 56.56 47.11 50.70 51.36 48.17
Scenarios MiniGPT-4 [2] Otter [5] Qwen-VL-Chat [49] Shikra [6] Shikra-VQA [6]
MC TF FF MC TF FF MC TF FF MC TF FF MC TF FF
Clean 38.50 50.51 38.90 61.36 65.99 39.12 82.31 55.84 49.90 76.61 63.79 47.17 79.31 66.69 29.93
Corruption 38.70 49.99 40.35 62.19 65.36 38.71 79.26 52.74 49.74 76.42 64.31 46.43 79.06 66.95 29.40
PrintAtt 36.16 50.00 42.45 38.11 36.27 36.93 59.68 46.40 48.65 60.02 40.18 47.03 62.20 41.05 27.91
AdverAtt 26.75 49.52 22.59 57.78 62.27 35.76 78.47 58.04 44.67 72.20 62.16 43.56 74.04 64.64 26.81

4.2 Analysis

4.2.1 Key Observations

(1) For individual models, their perceptual performance varies across different subtasks. For example, in the case of Qwen-VL-Chat [49], it achieves a score of 96.96% accuracy in landmark recognition task for multi-choices questions (2% below the first-place score), but obtains a score of 53.48% accuracy in age recognition task (12% below the first-place score). The results suggest that Qwen-VL-Chat [49] may require more fine-tuning in age perception data. Analyzing the performance of the models across various subtasks contribute to purposive improving.

Refer to caption
Figure 4: Models exhibit different performance when facing the same image but different question types.

(2) Models exhibit performance inconsistency when facing multi-choices and true-or-false question types. As can be seen, in the object recognition subtask, Otter [5] achieves an accuracy of 47.31% in the multi-choices question type (22% higher than random guessing), but obtains an accuracy of 82.51% in the true-or-false question type (32% higher than random guessing). Interestingly, we observe the opposite results in the Qwen-VL-Chat [49]. In the object recognition subtask, it achieves an accuracy of 88.32% in the multi-choices but achieves an accuracy of 63.14% in the true-of-false. We also observe the same problem shown in other models and we display two examples in Fig. 4. We speculate that the inconsistency may be attributed to the bias in the training dataset towards particular question types, such as using more multi-choices or true-or-false questions.

(3) Each model exhibits robustness in the corruption scenario, but suffers degradation in the two attacking scenarios. As shown, all models exhibit minor score variation of less than 1% under the corruption scenario. However, they exhibit degradation when facing print attacking (e.g., 84.29% vs. 59.18% for InstructBlip [3] in multi-choices accuracy). XComposer-VL [7] shows the strongest robustness, maintaining over 70% accuracy for both multi-choices and true-or-false. Besides, since our adversarial algorithm specifically targets the image encoder, the LVLMs that share the same encoder architecture (i.e., Blip2, InstructBLip and XComposer-VL all using EVA-CLIP [62] as the image encoder) exhibit significant performance degradation, with accuracy even falling below random selection. Models utilizing alternative image encoders also experience a performance decrease of approximately 5% to 10%. A more detailed result can be found in Appendix D.

4.2.2 The Validity of Dysca

Table 5: The correlation results on three benchmarks, where ρ[1,1]𝜌11\rho\in[-1,1]italic_ρ ∈ [ - 1 , 1 ] and τ[1,1]𝜏11\tau\in[-1,1]italic_τ ∈ [ - 1 , 1 ].
Style Method MMBench OCRBench SeedBench-2
All ρ𝜌\rhoitalic_ρ 0.70 0.90 0.46
τ𝜏\tauitalic_τ 0.60 0.80 0.43
Realistic ρ𝜌\rhoitalic_ρ 0.70 1.00 0.64
τ𝜏\tauitalic_τ 0.60 1.00 0.62

In this section, we investigate on the evaluation gap between Dysca and non-synthesis benchmarks. We calculate the Spearman’s rank correlation coefficient [63] ρ𝜌\rhoitalic_ρ and the Kendall rank correlation coefficient [64] τ𝜏\tauitalic_τ between the evaluation ranking of Dysca under clean scenario with the non-synthesis benchmark’s evaluation ranking, i.e., MMBench [39], OCRBench [52] and SeedBench-2 [24]. Both coefficient generate a score in the range of [-1,1], where 1 represents a perfect positive correlation, -1 represents a perfect negative correlation, and 0 represents no correlation. Specifically, we intersect our Dysca with current benchmarks based on the perceptual subtasks, evaluation models and evaluation question types. We then calculate the correlation of model evaluation rankings within this intersection. The results are shown in the first row of Tab. 5. For the MMbench [39] and OCRBench [52], both metrics show the high correlation, with ρ𝜌\rhoitalic_ρ and τ𝜏\tauitalic_τ higher than 0.6. However, the correlation for SeedBench-2 [24] is not as strong. Considering that SeedBench-2 only contains realistic images, we conduct additional experiments using the evaluation ranks on our realistic style images only. As shown in the second row of Tab. 5, the correlation results of SeedBench-2 significantly improve (i.e., 0.46 vs. 0.64 for ρ𝜌\rhoitalic_ρ and 0.43 vs. 0.62 for τ𝜏\tauitalic_τ). The correlation with OCRBench also improves to 1, demonstrating the validity of using synthetic datasets for evaluation LVLMs.

To further explore the the impact of image styles on evaluation results, we present the average scores across all subtasks for each of the 51 styles in Fig. 5. We observe slight score differences across all styles. In the case of realistic styles such as "iPhone photo", all LVLMs perform better compared to other image styles. The LVLMs also exhibit better performance on unrealistic but common styles like "expressionist". However, for unrealistic and less common styles such as "gothic", all models show relatively poor performance. The results reveal that the gap between Dysca and non-synthesis benchmarks primarily stems from the more diverse range of image styles, making Dysca a more comprehensive benchmark for assessing the perception ability compared to previous benchmarks.

Refer to caption
Figure 5: Illustration of each model’s performance across 51 image styles, where the darker colors represent higher scores. The representative styles are colored with non-black font. Realistic styles are shown in red font, unrealistic but common styles are displayed in yellow font, and unrealistic and less common styles are represented in blue font.

5 Conclusion

In this paper, we purpose Dysca, a dynamic and scalable benchmark for evaluating perception ability of Large Vision Language Models (LVLMs). Dysca consists of 617K Vision-language QA pairs, covering 20 perceptual subtasks, 4 image scenarios and 3 question types. We conduct the experiment on 8 advanced open-source LVLMs with 10 checkpoints, revealing the insightful weakness of current LVLMs when facing different question types, image styles and image conditions. Experiments demonstrate the validity on evaluating LVLMs by using synthesis images.

References

  • [1] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
  • [2] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  • [3] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint:2305.06500, 2023.
  • [4] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  • [5] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint:2305.03726, 2023.
  • [6] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  • [7] Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
  • [8] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  • [9] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  • [10] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. 2023.
  • [11] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  • [12] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  • [13] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [14] OpenAI. Gpt-4 technical report, 2023.
  • [15] FastChat. Vicuna. https://github.com/lm-sys/FastChat, 2023.
  • [16] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML), pages 8748–8763. PMLR, 2021.
  • [17] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, June 2023.
  • [18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [19] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015.
  • [20] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8309–8318, 2019.
  • [21] Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
  • [22] Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, and Ping Luo. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729, 2023.
  • [23] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  • [24] Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023.
  • [25] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  • [26] Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890, 2023.
  • [27] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  • [28] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv preprint arXiv:2309.17421, 2023.
  • [29] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024.
  • [30] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • [31] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, 2017.
  • [32] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3190–3199, 2019.
  • [33] Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, and Bing Qin. An early evaluation of gpt-4v(ision). arXiv preprint arXiv:2310.16534, 2023.
  • [34] Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, and Min Lin. Benchmarking large multimodal models against common corruptions. arXiv preprint arXiv:2401.11943, 2024.
  • [35] Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, and Renjing Xu. Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language model. arXiv preprint arXiv:2402.19150, 2024.
  • [36] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • [37] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2023.
  • [38] Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li, Qi Sun, Yifei Zhang, Xiaoming Fu, and Soujanya Poria. Mm-bigbench: Evaluating multimodal models on multimodal content comprehension tasks. arXiv preprint arXiv:2310.09036, 2023.
  • [39] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  • [40] Zejun Li, Ye Wang, Mengfei Du, Qingwen Liu, Binhao Wu, Jiwen Zhang, Chengxing Zhou, Zhihao Fan, Jie Fu, Jingjing Chen, Xuanjing Huang, and Zhongyu Wei. Reform-eval: Evaluating large vision language models via unified re-formulation of task-oriented benchmarks. arXiv preprint arXiv:2310.02569, 2023.
  • [41] Wentao Ge, Shunian Chen, Guiming Chen, Junying Chen, Zhihong Chen, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xidong Wang, Anningzhe Gao, Zhiyi Zhang, Jianquan Li, Xiang Wan, and Benyou Wang. Mllm-bench, evaluating multi-modal llms using gpt-4v. arXiv preprint arXiv:2311.13951, 2023.
  • [42] Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, and Alex Kot. Benchlmm: Benchmarking cross-style visual capability of large multimodal models. arXiv preprint arXiv:2312.02896, 2023.
  • [43] Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 49659–49678. Curran Associates, Inc., 2023.
  • [44] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. 2019.
  • [45] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems (NeurIPS), 33:1877–1901, 2020.
  • [46] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems (NeurIPS), 35:27730–27744, 2022.
  • [47] Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024.
  • [48] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  • [49] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
  • [50] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and Larry Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), September 2014.
  • [51] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015.
  • [52] Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng lin Liu, Lianwen Jin, and Xiang Bai. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2024.
  • [53] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  • [54] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. In 2014 IEEE International Conference on Computer Vision (ICCV), 2014.
  • [55] Join the midjourney discord server! https://discord.com/invite/midjourney.
  • [56] Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system, 2022.
  • [57] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
  • [58] Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. arXiv preprint arXiv:2311.16465, 2023.
  • [59] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • [60] Yongshuo Zong, Tingyang Yu, Bingchen Zhao, Ruchika Chavhan, and Timothy Hospedales. Fool your (vision and) language model with embarrassingly simple permutations. arXiv preprint arXiv:2310.01651, 2023.
  • [61] Nandan Thakur, Nils Reimers, Johannes Daxenberger, and Iryna Gurevych. Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 296–310. Association for Computational Linguistics, June 2021.
  • [62] Yuxin Fang, Wen Wang, Binhui Xie, Quan-Sen Sun, Ledell Yu Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2022.
  • [63] C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101, 1904.
  • [64] M. G. KENDALL. A new measure of rank correlation. Biometrika, 30(1-2):81–93, 06 1938.
  • [65] Meta. llama3. https://github.com/meta-llama/llama3, 2024.
  • [66] Javier Rando, Daniel Paleka, David Lindner, Lennard Heim, and Florian Tramèr. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
  • [67] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III au2, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2021.

Appendix

Appendix A The Leaderboards

Rank Model Score
  1 XComposer-VL 80.13
2 InstructBlip 72.36
3 Blip2 70.03
4 Shikra-VQA 64.03
5 Shikra 62.97
6 Otter 62.22
7 Qwen-VL-Chat 61.13
8 LLAVA-1.5-13B 60.17
9 LLAVA-1.5-7B 53.81
10 MiniGPT-4 42.52
(a) Clean-Movie
Rank Model Score
  1 XComposer-VL 95.89
2 InstructBlip 95.44
3 Blip2 94.64
4 Shikra-VQA 88.25
5 Qwen-VL-Chat 79.75
6 Shikra 77.65
7 Otter 69.10
8 LLAVA-1.5-13B 68.82
9 LLAVA-1.5-7B 62.77
10 MiniGPT-4 46.98
(b) Clean-Action
Rank Model Score
  1 XComposer-VL 75.55
2 InstructBlip 63.42
3 Otter 63.04
4 Shikra-VQA 60.10
5 Blip2 59.14
6 Shikra 58.55
7 Qwen-VL-Chat 58.27
8 LLAVA-1.5-13B 56.69
9 LLAVA-1.5-7B 53.48
10 MiniGPT-4 40.23
(c) Clean-TV Show
Rank Model Score
  1 XComposer-VL 79.92
2 InstructBlip 77.84
3 Blip2 76.88
4 Shikra 73.55
5 Shikra-VQA 73.07
6 Qwen-VL-Chat 65.12
7 LLAVA-1.5-13B 62.86
8 Otter 60.95
9 LLAVA-1.5-7B 57.74
10 MiniGPT-4 42.06
(d) Clean-Profession
Rank Model Score
  1 InstructBlip 96.90
2 Blip2 96.64
3 XComposer-VL 96.43
4 Shikra-VQA 82.85
5 Shikra 80.17
6 Qwen-VL-Chat 75.84
7 LLAVA-1.5-13B 75.12
8 LLAVA-1.5-7B 69.74
9 Otter 62.50
10 MiniGPT-4 52.81
(e) Clean-Landmark
Rank Model Score
  1 XComposer-VL 75.00
2 Qwen-VL-Chat 63.11
3 InstructBlip 62.66
4 Otter 59.56
5 Blip2 59.50
6 LLAVA-1.5-13B 53.88
7 Shikra-VQA 53.31
8 Shikra 52.84
9 LLAVA-1.5-7B 47.80
10 MiniGPT-4 38.98
(f) Clean-Anime
Rank Model Score
  1 XComposer-VL 85.96
2 InstructBlip 83.80
3 Blip2 78.09
4 Shikra-VQA 69.83
5 Qwen-VL-Chat 69.58
6 Shikra 67.43
7 LLAVA-1.5-13B 61.20
8 Otter 58.33
9 LLAVA-1.5-7B 48.80
10 MiniGPT-4 40.86
(g) Clean-Clothes
Rank Model Score
  1 XComposer-VL 87.91
2 InstructBlip 79.91
3 Blip2 79.23
4 Qwen-VL-Chat 68.71
5 Shikra-VQA 62.93
6 Shikra 61.81
7 LLAVA-1.5-13B 59.04
8 LLAVA-1.5-7B 56.04
9 Otter 52.97
10 MiniGPT-4 39.13
(h) Clean-Celebrity
Rank Model Score
  1 XComposer-VL 91.27
2 InstructBlip 90.28
3 Blip2 89.78
4 Shikra-VQA 80.23
5 Shikra 75.84
6 Qwen-VL-Chat 71.37
7 LLAVA-1.5-13B 68.16
8 Otter 63.22
9 LLAVA-1.5-7B 52.87
10 MiniGPT-4 47.72
(i) Clean-Food
Rank Model Score
  1 XComposer-VL 93.19
2 InstructBlip 92.62
3 Blip2 91.84
4 Shikra-VQA 80.05
5 Shikra 77.57
6 Qwen-VL-Chat 76.11
7 Otter 73.96
8 LLAVA-1.5-13B 67.61
9 LLAVA-1.5-7B 48.84
10 MiniGPT-4 47.95
(j) Clean-Plant
Rank Model Score
  1 XComposer-VL 73.12
2 InstructBlip 62.34
3 Shikra-VQA 61.42
4 Blip2 61.13
5 Shikra 61.04
6 LLAVA-1.5-13B 54.57
7 Qwen-VL-Chat 51.28
8 LLAVA-1.5-7B 46.75
9 Otter 44.58
10 MiniGPT-4 41.73
(k) Clean-Age
Rank Model Score
  1 XComposer-VL 98.83
2 Blip2 97.12
3 InstructBlip 93.86
4 LLAVA-1.5-13B 91.12
5 Shikra-VQA 90.40
6 Shikra 85.50
7 Qwen-VL-Chat 78.13
8 Otter 78.09
9 MiniGPT-4 52.86
10 LLAVA-1.5-7B 51.96
(l) Clean-Gender
Rank Model Score
  1 XComposer-VL 86.08
2 InstructBlip 85.59
3 Shikra-VQA 83.53
4 Blip2 82.59
5 Shikra 80.51
6 Qwen-VL-Chat 74.76
7 Otter 67.11
8 LLAVA-1.5-13B 64.81
9 LLAVA-1.5-7B 56.81
10 MiniGPT-4 46.46
(m) Clean-Expression
Rank Model Score
  1 XComposer-VL 78.56
2 InstructBlip 76.31
3 Blip2 73.16
4 LLAVA-1.5-13B 66.50
5 Qwen-VL-Chat 65.19
6 Shikra-VQA 64.56
7 Shikra 64.53
8 Otter 50.44
9 LLAVA-1.5-7B 47.27
10 MiniGPT-4 38.98
(n) Clean-Race
Rank Model Score
  1 XComposer-VL 97.27
2 InstructBlip 96.09
3 Blip2 95.95
4 Shikra-VQA 82.72
5 Otter 82.47
6 Qwen-VL-Chat 79.98
7 Shikra 79.44
8 LLAVA-1.5-13B 72.11
9 LLAVA-1.5-7B 50.08
10 MiniGPT-4 47.77
(o) Clean-Animal
Rank Model Score
  1 XComposer-VL 90.97
2 InstructBlip 90.00
3 Blip2 88.53
4 Shikra-VQA 82.72
5 Qwen-VL-Chat 75.73
6 Shikra 70.06
7 LLAVA-1.5-13B 67.89
8 Otter 64.91
9 LLAVA-1.5-7B 54.31
10 MiniGPT-4 51.65
(p) Clean-Object
Rank Model Score
  1 Blip2 68.38
2 InstructBlip 67.50
3 XComposer-VL 66.06
4 Shikra-VQA 64.86
5 Otter 64.37
6 Qwen-VL-Chat 63.28
7 Shikra 63.02
8 LLAVA-1.5-13B 58.84
9 LLAVA-1.5-7B 44.96
10 MiniGPT-4 43.70
(q) Clean-OCR
Rank Model Score
  1 XComposer-VL 73.94
2 Blip2 66.98
3 InstructBlip 66.77
4 Otter 63.82
5 Qwen-VL-Chat 61.80
6 Shikra 57.21
7 Shikra-VQA 56.89
8 LLAVA-1.5-13B 55.68
9 LLAVA-1.5-7B 49.05
10 MiniGPT-4 41.47
(r) Clean-Style
Rank Model Score
  1 XComposer-VL 73.27
2 Shikra-VQA 68.47
3 InstructBlip 67.84
4 Shikra 66.95
5 Blip2 66.17
6 Otter 63.23
7 Qwen-VL-Chat 60.51
8 LLAVA-1.5-13B 59.00
9 LLAVA-1.5-7B 47.64
10 MiniGPT-4 41.02
(s) Clean-Background
Rank Model Score
  1 XComposer-VL 88.50
2 InstructBlip 88.24
3 Blip2 86.78
4 Shikra-VQA 74.14
5 Shikra 73.37
6 Qwen-VL-Chat 69.72
7 LLAVA-1.5-13B 58.36
8 Otter 53.12
9 LLAVA-1.5-7B 46.91
10 MiniGPT-4 42.90
(t) Clean-Color
Table 6: Leaderboards of all Perceptual tasks under clean scenario.
Rank Model Score
  1 XComposer-VL 78.20
2 InstructBlip 71.30
3 Blip2 68.69
4 Shikra-VQA 64.19
5 Shikra 63.47
6 Otter 61.39
7 Qwen-VL 59.27
8 LLAVA-1.5-13B 58.67
9 LLAVA-1.5-7B 52.36
10 MiniGPT-4 43.02
(a) Corru.-Movie
Rank Model Score
  1 XComposer-VL 95.86
2 InstructBlip 95.09
3 Blip2 93.87
4 Shikra-VQA 88.72
5 Shikra 77.77
6 Qwen-VL 75.09
7 Otter 70.09
8 LLAVA-1.5-13B 68.34
9 LLAVA-1.5-7B 63.00
10 MiniGPT-4 47.22
(b) Corru.-Action
Rank Model Score
  1 XComposer-VL 72.86
2 InstructBlip 62.42
3 Otter 60.71
4 Shikra 60.09
5 Shikra-VQA 59.58
6 Blip2 58.16
7 Qwen-VL 57.95
8 LLAVA-1.5-13B 55.18
9 LLAVA-1.5-7B 52.34
10 MiniGPT-4 38.70
(c) Corru.-TV Show
Rank Model Score
  1 XComposer-VL 78.48
2 InstructBlip 77.84
3 Blip2 75.77
4 Shikra 73.72
5 Shikra-VQA 73.38
6 Qwen-VL 64.97
7 Otter 62.71
8 LLAVA-1.5-13B 62.23
9 LLAVA-1.5-7B 56.95
10 MiniGPT-4 43.22
(d) Corru.-Profession
Rank Model Score
  1 InstructBlip 96.33
2 Blip2 96.28
3 XComposer-VL 95.97
4 Shikra-VQA 83.01
5 Shikra 82.38
6 LLAVA-1.5-13B 74.63
7 Qwen-VL 73.55
8 LLAVA-1.5-7B 68.78
9 Otter 61.65
10 MiniGPT-4 49.55
(e) Corru.-Landmark
Rank Model Score
  1 XComposer-VL 72.28
2 InstructBlip 62.79
3 Qwen-VL 60.44
4 Blip2 58.67
5 Otter 57.24
6 LLAVA-1.5-13B 53.67
7 Shikra-VQA 52.28
8 Shikra 52.00
9 LLAVA-1.5-7B 47.87
10 MiniGPT-4 40.91
(f) Corru.-Anime
Rank Model Score
  1 XComposer-VL 84.87
2 InstructBlip 82.84
3 Blip2 76.69
4 Shikra-VQA 70.59
5 Shikra 67.05
6 Qwen-VL 64.26
7 LLAVA-1.5-13B 60.88
8 Otter 59.87
9 LLAVA-1.5-7B 48.12
10 MiniGPT-4 41.38
(g) Corru.-Clothes
Rank Model Score
  1 XComposer-VL 87.25
2 InstructBlip 79.23
3 Blip2 78.36
4 Qwen-VL 64.89
5 Shikra-VQA 62.55
6 Shikra 61.68
7 LLAVA-1.5-13B 58.64
8 LLAVA-1.5-7B 56.36
9 Otter 53.59
10 MiniGPT-4 38.80
(h) Corru.-Celebrity
Rank Model Score
  1 XComposer-VL 90.61
2 InstructBlip 90.08
3 Blip2 89.19
4 Shikra-VQA 80.55
5 Shikra 75.63
6 Qwen-VL 70.71
7 LLAVA-1.5-13B 67.46
8 Otter 65.06
9 LLAVA-1.5-7B 52.94
10 MiniGPT-4 47.76
(i) Corru.-Food
Rank Model Score
  1 XComposer-VL 92.45
2 InstructBlip 92.23
3 Blip2 91.15
4 Shikra-VQA 80.80
5 Shikra 77.69
6 Otter 73.42
7 Qwen-VL 72.01
8 LLAVA-1.5-13B 67.22
9 LLAVA-1.5-7B 48.84
10 MiniGPT-4 48.04
(j) Corru.-Plant
Rank Model Score
  1 XComposer-VL 73.04
2 InstructBlip 61.59
3 Shikra-VQA 61.09
4 Shikra 60.40
5 Blip2 60.03
6 Qwen-VL 55.26
7 LLAVA-1.5-13B 53.67
8 LLAVA-1.5-7B 46.76
9 Otter 45.09
10 MiniGPT-4 40.84
(k) Corru.-Age
Rank Model Score
  1 XComposer-VL 98.53
2 Blip2 97.76
3 InstructBlip 93.75
4 LLAVA-1.5-13B 91.09
5 Shikra-VQA 90.17
6 Shikra 85.91
7 Otter 80.22
8 Qwen-VL 67.00
9 MiniGPT-4 52.75
10 LLAVA-1.5-7B 52.27
(l) Corru.-Gender
Rank Model Score
  1 XComposer-VL 86.24
2 InstructBlip 84.95
3 Shikra-VQA 83.56
4 Blip2 81.56
5 Shikra 80.56
6 Qwen-VL 70.55
7 Otter 66.64
8 LLAVA-1.5-13B 64.94
9 LLAVA-1.5-7B 56.16
10 MiniGPT-4 46.55
(m) Corru.-Expression
Rank Model Score
  1 XComposer-VL 78.80
2 InstructBlip 76.19
3 Blip2 71.72
4 LLAVA-1.5-13B 65.33
5 Shikra-VQA 64.73
6 Shikra 63.94
7 Qwen-VL 63.55
8 Otter 49.73
9 LLAVA-1.5-7B 47.92
10 MiniGPT-4 39.65
(n) Corru.-Race
Rank Model Score
  1 XComposer-VL 97.06
2 InstructBlip 95.86
3 Blip2 95.69
4 Shikra-VQA 83.03
5 Otter 82.84
6 Shikra 79.44
7 LLAVA-1.5-13B 71.47
8 Qwen-VL 71.43
9 LLAVA-1.5-7B 50.67
10 MiniGPT-4 48.66
(o) Corru.-Animal
Rank Model Score
  1 XComposer-VL 90.81
2 InstructBlip 89.76
3 Blip2 88.05
4 Shikra-VQA 82.76
5 Shikra 71.49
6 Qwen-VL 67.78
7 LLAVA-1.5-13B 67.23
8 Otter 66.18
9 LLAVA-1.5-7B 54.09
10 MiniGPT-4 51.02
(p) Corru.-Object
Rank Model Score
  1 Blip2 94.06
2 InstructBlip 90.54
3 XComposer-VL 87.19
4 Shikra-VQA 80.80
5 Otter 78.70
6 Shikra 77.17
7 Qwen-VL 69.66
8 LLAVA-1.5-13B 68.58
9 LLAVA-1.5-7B 49.10
10 MiniGPT-4 44.63
(q) Corru.-OCR
Rank Model Score
  1 XComposer-VL 72.34
2 InstructBlip 65.06
3 Blip2 64.86
4 Otter 61.14
5 Qwen-VL 60.34
6 Shikra 56.61
7 Shikra-VQA 55.81
8 LLAVA-1.5-13B 53.65
9 LLAVA-1.5-7B 47.67
10 MiniGPT-4 40.81
(r) Courruption-Style
Rank Model Score
  1 XComposer-VL 72.53
2 Shikra-VQA 68.25
3 InstructBlip 67.48
4 Shikra 67.19
5 Blip2 64.69
6 Otter 62.99
7 Qwen-VL 60.50
8 LLAVA-1.5-13B 58.50
9 LLAVA-1.5-7B 47.12
10 MiniGPT-4 40.53
(s) Corru.-Background
Rank Model Score
  1 XComposer-VL 88.24
2 InstructBlip 88.09
3 Blip2 86.25
4 Shikra-VQA 74.14
5 Shikra 73.07
6 Qwen-VL 70.78
7 LLAVA-1.5-13B 57.70
8 Otter 56.14
9 LLAVA-1.5-7B 46.76
10 MiniGPT-4 42.86
(t) Corru.-Color
Table 7: Leaderboards of all Perceptual tasks under corruption scenario.
Rank Model Score
  1 XComposer-VL 67.09
2 Blip2 52.55
3 LLAVA-1.5-13B 46.74
4 LLAVA-1.5-7B 45.24
5 Shikra 44.84
6 Shikra-VQA 44.36
7 Qwen-VL 44.30
8 InstructBlip 43.06
9 MiniGPT-4 40.26
10 Otter 37.97
(a) Pr.Att.-Movie
Rank Model Score
  1 XComposer-VL 86.24
2 Blip2 79.13
3 InstructBlip 75.94
4 Shikra-VQA 71.36
5 Qwen-VL 65.26
6 Shikra 58.82
7 LLAVA-1.5-13B 57.66
8 LLAVA-1.5-7B 51.62
9 MiniGPT-4 47.07
10 Otter 39.77
(b) Pr.Att.-Action
Rank Model Score
  1 XComposer-VL 55.08
2 LLAVA-1.5-7B 43.55
3 Qwen-VL 40.72
4 LLAVA-1.5-13B 40.09
5 Blip2 39.71
6 Shikra 39.01
7 MiniGPT-4 37.40
8 Shikra-VQA 36.50
9 InstructBlip 32.17
10 Otter 31.48
(c) Pr.Att.-TV Show
Rank Model Score
  1 XComposer-VL 67.14
2 Blip2 59.62
3 Shikra 55.70
4 InstructBlip 54.52
5 Shikra-VQA 54.25
6 Qwen-VL 48.66
7 LLAVA-1.5-13B 47.49
8 LLAVA-1.5-7B 42.86
9 MiniGPT-4 42.22
10 Otter 38.12
(d) Pr.Att.-Profession
Rank Model Score
  1 XComposer-VL 89.81
2 Blip2 84.85
3 InstructBlip 76.78
4 Shikra-VQA 68.28
5 LLAVA-1.5-13B 65.98
6 Shikra 64.47
7 Qwen-VL 64.29
8 LLAVA-1.5-7B 56.39
9 MiniGPT-4 49.58
10 Otter 43.89
(e) Pr.Att.-Landmark
Rank Model Score
  1 XComposer-VL 59.72
2 Blip2 45.05
3 Qwen-VL 44.95
4 LLAVA-1.5-7B 42.23
5 LLAVA-1.5-13B 40.95
6 InstructBlip 38.26
7 Shikra-VQA 38.25
8 MiniGPT-4 37.27
9 Shikra 37.02
10 Otter 36.16
(f) Pr.Att.-Anime
Rank Model Score
  1 XComposer-VL 76.43
2 Blip2 66.53
3 InstructBlip 62.14
4 Qwen-VL 51.56
5 LLAVA-1.5-13B 49.63
6 Shikra-VQA 49.02
7 Shikra 47.23
8 LLAVA-1.5-7B 42.22
9 MiniGPT-4 39.66
10 Otter 34.67
(g) Pr.Att.-Clothes
Rank Model Score
  1 XComposer-VL 73.48
2 Blip2 57.30
3 Qwen-VL 52.34
4 LLAVA-1.5-7B 45.11
5 InstructBlip 43.36
6 Shikra 43.13
7 Shikra-VQA 42.01
8 LLAVA-1.5-13B 41.59
9 MiniGPT-4 38.64
10 Otter 33.95
(h) Pr.Att.-Celebrity
Rank Model Score
  1 XComposer-VL 83.60
2 Blip2 81.67
3 InstructBlip 75.94
4 Qwen-VL 60.43
5 Shikra-VQA 58.61
6 LLAVA-1.5-13B 56.20
7 Shikra 56.08
8 MiniGPT-4 46.70
9 LLAVA-1.5-7B 45.20
10 Otter 38.39
(i) Pr.Att.-Food
Rank Model Score
  1 XComposer-VL 82.18
2 Blip2 82.16
3 InstructBlip 74.20
4 Qwen-VL 60.29
5 Shikra 54.03
6 LLAVA-1.5-13B 53.77
7 Shikra-VQA 52.86
8 MiniGPT-4 46.45
9 LLAVA-1.5-7B 42.07
10 Otter 39.52
(j) Pr.Att.-Plant
Rank Model Score
  1 XComposer-VL 60.37
2 LLAVA-1.5-7B 42.53
3 Blip2 42.20
4 LLAVA-1.5-13B 41.30
5 MiniGPT-4 39.18
6 Shikra 38.98
7 Qwen-VL 37.56
8 Shikra-VQA 37.47
9 InstructBlip 35.75
10 Otter 32.18
(k) Pr.Att.-Age
Rank Model Score
  1 Blip2 93.97
2 InstructBlip 80.44
3 XComposer-VL 79.56
4 LLAVA-1.5-13B 74.03
5 Shikra-VQA 63.62
6 Shikra 62.06
7 Qwen-VL 56.05
8 MiniGPT-4 51.87
9 LLAVA-1.5-7B 51.33
10 Otter 39.31
(l) Pr.Att.-Gender
Rank Model Score
  1 XComposer-VL 78.59
2 Blip2 67.94
3 InstructBlip 67.72
4 Shikra-VQA 65.00
5 Shikra 61.51
6 Qwen-VL 59.16
7 LLAVA-1.5-13B 58.33
8 LLAVA-1.5-7B 53.00
9 MiniGPT-4 45.37
10 Otter 39.91
(m) Pr.Att.-Expression
Rank Model Score
  1 XComposer-VL 62.21
2 Blip2 52.98
3 InstructBlip 48.42
4 LLAVA-1.5-13B 47.80
5 Qwen-VL 44.81
6 Shikra-VQA 42.09
7 LLAVA-1.5-7B 41.75
8 Shikra 41.07
9 MiniGPT-4 38.03
10 Otter 28.77
(n) Pr.Att.-Race
Rank Model Score
  1 Blip2 90.07
2 XComposer-VL 87.80
3 InstructBlip 82.43
4 Qwen-VL 65.20
5 LLAVA-1.5-13B 60.13
6 Shikra-VQA 55.53
7 Shikra 53.98
8 MiniGPT-4 48.02
9 Otter 46.92
10 LLAVA-1.5-7B 44.73
(o) Pr.Att.-Animal
Rank Model Score
  1 Blip2 83.55
2 XComposer-VL 81.03
3 InstructBlip 76.06
4 Qwen-VL 61.06
5 LLAVA-1.5-13B 56.84
6 Shikra-VQA 55.02
7 MiniGPT-4 49.61
8 Shikra 46.96
9 LLAVA-1.5-7B 44.26
10 Otter 38.32
(p) Pr.Att.-Object
Rank Model Score
  1 XComposer-VL 60.60
2 Blip2 52.84
3 LLAVA-1.5-13B 45.26
4 Qwen-VL 43.55
5 LLAVA-1.5-7B 42.23
6 Shikra 41.06
7 MiniGPT-4 40.32
8 Shikra-VQA 40.12
9 InstructBlip 40.03
10 Otter 33.86
(q) Pr.Att.-Style
Rank Model Score
  1 XComposer-VL 64.99
2 Shikra-VQA 55.50
3 Blip2 54.94
4 Shikra 54.16
5 LLAVA-1.5-13B 51.16
6 InstructBlip 49.98
7 Qwen-VL 49.17
8 LLAVA-1.5-7B 44.34
9 MiniGPT-4 39.23
10 Otter 37.94
(r) Pr.Att.-Background
Rank Model Score
  1 XComposer-VL 78.66
2 Blip2 74.47
3 InstructBlip 63.59
4 Qwen-VL 58.40
5 Shikra 51.75
6 Shikra-VQA 51.10
7 LLAVA-1.5-13B 49.48
8 LLAVA-1.5-7B 43.40
9 MiniGPT-4 41.70
10 Otter 35.56
(s) Pr.Att.-Color
Table 8: Leaderboards of all Perceptual tasks under print attacking scenario.
Rank Model Score
  1 Qwen-VL 58.17
2 Shikra 57.41
3 Shikra-VQA 57.26
4 LLAVA-1.5-13B 57.17
5 Otter 56.61
6 LLAVA-1.5-7B 51.91
7 MiniGPT-4 36.95
8 Blip2 33.33
9 InstructBlip 31.97
10 XComposer-VL 31.75
(a) Ad.Att.-Movie
Rank Model Score
  1 Shikra-VQA 85.84
2 Qwen-VL 79.11
3 Shikra 77.17
4 LLAVA-1.5-13B 66.37
5 Otter 65.81
6 LLAVA-1.5-7B 60.84
7 MiniGPT-4 37.59
8 Blip2 34.64
9 XComposer-VL 33.88
10 InstructBlip 32.85
(b) Ad.Att.-Action
Rank Model Score
  1 Qwen-VL 55.79
2 Shikra-VQA 53.57
3 Otter 53.05
4 LLAVA-1.5-13B 53.03
5 Shikra 52.61
6 LLAVA-1.5-7B 48.36
7 MiniGPT-4 38.22
8 XComposer-VL 34.45
9 Blip2 32.04
10 InstructBlip 30.93
(c) Ad.Att.-TV Show
Rank Model Score
  1 Shikra-VQA 71.47
2 Shikra 67.66
3 Qwen-VL 64.80
4 Otter 61.44
5 LLAVA-1.5-13B 59.83
6 LLAVA-1.5-7B 56.15
7 MiniGPT-4 36.95
8 Blip2 36.14
9 InstructBlip 35.02
10 XComposer-VL 30.86
(d) Ad.Att.-Profession
Rank Model Score
  1 Shikra-VQA 78.40
2 Shikra 77.85
3 Qwen-VL 76.34
4 LLAVA-1.5-13B 71.92
5 LLAVA-1.5-7B 68.62
6 Otter 56.87
7 MiniGPT-4 35.73
8 XComposer-VL 35.33
9 Blip2 34.53
10 InstructBlip 32.03
(e) Ad.Att.-Landmark
Rank Model Score
  1 Qwen-VL 60.07
2 Otter 51.86
3 LLAVA-1.5-13B 51.81
4 Shikra-VQA 48.45
5 Shikra 47.75
6 LLAVA-1.5-7B 47.32
7 MiniGPT-4 38.38
8 Blip2 37.95
9 InstructBlip 34.29
10 XComposer-VL 30.89
(f) Ad.Att.-Anime
Rank Model Score
  1 Qwen-VL 69.92
2 Shikra-VQA 68.32
3 Shikra 65.73
4 LLAVA-1.5-13B 58.83
5 Otter 57.58
6 LLAVA-1.5-7B 47.05
7 MiniGPT-4 38.05
8 InstructBlip 36.65
9 XComposer-VL 36.41
10 Blip2 36.18
(g) Ad.Att.-Clothes
Rank Model Score
  1 Qwen-VL 64.77
2 Shikra 60.02
3 Shikra-VQA 59.70
4 LLAVA-1.5-13B 55.53
5 LLAVA-1.5-7B 54.95
6 Otter 52.26
7 MiniGPT-4 37.48
8 Blip2 34.72
9 InstructBlip 33.52
10 XComposer-VL 33.09
(h) Ad.Att.-Celebrity
Rank Model Score
  1 Shikra-VQA 75.26
2 Qwen-VL 73.72
3 Shikra 71.07
4 LLAVA-1.5-13B 64.95
5 Otter 64.39
6 LLAVA-1.5-7B 52.30
7 MiniGPT-4 36.94
8 XComposer-VL 33.16
9 Blip2 32.48
10 InstructBlip 32.13
(i) Ad.Att.-Food
Rank Model Score
  1 Qwen-VL 79.59
2 Shikra-VQA 76.06
3 Shikra 73.53
4 Otter 67.16
5 LLAVA-1.5-13B 64.44
6 LLAVA-1.5-7B 47.41
7 MiniGPT-4 37.74
8 Blip2 35.84
9 InstructBlip 34.02
10 XComposer-VL 34.00
(j) Ad.Att.-Plant
Rank Model Score
  1 Shikra 56.95
2 Shikra-VQA 56.80
3 LLAVA-1.5-13B 51.43
4 Qwen-VL 48.11
5 LLAVA-1.5-7B 45.89
6 Otter 42.82
7 MiniGPT-4 37.14
8 Blip2 34.16
9 InstructBlip 33.14
10 XComposer-VL 30.78
(k) Ad.Att.-Age
Rank Model Score
  1 LLAVA-1.5-13B 89.35
2 Shikra-VQA 89.06
3 Shikra 85.34
4 Otter 82.25
5 Qwen-VL 79.18
6 LLAVA-1.5-7B 51.95
7 MiniGPT-4 48.38
8 Blip2 39.48
9 InstructBlip 38.51
10 XComposer-VL 38.09
(l) Ad.Att.-Gender
Rank Model Score
  1 Shikra-VQA 82.04
2 Shikra 79.86
3 Qwen-VL 73.91
4 LLAVA-1.5-13B 63.13
5 Otter 60.44
6 LLAVA-1.5-7B 55.70
7 MiniGPT-4 41.07
8 Blip2 35.95
9 InstructBlip 33.52
10 XComposer-VL 32.80
(m) Ad.Att.-Expression
Rank Model Score
  1 LLAVA-1.5-13B 63.06
2 Shikra-VQA 61.19
3 Shikra 60.24
4 Qwen-VL 59.50
5 LLAVA-1.5-7B 45.97
6 Otter 45.50
7 MiniGPT-4 37.47
8 Blip2 36.81
9 InstructBlip 36.36
10 XComposer-VL 34.77
(n) Ad.Att.-Race
Rank Model Score
  1 Qwen-VL 81.78
2 Shikra-VQA 80.44
3 Otter 78.00
4 Shikra 77.66
5 LLAVA-1.5-13B 70.94
6 LLAVA-1.5-7B 49.30
7 MiniGPT-4 37.61
8 Blip2 35.23
9 XComposer-VL 34.38
10 InstructBlip 34.01
(o) Ad.Att.-Animal
Rank Model Score
  1 Shikra-VQA 79.05
2 Qwen-VL 77.26
3 Shikra 69.98
4 LLAVA-1.5-13B 65.33
5 Otter 62.69
6 LLAVA-1.5-7B 52.62
7 MiniGPT-4 37.84
8 Blip2 37.62
9 XComposer-VL 36.97
10 InstructBlip 36.88
(p) Ad.Att.-Object
Rank Model Score
  1 Shikra-VQA 79.12
2 Shikra 77.62
3 Qwen-VL 76.50
4 Otter 74.91
5 LLAVA-1.5-13B 66.43
6 XComposer-VL 53.21
7 InstructBlip 50.03
8 Blip2 47.80
9 LLAVA-1.5-7B 46.78
10 MiniGPT-4 36.78
(q) Ad.Att.-OCR
Rank Model Score
  1 Qwen-VL 56.80
2 Otter 53.47
3 Shikra 51.05
4 Shikra-VQA 50.30
5 LLAVA-1.5-13B 50.01
6 LLAVA-1.5-7B 44.85
7 MiniGPT-4 37.98
8 XComposer-VL 37.20
9 InstructBlip 36.20
10 Blip2 35.87
(r) Ad.Att.-Style
Rank Model Score
  1 Shikra-VQA 65.41
2 Shikra 64.50
3 Otter 60.30
4 Qwen-VL 60.09
5 LLAVA-1.5-13B 56.28
6 LLAVA-1.5-7B 46.53
7 MiniGPT-4 36.91
8 Blip2 35.05
9 XComposer-VL 34.71
10 InstructBlip 34.45
(s) Ad.Att.-Background
Rank Model Score
  1 Qwen-VL 69.63
2 Shikra 69.57
3 Shikra-VQA 69.05
4 LLAVA-1.5-13B 54.28
5 Otter 53.12
6 LLAVA-1.5-7B 46.08
7 MiniGPT-4 37.54
8 XComposer-VL 37.23
9 Blip2 36.83
10 InstructBlip 35.81
(t) Ad.Att.-Color
Table 9: Leaderboards of all Perceptual tasks under adversarial attacking scenario.
Rank Model Score
  1 XComposer-VL 72.37
2 InstructBlip 67.41
3 Blip2 66.13
4 Qwen-VL-Chat 62.68
5 Shikra 62.52
6 LLAVA-1.5-13B 59.28
7 Shikra-VQA 58.64
8 Otter 55.49
9 LLAVA-1.5-7B 51.75
10 MiniGPT-4 42.64
(a) Clean
Rank Model Score
  1 XComposer-VL 71.63
2 InstructBlip 67.62
3 Blip2 65.49
4 Shikra 62.37
5 Qwen-VL-Chat 60.58
6 LLAVA-1.5-13B 58.73
7 Shikra-VQA 58.47
8 Otter 55.42
9 LLAVA-1.5-7B 46.44
10 MiniGPT-4 42.87
(b) Corru.
Rank Model Score
  1 XComposer-VL 63.75
2 Blip2 56.20
3 Qwen-VL-Chat 51.57
4 InstructBlip 51.23
5 LLAVA-1.5-13B 50.70
6 Shikra 49.08
7 LLAVA-1.5-7B 46.47
8 Shikra-VQA 43.72
9 MiniGPT-4 42.87
10 Otter 47.10
(c) Print Attack
Rank Model Score
  1 Qwen-VL-Chat 60.39
2 Shikra 59.31
3 LLAVA-1.5-13B 56.84
4 Shikra-VQA 55.16
5 Otter 51.94
6 LLAVA-1.5-7B 50.07
7 InstructBlip 33.13
8 MiniGPT-4 32.95
9 Blip2 32.65
10 XComposer-VL 30.52
(d) Adversarial Attack
Table 10: Leaderboards of comprehensive performance in clean, corruption, print attack and adversarial attack scenario.

The model performance leaderboards for each subtask under each scenario are shown in Tab. 6, Tab. 7, Tab. 8, and Tab. 9. We compute the average of the multi-choice and true-or-false evaluation results under each subtask. Since the free-form question allows to assess the model’s perception abilities across multiple subtasks at the same time, the results of free-form are not taken into account.

The overall performances on each scenario are displayed in Tab. 10. We calculate the average of the scores of the three questioning types (i.e., multi-choice, true-or-false and free-form) as the final score.

Appendix B Discussion

B.1 General Discussion

Limitation. Dysca is the dynamic and scalable benchmark, offering evaluation for 20 perceptual subtasks under 51 image styles and 4 scenarios. However, generating data for evaluating cognition abilities (e.g., commonsense reasoning) presents challenge within the existing framework. This limitation arises from the reliance on predefined rules for prompt and question generation, which may not adequately capture the complexity of cognitive-level questions.

Synthesis Data for Training / Fine-tuning. The use of synthetic data for model training / fine-tuning has been adopted in the field of Natural Language Processing (NLP) [65]. In this work, we do not explore the possibility of utilizing our benchmark for model training. Our primary goal in this paper is to provide a large-scale evaluation benchmark that addresses the issue of data leakage in current multimodal evaluation benchmarks and offers evaluation results across multiple subtasks, scenarios, question types and styles. Nevertheless, considering that Dysca has the capability to synthesize high-resolution and unlimited amounts of annotated multimodal data, we believe that Dysca also holds potential as a training data synthesis tool for LVLMs.

Reproducibility and Licence. All the experiments are built on 8 * RTX 4090. All the data and the code for generation and evaluation are released in https://github.com/Benchmark-Dysca/Dysca. The licence of Dysca is "CreativeML Open RAIL++-M", which follows the licence set by the Stable Diffusion XL.

Ethical Concerns. Our Dysca leverages the Stable Diffusion XL [36] to generate images. In order to prevent the model generating unsafe images, e.g., NSFW and offensive images, lots of efforts have been made. First, we use the safety checker [66] to post filter the unsafe images. With the unsafe image that is recognized by the safety checker, the model’s output will be a blank image. Besides, we manually exclude the specific styles or the word that may trigger the unsafe images generation from the Metadata M𝑀Mitalic_M. We believe that our Dysca involves fewer ethical concerns.

B.2 The Stability of Dysca

Refer to caption
Figure 6: The tendency of model’s overall performance under clean scenario with different scale of evaluation data.

In this section, we focus on examining the stability of Dysca. We partition Dysca into 11 different scales: 1%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100%. We compute the evaluation scores using each of these data scales. The score is calculated as the sum of scores obtained from multiple-choice, true-or-false and free-form questions. As can be seen in Fig. 6, when the evaluation data scale is less than 30% of Dysca (i.e., less than 46.8K samples), the evaluation score show significant fluctuations. When the data scale exceeds 40%, we obtain the stable results, reflecting current scale of Dysca achieves the stable and reliable evaluation results. Although 40% evaluation scale of Dysca has achieved stable scores, Dysca aims to provide more than just stable rankings, but also draws on massive amounts of data to provide in-depth feedback across different image styles and perceptual subtasks.

Appendix C The Metadata (M𝑀Mitalic_M) of Dysca

Metadata (M𝑀Mitalic_M) is the core of Dysca, which is randomly assembled from our collected source material and contains all the information needed to generate prompt (P𝑃Pitalic_P), image (I𝐼Iitalic_I), and question-answer pairs(Q𝑄Qitalic_Q). Specifically, the metadata is a data container that contains information in multiple dimensions about the foreground, the attributes corresponding to the foreground, the background, and the artistic style required to generate an image. Therefore, each instance of M is mapped one-to-one to a prompt, an image, and a set of question-answer pairs, respectively.

In order to ensure the quality and stability of the generated images, we carefully select the source material. First, for each perceptual subtask, we collect rich annotation material as described in Section 3.2. However, the metadata composed of these raw annotations is not always usable. On the one hand, some of the content is polysemous, which can easily be misinterpreted by the model’s when generating images. On the other hand, there are backgrounds or artistic styles (e.g., “Pokemon Style", “architectural style", etc.) that negatively affect the quality of the image and do not accurately generate the desired content. In order to test the usability of these source materials, we went through several small-scale pre-generations covering all the source materials. After careful selection, we retain the clips that consistently produced high-quality images. The detailed information of the source materials are shown in Tab. 11.

Table 11: Detailed information of the source materials.
Category Data Description #Numbers
Style We collected artistic styles from the community, which can be well rendered by Stable Diffusion model. We removed those which have strong reflect on the image content or may generate unsafe image. 51
Background We have selected 20 rich backgrounds and they can be accurately generated by Stable Diffusion Model. 20
Age We chose four well-characterized age nodes: 14, 25, 40, and 80. 4
Expression We chose three characteristic expressions: smiling happily, shouting furiously, calm and placid. 3
Gender Male and Female. 2
Race We identified these five races based on the ethnicity that can be generated by Stable Diffusion: Caucasian, Asian, African, Indian, and Middle Eastern races 5
Profession After pre-generation and careful selection, we chose 20 occupations with distinctive characteristics that are easy to generate. 20
Action After pre-generation and careful selection, we chose 20 occupations with distinctive characteristics that are easy to generate. 20
Celebrity After pre-generation and careful selection, we chose 50 well-known celebrities. 50
Animal We selected a rich variety of animals, including mammals, birds, reptiles, insects, and aquatic animals, and can they be generated by the Stable Diffusion model accurately. 67
Plant We selected a rich variety of plants, including flowers, trees, fruits, and vegetables, and they can be generated by the Stable Diffusion model accurately. 37
Clothes We selected 16 common types of clothing that are highly distinguishable from each other. 16
Object We took the annotations from the MSCOCO [50] dataset after removing people, animals, and plants, and added some additional common objects. 80
Landmark We chose 23 characteristic landmarks from around the globe and they can be generated by the Stable Diffusion model accurately. 23
Food We collected 29 special dishes from around the globe and they can be generated by the Stable Diffusion model accurately. 29
Movie We selected 106 movies titles from the rating list of IMDb based on the number of user reviews. 106
Anime We selected 44 anime titles from the rating list of IMDb based on the number of user reviews. 44
TV shows We selected 20 TV show titles from the rating list of IMDb based on the number of user reviews. 20
OCR We randomly selected 5000 words from the IELTS vocabulary for the text material. Among them, words with length less than 3 were removed. 5000
Color We selected 8 easily distinguishable colors: red, orange, yellow, green, blue, purple, white, black. 8

Appendix D Scenarios Details

D.1 Print Attack Scenario

Followed by the settings in [35], we add the attack text on the images. Consider that the image resolution in our Dysca is much higher than the one in [35], we increase more font form in terms of font position and font orientation. Fig. 7 to Fig. 11 shows the detailed examples.

Refer to caption
Figure 7: Images with different font size. ’112px’ means that the typos are 112 pixels in size.
Refer to caption
Figure 8: Images with different font color. ’Blue’ means that the color of typos are in blue.
Refer to caption
Figure 9: Images with different font opacity. ’40%’ means that the transparency of typos are 40% and ’100%’ implies full opacity of typos.
Refer to caption
Figure 10: Images with different font orientation. ’15superscript1515^{\circ}15 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT’ means that the orientation of typos are 15superscript1515^{\circ}15 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT and 0superscript00^{\circ}0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT implies the typos are horizontal.
Refer to caption
Figure 11: Images with different font position. An image is divided into a grid of 5 rows and 5 columns, leading to 25 sections. ’R1C1’ means the typo is located in row 1, column 1, which is the top left corner of the image.

D.2 Corrupted Scenario

Examples of the 11 image corruptions are shown in Fig 12.

Refer to caption
Figure 12: Examples of 11 image corruptions applied to a single clean image.

Appendix E More Examples of Dysca

For each subject we collected in Metadata (M𝑀Mitalic_M), we display one example of their prompt (P𝑃Pitalic_P), generated image (I𝐼Iitalic_I) and corresponding question-answer pairs (Q𝑄Qitalic_Q).

Refer to caption
Figure 13: Plant
Refer to caption
Figure 14: Profession
Refer to caption
Figure 15: Celebrity
Refer to caption
Figure 16: Action
Refer to caption
Figure 17: Landmark
Refer to caption
Figure 18: Face
Refer to caption
Figure 19: Animal
Refer to caption
Figure 20: Object
Refer to caption
Figure 21: Clothes
Refer to caption
Figure 22: Food
Refer to caption
Figure 23: Movie
Refer to caption
Figure 24: TV
Refer to caption
Figure 25: Anime
Refer to caption
Figure 26: OCR

Appendix F Data Sheet

We follow the documentation frameworks provided by Gebru et al. [67].

F.1 Motivation

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

  • The proposed dataset is used for evaluating current LVLMs perception ability. We use the synthesis images to prevent the potential data leakage problem in current benchmarks. The dataset test LVLMs in 20 subtasks under 4 scenarios and 3 question type, revealing the existing drawbacks of current LVLMs.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

  • Followed by the double-blind rule, we will release the detailed information about this part once our paper is accepted.

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grant or and the grant name and number.

  • Followed by the double-blind rule, we will release the detailed information about this part once our paper is accepted.

F.2 Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.

How many instances are there in total (of each type, if appropriate)?

  • There are a total of 20 subtasks in our Dysca. For details of each subtasks please see refer Fig. 2.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).

  • No. The images in Dysca are completely generated from scratch.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.

  • Each instance consists of the prompt, the image generated by stable diffusion, the question and corresponding answer.

Is there a label or target associated with each instance? If so, please provide a description.

  • Yes, Dysca provides the ground truth for each instance.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.

  • No.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.

  • There are no relationships between individual instances.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.

  • Following our motivation, the entire proposed dataset is used for testing purposes.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

  • Errors in image generation resulting from stable diffusion are unavoidable. However, we have performed dataset cleaning to minimize these errors. Furthermore, the stability experiment in Appendix B demonstrates that these errors do not affect the overall evaluation results of the dataset.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

  • The proposed Dysca dose not rely on any external resources.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.

  • No.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

  • No. To ensure that the generated images do not contain offensive, insulting, threatening, or anxiety-inducing content, we manually filter out words from the metadata M𝑀Mitalic_M that could potentially trigger the diffusion model to generate such images. Safety checker also used to further avoid unsafe image generation.

Does the dataset relate to people? If not, you may skip the remaining questions in this section.

  • Yes.

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

  • Yes. There are the age, gender and race recognition subtasks in Dysca. Each of them are divided to several subpopulations and the selection of these subpopulations is based on the ability of stable diffusion to generate the representative subpopulations.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.

  • Yes. There is the celebrity recognition task in our dataset, where 50 well-know celebrity are chosen. We choose the celebrity who can be generated well by stable diffusion XL.

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

  • No, our benchmark does not contain any sensitive data.

F.3 Collection Process

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

  • We display the detailed explanation in Tab. 11.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?

  • We collect the data by manual human curation.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

  • No.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

  • We collect the metadata of Tab. 11 by authors. The images are generated by stable diffusion and labels of each image are also automatically generated.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

  • Our dataset was conducted in April of 2024, but the results do not depend on the date of data collection.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

  • No.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

  • No.

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.

  • N/A. Our Dysca does not involve the collection from the individuals.

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

  • N/A. Our Dysca does not involve the collection from the individuals.

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

  • N/A. Our Dysca does not involve the collection from the individuals.

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

  • No.

F.4 Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remaining questions in this section.

  • Yes. We leverage the off-the-shelf models, i.e., PP-OCRv3 [56] and CLIP-L-14 [16], to clean the data. PP-OCRv3 [56] is leveraged as the filter to exclude the failure image that TextDiffusion2 [58] generates the wrong text on the image. For the other images, we use CLIP-L-14 [16] to filter out the images with low text-image consistency.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

  • Yes. We have saved all the data. However, most of these images are filtered and considered to be useless.

Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.

F.5 Uses

Has the dataset been used for any tasks already? If so, please provide a description.

  • No. The proposed dataset is the novel one which is used for evaluation current LVLMs perception ability.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

  • Yes. We plan to create a section on the project homepage to keep track of LVLMs papers for researchers to analyze and compare.

What (other) tasks could the dataset be used for?

  • In this work, we do not explore the possibility of utilizing our benchmark for model training / fine-tuning. Our primary goal in this paper is to provide a large-scale evaluation benchmark that addresses the issue of data leakage in current multimodal evaluation benchmarks and offers evaluation results across multiple subtasks, scenarios, question types and styles. Nevertheless, considering that Dysca has the capability to synthesize high-resolution and unlimited amounts of annotated multimodal data, we believe that Dysca also holds potential as a training data synthesis tool for LVLMs.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other risks or harms (e.g., legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms?

  • Yes.

Are there tasks for which the dataset should not be used? If so, please provide a description.

  • The proposed dataset should not be used to generate offensive data.

F.6 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

  • Yes.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

  • We will open-source our dataset on our GitHub project homepage. At the moment, we do not have a DOI number.

When will the dataset be distributed?

  • The dataset can be downloaded right now.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

  • No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

  • Not yet.

F.7 Maintenance

Who will be supporting/hosting/maintaining the dataset?

  • Followed by the double-blind rule, we will release the detailed information about this part once our paper is accepted.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

  • Followed by the double-blind rule, we will release the detailed information about this part once our paper is accepted.

Is there an erratum? If so, please provide a link or other access point.

  • No.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)?

  • There are no plans at the moment, but if there are updates, they will be announced, and the download source will be updated on the project homepage.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

  • No.

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.

  • Yes. If there are any updates, the previous version of the dataset will also be shared on website for download.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.

  • Yes. We welcome and encourage researchers to extend/augment/build on/contribute to our dataset for non-profit purposes without the need for prior notification.