Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

Jie Zhang^∗123^✉, Zhongqi Wang^∗123, Mengqi Lei^∗4, Zheng Yuan¹²³,
Bei Yan¹²³, Shiguang Shan¹²³, Xilin Chen¹²³
¹Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
² University of Chinese Academy of Sciences, Beijing 100049, China
³ Key Laboratory of Al Safety, Chinese Academy of Sciences, Beijing, 100190, China
⁴China University of Geosciences

Abstract

Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 8 advanced open-source LVLMs with 10 checkpoints are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released in https://github.com/Benchmark-Dysca/Dysca.

^†^†footnotetext: ^∗ Equal contribution.
^✉ Corresponding Author: jiezhang@ict.ac.cn

1 Introduction

Recent years have witnessed the great success of the Large Vision-Language Models (LVLMs) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. These models leverage the powerful Large Language Models (LLMs) [11, 12, 13, 14, 15] as their brain and incorporate the state-of-the-art visual encoders [16, 17, 18] as their eyes. Thanks to the alignment of visual feature with textual space and the development of visual instruction tuning techniques [4], LVLMs showcase the impressive capability in terms of visual scene comprehension and multimodal instruction-following.

In order to comprehensively evaluate the capabilities of LVLMs, many benchmarks have been purposed [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], where we categorize the current benchmarks into three types [25]. The first type is the classical benchmarks, such as COCO Caption [30] and VQA [19, 31, 32]. Although these benchmarks provide high-quality evaluation data, they also have notable limitations. On the one hand, they are inadequate for measuring the fine-grained capabilities of current LVLMs, offering the limited insightful feedback for the future improvement. On the other hand, since these classical benchmarks have been available as the open-source test data for a long time, it is hard to prevent the data leakage problem. The second type of benchmarks evaluate the LVLMs through a subjective manner [28, 33]. Although the benchmarks reveal the insightful drawbacks of current models, their data scale is limited (i.e., less than 200 annotations) and they require manual evaluation by experts. The third type is built for objectively evaluating current LVLMs and the comparison between them are shown in Tab. 1. They provide an objective and automatic evaluation manner, giving the fine-grained evaluation for the LVLMs. However, these benchmarks conduct Vision-language QAs by selecting images from existing dataset. Although they claim that the questions are re-annotated, the previous work [29] has demonstrated that these benchmarks have unintentionally leaked into the training data of LLMs and LVLMs. Besides, most benchmarks focus on evaluating LVLMs in the realistic images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. While some works like MMCBench [34] and Typographic Dataset [35] have investigated the robustness of LVLMs with corrupted and print-attacked images, respectively, they have not explored the effect of these noisy images on various perceptual tasks.

Refer to caption — Figure 1: Overview of the automatic pipeline for generating Vision-language QAs, cleaning Vision-language QAs and evaluating LVLMs. (a) We first constructs prompts in terms of content, style and background, leveraging the Text-to-Image (T2I) diffusion model (e.g., SDXL [36]) to synthesis images to be asked. Then based on the scenarios and the question type, we post-process the synthesis images and generate the specific textual questions, respectively. (b) We further filter out low quality Vision-language QAs by utilizing trained models to form the final Dysca. (c) Finally, we evaluate LVLMs on our Dysca and feedback the fine-grained evaluation results.

In this paper, aiming to address these challenges above, we propose Dysca which is a dynamic and scalable benchmark for evaluating the perception ability of LVLMs via various subtasks and scenarios. Inspired by the prior evaluation works for LLMs [37], we investigate on whether we could leverage the large-scale synthesized images for evaluating LVLMs. We display the overview of our pipeline in Fig. 1. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and corresponding answers. We decouple the prompt into 4 part, i.e., attribute, foreground, style and background, and design pre-defined templates to dynamically generate prompts, as displayed in Fig. 3. Then we utilize the state-of-the-art text-to-image diffusion models (e.g., SDXL [36]) to generate the corresponding images. Since we already know the main information of the images through prompts, we easily generate question-answer textual pairs by the rule-based method. After that, in order to obtain the high quality Vision-language QAs, we employ CLIP [16] to perform data cleaning on the generated Vision-language QA pairs. Dysca focuses on assessing the fine-grained perception abilities, including recognizing human, animal, object, landmark, etc. Dysca evaluates LVLMs with 20 perceptual subtasks, containing a total number of 51 different artistic styles. Besides, to evaluate the robustness of the models across different scenarios and question types, we construct 4 testing scenarios (clean, corruption, print attacking and adversarial attacking) and 3 question types (multi-choices, true-or-false and free-form questions). In the end, Dysca consists of 617K Vision-language QA pairs ( $\times$ 20 larger than MM-BigBench [38] and $\times$ 25 larger than Seed-Bench2 [24] as shown in Tab. 1). Thanks to the generative paradigm, Dysca achieves the scalable benchmark to new subtasks and scenarios and dynamically generate unlimited Vision-language QAs for evaluation.

In summary, our work makes the following key contributions:

•

Dynamic and Scalable Benchmark: We propose Dysca, a benchmark that is able to dynamically generate the test data that users need and is easily to scale up to to new subtasks and scenarios.
•

Multi-grained Perceptual Subtasks and Multi-scenarios: Dysca evaluates the 20 perceptual subtasks performance of 8 mainstream LVLMs with 10 checkpoints under 4 image scenarios (i.e., clean, corruption, print attacking and adversarial attacking) and 3 question types (i.e., multi-choices, true-or-false and free-form questions).
•

Analysis and Observations: We demonstrate for the first time that evaluating LVLMs using large-scale synthetic data is valid. Experiments show the strong correlation coefficient between our evaluation rankings and the rankings obtained from non-synthetic benchmarks. The evaluation results also reveal the weakness of current LVLMs when facing different question types, image styles and image scenarios.

Table 1: Comparisons between existing LVLM benchmarks. ’✓ – ’ indicates that the benchmarks include both newly collected images / annotations and images / annotations gathered from existing datasets. ’*’ The scale of our released benchmark is 617K, however Dysca is able to generate unlimited data to be tested.

Benchmark

#Evaluation Data Scale

#Perceptual Tasks

Automatic

Annotation

Collecting from Existing Datasets

Question Type

Automatic

Evaluation

LLaVA-Bench [4]

0.15K

✗

✓ –

Free-form

✓

MME [25]

2.3K

✗

✓ –

True-or-false

✓

LVLM-eHub [21]

✓

✗

Free-form

✗

tiny-LVLM-eHub [22]

2.1K

✓

✗

Free-form

✓

SEED-Bench [23]

19K

✓ –

✗

Multi-choices

✓

MMBench [39]

2.9K

✗

✓ –

Multi-choices

✓

TouchStone [26]

0.9K

✗

✓

Free-form

✓

REFORM-EVAL [40]

50K

✓

✗

Multi-choices

✓

MM-BigBench [38]

30K

✓

✗

Multi-choices

✓

MM-VET [27]

0.2K

✓ –

Free-form

✓

MLLM-Bench [41]

0.42K

✗

✓ –

Free-form

✓

SEED-Bench2 [24]

24K

✓ –

✗

Multi-choices

✓

BenchLMM [42]

2.4K

✗

Free-form

✓

JourneyDB [43]

5.4K

✓

Free-form

Multi-choices

✓

\cdashline1-7 Dysca (Ours)

617K*

✓

Free-form

Multi-choices

True-or-false

✓

2 Related Works

2.1 Large Vision-Language Models

The landscape of Large Vision-Language Models (LVLMs) has been significantly shaped by the pioneering success of Large Language Models (LLMs) such as GPTs [44, 45, 46] and LLaMA [13], catalyzing advancements in multimodal content understanding and generation [47], including intricate tasks like image-text comprehension. At the forefront of these developments, BLIP-2 [1] introduces a lightweight Q-Former [1] that facilitates alignment between textual and visual representations through a cross-attention mechanism [1]. InstructBLIP [3] takes a step further by incorporating textual instructions into the Q-Former, which significantly improves performance. LLAVA [4] employs GPT-4 [14] to transform data into multimodal instruction-following data and uses CLIP [16] and LLAMA [13] for fine-tuning instructions, achieving advanced performance. LLAVA-1.5 [48] extends this paradigm by integrating MLP projection and introducing academic task-specific Vision-language QA data. Recently, models like Otter [5], MiniGPT-4 [2], Qwen-VL-Chat [49] and XComposer-VL [7] further unleash the cross-modal understanding capabilities of LVLMs.

2.2 Benchmarks for LVLMs

The great progress of LVLMs triggers the development of benchmarks for evaluating these models, where we divide previous benchmarks into three categories. The first type is the classical benchmarks which focuses on evaluating LVLMs abilities via image caption [50] and VQA [51, 19]. However, these benchmarks cannot provide the fine-grained feedback on how to improve the models. Besides, since these benchmarks have been the public resources for a long time, it is hardly to guarantee that the LVLMs have not use them for training. The second type subjectively evaluates LVLMs by experts [28, 33]. Although these benchmarks reveal the insightful feedback of the LVLMs, their scale is limited (i.e., less than 200 annotations). The subjective manner also makes the evaluation expensive and hardly to expand the scale of the benchmarks.

The third type [4, 25, 21, 22, 23, 24, 39, 26, 40, 38, 27, 41, 42, 29, 52] focuses on evaluating LVLMs in an objective and large-scaled manner, where we list the detailed information of them in the Tab. 1. Some of them have been adopted by the community [53] as the standard benchmarks for evaluating LVLMs [12, 5, 7], like MME [25] and MMBench [39]. These benchmarks evaluate models through the objective answer types and most of them leverage the automatic annotation and evaluation manner for revealing the fine-grained drawbacks of current LVLMs. However, the previous benchmarks primarily concentrate on evaluating LVLMs using realistic images and clean scenario, leaving multi-stylized images and noisy scenarios unexplored. Moreover, many of them conduct QA by selecting images from publicly available datasets (e.g., [50, 54]). While they state that the questions have been re-annotated, they cannot guarantee that the LVLMs have not seen the image during training stage. The previous work [29] has proved that these benchmarks have unintentionally leaked into the training data of LLMs and LVLMs. One possible way to solve data leakage is using novel but synthesis images, where JourneyDB [43] is the first work aiming to leverage synthesis images to evaluate current LVLMs. The prompts and the corresponding images are downloaded from Midjourney [55] and ChatGPT [12] is leveraged to label the images. However, JourneyDB is a top-down framework where the number of images is fixed. Besides, the ChatGPT labeling may cause hallucinate annotations, leading to the unreliable evaluation results. Although 40 annotators have involved to clean the data, the data cleaning cost are expensive and it limits the data scale. In contrast, our Dysca serves as the bottom-up framework, allowing for dynamic and scalable generation for both images and evaluation questions. The rule-based question generation method also makes the annotations more accuracy. Besides, Dysca contains 20 subtasks which is more comprehensive than JourneyDB.

3 Dysca

3.1 Overview of Our Pipeline

The overview of our pipeline is shown in Fig. 1, containing data generation, data cleaning and LVLMs evaluation. For the data generation, our Dysca benchmark consists of four dimensions, i.e., $(M,P,I,Q)$ , where $M$ means "Metadata", $P$ means "Prompt", $I$ means "Image" and $Q$ means "Question-answer pair". We further decouple the metadata $M$ into 4 parts, i.e., "style", "attribute", "foreground" and "background", and the combination of the four parts constitute the image prompts $P$ . Then, given the prompt $P$ and the selected scenario, we leverage the Text-to-Image (T2I) diffusion model (e.g., SDXL [36]) to synthesis image $I$ and add the specific perturbation to the image $I$ . After that, since the prompt already includes the question angle and the corresponding answer, we construct a rule-based approach to generate the $Q$ . Three types of questions are considered, i.e., multi-choice, true-or-false and free-form. Multi-choice and true-or-false questions utilize a closed-ended manner to assess LVLMs, while free-form questions employ an open-ended manner through image captioning for evaluation. For the data cleaning, considering that the T2I diffusion model may generate unsuccessful outcomes, we then use CLIP [16] and PP-OCRv3 [56] to automatically clean the whole dataset to obtain the final Dysca. Finally, we evaluate 8 open-source LVLMs with 10 checkpoints on our proposed Dysca.

3.2 Perceptual Tasks

Evaluation dimensions. Perception is one of the most fundamental capabilities of LVLMs and previous works [25] have shown that the lack of perceptual ability may result in hallucination [57]. In order to comprehensively evaluate LVLMs’ perception capability, we design 20 perceptual subtasks where we show all subtasks and the corresponding amount of their annotation in the Fig. 2. We investigate on two types of perception dimensions, i.e., coarse-grained and fine-grained perception. Coarse-grained perception involves recognizing the style, background and color of images. Fine-grained perception involves recognizing the animal, object, plant, food, age, gender, expression, race, celebrity, action, text, clothes, movie, anime, landmark, profession and TV shows.

Data sources. For each perceptual subtask, we collect the textual data first to construct the metadata $M$ . For the TV shows, Anime and Movie, we select the titles from the rating list of IMDb^*^**https://www.imdb.com/ based on the number of user reviews. For the styles, we utilize the style lists collected from the community^†^††https://stable-diffusion-art.com/sdxl-styles/ and remove those which have strong reflect on the image content like "architectural style" and "Pokemon style". Note that the style list does not include the style prompt associated with a particular artist’s name. Besides, for the remaining contents, we select them from the label of current dataset (e.g., ImageNet [54]). All the selected textual data above constitute the metadata $M$ . We provide the detailed information of the metadata in the Appendix C.

3.3 Construction of Questions & Answers

Recall that the data generation for Dysca benchmark consists of four dimensions, i.e., $(M,P,I,Q)$ , denoting the metadata ( $M$ ), prompt ( $P$ ), image ( $I$ ) and question-answer pairs ( $Q$ ), respectively. The relationships between these parts and the process of constructing Dysca are shown in Fig. 3. The metadata M is the core of the whole Dysca, containing all the information for generating $P$ , $I$ and $Q$ . The metadata $M$ consists of foreground, attribute, background and style, and these information guide the generation of the prompt ( $P$ ) through pre-designed templates. Then, we utilize the T2I diffusion model to generate the corresponding image using the prompt $P$ . For generating the image with specific text on it for the OCR subtask, we leverage TextDiffusion2 [58], which is the state-of-the-art text rendering method. For the rest of images, we leverage Stable Diffusion XL [36]. Subsequently, based on the different question types we select, i.e., multi-choices, true-or-false and free-form, we generate the corresponding Vision-language QA pairs in Dysca.

Besides, in order to evaluate the model performance under various scenarios, we conduct experiments on 4 scenarios, i.e., clean, corruption, print attacking and adversarial attacking. For the print attacking, followed by [35], we add the deceptive text on the image, where the text is a wrong option. Besides, to comprehensively evaluate the performance of LVLMs under corruption scenario, we add more typographic factors to original settings (i.e., different font orientations and font positions). For the adversarial attacking, we leverage PGD [59] to generate the adversarial image. We use InstructBLIP [3] as the proxy model and regard others as the black box models. The reason why we choose InstructBLIP is that it has shown superior performance in clean scenario. Besides, the black-box setting better reflects the robustness of the models when they face the real-world adversarial attacks. For the corruption, we leverage the image corruption methods collected from [34]. We remove some hard corruptions as they significantly impact the quality of the image, leading to human failure in judging the style and content of the image. The detailed examples are shown in Appendix D.

Consider that the Text-to-image diffusion model may generate the failure cases that affect the quality of the proposed benchmark, we leverage the off-the-shelf models, i.e., PP-OCRv3 [56] and CLIP-L-14 [16], to clean the data. PP-OCRv3 [56] is leveraged as the filter to exclude the failure image that TextDiffusion2 [58] generates the wrong text on the image. For the other images, we use CLIP-L-14 [16] to filter out the images with low text-image consistency. In the end, we filter out nearly 15% of low quality samples. The final statistics of our released Dysca are shown in Tab. 2. Note that the OCR subtask does not involve print attacking scenario as misidentifying adversarial text does not indicate poor OCR robustness of the LVLMs. Therefore, there are 7K fewer questions in the print attacking scenario. Besides, for the free-form question type, since it allows to assess the model’s perception abilities across multiple subtasks at the same time, we reduce the number of free-form questions for achieving a balanced data distribution.

3.4 Evaluation Strategy

Instruction Design. We design two types of instructions to improve the instruction-following result of LVLMs. For the multi-choices and true-or-false questions, we design the questions followed by the description "Please answer the question and provide the correct option letter, e.g., (A), (B), (C), (D), at the end. Do not contain the analysis progress. Your answer is: ". For the free-form questions, recalling that the prompt $P$ contains four part, i.e., the style, attribute, foreground and background, we instruct the model to caption these four dimensions by "Please describe the image. You can describe it from these aspects: $\{\}$ ", where " $\{\}$ " includes the specific template we design for each part. We display the sample in the Fig. 3 and more examples can be found in the Appendix E.

Table 3: Evaluation results on the 20 perceptual subtasks. The top two results on each subtask are bolded and underlined, respectively. "MC" and "TF" indicate the accuracy (%) of "Multi-choices" and "True-or-false", respectively.

		Movie		Action		TV Show		Profession		Landmark
Model	Language Model	MC	TF	MC	TF	MC	TF	MC	TF	MC	TF
Blip2 [1]	Flan-T5-XL	71.46	68.61	96.24	93.04	56.29	61.99	77.88	75.87	98.70	94.58
InstructBLIP [3]	Flan-T5-XL	77.35	67.38	97.28	93.61	65.89	60.96	79.17	76.51	98.04	95.75
XComposer-VL [7]	InternLM-7B	81.90	78.36	97.03	94.75	77.81	73.29	83.97	75.87	98.04	94.81
LLava-1.5 [48]	Vicuna-7B	55.95	51.68	74.78	50.75	56.29	50.68	58.65	56.83	85.00	54.48
LLava-1.5 [48]	Vicuna-13B	66.29	54.04	77.93	59.71	59.27	54.11	66.67	59.05	94.35	55.90
MiniGPT-4 [2]	Vicuna-7B	32.68	52.35	42.96	51.00	30.46	50.00	33.97	50.16	53.26	52.36
Otter [5]	LLaMA-7B	65.36	59.08	65.85	72.36	68.54	57.53	66.99	54.92	58.48	66.51
Qwen-VL-Chat [49]	Qwen-7B	71.25	51.01	96.35	63.14	67.22	49.32	75.64	54.60	96.96	54.72
Shikra [6]	Vicuna-7B	66.29	59.64	78.00	77.29	60.26	56.85	78.85	68.25	89.35	70.99
Shikra-VQA [6]	Vicuna-7B	66.39	61.66	96.17	80.32	59.93	60.27	78.21	67.94	92.83	72.88
		Anime		Clothes		Celebrity		Food		Plant
Model	Language Model	MC	TF	MC	TF	MC	TF	MC	TF	MC	TF
Blip2 [1]	Flan-T5-XL	57.07	61.94	82.38	73.79	81.98	76.48	91.52	88.04	92.50	91.19
InstructBLIP [3]	Flan-T5-XL	61.28	64.04	86.51	81.08	83.69	76.14	91.71	88.85	93.31	91.92
XComposer-VL [7]	InternLM-7B	75.41	74.58	86.18	85.73	88.53	87.28	92.33	90.21	93.25	93.14
LLava-1.5 [48]	Vicuna-7B	47.15	48.46	47.19	50.41	57.36	54.72	54.63	51.10	49.08	48.59
LLava-1.5 [48]	Vicuna-13B	57.61	50.14	65.20	57.20	60.14	57.94	78.87	57.45	79.72	55.51
MiniGPT-4 [2]	Vicuna-7B	29.21	48.74	31.25	50.47	28.62	49.64	44.66	50.78	45.38	50.53
Otter [5]	LLaMA-7B	61.82	57.30	47.13	69.52	42.41	63.52	46.81	79.62	66.41	81.51
Qwen-VL-Chat [49]	Qwen-7B	71.60	54.63	78.44	60.72	87.37	50.05	89.37	53.37	92.32	59.91
Shikra [6]	Vicuna-7B	47.96	57.72	75.21	59.65	63.76	59.86	83.30	68.37	88.55	66.60
Shikra-VQA [6]	Vicuna-7B	49.05	57.58	76.57	63.10	63.73	62.13	89.25	71.22	88.34	71.76
		Age		Gender		Expression		Race		Animal
Model	Language Model	MC	TF	MC	TF	MC	TF	MC	TF	MC	TF
Blip2 [1]	Flan-T5-XL	62.61	59.65	99.37	94.86	89.27	75.92	74.38	71.95	96.64	95.27
InstructBLIP [3]	Flan-T5-XL	65.14	59.55	99.51	88.21	91.85	79.34	77.99	74.62	97.21	94.98
XComposer-VL [7]	InternLM-7B	67.89	78.35	99.60	98.06	89.64	82.52	80.42	76.69	97.83	96.70
LLava-1.5 [48]	Vicuna-7B	38.35	55.15	54.14	49.78	63.86	49.76	43.98	50.56	49.26	50.90
LLava-1.5 [48]	Vicuna-13B	49.55	59.59	98.71	83.53	71.29	58.33	70.16	62.84	85.99	58.23
MiniGPT-4 [2]	Vicuna-7B	31.75	51.71	56.49	49.22	42.01	50.91	27.54	50.41	45.97	49.57
Otter [5]	LLaMA-7B	37.98	51.19	78.23	77.96	73.86	60.36	43.00	57.88	81.32	83.63
Qwen-VL-Chat [49]	Qwen-7B	53.48	49.09	97.72	58.55	85.22	64.30	73.50	56.88	95.60	64.37
Shikra [6]	Vicuna-7B	65.37	56.71	97.85	73.16	90.47	70.55	74.17	54.88	89.82	69.06
Shikra-VQA [6]	Vicuna-7B	65.50	57.34	99.39	81.41	92.36	74.71	73.90	55.24	91.18	74.26
		Object		OCR		Style		Background		Color
Model	Language Model	MC	TF	MC	TF	MC	TF	MC	TF	MC	TF
Blip2 [1]	Flan-T5-XL	89.32	87.75	71.89	62.07	99.86	88.62	63.97	68.37	88.01	85.55
InstructBLIP [3]	Flan-T5-XL	89.87	90.13	72.61	60.93	99.96	81.23	66.50	69.19	90.93	85.55
XComposer-VL [7]	InternLM-7B	90.23	91.72	69.22	78.66	95.83	81.32	71.26	75.27	87.97	89.02
LLava-1.5 [48]	Vicuna-7B	58.63	49.99	47.08	51.02	46.35	50.49	44.33	50.95	41.91	51.90
LLava-1.5 [48]	Vicuna-13B	77.86	57.93	58.19	53.17	85.80	51.56	62.67	55.33	63.81	52.90
MiniGPT-4 [2]	Vicuna-7B	51.75	51.55	31.37	51.56	43.41	48.64	31.94	50.10	35.29	50.51
Otter [5]	LLaMA-7B	47.31	82.51	66.23	61.41	98.04	61.58	63.06	63.40	48.28	57.97
Qwen-VL-Chat [49]	Qwen-7B	88.32	63.14	71.74	51.86	98.21	52.43	69.42	51.59	86.39	53.05
Shikra [6]	Vicuna-7B	70.91	69.22	60.79	53.63	95.13	58.95	71.24	62.66	84.91	61.83
Shikra-VQA [6]	Vicuna-7B	89.43	76.00	59.63	54.14	99.51	61.48	70.98	65.96	83.79	64.49

Evaluation Metric. For the multi-choices and true-or-false questions, we use accuracy as the evaluation metric. We randomly shuffle the order of choices to prevent evaluation results from being influenced by the model’s tendency towards specific choices [60]. The random accuracy of the two types are equal to 25 $\%$ and 50 $\%$ , respectively. We use regular expressions to extract the model’s answer choices. For cases where the extraction is fail, we calculate the Levenshtein distance between the answer string and the choice string, and select the option with the minimum distance as the model’s answer. For the free-form questions, we test the model’s image caption capability where the ground truth is the prompt of the image. Followed by [21], we use SentenceTransformer [61] to compute the text similarity with prompt $P$ and the caption output of the LVLMs. The final score of each question type is the average score of subtasks.

4 Results and Analysis

In this section, we report the evaluation results and make insightful analysis. A total of 8 LVLMs with 10 checkpoints are evaluated on Dysca benchmark, including BLIP2 [1], InstructBLIP [3], LLavA [48], MiniGPT-4 [2], Otter [5], XComposer-VL [7], Qwen-VL-chat [49] and Shikra [6]. Each model is evaluated with all the 20 perception subtasks under 4 scenarios. The detailed rankings for each subtask can be found in the Appendix A.

4.1 Main Results

Clean Scenario. The evaluation results of various LVLMs in different perceptual subtasks under clean scenarios are presented in Tab. 3. Since the evaluation for free-form question type usually involves multiple subtasks, we can not calculate the results of free-form for each subtask individually. Instead, we display the overall score of free-form in the first row in Tab. 4. As can be seen, Xcomposer-VL [7] outperforms other LVLMs, achieving top-1 or top-2 results in most subtasks, but InstructBLIP [3], Qwen-VL-chat [49] and BLIP2 also take a lead in a few subtasks.

Noisy Scenarios The evaluation results of various LVLMs under noisy scenarios (i.e., corruption, print attacking and adversarial attacking) are presented in Tab. 4. As can be seen, For the multi-choices and true-or-false question type, Xcomposer-VL [7] still takes a lead on all 4 scenarios. For the free-form, LLava-1.5-7b [48] achieves the best.

Table 4: Evaluation results on 4 scenarios. "MC", "TF" and "FF" indicates "Multi-choices", "True-or-false" and "Free-form", respectively. "PrintAtt" and "AdverAtt" means "Print Attacking" and "Adversarial Attacking", respectively. "*": the model is under white-box setting.

Scenarios	Blip2 [1]			InstructBLIP [3]			XComposer-VL [7]			LLava-1.5-13b [48]			LLava-1.5-7b [48]
Scenarios	MC	TF	FF	MC	TF	FF	MC	TF	FF	MC	TF	FF	MC	TF	FF
Clean	82.07	78.78	37.56	84.29	79.00	38.94	86.22	84.82	46.03	71.50	57.72	48.61	53.70	51.41	50.15
Corruption	81.17	77.98	37.32	83.77	78.57	38.72	85.20	84.16	45.52	70.34	57.57	48.28	53.21	51.40	49.54
PrintAtt	64.78	68.01	35.82	59.18	58.80	35.71	74.08	72.71	44.46	48.91	54.72	48.47	40.02	50.94	48.38
AdverAtt	23.28	48.98	25.70	21.49*	48.74*	29.15*	20.85	49.54	21.17	66.86	56.56	47.11	50.70	51.36	48.17
Scenarios	MiniGPT-4 [2]			Otter [5]			Qwen-VL-Chat [49]			Shikra [6]			Shikra-VQA [6]
Scenarios	MC	TF	FF	MC	TF	FF	MC	TF	FF	MC	TF	FF	MC	TF	FF
Clean	38.50	50.51	38.90	61.36	65.99	39.12	82.31	55.84	49.90	76.61	63.79	47.17	79.31	66.69	29.93
Corruption	38.70	49.99	40.35	62.19	65.36	38.71	79.26	52.74	49.74	76.42	64.31	46.43	79.06	66.95	29.40
PrintAtt	36.16	50.00	42.45	38.11	36.27	36.93	59.68	46.40	48.65	60.02	40.18	47.03	62.20	41.05	27.91
AdverAtt	26.75	49.52	22.59	57.78	62.27	35.76	78.47	58.04	44.67	72.20	62.16	43.56	74.04	64.64	26.81

4.2 Analysis

4.2.1 Key Observations

(1) For individual models, their perceptual performance varies across different subtasks. For example, in the case of Qwen-VL-Chat [49], it achieves a score of 96.96% accuracy in landmark recognition task for multi-choices questions (2% below the first-place score), but obtains a score of 53.48% accuracy in age recognition task (12% below the first-place score). The results suggest that Qwen-VL-Chat [49] may require more fine-tuning in age perception data. Analyzing the performance of the models across various subtasks contribute to purposive improving.

(2) Models exhibit performance inconsistency when facing multi-choices and true-or-false question types. As can be seen, in the object recognition subtask, Otter [5] achieves an accuracy of 47.31% in the multi-choices question type (22% higher than random guessing), but obtains an accuracy of 82.51% in the true-or-false question type (32% higher than random guessing). Interestingly, we observe the opposite results in the Qwen-VL-Chat [49]. In the object recognition subtask, it achieves an accuracy of 88.32% in the multi-choices but achieves an accuracy of 63.14% in the true-of-false. We also observe the same problem shown in other models and we display two examples in Fig. 4. We speculate that the inconsistency may be attributed to the bias in the training dataset towards particular question types, such as using more multi-choices or true-or-false questions.

(3) Each model exhibits robustness in the corruption scenario, but suffers degradation in the two attacking scenarios. As shown, all models exhibit minor score variation of less than 1% under the corruption scenario. However, they exhibit degradation when facing print attacking (e.g., 84.29% vs. 59.18% for InstructBlip [3] in multi-choices accuracy). XComposer-VL [7] shows the strongest robustness, maintaining over 70% accuracy for both multi-choices and true-or-false. Besides, since our adversarial algorithm specifically targets the image encoder, the LVLMs that share the same encoder architecture (i.e., Blip2, InstructBLip and XComposer-VL all using EVA-CLIP [62] as the image encoder) exhibit significant performance degradation, with accuracy even falling below random selection. Models utilizing alternative image encoders also experience a performance decrease of approximately 5% to 10%. A more detailed result can be found in Appendix D.

4.2.2 The Validity of Dysca

Table 5: The correlation results on three benchmarks, where

\rho\in[-1,1]

and

\tau\in[-1,1]

Style	Method	MMBench	OCRBench	SeedBench-2
All	$\rho$	0.70	0.90	0.46
All	$\tau$	0.60	0.80	0.43
Realistic	$\rho$	0.70	1.00	0.64
Realistic	$\tau$	0.60	1.00	0.62

In this section, we investigate on the evaluation gap between Dysca and non-synthesis benchmarks. We calculate the Spearman’s rank correlation coefficient [63] $\rho$ and the Kendall rank correlation coefficient [64] $\tau$ between the evaluation ranking of Dysca under clean scenario with the non-synthesis benchmark’s evaluation ranking, i.e., MMBench [39], OCRBench [52] and SeedBench-2 [24]. Both coefficient generate a score in the range of [-1,1], where 1 represents a perfect positive correlation, -1 represents a perfect negative correlation, and 0 represents no correlation. Specifically, we intersect our Dysca with current benchmarks based on the perceptual subtasks, evaluation models and evaluation question types. We then calculate the correlation of model evaluation rankings within this intersection. The results are shown in the first row of Tab. 5. For the MMbench [39] and OCRBench [52], both metrics show the high correlation, with $\rho$ and $\tau$ higher than 0.6. However, the correlation for SeedBench-2 [24] is not as strong. Considering that SeedBench-2 only contains realistic images, we conduct additional experiments using the evaluation ranks on our realistic style images only. As shown in the second row of Tab. 5, the correlation results of SeedBench-2 significantly improve (i.e., 0.46 vs. 0.64 for $\rho$ and 0.43 vs. 0.62 for $\tau$ ). The correlation with OCRBench also improves to 1, demonstrating the validity of using synthetic datasets for evaluation LVLMs.

To further explore the the impact of image styles on evaluation results, we present the average scores across all subtasks for each of the 51 styles in Fig. 5. We observe slight score differences across all styles. In the case of realistic styles such as "iPhone photo", all LVLMs perform better compared to other image styles. The LVLMs also exhibit better performance on unrealistic but common styles like "expressionist". However, for unrealistic and less common styles such as "gothic", all models show relatively poor performance. The results reveal that the gap between Dysca and non-synthesis benchmarks primarily stems from the more diverse range of image styles, making Dysca a more comprehensive benchmark for assessing the perception ability compared to previous benchmarks.

5 Conclusion

In this paper, we purpose Dysca, a dynamic and scalable benchmark for evaluating perception ability of Large Vision Language Models (LVLMs). Dysca consists of 617K Vision-language QA pairs, covering 20 perceptual subtasks, 4 image scenarios and 3 question types. We conduct the experiment on 8 advanced open-source LVLMs with 10 checkpoints, revealing the insightful weakness of current LVLMs when facing different question types, image styles and image conditions. Experiments demonstrate the validity on evaluating LVLMs by using synthesis images.

References

[1] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
[2] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
[3] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint:2305.06500, 2023.
[4] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
[5] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint:2305.03726, 2023.
[6] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
[7] Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
[8] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
[9] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
[10] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. 2023.
[11] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
[12] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
[13] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[14] OpenAI. Gpt-4 technical report, 2023.
[15] FastChat. Vicuna. https://github.com/lm-sys/FastChat, 2023.
[16] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML), pages 8748–8763. PMLR, 2021.
[17] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, June 2023.
[18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[19] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015.
[20] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8309–8318, 2019.
[21] Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
[22] Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, and Ping Luo. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729, 2023.
[23] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
[24] Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023.
[25] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
[26] Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890, 2023.
[27] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
[28] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv preprint arXiv:2309.17421, 2023.
[29] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024.
[30] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
[31] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, 2017.
[32] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3190–3199, 2019.
[33] Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, and Bing Qin. An early evaluation of gpt-4v(ision). arXiv preprint arXiv:2310.16534, 2023.
[34] Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, and Min Lin. Benchmarking large multimodal models against common corruptions. arXiv preprint arXiv:2401.11943, 2024.
[35] Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, and Renjing Xu. Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language model. arXiv preprint arXiv:2402.19150, 2024.
[36] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
[37] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2023.
[38] Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li, Qi Sun, Yifei Zhang, Xiaoming Fu, and Soujanya Poria. Mm-bigbench: Evaluating multimodal models on multimodal content comprehension tasks. arXiv preprint arXiv:2310.09036, 2023.
[39] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
[40] Zejun Li, Ye Wang, Mengfei Du, Qingwen Liu, Binhao Wu, Jiwen Zhang, Chengxing Zhou, Zhihao Fan, Jie Fu, Jingjing Chen, Xuanjing Huang, and Zhongyu Wei. Reform-eval: Evaluating large vision language models via unified re-formulation of task-oriented benchmarks. arXiv preprint arXiv:2310.02569, 2023.
[41] Wentao Ge, Shunian Chen, Guiming Chen, Junying Chen, Zhihong Chen, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xidong Wang, Anningzhe Gao, Zhiyi Zhang, Jianquan Li, Xiang Wan, and Benyou Wang. Mllm-bench, evaluating multi-modal llms using gpt-4v. arXiv preprint arXiv:2311.13951, 2023.
[42] Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, and Alex Kot. Benchlmm: Benchmarking cross-style visual capability of large multimodal models. arXiv preprint arXiv:2312.02896, 2023.
[43] Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 49659–49678. Curran Associates, Inc., 2023.
[44] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. 2019.
[45] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems (NeurIPS), 33:1877–1901, 2020.
[46] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems (NeurIPS), 35:27730–27744, 2022.
[47] Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024.
[48] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
[49] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
[50] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and Larry Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), September 2014.
[51] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015.
[52] Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng lin Liu, Lianwen Jin, and Xiang Bai. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2024.
[53] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
[54] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. In 2014 IEEE International Conference on Computer Vision (ICCV), 2014.
[55] Join the midjourney discord server! https://discord.com/invite/midjourney.
[56] Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system, 2022.
[57] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
[58] Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. arXiv preprint arXiv:2311.16465, 2023.
[59] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
[60] Yongshuo Zong, Tingyang Yu, Bingchen Zhao, Ruchika Chavhan, and Timothy Hospedales. Fool your (vision and) language model with embarrassingly simple permutations. arXiv preprint arXiv:2310.01651, 2023.
[61] Nandan Thakur, Nils Reimers, Johannes Daxenberger, and Iryna Gurevych. Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 296–310. Association for Computational Linguistics, June 2021.
[62] Yuxin Fang, Wen Wang, Binhui Xie, Quan-Sen Sun, Ledell Yu Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2022.
[63] C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101, 1904.
[64] M. G. KENDALL. A new measure of rank correlation. Biometrika, 30(1-2):81–93, 06 1938.
[65] Meta. llama3. https://github.com/meta-llama/llama3, 2024.
[66] Javier Rando, Daniel Paleka, David Lindner, Lennard Heim, and Florian Tramèr. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
[67] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III au2, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2021.

Appendix

Appendix A The Leaderboards

Rank	Model	Score
1	XComposer-VL	80.13
2	InstructBlip	72.36
3	Blip2	70.03
4	Shikra-VQA	64.03
5	Shikra	62.97
6	Otter	62.22
7	Qwen-VL-Chat	61.13
8	LLAVA-1.5-13B	60.17
9	LLAVA-1.5-7B	53.81
10	MiniGPT-4	42.52

(a) Clean-Movie

Rank	Model	Score
1	XComposer-VL	95.89
2	InstructBlip	95.44
3	Blip2	94.64
4	Shikra-VQA	88.25
5	Qwen-VL-Chat	79.75
6	Shikra	77.65
7	Otter	69.10
8	LLAVA-1.5-13B	68.82
9	LLAVA-1.5-7B	62.77
10	MiniGPT-4	46.98

(b) Clean-Action

Rank	Model	Score
1	XComposer-VL	75.55
2	InstructBlip	63.42
3	Otter	63.04
4	Shikra-VQA	60.10
5	Blip2	59.14
6	Shikra	58.55
7	Qwen-VL-Chat	58.27
8	LLAVA-1.5-13B	56.69
9	LLAVA-1.5-7B	53.48
10	MiniGPT-4	40.23

Rank	Model	Score
1	XComposer-VL	79.92
2	InstructBlip	77.84
3	Blip2	76.88
4	Shikra	73.55
5	Shikra-VQA	73.07
6	Qwen-VL-Chat	65.12
7	LLAVA-1.5-13B	62.86
8	Otter	60.95
9	LLAVA-1.5-7B	57.74
10	MiniGPT-4	42.06

(d) Clean-Profession

Rank	Model	Score
1	InstructBlip	96.90
2	Blip2	96.64
3	XComposer-VL	96.43
4	Shikra-VQA	82.85
5	Shikra	80.17
6	Qwen-VL-Chat	75.84
7	LLAVA-1.5-13B	75.12
8	LLAVA-1.5-7B	69.74
9	Otter	62.50
10	MiniGPT-4	52.81

(e) Clean-Landmark

Rank	Model	Score
1	XComposer-VL	75.00
2	Qwen-VL-Chat	63.11
3	InstructBlip	62.66
4	Otter	59.56
5	Blip2	59.50
6	LLAVA-1.5-13B	53.88
7	Shikra-VQA	53.31
8	Shikra	52.84
9	LLAVA-1.5-7B	47.80
10	MiniGPT-4	38.98

(f) Clean-Anime

Rank	Model	Score
1	XComposer-VL	85.96
2	InstructBlip	83.80
3	Blip2	78.09
4	Shikra-VQA	69.83
5	Qwen-VL-Chat	69.58
6	Shikra	67.43
7	LLAVA-1.5-13B	61.20
8	Otter	58.33
9	LLAVA-1.5-7B	48.80
10	MiniGPT-4	40.86

(g) Clean-Clothes

Rank	Model	Score
1	XComposer-VL	87.91
2	InstructBlip	79.91
3	Blip2	79.23
4	Qwen-VL-Chat	68.71
5	Shikra-VQA	62.93
6	Shikra	61.81
7	LLAVA-1.5-13B	59.04
8	LLAVA-1.5-7B	56.04
9	Otter	52.97
10	MiniGPT-4	39.13

(h) Clean-Celebrity

Rank	Model	Score
1	XComposer-VL	91.27
2	InstructBlip	90.28
3	Blip2	89.78
4	Shikra-VQA	80.23
5	Shikra	75.84
6	Qwen-VL-Chat	71.37
7	LLAVA-1.5-13B	68.16
8	Otter	63.22
9	LLAVA-1.5-7B	52.87
10	MiniGPT-4	47.72

(i) Clean-Food

Rank	Model	Score
1	XComposer-VL	93.19
2	InstructBlip	92.62
3	Blip2	91.84
4	Shikra-VQA	80.05
5	Shikra	77.57
6	Qwen-VL-Chat	76.11
7	Otter	73.96
8	LLAVA-1.5-13B	67.61
9	LLAVA-1.5-7B	48.84
10	MiniGPT-4	47.95

(j) Clean-Plant

Rank	Model	Score
1	XComposer-VL	73.12
2	InstructBlip	62.34
3	Shikra-VQA	61.42
4	Blip2	61.13
5	Shikra	61.04
6	LLAVA-1.5-13B	54.57
7	Qwen-VL-Chat	51.28
8	LLAVA-1.5-7B	46.75
9	Otter	44.58
10	MiniGPT-4	41.73

(k) Clean-Age

Rank	Model	Score
1	XComposer-VL	98.83
2	Blip2	97.12
3	InstructBlip	93.86
4	LLAVA-1.5-13B	91.12
5	Shikra-VQA	90.40
6	Shikra	85.50
7	Qwen-VL-Chat	78.13
8	Otter	78.09
9	MiniGPT-4	52.86
10	LLAVA-1.5-7B	51.96

(l) Clean-Gender

Rank	Model	Score
1	XComposer-VL	86.08
2	InstructBlip	85.59
3	Shikra-VQA	83.53
4	Blip2	82.59
5	Shikra	80.51
6	Qwen-VL-Chat	74.76
7	Otter	67.11
8	LLAVA-1.5-13B	64.81
9	LLAVA-1.5-7B	56.81
10	MiniGPT-4	46.46

(m) Clean-Expression

Rank	Model	Score
1	XComposer-VL	78.56
2	InstructBlip	76.31
3	Blip2	73.16
4	LLAVA-1.5-13B	66.50
5	Qwen-VL-Chat	65.19
6	Shikra-VQA	64.56
7	Shikra	64.53
8	Otter	50.44
9	LLAVA-1.5-7B	47.27
10	MiniGPT-4	38.98

(n) Clean-Race

Rank	Model	Score
1	XComposer-VL	97.27
2	InstructBlip	96.09
3	Blip2	95.95
4	Shikra-VQA	82.72
5	Otter	82.47
6	Qwen-VL-Chat	79.98
7	Shikra	79.44
8	LLAVA-1.5-13B	72.11
9	LLAVA-1.5-7B	50.08
10	MiniGPT-4	47.77

(o) Clean-Animal

Rank	Model	Score
1	XComposer-VL	90.97
2	InstructBlip	90.00
3	Blip2	88.53
4	Shikra-VQA	82.72
5	Qwen-VL-Chat	75.73
6	Shikra	70.06
7	LLAVA-1.5-13B	67.89
8	Otter	64.91
9	LLAVA-1.5-7B	54.31
10	MiniGPT-4	51.65

(p) Clean-Object

Rank	Model	Score
1	Blip2	68.38
2	InstructBlip	67.50
3	XComposer-VL	66.06
4	Shikra-VQA	64.86
5	Otter	64.37
6	Qwen-VL-Chat	63.28
7	Shikra	63.02
8	LLAVA-1.5-13B	58.84
9	LLAVA-1.5-7B	44.96
10	MiniGPT-4	43.70

(q) Clean-OCR

Rank	Model	Score
1	XComposer-VL	73.94
2	Blip2	66.98
3	InstructBlip	66.77
4	Otter	63.82
5	Qwen-VL-Chat	61.80
6	Shikra	57.21
7	Shikra-VQA	56.89
8	LLAVA-1.5-13B	55.68
9	LLAVA-1.5-7B	49.05
10	MiniGPT-4	41.47

(r) Clean-Style

Rank	Model	Score
1	XComposer-VL	73.27
2	Shikra-VQA	68.47
3	InstructBlip	67.84
4	Shikra	66.95
5	Blip2	66.17
6	Otter	63.23
7	Qwen-VL-Chat	60.51
8	LLAVA-1.5-13B	59.00
9	LLAVA-1.5-7B	47.64
10	MiniGPT-4	41.02

(s) Clean-Background

Rank	Model	Score
1	XComposer-VL	88.50
2	InstructBlip	88.24
3	Blip2	86.78
4	Shikra-VQA	74.14
5	Shikra	73.37
6	Qwen-VL-Chat	69.72
7	LLAVA-1.5-13B	58.36
8	Otter	53.12
9	LLAVA-1.5-7B	46.91
10	MiniGPT-4	42.90

(t) Clean-Color

Table 6: Leaderboards of all Perceptual tasks under clean scenario.

Rank	Model	Score
1	XComposer-VL	78.20
2	InstructBlip	71.30
3	Blip2	68.69
4	Shikra-VQA	64.19
5	Shikra	63.47
6	Otter	61.39
7	Qwen-VL	59.27
8	LLAVA-1.5-13B	58.67
9	LLAVA-1.5-7B	52.36
10	MiniGPT-4	43.02

(a) Corru.-Movie

Rank	Model	Score
1	XComposer-VL	95.86
2	InstructBlip	95.09
3	Blip2	93.87
4	Shikra-VQA	88.72
5	Shikra	77.77
6	Qwen-VL	75.09
7	Otter	70.09
8	LLAVA-1.5-13B	68.34
9	LLAVA-1.5-7B	63.00
10	MiniGPT-4	47.22

(b) Corru.-Action

Rank	Model	Score
1	XComposer-VL	72.86
2	InstructBlip	62.42
3	Otter	60.71
4	Shikra	60.09
5	Shikra-VQA	59.58
6	Blip2	58.16
7	Qwen-VL	57.95
8	LLAVA-1.5-13B	55.18
9	LLAVA-1.5-7B	52.34
10	MiniGPT-4	38.70

Rank	Model	Score
1	XComposer-VL	78.48
2	InstructBlip	77.84
3	Blip2	75.77
4	Shikra	73.72
5	Shikra-VQA	73.38
6	Qwen-VL	64.97
7	Otter	62.71
8	LLAVA-1.5-13B	62.23
9	LLAVA-1.5-7B	56.95
10	MiniGPT-4	43.22

(d) Corru.-Profession

Rank	Model	Score
1	InstructBlip	96.33
2	Blip2	96.28
3	XComposer-VL	95.97
4	Shikra-VQA	83.01
5	Shikra	82.38
6	LLAVA-1.5-13B	74.63
7	Qwen-VL	73.55
8	LLAVA-1.5-7B	68.78
9	Otter	61.65
10	MiniGPT-4	49.55

(e) Corru.-Landmark

Rank	Model	Score
1	XComposer-VL	72.28
2	InstructBlip	62.79
3	Qwen-VL	60.44
4	Blip2	58.67
5	Otter	57.24
6	LLAVA-1.5-13B	53.67
7	Shikra-VQA	52.28
8	Shikra	52.00
9	LLAVA-1.5-7B	47.87
10	MiniGPT-4	40.91

(f) Corru.-Anime

Rank	Model	Score
1	XComposer-VL	84.87
2	InstructBlip	82.84
3	Blip2	76.69
4	Shikra-VQA	70.59
5	Shikra	67.05
6	Qwen-VL	64.26
7	LLAVA-1.5-13B	60.88
8	Otter	59.87
9	LLAVA-1.5-7B	48.12
10	MiniGPT-4	41.38

(g) Corru.-Clothes

Rank	Model	Score
1	XComposer-VL	87.25
2	InstructBlip	79.23
3	Blip2	78.36
4	Qwen-VL	64.89
5	Shikra-VQA	62.55
6	Shikra	61.68
7	LLAVA-1.5-13B	58.64
8	LLAVA-1.5-7B	56.36
9	Otter	53.59
10	MiniGPT-4	38.80

(h) Corru.-Celebrity

Rank	Model	Score
1	XComposer-VL	90.61
2	InstructBlip	90.08
3	Blip2	89.19
4	Shikra-VQA	80.55
5	Shikra	75.63
6	Qwen-VL	70.71
7	LLAVA-1.5-13B	67.46
8	Otter	65.06
9	LLAVA-1.5-7B	52.94
10	MiniGPT-4	47.76

(i) Corru.-Food

Rank	Model	Score
1	XComposer-VL	92.45
2	InstructBlip	92.23
3	Blip2	91.15
4	Shikra-VQA	80.80
5	Shikra	77.69
6	Otter	73.42
7	Qwen-VL	72.01
8	LLAVA-1.5-13B	67.22
9	LLAVA-1.5-7B	48.84
10	MiniGPT-4	48.04

(j) Corru.-Plant

Rank	Model	Score
1	XComposer-VL	73.04
2	InstructBlip	61.59
3	Shikra-VQA	61.09
4	Shikra	60.40
5	Blip2	60.03
6	Qwen-VL	55.26
7	LLAVA-1.5-13B	53.67
8	LLAVA-1.5-7B	46.76
9	Otter	45.09
10	MiniGPT-4	40.84

(k) Corru.-Age

Rank	Model	Score
1	XComposer-VL	98.53
2	Blip2	97.76
3	InstructBlip	93.75
4	LLAVA-1.5-13B	91.09
5	Shikra-VQA	90.17
6	Shikra	85.91
7	Otter	80.22
8	Qwen-VL	67.00
9	MiniGPT-4	52.75
10	LLAVA-1.5-7B	52.27

(l) Corru.-Gender

Rank	Model	Score
1	XComposer-VL	86.24
2	InstructBlip	84.95
3	Shikra-VQA	83.56
4	Blip2	81.56
5	Shikra	80.56
6	Qwen-VL	70.55
7	Otter	66.64
8	LLAVA-1.5-13B	64.94
9	LLAVA-1.5-7B	56.16
10	MiniGPT-4	46.55

(m) Corru.-Expression

Rank	Model	Score
1	XComposer-VL	78.80
2	InstructBlip	76.19
3	Blip2	71.72
4	LLAVA-1.5-13B	65.33
5	Shikra-VQA	64.73
6	Shikra	63.94
7	Qwen-VL	63.55
8	Otter	49.73
9	LLAVA-1.5-7B	47.92
10	MiniGPT-4	39.65

(n) Corru.-Race

Rank	Model	Score
1	XComposer-VL	97.06
2	InstructBlip	95.86
3	Blip2	95.69
4	Shikra-VQA	83.03
5	Otter	82.84
6	Shikra	79.44
7	LLAVA-1.5-13B	71.47
8	Qwen-VL	71.43
9	LLAVA-1.5-7B	50.67
10	MiniGPT-4	48.66

(o) Corru.-Animal

Rank	Model	Score
1	XComposer-VL	90.81
2	InstructBlip	89.76
3	Blip2	88.05
4	Shikra-VQA	82.76
5	Shikra	71.49
6	Qwen-VL	67.78
7	LLAVA-1.5-13B	67.23
8	Otter	66.18
9	LLAVA-1.5-7B	54.09
10	MiniGPT-4	51.02

(p) Corru.-Object

Rank	Model	Score
1	Blip2	94.06
2	InstructBlip	90.54
3	XComposer-VL	87.19
4	Shikra-VQA	80.80
5	Otter	78.70
6	Shikra	77.17
7	Qwen-VL	69.66
8	LLAVA-1.5-13B	68.58
9	LLAVA-1.5-7B	49.10
10	MiniGPT-4	44.63

(q) Corru.-OCR

Rank	Model	Score
1	XComposer-VL	72.34
2	InstructBlip	65.06
3	Blip2	64.86
4	Otter	61.14
5	Qwen-VL	60.34
6	Shikra	56.61
7	Shikra-VQA	55.81
8	LLAVA-1.5-13B	53.65
9	LLAVA-1.5-7B	47.67
10	MiniGPT-4	40.81

(r) Courruption-Style

Rank	Model	Score
1	XComposer-VL	72.53
2	Shikra-VQA	68.25
3	InstructBlip	67.48
4	Shikra	67.19
5	Blip2	64.69
6	Otter	62.99
7	Qwen-VL	60.50
8	LLAVA-1.5-13B	58.50
9	LLAVA-1.5-7B	47.12
10	MiniGPT-4	40.53

(s) Corru.-Background

Rank	Model	Score
1	XComposer-VL	88.24
2	InstructBlip	88.09
3	Blip2	86.25
4	Shikra-VQA	74.14
5	Shikra	73.07
6	Qwen-VL	70.78
7	LLAVA-1.5-13B	57.70
8	Otter	56.14
9	LLAVA-1.5-7B	46.76
10	MiniGPT-4	42.86

(t) Corru.-Color

Table 7: Leaderboards of all Perceptual tasks under corruption scenario.

Rank	Model	Score
1	XComposer-VL	67.09
2	Blip2	52.55
3	LLAVA-1.5-13B	46.74
4	LLAVA-1.5-7B	45.24
5	Shikra	44.84
6	Shikra-VQA	44.36
7	Qwen-VL	44.30
8	InstructBlip	43.06
9	MiniGPT-4	40.26
10	Otter	37.97

(a) Pr.Att.-Movie

Rank	Model	Score
1	XComposer-VL	86.24
2	Blip2	79.13
3	InstructBlip	75.94
4	Shikra-VQA	71.36
5	Qwen-VL	65.26
6	Shikra	58.82
7	LLAVA-1.5-13B	57.66
8	LLAVA-1.5-7B	51.62
9	MiniGPT-4	47.07
10	Otter	39.77

(b) Pr.Att.-Action

Rank	Model	Score
1	XComposer-VL	55.08
2	LLAVA-1.5-7B	43.55
3	Qwen-VL	40.72
4	LLAVA-1.5-13B	40.09
5	Blip2	39.71
6	Shikra	39.01
7	MiniGPT-4	37.40
8	Shikra-VQA	36.50
9	InstructBlip	32.17
10	Otter	31.48

Rank	Model	Score
1	XComposer-VL	67.14
2	Blip2	59.62
3	Shikra	55.70
4	InstructBlip	54.52
5	Shikra-VQA	54.25
6	Qwen-VL	48.66
7	LLAVA-1.5-13B	47.49
8	LLAVA-1.5-7B	42.86
9	MiniGPT-4	42.22
10	Otter	38.12

(d) Pr.Att.-Profession

Rank	Model	Score
1	XComposer-VL	89.81
2	Blip2	84.85
3	InstructBlip	76.78
4	Shikra-VQA	68.28
5	LLAVA-1.5-13B	65.98
6	Shikra	64.47
7	Qwen-VL	64.29
8	LLAVA-1.5-7B	56.39
9	MiniGPT-4	49.58
10	Otter	43.89

(e) Pr.Att.-Landmark

Rank	Model	Score
1	XComposer-VL	59.72
2	Blip2	45.05
3	Qwen-VL	44.95
4	LLAVA-1.5-7B	42.23
5	LLAVA-1.5-13B	40.95
6	InstructBlip	38.26
7	Shikra-VQA	38.25
8	MiniGPT-4	37.27
9	Shikra	37.02
10	Otter	36.16

(f) Pr.Att.-Anime

Rank	Model	Score
1	XComposer-VL	76.43
2	Blip2	66.53
3	InstructBlip	62.14
4	Qwen-VL	51.56
5	LLAVA-1.5-13B	49.63
6	Shikra-VQA	49.02
7	Shikra	47.23
8	LLAVA-1.5-7B	42.22
9	MiniGPT-4	39.66
10	Otter	34.67

(g) Pr.Att.-Clothes

Rank	Model	Score
1	XComposer-VL	73.48
2	Blip2	57.30
3	Qwen-VL	52.34
4	LLAVA-1.5-7B	45.11
5	InstructBlip	43.36
6	Shikra	43.13
7	Shikra-VQA	42.01
8	LLAVA-1.5-13B	41.59
9	MiniGPT-4	38.64
10	Otter	33.95

(h) Pr.Att.-Celebrity

Rank	Model	Score
1	XComposer-VL	83.60
2	Blip2	81.67
3	InstructBlip	75.94
4	Qwen-VL	60.43
5	Shikra-VQA	58.61
6	LLAVA-1.5-13B	56.20
7	Shikra	56.08
8	MiniGPT-4	46.70
9	LLAVA-1.5-7B	45.20
10	Otter	38.39

(i) Pr.Att.-Food

Rank	Model	Score
1	XComposer-VL	82.18
2	Blip2	82.16
3	InstructBlip	74.20
4	Qwen-VL	60.29
5	Shikra	54.03
6	LLAVA-1.5-13B	53.77
7	Shikra-VQA	52.86
8	MiniGPT-4	46.45
9	LLAVA-1.5-7B	42.07
10	Otter	39.52

(j) Pr.Att.-Plant

Rank	Model	Score
1	XComposer-VL	60.37
2	LLAVA-1.5-7B	42.53
3	Blip2	42.20
4	LLAVA-1.5-13B	41.30
5	MiniGPT-4	39.18
6	Shikra	38.98
7	Qwen-VL	37.56
8	Shikra-VQA	37.47
9	InstructBlip	35.75
10	Otter	32.18

(k) Pr.Att.-Age

Rank	Model	Score
1	Blip2	93.97
2	InstructBlip	80.44
3	XComposer-VL	79.56
4	LLAVA-1.5-13B	74.03
5	Shikra-VQA	63.62
6	Shikra	62.06
7	Qwen-VL	56.05
8	MiniGPT-4	51.87
9	LLAVA-1.5-7B	51.33
10	Otter	39.31

(l) Pr.Att.-Gender

Rank	Model	Score
1	XComposer-VL	78.59
2	Blip2	67.94
3	InstructBlip	67.72
4	Shikra-VQA	65.00
5	Shikra	61.51
6	Qwen-VL	59.16
7	LLAVA-1.5-13B	58.33
8	LLAVA-1.5-7B	53.00
9	MiniGPT-4	45.37
10	Otter	39.91

(m) Pr.Att.-Expression

Rank	Model	Score
1	XComposer-VL	62.21
2	Blip2	52.98
3	InstructBlip	48.42
4	LLAVA-1.5-13B	47.80
5	Qwen-VL	44.81
6	Shikra-VQA	42.09
7	LLAVA-1.5-7B	41.75
8	Shikra	41.07
9	MiniGPT-4	38.03
10	Otter	28.77

(n) Pr.Att.-Race

Rank	Model	Score
1	Blip2	90.07
2	XComposer-VL	87.80
3	InstructBlip	82.43
4	Qwen-VL	65.20
5	LLAVA-1.5-13B	60.13
6	Shikra-VQA	55.53
7	Shikra	53.98
8	MiniGPT-4	48.02
9	Otter	46.92
10	LLAVA-1.5-7B	44.73

(o) Pr.Att.-Animal

Rank	Model	Score
1	Blip2	83.55
2	XComposer-VL	81.03
3	InstructBlip	76.06
4	Qwen-VL	61.06
5	LLAVA-1.5-13B	56.84
6	Shikra-VQA	55.02
7	MiniGPT-4	49.61
8	Shikra	46.96
9	LLAVA-1.5-7B	44.26
10	Otter	38.32

(p) Pr.Att.-Object

Rank	Model	Score
1	XComposer-VL	60.60
2	Blip2	52.84
3	LLAVA-1.5-13B	45.26
4	Qwen-VL	43.55
5	LLAVA-1.5-7B	42.23
6	Shikra	41.06
7	MiniGPT-4	40.32
8	Shikra-VQA	40.12
9	InstructBlip	40.03
10	Otter	33.86

(q) Pr.Att.-Style

Rank	Model	Score
1	XComposer-VL	64.99
2	Shikra-VQA	55.50
3	Blip2	54.94
4	Shikra	54.16
5	LLAVA-1.5-13B	51.16
6	InstructBlip	49.98
7	Qwen-VL	49.17
8	LLAVA-1.5-7B	44.34
9	MiniGPT-4	39.23
10	Otter	37.94

(r) Pr.Att.-Background

Rank	Model	Score
1	XComposer-VL	78.66
2	Blip2	74.47
3	InstructBlip	63.59
4	Qwen-VL	58.40
5	Shikra	51.75
6	Shikra-VQA	51.10
7	LLAVA-1.5-13B	49.48
8	LLAVA-1.5-7B	43.40
9	MiniGPT-4	41.70
10	Otter	35.56

(s) Pr.Att.-Color

Table 8: Leaderboards of all Perceptual tasks under print attacking scenario.

Rank	Model	Score
1	Qwen-VL	58.17
2	Shikra	57.41
3	Shikra-VQA	57.26
4	LLAVA-1.5-13B	57.17
5	Otter	56.61
6	LLAVA-1.5-7B	51.91
7	MiniGPT-4	36.95
8	Blip2	33.33
9	InstructBlip	31.97
10	XComposer-VL	31.75

(a) Ad.Att.-Movie

Rank	Model	Score
1	Shikra-VQA	85.84
2	Qwen-VL	79.11
3	Shikra	77.17
4	LLAVA-1.5-13B	66.37
5	Otter	65.81
6	LLAVA-1.5-7B	60.84
7	MiniGPT-4	37.59
8	Blip2	34.64
9	XComposer-VL	33.88
10	InstructBlip	32.85

(b) Ad.Att.-Action

Rank	Model	Score
1	Qwen-VL	55.79
2	Shikra-VQA	53.57
3	Otter	53.05
4	LLAVA-1.5-13B	53.03
5	Shikra	52.61
6	LLAVA-1.5-7B	48.36
7	MiniGPT-4	38.22
8	XComposer-VL	34.45
9	Blip2	32.04
10	InstructBlip	30.93

Rank	Model	Score
1	Shikra-VQA	71.47
2	Shikra	67.66
3	Qwen-VL	64.80
4	Otter	61.44
5	LLAVA-1.5-13B	59.83
6	LLAVA-1.5-7B	56.15
7	MiniGPT-4	36.95
8	Blip2	36.14
9	InstructBlip	35.02
10	XComposer-VL	30.86

(d) Ad.Att.-Profession

Rank	Model	Score
1	Shikra-VQA	78.40
2	Shikra	77.85
3	Qwen-VL	76.34
4	LLAVA-1.5-13B	71.92
5	LLAVA-1.5-7B	68.62
6	Otter	56.87
7	MiniGPT-4	35.73
8	XComposer-VL	35.33
9	Blip2	34.53
10	InstructBlip	32.03

(e) Ad.Att.-Landmark

Rank	Model	Score
1	Qwen-VL	60.07
2	Otter	51.86
3	LLAVA-1.5-13B	51.81
4	Shikra-VQA	48.45
5	Shikra	47.75
6	LLAVA-1.5-7B	47.32
7	MiniGPT-4	38.38
8	Blip2	37.95
9	InstructBlip	34.29
10	XComposer-VL	30.89

(f) Ad.Att.-Anime

Rank	Model	Score
1	Qwen-VL	69.92
2	Shikra-VQA	68.32
3	Shikra	65.73
4	LLAVA-1.5-13B	58.83
5	Otter	57.58
6	LLAVA-1.5-7B	47.05
7	MiniGPT-4	38.05
8	InstructBlip	36.65
9	XComposer-VL	36.41
10	Blip2	36.18

(g) Ad.Att.-Clothes

Rank	Model	Score
1	Qwen-VL	64.77
2	Shikra	60.02
3	Shikra-VQA	59.70
4	LLAVA-1.5-13B	55.53
5	LLAVA-1.5-7B	54.95
6	Otter	52.26
7	MiniGPT-4	37.48
8	Blip2	34.72
9	InstructBlip	33.52
10	XComposer-VL	33.09

(h) Ad.Att.-Celebrity

Rank	Model	Score
1	Shikra-VQA	75.26
2	Qwen-VL	73.72
3	Shikra	71.07
4	LLAVA-1.5-13B	64.95
5	Otter	64.39
6	LLAVA-1.5-7B	52.30
7	MiniGPT-4	36.94
8	XComposer-VL	33.16
9	Blip2	32.48
10	InstructBlip	32.13

(i) Ad.Att.-Food

Rank	Model	Score
1	Qwen-VL	79.59
2	Shikra-VQA	76.06
3	Shikra	73.53
4	Otter	67.16
5	LLAVA-1.5-13B	64.44
6	LLAVA-1.5-7B	47.41
7	MiniGPT-4	37.74
8	Blip2	35.84
9	InstructBlip	34.02
10	XComposer-VL	34.00

(j) Ad.Att.-Plant

Rank	Model	Score
1	Shikra	56.95
2	Shikra-VQA	56.80
3	LLAVA-1.5-13B	51.43
4	Qwen-VL	48.11
5	LLAVA-1.5-7B	45.89
6	Otter	42.82
7	MiniGPT-4	37.14
8	Blip2	34.16
9	InstructBlip	33.14
10	XComposer-VL	30.78

(k) Ad.Att.-Age

Rank	Model	Score
1	LLAVA-1.5-13B	89.35
2	Shikra-VQA	89.06
3	Shikra	85.34
4	Otter	82.25
5	Qwen-VL	79.18
6	LLAVA-1.5-7B	51.95
7	MiniGPT-4	48.38
8	Blip2	39.48
9	InstructBlip	38.51
10	XComposer-VL	38.09

(l) Ad.Att.-Gender

Rank	Model	Score
1	Shikra-VQA	82.04
2	Shikra	79.86
3	Qwen-VL	73.91
4	LLAVA-1.5-13B	63.13
5	Otter	60.44
6	LLAVA-1.5-7B	55.70
7	MiniGPT-4	41.07
8	Blip2	35.95
9	InstructBlip	33.52
10	XComposer-VL	32.80

(m) Ad.Att.-Expression

Rank	Model	Score
1	LLAVA-1.5-13B	63.06
2	Shikra-VQA	61.19
3	Shikra	60.24
4	Qwen-VL	59.50
5	LLAVA-1.5-7B	45.97
6	Otter	45.50
7	MiniGPT-4	37.47
8	Blip2	36.81
9	InstructBlip	36.36
10	XComposer-VL	34.77

(n) Ad.Att.-Race

Rank	Model	Score
1	Qwen-VL	81.78
2	Shikra-VQA	80.44
3	Otter	78.00
4	Shikra	77.66
5	LLAVA-1.5-13B	70.94
6	LLAVA-1.5-7B	49.30
7	MiniGPT-4	37.61
8	Blip2	35.23
9	XComposer-VL	34.38
10	InstructBlip	34.01

(o) Ad.Att.-Animal

Rank	Model	Score
1	Shikra-VQA	79.05
2	Qwen-VL	77.26
3	Shikra	69.98
4	LLAVA-1.5-13B	65.33
5	Otter	62.69
6	LLAVA-1.5-7B	52.62
7	MiniGPT-4	37.84
8	Blip2	37.62
9	XComposer-VL	36.97
10	InstructBlip	36.88

(p) Ad.Att.-Object

Rank	Model	Score
1	Shikra-VQA	79.12
2	Shikra	77.62
3	Qwen-VL	76.50
4	Otter	74.91
5	LLAVA-1.5-13B	66.43
6	XComposer-VL	53.21
7	InstructBlip	50.03
8	Blip2	47.80
9	LLAVA-1.5-7B	46.78
10	MiniGPT-4	36.78

(q) Ad.Att.-OCR

Rank	Model	Score
1	Qwen-VL	56.80
2	Otter	53.47
3	Shikra	51.05
4	Shikra-VQA	50.30
5	LLAVA-1.5-13B	50.01
6	LLAVA-1.5-7B	44.85
7	MiniGPT-4	37.98
8	XComposer-VL	37.20
9	InstructBlip	36.20
10	Blip2	35.87

(r) Ad.Att.-Style

Rank	Model	Score
1	Shikra-VQA	65.41
2	Shikra	64.50
3	Otter	60.30
4	Qwen-VL	60.09
5	LLAVA-1.5-13B	56.28
6	LLAVA-1.5-7B	46.53
7	MiniGPT-4	36.91
8	Blip2	35.05
9	XComposer-VL	34.71
10	InstructBlip	34.45

(s) Ad.Att.-Background

Rank	Model	Score
1	Qwen-VL	69.63
2	Shikra	69.57
3	Shikra-VQA	69.05
4	LLAVA-1.5-13B	54.28
5	Otter	53.12
6	LLAVA-1.5-7B	46.08
7	MiniGPT-4	37.54
8	XComposer-VL	37.23
9	Blip2	36.83
10	InstructBlip	35.81

(t) Ad.Att.-Color

Table 9: Leaderboards of all Perceptual tasks under adversarial attacking scenario.

Rank	Model	Score
1	XComposer-VL	72.37
2	InstructBlip	67.41
3	Blip2	66.13
4	Qwen-VL-Chat	62.68
5	Shikra	62.52
6	LLAVA-1.5-13B	59.28
7	Shikra-VQA	58.64
8	Otter	55.49
9	LLAVA-1.5-7B	51.75
10	MiniGPT-4	42.64

(a) Clean

Rank	Model	Score
1	XComposer-VL	71.63
2	InstructBlip	67.62
3	Blip2	65.49
4	Shikra	62.37
5	Qwen-VL-Chat	60.58
6	LLAVA-1.5-13B	58.73
7	Shikra-VQA	58.47
8	Otter	55.42
9	LLAVA-1.5-7B	46.44
10	MiniGPT-4	42.87

(b) Corru.

Rank	Model	Score
1	XComposer-VL	63.75
2	Blip2	56.20
3	Qwen-VL-Chat	51.57
4	InstructBlip	51.23
5	LLAVA-1.5-13B	50.70
6	Shikra	49.08
7	LLAVA-1.5-7B	46.47
8	Shikra-VQA	43.72
9	MiniGPT-4	42.87
10	Otter	47.10

Rank	Model	Score
1	Qwen-VL-Chat	60.39
2	Shikra	59.31
3	LLAVA-1.5-13B	56.84
4	Shikra-VQA	55.16
5	Otter	51.94
6	LLAVA-1.5-7B	50.07
7	InstructBlip	33.13
8	MiniGPT-4	32.95
9	Blip2	32.65
10	XComposer-VL	30.52

(d) Adversarial Attack

Table 10: Leaderboards of comprehensive performance in clean, corruption, print attack and adversarial attack scenario.

The model performance leaderboards for each subtask under each scenario are shown in Tab. 6, Tab. 7, Tab. 8, and Tab. 9. We compute the average of the multi-choice and true-or-false evaluation results under each subtask. Since the free-form question allows to assess the model’s perception abilities across multiple subtasks at the same time, the results of free-form are not taken into account.

The overall performances on each scenario are displayed in Tab. 10. We calculate the average of the scores of the three questioning types (i.e., multi-choice, true-or-false and free-form) as the final score.

Appendix B Discussion

B.1 General Discussion

Limitation. Dysca is the dynamic and scalable benchmark, offering evaluation for 20 perceptual subtasks under 51 image styles and 4 scenarios. However, generating data for evaluating cognition abilities (e.g., commonsense reasoning) presents challenge within the existing framework. This limitation arises from the reliance on predefined rules for prompt and question generation, which may not adequately capture the complexity of cognitive-level questions.

Synthesis Data for Training / Fine-tuning. The use of synthetic data for model training / fine-tuning has been adopted in the field of Natural Language Processing (NLP) [65]. In this work, we do not explore the possibility of utilizing our benchmark for model training. Our primary goal in this paper is to provide a large-scale evaluation benchmark that addresses the issue of data leakage in current multimodal evaluation benchmarks and offers evaluation results across multiple subtasks, scenarios, question types and styles. Nevertheless, considering that Dysca has the capability to synthesize high-resolution and unlimited amounts of annotated multimodal data, we believe that Dysca also holds potential as a training data synthesis tool for LVLMs.

Reproducibility and Licence. All the experiments are built on 8 * RTX 4090. All the data and the code for generation and evaluation are released in https://github.com/Benchmark-Dysca/Dysca. The licence of Dysca is "CreativeML Open RAIL++-M", which follows the licence set by the Stable Diffusion XL.

Ethical Concerns. Our Dysca leverages the Stable Diffusion XL [36] to generate images. In order to prevent the model generating unsafe images, e.g., NSFW and offensive images, lots of efforts have been made. First, we use the safety checker [66] to post filter the unsafe images. With the unsafe image that is recognized by the safety checker, the model’s output will be a blank image. Besides, we manually exclude the specific styles or the word that may trigger the unsafe images generation from the Metadata $M$ . We believe that our Dysca involves fewer ethical concerns.

B.2 The Stability of Dysca

In this section, we focus on examining the stability of Dysca. We partition Dysca into 11 different scales: 1%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100%. We compute the evaluation scores using each of these data scales. The score is calculated as the sum of scores obtained from multiple-choice, true-or-false and free-form questions. As can be seen in Fig. 6, when the evaluation data scale is less than 30% of Dysca (i.e., less than 46.8K samples), the evaluation score show significant fluctuations. When the data scale exceeds 40%, we obtain the stable results, reflecting current scale of Dysca achieves the stable and reliable evaluation results. Although 40% evaluation scale of Dysca has achieved stable scores, Dysca aims to provide more than just stable rankings, but also draws on massive amounts of data to provide in-depth feedback across different image styles and perceptual subtasks.

Appendix C The Metadata ( $M$ ) of Dysca

Metadata ( $M$ ) is the core of Dysca, which is randomly assembled from our collected source material and contains all the information needed to generate prompt ( $P$ ), image ( $I$ ), and question-answer pairs( $Q$ ). Specifically, the metadata is a data container that contains information in multiple dimensions about the foreground, the attributes corresponding to the foreground, the background, and the artistic style required to generate an image. Therefore, each instance of M is mapped one-to-one to a prompt, an image, and a set of question-answer pairs, respectively.

In order to ensure the quality and stability of the generated images, we carefully select the source material. First, for each perceptual subtask, we collect rich annotation material as described in Section 3.2. However, the metadata composed of these raw annotations is not always usable. On the one hand, some of the content is polysemous, which can easily be misinterpreted by the model’s when generating images. On the other hand, there are backgrounds or artistic styles (e.g., “Pokemon Style", “architectural style", etc.) that negatively affect the quality of the image and do not accurately generate the desired content. In order to test the usability of these source materials, we went through several small-scale pre-generations covering all the source materials. After careful selection, we retain the clips that consistently produced high-quality images. The detailed information of the source materials are shown in Tab. 11.

Table 11: Detailed information of the source materials.

Category	Data Description	#Numbers
Style	We collected artistic styles from the community, which can be well rendered by Stable Diffusion model. We removed those which have strong reflect on the image content or may generate unsafe image.	51
Background	We have selected 20 rich backgrounds and they can be accurately generated by Stable Diffusion Model.	20
Age	We chose four well-characterized age nodes: 14, 25, 40, and 80.	4
Expression	We chose three characteristic expressions: smiling happily, shouting furiously, calm and placid.	3
Gender	Male and Female.	2
Race	We identified these five races based on the ethnicity that can be generated by Stable Diffusion: Caucasian, Asian, African, Indian, and Middle Eastern races	5
Profession	After pre-generation and careful selection, we chose 20 occupations with distinctive characteristics that are easy to generate.	20
Action	After pre-generation and careful selection, we chose 20 occupations with distinctive characteristics that are easy to generate.	20
Celebrity	After pre-generation and careful selection, we chose 50 well-known celebrities.	50
Animal	We selected a rich variety of animals, including mammals, birds, reptiles, insects, and aquatic animals, and can they be generated by the Stable Diffusion model accurately.	67
Plant	We selected a rich variety of plants, including flowers, trees, fruits, and vegetables, and they can be generated by the Stable Diffusion model accurately.	37
Clothes	We selected 16 common types of clothing that are highly distinguishable from each other.	16
Object	We took the annotations from the MSCOCO [50] dataset after removing people, animals, and plants, and added some additional common objects.	80
Landmark	We chose 23 characteristic landmarks from around the globe and they can be generated by the Stable Diffusion model accurately.	23
Food	We collected 29 special dishes from around the globe and they can be generated by the Stable Diffusion model accurately.	29
Movie	We selected 106 movies titles from the rating list of IMDb based on the number of user reviews.	106
Anime	We selected 44 anime titles from the rating list of IMDb based on the number of user reviews.	44
TV shows	We selected 20 TV show titles from the rating list of IMDb based on the number of user reviews.	20
OCR	We randomly selected 5000 words from the IELTS vocabulary for the text material. Among them, words with length less than 3 were removed.	5000
Color	We selected 8 easily distinguishable colors: red, orange, yellow, green, blue, purple, white, black.	8

Appendix D Scenarios Details

D.1 Print Attack Scenario

Followed by the settings in [35], we add the attack text on the images. Consider that the image resolution in our Dysca is much higher than the one in [35], we increase more font form in terms of font position and font orientation. Fig. 7 to Fig. 11 shows the detailed examples.

D.2 Corrupted Scenario

Examples of the 11 image corruptions are shown in Fig 12.

Appendix E More Examples of Dysca

For each subject we collected in Metadata ( $M$ ), we display one example of their prompt ( $P$ ), generated image ( $I$ ) and corresponding question-answer pairs ( $Q$ ).

Appendix F Data Sheet

We follow the documentation frameworks provided by Gebru et al. [67].

F.1 Motivation

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

•

The proposed dataset is used for evaluating current LVLMs perception ability. We use the synthesis images to prevent the potential data leakage problem in current benchmarks. The dataset test LVLMs in 20 subtasks under 4 scenarios and 3 question type, revealing the existing drawbacks of current LVLMs.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

•

Followed by the double-blind rule, we will release the detailed information about this part once our paper is accepted.

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grant or and the grant name and number.

•

Followed by the double-blind rule, we will release the detailed information about this part once our paper is accepted.

F.2 Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.

•

We show the instances list in Tab. 11. The detailed word we collect for metadata $M$ are shown in our anonymous github page https://github.com/Benchmark-Dysca/Dysca.

How many instances are there in total (of each type, if appropriate)?

•

There are a total of 20 subtasks in our Dysca. For details of each subtasks please see refer Fig. 2.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).

•

No. The images in Dysca are completely generated from scratch.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.

•

Each instance consists of the prompt, the image generated by stable diffusion, the question and corresponding answer.

Is there a label or target associated with each instance? If so, please provide a description.

•

Yes, Dysca provides the ground truth for each instance.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.

•

No.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.

•

There are no relationships between individual instances.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.

•

Following our motivation, the entire proposed dataset is used for testing purposes.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

•

Errors in image generation resulting from stable diffusion are unavoidable. However, we have performed dataset cleaning to minimize these errors. Furthermore, the stability experiment in Appendix B demonstrates that these errors do not affect the overall evaluation results of the dataset.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

•

The proposed Dysca dose not rely on any external resources.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.

•

No.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

•

No. To ensure that the generated images do not contain offensive, insulting, threatening, or anxiety-inducing content, we manually filter out words from the metadata $M$ that could potentially trigger the diffusion model to generate such images. Safety checker also used to further avoid unsafe image generation.

Does the dataset relate to people? If not, you may skip the remaining questions in this section.

•

Yes.

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

•

Yes. There are the age, gender and race recognition subtasks in Dysca. Each of them are divided to several subpopulations and the selection of these subpopulations is based on the ability of stable diffusion to generate the representative subpopulations.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.

•

Yes. There is the celebrity recognition task in our dataset, where 50 well-know celebrity are chosen. We choose the celebrity who can be generated well by stable diffusion XL.

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

•

No, our benchmark does not contain any sensitive data.

F.3 Collection Process

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

•

We display the detailed explanation in Tab. 11.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?

•

We collect the data by manual human curation.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

•

No.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

•

We collect the metadata of Tab. 11 by authors. The images are generated by stable diffusion and labels of each image are also automatically generated.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

•

Our dataset was conducted in April of 2024, but the results do not depend on the date of data collection.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

•

No.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

•

No.

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.

•

N/A. Our Dysca does not involve the collection from the individuals.

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

•

N/A. Our Dysca does not involve the collection from the individuals.

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

•

N/A. Our Dysca does not involve the collection from the individuals.

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

•

No.

F.4 Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remaining questions in this section.

•

Yes. We leverage the off-the-shelf models, i.e., PP-OCRv3 [56] and CLIP-L-14 [16], to clean the data. PP-OCRv3 [56] is leveraged as the filter to exclude the failure image that TextDiffusion2 [58] generates the wrong text on the image. For the other images, we use CLIP-L-14 [16] to filter out the images with low text-image consistency.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

•

Yes. We have saved all the data. However, most of these images are filtered and considered to be useless.

Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.

•

Yes. CLIP-L-14 can be downloaded at https://huggingface.co/docs/transformers/v4.41.3/en/model_doc/clip#transformers.CLIPModel.

PP-OCRv3 can be downloaded at https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md

F.5 Uses

Has the dataset been used for any tasks already? If so, please provide a description.

•

No. The proposed dataset is the novel one which is used for evaluation current LVLMs perception ability.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

•

Yes. We plan to create a section on the project homepage to keep track of LVLMs papers for researchers to analyze and compare.

What (other) tasks could the dataset be used for?

•

In this work, we do not explore the possibility of utilizing our benchmark for model training / fine-tuning. Our primary goal in this paper is to provide a large-scale evaluation benchmark that addresses the issue of data leakage in current multimodal evaluation benchmarks and offers evaluation results across multiple subtasks, scenarios, question types and styles. Nevertheless, considering that Dysca has the capability to synthesize high-resolution and unlimited amounts of annotated multimodal data, we believe that Dysca also holds potential as a training data synthesis tool for LVLMs.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other risks or harms (e.g., legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms?

•

Yes.

Are there tasks for which the dataset should not be used? If so, please provide a description.

•

The proposed dataset should not be used to generate offensive data.

F.6 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

•

Yes.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

•

We will open-source our dataset on our GitHub project homepage. At the moment, we do not have a DOI number.

When will the dataset be distributed?

•

The dataset can be downloaded right now.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

•

The licence of Dysca is "CreativeML Open RAIL++-M", which follows the licence set by the Stable Diffusion XL.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

•

No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

•

Not yet.

F.7 Maintenance

Who will be supporting/hosting/maintaining the dataset?

•

Followed by the double-blind rule, we will release the detailed information about this part once our paper is accepted.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

•

Followed by the double-blind rule, we will release the detailed information about this part once our paper is accepted.

Is there an erratum? If so, please provide a link or other access point.

•

No.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)?

•

There are no plans at the moment, but if there are updates, they will be announced, and the download source will be updated on the project homepage.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

•

No.

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.

•

Yes. If there are any updates, the previous version of the dataset will also be shared on website for download.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.

•

Yes. We welcome and encourage researchers to extend/augment/build on/contribute to our dataset for non-profit purposes without the need for prior notification.

Statistic	#Number
Total questions	617K
- Clean	156K (25.2%)
- Print attacking	149K (24.1%)
- Adversarial attacking	156K (25.2%)
- Corruption	156K (25.2%)
Question type
- Multi-choices	251K (40.6%)
- True-or-false	250K (40.5%)
- Free-form	116K (18.8%)
Image resolution	1024*1024
Unique number of images	289K
Unique number of questions	162K
Unique number of answers	31K
Average question length	37.8
Average answer length	2.7
Average choice number	3.0

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

Abstract

1 Introduction

2 Related Works

2.1 Large Vision-Language Models

2.2 Benchmarks for LVLMs

3 Dysca

3.1 Overview of Our Pipeline

3.2 Perceptual Tasks

3.3 Construction of Questions & Answers

3.4 Evaluation Strategy

4 Results and Analysis

4.1 Main Results

4.2 Analysis

4.2.1 Key Observations

4.2.2 The Validity of Dysca

5 Conclusion

References

Appendix

Appendix A The Leaderboards

Appendix B Discussion

B.1 General Discussion

B.2 The Stability of Dysca

Appendix C The Metadata (M𝑀Mitalic_M) of Dysca

Appendix D Scenarios Details

D.1 Print Attack Scenario

D.2 Corrupted Scenario

Appendix E More Examples of Dysca

Appendix F Data Sheet

F.1 Motivation

F.2 Composition

F.3 Collection Process

F.4 Preprocessing/cleaning/labeling

F.5 Uses

F.6 Distribution

F.7 Maintenance

Appendix C The Metadata ( $M$ ) of Dysca