GitHub - Benchmark-Dysca/Dysca: Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

🎨Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

🎉If you like our project, please give us a star ⭐ on GitHub for latest update.

Overview🔍

Figure 1. Overview of the automatic pipeline in Dysca for generating VQAs, cleaning VQAs and evaluating LVLMs.

Figure 2. The available subtasks of our Dysca.

Abstract - Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios.A total of 8 advanced open-source LVLMs with 10 checkpoints are evaluated on Dysca, revealing the drawbacks of current LVLMs.

Download 📩

We provide two types of downloading ways:

Comparison with Existing Benchmarks📊

Comparisons between existing LVLM benchmarks. '⍻' indicates that the benchmarks include both newly collected images / annotations and images / annotations gathered from existing datasets. '*' The scale of our released benchmark is 617K, however Dysca is able to generate unlimited data to be tested.

Benchmark	#Evaluation Data Scale	#Perceptual Tasks	Automatic Annotation	Collecting from Existing Datasets	Question Type	Automatic Evaluation
LLaVA-Bench	0.15K	-	×	⍻	Free-form	√
MME	2.3K	10	×	⍻	True-or-false	√
LVLM-eHub	-	3	√	×	Free-form	×
tiny-LVLM-eHub	2.1K	3	√	×	Free-form	√
SEED-Bench	19K	8	⍻	×	Multi-choices	√
MMBench	2.9K	12	×	⍻	Multi-choices	√
TouchStone	0.9K	10	×	√	Free-form	√
REFORM-EVAL	50K	7	√	×	Multi-choices	√
MM-BigBench	30K	6	√	×	Multi-choices	√
MM-VET	0.2K	4	⍻	⍻	Free-form	√
MLLM-Bench	0.42K	7	×	⍻	Free-form	√
SEED-Bench2	24K	10	⍻	×	Multi-choices	√
BenchLMM	2.4K	15	×	×	Free-form	√
JourneyDB	5.4K	2	√	√	Free-form, Multi-choices	√
Dysca (Ours)	617K*	20	√	√	Free-form, Multi-choices, True-or-false	√

Examples of Dysca📸

Here are some examples of the images, prompts, questions and ground truth answers of our Dysca. These images are generated by diffusion models.

Leaderboards🏆

A. Comprehensive Performance Leaderboards under each scenario

Rank	Model	Score
🥇	XComposer-VL	217.07
🥈	InstructBlip	202.23
🥉	Blip2	198.41
4	Qwen-VL-Chat	188.05
5	Shikra	187.57
6	LLAVA-1.5-13B	177.83
7	Shikra-VQA	175.93
8	Otter	166.47
9	LLAVA-1.5-7B	155.26
10	MiniGPT-4	127.91

Clean

Rank	Model	Score
🥇	XComposer-VL	214.88
🥈	InstructBlip	201.06
🥉	Blip2	196.47
4	Shikra	187.16
5	Qwen-VL-Chat	181.74
6	LLAVA-1.5-13B	176.19
7	Shikra-VQA	175.41
8	Otter	166.26
9	LLAVA-1.5-7B	154.15
10	MiniGPT-4	129.04

Corruption

Rank	Model	Score
🥇	XComposer-VL	191.25
🥈	Blip2	168.61
🥉	Qwen-VL-Chat	154.73
4	InstructBlip	153.69
5	Shikra	147.23
6	LLAVA-1.5-13B	152.10
7	LLAVA-1.5-7B	139.34
8	Shikra-VQA	131.16
9	MiniGPT-4	128.61
10	Otter	111.31

Print Attacking

Rank	Model	Score
🥇	Qwen-VL-Chat	181.18
🥈	Shikra	177.92
🥉	LLAVA-1.5-13B	170.53
4	Shikra-VQA	165.49
5	Otter	155.81
6	LLAVA-1.5-7B	150.23
7	InstructBlip	99.38
8	MiniGPT-4	98.86
9	Blip2	97.96
10	XComposer-VL	91.56

Adversarial Attacking

B. Fine-grained Leaderboards under each scenario

1. Clean

Rank	Model	Score
1	XComposer-VL	80.13
2	InstructBlip	72.36
3	Blip2	70.03
4	Shikra-VQA	64.03
5	Shikra	62.97
6	Otter	62.22
7	Qwen-VL-Chat	61.13
8	LLAVA-1.5-13B	60.17
9	LLAVA-1.5-7B	53.81
10	MiniGPT-4	42.52

Clean-Movie

Rank	Model	Score
1	XComposer-VL	95.89
2	InstructBlip	95.44
3	Blip2	94.64
4	Shikra-VQA	88.25
5	Qwen-VL-Chat	79.75
6	Shikra	77.65
7	Otter	69.10
8	LLAVA-1.5-13B	68.82
9	LLAVA-1.5-7B	62.77
10	MiniGPT-4	46.98

Clean-Action

Rank	Model	Score
1	XComposer-VL	75.55
2	InstructBlip	63.42
3	Otter	63.04
4	Shikra-VQA	60.10
5	Blip2	59.14
6	Shikra	58.55
7	Qwen-VL-Chat	58.27
8	LLAVA-1.5-13B	56.69
9	LLAVA-1.5-7B	53.48
10	MiniGPT-4	40.23

Clean-TV Show

Rank	Model	Score
1	XComposer-VL	79.92
2	InstructBlip	77.84
3	Blip2	76.88
4	Shikra	73.55
5	Shikra-VQA	73.07
6	Qwen-VL-Chat	65.12
7	LLAVA-1.5-13B	62.86
8	Otter	60.95
9	LLAVA-1.5-7B	57.74
10	MiniGPT-4	42.06

Clean-Profession

Rank	Model	Score
1	InstructBlip	96.90
2	Blip2	96.64
3	XComposer-VL	96.43
4	Shikra-VQA	82.85
5	Shikra	80.17
6	Qwen-VL-Chat	75.84
7	LLAVA-1.5-13B	75.12
8	LLAVA-1.5-7B	69.74
9	Otter	62.50
10	MiniGPT-4	52.81

Clean-Landmark

Rank	Model	Score
1	XComposer-VL	75.00
2	Qwen-VL-Chat	63.11
3	InstructBlip	62.66
4	Otter	59.56
5	Blip2	59.50
6	LLAVA-1.5-13B	53.88
7	Shikra-VQA	53.31
8	Shikra	52.84
9	LLAVA-1.5-7B	47.80
10	MiniGPT-4	38.98

Clean-Anime

Rank	Model	Score
1	XComposer-VL	85.96
2	InstructBlip	83.80
3	Blip2	78.09
4	Shikra-VQA	69.83
5	Qwen-VL-Chat	69.58
6	Shikra	67.43
7	LLAVA-1.5-13B	61.20
8	Otter	58.33
9	LLAVA-1.5-7B	48.80
10	MiniGPT-4	40.86

Clean-Clothes

Rank	Model	Score
1	XComposer-VL	87.91
2	InstructBlip	79.91
3	Blip2	79.23
4	Qwen-VL-Chat	68.71
5	Shikra-VQA	62.93
6	Shikra	61.81
7	LLAVA-1.5-13B	59.04
8	LLAVA-1.5-7B	56.04
9	Otter	52.97
10	MiniGPT-4	39.13

Clean-Celebrity

Rank	Model	Score
1	XComposer-VL	91.27
2	InstructBlip	90.28
3	Blip2	89.78
4	Shikra-VQA	80.23
5	Shikra	75.84
6	Qwen-VL-Chat	71.37
7	LLAVA-1.5-13B	68.16
8	Otter	63.22
9	LLAVA-1.5-7B	52.87
10	MiniGPT-4	47.72

Clean-Food

Rank	Model	Score
1	XComposer-VL	93.19
2	InstructBlip	92.62
3	Blip2	91.84
4	Shikra-VQA	80.05
5	Shikra	77.57
6	Qwen-VL-Chat	76.11
7	Otter	73.96
8	LLAVA-1.5-13B	67.61
9	LLAVA-1.5-7B	48.84
10	MiniGPT-4	47.95

Clean-Plant

Rank	Model	Score
1	XComposer-VL	73.12
2	InstructBlip	62.34
3	Shikra-VQA	61.42
4	Blip2	61.13
5	Shikra	61.04
6	LLAVA-1.5-13B	54.57
7	Qwen-VL-Chat	51.28
8	LLAVA-1.5-7B	46.75
9	Otter	44.58
10	MiniGPT-4	41.73

Clean-Age

Rank	Model	Score
1	XComposer-VL	98.83
2	Blip2	97.12
3	InstructBlip	93.86
4	LLAVA-1.5-13B	91.12
5	Shikra-VQA	90.40
6	Shikra	85.50
7	Qwen-VL-Chat	78.13
8	Otter	78.09
9	MiniGPT-4	52.86
10	LLAVA-1.5-7B	51.96

Clean-Gender

Rank	Model	Score
1	XComposer-VL	86.08
2	InstructBlip	85.59
3	Shikra-VQA	83.53
4	Blip2	82.59
5	Shikra	80.51
6	Qwen-VL-Chat	74.76
7	Otter	67.11
8	LLAVA-1.5-13B	64.81
9	LLAVA-1.5-7B	56.81
10	MiniGPT-4	46.46

Clean-Expression

Rank	Model	Score
1	XComposer-VL	78.56
2	InstructBlip	76.31
3	Blip2	73.16
4	LLAVA-1.5-13B	66.50
5	Qwen-VL-Chat	65.19
6	Shikra-VQA	64.56
7	Shikra	64.53
8	Otter	50.44
9	LLAVA-1.5-7B	47.27
10	MiniGPT-4	38.98

Clean-Race

Rank	Model	Score
1	XComposer-VL	97.27
2	InstructBlip	96.09
3	Blip2	95.95
4	Shikra-VQA	82.72
5	Otter	82.47
6	Qwen-VL-Chat	79.98
7	Shikra	79.44
8	LLAVA-1.5-13B	72.11
9	LLAVA-1.5-7B	50.08
10	MiniGPT-4	47.77

Clean-Animal

Rank	Model	Score
1	XComposer-VL	90.97
2	InstructBlip	90.00
3	Blip2	88.53
4	Shikra-VQA	82.72
5	Qwen-VL-Chat	75.73
6	Shikra	70.06
7	LLAVA-1.5-13B	67.89
8	Otter	64.91
9	LLAVA-1.5-7B	54.31
10	MiniGPT-4	51.65

Clean-Object

Rank	Model	Score
1	Blip2	68.38
2	InstructBlip	67.50
3	XComposer-VL	66.06
4	Shikra-VQA	64.86
5	Otter	64.37
6	Qwen-VL-Chat	63.28
7	Shikra	63.02
8	LLAVA-1.5-13B	58.84
9	LLAVA-1.5-7B	44.96
10	MiniGPT-4	43.70

Clean-OCR

Rank	Model	Score
1	XComposer-VL	73.94
2	Blip2	66.98
3	InstructBlip	66.77
4	Otter	63.82
5	Qwen-VL-Chat	61.80
6	Shikra	57.21
7	Shikra-VQA	56.89
8	LLAVA-1.5-13B	55.68
9	LLAVA-1.5-7B	49.05
10	MiniGPT-4	41.47

Clean-Style

Rank	Model	Score
1	XComposer-VL	73.27
2	Shikra-VQA	68.47
3	InstructBlip	67.84
4	Shikra	66.95
5	Blip2	66.17
6	Otter	63.23
7	Qwen-VL-Chat	60.51
8	LLAVA-1.5-13B	59.00
9	LLAVA-1.5-7B	47.64
10	MiniGPT-4	41.02

Clean-Background

Rank	Model	Score
1	XComposer-VL	88.50
2	InstructBlip	88.24
3	Blip2	86.78
4	Shikra-VQA	74.14
5	Shikra	73.37
6	Qwen-VL-Chat	69.72
7	LLAVA-1.5-13B	58.36
8	Otter	53.12
9	LLAVA-1.5-7B	46.91
10	MiniGPT-4	42.90

Clean-Color

2. Corruption

Rank	Model	Score
1	XComposer-VL	78.20
2	InstructBlip	71.30
3	Blip2	68.69
4	Shikra-VQA	64.19
5	Shikra	63.47
6	Otter	61.39
7	Qwen-VL	59.27
8	LLAVA-1.5-13B	58.67
9	LLAVA-1.5-7B	52.36
10	MiniGPT-4	43.02

Corru.-Movie

Rank	Model	Score
1	XComposer-VL	95.86
2	InstructBlip	95.09
3	Blip2	93.87
4	Shikra-VQA	88.72
5	Shikra	77.77
6	Qwen-VL	75.09
7	Otter	70.09
8	LLAVA-1.5-13B	68.34
9	LLAVA-1.5-7B	63.00
10	MiniGPT-4	47.22

Corru.-Action

Rank	Model	Score
1	XComposer-VL	72.86
2	InstructBlip	62.42
3	Otter	60.71
4	Shikra	60.09
5	Shikra-VQA	59.58
6	Blip2	58.16
7	Qwen-VL	57.95
8	LLAVA-1.5-13B	55.18
9	LLAVA-1.5-7B	52.34
10	MiniGPT-4	38.70

Corru.-TV Show

Rank	Model	Score
1	XComposer-VL	78.48
2	InstructBlip	77.84
3	Blip2	75.77
4	Shikra	73.72
5	Shikra-VQA	73.38
6	Qwen-VL	64.97
7	Otter	62.71
8	LLAVA-1.5-13B	62.23
9	LLAVA-1.5-7B	56.95
10	MiniGPT-4	43.22

Corru.-Profession

Rank	Model	Score
1	InstructBlip	96.33
2	Blip2	96.28
3	XComposer-VL	95.97
4	Shikra-VQA	83.01
5	Shikra	82.38
6	LLAVA-1.5-13B	74.63
7	Qwen-VL	73.55
8	LLAVA-1.5-7B	68.78
9	Otter	61.65
10	MiniGPT-4	49.55

Corru.-Landmark

Rank	Model	Score
1	XComposer-VL	72.28
2	InstructBlip	62.79
3	Qwen-VL	60.44
4	Blip2	58.67
5	Otter	57.24
6	LLAVA-1.5-13B	53.67
7	Shikra-VQA	52.28
8	Shikra	52.00
9	LLAVA-1.5-7B	47.87
10	MiniGPT-4	40.91

Corru.-Anime

Rank	Model	Score
1	XComposer-VL	84.87
2	InstructBlip	82.84
3	Blip2	76.69
4	Shikra-VQA	70.59
5	Shikra	67.05
6	Qwen-VL	64.26
7	LLAVA-1.5-13B	60.88
8	Otter	59.87
9	LLAVA-1.5-7B	48.12
10	MiniGPT-4	41.38

Corru.-Clothes

Rank	Model	Score
1	XComposer-VL	87.25
2	InstructBlip	79.23
3	Blip2	78.36
4	Qwen-VL	64.89
5	Shikra-VQA	62.55
6	Shikra	61.68
7	LLAVA-1.5-13B	58.64
8	LLAVA-1.5-7B	56.36
9	Otter	53.59
10	MiniGPT-4	38.80

Corru.-Celebrity

Rank	Model	Score
1	XComposer-VL	90.61
2	InstructBlip	90.08
3	Blip2	89.19
4	Shikra-VQA	80.55
5	Shikra	75.63
6	Qwen-VL	70.71
7	LLAVA-1.5-13B	67.46
8	Otter	65.06
9	LLAVA-1.5-7B	52.94
10	MiniGPT-4	47.76

Corru.-Food

Rank	Model	Score
1	XComposer-VL	92.45
2	InstructBlip	92.23
3	Blip2	91.15
4	Shikra-VQA	80.80
5	Shikra	77.69
6	Otter	73.42
7	Qwen-VL	72.01
8	LLAVA-1.5-13B	67.22
9	LLAVA-1.5-7B	48.84
10	MiniGPT-4	48.04

Corru.-Plant

Rank	Model	Score
1	XComposer-VL	73.04
2	InstructBlip	61.59
3	Shikra-VQA	61.09
4	Shikra	60.40
5	Blip2	60.03
6	Qwen-VL	55.26
7	LLAVA-1.5-13B	53.67
8	LLAVA-1.5-7B	46.76
9	Otter	45.09
10	MiniGPT-4	40.84

Corru.-Age

Rank	Model	Score
1	XComposer-VL	98.53
2	Blip2	97.76
3	InstructBlip	93.75
4	LLAVA-1.5-13B	91.09
5	Shikra-VQA	90.17
6	Shikra	85.91
7	Otter	80.22
8	Qwen-VL	67.00
9	MiniGPT-4	52.75
10	LLAVA-1.5-7B	52.27

Corru.-Gender

Rank	Model	Score
1	XComposer-VL	86.24
2	InstructBlip	84.95
3	Shikra-VQA	83.56
4	Blip2	81.56
5	Shikra	80.56
6	Qwen-VL	70.55
7	Otter	66.64
8	LLAVA-1.5-13B	64.94
9	LLAVA-1.5-7B	56.16
10	MiniGPT-4	46.55

Corru.-Expression

Rank	Model	Score
1	XComposer-VL	78.80
2	InstructBlip	76.19
3	Blip2	71.72
4	LLAVA-1.5-13B	65.33
5	Shikra-VQA	64.73
6	Shikra	63.94
7	Qwen-VL	63.55
8	Otter	49.73
9	LLAVA-1.5-7B	47.92
10	MiniGPT-4	39.65

Corru.-Race

Rank	Model	Score
1	XComposer-VL	97.06
2	InstructBlip	95.86
3	Blip2	95.69
4	Shikra-VQA	83.03
5	Otter	82.84
6	Shikra	79.44
7	LLAVA-1.5-13B	71.47
8	Qwen-VL	71.43
9	LLAVA-1.5-7B	50.67
10	MiniGPT-4	48.66

Corru.-Animal

Rank	Model	Score
1	XComposer-VL	90.81
2	InstructBlip	89.76
3	Blip2	88.05
4	Shikra-VQA	82.76
5	Shikra	71.49
6	Qwen-VL	67.78
7	LLAVA-1.5-13B	67.23
8	Otter	66.18
9	LLAVA-1.5-7B	54.09
10	MiniGPT-4	51.02

Corru.-Object

Rank	Model	Score
1	Blip2	94.06
2	InstructBlip	90.54
3	XComposer-VL	87.19
4	Shikra-VQA	80.80
5	Otter	78.70
6	Shikra	77.17
7	Qwen-VL	69.66
8	LLAVA-1.5-13B	68.58
9	LLAVA-1.5-7B	49.10
10	MiniGPT-4	44.63

Corru.-OCR

Rank	Model	Score
1	XComposer-VL	72.34
2	InstructBlip	65.06
3	Blip2	64.86
4	Otter	61.14
5	Qwen-VL	60.34
6	Shikra	56.61
7	Shikra-VQA	55.81
8	LLAVA-1.5-13B	53.65
9	LLAVA-1.5-7B	47.67
10	MiniGPT-4	40.81

Courruption-Style

Rank	Model	Score
1	XComposer-VL	72.53
2	Shikra-VQA	68.25
3	InstructBlip	67.48
4	Shikra	67.19
5	Blip2	64.69
6	Otter	62.99
7	Qwen-VL	60.50
8	LLAVA-1.5-13B	58.50
9	LLAVA-1.5-7B	47.12
10	MiniGPT-4	40.53

Corru.-Background

Rank	Model	Score
1	XComposer-VL	88.24
2	InstructBlip	88.09
3	Blip2	86.25
4	Shikra-VQA	74.14
5	Shikra	73.07
6	Qwen-VL	70.78
7	LLAVA-1.5-13B	57.70
8	Otter	56.14
9	LLAVA-1.5-7B	46.76
10	MiniGPT-4	42.86

Corru.-Color

3. Print Attacking

Rank	Model	Score
1	XComposer-VL	67.09
2	Blip2	52.55
3	LLAVA-1.5-13B	46.74
4	LLAVA-1.5-7B	45.24
5	Shikra	44.84
6	Shikra-VQA	44.36
7	Qwen-VL	44.30
8	InstructBlip	43.06
9	MiniGPT-4	40.26
10	Otter	37.97

Print Attacking-Movie

Rank	Model	Score
1	XComposer-VL	86.24
2	Blip2	79.13
3	InstructBlip	75.94
4	Shikra-VQA	71.36
5	Qwen-VL	65.26
6	Shikra	58.82
7	LLAVA-1.5-13B	57.66
8	LLAVA-1.5-7B	51.62
9	MiniGPT-4	47.07
10	Otter	39.77

Print Attacking-Action

Rank	Model	Score
1	XComposer-VL	55.08
2	LLAVA-1.5-7B	43.55
3	Qwen-VL	40.72
4	LLAVA-1.5-13B	40.09
5	Blip2	39.71
6	Shikra	39.01
7	MiniGPT-4	37.40
8	Shikra-VQA	36.50
9	InstructBlip	32.17
10	Otter	31.48

Print Attacking-TV Show

Rank	Model	Score
1	XComposer-VL	67.14
2	Blip2	59.62
3	Shikra	55.70
4	InstructBlip	54.52
5	Shikra-VQA	54.25
6	Qwen-VL	48.66
7	LLAVA-1.5-13B	47.49
8	LLAVA-1.5-7B	42.86
9	MiniGPT-4	42.22
10	Otter	38.12

Print Attacking-Profession

Rank	Model	Score
1	XComposer-VL	89.81
2	Blip2	84.85
3	InstructBlip	76.78
4	Shikra-VQA	68.28
5	LLAVA-1.5-13B	65.98
6	Shikra	64.47
7	Qwen-VL	64.29
8	LLAVA-1.5-7B	56.39
9	MiniGPT-4	49.58
10	Otter	43.89

Print Attacking-Landmark

Rank	Model	Score
1	XComposer-VL	59.72
2	Blip2	45.05
3	Qwen-VL	44.95
4	LLAVA-1.5-7B	42.23
5	LLAVA-1.5-13B	40.95
6	InstructBlip	38.26
7	Shikra-VQA	38.25
8	MiniGPT-4	37.27
9	Shikra	37.02
10	Otter	36.16

Print Attacking-Anime

Rank	Model	Score
1	XComposer-VL	76.43
2	Blip2	66.53
3	InstructBlip	62.14
4	Qwen-VL	51.56
5	LLAVA-1.5-13B	49.63
6	Shikra-VQA	49.02
7	Shikra	47.23
8	LLAVA-1.5-7B	42.22
9	MiniGPT-4	39.66
10	Otter	34.67

Print Attacking-Clothes

Rank	Model	Score
1	XComposer-VL	73.48
2	Blip2	57.30
3	Qwen-VL	52.34
4	LLAVA-1.5-7B	45.11
5	InstructBlip	43.36
6	Shikra	43.13
7	Shikra-VQA	42.01
8	LLAVA-1.5-13B	41.59
9	MiniGPT-4	38.64
10	Otter	33.95

Print Attacking-Celebrity

Rank	Model	Score
1	XComposer-VL	83.60
2	Blip2	81.67
3	InstructBlip	75.94
4	Qwen-VL	60.43
5	Shikra-VQA	58.61
6	LLAVA-1.5-13B	56.20
7	Shikra	56.08
8	MiniGPT-4	46.70
9	LLAVA-1.5-7B	45.20
10	Otter	38.39

Print Attacking-Food

Rank	Model	Score
1	XComposer-VL	82.18
2	Blip2	82.16
3	InstructBlip	74.20
4	Qwen-VL	60.29
5	Shikra	54.03
6	LLAVA-1.5-13B	53.77
7	Shikra-VQA	52.86
8	MiniGPT-4	46.45
9	LLAVA-1.5-7B	42.07
10	Otter	39.52

Print Attacking-Plant

Rank	Model	Score
1	XComposer-VL	60.37
2	LLAVA-1.5-7B	42.53
3	Blip2	42.20
4	LLAVA-1.5-13B	41.30
5	MiniGPT-4	39.18
6	Shikra	38.98
7	Qwen-VL	37.56
8	Shikra-VQA	37.47
9	InstructBlip	35.75
10	Otter	32.18

Print Attacking-Age

Rank	Model	Score
1	Blip2	93.97
2	InstructBlip	80.44
3	XComposer-VL	79.56
4	LLAVA-1.5-13B	74.03
5	Shikra-VQA	63.62
6	Shikra	62.06
7	Qwen-VL	56.05
8	MiniGPT-4	51.87
9	LLAVA-1.5-7B	51.33
10	Otter	39.31

Print Attacking-Gender

Rank	Model	Score
1	XComposer-VL	78.59
2	Blip2	67.94
3	InstructBlip	67.72
4	Shikra-VQA	65.00
5	Shikra	61.51
6	Qwen-VL	59.16
7	LLAVA-1.5-13B	58.33
8	LLAVA-1.5-7B	53.00
9	MiniGPT-4	45.37
10	Otter	39.91

Print Attacking-Expression

Rank	Model	Score
1	XComposer-VL	62.21
2	Blip2	52.98
3	InstructBlip	48.42
4	LLAVA-1.5-13B	47.80
5	Qwen-VL	44.81
6	Shikra-VQA	42.09
7	LLAVA-1.5-7B	41.75
8	Shikra	41.07
9	MiniGPT-4	38.03
10	Otter	28.77

Print Attacking-Race

Rank	Model	Score
1	Blip2	90.07
2	XComposer-VL	87.80
3	InstructBlip	82.43
4	Qwen-VL	65.20
5	LLAVA-1.5-13B	60.13
6	Shikra-VQA	55.53
7	Shikra	53.98
8	MiniGPT-4	48.02
9	Otter	46.92
10	LLAVA-1.5-7B	44.73

Print Attacking-Animal

Rank	Model	Score
1	Blip2	83.55
2	XComposer-VL	81.03
3	InstructBlip	76.06
4	Qwen-VL	61.06
5	LLAVA-1.5-13B	56.84
6	Shikra-VQA	55.02
7	MiniGPT-4	49.61
8	Shikra	46.96
9	LLAVA-1.5-7B	44.26
10	Otter	38.32

Print Attacking-Object

Rank	Model	Score
1	XComposer-VL	60.60
2	Blip2	52.84
3	LLAVA-1.5-13B	45.26
4	Qwen-VL	43.55
5	LLAVA-1.5-7B	42.23
6	Shikra	41.06
7	MiniGPT-4	40.32
8	Shikra-VQA	40.12
9	InstructBlip	40.03
10	Otter	33.86

Print Attacking-Style

Rank	Model	Score
1	XComposer-VL	64.99
2	Shikra-VQA	55.50
3	Blip2	54.94
4	Shikra	54.16
5	LLAVA-1.5-13B	51.16
6	InstructBlip	49.98
7	Qwen-VL	49.17
8	LLAVA-1.5-7B	44.34
9	MiniGPT-4	39.23
10	Otter	37.94

Print Attacking-Background

Rank	Model	Score
1	XComposer-VL	78.66
2	Blip2	74.47
3	InstructBlip	63.59
4	Qwen-VL	58.40
5	Shikra	51.75
6	Shikra-VQA	51.10
7	LLAVA-1.5-13B	49.48
8	LLAVA-1.5-7B	43.40
9	MiniGPT-4	41.70
10	Otter	35.56

Print Attacking-Color

4. Adversarial Attacking

Rank	Model	Score
1	Qwen-VL	58.17
2	Shikra	57.41
3	Shikra-VQA	57.26
4	LLAVA-1.5-13B	57.17
5	Otter	56.61
6	LLAVA-1.5-7B	51.91
7	MiniGPT-4	36.95
8	Blip2	33.33
9	InstructBlip	31.97
10	XComposer-VL	31.75

Adversarial Attacking-Movie

Rank	Model	Score
1	Shikra-VQA	85.84
2	Qwen-VL	79.11
3	Shikra	77.17
4	LLAVA-1.5-13B	66.37
5	Otter	65.81
6	LLAVA-1.5-7B	60.84
7	MiniGPT-4	37.59
8	Blip2	34.64
9	XComposer-VL	33.88
10	InstructBlip	32.85

Adversarial Attacking-Action

Rank	Model	Score
1	Qwen-VL	55.79
2	Shikra-VQA	53.57
3	Otter	53.05
4	LLAVA-1.5-13B	53.03
5	Shikra	52.61
6	LLAVA-1.5-7B	48.36
7	MiniGPT-4	38.22
8	XComposer-VL	34.45
9	Blip2	32.04
10	InstructBlip	30.93

Adversarial Attacking-TV Show

Rank	Model	Score
1	Shikra-VQA	71.47
2	Shikra	67.66
3	Qwen-VL	64.80
4	Otter	61.44
5	LLAVA-1.5-13B	59.83
6	LLAVA-1.5-7B	56.15
7	MiniGPT-4	36.95
8	Blip2	36.14
9	InstructBlip	35.02
10	XComposer-VL	30.86

Adversarial Attacking-Profession

Rank	Model	Score
1	Shikra-VQA	78.40
2	Shikra	77.85
3	Qwen-VL	76.34
4	LLAVA-1.5-13B	71.92
5	LLAVA-1.5-7B	68.62
6	Otter	56.87
7	MiniGPT-4	35.73
8	XComposer-VL	35.33
9	Blip2	34.53
10	InstructBlip	32.03

Adversarial Attacking-Landmark

Rank	Model	Score
1	Qwen-VL	60.07
2	Otter	51.86
3	LLAVA-1.5-13B	51.81
4	Shikra-VQA	48.45
5	Shikra	47.75
6	LLAVA-1.5-7B	47.32
7	MiniGPT-4	38.38
8	Blip2	37.95
9	InstructBlip	34.29
10	XComposer-VL	30.89

Adversarial Attacking-Anime

Rank	Model	Score
1	Qwen-VL	69.92
2	Shikra-VQA	68.32
3	Shikra	65.73
4	LLAVA-1.5-13B	58.83
5	Otter	57.58
6	LLAVA-1.5-7B	47.05
7	MiniGPT-4	38.05
8	InstructBlip	36.65
9	XComposer-VL	36.41
10	Blip2	36.18

Adversarial Attacking-Clothes

Rank	Model	Score
1	Qwen-VL	64.77
2	Shikra	60.02
3	Shikra-VQA	59.70
4	LLAVA-1.5-13B	55.53
5	LLAVA-1.5-7B	54.95
6	Otter	52.26
7	MiniGPT-4	37.48
8	Blip2	34.72
9	InstructBlip	33.52
10	XComposer-VL	33.09

Adversarial Attacking-Celebrity

Rank	Model	Score
1	Shikra-VQA	75.26
2	Qwen-VL	73.72
3	Shikra	71.07
4	LLAVA-1.5-13B	64.95
5	Otter	64.39
6	LLAVA-1.5-7B	52.30
7	MiniGPT-4	36.94
8	XComposer-VL	33.16
9	Blip2	32.48
10	InstructBlip	32.13

Adversarial Attacking-Food

Rank	Model	Score
1	Qwen-VL	79.59
2	Shikra-VQA	76.06
3	Shikra	73.53
4	Otter	67.16
5	LLAVA-1.5-13B	64.44
6	LLAVA-1.5-7B	47.41
7	MiniGPT-4	37.74
8	Blip2	35.84
9	InstructBlip	34.02
10	XComposer-VL	34.00

Adversarial Attacking-Plant

Rank	Model	Score
1	Shikra	56.95
2	Shikra-VQA	56.80
3	LLAVA-1.5-13B	51.43
4	Qwen-VL	48.11
5	LLAVA-1.5-7B	45.89
6	Otter	42.82
7	MiniGPT-4	37.14
8	Blip2	34.16
9	InstructBlip	33.14
10	XComposer-VL	30.78

Adversarial Attacking-Age

Rank	Model	Score
1	LLAVA-1.5-13B	89.35
2	Shikra-VQA	89.06
3	Shikra	85.34
4	Otter	82.25
5	Qwen-VL	79.18
6	LLAVA-1.5-7B	51.95
7	MiniGPT-4	48.38
8	Blip2	39.48
9	InstructBlip	38.51
10	XComposer-VL	38.09

Adversarial Attacking-Gender

Rank	Model	Score
1	Shikra-VQA	82.04
2	Shikra	79.86
3	Qwen-VL	73.91
4	LLAVA-1.5-13B	63.13
5	Otter	60.44
6	LLAVA-1.5-7B	55.70
7	MiniGPT-4	41.07
8	Blip2	35.95
9	InstructBlip	33.52
10	XComposer-VL	32.80

Adversarial Attacking-Expression

Rank	Model	Score
1	LLAVA-1.5-13B	63.06
2	Shikra-VQA	61.19
3	Shikra	60.24
4	Qwen-VL	59.50
5	LLAVA-1.5-7B	45.97
6	Otter	45.50
7	MiniGPT-4	37.47
8	Blip2	36.81
9	InstructBlip	36.36
10	XComposer-VL	34.77

Adversarial Attacking-Race

Rank	Model	Score
1	Qwen-VL	81.78
2	Shikra-VQA	80.44
3	Otter	78.00
4	Shikra	77.66
5	LLAVA-1.5-13B	70.94
6	LLAVA-1.5-7B	49.30
7	MiniGPT-4	37.61
8	Blip2	35.23
9	XComposer-VL	34.38
10	InstructBlip	34.01

Adversarial Attacking-Animal

Rank	Model	Score
1	Shikra-VQA	79.05
2	Qwen-VL	77.26
3	Shikra	69.98
4	LLAVA-1.5-13B	65.33
5	Otter	62.69
6	LLAVA-1.5-7B	52.62
7	MiniGPT-4	37.84
8	Blip2	37.62
9	XComposer-VL	36.97
10	InstructBlip	36.88

Adversarial Attacking-Object

Rank	Model	Score
1	Shikra-VQA	79.12
2	Shikra	77.62
3	Qwen-VL	76.50
4	Otter	74.91
5	LLAVA-1.5-13B	66.43
6	XComposer-VL	53.21
7	InstructBlip	50.03
8	Blip2	47.80
9	LLAVA-1.5-7B	46.78
10	MiniGPT-4	36.78

Adversarial Attacking-OCR

Rank	Model	Score
1	Qwen-VL	56.80
2	Otter	53.47
3	Shikra	51.05
4	Shikra-VQA	50.30
5	LLAVA-1.5-13B	50.01
6	LLAVA-1.5-7B	44.85
7	MiniGPT-4	37.98
8	XComposer-VL	37.20
9	InstructBlip	36.20
10	Blip2	35.87

Adversarial Attacking-Style

Rank	Model	Score
1	Shikra-VQA	65.41
2	Shikra	64.50
3	Otter	60.30
4	Qwen-VL	60.09
5	LLAVA-1.5-13B	56.28
6	LLAVA-1.5-7B	46.53
7	MiniGPT-4	36.91
8	Blip2	35.05
9	XComposer-VL	34.71
10	InstructBlip	34.45

Adversarial Attacking-Background

Rank	Model	Score
1	Qwen-VL	69.63
2	Shikra	69.57
3	Shikra-VQA	69.05
4	LLAVA-1.5-13B	54.28
5	Otter	53.12
6	LLAVA-1.5-7B	46.08
7	MiniGPT-4	37.54
8	XComposer-VL	37.23
9	Blip2	36.83
10	InstructBlip	35.81

Adversarial Attacking-Color

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
evaluate_LVLMs		evaluate_LVLMs
figure		figure
image_generate		image_generate
pqa_generate		pqa_generate
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎨Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

🎉If you like our project, please give us a star ⭐ on GitHub for latest update.

Overview🔍

Download 📩

Comparison with Existing Benchmarks📊

Examples of Dysca📸

Leaderboards🏆

A. Comprehensive Performance Leaderboards under each scenario

B. Fine-grained Leaderboards under each scenario

1. Clean

2. Corruption

3. Print Attacking

4. Adversarial Attacking

Related projects🔗

About

Releases

Packages

Languages

License

Benchmark-Dysca/Dysca

Folders and files

Latest commit

History

Repository files navigation

🎨Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

🎉If you like our project, please give us a star ⭐ on GitHub for latest update.

Overview🔍

Download 📩

Comparison with Existing Benchmarks📊

Examples of Dysca📸

Leaderboards🏆

A. Comprehensive Performance Leaderboards under each scenario

B. Fine-grained Leaderboards under each scenario

1. Clean

2. Corruption

3. Print Attacking

4. Adversarial Attacking

Related projects🔗

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages