Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

License

Notifications You must be signed in to change notification settings

Benchmark-Dysca/Dysca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎨Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

🎉If you like our project, please give us a star ⭐ on GitHub for latest update.

Overview🔍

Figure 1. Overview of the automatic pipeline in Dysca for generating VQAs, cleaning VQAs and evaluating LVLMs.



Figure 2. The available subtasks of our Dysca.



Abstract - Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios.A total of 8 advanced open-source LVLMs with 10 checkpoints are evaluated on Dysca, revealing the drawbacks of current LVLMs.

Download 📩

We provide two types of downloading ways:

Comparison with Existing Benchmarks📊

Comparisons between existing LVLM benchmarks. '⍻' indicates that the benchmarks include both newly collected images / annotations and images / annotations gathered from existing datasets. '*' The scale of our released benchmark is 617K, however Dysca is able to generate unlimited data to be tested.
Benchmark #Evaluation Data Scale #Perceptual Tasks Automatic Annotation Collecting from Existing Datasets Question Type Automatic Evaluation
LLaVA-Bench 0.15K - × Free-form
MME 2.3K 10 × True-or-false
LVLM-eHub - 3 × Free-form ×
tiny-LVLM-eHub 2.1K 3 × Free-form
SEED-Bench 19K 8 × Multi-choices
MMBench 2.9K 12 × Multi-choices
TouchStone 0.9K 10 × Free-form
REFORM-EVAL 50K 7 × Multi-choices
MM-BigBench 30K 6 × Multi-choices
MM-VET 0.2K 4 Free-form
MLLM-Bench 0.42K 7 × Free-form
SEED-Bench2 24K 10 × Multi-choices
BenchLMM 2.4K 15 × × Free-form
JourneyDB 5.4K 2 Free-form, Multi-choices
Dysca (Ours) 617K* 20 Free-form, Multi-choices, True-or-false

Examples of Dysca📸

Here are some examples of the images, prompts, questions and ground truth answers of our Dysca. These images are generated by diffusion models.









Leaderboards🏆

A. Comprehensive Performance Leaderboards under each scenario

Rank Model Score
🥇XComposer-VL217.07
🥈InstructBlip202.23
🥉Blip2198.41
4Qwen-VL-Chat188.05
5Shikra187.57
6LLAVA-1.5-13B177.83
7Shikra-VQA175.93
8Otter166.47
9LLAVA-1.5-7B155.26
10MiniGPT-4127.91
Clean
Rank Model Score
🥇XComposer-VL214.88
🥈InstructBlip201.06
🥉Blip2196.47
4Shikra187.16
5Qwen-VL-Chat181.74
6LLAVA-1.5-13B176.19
7Shikra-VQA175.41
8Otter166.26
9LLAVA-1.5-7B154.15
10MiniGPT-4129.04
Corruption
Rank Model Score
🥇XComposer-VL191.25
🥈Blip2168.61
🥉Qwen-VL-Chat154.73
4InstructBlip153.69
5Shikra147.23
6LLAVA-1.5-13B152.10
7LLAVA-1.5-7B139.34
8Shikra-VQA131.16
9MiniGPT-4128.61
10Otter111.31
Print Attacking
Rank Model Score
🥇Qwen-VL-Chat181.18
🥈Shikra177.92
🥉LLAVA-1.5-13B170.53
4Shikra-VQA165.49
5Otter155.81
6LLAVA-1.5-7B150.23
7InstructBlip99.38
8MiniGPT-498.86
9Blip297.96
10XComposer-VL91.56
Adversarial Attacking

B. Fine-grained Leaderboards under each scenario

1. Clean

Rank Model Score
1XComposer-VL80.13
2InstructBlip72.36
3Blip270.03
4Shikra-VQA64.03
5Shikra62.97
6Otter62.22
7Qwen-VL-Chat61.13
8LLAVA-1.5-13B60.17
9LLAVA-1.5-7B53.81
10MiniGPT-442.52
Clean-Movie
Rank Model Score
1XComposer-VL95.89
2InstructBlip95.44
3Blip294.64
4Shikra-VQA88.25
5Qwen-VL-Chat79.75
6Shikra77.65
7Otter69.10
8LLAVA-1.5-13B68.82
9LLAVA-1.5-7B62.77
10MiniGPT-446.98
Clean-Action
Rank Model Score
1XComposer-VL75.55
2InstructBlip63.42
3Otter63.04
4Shikra-VQA60.10
5Blip259.14
6Shikra58.55
7Qwen-VL-Chat58.27
8LLAVA-1.5-13B56.69
9LLAVA-1.5-7B53.48
10MiniGPT-440.23
Clean-TV Show
Rank Model Score
1XComposer-VL79.92
2InstructBlip77.84
3Blip276.88
4Shikra73.55
5Shikra-VQA73.07
6Qwen-VL-Chat65.12
7LLAVA-1.5-13B62.86
8Otter60.95
9LLAVA-1.5-7B57.74
10MiniGPT-442.06
Clean-Profession
Rank Model Score
1InstructBlip96.90
2Blip296.64
3XComposer-VL96.43
4Shikra-VQA82.85
5Shikra80.17
6Qwen-VL-Chat75.84
7LLAVA-1.5-13B75.12
8LLAVA-1.5-7B69.74
9Otter62.50
10MiniGPT-452.81
Clean-Landmark
Rank Model Score
1XComposer-VL75.00
2Qwen-VL-Chat63.11
3InstructBlip62.66
4Otter59.56
5Blip259.50
6LLAVA-1.5-13B53.88
7Shikra-VQA53.31
8Shikra52.84
9LLAVA-1.5-7B47.80
10MiniGPT-438.98
Clean-Anime
Rank Model Score
1XComposer-VL85.96
2InstructBlip83.80
3Blip278.09
4Shikra-VQA69.83
5Qwen-VL-Chat69.58
6Shikra67.43
7LLAVA-1.5-13B61.20
8Otter58.33
9LLAVA-1.5-7B48.80
10MiniGPT-440.86
Clean-Clothes
Rank Model Score
1XComposer-VL87.91
2InstructBlip79.91
3Blip279.23
4Qwen-VL-Chat68.71
5Shikra-VQA62.93
6Shikra61.81
7LLAVA-1.5-13B59.04
8LLAVA-1.5-7B56.04
9Otter52.97
10MiniGPT-439.13
Clean-Celebrity
Rank Model Score
1XComposer-VL91.27
2InstructBlip90.28
3Blip289.78
4Shikra-VQA80.23
5Shikra75.84
6Qwen-VL-Chat71.37
7LLAVA-1.5-13B68.16
8Otter63.22
9LLAVA-1.5-7B52.87
10MiniGPT-447.72
Clean-Food
Rank Model Score
1XComposer-VL93.19
2InstructBlip92.62
3Blip291.84
4Shikra-VQA80.05
5Shikra77.57
6Qwen-VL-Chat76.11
7Otter73.96
8LLAVA-1.5-13B67.61
9LLAVA-1.5-7B48.84
10MiniGPT-447.95
Clean-Plant
Rank Model Score
1XComposer-VL73.12
2InstructBlip62.34
3Shikra-VQA61.42
4Blip261.13
5Shikra61.04
6LLAVA-1.5-13B54.57
7Qwen-VL-Chat51.28
8LLAVA-1.5-7B46.75
9Otter44.58
10MiniGPT-441.73
Clean-Age
Rank Model Score
1XComposer-VL98.83
2Blip297.12
3InstructBlip93.86
4LLAVA-1.5-13B91.12
5Shikra-VQA90.40
6Shikra85.50
7Qwen-VL-Chat78.13
8Otter78.09
9MiniGPT-452.86
10LLAVA-1.5-7B51.96
Clean-Gender
Rank Model Score
1XComposer-VL86.08
2InstructBlip85.59
3Shikra-VQA83.53
4Blip282.59
5Shikra80.51
6Qwen-VL-Chat74.76
7Otter67.11
8LLAVA-1.5-13B64.81
9LLAVA-1.5-7B56.81
10MiniGPT-446.46
Clean-Expression
Rank Model Score
1XComposer-VL78.56
2InstructBlip76.31
3Blip273.16
4LLAVA-1.5-13B66.50
5Qwen-VL-Chat65.19
6Shikra-VQA64.56
7Shikra64.53
8Otter50.44
9LLAVA-1.5-7B47.27
10MiniGPT-438.98
Clean-Race
Rank Model Score
1XComposer-VL97.27
2InstructBlip96.09
3Blip295.95
4Shikra-VQA82.72
5Otter82.47
6Qwen-VL-Chat79.98
7Shikra79.44
8LLAVA-1.5-13B72.11
9LLAVA-1.5-7B50.08
10MiniGPT-447.77
Clean-Animal
Rank Model Score
1XComposer-VL90.97
2InstructBlip90.00
3Blip288.53
4Shikra-VQA82.72
5Qwen-VL-Chat75.73
6Shikra70.06
7LLAVA-1.5-13B67.89
8Otter64.91
9LLAVA-1.5-7B54.31
10MiniGPT-451.65
Clean-Object
Rank Model Score
1Blip268.38
2InstructBlip67.50
3XComposer-VL66.06
4Shikra-VQA64.86
5Otter64.37
6Qwen-VL-Chat63.28
7Shikra63.02
8LLAVA-1.5-13B58.84
9LLAVA-1.5-7B44.96
10MiniGPT-443.70
Clean-OCR
Rank Model Score
1XComposer-VL73.94
2Blip266.98
3InstructBlip66.77
4Otter63.82
5Qwen-VL-Chat61.80
6Shikra57.21
7Shikra-VQA56.89
8LLAVA-1.5-13B55.68
9LLAVA-1.5-7B49.05
10MiniGPT-441.47
Clean-Style
Rank Model Score
1XComposer-VL73.27
2Shikra-VQA68.47
3InstructBlip67.84
4Shikra66.95
5Blip266.17
6Otter63.23
7Qwen-VL-Chat60.51
8LLAVA-1.5-13B59.00
9LLAVA-1.5-7B47.64
10MiniGPT-441.02
Clean-Background
Rank Model Score
1XComposer-VL88.50
2InstructBlip88.24
3Blip286.78
4Shikra-VQA74.14
5Shikra73.37
6Qwen-VL-Chat69.72
7LLAVA-1.5-13B58.36
8Otter53.12
9LLAVA-1.5-7B46.91
10MiniGPT-442.90
Clean-Color

2. Corruption

Rank Model Score
1XComposer-VL78.20
2InstructBlip71.30
3Blip268.69
4Shikra-VQA64.19
5Shikra63.47
6Otter61.39
7Qwen-VL59.27
8LLAVA-1.5-13B58.67
9LLAVA-1.5-7B52.36
10MiniGPT-443.02
Corru.-Movie
Rank Model Score
1XComposer-VL95.86
2InstructBlip95.09
3Blip293.87
4Shikra-VQA88.72
5Shikra77.77
6Qwen-VL75.09
7Otter70.09
8LLAVA-1.5-13B68.34
9LLAVA-1.5-7B63.00
10MiniGPT-447.22
Corru.-Action
Rank Model Score
1XComposer-VL72.86
2InstructBlip62.42
3Otter60.71
4Shikra60.09
5Shikra-VQA59.58
6Blip258.16
7Qwen-VL57.95
8LLAVA-1.5-13B55.18
9LLAVA-1.5-7B52.34
10MiniGPT-438.70
Corru.-TV Show
Rank Model Score
1XComposer-VL78.48
2InstructBlip77.84
3Blip275.77
4Shikra73.72
5Shikra-VQA73.38
6Qwen-VL64.97
7Otter62.71
8LLAVA-1.5-13B62.23
9LLAVA-1.5-7B56.95
10MiniGPT-443.22
Corru.-Profession
Rank Model Score
1InstructBlip96.33
2Blip296.28
3XComposer-VL95.97
4Shikra-VQA83.01
5Shikra82.38
6LLAVA-1.5-13B74.63
7Qwen-VL73.55
8LLAVA-1.5-7B68.78
9Otter61.65
10MiniGPT-449.55
Corru.-Landmark
Rank Model Score
1XComposer-VL72.28
2InstructBlip62.79
3Qwen-VL60.44
4Blip258.67
5Otter57.24
6LLAVA-1.5-13B53.67
7Shikra-VQA52.28
8Shikra52.00
9LLAVA-1.5-7B47.87
10MiniGPT-440.91
Corru.-Anime
Rank Model Score
1XComposer-VL84.87
2InstructBlip82.84
3Blip276.69
4Shikra-VQA70.59
5Shikra67.05
6Qwen-VL64.26
7LLAVA-1.5-13B60.88
8Otter59.87
9LLAVA-1.5-7B48.12
10MiniGPT-441.38
Corru.-Clothes
Rank Model Score
1XComposer-VL87.25
2InstructBlip79.23
3Blip278.36
4Qwen-VL64.89
5Shikra-VQA62.55
6Shikra61.68
7LLAVA-1.5-13B58.64
8LLAVA-1.5-7B56.36
9Otter53.59
10MiniGPT-438.80
Corru.-Celebrity
Rank Model Score
1XComposer-VL90.61
2InstructBlip90.08
3Blip289.19
4Shikra-VQA80.55
5Shikra75.63
6Qwen-VL70.71
7LLAVA-1.5-13B67.46
8Otter65.06
9LLAVA-1.5-7B52.94
10MiniGPT-447.76
Corru.-Food
Rank Model Score
1XComposer-VL92.45
2InstructBlip92.23
3Blip291.15
4Shikra-VQA80.80
5Shikra77.69
6Otter73.42
7Qwen-VL72.01
8LLAVA-1.5-13B67.22
9LLAVA-1.5-7B48.84
10MiniGPT-448.04
Corru.-Plant
Rank Model Score
1XComposer-VL73.04
2InstructBlip61.59
3Shikra-VQA61.09
4Shikra60.40
5Blip260.03
6Qwen-VL55.26
7LLAVA-1.5-13B53.67
8LLAVA-1.5-7B46.76
9Otter45.09
10MiniGPT-440.84
Corru.-Age
Rank Model Score
1XComposer-VL98.53
2Blip297.76
3InstructBlip93.75
4LLAVA-1.5-13B91.09
5Shikra-VQA90.17
6Shikra85.91
7Otter80.22
8Qwen-VL67.00
9MiniGPT-452.75
10LLAVA-1.5-7B52.27
Corru.-Gender
Rank Model Score
1XComposer-VL86.24
2InstructBlip84.95
3Shikra-VQA83.56
4Blip281.56
5Shikra80.56
6Qwen-VL70.55
7Otter66.64
8LLAVA-1.5-13B64.94
9LLAVA-1.5-7B56.16
10MiniGPT-446.55
Corru.-Expression
Rank Model Score
1XComposer-VL78.80
2InstructBlip76.19
3Blip271.72
4LLAVA-1.5-13B65.33
5Shikra-VQA64.73
6Shikra63.94
7Qwen-VL63.55
8Otter49.73
9LLAVA-1.5-7B47.92
10MiniGPT-439.65
Corru.-Race
Rank Model Score
1XComposer-VL97.06
2InstructBlip95.86
3Blip295.69
4Shikra-VQA83.03
5Otter82.84
6Shikra79.44
7LLAVA-1.5-13B71.47
8Qwen-VL71.43
9LLAVA-1.5-7B50.67
10MiniGPT-448.66
Corru.-Animal
Rank Model Score
1XComposer-VL90.81
2InstructBlip89.76
3Blip288.05
4Shikra-VQA82.76
5Shikra71.49
6Qwen-VL67.78
7LLAVA-1.5-13B67.23
8Otter66.18
9LLAVA-1.5-7B54.09
10MiniGPT-451.02
Corru.-Object
Rank Model Score
1Blip294.06
2InstructBlip90.54
3XComposer-VL87.19
4Shikra-VQA80.80
5Otter78.70
6Shikra77.17
7Qwen-VL69.66
8LLAVA-1.5-13B68.58
9LLAVA-1.5-7B49.10
10MiniGPT-444.63
Corru.-OCR
Rank Model Score
1XComposer-VL72.34
2InstructBlip65.06
3Blip264.86
4Otter61.14
5Qwen-VL60.34
6Shikra56.61
7Shikra-VQA55.81
8LLAVA-1.5-13B53.65
9LLAVA-1.5-7B47.67
10MiniGPT-440.81
Courruption-Style
Rank Model Score
1XComposer-VL72.53
2Shikra-VQA68.25
3InstructBlip67.48
4Shikra67.19
5Blip264.69
6Otter62.99
7Qwen-VL60.50
8LLAVA-1.5-13B58.50
9LLAVA-1.5-7B47.12
10MiniGPT-440.53
Corru.-Background
Rank Model Score
1XComposer-VL88.24
2InstructBlip88.09
3Blip286.25
4Shikra-VQA74.14
5Shikra73.07
6Qwen-VL70.78
7LLAVA-1.5-13B57.70
8Otter56.14
9LLAVA-1.5-7B46.76
10MiniGPT-442.86
Corru.-Color

3. Print Attacking

Rank Model Score
1XComposer-VL67.09
2Blip252.55
3LLAVA-1.5-13B46.74
4LLAVA-1.5-7B45.24
5Shikra44.84
6Shikra-VQA44.36
7Qwen-VL44.30
8InstructBlip43.06
9MiniGPT-440.26
10Otter37.97
Print Attacking-Movie
Rank Model Score
1XComposer-VL86.24
2Blip279.13
3InstructBlip75.94
4Shikra-VQA71.36
5Qwen-VL65.26
6Shikra58.82
7LLAVA-1.5-13B57.66
8LLAVA-1.5-7B51.62
9MiniGPT-447.07
10Otter39.77
Print Attacking-Action
Rank Model Score
1XComposer-VL55.08
2LLAVA-1.5-7B43.55
3Qwen-VL40.72
4LLAVA-1.5-13B40.09
5Blip239.71
6Shikra39.01
7MiniGPT-437.40
8Shikra-VQA36.50
9InstructBlip32.17
10Otter31.48
Print Attacking-TV Show
Rank Model Score
1XComposer-VL67.14
2Blip259.62
3Shikra55.70
4InstructBlip54.52
5Shikra-VQA54.25
6Qwen-VL48.66
7LLAVA-1.5-13B47.49
8LLAVA-1.5-7B42.86
9MiniGPT-442.22
10Otter38.12
Print Attacking-Profession
Rank Model Score
1XComposer-VL89.81
2Blip284.85
3InstructBlip76.78
4Shikra-VQA68.28
5LLAVA-1.5-13B65.98
6Shikra64.47
7Qwen-VL64.29
8LLAVA-1.5-7B56.39
9MiniGPT-449.58
10Otter43.89
Print Attacking-Landmark
Rank Model Score
1XComposer-VL59.72
2Blip245.05
3Qwen-VL44.95
4LLAVA-1.5-7B42.23
5LLAVA-1.5-13B40.95
6InstructBlip38.26
7Shikra-VQA38.25
8MiniGPT-437.27
9Shikra37.02
10Otter36.16
Print Attacking-Anime
Rank Model Score
1XComposer-VL76.43
2Blip266.53
3InstructBlip62.14
4Qwen-VL51.56
5LLAVA-1.5-13B49.63
6Shikra-VQA49.02
7Shikra47.23
8LLAVA-1.5-7B42.22
9MiniGPT-439.66
10Otter34.67
Print Attacking-Clothes
Rank Model Score
1XComposer-VL73.48
2Blip257.30
3Qwen-VL52.34
4LLAVA-1.5-7B45.11
5InstructBlip43.36
6Shikra43.13
7Shikra-VQA42.01
8LLAVA-1.5-13B41.59
9MiniGPT-438.64
10Otter33.95
Print Attacking-Celebrity
Rank Model Score
1XComposer-VL83.60
2Blip281.67
3InstructBlip75.94
4Qwen-VL60.43
5Shikra-VQA58.61
6LLAVA-1.5-13B56.20
7Shikra56.08
8MiniGPT-446.70
9LLAVA-1.5-7B45.20
10Otter38.39
Print Attacking-Food
Rank Model Score
1XComposer-VL82.18
2Blip282.16
3InstructBlip74.20
4Qwen-VL60.29
5Shikra54.03
6LLAVA-1.5-13B53.77
7Shikra-VQA52.86
8MiniGPT-446.45
9LLAVA-1.5-7B42.07
10Otter39.52
Print Attacking-Plant
Rank Model Score
1XComposer-VL60.37
2LLAVA-1.5-7B42.53
3Blip242.20
4LLAVA-1.5-13B41.30
5MiniGPT-439.18
6Shikra38.98
7Qwen-VL37.56
8Shikra-VQA37.47
9InstructBlip35.75
10Otter32.18
Print Attacking-Age
Rank Model Score
1Blip293.97
2InstructBlip80.44
3XComposer-VL79.56
4LLAVA-1.5-13B74.03
5Shikra-VQA63.62
6Shikra62.06
7Qwen-VL56.05
8MiniGPT-451.87
9LLAVA-1.5-7B51.33
10Otter39.31
Print Attacking-Gender
Rank Model Score
1XComposer-VL78.59
2Blip267.94
3InstructBlip67.72
4Shikra-VQA65.00
5Shikra61.51
6Qwen-VL59.16
7LLAVA-1.5-13B58.33
8LLAVA-1.5-7B53.00
9MiniGPT-445.37
10Otter39.91
Print Attacking-Expression
Rank Model Score
1XComposer-VL62.21
2Blip252.98
3InstructBlip48.42
4LLAVA-1.5-13B47.80
5Qwen-VL44.81
6Shikra-VQA42.09
7LLAVA-1.5-7B41.75
8Shikra41.07
9MiniGPT-438.03
10Otter28.77
Print Attacking-Race
Rank Model Score
1Blip290.07
2XComposer-VL87.80
3InstructBlip82.43
4Qwen-VL65.20
5LLAVA-1.5-13B60.13
6Shikra-VQA55.53
7Shikra53.98
8MiniGPT-448.02
9Otter46.92
10LLAVA-1.5-7B44.73
Print Attacking-Animal
Rank Model Score
1Blip283.55
2XComposer-VL81.03
3InstructBlip76.06
4Qwen-VL61.06
5LLAVA-1.5-13B56.84
6Shikra-VQA55.02
7MiniGPT-449.61
8Shikra46.96
9LLAVA-1.5-7B44.26
10Otter38.32
Print Attacking-Object
Rank Model Score
1XComposer-VL60.60
2Blip252.84
3LLAVA-1.5-13B45.26
4Qwen-VL43.55
5LLAVA-1.5-7B42.23
6Shikra41.06
7MiniGPT-440.32
8Shikra-VQA40.12
9InstructBlip40.03
10Otter33.86
Print Attacking-Style
Rank Model Score
1XComposer-VL64.99
2Shikra-VQA55.50
3Blip254.94
4Shikra54.16
5LLAVA-1.5-13B51.16
6InstructBlip49.98
7Qwen-VL49.17
8LLAVA-1.5-7B44.34
9MiniGPT-439.23
10Otter37.94
Print Attacking-Background
Rank Model Score
1XComposer-VL78.66
2Blip274.47
3InstructBlip63.59
4Qwen-VL58.40
5Shikra51.75
6Shikra-VQA51.10
7LLAVA-1.5-13B49.48
8LLAVA-1.5-7B43.40
9MiniGPT-441.70
10Otter35.56
Print Attacking-Color

4. Adversarial Attacking

Rank Model Score
1Qwen-VL58.17
2Shikra57.41
3Shikra-VQA57.26
4LLAVA-1.5-13B57.17
5Otter56.61
6LLAVA-1.5-7B51.91
7MiniGPT-436.95
8Blip233.33
9InstructBlip31.97
10XComposer-VL31.75
Adversarial Attacking-Movie
Rank Model Score
1Shikra-VQA85.84
2Qwen-VL79.11
3Shikra77.17
4LLAVA-1.5-13B66.37
5Otter65.81
6LLAVA-1.5-7B60.84
7MiniGPT-437.59
8Blip234.64
9XComposer-VL33.88
10InstructBlip32.85
Adversarial Attacking-Action
Rank Model Score
1Qwen-VL55.79
2Shikra-VQA53.57
3Otter53.05
4LLAVA-1.5-13B53.03
5Shikra52.61
6LLAVA-1.5-7B48.36
7MiniGPT-438.22
8XComposer-VL34.45
9Blip232.04
10InstructBlip30.93
Adversarial Attacking-TV Show
Rank Model Score
1Shikra-VQA71.47
2Shikra67.66
3Qwen-VL64.80
4Otter61.44
5LLAVA-1.5-13B59.83
6LLAVA-1.5-7B56.15
7MiniGPT-436.95
8Blip236.14
9InstructBlip35.02
10XComposer-VL30.86
Adversarial Attacking-Profession
Rank Model Score
1Shikra-VQA78.40
2Shikra77.85
3Qwen-VL76.34
4LLAVA-1.5-13B71.92
5LLAVA-1.5-7B68.62
6Otter56.87
7MiniGPT-435.73
8XComposer-VL35.33
9Blip234.53
10InstructBlip32.03
Adversarial Attacking-Landmark
Rank Model Score
1Qwen-VL60.07
2Otter51.86
3LLAVA-1.5-13B51.81
4Shikra-VQA48.45
5Shikra47.75
6LLAVA-1.5-7B47.32
7MiniGPT-438.38
8Blip237.95
9InstructBlip34.29
10XComposer-VL30.89
Adversarial Attacking-Anime
Rank Model Score
1Qwen-VL69.92
2Shikra-VQA68.32
3Shikra65.73
4LLAVA-1.5-13B58.83
5Otter57.58
6LLAVA-1.5-7B47.05
7MiniGPT-438.05
8InstructBlip36.65
9XComposer-VL36.41
10Blip236.18
Adversarial Attacking-Clothes
Rank Model Score
1Qwen-VL64.77
2Shikra60.02
3Shikra-VQA59.70
4LLAVA-1.5-13B55.53
5LLAVA-1.5-7B54.95
6Otter52.26
7MiniGPT-437.48
8Blip234.72
9InstructBlip33.52
10XComposer-VL33.09
Adversarial Attacking-Celebrity
Rank Model Score
1Shikra-VQA75.26
2Qwen-VL73.72
3Shikra71.07
4LLAVA-1.5-13B64.95
5Otter64.39
6LLAVA-1.5-7B52.30
7MiniGPT-436.94
8XComposer-VL33.16
9Blip232.48
10InstructBlip32.13
Adversarial Attacking-Food
Rank Model Score
1Qwen-VL79.59
2Shikra-VQA76.06
3Shikra73.53
4Otter67.16
5LLAVA-1.5-13B64.44
6LLAVA-1.5-7B47.41
7MiniGPT-437.74
8Blip235.84
9InstructBlip34.02
10XComposer-VL34.00
Adversarial Attacking-Plant
Rank Model Score
1Shikra56.95
2Shikra-VQA56.80
3LLAVA-1.5-13B51.43
4Qwen-VL48.11
5LLAVA-1.5-7B45.89
6Otter42.82
7MiniGPT-437.14
8Blip234.16
9InstructBlip33.14
10XComposer-VL30.78
Adversarial Attacking-Age
Rank Model Score
1LLAVA-1.5-13B89.35
2Shikra-VQA89.06
3Shikra85.34
4Otter82.25
5Qwen-VL79.18
6LLAVA-1.5-7B51.95
7MiniGPT-448.38
8Blip239.48
9InstructBlip38.51
10XComposer-VL38.09
Adversarial Attacking-Gender
Rank Model Score
1Shikra-VQA82.04
2Shikra79.86
3Qwen-VL73.91
4LLAVA-1.5-13B63.13
5Otter60.44
6LLAVA-1.5-7B55.70
7MiniGPT-441.07
8Blip235.95
9InstructBlip33.52
10XComposer-VL32.80
Adversarial Attacking-Expression
Rank Model Score
1LLAVA-1.5-13B63.06
2Shikra-VQA61.19
3Shikra60.24
4Qwen-VL59.50
5LLAVA-1.5-7B45.97
6Otter45.50
7MiniGPT-437.47
8Blip236.81
9InstructBlip36.36
10XComposer-VL34.77
Adversarial Attacking-Race
Rank Model Score
1Qwen-VL81.78
2Shikra-VQA80.44
3Otter78.00
4Shikra77.66
5LLAVA-1.5-13B70.94
6LLAVA-1.5-7B49.30
7MiniGPT-437.61
8Blip235.23
9XComposer-VL34.38
10InstructBlip34.01
Adversarial Attacking-Animal
Rank Model Score
1Shikra-VQA79.05
2Qwen-VL77.26
3Shikra69.98
4LLAVA-1.5-13B65.33
5Otter62.69
6LLAVA-1.5-7B52.62
7MiniGPT-437.84
8Blip237.62
9XComposer-VL36.97
10InstructBlip36.88
Adversarial Attacking-Object
Rank Model Score
1Shikra-VQA79.12
2Shikra77.62
3Qwen-VL76.50
4Otter74.91
5LLAVA-1.5-13B66.43
6XComposer-VL53.21
7InstructBlip50.03
8Blip247.80
9LLAVA-1.5-7B46.78
10MiniGPT-436.78
Adversarial Attacking-OCR
Rank Model Score
1Qwen-VL56.80
2Otter53.47
3Shikra51.05
4Shikra-VQA50.30
5LLAVA-1.5-13B50.01
6LLAVA-1.5-7B44.85
7MiniGPT-437.98
8XComposer-VL37.20
9InstructBlip36.20
10Blip235.87
Adversarial Attacking-Style
Rank Model Score
1Shikra-VQA65.41
2Shikra64.50
3Otter60.30
4Qwen-VL60.09
5LLAVA-1.5-13B56.28
6LLAVA-1.5-7B46.53
7MiniGPT-436.91
8Blip235.05
9XComposer-VL34.71
10InstructBlip34.45
Adversarial Attacking-Background
Rank Model Score
1Qwen-VL69.63
2Shikra69.57
3Shikra-VQA69.05
4LLAVA-1.5-13B54.28
5Otter53.12
6LLAVA-1.5-7B46.08
7MiniGPT-437.54
8XComposer-VL37.23
9Blip236.83
10InstructBlip35.81
Adversarial Attacking-Color

Related projects🔗

About

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages