Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

WildVision: Evaluating Vision-Language Models
in the Wild with Human Preferences

Yujie Lu  Dongfu Jiang
 Wenhu Chen  William Yang Wang  Yejin Choi  Bill Yuchen Lin
\diamondsuitAllen Institute of AI  \clubsuitUniversity of Washington
\spadesuitUniversity of California, Santa Barbara  \heartsuitUniversity of Waterloo  
yujielu@ucsb.edu, yuchenl@allenai.org
[Uncaptioned image] https://hf.co/spaces/WildVision/vision-arena
Abstract

Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This significantly outperforms other benchmarks like MMVet, MMMU, and MMStar. Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs. For example, we find that although GPT-4V surpasses many other models like Reka-Flash, Opus, and Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces challenges with subtle contextual cues, spatial reasoning, visual imagination, and expert domain knowledge. Additionally, current VLMs exhibit issues with hallucinations and safety when intentionally provoked. We are releasing our chat and feedback data to further advance research in the field of VLMs.

\doparttoc\faketableofcontents
[Uncaptioned image]
Figure 1: WildVision-Arena (WV-Arena) supports multi-round multimodal chats with 20+limit-from2020+20 + models, enabling the comparison of VLMs in real-world scenarios. We curate WildVision-Bench (WV-Bench) by selecting 500 samples from 20k+limit-from20𝑘20k+20 italic_k + in-the-wild chats and 8k+limit-from8𝑘8k+8 italic_k + user ratings. Automatic model scorings on WV-Bench closely correlate with the Elo ratings on WV-Arena.

1 Introduction

Vision-language models (VLMs) [68, 82, 69, 49, 14, 113, 3, 5] have shown groundbreaking performance across various applications, necessitating enhanced evaluation approaches [87, 24, 107, 106] to keep up with their rapid advancements. Current evaluation benchmarks, however, are constrained by simplicity [53, 102] and practicality [101, 50]. Meanwhile, evaluation metrics for vision and language tasks are predominantly reference-based, focusing on exact matches or model-based scores [87, 7]. The success of the CLIP model [73] has enabled reference-free evaluation [24], reducing the need for reference curation while maintaining alignment with human annotators. More recent evaluation methods [56, 107, 35] leverage the instruction-following capability of LLMs and the expertise of vision models [15, 91, 34], making the automatic evaluation of VLMs more fine-grained and interpretable. Despite these advancements, a gap remains between these metrics and human preferences when comparing a large number of models’ capabilities in real-world multimodal interactions.

In this paper, we introduce WildVision-Arena and WildVision-Bench to address the need for tracking human preferences regarding models’ capabilities in the wild. Our WildVision-Arena is a chatbot-style [110, 12] platform that facilitates easy comparison among VLMs, utilizing the Elo Rating system as the primary ranking metric. With the support of over 20202020 models (GPT-4o [69], GPT-4V [68], Gemini-Pro [82], Gemini-1.5 [81], Reka [83], Claude-3 [2], LLaVA-NEXT [48], etc), alongside a side-by-side chatting interface over images, we have crowdsourced over 20,0002000020,00020 , 000 multi-round human-AI chat interactions, including over 8,00080008,0008 , 000 votes and fine-grained feedback. We then sample diversified and safe data as our WildVision-Bench and adapt AlpacalEval [44] to visual context. Specifically, we use the latest released GPT-4o [69] as a judge model to vote between each VLM and the reference model Claude-3-Sonnet [2]. The statistically estimated model scores on WV-Bench achieve a Spearman’s Correlation of 0.940.940.940.94 with Elo ratings in WildVision-Arena.

Statistic Number
Total Votes 8,076
Anonymous 6,636
Non-anonymous 1,440
Left Vote 2,932
Right Vote 2,839
Tie Vote 979
Bad Vote 1,326
Days 102
Total Round 10,884
Avg Round 1.34
Avg Token Input 31.00
Avg Token Output 108.87
Table 1: Statistics of votings in WV-Arena.
Refer to caption
Figure 2: Question Category
Refer to caption
Figure 3: Image Domain

Our comprehensive analysis of these in-the-wild chats identifies areas for improvement in recognizing visual context, spatial reasoning and imagination, and expert domain knowledge. Additionally, lower-performing VLMs struggle with discerning fine visual details in images, hindered by resolution and contextual limitations. Across the board, these models also face challenges with hallucination and safety concerns. Our main contributions can be summarized as:

  • We develop WildVision-Arena, an interactive evaluation platform that hosts over 20 VLMs and a live leaderboard reflecting crowdsourced user preferences on real-world chats.

  • We curate WildVision-Bench from WildVision-Arena, a fast-evaluation benchmark that closely aligned with human preferences at 0.940.940.940.94 Spearman’s Correlation.

  • We comprehensively analyze 20,000+20limit-from00020,000+20 , 000 + multimodal conversations and 8,000+8limit-from0008,000+8 , 000 + votes, and we will release this data to advance future research in VLMs.

2 WildVision-Arena: Ranking VLMs with Human Preference

In this section, we introduce WildVision-Arena and present statistics of in-the-wild chat data, along with a deep analysis of human preferences that formulate our online VLMs leaderboard.

2.1 Overview Design of WildVision-Arena

Users conduct multi-round chats over uploaded images, during which two models from the pool or third-party APIs are sampled. Users vote for the better response, with the model’s identity revealed afterward, and can provide reasons for their choices. Votes contribute to a live leaderboard, which is updated every few hours to rank the models. Appendix A shows a screenshot of our user interface. In WildVision-Arena, we currently support 20+limit-from2020+20 + VLMs as shown in the leaderboard on the right part of Figure 1. The generation hyperparameters are set the same when comparing these models, and users can change the temperature, top-p and max output tokens per their use cases.

Refer to caption
Refer to caption
Figure 4: Battle Count Heatmap (Left): the number of voted comparisons between models. Win Fraction Heatmap (Right): the winning rate of Model A over Model B in voted comparisons.

2.2 Statistics of Chat Data with Votings

Each chat data point that has human voting is classified into a category-subcategory and domain-subdomain using GPT-4v . The prompt template details are provided in Appendix LABEL:sec:app_prompt_taxonomy. Key statistics of user voting in WildVision-Arena are presented in Table 1. The number of tokens is estimated with tiktoken tokenizer corresponding to model ‘gpt-3.5-turbo’. Figure 3 and Figure 3 visualize the distribution of these voting data in terms of question categories and image domains, respectively. In addition to the three dominant question categories (Recognition, Descriptive, Analytical), the Interactive, Instructive, and Creative categories are also receiving increasing interest. Users are mostly interested in chat about images tagged with the Entertainment domain (most of which are related to games and movies/TV shows), as well as the Urban, Expert, and People domains.

2.3 Crowdsourced Human Preference on VLMs in the Wild

Pairwise Comparison

We visualize the heatmap of battle counts and win fractions of seven models out of the 20+ models supported in the WildVision-Arena in Figure 4. The battle count heatmap highlights the frequency of direct comparisons, with models like GPT-4V vs. Gemini-Pro (252252252252 voted battles) being tested more rigorously. GPT-4o consistently outperforms the others by a large margin, winning 77%percent7777\%77 % of its battles against the second-best model, GPT-4V, which ranks as the second best. Reka-Flash follows closely behind GPT-4V, winning 42%percent4242\%42 % of its battles, while other models demonstrate lower winning rates. Among the open-source models, LLaVA-NEXT leads, though there remains a significant gap between it and both GPT-4V and GPT-4o.

Expert Agreement with User Voting

To assess the quality of crowdsourced user voting data on our platform, we evaluated inter-annotator agreement by comparing the annotations of our experts with those from users of the WildVision-Arena. This analysis was conducted on a set of 100 samples. Our findings indicate a substantial level of agreement with the two experts, with an average percentage agreement of 72.5%percent72.572.5\%72.5 %. Furthermore, the calculated Cohen’s Kappa coefficient was 0.590.590.590.59, suggesting a moderate to high degree of reliability in the annotations across different annotators.

Table 2: WildVision-Arena Leaderboard. We show the full elo score and within three question categories (Analytical, Descriptive, Recognition) and three image domains (Entertainment, Objects, Expert) of 22 models with a time cutoff at May 29, 2024. Best Second Best Best among proprietary models Best among open-source models.
Models Size Elo Battles MMMU Question Category Image Domain
Analyt. Descri. Recogn. Entert. Objects Expert
GPT-4O [69] -- 1235 434434434434 62.8 1290 1250 1236 1362 1203 1293
GPT-4-Vision [68] -- 1132¯¯1132\underline{1132}under¯ start_ARG 1132 end_ARG 2288228822882288 56.856.856.856.8 1154¯¯1154\underline{1154}under¯ start_ARG 1154 end_ARG 1169¯¯1169\underline{1169}under¯ start_ARG 1169 end_ARG 1099¯¯1099\underline{1099}under¯ start_ARG 1099 end_ARG 1177¯¯1177\underline{1177}under¯ start_ARG 1177 end_ARG 1109110911091109 1178¯¯1178\underline{1178}under¯ start_ARG 1178 end_ARG
Reka-Flash [83] -- 1107110711071107 513513513513 56.356.356.356.3 1093109310931093 1141114111411141 1067106710671067 1069106910691069 1101110111011101 1191119111911191
Claude-3-OPUS [2] -- 1100110011001100 908908908908 59.4¯¯59.4\underline{59.4}under¯ start_ARG 59.4 end_ARG 1117111711171117 1096109610961096 1092109210921092 1111111111111111 1127¯¯1127\underline{1127}under¯ start_ARG 1127 end_ARG 1128112811281128
Gemini-Pro-Vision [82] -- 1061106110611061 2229222922292229 47.947.947.947.9 1099109910991099 1041104110411041 1090109010901090 1088108810881088 1077107710771077 1041104110411041
Yi-VL-PLUS [1] -- 1061106110611061 283283283283 -- 1084108410841084 1040104010401040 1078107810781078 1001100110011001 1119111911191119 1101110111011101
LLaVA-NEXT [48] 34B34𝐵34B34 italic_B 1059105910591059 1826182618261826 51.151.151.151.1 1068106810681068 1104110411041104 1021102110211021 1074107410741074 1015101510151015 1052105210521052
Gemini-1.5-Flash [81] -- 1055105510551055 132132132132 -- 1090109010901090 1018101810181018 1085108510851085 1190119011901190 990990990990 1127112711271127
Claude-3-Sonnet [2] -- 1044104410441044 496496496496 53.153.153.153.1 1063106310631063 1056105610561056 1041104110411041 1033103310331033 1023102310231023 1119111911191119
CogVLM-Chat-HF [89] 13B13𝐵13B13 italic_B 1016101610161016 1024102410241024 32.132.132.132.1 950950950950 947947947947 1006100610061006 955955955955 930930930930 950950950950
Claude-3-Haiku [2] -- 1002100210021002 419419419419 50.250.250.250.2 964964964964 1008100810081008 996996996996 1033103310331033 1014101410141014 1005100510051005
LLaVA-NEXT [48] 7B7𝐵7B7 italic_B 992992992992 1367136713671367 35.135.135.135.1 963963963963 1032103210321032 977977977977 992992992992 1023102310231023 1001100110011001
DeepSeek-VL [51] 7B7𝐵7B7 italic_B 979979979979 646646646646 36.636.636.636.6 988988988988 984984984984 953953953953 956956956956 1026102610261026 962962962962
Idefics2 [37] 8B8𝐵8B8 italic_B 965965965965 100100100100 36.636.636.636.6 818818818818 1003100310031003 1011101110111011 909909909909 1071107110711071 1020102010201020
LLaVA-NEXT [48] 13B13𝐵13B13 italic_B 956956956956 201201201201 35.935.935.935.9 965965965965 974974974974 1006100610061006 975975975975 971971971971 987987987987
Qwen-VL-Chat [5] 10B10𝐵10B10 italic_B 930930930930 1328132813281328 35.935.935.935.9 898898898898 937937937937 940940940940 923923923923 942942942942 902902902902
Bunny-V1 [23] 3B3𝐵3B3 italic_B 921921921921 389389389389 38.238.238.238.2 897897897897 922922922922 878878878878 884884884884 823823823823 823823823823
MiniCPM-V [26] 3B3𝐵3B3 italic_B 910910910910 1349134913491349 34.734.734.734.7 895895895895 911911911911 925925925925 888888888888 890890890890 840840840840
LLaVA-v1.5 [47] 13B13𝐵13B13 italic_B 891891891891 299299299299 36.436.436.436.4 952952952952 838838838838 920920920920 887887887887 827827827827 914914914914
Tiny-LLaVA-v1-HF [111] 3B3𝐵3B3 italic_B 879879879879 288288288288 33.133.133.133.1 901901901901 828828828828 821821821821 808808808808 853853853853 894894894894
InstructBLIP [14] 7B7𝐵7B7 italic_B 862862862862 807807807807 30.630.630.630.6 834834834834 856856856856 891891891891 840840840840 902902902902 763763763763
UFORM-Gen2-Qwen [86] 500M500𝑀500M500 italic_M 827827827827 452452452452 -- 911911911911 785785785785 853853853853 768768768768 937937937937 830830830830

2.4 Model Ranking with Elo Rating in WildVision-Arena

Following Chatbot Arena [12], we adapt Elo Rating System [17] to provide a dynamic evaluation platform for ranking VLMs by statistical modeling based on our collected direct pairwise comparisons. We briefly introduce the Online Elo Rating and the statistical estimation method.

Online Elo Rating

Elo rating focuses on modeling the probability of player i𝑖iitalic_i winning against player j𝑗jitalic_j given their existing ratings Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Rjsubscript𝑅𝑗R_{j}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT respectively, where i,jN𝑖𝑗𝑁i,j\in Nitalic_i , italic_j ∈ italic_N. We define a binary outcome Yijsubscript𝑌𝑖𝑗Y_{ij}italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT for each comparison between player i𝑖iitalic_i and player j𝑗jitalic_j, where Yij=1subscript𝑌𝑖𝑗1Y_{ij}=1italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 if player i𝑖iitalic_i wins against player j𝑗jitalic_j, and Yij=0subscript𝑌𝑖𝑗0Y_{ij}=0italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 otherwise. Then the logistic probability is formulated as:

P(Yij=1)=11+10(RjRi)/α,𝑃subscript𝑌𝑖𝑗111superscript10subscript𝑅𝑗subscript𝑅𝑖𝛼P(Y_{ij}=1)=\frac{1}{1+10^{(R_{j}-R_{i})/\alpha}},italic_P ( italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ) = divide start_ARG 1 end_ARG start_ARG 1 + 10 start_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_α end_POSTSUPERSCRIPT end_ARG , (1)

where α=400𝛼400\alpha=400italic_α = 400 for Elo rating computation. After a match, each player’s rating is updated by the formula: Ri=Ri+K×(S(i|j)E(i|j))subscriptsuperscript𝑅𝑖subscript𝑅𝑖𝐾𝑆conditional𝑖𝑗𝐸conditional𝑖𝑗R^{\prime}_{i}=R_{i}+K\times(S(i|j)-E(i|j))italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_K × ( italic_S ( italic_i | italic_j ) - italic_E ( italic_i | italic_j ) ), where S(i|j)𝑆conditional𝑖𝑗S(i|j)italic_S ( italic_i | italic_j ) is the actual match outcome (1 for a win, 0.5 for a tie, and 0 for a loss), and E(i|j)=P(Yij=1)𝐸conditional𝑖𝑗𝑃subscript𝑌𝑖𝑗1E(i|j)=P(Y_{ij}=1)italic_E ( italic_i | italic_j ) = italic_P ( italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ). The higher-rated player will win fewer points if they win but lose more if they lose, while the lower-rated player will experience the opposite. The computation of the online Elo rating is correlated with the comparison order. Therefore, we follow Chatbot Arena to adopt the Bradley–Terry model [9] for a stable statistical estimation.

Statistical Estimation

The Bradley–Terry model [9] estimates the Elo rating using a logistic regression model and maximum likelihood estimation (MLE). Let’s say there are N𝑁Nitalic_N players, and we have a series of pairwise comparisons, where Wijsubscript𝑊𝑖𝑗W_{ij}italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the number of times player i𝑖iitalic_i wins against player j𝑗jitalic_j. The log-likelihood function for all pairwise comparisons can be written as:

(𝐑)=i,jN,ij(WijYijlogP(Yij=1)),𝐑subscriptformulae-sequence𝑖𝑗𝑁𝑖𝑗subscript𝑊𝑖𝑗subscript𝑌𝑖𝑗𝑃subscript𝑌𝑖𝑗1\mathcal{L}(\mathbf{R})=\sum_{i,j\in N,i\neq j}\left(W_{ij}Y_{ij}\log P(Y_{ij}% =1)\right),caligraphic_L ( bold_R ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ italic_N , italic_i ≠ italic_j end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log italic_P ( italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ) ) , (2)

where 𝐑={R1,,RN}𝐑𝑅1subscript𝑅𝑁\mathbf{R}=\{R1,...,R_{N}\}bold_R = { italic_R 1 , … , italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } is the Elo rating variable of each player. Since this modeling does not consider ties, in practice, we duplicate all the votes and force half of the tie votes to be counted as left model i𝑖iitalic_i winning (Yij=1subscript𝑌𝑖𝑗1Y_{ij}=1italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1) and the other half as right model j𝑗jitalic_j winning (Yij=0subscript𝑌𝑖𝑗0Y_{ij}=0italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0).

Refer to caption
Refer to caption
Figure 5: Elo ratings of six models across question categories (Top) and image domains (Bottom).

2.5 WildVision-Arena Leaderboard

We report the leaderboard results in Table 2, including the full Elo ratings and the total number of battles for each model, with a time cutoff on May 29, 2024. Additionally, we provide the Elo ratings for three main question categories (Analytical, Descriptive, Recognition) and three main image domains (Entertainment, Natural, Expert) to better understand the specialties of each model. GPT-4o quickly dominates the leaderboard after its release, surpassing the previous state-of-the-art GPT-4V by a significant margin, followed by Reka-Flash, Claude-3-OPUS. Yi-VL-PLUS and LLaVA-NEXT-34B achieve the same rank, reflecting that both models are based on the Yi [1]. Among open-source models, LLaVA-NEXT-34B ranks first, even surpassing Gemini-1.5-Flash and Claude-3-Sonnet, Claude-3-Haiku, indicating a strong baseline for research purposes. To compare models under each question category and image domain, we present the top six models ranked in the WildVision-Arena leaderboard in terms of Elo ratings for each question category and image domain in Figure 5. GPT-4o consistently outperforms all other models except for the images tagged with Natural, where varying specialties are more commonly observed among the other models.

3 WildVision-Bench: In-the-Wild Testbed for VLMs

Recent VLMs reveal a closing gap with GPT-4V on various benchmarks[101, 102], but this improvement is not always reflected in users’ daily experiences. This discrepancy arises from current models’ limited generalizability compared to proprietary ones, which fixed benchmarks fail to capture. To address this, we propose creating WildVision-Bench, a challenging and natural benchmark for VLMs that reflects real-world human use cases, with models’ rankings aligning closely with the WildVision-Arena leaderboard contributed by diverse crowdsourced user votes.

Table 3: VLMs’ responses on two cases from WildVision-Bench expert annotated samples. The example #61#61\#61# 61 is a hard case that all models fall short at.
Image #44 in WildVision Bench Image #61 in WildVision Bench
[Uncaptioned image] [Uncaptioned image]
Text Prompt: Answer shortly what is he doing? And what do we call him? Text Prompt: Convert the first row of this sudoku matrix into the text format, each cell separated by a space. Use 0 to represent empty cells.
GPT-4V: He is performing a breakdance move. We call him a breakdancer or b-boy. GPT-4V: 6 8 4 2 1 0 0 0 0
LLaVA-v1.6-34B: He is performing a dance move that involves spinning or flipping while balancing on one hand …, he may be referred to as a dancer or acrobat. LLaVA-v1.6-34B: 6 8 4 0 2 1 0 7 3 4 0 6
Gemini Pro Vision: He is breakdancing. He is called a b-boy or breaker. Gemini Pro Vision: 6 8 4 2 1 9 8 7 7

3.1 Data Curation Pipeline

Starting with in-the-wild multimodal conversation data from WildVision-Arena’s users, we apply the NSFW detector [36] on the images to filter out unsafe content. We then perform deduplication on the images and apply diversity sampling to formulate a public set of 500 data samples for WildVision-Bench. Our experts manually annotate 50 samples as a preview of a hidden set, which will be updated dynamically to avoid contamination. We showcase the model performance on two cases from expert annotations in Table 3.

3.2 Automatic Evaluation on WildVision-Bench

Refer to caption
Metric vs Human GPT-4v
4-way 3-way Binary
F1 Score (Macro) 0.4245 0.5143 0.7792
F1 Score (Micro) 0.5747 0.5842 0.7796
F1 Score (Weighted) 0.5407 0.5536 0.7798
Cohen’s Kappa Score 0.3404 0.3442 0.5585
Pearson Correlation 0.2906 0.2880 0.5587
Figure 6: Left: GPT-4V vs. Arena Human Voting. Right: Agreement; 4-way: left/right/tie/bad vote. 3-way: left/right/other. Binary: left/right vote
VLMs as a Local Evaluator

Previous work [107, 35] shows alignment between GPT-4V and humans when evaluating the performance of VLMs. We further validate the agreement of GPT-4V with crowdsourced human preferences in WildVision-Arena to ensure its efficacy in the wild. Specifically, we feed a pair of multimodal conversations along with the votes into GPT-4V to select among four choices: 1) left/right vote: the left/right model response is better, 2) tie/bad vote: both models are equally good/bad. In Appendix LABEL:sec:app_prompt_evaluator, we provide the detailed prompt template for GPT-4V. We show the GPT-4V vs Arena Human alignment in Figure 6. We observe that GPT-4V has relatively low agreement with humans on tie votes but shows high agreement with humans when both models exhibit distinguishable differences. However, predicting when both models are bad is challenging as GPT-4V sometimes falls short in these examples as well.

WildVision-Bench Alignment with Human Preferences in WildVision-Arena

Inspired by Alpaca Eval [16], we adopt a similar approach to rank VLMs on our WildVision-Bench automatically. Specifically, we use GPT-4o as the judgment model and Claude-3-Sonnet as our reference model. We compare each model’s answers on the WildVision-Bench public set with Claude-3-Sonnet and then use GPT-4o, which shows better alignment with humans in our cases, to give a vote. The template in Table LABEL:box:visionbench_judge_prompt is used for the prompt of the judge, where 5 levels of comparison results are defined, which are "Better+", "Better", "Tie", "Worse", and "Worse+" respectively. We report the score results of these models in Table 4. This achieves a 0.940.940.940.94 Spearman correlation with the WildVision-Arena leaderboard.

Table 4: Estimated model scores of VLMs on WildVision-Benchtest split of 500 samples.
Model Score 95% CI Win Rate Reward Much Better Better Tie Worse Much Worse Avg Tokens
GPT-4o [69] 89.4189.4189.4189.41 (1.7,2.0)1.72.0(-1.7,2.0)( - 1.7 , 2.0 ) 80.6%percent80.680.6\%80.6 % 56.456.456.456.4 255.0255.0255.0255.0 148.0148.0148.0148.0 14.014.014.014.0 72.072.072.072.0 11.011.011.011.0 157157157157
GPT-4-Vision [68] 80.0180.0180.0180.01 (1.9,2.8)1.92.8(-1.9,2.8)( - 1.9 , 2.8 ) 71.8%percent71.871.8\%71.8 % 39.439.439.439.4 182.0182.0182.0182.0 177.0177.0177.0177.0 22.022.022.022.0 91.091.091.091.0 28.028.028.028.0 140140140140
Reka-Flash [83] 64.7964.7964.7964.79 (2.9,3.0)2.93.0(-2.9,3.0)( - 2.9 , 3.0 ) 58.8%percent58.858.8\%58.8 % 18.918.918.918.9 135.0135.0135.0135.0 159.0159.0159.0159.0 28.028.028.028.0 116.0116.0116.0116.0 62.062.062.062.0 181181181181
Claude-3-Opus [2] 62.1562.1562.1562.15 (2.8,3.4)2.83.4(-2.8,3.4)( - 2.8 , 3.4 ) 53.0%percent53.053.0\%53.0 % 13.513.513.513.5 103.0103.0103.0103.0 162.0162.0162.0162.0 48.048.048.048.0 141.0141.0141.0141.0 46.046.046.046.0 120120120120
Yi-VL-PLUS [1] 55.0955.0955.0955.09 (2.9,3.0)2.93.0(-2.9,3.0)( - 2.9 , 3.0 ) 52.8%percent52.852.8\%52.8 % 7.27.27.27.2 98.098.098.098.0 166.0166.0166.0166.0 29.029.029.029.0 124.0124.0124.0124.0 83.083.083.083.0 150150150150
LLaVA-NEXT-34B [48] 51.9151.9151.9151.91 (3.1,2.4)3.12.4(-3.1,2.4)( - 3.1 , 2.4 ) 49.2%percent49.249.2\%49.2 % 2.52.52.52.5 90.090.090.090.0 156.0156.0156.0156.0 26.026.026.026.0 145.0145.0145.0145.0 83.083.083.083.0 165165165165
\hdashlineClaude-3-Sonnet [2] 50.0050.0050.0050.00 -- -- -- -- -- -- -- -- 120120120120
\hdashlineClaude-3-Haiku [2] 37.7037.7037.7037.70 (3.2,4.2)3.24.2(-3.2,4.2)( - 3.2 , 4.2 ) 30.6%percent30.630.6\%30.6 % 16.516.5-16.5- 16.5 54.054.054.054.0 99.099.099.099.0 47.047.047.047.0 228.0228.0228.0228.0 72.072.072.072.0 97979797
Gemini-Pro-Vision [82] 35.4535.4535.4535.45 (2.6,3.2)2.63.2(-2.6,3.2)( - 2.6 , 3.2 ) 32.6%percent32.632.6\%32.6 % 21.021.0-21.0- 21.0 80.080.080.080.0 83.083.083.083.0 27.027.027.027.0 167.0167.0167.0167.0 143.0143.0143.0143.0 66666666
LLaVA-NEXT-13B [48] 33.6933.6933.6933.69 (3.8,2.7)3.82.7(-3.8,2.7)( - 3.8 , 2.7 ) 33.8%percent33.833.8\%33.8 % 21.421.4-21.4- 21.4 62.062.062.062.0 107.0107.0107.0107.0 25.025.025.025.0 167.0167.0167.0167.0 139.0139.0139.0139.0 138138138138
DeepSeek-VL-7B [51] 33.4833.4833.4833.48 (2.2,3.0)2.23.0(-2.2,3.0)( - 2.2 , 3.0 ) 35.6%percent35.635.6\%35.6 % 21.221.2-21.2- 21.2 59.059.059.059.0 119.0119.0119.0119.0 17.017.017.017.0 161.0161.0161.0161.0 144.0144.0144.0144.0 119119119119
CogVLM-Chat-HF [89] 31.8831.8831.8831.88 (2.7,2.4)2.72.4(-2.7,2.4)( - 2.7 , 2.4 ) 30.6%percent30.630.6\%30.6 % 26.426.4-26.4- 26.4 75.075.075.075.0 78.078.078.078.0 15.015.015.015.0 172.0172.0172.0172.0 160.0160.0160.0160.0 63636363
LLaVA-NEXT-7B [48] 26.1526.1526.1526.15 (2.7,2.3)2.72.3(-2.7,2.3)( - 2.7 , 2.3 ) 27.0%percent27.027.0\%27.0 % 31.431.4-31.4- 31.4 45.045.045.045.0 90.090.090.090.0 36.036.036.036.0 164.0164.0164.0164.0 165.0165.0165.0165.0 139139139139
Idefics2 [37] 23.7123.7123.7123.71 (2.4,2.5)2.42.5(-2.4,2.5)( - 2.4 , 2.5 ) 26.4%percent26.426.4\%26.4 % 35.835.8-35.8- 35.8 44.044.044.044.0 88.088.088.088.0 19.019.019.019.0 164.0164.0164.0164.0 185.0185.0185.0185.0 128128128128
Qwen-VL-Chat [5] 17.8717.8717.8717.87 (2.6,2.2)2.62.2(-2.6,2.2)( - 2.6 , 2.2 ) 19.6%percent19.619.6\%19.6 % 47.947.9-47.9- 47.9 42.042.042.042.0 56.056.056.056.0 15.015.015.015.0 155.0155.0155.0155.0 232.0232.0232.0232.0 70707070
LLaVA-v1.5-13B [47] 14.1514.1514.1514.15 (2.2,2.2)2.22.2(-2.2,2.2)( - 2.2 , 2.2 ) 16.8%percent16.816.8\%16.8 % 52.552.5-52.5- 52.5 28.028.028.028.0 56.056.056.056.0 19.019.019.019.0 157.0157.0157.0157.0 240.0240.0240.0240.0 87878787
Bunny-3B [23] 12.7012.7012.7012.70 (1.8,1.9)1.81.9(-1.8,1.9)( - 1.8 , 1.9 ) 16.6%percent16.616.6\%16.6 % 54.454.4-54.4- 54.4 23.023.023.023.0 60.060.060.060.0 10.010.010.010.0 164.0164.0164.0164.0 243.0243.0243.0243.0 76767676
MiniCPM-V [26] 11.6611.6611.6611.66 (1.8,2.1)1.82.1(-1.8,2.1)( - 1.8 , 2.1 ) 13.6%percent13.613.6\%13.6 % 57.557.5-57.5- 57.5 25.025.025.025.0 43.043.043.043.0 16.016.016.016.0 164.0164.0164.0164.0 252.0252.0252.0252.0 89898989
Tiny-LLaVA [111] 8.018.018.018.01 (1.4,1.4)1.41.4(-1.4,1.4)( - 1.4 , 1.4 ) 11.0%percent11.011.0\%11.0 % 66.266.2-66.2- 66.2 16.016.016.016.0 39.039.039.039.0 15.015.015.015.0 127.0127.0127.0127.0 303.0303.0303.0303.0 74747474
UFORM-Gen2-Qwen [86] 7.557.557.557.55 (1.6,1.1)1.61.1(-1.6,1.1)( - 1.6 , 1.1 ) 10.8%percent10.810.8\%10.8 % 68.568.5-68.5- 68.5 16.016.016.016.0 38.038.038.038.0 11.011.011.011.0 115.0115.0115.0115.0 320.0320.0320.0320.0 92929292
InstructBLIP-7B [14] 5.545.545.545.54 (1.3,1.5)1.31.5(-1.3,1.5)( - 1.3 , 1.5 ) 7.8%percent7.87.8\%7.8 % 72.572.5-72.5- 72.5 11.011.011.011.0 28.028.028.028.0 15.015.015.015.0 117.0117.0117.0117.0 329.0329.0329.0329.0 47474747
Benchmark Correlation Heatmap
Refer to caption
Figure 7: WildVision-Bench achieves the highest correlation with WildVision-Arena, with a Spearman’s correlation of 0.94.

We visualize the Spearman correlation heatmap among various multimodal benchmarks in Figure 7. The MMBench-series [50] (CCBench, MMBench EN, MMBench CN) considers fine-grained perception and reasoning tasks in multiple choice questions. MMVet [101] evaluates integrated capabilities in visual question answering. MMStar [10] alleviates misjudgment issues with high-quality multiple choice questions. HallucionBench [22] focus on investigating hallucination issues, while MMMU [102] and MathVista [53] focus on college-level subject knowledge and mathematical reasoning in visual contexts, respectively. WildVision Elo represents the arena leaderboard, reflecting human preferences using Elo ratings from pairwise comparisons. WildVision Bench represents ranking model using estimated model score on our WildVision-Bench. This achieves the highest correlation with WildVision Elo, indicating its crucial role in simulating human preferences on these VLMs in the real world. The runner-up in alignment with human preferences is MMVet, followed by MMMU and MMStar.

4 Analysis

In-the-wild Multimodal Chat

In contrast to public benchmark, in-the-wild multimodal conversations involve images and instructions from a diverse range of sources and receive vote data from a varied group of users. This better helps us understand how current VLMs can benefit real-world scenarios and reveal improvement directions for researchers in the field. In Appendix B, we present more cases under each image domain and question category. We will release both multimodal chat and crowdsourced voting data for future research.

Failure Cases

In Table 5, we present two distinct failure instances that are documented in the WildVision-Arena platform. This analysis reveals that GPT-4V’s limitations primarily stem from insufficient background knowledge, whereas Gemini-Pro-Vision often fails to discern and process subtle details crucial for deriving correct answers. Additional details on these failure cases are provided in Appendix Our categorization of common failures includes six types: Visual Recognition, Visual Reasoning, Spatial Imagination, Contextual Understanding, Expert Domain Knowledge, Hallucination, and Safety. Although not all failure cases can be included in this paper, we plan to periodically release additional cases on our live platform to aid ongoing research and development.

Table 5: Failure cases of GPT-4V and Gemini-Pro-Vision sampled from WildVision-Arena.
Image Image [Uncaptioned image] [Uncaptioned image] Text Prompt: What operating system is this? Text Prompt: What is the length of the object shown in the image. GPT-4V: The image shows a screenshot of Windows 3.x, which is a series of graphical operating environments produced by Microsoft for use on personal computers, prior to the Windows 95 operating system… Gemini-Pro-Vision: The diameter of the bolt is 7.92 mm.
Model Comparison on WildVision-Bench

Table 3 compares the responses of GPT-4V, LLaVA-NEXT-34B, and Gemini-Pro-Vision on a validation sample from WildVision-Bench. GPT-4V generally outperforms the other models, confirming expectations of its superior capabilities. Nevertheless, all models occasionally fail to deliver correct responses, notably in scenarios requiring compositional reasoning, regardless of the simplicity of the text or the image involved. We also observe that recognizing and interpreting subtle visual details within images is still challenging for less capable models.

Broader Impact

For the first version of data release, we plan to release over 20,000 crowdsourced multi-turn conversation data and more than 8,000 human votings with reasons, providing a valuable resource for understanding human preferences in VLMs interactions and developing models that align more closely with human standards in real-world scenarios. We will also present a live leaderboard together with useful failure case analysis to keep track of recent advancements in this field. Additionally, by open-sourcing the WildVision-Arena code, we enable researchers and developers to adapt our methods to other domains. We will also support fast evaluation of our WildVision-Bench for quick and human-aligned evaluation, which aligns with the human preferences in VLMs in real-world scenarios.

Modality, Resolution, Long Context, Resource-Efficent

Many work have extended vision-language models (VLMs) beyond image-text modalities, including video [105, 57, 109], audio [13], and even applied to embodied agent [65]. Future work may consider improving all-in-one models [63, 92, 82, 112, 19] by discovering better methods to integrate these modality data. Recent works have enabled high-resolution [48, 96] and text reading [108, 25] capabilities in VLMs, although many failure cases are still induced by low resolution or poor OCR capability. Other work advances multi-image and long-context capabilities in VLMs [61, 37, 29, 79, 54]. We expect future research to discover the best mechanisms for balancing compact and effective approaches to convey multimodal information, such as recent progress of text representation in pixel space [75, 18, 55]. This is essential to closing the gap between open-source multimodal agents [99, 104] and proprietary ones [97, 69]. Although many works [26, 111] have made VLMs more compact, their performance is still not satisfying. Future work may further improve the performance of smaller models with less training data and higher throughput inference.

World Knowledge and Safety in VLMs

The challenge of embedding extensive world knowledge within VLMs is significant, particularly given their current limitations in understanding physical principles and interacting with real-world environments. These models’ ability to dynamically expand their knowledge base through activities like browsing the internet, reading books, or watching videos is an exciting potential advancement. Key concerns in LLMs include security [94, 64, 90, 98], privacy [31, 38], and the propagation of truthfulness [30, 77, 45] and prevention of misinformation [80, 72, 103]. For VLMs, they face unique safety challenges: 1) incorrect alignment of multimodal data can lead to harmful outputs, 2) images may contain sensitive information, necessitating careful handling, and 3) VLMs are vulnerable to attacks manipulating both text and images.

5 Related Work

Live Benchmarking for vision-language models

Vision-and-language pre-training starts from models [42, 43] adapting objectives in BERT [33], to models [74] adopting contrastive learning, and to unified frameworks [52, 88, 41, 40] without task-specific head. With recent advancements of Large Language Models [67, 20, 4, 84, 85], their multi-modal counterparts [68, 82, 14, 113, 49, 47, 5, 28, 37] are dominating vision and language tasks. Beyond previous task-specific caption [11, 78], visual question answer [62, 59, 27, 21, 60], grounding [46, 100, 66, 58, 71], more benchmarks [101, 50, 39, 32] are proposed to capture VLMs capabilities. When building such benchmarks, there is an urge need to consider alleviating data contamination [76, 6] during eval, assuring robustness [55] and difficulty [70], and incorporating real-world scenarios [8, 93]. We build WildVision-Arena to support diversified, difficult, in-the-wild, live benchmarking [12, 95] of VLMs.

Human-Aligned Evaluation for vision-language models

Evaluation for open-ended vision and language tasks [8, 93, 70] are usually challenging, and recent techniques improve human alignment by mapping free-form predictions to pre-defined choices [50], using larger models as the evaluator [56, 107]. In the domain of evaluating LLMs, a certain approaches [110, 16] prove their effectiveness in aligning with real-world annotators on the Chatbot Arena [12]. This inspires our efforts in curating in-the-wild small-scale WildVision-Bench, that can support fast evaluation by pair-wise comparison with reference model (such as Claude-3-Sonnet [2]), and achieve alignment with crowdsourced human rators on WildVision-Arena.

6 Conclusion

We first introduce WildVision-Arena, a dynamic evaluation platform for comparing vision-language models (VLMs) in the wild. We conduct comparative insights across over 20 models by utilizing an extensive dataset of 20,000+ multimodal conversations and 8,000+ votes, allowing for continuous refinement of VLMs performance. From these in-the-wild chats, we then sample safe and diversified data for WildVision-Bench and apply automatic evaluation that closely aligns with crowdsourced human preferences from WildVision-Arena. Our comprehensive analysis on these in-the-wild chats indicates future directions for advancing VLMs.

References

  • [1] 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. Yi: Open foundation models by 01.ai, 2024.
  • [2] Anthropic. The claude 3 model family: Opus, sonnet, haiku., 2024.
  • [3] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  • [4] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  • [5] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  • [6] Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms, 2024.
  • [7] Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.
  • [8] Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023.
  • [9] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • [10] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024.
  • [11] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • [12] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024.
  • [13] Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023.
  • [14] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  • [15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • [16] Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024.
  • [17] Arpad E Elo. The proposed uscf rating system, its development, theory, and applications. Chess life, 22(8):242–247, 1967.
  • [18] Tianyu Gao, Zirui Wang, Adithya Bhaskar, and Danqi Chen. Improving language understanding from screenshots, 2024.
  • [19] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all, 2023.
  • [20] Google. Bard - chat based ai tool from google, powered by palm 2. https://bard.google.com/?hl=en, 2023.
  • [21] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [22] Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023.
  • [23] Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal learning from data-centric perspective, 2024.
  • [24] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  • [25] Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, and Fei Huang. mplug-paperowl: Scientific diagram analysis with the multimodal large language model, 2024.
  • [26] Jinyi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, Xu Han, Yankai Lin, Jiao Xue, Dahai Li, Zhiyuan Liu, and Maosong Sun. Large multilingual models pivot zero-shot multimodal learning across languages, 2024.
  • [27] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  • [28] IDEFICS. Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model, 2023.
  • [29] Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning, 2024.
  • [30] Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, and He He. Personas as a way to model truthfulness in language models, 2024.
  • [31] Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 10697–10707. PMLR, 17–23 Jul 2022.
  • [32] Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. ArXiv, abs/1603.07396, 2016.
  • [33] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  • [34] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023.
  • [35] Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2023.
  • [36] Gant Laborde. Deep nn for nsfw detection.
  • [37] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?, 2024.
  • [38] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  • [39] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023.
  • [40] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 19730–19742, 2023.
  • [41] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  • [42] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language, 2019.
  • [43] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer, 2020.
  • [44] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  • [45] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
  • [46] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [47] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  • [48] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  • [49] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  • [50] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2024.
  • [51] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding, 2024.
  • [52] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks, 2022.
  • [53] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024.
  • [54] Yujie Lu, Xiujun Li, Tsu-Jui Fu, Miguel Eckstein, and William Yang Wang. From text to pixel: Advancing long-context understanding in mllms, 2024.
  • [55] Yujie Lu, Xiujun Li, William Yang Wang, and Yejin Choi. Vim: Probing multimodal large language models for visual embedded instruction following, 2023.
  • [56] Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, and William Yang Wang. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation, 2023.
  • [57] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023.
  • [58] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  • [59] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [60] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  • [61] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, and Yinfei Yang. Mm1: Methods, analysis & insights from multimodal llm pre-training, 2024.
  • [62] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
  • [63] Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya Srinet, Babak Damavandi, and Anuj Kumar. Anymal: An efficient and scalable any-modality augmented language model, 2023.
  • [64] Maximilian Mozes, Xuanli He, Bennett Kleinberg, and Lewis D. Griffin. Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities, 2023.
  • [65] Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought, 2023.
  • [66] Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 792–807. Springer, 2016.
  • [67] OpenAI. Gpt-4: Technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [68] OpenAI. Gpt-4v(ision) system card. https://openai.com/research/gpt-4v-system-card, 2023.
  • [69] OpenAI. Gpt-4o. https://openai.com/index/hello-gpt-4o, 2024.
  • [70] Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Mikel Artetxe, and Yi Tay. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models, 2024.
  • [71] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  • [72] Dorian Quelle, Calvin Cheng, Alexandre Bovet, and Scott A. Hale. Lost in translation – multilingual misinformation and its evolution, 2023.
  • [73] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [74] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [75] Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. Language modelling with pixels, 2023.
  • [76] Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023.
  • [77] Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé III au2, and Jordan Boyd-Graber. Large language models help humans verify truthfulness – except when they are convincingly wrong, 2024.
  • [78] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
  • [79] Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, and Benyou Wang. Milebench: Benchmarking mllms in long context, 2024.
  • [80] Jinyan Su, Claire Cardie, and Preslav Nakov. Adapting fake news detection to the era of large language models, 2024.
  • [81] Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
  • [82] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [83] Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, and Zhihui Xie. Reka core, flash, and edge: A series of powerful multimodal language models, 2024.
  • [84] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [85] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [86] UForm. Uform: Pocket-sized multimodal ai for content understanding and generation, 2024.
  • [87] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation, 2015.
  • [88] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  • [89] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2024.
  • [90] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023.
  • [91] Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text transformer for object understanding, 2022.
  • [92] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm, 2023.
  • [93] XAI. Realworldqa benchmark, 2024.
  • [94] Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. Recipes for safety in open-domain chatbots, 2021.
  • [95] Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models, 2023.
  • [96] Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images, 2024.
  • [97] An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, and Lijuan Wang. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation, 2023.
  • [98] Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211, June 2024.
  • [99] Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. Agent lumos: Unified and modular training for open-source language agents, 2024.
  • [100] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
  • [101] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  • [102] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  • [103] Zhenrui Yue, Huimin Zeng, Yimeng Lu, Lanyu Shang, Yang Zhang, and Dong Wang. Evidence-driven retrieval augmented response generation for online misinformation, 2024.
  • [104] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023.
  • [105] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023.
  • [106] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey, 2024.
  • [107] Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, and Linda Ruth Petzold. Gpt-4v(ision) as a generalist evaluator for vision-language tasks, 2023.
  • [108] Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2024.
  • [109] Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April 2024.
  • [110] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  • [111] Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models, 2024.
  • [112] Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment, 2024.
  • [113] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Part I Appendix

\parttoc

Appendix A User Interface

In Figure 8, we show a screenshot of the user interface of our WildVision-Arena, which presents an interactive environment for evaluating multimodal large language models. This environment allows users to input questions and compare responses from multiple models simultaneously. Each model’s answer is displayed side-by-side, enabling a straightforward comparison of their performance and capabilities based on user queries related to specific images or tasks. The interface also facilitates easy selection and voting to decide which model’s response fits the user’s criteria best, enhancing the user’s ability to judge and refine the models’ outputs effectively.

Refer to caption
Figure 8: User Interface of WildVision-Arena.

Appendix B Question Category and Image Domain

In Table 68, we showcase example data under each of the image domain and question category from WildVision-Arena’s users.

Table 6: Example input data in WildVision-Arena tagged with [Image Domain-Subdomain] and [ Question Category-Subcategory].
Image [Entertainment-Movies/TV Shows] Image [Natural-Plants]
[Uncaptioned image] [Uncaptioned image]
[Descriptive-Movies/TV Shows] Text Prompt: What are the two giraffe characters on this movie poster doing? [Analytical-Problem Solving] Text Prompt: How likely is it to snow after this picture was taken? What would change with this type of tree before it’s likely to snow?
Image [Expert-Business] Image [Urban-Infrastructure]
[Uncaptioned image] [Uncaptioned image]
[Analytical-Data Analysis] Text Prompt: Which of the companies featured in the dashboard are headquartered outside the US? [Recognition-Text] Text Prompt: Can you tell me the potential risks and the unreasonale parts in the image?
Table 7: Example input data in WildVision-Arena tagged with [Image Domain-Subdomain] and [ Question Category-Subcategory].
Image [Entertainment-Comics] Image [People-Portraits]
[Uncaptioned image] [Uncaptioned image]
[Descriptive-Scene Description] Text Prompt: Whos’s in the sky? [Creative-Media Post] Text Prompt: write a social media post with the provided image, saying that I am ready for the new challange.
Image [Urban-Buildings] Image [Expert-Science]
[Uncaptioned image] [Uncaptioned image]
[Recognition-Location] Text Prompt: where is this? [Analytical-Safety Procedures] Text Prompt: Can you tell me the potential risks and the unreasonale parts in the image?
Image [Natural-Landscapes] Image [Objects-Household Tools]
[Uncaptioned image] [Uncaptioned image]
[Recognition-Location] Text Prompt: where was this photo taken? [Descriptive-Object Description] Text Prompt: describe the scene and objects
Table 8: Example input data in WildVision-Arena tagged with [Image Domain-Subdomain] and [ Question Category-Subcategory].
Image [Entertainment-Web and Mobile Apps Screenshots] Image [Event-Sports] [Uncaptioned image] [Uncaptioned image] [Interactive-Web Navigation] Text Prompt: I need to download flyer, you will be given screenshot from browser with elements marked with number. give next action to take on web page to download the flyersngive me response in below format example 1 action:[click,scroll,wait], box:1 format action:, box: [Descriptive-Scene Description] Text Prompt: this is a football match , every player has an identifier , describe every player action (example : player #501 is running) Image [Urban-Infrastructure] Image [Expert-Science] [Uncaptioned image] [Uncaptioned image] [Interactive-Recommendations] Text Prompt: Which section’s ticket would you recommend I purchase? [Interactive-Code Generation] Text Prompt: Give me Latex code to create this diagram Image [Expert-Health and Medcine] Image [Entertainment-Web and Mobile Apps Screenshots] [Uncaptioned image] [Uncaptioned image] [Recognition-Object] Text Prompt: what type of tumor is this? [Analytical-Critical Reviews] Text Prompt: Review each screenshot carefully, focusing on different aspects of usability…

Appendix C Analysis of Failure Cases

Table 9: Failure Cases.
Image Error #1 Image Error #2
[Uncaptioned image] [Uncaptioned image]
Text Prompt: What should I play in this position? Text Prompt: Based on the architecture plans is this compliant in the Ontario building code
GPT-4V: This chess position is unconventional and humorous because it features a board filled with knights (horses) except for a single queen. In a standard game of chess, this situation would not occur, as each player starts with just two knights. In this specific position, there’s not much strategy to discuss because the game doesn’t reflect an actual playable scenario. However, if we want to engage with this as a fun exercise in imagination, you could consider moving one of the knights that isn’t pinned or creating a threat to the queen. For example, you could move one of the knights on the third rank to attack the queen, such as Nc3-d5 or Ne3-c4. Remember, this is just a playful scenario and not representative of a legitimate chess position. In a real game, the best move would depend on a much broader context of the board and the strategic positions of all the pieces, not just an army of knights. GPT-4V: I’m sorry, but as an AI, I do not have the capacity to verify the compliance of architectural plans with specific building codes. Building code compliance typically involves a wide range of considerations, including but not limited to accessibility, fire safety, structural integrity, and mechanical systems. Professional architects and engineers, often alongside local code enforcement officials, ensure that designs and construction meet the appropriate codes and regulations. To determine compliance with the Ontario Building Code, the plans would need to be reviewed by a certified professional who has access to and understands the local regulations and requirements. They would check if the design meets all necessary stipulations for things such as room sizes, egress requirements, bathroom accessibility, fire ratings, and many other details. You should consult with a licensed Ontario architect or a building code consultant for an official assessment of compliance.
Image Error #3 Image Error #4
[Uncaptioned image] [Uncaptioned image]
Text Prompt: What food will these people have very soon? Text Prompt: Based on this chart, what is the MOHS hardness of Marble?
LLaVA-NEXT-34B: Based on the image, the people are lined up outside a building with a sign that reads "Salsa’s Authentic Mexican Restaurant." It seems likely that they will be having Mexican food very soon, which could include items such as tacos, burritos, enchiladas, chiles rellenos, guacamole, and other similar dishes that are typically served at Mexican restaurants. The specific menu items will depend on the restaurant’s offerings and the preferences of the customers. LLaVA-NEXT-34B: The MOHS hardness of Marble, as indicated on the chart, is 3.
Table 10: Failure Cases.