Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Yan, Bei; Zhang, Jie; Yuan, Zheng; Shan, Shiguang; Chen, Xilin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.17115 (cs)

[Submitted on 24 Jun 2024 (v1), last revised 9 Oct 2024 (this version, v2)]

Title:Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Authors:Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

View PDF HTML (experimental)

Abstract:Despite the rapid progress and outstanding performance of Large Vision-Language Models (LVLMs) in recent years, LVLMs have been plagued by the issue of hallucination, i.e., LVLMs tend to generate responses that are inconsistent with the corresponding visual inputs. To evaluate the degree of hallucination in LVLMs, previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics. However, we find that the quality of the existing hallucination benchmarks varies, with some suffering from problems, e.g., inconsistent evaluation results under repeated tests, and misalignment with human evaluation. To this end, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages various indicators to assess the reliability and validity of existing hallucination benchmarks separately. Specifically, for reliability we explore test-retest reliability and parallel-forms reliability, while for validity we examine criterion validity and coverage of hallucination types. Furthermore, based on the results of our quality measurement, we construct a High-Quality Hallucination Benchmark (HQH) for LVLMs, which demonstrates superior reliability and validity under our HQM framework. We conduct an extensive evaluation of over 10 representative LVLMs, including GPT-4o and Gemini-1.5-Pro, to provide an in-depth analysis of the hallucination issues in existing models. Our benchmark is publicly available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.17115 [cs.CV]
	(or arXiv:2406.17115v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.17115

Submission history

From: Bei Yan [view email]
[v1] Mon, 24 Jun 2024 20:08:07 UTC (3,515 KB)
[v2] Wed, 9 Oct 2024 10:43:47 UTC (2,982 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators