OR-Bench: An Over-Refusal Benchmark for Large Language Models

Cui, Justin; Chiang, Wei-Lin; Stoica, Ion; Hsieh, Cho-Jui

Computer Science > Computation and Language

arXiv:2405.20947 (cs)

[Submitted on 31 May 2024 (v1), last revised 20 Jun 2024 (this version, v2)]

Title:OR-Bench: An Over-Refusal Benchmark for Large Language Models

Authors:Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at this https URL and the demo can be found at this https URL. We hope this benchmark can help the community develop better safety aligned models.

Comments:	version 2, 10 pages main, 22 pages total
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2405.20947 [cs.CL]
	(or arXiv:2405.20947v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.20947

Submission history

From: Justin Cui [view email]
[v1] Fri, 31 May 2024 15:44:33 UTC (1,731 KB)
[v2] Thu, 20 Jun 2024 05:22:38 UTC (1,653 KB)

Computer Science > Computation and Language

Title:OR-Bench: An Over-Refusal Benchmark for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:OR-Bench: An Over-Refusal Benchmark for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators