Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Xu, Zhao; Liu, Fan; Liu, Hao

Computer Science > Cryptography and Security

arXiv:2406.09324 (cs)

[Submitted on 13 Jun 2024 (v1), last revised 6 Nov 2024 (this version, v3)]

Title:Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Authors:Zhao Xu, Fan Liu, Hao Liu

View PDF

Abstract:Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced $\textbf{JailTrickBench}$ to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at this https URL.

Comments:	Accepted by NeurIPS 2024
Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2406.09324 [cs.CR]
	(or arXiv:2406.09324v3 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2406.09324

Submission history

From: Zhao Xu [view email]
[v1] Thu, 13 Jun 2024 17:01:40 UTC (3,580 KB)
[v2] Fri, 4 Oct 2024 05:14:30 UTC (4,634 KB)
[v3] Wed, 6 Nov 2024 04:43:57 UTC (4,867 KB)

Computer Science > Cryptography and Security

Title:Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators