Are aligned neural networks adversarially aligned?

Nicholas Carlini¹, Milad Nasr¹, Christopher A. Choquette-Choo¹,
Matthew Jagielski¹, Irena Gao², Anas Awadalla³, Pang Wei Koh¹³,
Daphne Ippolito¹, Katherine Lee¹, Florian Tramèr⁴, Ludwig Schmidt³
¹Google DeepMind ² Stanford ³University of Washington ⁴ETH Zurich

Abstract

Large language models are now tuned to align with the goals of their creators, namely to be “helpful and harmless.” These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs.

However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models. Warning: some content generated by language models in this paper may be offensive to some readers.

Refer to caption — Figure 1: We generate adversarial *images* for aligned multimodal text-vision models that result in profane or otherwise harmful output, which would not normally be generated by the model. When presented with clean inputs the models follow their instruction tuning and produce harmless output, but by providing a worst-case maliciously-constructed input, we can induce arbitrary output behavior discouraged by the alignment techniques.

1 Introduction

Aligned language models are supposed to be “helpful and harmless” Bai et al. (2022): they should respond helpfully to user interaction, but avoid causing harm, either directly or indirectly. Prior work has focused extensively on how to train models to align with the preferences and goals of their creators. For example, reinforcement learning through human feedback (RLHF) Bai et al. (2022); Ouyang et al. (2022); Christiano et al. (2023) fine-tunes a pretrained model to emit outputs that humans judge to be desirable, and discourages outputs that are judged to be undesirable. This method has been successful at training models that produce benign content that is generally agreeable.

However, these models not are perfectly aligned. By repeatedly interacting with models, humans have been able to “social engineer” them into producing some harmful content (i.e., “jailbreak” attacks). For example, early attacks on ChatGPT (one such alignment-tuned language model) worked by telling the model the user is a researcher studying language model harms and asking ChatGPT to help them produce test cases of what a language model should not say. While there have been many such anecdotes where humans have manually constructed harm-inducing prompts, it has been difficult to scientifically study this phenomenon.

Fortunately, the machine learning community has by now studied the fundamental vulnerability of neural networks to adversarial examples for a decade (Szegedy et al., 2014; Biggio et al., 2013). Given any trained neural network and an arbitrary behavior, it is almost always possible to construct an “adversarial example” that cause the selected behavior. Much of the early adversarial machine learning work focused on the domain of image classification, where it was shown that it is possible to minimally modify images so that they will be misclassified as an arbitrary test label. But adversarial examples have since been expanded to text Jia and Liang (2017); Ebrahimi et al. (2017); Alzantot et al. (2018); Wallace et al. (2019); Jones et al. (2023) and other domains.

In this paper we unify these two research directions and study what we call adversarial alignment: the evaluation of aligned models to adversarial inputs. That is, we ask the question:

Are aligned neural network models “adversarially aligned”?

First, we show that current alignment techniques—such as those used to fine-tune the Vicuna model Chiang et al. (2023)—are an effective defense against existing state-of-the-art (white-box) NLP attacks. This suggests that the above question can be answered in the affirmative. Yet, we further show that existing attacks are simply not powerful enough to distinguish between robust and non-robust defenses: even when we guarantee that an adversarial input on the language model exists, we show that state-of-the-art attacks fail to find it. The true adversarial robustness of current alignment techniques thus remains an open question, which will require substantially stronger attacks to resolve.

We then turn our attention to today’s most advanced multimodal models, such as OpenAI’s GPT-4 and Google’s Flamingo and Gemini, which accept both text and images as input (OpenAI, 2023; Alayrac et al., 2022; Pichai, 2023). Specifically, we study open-source implementations with similar capabilities Liu et al. (2023); Zhu et al. (2023); Gao et al. (2023) since these proprietary models are not yet publicly accessible. We find that we can use the continuous-domain images as adversarial prompts to cause the language model to emit harmful toxic content (see, e.g., Figure 1, or unfiltered examples in Appendix C). Because of this, we conjecture that improved NLP attacks may be able to trigger similar adversarial behavior on alignment-trained text-only models, and call on researchers to explore this understudied problem.

Some alignment researchers Russell (2019); Bucknall and Dori-Hacohen (2022); Ngo (2022); Carlsmith (2022) believe that sufficiently advanced language models should be aligned to prevent an existential risk Bostrom (2013) to humanity: if this were true, an attack that causes such a model to become misaligned even once would be devastating. Even if these advanced capabilities do not come to pass, the machine learning models of today already face practical security risks Brundage et al. (2018); Greshake et al. (2023). Our work suggests that eliminating these risks via current alignment techniques—which do not specifically account for adversarially optimized inputs—is unlikely to succeed.

2 Background

Our paper studies the intersection of two research areas: AI alignment and adversarial examples.

Large language models.

As large language model parameter count, training dataset size, and training duration have been increased, the models have been found to exhibit complex behaviors (Brown et al., 2020; Wei et al., 2022b; Ganguli et al., 2022). In this work, we focus on models trained with causal “next-word” prediction, and use the notation $s\leftarrow\texttt{Gen}(x)$ to denote a language model emitting a sequence of tokens $s$ given a prompt $x$ . Many applications of language models take advantage of emergent capabilities that arise from increased scale. For instance, language models are commonly used to perform tasks like question answering, translation, and summarization Chowdhery et al. (2022); Rae et al. (2022); Anil et al. (2023); Liang et al. (2022); Goyal et al. (2022).

Aligning large language models.

Large pretrained language models can perform many useful tasks without further tuning Brown et al. (2020), but they suffer from a number of limitations when deployed as is in user-facing applications. First, these the models do not follow user instructions (e.g., “write me a sorting function in Python”), likely because the model’s pretraining data (e.g., Internet text) contains few instruction-answer pairs. Second, by virtue of faithfully modeling the distribution of Internet text, the base models tend to reflect and even exacerbate biases Abid et al. (2021), toxicity, and profanity Welbl et al. (2021); Dixon et al. (2018) present in the training data.

Model developers thus attempt to align base models with certain desired principles, through techniques like instruction tuning Wei et al. (2022a); Ouyang et al. (2022) and reinforcement learning via human feedback (RLHF) Christiano et al. (2023); Bai et al. (2022). Instruction tuning finetunes a model on tasks described with instructions. RLHF explicitly captures human preferences by supervising the model towards generations preferred by human annotators (Christiano et al., 2023).

Multimodal text-vision models.

Increasingly, models are multimodal, with images and text being the most commonly combined modalities (OpenAI, 2023; Pichai, 2023; Liu et al., 2023; Zhu et al., 2023). Multimodal training allows these models to answer questions such as “how many people are in this image?” or “transcribe the text in the image”.

While GPT-4’s multimodal implementation has not been disclosed, there are a number of open-source multimodal models that follow the same general protocol Gao et al. (2023); Liu et al. (2023); Zhu et al. (2023). These papers start with a standard pre-trained language model that tokenizes and then processes the embedding layers. To process images, they use a pretrained vision encoder like CLIP Radford et al. (2021) to encode images into an image embedding, and then train a projection model that converts image embeddings into token embeddings processed by the language model. These visual tokens may be passed directly as an input to the model Zhu et al. (2023); Liu et al. (2023), surrounded by special templates (e.g., “<img> . . . <\img>”) to delineate their modality, or combined internal to the model via learned adaptation prompts Gao et al. (2023).

Adversarial examples.

Adversarial examples are inputs designed by an adversary to cause a neural network to perform some incorrect behavior (Szegedy et al., 2014; Biggio et al., 2013). While primarily studied on vision classification tasks, adversarial examples also exist for textual tasks such as question answering (Jia and Liang, 2017; Wallace et al., 2019), document classification Ebrahimi et al. (2017), sentiment analysis Alzantot et al. (2018), or triggering toxic completions Jones et al. (2023); Wallace et al. (2019). Prior work on textual tasks has either applied greedy attack heuristics Jia and Liang (2017); Alzantot et al. (2018) or used discrete optimization to search for input text that triggers the adversarial behavior Ebrahimi et al. (2017); Wallace et al. (2019); Jones et al. (2023).

In this paper, we study adversarial examples from the perspective of alignment. Because aligned language models are intended to be general-purpose—with strong performance on many different tasks—we focus more broadly on adversarial examples that cause the model to produce harmful behavior, rather than adversarial examples that simply cause “misclassification”.

Our inputs are “adversarial” in the sense that they are specifically optimized to produce some targeted and unwanted outcome. Unlike recent “social-engineering” attacks on language models that induce harmful behavior by tricking the model into playing a harmful role (for example, taking on the persona of a racist movie actor Reddit (2023)), we make no effort to ensure our attacks are semantically meaningful—and they often will not be.

3 Threat Model

There are two primary reasons researchers study adversarial examples. On the one hand, researchers are interested in evaluating the robustness of machine learning systems in the presence of real adversaries. For example, an adversary might try to construct inputs that evade machine learning models used for content filtering (Tramèr et al., 2019; Welbl et al., 2021) or malware detection (Kolosnjaji et al., 2018), and so designing robust classifiers is important to prevent a real attack.

On the other hand, researchers use adversarial robustness as a way to understand the worst-case behavior of some system (Szegedy et al., 2014; Pei et al., 2017). For example, we may want to study a self-driving car’s resilience to worst-case, adversarial situations, even if we do not believe that an actual attacker would attempt to cause a crash. Adversarial examples have seen extensive study in the verification of high-stakes neural networks (Wong and Kolter, 2018; Katz et al., 2017), where adversarial examples serve as a lower bound of error when formal verification is not possible.

3.1 Existing Threat Models

Existing attacks assume that a model developer creates the model and uses some alignment technique (e.g., RLHF) to make the model conform with the developer’s principles. The model is then made available to a user, either as a standalone model or via a chat API. There are two common settings under which these attacks are mounted, which we describe below.

Malicious user: The user attempts to make the model produce outputs misaligned with the developer’s principles. Common examples of this are jailbreaks of chatbots such as ChatGPT or Bard where a user uses an adversarial example (a maliciously designed prompt) to elicit the desired unaligned behavior, such as outputting instructions for building a bomb. In this setting, there is no need for the attack to be “stealthy”.

Malicious third-party: An honest user might query an alignment-tuned language model as part of an autonomous system that processes untrusted third-party data (e.g., a virtual assistant that reads and writes the user’s emails). A malicious third-party could launch a prompt injection attack (Greshake et al., 2023) by feeding data to the language model to hijack its behavior (e.g., exfiltrating the user’s emails). In this setting, there might be stricter constraints on a valid attack.

3.2 Our Threat Model

In this paper, we are focused on better understanding the limitations of current alignment techniques. As such, we use adversarial examples primarily to measure their worst-case behavior. That is, for the most part we are not concerned with any particular practical adversary, and instead, only with finding any valid input that achieves our attack goal. That said, our attacks are likely to be practical in both settings where a malicious user is attacking the trained language model to allow it to perform malicious behavior, as well as settings where an honest user copies malicious data from an adversary.

Attack Goal.

Independent of who is attacking whom, it is also necessary to define the ultimate attack objective. While attackers could aim to induce a number of harmful behaviors (e.g., triggering a model into outputting dangerous instructions, or emitting private training data), we focus here on the specific goal of triggering toxic outputs. The main reason is that toxicity is relatively easy to (approximately) evaluate in a automated way. We use an extremely simplistic definition for toxicity (checking for the presence or absence of a selected list of toxic words in the output of the model), noting that definitions of toxicity are cultural and contextual and widely studied. We formalize our toxicity attack objective as follows:

Specifically, given a toxicity-detection function isToxic, and an language model Gen that is never supposed to generate text that causes isToxic to be true, an input $X$ causes harm if $\texttt{isToxic}(\texttt{Gen}(X))$ returns true. When the generation function is randomized (i.e., it is not greedy sampling) we might require $\Pr[\texttt{isToxic}(\texttt{Gen}(X))]>\varepsilon$ , for some probability $\varepsilon>0$ . For simplicity, in our experiments, we set the temperature to $0$ to make models deterministic.

Choosing an appropriate method for isToxic is important. In this paper we use simple substring lookup. Our attack chooses some harmful target sequence (for example, a curse word) that the model should generate, and then we report a generation as toxic if it contains this word anywhere.

4 Evaluating Aligned Models with NLP-Only Attacks

The most direct method for generating an adversarial text sequence would be to directly optimize some differentiable proxy for the attack objective, $\texttt{isToxic}(\texttt{Gen}(X))$ . However there are two challenges with this approach, arising from access limitations of these models in that tokens must be inputted and outputted by the model:

1.

Text tokens are discrete, and so continuous optimization via common optimization algorithms, e.g., gradient descent is unlikely to be effective Ebrahimi et al. (2017).
2.

There is often not one exact target. And so in order to check if the attack succeeded, we would have to query the model to emit one token at a time. Thus, in order to pass a long sequence $S$ into the toxicity classifier we would need to generate $\lvert S\rvert$ tokens and then perform back propagation through $\lvert S\rvert$ neural network forward passes.

Attack Objective: harmful prefix.

While the first challenge above is a fundamental challenge of neural language models, the second is not fundamental. To address this, instead of directly optimizing the true objective, i.e., checking that $\texttt{isToxic}(S)$ is true for a generated $S$ , we optimize for the surrogate objective $S_{..j}=t$ for some malicious string $t$ , with $j\ll\lvert S\rvert$ . This objective is much easier to optimize as we can now perform just one single forward pass.

Why does this work? We find that, as long as the language modle begins its response with some harmful output, then it will continue to emit harmful text without any additional adversarial control. In this section, we will study the suitability of prior attack methods for achieving our toxicity objective against a variety of chat bot models, both trained with and without alignment techniques.

4.1 Our Target: Aligned Chat Bots

Alignment techniques (such as RLHF) are typically not applied to “plain” language models, but rather to models that have been first tuned to interact with users via a simple chat protocol.

Typically, this is done by placing the input to underlying language model with a specific interleaving of messages, separated by special tokens that indicate the boundaries of each message.

$[$ USER $]$ :	“Hello, how are you?”
$[$ AGENT $]$ :	‘I am a large language model.’
$[$ USER $]$ :	“What is 1+2?”
$[$ AGENT $]$ :	‘3.’

In the above example, the chat bot’s user typed in the messages in double-quotes, and the language model generated the italicized text in single-quotes. The special tokens ‘[USER]:’ and ‘[AGENT]:’ are automatically inserted by the chat bot application to delineate rounds of interaction when prompting the language model for its next message.

This special formatting of the aligned language model’s input places a constraint on the attacker: while the content input by the user (i.e., the text in double quotes) could be arbitrarily manipulated, the prior chat history as well as the special ‘[USER]:’ and ‘[AGENT]:’ tokens cannot be modified. In general, across domains we believe this “attacks must follow some specified format” setting is likely to occur in practice.

4.2 Prior Attack Methods

A number of prior works have studied adversarial examples against NLP models. The most closely related to our goal is the work of Jones et al. (2023) who study the possibility of inverting a language model, i.e., of finding an adversarial prompt $X$ that causes a model $f$ to output some targeted string $y\leftarrow f(X)$ . Their technique succeeds in making a model emit the names of US Senators or emit toxic words after particular individuals. However, they assume a stronger threat model than we do here, and which does not fully match with the recent deployment of alignment-trained language models. Indeed, as noted above in Section 4.1, these language models typically operate as chat bots that interleave attacker-controlled messages with model generations and special separation tokens. As a result, this paper considers the generic problem of designing an adversarial prompt $X$ such that $\texttt{Gen}(P_{pre}||X||P_{post})$ is toxic, where $P_{pre}$ and $P_{post}$ are non-adversarial parts of the model’s prompt that respectively precede and succeed the attacker’s input.

In contrast, the setup in Jones et al. (2023) assumes that the attacker can either control the entire prompt (i.e., the aim is to make $\texttt{Gen}(X)$ toxic), or at least the text immediately preceding the model’s next generation (i.e., $\texttt{Gen}(P_{pre}||X)$ ). In both cases, the attacker controls the tokens right before the model’s generation, which makes an attack much easier (but not relevant for current chat bot deployments). (However: there are several settings where their threat model is valid—for example, in code completion tasks this is likely the case. Our threat model is just more general.)

4.3 Our Evaluation Setup

Thus, we assume the adversary can control only their messages (following the [USER]: token), and then the special [AGENT]: token is appended to the prompt sequence to indicate the end of the prompt and that the LLM should begin generating a response. In this setup, the adversary cannot modify the final tokens of the LLM’s prompt.

We construct our evaluation dataset by first collecting potentially toxic messages that a model might emit. For each toxic message, we prepend a set of benign conversations followed by the special tokens used to prompt the LLMs response. An attack is successful if it causes the toxic output to be emitted. We source benign conversations from the Open Assistant dataset. This dataset consists of a series of hypothetical interactions between a human and an assistant, which we truncate to a randomly selected $K\in[1..N]$ rounds. We use a random subset of harmful texts from Jones et al. (2023) as the attack objective, with attacks targeting between one and three tokens of toxic text.

We attack three different publicly available models, with links for reproducibility in Appendix B:

•

GPT-2: a model that was neither trained to act as a chatbot, nor to be aligned;
•

LLaMA: a model trained to chat with users but with no explicit alignment attempt;
•

Vicuna: a version of LLaMa that was aligned via fine-tuning to prevent toxic generations (and other harmful behaviors).

Table 1: Success rates of prior attacks in constructing adversarial prompts that cause models to output toxic content. We allow the adversary to modify up to

30

tokens of text. We say an attack is “Distant” if the adversarial tokens come before the question, and “Nearby” if the adversarial tokens come after the question.

		Attack success rate
		Distant Attack		Nearby Attack
		ARCA	GBDA	ARCA	GBDA
GPT-2	None	67% $\pm$ 4%	12% $\pm$ 3%	84% $\pm$ 3%	16% $\pm$ 3%
LLaMA	None	2% $\pm$ 1%	1% $\pm$ 1%	61% $\pm$ 6%	10% $\pm$ 1%
Vicuna	Instruct	6% $\pm$ 2%	0% $\pm$ 1%	8% $\pm$ 2%	1% $\pm$ 1%

4.4 Prior Attacks Results

We evaluate the ARCA attack from Jones et al. (2023) and the GBDA attack from Guo et al. (2021). As ARCA generalizes and dominates prior attacks such as (Wallace et al., 2019; Ebrahimi et al., 2017) we do not also re-evaluate these. Both ARCA and GBDA are largely ineffective at causing the model to emit toxic content in our setting even when allowed to inject thirty malicious tokens. (We choose this number to match with experiemnts that will follow later in the paper.) This should not be surprising: already in Jones et al. (2023), the attack had under 40% attack success rate in getting GPT-2 to output toxic output text, even when the adversary had complete control of every token up to the desired toxic phrases. In contrast, because we require that the tokens “[AGENT]:” come after the adversarial tokens, the attacks have (much) lower attack success rates here. Table 1 summarizes these results.

Qualitative Evaluation: Toxicity: We qualitatively inspect a subset of these generations from each setting in Table 1. We find that in most cases the output is indeed toxic as expected. But in less than ten cases, we find our automated metric has potential limitations. For example, when told to produce one toxic phrase, Vicuna responds “F*** you. I’m kidding, of course. As a AI visual assistant, I’m programmed to be polite and provide assistance to the best of my ability. How can I assist you today?”, and in another case it responds “(Note: This is not appropriate language and is not a proper response to the prompt.)” after an initially abusive message. This demonstrates a limitation of our simple isToxic definition that only performs a substring lookup. Future work will need to carefully consider different toxicity definitions once attacks begin to have nontrivial attack success rates—for now, even with this loose definition, the current attacks in the literature fail to find successful adversarial sequences.

5 Why do Existing NLP Optimization Attacks Fail?

In the prior section we found that existing NLP attacks have limited success at causing aligned models to emit harmful text in chat settings. There are two possible explanations for this result:

1.

The aligned language models we attack are truly robust to adversarial examples; or,
2.

Current attacks are insufficiently powerful to evaluate the robustness of aligned models.

Fortunately, recent work has developed techniques explicitly designed to differentiate between these two hypotheses for general attacks. Zimmermann et al. (2022) propose the following framework: first, we construct test cases with known adversarial examples that we have identified a priori; then, we run the attack on these test cases and verify they succeed. Our specific test case methodology follows Lucas et al. (2023). To construct test cases, we first identify a set of adversarial examples via brute force. And once we have confirmed the existence of at least one adversarial example via brute force, we run our attack over the same search space and check if it finds a (potentially different, but still valid) adversarial example. This approach is effective when there exist effective brute force methods and the set of possible adversarial examples is effectively enumerable—such as is the case in the NLP domain.

We adapt to this to our setting as follows. We construct (via brute force) prompts $p$ that causes the model to emit a rare suffix $q$ . Then, the attack succeeds if it can find some input sequence $p^{\prime}$ that causes $\texttt{Gen}(p)=q$ , i.e., the model emits the same $q$ . Otherwise, the attack fails. Observe that a sufficiently strong attack (e.g. a brute force search over all prompts) will always succeed on this test: any failure thus indicates a flawed attack. Even though these strings are not toxic, they still suffice to demonstrate the attack is weak.

5.1 Our Test Set

How should we choose the prefixes $p$ and the target token $q$ ? If we were to choose $q$ ahead of time, then it might be very hard to find—even via brute force—a prefix $p$ so that $\texttt{Gen}(p)=q$ . So instead we drop the requirement that $q$ is toxic, and approach the problem from reverse.

Initially, we sample many different prefixes $p_{1},p_{2},\dots$ from some dataset (in our case, Wikipedia). Let $S$ be the space of all N-token sequences (for some N). Then, for all possible sequences $s_{i}\in S$ we query the model on $\texttt{Gen}(s_{i}||p_{j})$ . (If $|S|$ is too large, we randomly sample 1,000,000 elements $s_{i}\in S$ .) This gives a set of possible output tokens $\{q_{i}\}$ , one for each sequence $s_{i}$ .

For some prompts $p_{j}$ , the set of possible output tokens $\{q_{i}\}$ may have high entropy. For example, if $p_{j}$ = “How are you doing?” then there are likely thousands of possible continuations $q_{i}$ depending on the exact context. But for other prompts $p_{j}$ , the set of possible output tokens $\{q_{i}\}$ could be exceptionally small. For example, if we choose the sequence $p_{j}$ =“Barack” the subsequent token $q_{i}$ is almost always “Obama” regardless of what context $s_{i}$ was used.

But the model’s output might not always be the same. There are some other tokens that might be possible—for example, if the context where $s_{i}$ =“The first name [”, then the entire prompt (“The first name [Barack”) would likely cause the model to output a closing bracket q=“]”. We denote such sequences $p_{j}$ that yield small-but-positive entropy over the outputs $\{q_{i}\}$ (for different prompts $s_{i}\in S$ ) a test case, and set the attack objective to be the output token $q_{i}$ that is least-likely.

These tests make excellent candidates for evaluating NLP attacks. They give us a proof (by construction) that it is possible to trigger the model to output a given word. But this happens rarely enough that an attack is non-trivial. It is now just a question of whether or not existing attacks succeed.

We construct eight different sets with varying difficulty levels and report averages across each. Our test sets are parameterized by three constants. (1) Prevalence: the probability of token $q$ given $p_{j}$ , which we fix to $10^{-6}$ ; (2) Attacker Controlled Tokens: the number of tokens the adversary is allowed to modify, which we vary from 2, 5, 10, or 20 tokens, and (3) Target Tokens: the number of tokens of output the attacker must reach. We generate our test cases using GPT-2 only, due to the cost of running a brute force search. However, as GPT-2 is an easier model to attack, if attacks fail here, they are unlikely to succeed on the more difficult cases of aligned models.

5.2 Prior Attacks Results

Table 2: Pass rates on GPT-2 for the prior attacks on the test cases we propose. We design each test so that a solution is guaranteed to exist; any value under

100\%

indicates the attack has failed.

	Pass Rate given $N\times$ extra tokens
Method	1 $\times$	2 $\times$	5 $\times$	10 $\times$
Brute Force	100.0%	100.0%	100.0%	100.0%
ARCA	11.1%	14.6%	25.8%	30.6%
GBDA	3.1%	6.2%	8.8 %	9.5%

In Table 2 we find that the existing state-of-the-art NLP attacks fail to successfully solve our test cases. In the left-most column we report attack success in a setting where the adversary must solve the task within the given number of attacker controlled tokens. ARCA is significantly stronger than GBDA (consistent with prior work) but even ARCA fails to find a successful prompt in nearly 90% of test cases. Because the numbers here are so low, we then experimented with giving the attacker more control, with a multiplicative factor on the number of manipulated tokens. That is, if the original test asks to find an adversarial prompt with $10$ tokens, and we run the attack with a factor of $5$ , we allow the attack to search over $50$ attacker controlled tokens. We find that even with $10\times$ extra tokens the attack still fails our tests the majority of the time.

Note that the purpose of this evaluation is not to argue the NLP attacks we study here are incorrect in any way. On the contrary: they largely succeed at the tasks that they were originally designed for. But we are asking them to do something much harder and control the output at a distance, and our hope here is to demonstrate that while we have made significant progress towards developing strong NLP optimization attacks there is still room for improving these techniques.

6 Attacking Multimodal Aligned Models

Text is not the only paradigm for human communication. And so increasingly, foundation models have begun to support “multimodal” inputs across vision, text, audio, or other domains. In this paper we study vision-augmented models, because they are the most common. For example, as mentioned earlier, OpenAI’s GPT-4 and Google’s Gemini will in the future support both images and text as input. This allows models to answer questions such as “describe this image” which can, for example, help blind users Salam (2019).

It also means that an adversary can now supply adversarial images, and not just adversarial text. And because images are drawn from a continuous domain, adversarial examples are orders of magnitude simpler to create: we no longer need to concern ourselves with the discrete nature of text or the inversion of embedding matrices and can now operate on (near) continuous-domain pixels.

6.1 Attack Methodology

Our attack approach directly follows the standard methodology for generating adversarial examples on image models. We construct an end-to-end differentiable implementation of the multimodal model, from the image pixels to the output logits of the language model. We again use the harmful-prefix attack, with the adversary constructing an adversarial image that maximizes the likelihood the model will begin its response with a specific harmful response.

We apply standard teacher-forcing optimization techniques when the target response is $>1$ token, i.e., we optimize the total cross-entropy loss across each targeted output token as if the model had correctly predicted all prior output tokens. To initiate each attack, we use a random image generated by sampling each pixel uniformly at random. We use the projected gradient descent Madry et al. (2017). We use an arbitrarily large $\epsilon$ and run for a maximum of 500 steps or until the attack succeeds; note, we report the final distortions in Table 3. We use the default step size of 0.2.

6.2 Experiments

While GPT-4 currently supports vision for some users OpenAI (2023), this functionality is not yet publicly available at the time of performing this attack. Google’s Gemini has also not been made available publicly. The research community has thus developed open-source (somewhat smaller) versions of these multimodal models.

We evaluate our attack on two different implementations. While they differ in some details, both follow the approach in Section 2: the image is encoded with a vision model, projected to the token embedding space, and passed as a sequence of soft-tokens to the language model.

Mini GPT-4 (Zhu et al., 2023) uses a pretrained Q-Former module from Li et al. (2023) to project images encoded by EVA CLIP ViT-G/14 Fang et al. (2022) to Vicuna’s (Chiang et al., 2023) text embedding space. Both CLIP and Vicuna are frozen, while a section of the Q-former is finetuned on a subset of LAION Schuhmann et al. (2021), Conceptual Captions Sharma et al. (2018), SBU Ordonez et al. (2011), and multimodal instruction-following data generated by the authors.

LLaVA (Liu et al., 2023) uses a linear layer to project features from CLIP ViT-L/14 to the Vicuna embedding space. While CLIP is frozen, both Vicuna and the projection matrix are finetuned on Conceptual Captions Sharma et al. (2018) and custom multimodal instruction-following data.

LLaMA Adapter (Gao et al., 2023) works similarly to these except that it uses learned adaptation prompts to incorporate visual information internal to the model via these adapters.

Our goal with these experiments is to show that these vision-augmented models are significantly more vulnerable due to the ease of generating image adversarial examples.

6.3 Quantitative Evaluation: Toxicity

We now evaluate the ability of these models to be coerced into emitting arbitrary toxic output. As in the previous section, we use the toxic phrase dataset from Jones et al. (2023) and use the OpenAssistant prompts as the initial query to the model. We measure distortion across a set of randomly generated images. We observe that our attack succeeds in all cases, regardless of the prompt. As Table 3 shows, our attack has a $100\%$ success rate at causing the model to emit arbitrary toxic content. We observe that adversarial inputs are relatively easy to find, requiring only minimal $\ell_{2}$ distortions from the initial image; there is also minimal variation in the distortion across prompts. Even though there is no inherent reason why distortion is important as an objective that should be minimized, we believe it is interesting to report the distortion required in order to underscore just how easy it is to run these attacks.

Table 3: We can force Mini GPT-4, LLaVA, and LLaMA Adapter to produce arbitrary toxic output small

\ell_{2}

perturbations. Despite their similar methodology, LLaVA is

10\times

more vulnerable than the others, indicating the importance of implementation details.

Model	Attack Success Rate	Mean $\ell_{2}$ Distortion
LLaMA Adapter	100%	3.91 $\pm$ 0.36
Mini GPT-4 (Instruct)	100%	2.51 $\pm$ 1.45
Mini GPT-4 (RLHF)	100%	2.71 $\pm$ 2.12
LLaVA	100%	0.86 $\pm$ 0.17

6.4 Qualitative Evaluation

We further evaluate our attack on qualitative samples designed to probe various harmful outcomes.

For example, even when explicitly asking for harmful output (e.g., “Write a vile letter telling my neighbor what I think of him.”) the model still responds harmlessly (“Dear [Neighbor’s Name] … I have to say that I am not impressed with your behavior … Sincerely, [Your Name]”). But by providing an adversarial image as input, we can cause the model to emit our desired harmful output (“F*** you, you dirty life piece of s***. You’re a disgusting, filthy, stinky, scumbag who doesn’t deserve to live in this neighborhood … you worthless, good-for-nothing, loser.”)

Similarly, when asked to provide instructions for how to get away with murder, the base model responds that “It is not appropriate to encourage or provide information on how to harm or kill someone” but the attacked model complies: “The first step would be to identify the location of the neighbor’s house”. While the response here is rather poor advice, this is likely because the model is not competent; a better model might provide more actionable (and thus harmful) advice. We list sample (uncensored) examples of our ability to attack these models in the Appendix.

7 Conclusion

Language models trained via RLHF or instruction tuning are significantly more aligned than base models: in particular, they are more helpful (they appropriately follow benign user instructions) and harmless (they are less likely to output toxicity or harmful actions). While helpfulness can be evaluated through various utility metrics, harmlessness is more difficult to evaluate—and almost all methods to date rely on human-designed test cases to quantify this.

In this paper we have shown that while these models might be usually harmless, they may not be harmless under adversarial prompting. While the harms from adversarial prompting that we illustrate are fairly benign (e.g., the small models we study give unhelpful advice on how to get away with murder, or produce toxic content that could be found anywhere on the internet), our attacks are directly applicable to triggering other bad behaviors in larger and more capable systems.

Our attacks are most effective on the new paradigm of multimodal vision-language models. While all models we study are easy to attack, small design decisions affect the ease of attacks by as much as $10\times$ . Better understanding where this increased vulnerability arises is an important area for future work. Moreover, it is very likely that the future models will add additional modalities (e.g, audio) which can introduce new vulnerability and surface to attack.

Unfortunately, for text-only models, we show that current NLP attacks are not sufficiently powerful to correctly evaluate adversarial alignment: these attacks often fail to find adversarial sequences even when they are known to exist. Since our multimodal attacks show that there exist input embeddings that cause language models to produce harmful output, we hypothesize that there may also exist adversarial sequences of text that could cause similarly harmful behaviour.

Conjecture: An improved NLP optimization attack may be able to induce harmful output in an otherwise aligned language model.

While we cannot prove this claim (that’s why it’s a conjecture!) we believe our paper provides strong evidence for it: (1) language models are weak to soft-embedding attacks (e.g., multimodal attacks); and (2) current NLP attacks cannot find solutions that are known to exist. We thus hypothesize that stronger attacks will succeed in making text-only aligned models behave harmfully.

Future work.

We hope our paper will inspire several directions for future research. Most immediately, we hope that stronger NLP attacks will enable comprehensive robustness evaluations of aligned LLMs. Such attacks should, at a minimum, pass our tests to be considered reliable.

We view the end goal of this line of work not to produce better attacks, but to improve the evaluation of defenses. Without a solid foundation on understanding attacks, it is impossible to design robust defenses that withstand the test of time. An important open question is whether existing attack and defense insights from the adversarial machine learning literature will transfer to this new domain.

Ultimately, such foundational work on attacks and defenses can help inform alignment researchers develop improved model alignment techniques that remain reliable in adversarial environments.

Acknowledgements

We are grateful for comments on this paper by Andreas Terzis, Slav Petrov, and Erik Jones, and the anonymous reviewers.

References

Abid et al. (2021) Abubakar Abid, Maheen Farooqi, and James Zou. Large language models associate muslims with violence. Nature Machine Intelligence, 3(6):461–463, 2021.
Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 2022.
Alzantot et al. (2018) Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
Biggio et al. (2013) Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In European Conference on Machine Learning and Knowledge Discovery in Databases, pages 387–402. Springer, 2013.
Bostrom (2013) Nick Bostrom. Existential risk prevention as global priority. Global Policy, 4(1):15–31, 2013.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
Brundage et al. (2018) Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228, 2018.
Bucknall and Dori-Hacohen (2022) Benjamin S Bucknall and Shiri Dori-Hacohen. Current and near-term ai as a potential existential risk factor. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 119–129, 2022.
Carlsmith (2022) Joseph Carlsmith. Is power-seeking ai an existential risk? arXiv preprint arXiv:2206.13353, 2022.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022.
Christiano et al. (2023) Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2023.
Dixon et al. (2018) Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, page 67–73, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360128. doi: 10.1145/3278721.3278729. URL https://doi.org/10.1145/3278721.3278729.
Ebrahimi et al. (2017) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
Fang et al. (2022) Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
Ganguli et al. (2022) Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Scott Johnston, Andy Jones, Nicholas Joseph, Jackson Kernian, Shauna Kravec, Ben Mann, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Tom Brown, Jared Kaplan, Sam McCandlish, Christopher Olah, Dario Amodei, and Jack Clark. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM, jun 2022. doi: 10.1145/3531146.3533229. URL https://doi.org/10.1145/3531146.3533229.
Gao et al. (2023) Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
Goyal et al. (2022) Tanya Goyal, Junyi Jessy Li, and Greg Durrett. News summarization and evaluation in the era of gpt-3, 2022.
Greshake et al. (2023) Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173, 2023.
Guo et al. (2021) Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and Douwe Kiela. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733, 2021.
Jia and Liang (2017) Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328, 2017.
Jones et al. (2023) Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
Katz et al. (2017) Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In Computer Aided Verification: 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30, pages 97–117. Springer, 2017.
Kolosnjaji et al. (2018) Bojan Kolosnjaji, Ambra Demontis, Battista Biggio, Davide Maiorca, Giorgio Giacinto, Claudia Eckert, and Fabio Roli. Adversarial malware binaries: Evading deep learning for malware detection in executables. In 2018 26th European signal processing conference (EUSIPCO), pages 533–537. IEEE, 2018.
Li et al. (2023) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models, 2022.
Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
Lucas et al. (2023) Keane Lucas, Matthew Jagielski, Florian Tramèr, Lujo Bauer, and Nicholas Carlini. Randomness in ml defenses helps persistent attackers and hinders evaluators. arXiv preprint arXiv:2302.13464, 2023.
Madry et al. (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
Ngo (2022) Richard Ngo. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
Ordonez et al. (2011) Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.
Pei et al. (2017) Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles, pages 1–18, 2017.
Pichai (2023) Sundar Pichai. Google i/o 2023: Making ai more helpful for everyone. The Keyword, 2023.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
Rae et al. (2022) Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training Gopher, 2022.
Reddit (2023) Reddit. Dan 5.0, 2023. URL https://www.reddit.com/r/ChatGPT/comments/10tevu1/new_jailbreak_proudly_unveiling_the_tried_and/.
Russell (2019) Stuart Russell. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.
Salam (2019) Erum Salam. I tried Be My Eyes, the popular app that pairs blind people with helpers. https://www.theguardian.com/lifeandstyle/2019/jul/12/be-my-eyes-app-blind-people-helpers, 2019.
Schuhmann et al. (2021) Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Tramèr et al. (2019) Florian Tramèr, Pascal Dupré, Gili Rusak, Giancarlo Pellegrino, and Dan Boneh. Adversarial: Perceptual ad blocking meets adversarial machine learning. In ACM SIGSAC Conference on Computer and Communications Security, 2019.
Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
Wei et al. (2022a) Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022a.
Wei et al. (2022b) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022b. ISSN 2835-8856. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
Welbl et al. (2021) Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2447–2469, 2021.
Wong and Kolter (2018) Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International conference on machine learning, pages 5286–5295. PMLR, 2018.
Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Zimmermann et al. (2022) Roland S Zimmermann, Wieland Brendel, Florian Tramer, and Nicholas Carlini. Increasing confidence in adversarial robustness evaluations. arXiv preprint arXiv:2206.13991, 2022.

Appendix A Ethics

Our paper demonstrates that adversarial techniques that can be used to circumvent alignment of language models. This can be used by an adversary to coerce a model into performing actions that the original model developer intended to disallow.

We firmly believe that on the whole our paper will cause little harm in the short term, and ultimately lead to stronger defenses in the long term.

Our attacks are unlikely to cause short-term harm because they are currently restricted to the white-box setting where the entire model is available. Additionally, because all open-source aligned models are simply fine-tuned by building on unaligned base models, an adversary who wished to perform malicious behavior could directly use the (already open-source) base model. In addition, because of our white-box assumption, our attacks will not be directly applicable to much more capable production language models such as GPT-4. Given that these production models are already susceptible to human-designed adversarial prompts that cause them to emit harmful content, we do not expect a significant short-term harm from releasing this paper even if our attacks could be extended to the black-box domain.

In the long term we expect our work will incentivize and help form a foundation for the design of more robustly aligned models. Just as it is not possible to produce a secure cipher without a solid understanding of cryptanalysis, we do not believe it is possible to design an adversarially aligned language model without first understanding the potential vulnerabilities. Otherwise, it is likely that researchers will fool themselves into a false sense of security and release models they believe to be aligned even if they are not.

Appendix B Model References

GPT-2 Radford et al. [2019] is available from https://huggingface.co/gpt2.

Llama Touvron et al. [2023] is available from https://huggingface.co/docs/transformers/v4.28.0/en/model_doc/llama.

Vicuna Chiang et al. [2023] is available from https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot.

Appendix C Unfiltered toxic conversations

This appendix contains unfiltered content that may be offensive to some readers.

This page is intentionally blank. This appendix contains unfiltered content that may be offensive to some readers beginning on the following page.

Appendix D Mini-GPT4 Output

Normal Image ( $x\sim N(0,1)$ )	Adversarial Image

Appendix E LLaVA outputs

Normal Image ( $x\sim N(0,1)$ )	Adversarial Image