No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith
Carnegie Mellon University
{qipang, shengyuanhu, wenting, smithv}@cmu.edu

Abstract

Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating the misuse of such AI-generated content. However, we show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack—leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems, and propose guidelines and defenses for LLM watermarking in practice.

1 Introduction

Modern generative modeling systems have notably enhanced the quality of AI-produced content [BMR⁺20, SCS⁺22, Ope23a, Ope22]. For example, large language models (LLMs) like those powering ChatGPT [Ope22] can generate text closely resembling human-crafted sentences. While this has led to exciting new applications of machine learning, there is also growing concern around the potential for misuse of these models, leading to a flurry of recent efforts on developing techniques to detect AI-generated content. A promising approach in this direction is to embed invisible watermarks into model-derived content, which can then be extracted and verified using a secret watermark key [KGW⁺23a, FGJ⁺23, CGZ23, KTHL23, ZALW24, KGW⁺23b, HCW⁺23, WHZH23, WYC⁺23].

In this work, we identify that many of the key properties that make existing LLM watermarks successful can also render them susceptible to attack. In particular, we study a number of simple attacks that take advantage of common design choices of existing watermarking schemes, including:

1.

Robustness of the watermarks to potential modifications in the output text, so that the watermarks cannot be easily removed [KTHL23, KGW⁺23a, ZALW24, KTHL23, CGZ23, HCW⁺23];
2.

The use of multiple keys to prevent against watermark stealing attacks [KGW⁺23a, FGJ⁺23, KTHL23, JSV24, SKB⁺23, GLLH23]; and
3.

Public detection APIs, which allow the general public to easily verify whether or not candidate text is AI-generated [FGJ⁺23, KGW⁺23a, SBC⁺19, MLK⁺23].

While these common features and design choices of existing watermarking schemes have clear benefits, we show that they also make the resulting systems vulnerable to a number of simple but effective attacks. In particular, we study two types of attacks: 1) watermark-removal attacks, which remove the watermark from the watermarked content, and 2) spoofing attacks, which create (potentially toxic) content with a target watermark embedded, thus making the content appear to be generated by a specific LLM when it is in fact not (see motivating examples in Table 1).

Prompt	Alan Turing was born in ...
Unwatermarked	Alan Turing was born in 1912 and died in 1954. He was an English mathematician, logician, cryptanalyst, and computer scientist. In 1938, Turing joined the Government Code and Cypher School (GC&CS), where he contributed to the design of the bombe, a machine that was used to decipher the Enigma-enciphered messages...
Z-Score: $0.16$ $\downarrow$
PPL: $3.19$
Watermarked	Alan Turing was born in 1912 and died in 1954, at the age of 41. He was the brilliant British scientist and mathematician who is largely credited with being the father of modern computer science. He is known for his contributions to mathematical biology and chemistry. He was also one of the pioneers of computer science...
Z-Score: $5.98$ $\uparrow$
PPL: $4.46$
(a) Piggyback spoofing attack	Alan Turing was born in 1950 and died in 1994, at the age of 43. He was the brilliant American scientist and mathematician who is largely credited with being the father of modern computer science. He is known for his contributions to mathematical biology and musicology. He was also one of the pioneers of computer science...
Exploiting robustness
Z-Score: $5.98$ $\uparrow$
(b) Watermark-removal attack	Alan Turing was born in 1912 and died in 1954. He was a mathematician, logician, cryptologist and theoretical computer scientist. He is famous for his work on code-breaking and artificial intelligence, and his contribution to the Allied victory in World War II. Turing was born in London. He showed an interest in mathematics...
Exploiting multiple keys
Z-Score: $2.40$ $\downarrow$
PPL: $4.05$
(c) Watermark-removal attack	Alan Turing was born in 1912 and died in 1954. He was an English mathematician, computer scientist, cryptanalyst and philosopher. Turing was a leading mathematician and cryptanalyst. He was one of the key players in cracking the German Enigma Code during World War II. He also came up with the Turing Machine...
Exploiting public detection API
Z-Score: $1.47$ $\downarrow$
PPL: $4.57$

Table 1: Examples generated using LLAMA-2-7B with/without the KGW watermark [KGW⁺23a] under various attacks. We mark tokens in the green and red lists (see Appendix A). Z-score reflects the detection confidence of the watermark, and perplexity (PPL) measures text quality. (a) In the piggyback spoofing attack, we exploit watermark robustness by generating incorrect content that appears as watermarked (matching the z-score of the watermarked baseline), potentially damaging the reputation of the LLM. Incorrect tokens modified by the attacker are marked in orange and watermarked tokens in blue. (b-c) In watermark-removal attacks, attackers can effectively lower the z-score below the detection threshold while preserving a high sentence quality (low PPL) by exploiting either the (b) use of multiple keys or (c) publicly available watermark detection API.

Our work rigorously explores a number of simple removal and spoofing attacks for LLM watermarks. In doing so, we identify critical trade-offs that emerge between watermark robustness, utility, and usability as a result of watermarking design choices. To navigate these trade-offs, we propose potential defenses as well as a set of general guidelines to better enhance the security of next-generation LLM watermarking systems. Overall, we make the following contributions:

•

We study how watermark robustness, despite being a desirable property to mitigate removal attacks, can make the resulting systems highly susceptible to piggyback spoofing attacks, a simple type of attack that makes makes watermarked text toxic or inaccurate through small modifications, and show that challenges exist in detecting these attacks given that a single token can render an entire sentence inaccurate (Sec. 4).
•

We show that using multiple watermarking keys can make the system susceptible to watermark removal attacks (Sec. 5). Although a larger number of keys can help defend against watermark stealing attacks, which can be used to launch either spoofing or removal attacks, we show both theoretically and empirically that this in turn increases the potential for watermark removal attacks.
•

Finally, we identify that public watermark detection APIs can be exploited by attackers to launch both watermark-removal and spoofing attacks (Sec. 6). We propose a defense using techniques from differential privacy to effectively counteract spoofing attacks, showing that it is possible to avoid the possibilities of noise reduction by applying pseudorandom noise based on the input.

Throughout, we explore our attacks on three state-of-the-art watermarks [KGW⁺23a, ZALW24, KTHL23] and two LLMs (LLAMA-2-7B [TMS⁺23] and OPT-1.3B [ZRG⁺22])—demonstrating that these vulnerabilities are common to existing LLM watermarks, and providing caution for the field in deploying current solutions in practice without carefully considering the impact and trade-offs of watermarking design choices.

2 Related Work

Advances in large language models (LLMs) have given rise to increasing concerns that such models may be misused for purposes such as spreading misinformation, phishing, and academic cheating. In response, numerous recent works have proposed watermarking schemes as a tool for detecting LLM-generated text to mitigate potential misuse [KGW⁺23a, FGJ⁺23, CGZ23, KTHL23, ZALW24, KGW⁺23b, HCW⁺23, WHZH23, WYC⁺23]. These approaches involve embedding invisible watermarks into the model-generated content, which can then be extracted and verified using a secret watermark key. Existing watermarking schemes share a few natural goals: (1) the watermark should be robust in that it cannot be easily removed; (2) the watermark should not be easily stolen, thus enabling spoofing or removal attacks; and (3) the presence of a watermark should be easy to detect when given new candidate text. Unfortunately, we show that existing methods that aim to achieve these goals can in turn enable simple watermark removal or spoofing attacks.

Removal attacks. Several recent works have highlighted that paraphrasing methods may be used to evade the detection of AI-generated text [KSK⁺23, IWGZ18, LJSL18, LCW21, ZEF⁺23], with [KSK⁺23, ZEF⁺23] demonstrating effective watermark removal using a local LLM. These methods usually require additional training for sentence paraphrasing which can impact sentence quality, or assume a high-quality oracle model to guarantee the output quality is preserved. In contrast, the simple and scalable removal attacks herein do not require additional training or a high-quality oracle. Additionally, our work differs in that we aim to directly connect and study how the inherent properties and design choices of watermarking schemes (such as the use of multiple keys and detection APIs) can inform such removal attacks.

Spoofing attacks. Prior works on spoofing use watermark stealing attacks to first estimate the watermark pattern and then embed it into an arbitrary content to launch spoofing attacks. These attacks usually require the attacker to pay a large startup cost by obtaining a significant number of watermarked tokens. For example, [SKB⁺23] requires 1 million queries to the watermarked LLM, and [JSV24, GLLH23] assume the attacker can obtain millions of watermarked tokens to estimate their distribution. Unlike these works, we explore spoofing attacks that are less flexible but can be launched with significantly less upfront cost. In Sec. 4, we explore a very simple and scalable form of spoofing exploiting the inherent robustness property of watermarks, which we refer to as a ‘piggyback spoofing attack’. In Sec. 6, we then explore more general spoofing attacks, which instead of querying the watermarked LLM numerous times, consider exploiting the public detection API. In both, our attacks do not require the attacker to estimate the watermark pattern, but share a similar ultimate goal with the prior spoofing attacks to create falsified inaccurate or toxic content that appears to be watermarked.

3 Preliminaries

Before exploring attacks and defenses on watermarking systems, we introduce relevant background on LLMs, notation we use throughout the work, and a set of concrete threat models.

Notation. We use x to denote a sequence of tokens, $\textbf{x}_{i}\in\mathcal{V}$ is the $i$ -th token in the sequence, and $\mathcal{V}$ is the vocabulary. $M_{\text{orig}}$ denotes the original model without a watermark, $M_{\text{wm}}$ is the watermarked model, and $sk\in\mathcal{S}$ is the watermark secret key sampled from the key space $\mathcal{S}$ .

Language Models. Current state-of-the-art (SOTA) LLMs are auto-regressive models, which predict the next token based on the prior tokens. We define language models more formally below:

Definition 1 (LM).

We define a language model (LM) without a watermark as:

\footnotesize M_{\text{orig}}:\mathcal{V}^{*}\rightarrow\mathcal{V},

(1)

where the input is a sequence of length $t$ tokens x. $M_{\text{orig}}(\textbf{x})$ first returns the probability distribution for the next token $\textbf{x}_{t+1}$ and then the LM samples $\textbf{x}_{t+1}$ from this distribution.

Watermarks for LLMs. In this work, we focus on three SOTA decoding-based watermarking schemes: KGW [KGW⁺23a], Unigram [ZALW24] and Exp [KTHL23]. Informally, decoding-based watermarks are embedded by perturbing the output distribution of the original LLM. The perturbation is determined by secret watermark keys held by the LLM owner. Formally, we define the watermarking scheme:

Definition 2 (Watermarked LLMs).

The watermarked LLM takes token sequence $\textbf{x}\in\mathcal{V}^{*}$ and secret key $sk\in\mathcal{S}$ as input, and outputs a perturbed probability distribution for the next token. The perturbation is determined by $sk$ :

\footnotesize M_{\text{wm}}:\mathcal{V}^{*}\times\mathcal{S}\rightarrow% \mathcal{V}

(2)

The watermark detection outputs the statistical testing score for the null hypothesis that the input token sequence is independent of the watermark secret key:

\footnotesize f_{\text{detection}}:\mathcal{V}^{*}\times\mathcal{S}\rightarrow% \mathbb{R}

(3)

The output score reflects the confidence of the watermark’s existence in the input. Please refer to Appendix A for additional details of the specific watermarks explored in this work [KGW⁺23a, ZALW24, KTHL23].

3.1 Threat Model

Attacker’s Objective & Motivation. We study two types of attacks—watermark-removal attacks and (piggyback or general) spoofing attacks. In the watermark-removal attack, the attacker aims to generate a high-quality response from the LLM without an embedded watermark. For the spoofing attacks, the goal is to generate a harmful or incorrect output that has the victim organization’s watermark embedded.

We present two practical scenarios to motivate watermark-removal attacks: (i) A student or a journalist uses high-quality watermarked LLMs to write articles, but wants to remove the watermark to claim originality. (ii) A malicious company offering LLM services for clients, instead of developing their own LLMs, simply queries a watermarked LLM from a victim company and removes the watermark, potentially infringing upon IP rights of the victim company.

In piggyback and spoofing attacks, an attacker can damage the reputation of a victim company offering an LLM service. For example: (i) The attacker can use a spoofing attack to generate fake news or incorrect facts and post them on social media. By claiming the material is generated by the LLM from the benign company, the attacker can damage the reputation of the company and their model. (ii) The attacker can use the spoofing attack to inject malicious code into some public software. The code has the benign company’s watermark embedded, and the benign company may thus be at fault and have to bear responsibility for the actions.

Attacker’s Capabilities. We study attacks by exploiting three common design choices in watermarks: 1) robustness, 2) the use of multiple keys, and 3) public detection APIs. Each attack requires the adversary to have different capabilities, but we make assumptions that are practical and easy to achieve in real-world deployment scenarios.

1) For piggyback spoofing attacks exploiting robustness (Sec. 4), we assume that the attacker can make $\mathcal{O}(1)$ queries to the target watermarked LLM. We also assume that the attacker can edit the generated sentence (e.g., insert or substitute tokens).

2) For watermark-removal attacks exploiting the use of multiple keys (Sec. 5), we consider the scenario where multiple watermark keys are utilized to embed the watermark, which is a common practice in designing robust cryptographic protocols and is suggested by SOTA watermarks [KTHL23, KGW⁺23a] to improve resistance against watermark-stealing attacks [JSV24, GLLH23, SKB⁺23]. For a sentence of length $l$ , we assume that the attacker can make $\mathcal{O}(l)$ queries to the watermarked LLM.

3) For the attacks on detection APIs (Sec. 6), we assume that the detection API is available to normal users and the attacker can make $\mathcal{O}(l)$ queries for a sentence of length $l$ . The detection returns the watermark confidence score (p-value or z-score). For spoofing attacks exploiting the detection APIs, we assume that the attacker can auto-regressively synthesize (toxic) sentences. For example, they can run a local (small) model to synthesize such sentences. For watermark-removal attacks exploiting the detection APIs, we also assume that the attacker can make $\mathcal{O}(l)$ queries to the watermarked LLM. As is common practice [NKIH23, OWJ⁺22] and also enabled by OpenAI’s API, we assume that the top 5 tokens at each position and their probabilities are returned to the attackers.

4 Attacking Robust Watermarks

The goal of developing a watermark that is robust to output perturbations is to defend against watermark removal, which may be used to circumvent detection schemes for applications such as phishing or fake news generation. Robust watermark designs have been the topic of many recent works [ZALW24, KGW⁺23a, KTHL23, SKB⁺23, KGW⁺23b, PSF⁺23]. We formally define watermark robustness in the following definition.

Definition 3 (Watermark robustness).

A watermark is $(\epsilon,\delta)$ -robust, given a watermarked text x, if for all its neighboring texts within the $\epsilon$ editing distance, the probability that the detection fails to detect the edited text is bounded by $\delta$ , given the detection confidence threshold $T$ :

\displaystyle\footnotesize\forall\textbf{x},\textbf{x}^{\prime}\in\mathcal{V}^% {*},\,\Pr[f_{\text{detection}}(\textbf{x}^{\prime},sk)<T]<\delta,\quad s.t.\,f% _{\text{detection}}(\textbf{x},sk)\geq T,\,\text{d}(\textbf{x},\textbf{x}^{% \prime})\leq\epsilon,

More robust watermarks can better defend against editing attacks, but this seemingly desirable property can also be easily misused by malicious users to launch simple piggyback spoofing attacks—e.g., a small portion of toxic or incorrect content can be inserted into the watermarked material, making it seem like it was generated by a specific watermarked LLM. The toxic content will still be detected as watermarked, potentially damaging the reputation of the LLM service provider. As discussed in Sec. 2, spoofing attacks explored in prior work usually require the attacker to obtain millions of watermarked tokens upfront to estimate the watermark pattern [JSV24, SKB⁺23, GLLH23]. In contrast, our simple piggyback spoofing only requires a single query to the watermarked LLM with careful text modifications, and the effectiveness relates directly to the robustness of the LLM watermark.

Attack Procedure. (i) The attacker queries the target watermarked LLM to receive a high-entropy watermarked sentence $\textbf{x}_{\text{wm}}$ , (ii) The attacker edits $\textbf{x}_{\text{wm}}$ and forms a new piece of text $\textbf{x}^{\prime}$ and claims that $\textbf{x}^{\prime}$ is generated by the target LLM. The editing method can be defined by the attacker. Simple strategies could include inserting toxic tokens into the watermarked sentence $\textbf{x}_{\text{wm}}$ at random positions, or editing specific tokens to make the output inaccurate (see example in Table 1). As we show, editing can also be done at scale by querying another LLM like GPT4 to generate fluent output.

We present the formal analysis on the attack feasibility in Appendix B and point out the takeaway that is universally applicable to all robust watermarks: A more robust watermark makes piggyback spoofing attack easier by allowing more toxic tokens to be inserted. This is a fundamental design trade-off: If a watermark is robust, such spoofing attacks are inevitable and may be extremely difficult to detect, as even one toxic token can render the entire content harmful or inaccurate.

4.1 Evaluation

Experiment Setup. We assess the effectiveness of our piggyback spoofing attack by using the two editing strategies discussed above. Through toxic token insertion, we study the limits of how many tokens can be inserted into the watermarked content. Using fluent inaccurate editing, we show that piggyback spoofing can generate fluent, watermarked, but inaccurate results at scale. Specifically, for the toxic token insertion, we generate a list of $200$ toxic tokens and insert them at random positions in the watermarked output. For the fluent inaccurate editing, we edit the watermarked sentence by querying GPT4 using the prompt “Modify less than 3 words in the following sentence and make it inaccurate or have opposite meanings.” Unless otherwise specified, in the evaluations of this work, we utilize $500$ prompts data from OpenGen [KSK⁺23] dataset, and query the watermarked language models (LLAMA-2-7B [TMS⁺23] and OPT-1.3B [ZRG⁺22]) to generate the watermarked outputs. We evaluate three SOTA watermarks including KGW [KGW⁺23a], Unigram [ZALW24], and Exp [KTHL23], using the default watermarking hyperparameters. In our experiments, we default to a maximum of 200 new tokens for KGW and Unigram, and 70 for Exp, due to its complexity in the watermark detection. 70 is also the maximum number of tokens the authors of Exp evaluated in their paper [KTHL23].

Refer to caption — (a) Toxic token insertion.

Evaluation Result. We report the maximum portion of the inserted toxic tokens relative to the original watermarked sentence length on LLAMA-2-7B model in Fig. 1(a). We also present the confidence of the OpenAI moderation model [Ope23b] in identifying the content as violating their usage policy [Ope23c] due to the inserted toxic tokens in Fig. 1(a). Our findings show that we can insert a significant number of toxic tokens into content generated by all the robust watermarking schemes, with a median portion higher than $20\%$ , i.e., for a $200$ -token sentence, the attacker can insert a median of $40$ toxic tokens into it. These toxic sentences are then identified as violating OpenAI policy rules with high confidence scores, whose median is higher than 0.8 for all the watermarking schemes we study. The average confidence scores for content before attack are around 0.01. The empirical data on the maximum portion of inserted toxic tokens aligns with our analysis in Appendix B. We further validate this analysis in Fig. 5 of Appendix C, showing that attackers can insert nontrivial portions of toxic tokens into the watermarked text to launch piggyback spoofing attacks. Notably, the more robust the watermark is, the more tokens can effectively be inserted. We present the results on OPT-1.3B in Appendix E.

In Fig. 1(b), we report the PPL and watermark detection scores of the piggyback results on KGW and LLAMA-2-7B by the fluent inaccurate editing strategy. We show that we can successfully generate fluent results, with a slightly higher PPL. $94.17\%$ of the piggyback results have a z-score higher than the default threshold $4$ . We randomly sample $100$ piggyback results and manually check that most of them ( $92\%$ ) are fluent and have inaccurate or opposite content from the original watermarked content. See concrete examples in Appendix D. The results show that we can generate watermarked, fluent, but inaccurate content at scale with an ASR higher than 90%.

4.2 Discussion

Our results highlight that piggyback spoofing attacks are easy to execute in practice. LLM watermarks typically do not consider such attacks during design and deployment, and existing robust watermarks are inherently vulnerable to such attacks. We highlight the contradiction between the watermark robustness and the piggyback spoofing feasibility. We consider this attack to be challenging to defend against, especially considering examples such as those in Table 1 and Appendix D, where by only editing a single token, the entire content becomes incorrect. It is hard, if not impossible, to detect whether a particular token is from the attacker by using robust watermark detection algorithms. Thus, practitioners should weigh the risks of removal vs. piggyback spoofing attacks for the model at hand. A feasible strategy to mitigate spoofing attacks is by requiring proof of digital signatures on the LLM generated content. However, while an attacker without access to the private key cannot spoof, it is worth nothing that this strategy is still vulnerable to watermark-removal attacks, as a single editing can invalidate the original signature.

5 Attacking Stealing-Resistant Watermarks

As discussed in Sec. 2, many works have explored the possibility of launching watermark stealing attacks to infer the secret pattern of the watermark, which can then enable spoofing and removal attacks [SKB⁺23, JSV24, GLLH23]. A natural and effective defense against watermark stealing is using multiple watermark keys during embedding, which is a common practice in cryptography and also suggested by prior watermarks and work in watermark stealing [KGW⁺23a, KTHL23, JSV24]. Unfortunately, we demonstrate that using multiple keys can in turn introduce new watermark-removal attacks.

In particular, SOTA watermarking schemes [KGW⁺23a, FGJ⁺23, CGZ23, KTHL23, ZALW24, KGW⁺23b] aim to ensure the watermarked text retains its high quality and the private watermark patterns are not easily distinguished by maintaining an “unbiasedness” property:

\footnotesize\mathbb{E}_{sk\in\mathcal{S}}(M_{\text{wm}}(\textbf{x},sk))% \approx_{\epsilon}M_{\text{orig}}(\textbf{x}),

(4)

i.e., the expected distribution of watermarked output over the watermark key space $sk\in\mathcal{S}$ is close to the output distribution without a watermark, differing by a distance of $\epsilon$ . Exp [KTHL23] is rigorously unbiased, and KGW [KGW⁺23a] and Unigram [ZALW24] slightly shift the watermarked distributions.

The insight of our proposed watermark-removal attack is that given the “unbiasedness” nature of watermarks and considering multiple keys may be used during watermark embedding, malicious users can estimate the output distribution without any watermark by querying the watermarked LLM multiple times using the same prompt. As this attack estimates the original, unwatermarked distribution, the quality of the generated content is preserved.

Attack Procedure. An attacker queries a watermarked model with an input x multiple times, observing $n$ subsequent tokens $\textbf{x}_{t+1}$ . This is easy for text completion model APIs, and chat model APIs can also be easily attacked by constructing a prompt to ask the chat model to complete a partial sentence without any prefix. The attacker then creates a frequency histogram of these tokens and samples according to the frequency. This sampled token matches the result of sampling on an unwatermarked output distribution with a nontrivial probability. Consequently, the attacker can progressively eliminate watermarks while maintaining a high quality of the synthesized content. We present a formal analysis of the number of required queries in Appendix F.

5.1 Evaluation

Experiment Setup. Our watermarks, models and datasets settings are the same as Sec. 4.1. We study the trade-off between resistance against watermark stealing and watermark-removal attacks by evaluating a recent watermark stealing attack [NKIH23]. In this attack, we query the watermarked LLM to obtain 2.2 million tokens in total to estimate the watermark pattern and then launch spoofing attacks using the estimated watermark pattern. We follow their assumptions that the attacker can access the unwatermarked tokens’ distribution. In our watermark removal attack, we consider that the attacker has observations with different keys. We evaluate the detection scores (z-score or p-value) and the output perplexity (PPL, evaluated using GPT3 [OWJ⁺22]). The detection algorithm returns the maximum detection score across all the keys, which increases the expectation of unwatermarked detection results. Thus, we set the detection thresholds for different keys to keep the false positive rates (FPR) below 1e-3 and report the attack success rates (ASR). We use default watermark hyperparameters.

Evaluation Result. As shown in Fig. 2(a), using multiple keys can effectively defend against watermark stealing attacks. With a single key, the ASR is $91\%$ , which matches the results reported in [JSV24]. We observe that using three keys can effectively reduce the ASR to $13\%$ , and using more than 7 keys, the ASR of the watermark stealing is close to zero. However, using more keys also makes the system vulnerable to our watermark-removal attacks as shown in Fig. 2(b). When we use more than $7$ keys, the detection scores of the content produced by our watermark removal attacks closely resemble those of unwatermarked content and are much lower than the detection thresholds, with ASRs higher than $97\%$ . Fig. 2(c) suggests that using more keys improves the quality of the output content. This is because, with a greater number of keys, there is a higher probability for an attacker to accurately estimate the unwatermarked distribution, which is consistent with our analysis in Appendix F. We observe that in practice, 7 keys suffice to produce high-quality content comparable to the unwatermarked content. These observations remain consistent across various watermarking schemes and models; for additional results see Appendix I.

5.2 Discussion

Many prior works have suggested using multiple keys to defend against watermark stealing attacks. However, in this study, we reveal that a conflict exists between improving resistance to watermark stealing and the feasibility of removing watermarks. Our evaluation results show that finding a "sweet spot" in terms of the number of keys to use to mitigate both the watermark stealing and the watermark-removal attacks is not trivial. For example, our watermark-removal attack achieves a high ASR of $36.2\%$ just using three keys, and the corresponding watermark stealing-based spoofing’s ASR is $13.0\%$ . Using more keys can decrease the watermark stealing-based spoofing’s ASR, but at the cost of making the system more vulnerable to watermark removal and vice-versa. We note that the ASRs with three keys are not negligible, thus limiting the ability of potentially malicious users is necessary in practice to mitigate these attacks. As a practical defense, we evaluate watermark stealing with various query limits on the watermarked LLM, and found that the ASR can be significantly reduced by limiting the attacker’s query rate. Detailed results can be found in Appendix I. Given the trade-off that exists, we suggest that LLM service providers consider “defense-in-depth” techniques such as anomaly detection, query rate limiting, and user identification verification.

6 Attacking Watermark Detection APIs

It is still an open question whether watermark detection APIs should be made publicly available to users. Although this makes it easier to detect watermarked text, it is a commonly acknowledged that it will make the system vulnerable to attacks [Aar23]. Here, we study this statement more precisely by examining the specific risk trade-offs that exist, as well as introducing a novel defense that may make the public detection API more feasible in practice. In the following sections, we first introduce attacks that exploit the APIs and then propose suggestions and defenses to mitigate these attacks.

6.1 Attack Procedures

Watermark-Removal Attack. For the watermark-removal attack, we consider an attacker who has access to the target watermarked LLM’s API, and can query the watermark detection results. The attacker feeds a prompt into the watermarked LLM, which generates the response in an auto-regressive manner. For the token $\textbf{x}_{i}$ the attacker will generate a list of possible replacements for $\textbf{x}_{i}$ . This list can be generated by querying the watermarked LLM, querying a local model, or simply returned by the watermarked LLM. In this work, we choose the third approach because of its simplicity and guarantee of synthesized sentences’ quality. This is a common assumption made by prior works [NKIH23], and such an API is also provided by OpenAI ( $\mathrm{top\_logprobs=5}$ ), which can benefit the normal users in understanding the model confidence, debugging and analyzing the model’s behavior, customizing sampling strategies, etc. Consider that the top $L=5$ tokens and their probabilities are returned to the attackers. The probability that the attacker can find an unwatermarked token in the token candidates’ list of length $L$ is $1-\gamma^{L}$ for KGW and Unigram, which becomes sufficiently large given $L=5$ and $\gamma=0.5$ . The attacker will query the detection using these replacements and sample a token based on their probabilities and detection scores to remove the watermark while preserving a high output quality. See the detailed algorithm Alg. 1 in Appendix H.

Spoofing Attack. Spoofing attacks follow a similar procedure where the attacker can generate (harmful) content using a local model. When sampling the tokens, instead of selecting those that yield low confidence scores as in removal attacks, the attacker will choose tokens that have higher confidence scores upon watermark detection queries. Thanks to the robustness of the LLM watermarks, attackers don’t need to ensure every single token carries a watermark; only that the overall detection confidence score surpasses the threshold, thereby treating synthesized content as if generated by the watermarked LLM. Please refer to Alg. 2 in Appendix H for the detailed algorithm.

6.2 Evaluation

Experiment Setup. We use the same evaluation setup as in Sec. 4.1 and Sec. 5.1. We evaluate the detection scores for both the watermark-removal and the spoofing attacks. We also report the number of queries to the detection API. Furthermore, for the watermark-removal attack, where the attackers care more about the output quality, we report the output PPL. For spoofing attacks, the attackers’ local models are LLAMA-2-7B and OPT-1.3B.

Evaluation Result. As shown in Fig. 3(a) and Fig. 3(b), watermark-removal attacks exploiting the detection API significantly reduce detection confidence while maintaining high output quality. For instance, for the KGW watermark on LLAMA-2-7B model, we achieve a median z-score of $1.43$ , which is much lower than the threshold $4$ . The PPL is also close to the watermarked outputs ( $6.17$ vs. $6.28$ ). We observe that the Exp watermark has higher PPL than the other two watermarks. This is because that Exp watermark is deterministic, while other watermarks enable random sampling during inference. Our attack also employs sampling based on the token probabilities and detection scores, thus we can improve the output quality for the Exp watermark.

	wm-removal		spoofing
	ASR	#queries	ASR	#queries
KGW	$1.00$	$2.42$	$0.98$	$2.95$
Unigram	$0.96$	$2.66$	$0.98$	$2.96$
Exp	$0.96$	$1.55$	$0.85$	$2.89$

Table 2: The attack success rate (ASR), and the average query numbers per token for the watermark-removal and spoofing attacks exploiting the detection API on LLAMA-2-7B model.

The spoofing attacks also significantly boost the detection confidence even though the content is not from the watermarked LLM, as depicted in Fig. 3(c). We report the attack success rate (ASR) and the number of queries for both of the attacks in Table 2. The ASR quantifies how much of the generated content surpasses or falls short of the detection threshold. These attacks use a reasonable number of queries to the detection API and achieve high success rate, demonstrating practical feasibility. We observe consistent results on OPT-1.3B, please see Appendix J.

6.3 Defending Detection with Differential Privacy

In light of the issues above, we propose an effective defense using ideas from differential privacy (DP) [DR⁺14] to counteract detection API based spoofing attacks. DP adds random noise to function results evaluated on private dataset such that the results from neighbouring datasets are indistinguishable. Similarly, we consider adding Gaussian noise to the distance score in the watermark detection, making the detection $(\epsilon,\delta)$ -DP [DR⁺14], and ensuring that attackers cannot tell the difference between two queries by replacing a single token in the content, thus increasing the hardness of launching the attacks. Considering an attacker can average multiple query results to reduce noise and estimate original scores without DP protection, we propose to calculate the noise based on the random seed generated by a pseudorandom function (PRF) with the sentence to be detected as the input. Specifically, $\mathtt{seed}=\mathtt{PRF}_{sk}(\textbf{x})$ , where $sk$ is the secret key held by the detection service. The users without the secret key cannot reverse or reduce the noise in the detection score. Thus, we can successfully mitigate the noise reduction via averaging multiple query results without comprising on utility or protection of the DP defense. In the following, we evaluate the utility of the DP defense and its performance in mitigating the spoofing attacks.

Experiment Setup. Firstly, we assess the utility of DP defense by evaluating the accuracy of the detection under various noise scales. Next, we evaluate the efficacy of the spoofing against DP detection defense using the same method as in Sec. 6.1. We select the optimal noise scale that provides best defense while keeping the drop in accuracy within $2\%$ .

Evaluation Result. As shown in Fig. 4(a), with a noise scale of $\sigma=4$ , the DP detection’s accuracy drops from the original $98.2\%$ to $97.2\%$ on KGW and LLAMA-2-7B, while the spoofing ASR becomes $0\%$ using the same attack procedure as Sec. 6.1. The results are consistent for Unigram and Exp watermarks and OPT-1.3B model as shown in Appendix K, which illustrates that the DP defense has a great utility-defense trade-off, with a negligible accuracy drop and significantly mitigates the spoofing attacks.

6.4 Discussion

The detection API, available to the public, aids users in differentiating between AI and human-created materials. However, it can be exploited by attackers to gradually remove watermarks or launch spoofing attacks. We propose a defense utilizing the ideas in differential privacy, which significantly increases the difficulty for spoofing attacks. However, this method is less effective against watermark-removal attacks that exploit the detection API because attackers’ actions will be close to random sampling, which, even though with less success rates, remains an effective way of removing watermarks. Therefore, we leave developing a more powerful defense mechanism against watermark-removal attacks exploiting detection API as future work. We recommend companies providing detection services should detect and curb malicious behavior by limiting query rates from potential attackers, and also verify the identity of the users to protect against Sybil attacks.

7 Conclusion

In this work, we reveal new attack vectors that exploit common features and design choices of LLM watermarks. In particular, while these design choices may enhance robustness, resistance against watermark stealing attacks, and public detection ease, they also allow malicious actors to launch attacks that can easily remove the watermark or damage the model’s reputation. Based on the theoretical and empirical analysis of our attacks, we suggest guidelines for designing and deploying LLM watermarks along with possible defenses to establish more reliable LLM watermark systems.

Our work studies the security implications of common LLM watermarking design choices. By developing realistic attacks and defenses and a simple set of guidelines for watermarking in practice, we aim for the work to serve as a resource for the development of secure LLM watermarking systems. Of course, by outlining such attacks, there is a risk that our work may in fact increase the prevalence of watermark removal or spoofing attacks performed in practice. We believe that this is nonetheless an important step towards educating the community about potential risks in watermarking systems and ultimately creating more effective defenses for secure LLM watermarking.

References

[Aar23] Scott Aaronson. Watermarking of large language models. https://simons.berkeley.edu/talks/scott-aaronson-ut-austin-openai-2023-08-17, 2023.
[BMR⁺20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[CGZ23] Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194, 2023.
[DR⁺14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
[FGJ⁺23] Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, and Mingyuan Wang. Publicly detectable watermarking for language models. Cryptology ePrint Archive, 2023.
[GLLH23] Chenchen Gu, Xiang Lisa Li, Percy Liang, and Tatsunori Hashimoto. On the learnability of watermarks for language models. arXiv preprint arXiv:2312.04469, 2023.
[Gum48] Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office, 1948.
[HCW⁺23] Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. Unbiased watermark for large language models. arXiv preprint arXiv:2310.10669, 2023.
[IWGZ18] Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885, 2018.
[JSV24] Nikola Jovanović, Robin Staab, and Martin Vechev. Watermark stealing in large language models. arXiv preprint arXiv:2402.19361, 2024.
[KGW⁺23a] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 17061–17084. PMLR, 23–29 Jul 2023.
[KGW⁺23b] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634, 2023.
[KSK⁺23] Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Frederick Wieting, and Mohit Iyyer. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[KTHL23] Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023.
[LCW21] Zhe Lin, Yitao Cai, and Xiaojun Wan. Towards document-level paraphrase generation with sentence rewriting and reordering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1033–1044, 2021.
[LJSL18] Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. Paraphrase generation with deep reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3865–3878, 2018.
[MLK⁺23] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. Detectgpt: zero-shot machine-generated text detection using probability curvature. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
[NKIH23] Ali Naseh, Kalpesh Krishna, Mohit Iyyer, and Amir Houmansadr. Stealing the decoding algorithms of language models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 1835–1849, 2023.
[Ope22] OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI blog, https://openai.com/blog/chatgpt, 2022.
[Ope23a] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[Ope23b] OpenAI. Openai moderation endpoint. https://platform.openai.com/docs/guides/moderation, 2023.
[Ope23c] OpenAI. Openai usage policies. https://openai.com/policies/usage-policies, 2023.
[OWJ⁺22] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
[PSF⁺23] Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Mark my words: Analyzing and evaluating language model watermarks. arXiv preprint arXiv:2312.00273, 2023.
[SBC⁺19] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. Release strategies and the social impacts of language models, 2019.
[SCS⁺22] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
[SKB⁺23] Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
[TMS⁺23] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[WHZH23] Yihan Wu, Zhengmian Hu, Hongyang Zhang, and Heng Huang. Dipmark: A stealthy, efficient and resilient watermark for large language models. arXiv preprint arXiv:2310.07710, 2023.
[WYC⁺23] Lean Wang, Wenkai Yang, Deli Chen, Hao Zhou, Yankai Lin, Fandong Meng, Jie Zhou, and Xu Sun. Towards codable text watermarking for large language models. arXiv preprint arXiv:2307.15992, 2023.
[ZALW24] Xuandong Zhao, Prabhanjan Vijendra Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for AI-generated text. In The Twelfth International Conference on Learning Representations, 2024.
[ZEF⁺23] Hanlin Zhang, Benjamin Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. Watermarks in the sand: Impossibility of strong watermarking for generative models. arXiv preprint arXiv:2311.04378, 2023.
[ZRG⁺22] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Appendix A Watermarking Schemes & Hyper-Parameters

In this section, we introduce the three watermarking schemes we evaluate in the paper—KGW [KGW⁺23a], Unigram [ZALW24], and Exp [KTHL23]. We also introduce the perplexity, a metric to evaluate the sentence quality.

KGW. In the KGW watermarking scheme, when generating the current token $\textbf{x}_{t+1}$ , all the tokens in the vocabulary is pseudorandomly shuffled and split into two lists—the green list and the red list. The random seed used to determine the green and red lists is computed by a watermark secret key $sk$ and the prior $h$ tokens $\textbf{x}_{t-h-1}||\cdots||\textbf{x}_{t}$ using pseudorandom functions (PRFs):

\textsc{seed}=F_{sk}(\textbf{x}_{t-h-1}||\cdots||\textbf{x}_{t}),

where $h$ is the context width of the watermark. We note that the choice of $h$ has minor influence on our attacks or defenses, as our algorithms are not dependent on $h$ . Here we use their original algorithm with $h=1$ . Then, the seed is used to split the vocabulary into the green and red lists of tokens, with $\gamma$ portion of tokens in the green list:

L_{\text{green}},L_{\text{red}}=\text{Shuffle}(\mathcal{V},\textsc{seed},\gamma)

Then, KGW generates a binary watermark mask vector for the current token prediction, which has the same size as the vocabulary. All the tokens in the green list $L_{\text{green}}$ have value $1$ in the mask, and all the tokens in the red list have value $0$ in the mask:

\textsc{mask}=\text{GenerateMask}(L_{\text{green}},L_{\text{red}})

To embed the watermark, KGW add a constant to the logits of the LLM’s prediction for token $\textbf{x}_{t+1}$ :

\textsc{WatermarkedProb}=\text{Softmax}(\text{logits}+\delta\times\textsc{mask% }),

where the logits is from the LLM, and the $\delta$ is the watermark strength. Then the LLM will sample the token $\textbf{x}_{t+1}$ according to the watermarked probability distribution.

The detection involves computing the z-score:

z=\frac{g-\gamma l}{\sqrt{\gamma(1-\gamma)l}},

where $g$ is the number of tokens in the green list, $l$ is the total number of tokens in the input token sequence, and $\gamma$ is the portion of the vocabulary tokens in the green list. Similar to the watermark embedding, the green and red lists for each token position are determined by watermark secret key and the token prior to the current token in the input token sequence.

Unigram. Similar to KGW, Unigram also splits the vocabulary into green and red lists and prioritize the tokens in the green list by adding a constant to the logits before computing the softmax. The difference is that Unigram uses global red and green lists instead of computing the green and red lists for each token. That is, the seed to shuffle the list is only determined by the watermark secret key and generated by a Pseudo-Random Generator (PRG):

\textsc{seed}=G(sk)

Then, similar to KGW, the seed is used to split the vocabulary into the green and red lists of tokens, with $\gamma$ portion of tokens in the green list:

L_{\text{green}},L_{\text{red}}=\text{Shuffle}(\mathcal{V},\textsc{seed},\gamma)

The watermark embedding and detection procedures are the same as KGW: Unigram first compute the watermark mask:

\textsc{mask}=\text{GenerateMask}(L_{\text{green}},L_{\text{red}})

And then embed the watermark by perturbing the logits of the LLM outputs:

\textsc{WatermarkedProb}=\text{Softmax}(\text{logits}+\delta\times\textsc{mask% }),

where the logits is from the LLM, and the $\delta$ is the watermark strength. Then the LLM will sample the token $\textbf{x}_{t+1}$ according to the watermarked probability distribution.

The detection also computes the z-score:

z=\frac{g-\gamma l}{\sqrt{\gamma(1-\gamma)l}},

where $g$ is the number of tokens in the green list, $l$ is the total number of tokens in the input token sequence, and $\gamma$ is the portion of the vocabulary tokens in the green list. According to the analysis in [ZALW24] and also consistent with our results in Sec. 4.1, by decoupling the green and red lists splitting with the prior tokens, Unigram is twice as robust as KGW. But it’s more likely to leak the pattern of the watermarked tokens given that it uses a global green-red list splitting.

Exp. The Exp watermarking scheme from [KTHL23] is an extension of [Aar23]. Instead of using a single key as in KGW and Unigram, the usage of multiple watermark keys is inherent in Exp to provide the distortion-free guarantee. Each key is a vector of size $|\mathcal{V}|$ with values uniformly distributed in $[0,1]$ . That is, $sk=\xi_{1},\xi_{2},\cdots,\xi_{n}$ , where $\xi_{k}\in[0,1]^{|\mathcal{V}|},k\in[n]$ , and $n$ is the length of the watermark keys, default to $256$ .

For the prediction of the token $\textbf{x}_{t+1}$ , Exp firstly collects the output probability vector $\textbf{p}\in[0,1]^{|\mathcal{V}|}$ from the LLM. A random shift $r\overset{{\scriptscriptstyle\$}}{\leftarrow}[n]$ is sampled at the beginning of receiving the prompt. Then the token $\textbf{x}_{t+1}$ is sampled using the Gumbel trick [Gum48]:

\textbf{x}_{t+1}={\arg\max}_{i}\;(\xi_{k,i})^{1/\textbf{p}_{i}},

where $k=r+t+1\text{ mod }n$ , i.e., each position uses a different watermark key which determines the uniform distribution sampling used in the Gumbel trick sampling. This method guarantees that the output distribution is distortion-free, whose expectation is identical to the distribution without watermark given sufficiently large $n$ .

The watermark detection also computes test statistics. The basic test statistics is:

\phi=\sum_{t=1}^{l}-\log(1-\xi_{k,\textbf{x}_{t}}),

where $k=t\text{ mod }n$ . And Exp computes the minimum Levenshtein distance using the basic test statistic as a cost (see Sec. 2.4 in [KTHL23]).

Instead of using single keys as KGW and Unigram, Exp uses multiple keys and incorporates Gumbel trick to rigorously provide distortion-free (unbiased) guarantee, whose expected output distribution over the key space is identical to the unwatermarked distribution.

Sentence Quality. Perplexity (PPL) is one of the most common metrics for evaluating language models. It can also be utilized to measure the quality of the sentences [ZALW24, KGW⁺23a] based on the oracle of high-quality language models. Formally, PPL returns the following quality score for an input sentence x:

\textsc{PPL}(\textbf{x})=\exp\{-\frac{1}{t}\sum_{i=1}^{t}\log[\Pr(\textbf{x}_{% i}|\textbf{x}_{0},\cdots\textbf{x}_{i-1})]\}

(5)

In our evaluation, we utilize the GPT3 [OWJ⁺22] as the oracle model to evaluate sentence quality.

Watermark Setups and Hyper-Parameters. For KGW [KGW⁺23a] and Unigram [ZALW24] watermarks, we utilize the default parameters in [ZALW24], where the watermark strength is $\delta=2$ , and the green list portion is $\gamma=0.5$ . We employ a threshold of $T=4$ for these two watermarks with a single watermark key. For the scenarios where multiple keys are used, we calculate the thresholds to guarantee that the false positive rates (FPRs) are below 1e-3. For the Exp watermark (refered to as Exp-edit in [KTHL23]), we use the default parameters, where the watermark key length is $n=256$ and the block size $k$ is default to be identical to the token length. We set the p-value threshold for Exp to $0.05$ in our experiments. We conduct the experiments on a cluster with 8 NVIDIA A100 GPUs, AMD EPYC 7763 64-Core CPU, and 1TB memory.

Appendix B Attack Feasibility Analysis of Piggyback Spoofing Exploiting Robustness

We study the bound on the maximum number of tokens that are allowed to be inserted or edited in a watermarked sentence, and we present the following theorem on Unigram watermark [ZALW24] due to its clean robustness guarantee:

Theorem 1 (Maximum insertion portion).

Consider a watermarked token sequence x of length $l$ . The Unigram watermark z-score threshold is $T$ , the portion of the tokens in the green list is $\gamma$ , the detection z-score of x is $z$ , and the number of inserted tokens is $s$ . Then, to guarantee the expected z-score of the edited text is greater than $T$ , it suffices to guarantee $\frac{s}{l}\leq\frac{z^{2}-T^{2}}{T^{2}}$ .

Proof.

Recall that the watermarking schemes’ detections usually involve computing the statistical testing. Unigram splits the vocabulary into two lists—the green list and the red list. It prioritizes the tokens in the green list during watermark embedding, and the detection computes the z-score:

z=\frac{g-\gamma l}{\sqrt{\gamma(1-\gamma)l}},

where $g$ is the number of tokens in the green list, $l$ is the total number of tokens in the input token sequence, and $\gamma$ is the portion of the vocabulary tokens in the green list. Let the number of the inserted toxic tokens be $s$ . Since toxic tokens are independent of the secret key $sk$ , the expected new z-score $z^{\prime}$ is:

\mathbb{E}(z^{\prime})=\frac{g+\gamma s-\gamma(l+s)}{\sqrt{\gamma(1-\gamma)(l+% s)}}=z\sqrt{\frac{l}{l+s}},

To guarantee that $\mathbb{E}(z^{\prime})\geq T$ , it suffices to guarantee

\frac{s}{l}\leq\frac{z^{2}-T^{2}}{T^{2}}

∎

Different from the analysis in the Unigram paper on how the z-score changes given a specific number of edits, we have a tight bound on the maximum possible number of edits, which is also more straightforward for the attack feasibility analysis. According to Theorem 1, as long as the number of toxic tokens inserted is bounded by $l\frac{z^{2}-T^{2}}{T^{2}}$ , the attacker can execute a piggyback attack to generate toxic content with the target watermark embedded. The editing distance bound (Def. 3) for a sentence is $\epsilon=l\frac{z^{2}-T^{2}}{T^{2}}$ . A stronger watermark makes piggyback spoofing attacks easier by allowing more toxic tokens to be inserted. This conclusion applies universally to all robust watermarking schemes. This is a fundamental design trade-off: if a watermark is robust, such spoofing attacks are inevitable and may be extremely difficult to detect, as even one toxic token can render the entire content harmful or inaccurate.

Appendix C Validation of Theorem 1

In this section, we validate Theorem 1 by using watermarked texts of varying lengths $l$ and z-scores $z$ to study the relationship between $\frac{s}{l}$ and $\frac{z^{t}-T^{2}}{T^{2}}$ of Unigram watermark. The results are shown in Fig. 5. As anticipated, 85.78% of the maximum allowable tokens to be inserted into the watermarked content satisfy Theorem 1. Given that this equation analyzes expected $s/l$ , a small portion of outliers is reasonable. We primarily visualize this result for Unigram due to its clean robustness guarantee. Other watermarks can also reach similar conclusions, but their bounds on $s$ are either complex [KGW⁺23a] or lack a closed form [KTHL23], making them difficult to visualize. Our empirical findings in Fig. 1 sufficiently prove an attacker can insert nontrivial portions of toxic or incorrect tokens into the watermarked text to launch the spoofing attack, which can be generalized across all robust watermarking schemes.

Appendix D Piggyback Attack Examples

Here we present more piggyback attack results using the edition strategy by querying GPT4 using the prompt “Modify less than 3 words in the following sentence and make it inaccurate or have opposite meanings.” The attack is launched on KGW watermark and LLAMA-2-7B model.

Appendix E Additional Results of Piggyback Spoofing Attack

In Sec. 4, we present the piggyback spoofing attack using toxic token insertion strategy on LLAMA-2-7B model. Here, we present the results on OPT-1.3B model, which are consistent with LLAMA-2-7B model’s results.

In Sec. 4, we present the fluent inaccurate editing strategy by querying the GPT4 on KGW watermark and LLAMA-2-7B model. Here we present more results of this strategy on all the three watermarks (KGW, Unigram, and Exp) and two models (LLAMA-2-7B and OPT-1.3B). The results are shown in Fig. 7, Fig. 8, and Fig. 9, which are consistent with our findings in Fig. 1, indicating that our piggyback spoofing attack can be generalized across various robust watermarks and models.

Appendix F Watermark Key Number Analysis for Watermark-Removal Attacks Exploiting the Use of Multiple Watermark Keys

Now we analyze the number of required queries under different keys to estimate the token with the highest probability without a watermark. We have the following probability bound for KGW and Unigram with the corresponding proof, and present the bound for Exp in Appendix G.

Theorem 2 (Probability bound of unwatermarked token estimation).

Suppose there are $n$ observations under different keys, the portion of the green list in KGW or Unigram is $\gamma$ . Then the probability that the most frequent token is the same as the original unwatermarked token is

\footnotesize 1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}\gamma^{k}(1-\gamma% )^{n-k}\times p(k),

(6)

where $p(k)=1-\Bigl{(}\sum_{m=0}^{k-1}\binom{n-k}{m}\gamma^{m}(1-\gamma)^{n-k-m}\Bigr% {)}^{c}$ , $c$ is the number of other tokens whose watermarked probability can exceed that of the highest unwatermarked token.

In a practical scenario where $n=13,\gamma=0.5$ , and $c=3$ , Theorem 2 suggests that the attacker has a probability of $0.71$ in finding the token with the highest unwatermarked probability. This implies that we can successfully remove watermarks from over $71\%$ of tokens using a small number of observations under different keys ( $n=13$ ), yielding high-quality unwatermarked content.

Proof.

Recall that KGW and Unigram randomly split the tokens in the vocabulary into the green list and the red list. We consider the greedy sampling, where the token with the highest (watermarked) probability is sampled. We have $n$ independent observations under different watermark keys. For each key, the token $\textbf{x}_{i}$ with the highest unwatermarked probability is in the green list is $\gamma$ . As long as $\textbf{x}_{i}$ is the green list, the greedy sampling will always yield $\textbf{x}_{i}$ since the watermarks add the same constant to all the tokens’ loogits in the green list.

Thus, the probability that the most frequent token among these $n$ observations is $\textbf{x}_{i}$ is at least:

1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}\gamma^{k}(1-\gamma)^{n-k},

which is the probability that $\textbf{x}_{i}$ is in the green list for at least half of the $n$ keys.

For another token $\textbf{x}_{j}$ whose probability can exceed $\textbf{x}_{i}$ , if $\textbf{x}_{j}$ is in the green list and $\textbf{x}_{i}$ is in the red list. Then if $\textbf{x}_{i}$ is in the green list for $k$ keys, the probability that $\textbf{x}_{j}$ is in the green list for at least $k$ keys among the other $n-k$ keys is:

1-\sum_{m=0}^{k-1}\binom{n-k}{m}\gamma^{m}(1-\gamma)^{n-k-m}

Consider we have $c$ such tokens having potential to exceed $\textbf{x}_{i}$ . Then at least one of the $c$ tokens is in the green list for at least $k$ keys among the other $n-k$ keys is:

1-\Bigl{(}\sum_{m=0}^{k-1}\binom{n-k}{m}\gamma^{m}(1-\gamma)^{n-k-m}\Bigr{)}^{c}

Thus, with all the above analysis, we have that if there are $c$ tokens that have the potential to exceed the probability of the token with highest unwatermarked probability (i.e., $\textbf{x}_{i}$ ), the probability that the most frequent token among the $n$ observations is the same as $\textbf{x}_{i}$ is:

1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}\gamma^{k}(1-\gamma)^{n-k}\times% \Biggl{(}1-\Bigl{(}\sum_{m=0}^{k-1}\binom{n-k}{m}\gamma^{m}(1-\gamma)^{n-k-m}% \Bigr{)}^{c}\Biggr{)},

which concludes the proof. ∎

Here we consider that the watermarked LLM is utilizing greedy sampling. In practice, the greedy sampling might not be an optimal sampling strategy, but we note that it is extremely challenging to incorporate the multinomial sampling when analyzing the KGW and Unigram watermarks. Because KGW and Unigram add bias to the output logits, which will go through the softmax function to calculate the probabilities for the tokens. Given the softmax function is not unbiased, we cannot get a tight bound on its variance. Thus, we leave this part as a future direction to further incorporate multinomial sampling in the analysis. Nevertheless, our empirical results still show that the attackers can generate high-quality unwatermarked content when multinomial sampling is used. Also, our analysis on Exp watermark in Appendix G can naturally incorporate multinomial sampling.

Appendix G Probability Bound of Unwatermarked Token Estimation for Exp

In this section, we present and prove the probability bound of unwatermarked token estimation for the Exp watermark [KTHL23].

Theorem 3 (Probability bound of unwatermarked token estimation for Exp).

Suppose there are $n$ observations under different keys, the highest probability for the unwatermarked tokens is $p$ . Then the probability that the most frequently appeared token among the $n$ observations is the same as the original unwatermarked token with highest probability is:

1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}p^{k}(1-p)^{n-k}

(7)

Proof.

The proof of Theorem 3 is straightforward. As we have introduced in Appendix A, the Exp watermark employs the Gumbel trick sampling [Gum48] when embedding the watermark. Thus, the probability that we observe the token whose original unwatermarked probability is $p$ is exactly $p$ for each of the independent keys. Thus, if we make $n$ observations under different keys, then at least half of them yields the token with the highest original probability $p$ is:

1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}p^{k}(1-p)^{n-k},

which concludes the proof. ∎

Appendix H Algorithms of Attacks Exploiting the Detection API

In this section, we provide the detailed algorithm of the attacks exploiting the detection API as we have introduced in Sec. 6. Specifically, we present the algorithm for watermark-removal attack exploiting the detection API in Alg. 1 and the algorithm for spoofing attack exploiting the detection API in Alg. 2.

Algorithm 1 Watermark-removal attack exploiting the detection API.

Input: Prompt

\textbf{x}_{\text{prompt}}

, watermarked LLM

M_{\text{wm}}

, detection API

f_{\text{detection}}

, maximum output token number

m\geq 2

Let

k\leftarrow 5

\textbf{x}_{1}\sim M_{\text{wm}}(\textbf{x}_{\text{prompt}})

for

t=2

m

(\textbf{x}_{t}^{1},\textbf{x}_{t}^{2},\cdots,\textbf{x}_{t}^{k}),(\textbf{p}_% {t}^{1},\textbf{p}_{t}^{2},\cdots,\textbf{p}_{t}^{k})\leftarrow M_{\text{wm}}(% \textbf{x}_{prompt}||\textbf{x}_{1}\cdots\textbf{x}_{t-1})

{The watermarked LLM returns the top

k

tokens and their corresponding probabilities in descending order.}

for

i=1

k

d_{i}\leftarrow f_{\text{detection}}(\textbf{x}_{1}||\cdots||\textbf{x}_{t-1}|% |\textbf{x}_{t}^{i})

d_{\text{min}}\leftarrow\min(d_{1},d_{2},\cdots,d_{k})

l_{\text{candidate}}\leftarrow\text{empty}

{Get the detection score with the lowest confidence.}

for

i=1

k

d_{\text{min}}=d_{i}

then

l_{\text{candidate}}\leftarrow l_{\text{candidate}}||\textbf{x}_{t}^{i}

{Get all the tokens with the lowest detection confidence.}

\textbf{x}_{t}^{1}\in l_{\text{candidate}}

then

j\leftarrow 0

{If the token with the highest probability (the first token) is in the list, output that token.}

else

c\leftarrow 1

for

\textbf{x}_{t}^{i}\in l_{\text{candidate}}

\textbf{p}_{t}^{i}\leftarrow\textbf{p}_{t}^{1}/c

{Update the tokens’ probabilities that have lowest detection confidence scores.}

c\leftarrow c+1

\textbf{p}_{t}^{1}\leftarrow 0

j\leftarrow\text{Sample}(\textbf{p}_{t}^{1},\cdots,\textbf{p}_{t}^{k})

{Sample the tokens according to the updated probabilities.}

\textbf{x}_{t}\leftarrow\textbf{x}_{t}^{j}

Return

\textbf{x}_{1},\textbf{x}_{2},\cdots,\textbf{x}_{m}

Algorithm 2 Spoofing attack exploiting the detection API.

Input: Prompt

\textbf{x}_{\text{prompt}}

, local LLM

M

, detection API

f_{\text{detection}}

, maximum output token number

m

Let

k\leftarrow 3

for

t=1

m

(\textbf{x}_{t}^{1},\textbf{x}_{t}^{2},\cdots,\textbf{x}_{t}^{k}),(\textbf{p}_% {t}^{1},\textbf{p}_{t}^{2},\cdots,\textbf{p}_{t}^{k})\leftarrow M(\textbf{x}_{% prompt}||\textbf{x}_{1}\cdots\textbf{x}_{t-1})

{The local LLM returns the top

k

tokens and their corresponding probabilities in descending order.}

for

i=1

k

d_{i}\leftarrow f_{\text{detection}}(\textbf{x}_{1}||\cdots||\textbf{x}_{t-1}|% |\textbf{x}_{t}^{i})

j\leftarrow\arg\max(d_{1},d_{2},\cdots,d_{k})

{Get the token resulting in the highest confidence.}

\textbf{x}_{t}\leftarrow\textbf{x}_{t}^{j}

Return

\textbf{x}_{1},\textbf{x}_{2},\cdots,\textbf{x}_{m}

Appendix I Additional Results of Watermark-Removal Attacks Exploiting the use of Multiple Watermark Keys

In this section, we provide more evaluation results of the watermark stealing [JSV24] and our watermark-removal attacks exploiting the use of multiple watermark keys (see Sec. 5) on all the three watermarks (KGW, Unigram, and Exp) and two models (LLAMA-2-7B and OPT-1.3B). The results are shown in Fig. 11, Fig. 12, Fig. 13, Fig. 14, Fig. 15. For KGW watermark on OPT-1.3B model and Unigram watermark on LLAMA-2-7B and OPT-1.3B models, we have consistent observations with the KGW watermark on LLAMA-2-7B as we present in Sec. 5.1, demonstrating the effectiveness and generalizability of our attacks. For the Exp watermark, our results in Fig. 12 and Fig. 15 also show that the watermark can be easily removed using multiple queries to estimate the distribution of the unwatermarked tokens.

The results of the watermark stealing [JSV24] on Unigram watermark and OPT-1.3B model are also consistent with our observations in Sec. 5. Using more keys can effectively mitigate the watermark stealing; however, it will make the system more vulnerable to our watermark removal attacks. Throughout these experiments, we observe that using three keys is the optimal choice to defend against both attacks. However, the attack success rates with three keys are not negligible. Thus, consistent with our guidelines in Sec. 5, we highly recommend that the LLM service provider to simultaneously limit the ability of the potentially malicious users.

To further verify that the LLM service provider can mitigate the watermark stealing attacks by limiting the attacker’s query rates, we present the stealing attack results with various numbers of queries on the KGW watermark and LLAMA-2-7B model using three keys in Fig. 10. The results show that by limiting the query rates of the attacker, the attack success rate of the watermark stealing attack can be significantly decreased. Thus, we recommend that the LLM service provider follow a “defense-in-depth” approach and utilize complementary techniques such as anomaly detection, query rate limiting, and user identification verification to mitigate stealing and removal attacks.

We note that the watermark stealing attacks do not work on the Exp watermark [KTHL23], as the use of a large number of watermark keys is inherent in their design, which defaults to $256$ . Thus, we omit the watermark stealing results on Exp, but we show that this watermark is inherently vulnerable to our watermark removal attack. From the results in Fig. 12 and Fig. 15, we conclude that using $n=13$ queries, the resulting p-value is very close to that of the content without a watermark and is significantly different from the watermarked p-value, which shows that we can effectively remove the watermark using $13$ queries for each token. We note that for Exp, the perplexity of the watermarked content is significantly higher than that of the unwatermarked content. This is mainly because Exp does not allow sampling in watermark embedding, which becomes a deterministic algorithm when the key is fixed. In contrast, our watermark removal attack generates content with much lower perplexity, making it comparable to unwatermarked content when the query number under different keys exceeds $13$ . This can be attributed to our attack functioning as a layer of random sampling. Unlike greedy sampling methods, we have a probability to sample the token with the highest unwatermarked probability (see Sec. 4, Appendix F, and Appendix G). The results of the three watermarks and two models prove that the watermark-removal attack exploiting the use of multiple watermark keys can effectively eliminate the watermarks while maintaining high output quality.

Appendix J Additional Results of Attacks Exploiting Detection APIs

We present the results of watermark-removal and spoofing attacks on OPT-1.3B model in Fig. 16 and Table 3. The results are consistent with the LLAMA-2-7B model presented in Sec. 6.1., with all the attack success rates higher than $75\%$ using a small number of queries to the detection API of around $3$ per token. The results on OPT-1.3B model further demonstrate the effectiveness of our attacks exploiting the detection API.

	wm-removal		spoofing
	ASR	#queries	ASR	#queries
KGW	$0.99$	$2.87$	$1.00$	$2.96$
Unigram	$0.77$	$3.25$	$1.00$	$2.97$
Exp	$0.86$	$2.07$	$0.93$	$2.92$

Table 3: The attack success rate (ASR), and the average query numbers per token for the watermark-removal and spoofing attacks exploiting the detection API on OPT-1.3B model.

Appendix K Additional Results of DP Defense

We present additional evaluation results of our defence technique that enhances the watermark detection by utilizing the techniques of differential privacy (see Sec. 6). Consistent with Sec. 6.3, we evaluate the utility of the DP defense as well as its performance in mitigating the spoofing attack exploiting the detection API. The results are shown in Fig. 17, Fig. 18, Fig. 19, Fig. 20, Fig. 21.

We first identify the optimal noise scale parameter $\sigma$ based on its detection accuracy and attack success rate, aiming for a drop in detection accuracy within $2\%$ and the lowest attack success rate. Then we assess the performance of the defense. Our findings across three watermarks and two models consistently demonstrate that we can significantly reduce the attack success rate to around or below $20\%$ .

Our defense can be generalized to all LLM watermarking schemes. It allows us to substantially mitigate spoofing attacks exploiting the detection API while having negligible impact on utility.