Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith
Carnegie Mellon University
{qipang, shengyuanhu, wenting, smithv}@cmu.edu
Abstract

Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating the misuse of such AI-generated content. However, we show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack—leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems, and propose guidelines and defenses for LLM watermarking in practice.

1 Introduction

Modern generative modeling systems have notably enhanced the quality of AI-produced content [BMR+20, SCS+22, Ope23a, Ope22]. For example, large language models (LLMs) like those powering ChatGPT [Ope22] can generate text closely resembling human-crafted sentences. While this has led to exciting new applications of machine learning, there is also growing concern around the potential for misuse of these models, leading to a flurry of recent efforts on developing techniques to detect AI-generated content. A promising approach in this direction is to embed invisible watermarks into model-derived content, which can then be extracted and verified using a secret watermark key [KGW+23a, FGJ+23, CGZ23, KTHL23, ZALW24, KGW+23b, HCW+23, WHZH23, WYC+23].

In this work, we identify that many of the key properties that make existing LLM watermarks successful can also render them susceptible to attack. In particular, we study a number of simple attacks that take advantage of common design choices of existing watermarking schemes, including:

  1. 1.

    Robustness of the watermarks to potential modifications in the output text, so that the watermarks cannot be easily removed [KTHL23, KGW+23a, ZALW24, KTHL23, CGZ23, HCW+23];

  2. 2.

    The use of multiple keys to prevent against watermark stealing attacks [KGW+23a, FGJ+23, KTHL23, JSV24, SKB+23, GLLH23]; and

  3. 3.

    Public detection APIs, which allow the general public to easily verify whether or not candidate text is AI-generated [FGJ+23, KGW+23a, SBC+19, MLK+23].

While these common features and design choices of existing watermarking schemes have clear benefits, we show that they also make the resulting systems vulnerable to a number of simple but effective attacks. In particular, we study two types of attacks: 1) watermark-removal attacks, which remove the watermark from the watermarked content, and 2) spoofing attacks, which create (potentially toxic) content with a target watermark embedded, thus making the content appear to be generated by a specific LLM when it is in fact not (see motivating examples in Table 1).

Prompt Alan Turing was born in ...
Unwatermarked Alan Turing was born in 1912 and died in 1954. He was an English mathematician, logician, cryptanalyst, and computer scientist. In 1938, Turing joined the Government Code and Cypher School (GC&CS), where he contributed to the design of the bombe, a machine that was used to decipher the Enigma-enciphered messages...
Z-Score: 0.160.160.160.16 \downarrow
PPL: 3.193.193.193.19
Watermarked Alan Turing was born in 1912 and died in 1954, at the age of 41. He was the brilliant British scientist and mathematician who is largely credited with being the father of modern computer science. He is known for his contributions to mathematical biology and chemistry. He was also one of the pioneers of computer science...
Z-Score: 5.985.985.985.98 \uparrow
PPL: 4.464.464.464.46
(a) Piggyback spoofing attack Alan Turing was born in 1950 and died in 1994, at the age of 43. He was the brilliant American scientist and mathematician who is largely credited with being the father of modern computer science. He is known for his contributions to mathematical biology and musicology. He was also one of the pioneers of computer science...
Exploiting robustness
Z-Score: 5.985.985.985.98 \uparrow
(b) Watermark-removal attack Alan Turing was born in 1912 and died in 1954. He was a mathematician, logician, cryptologist and theoretical computer scientist. He is famous for his work on code-breaking and artificial intelligence, and his contribution to the Allied victory in World War II. Turing was born in London. He showed an interest in mathematics...
Exploiting multiple keys
Z-Score: 2.402.402.402.40 \downarrow
PPL: 4.054.054.054.05
(c) Watermark-removal attack Alan Turing was born in 1912 and died in 1954. He was an English mathematician, computer scientist, cryptanalyst and philosopher. Turing was a leading mathematician and cryptanalyst. He was one of the key players in cracking the German Enigma Code during World War II. He also came up with the Turing Machine...
Exploiting public detection API
Z-Score: 1.471.471.471.47 \downarrow
PPL: 4.574.574.574.57
Table 1: Examples generated using LLAMA-2-7B with/without the KGW watermark [KGW+23a] under various attacks. We mark tokens in the green and red lists (see Appendix A). Z-score reflects the detection confidence of the watermark, and perplexity (PPL) measures text quality. (a) In the piggyback spoofing attack, we exploit watermark robustness by generating incorrect content that appears as watermarked (matching the z-score of the watermarked baseline), potentially damaging the reputation of the LLM. Incorrect tokens modified by the attacker are marked in orange and watermarked tokens in blue. (b-c) In watermark-removal attacks, attackers can effectively lower the z-score below the detection threshold while preserving a high sentence quality (low PPL) by exploiting either the (b) use of multiple keys or (c) publicly available watermark detection API.

Our work rigorously explores a number of simple removal and spoofing attacks for LLM watermarks. In doing so, we identify critical trade-offs that emerge between watermark robustness, utility, and usability as a result of watermarking design choices. To navigate these trade-offs, we propose potential defenses as well as a set of general guidelines to better enhance the security of next-generation LLM watermarking systems. Overall, we make the following contributions:

  • We study how watermark robustness, despite being a desirable property to mitigate removal attacks, can make the resulting systems highly susceptible to piggyback spoofing attacks, a simple type of attack that makes makes watermarked text toxic or inaccurate through small modifications, and show that challenges exist in detecting these attacks given that a single token can render an entire sentence inaccurate (Sec. 4).

  • We show that using multiple watermarking keys can make the system susceptible to watermark removal attacks (Sec. 5). Although a larger number of keys can help defend against watermark stealing attacks, which can be used to launch either spoofing or removal attacks, we show both theoretically and empirically that this in turn increases the potential for watermark removal attacks.

  • Finally, we identify that public watermark detection APIs can be exploited by attackers to launch both watermark-removal and spoofing attacks (Sec. 6). We propose a defense using techniques from differential privacy to effectively counteract spoofing attacks, showing that it is possible to avoid the possibilities of noise reduction by applying pseudorandom noise based on the input.

Throughout, we explore our attacks on three state-of-the-art watermarks [KGW+23a, ZALW24, KTHL23] and two LLMs (LLAMA-2-7B [TMS+23] and OPT-1.3B [ZRG+22])—demonstrating that these vulnerabilities are common to existing LLM watermarks, and providing caution for the field in deploying current solutions in practice without carefully considering the impact and trade-offs of watermarking design choices.

2 Related Work

Advances in large language models (LLMs) have given rise to increasing concerns that such models may be misused for purposes such as spreading misinformation, phishing, and academic cheating. In response, numerous recent works have proposed watermarking schemes as a tool for detecting LLM-generated text to mitigate potential misuse [KGW+23a, FGJ+23, CGZ23, KTHL23, ZALW24, KGW+23b, HCW+23, WHZH23, WYC+23]. These approaches involve embedding invisible watermarks into the model-generated content, which can then be extracted and verified using a secret watermark key. Existing watermarking schemes share a few natural goals: (1) the watermark should be robust in that it cannot be easily removed; (2) the watermark should not be easily stolen, thus enabling spoofing or removal attacks; and (3) the presence of a watermark should be easy to detect when given new candidate text. Unfortunately, we show that existing methods that aim to achieve these goals can in turn enable simple watermark removal or spoofing attacks.

Removal attacks. Several recent works have highlighted that paraphrasing methods may be used to evade the detection of AI-generated text [KSK+23, IWGZ18, LJSL18, LCW21, ZEF+23], with [KSK+23, ZEF+23] demonstrating effective watermark removal using a local LLM. These methods usually require additional training for sentence paraphrasing which can impact sentence quality, or assume a high-quality oracle model to guarantee the output quality is preserved. In contrast, the simple and scalable removal attacks herein do not require additional training or a high-quality oracle. Additionally, our work differs in that we aim to directly connect and study how the inherent properties and design choices of watermarking schemes (such as the use of multiple keys and detection APIs) can inform such removal attacks.

Spoofing attacks. Prior works on spoofing use watermark stealing attacks to first estimate the watermark pattern and then embed it into an arbitrary content to launch spoofing attacks. These attacks usually require the attacker to pay a large startup cost by obtaining a significant number of watermarked tokens. For example, [SKB+23] requires 1 million queries to the watermarked LLM, and [JSV24, GLLH23] assume the attacker can obtain millions of watermarked tokens to estimate their distribution. Unlike these works, we explore spoofing attacks that are less flexible but can be launched with significantly less upfront cost. In Sec. 4, we explore a very simple and scalable form of spoofing exploiting the inherent robustness property of watermarks, which we refer to as a ‘piggyback spoofing attack’. In Sec. 6, we then explore more general spoofing attacks, which instead of querying the watermarked LLM numerous times, consider exploiting the public detection API. In both, our attacks do not require the attacker to estimate the watermark pattern, but share a similar ultimate goal with the prior spoofing attacks to create falsified inaccurate or toxic content that appears to be watermarked.

3 Preliminaries

Before exploring attacks and defenses on watermarking systems, we introduce relevant background on LLMs, notation we use throughout the work, and a set of concrete threat models.

Notation. We use x to denote a sequence of tokens, xi𝒱subscriptx𝑖𝒱\textbf{x}_{i}\in\mathcal{V}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V is the i𝑖iitalic_i-th token in the sequence, and 𝒱𝒱\mathcal{V}caligraphic_V is the vocabulary. Morigsubscript𝑀origM_{\text{orig}}italic_M start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT denotes the original model without a watermark, Mwmsubscript𝑀wmM_{\text{wm}}italic_M start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT is the watermarked model, and sk𝒮𝑠𝑘𝒮sk\in\mathcal{S}italic_s italic_k ∈ caligraphic_S is the watermark secret key sampled from the key space 𝒮𝒮\mathcal{S}caligraphic_S.

Language Models. Current state-of-the-art (SOTA) LLMs are auto-regressive models, which predict the next token based on the prior tokens. We define language models more formally below:

Definition 1 (LM).

We define a language model (LM) without a watermark as:

Morig:𝒱𝒱,:subscript𝑀origsuperscript𝒱𝒱\footnotesize M_{\text{orig}}:\mathcal{V}^{*}\rightarrow\mathcal{V},italic_M start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT : caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT → caligraphic_V , (1)

where the input is a sequence of length t𝑡titalic_t tokens x. Morig(x)subscript𝑀origxM_{\text{orig}}(\textbf{x})italic_M start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT ( x ) first returns the probability distribution for the next token xt+1subscriptx𝑡1\textbf{x}_{t+1}x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and then the LM samples xt+1subscriptx𝑡1\textbf{x}_{t+1}x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT from this distribution.

Watermarks for LLMs. In this work, we focus on three SOTA decoding-based watermarking schemes: KGW [KGW+23a], Unigram [ZALW24] and Exp [KTHL23]. Informally, decoding-based watermarks are embedded by perturbing the output distribution of the original LLM. The perturbation is determined by secret watermark keys held by the LLM owner. Formally, we define the watermarking scheme:

Definition 2 (Watermarked LLMs).

The watermarked LLM takes token sequence x𝒱xsuperscript𝒱\textbf{x}\in\mathcal{V}^{*}x ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and secret key sk𝒮𝑠𝑘𝒮sk\in\mathcal{S}italic_s italic_k ∈ caligraphic_S as input, and outputs a perturbed probability distribution for the next token. The perturbation is determined by sk𝑠𝑘skitalic_s italic_k:

Mwm:𝒱×𝒮𝒱:subscript𝑀wmsuperscript𝒱𝒮𝒱\footnotesize M_{\text{wm}}:\mathcal{V}^{*}\times\mathcal{S}\rightarrow% \mathcal{V}italic_M start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT : caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × caligraphic_S → caligraphic_V (2)

The watermark detection outputs the statistical testing score for the null hypothesis that the input token sequence is independent of the watermark secret key:

fdetection:𝒱×𝒮:subscript𝑓detectionsuperscript𝒱𝒮\footnotesize f_{\text{detection}}:\mathcal{V}^{*}\times\mathcal{S}\rightarrow% \mathbb{R}italic_f start_POSTSUBSCRIPT detection end_POSTSUBSCRIPT : caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × caligraphic_S → blackboard_R (3)

The output score reflects the confidence of the watermark’s existence in the input. Please refer to Appendix A for additional details of the specific watermarks explored in this work [KGW+23a, ZALW24, KTHL23].

3.1 Threat Model

Attacker’s Objective & Motivation. We study two types of attacks—watermark-removal attacks and (piggyback or general) spoofing attacks. In the watermark-removal attack, the attacker aims to generate a high-quality response from the LLM without an embedded watermark. For the spoofing attacks, the goal is to generate a harmful or incorrect output that has the victim organization’s watermark embedded.

We present two practical scenarios to motivate watermark-removal attacks: (i) A student or a journalist uses high-quality watermarked LLMs to write articles, but wants to remove the watermark to claim originality. (ii) A malicious company offering LLM services for clients, instead of developing their own LLMs, simply queries a watermarked LLM from a victim company and removes the watermark, potentially infringing upon IP rights of the victim company.

In piggyback and spoofing attacks, an attacker can damage the reputation of a victim company offering an LLM service. For example: (i) The attacker can use a spoofing attack to generate fake news or incorrect facts and post them on social media. By claiming the material is generated by the LLM from the benign company, the attacker can damage the reputation of the company and their model. (ii) The attacker can use the spoofing attack to inject malicious code into some public software. The code has the benign company’s watermark embedded, and the benign company may thus be at fault and have to bear responsibility for the actions.

Attacker’s Capabilities. We study attacks by exploiting three common design choices in watermarks: 1) robustness, 2) the use of multiple keys, and 3) public detection APIs. Each attack requires the adversary to have different capabilities, but we make assumptions that are practical and easy to achieve in real-world deployment scenarios.

1) For piggyback spoofing attacks exploiting robustness (Sec. 4), we assume that the attacker can make 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) queries to the target watermarked LLM. We also assume that the attacker can edit the generated sentence (e.g., insert or substitute tokens).

2) For watermark-removal attacks exploiting the use of multiple keys (Sec. 5), we consider the scenario where multiple watermark keys are utilized to embed the watermark, which is a common practice in designing robust cryptographic protocols and is suggested by SOTA watermarks [KTHL23, KGW+23a] to improve resistance against watermark-stealing attacks [JSV24, GLLH23, SKB+23]. For a sentence of length l𝑙litalic_l, we assume that the attacker can make 𝒪(l)𝒪𝑙\mathcal{O}(l)caligraphic_O ( italic_l ) queries to the watermarked LLM.

3) For the attacks on detection APIs (Sec. 6), we assume that the detection API is available to normal users and the attacker can make 𝒪(l)𝒪𝑙\mathcal{O}(l)caligraphic_O ( italic_l ) queries for a sentence of length l𝑙litalic_l. The detection returns the watermark confidence score (p-value or z-score). For spoofing attacks exploiting the detection APIs, we assume that the attacker can auto-regressively synthesize (toxic) sentences. For example, they can run a local (small) model to synthesize such sentences. For watermark-removal attacks exploiting the detection APIs, we also assume that the attacker can make 𝒪(l)𝒪𝑙\mathcal{O}(l)caligraphic_O ( italic_l ) queries to the watermarked LLM. As is common practice [NKIH23, OWJ+22] and also enabled by OpenAI’s API, we assume that the top 5 tokens at each position and their probabilities are returned to the attackers.

4 Attacking Robust Watermarks

The goal of developing a watermark that is robust to output perturbations is to defend against watermark removal, which may be used to circumvent detection schemes for applications such as phishing or fake news generation. Robust watermark designs have been the topic of many recent works [ZALW24, KGW+23a, KTHL23, SKB+23, KGW+23b, PSF+23]. We formally define watermark robustness in the following definition.

Definition 3 (Watermark robustness).

A watermark is (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-robust, given a watermarked text x, if for all its neighboring texts within the ϵitalic-ϵ\epsilonitalic_ϵ editing distance, the probability that the detection fails to detect the edited text is bounded by δ𝛿\deltaitalic_δ, given the detection confidence threshold T𝑇Titalic_T:

x,x𝒱,Pr[fdetection(x,sk)<T]<δ,s.t.fdetection(x,sk)T,d(x,x)ϵ,formulae-sequencefor-allxsuperscriptxsuperscript𝒱Prsubscript𝑓detectionsuperscriptx𝑠𝑘𝑇𝛿𝑠𝑡formulae-sequencesubscript𝑓detectionx𝑠𝑘𝑇dxsuperscriptxitalic-ϵ\displaystyle\footnotesize\forall\textbf{x},\textbf{x}^{\prime}\in\mathcal{V}^% {*},\,\Pr[f_{\text{detection}}(\textbf{x}^{\prime},sk)<T]<\delta,\quad s.t.\,f% _{\text{detection}}(\textbf{x},sk)\geq T,\,\text{d}(\textbf{x},\textbf{x}^{% \prime})\leq\epsilon,∀ x , x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Pr [ italic_f start_POSTSUBSCRIPT detection end_POSTSUBSCRIPT ( x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s italic_k ) < italic_T ] < italic_δ , italic_s . italic_t . italic_f start_POSTSUBSCRIPT detection end_POSTSUBSCRIPT ( x , italic_s italic_k ) ≥ italic_T , d ( x , x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_ϵ ,

More robust watermarks can better defend against editing attacks, but this seemingly desirable property can also be easily misused by malicious users to launch simple piggyback spoofing attacks—e.g., a small portion of toxic or incorrect content can be inserted into the watermarked material, making it seem like it was generated by a specific watermarked LLM. The toxic content will still be detected as watermarked, potentially damaging the reputation of the LLM service provider. As discussed in Sec. 2, spoofing attacks explored in prior work usually require the attacker to obtain millions of watermarked tokens upfront to estimate the watermark pattern [JSV24, SKB+23, GLLH23]. In contrast, our simple piggyback spoofing only requires a single query to the watermarked LLM with careful text modifications, and the effectiveness relates directly to the robustness of the LLM watermark.

Attack Procedure. (i) The attacker queries the target watermarked LLM to receive a high-entropy watermarked sentence xwmsubscriptxwm\textbf{x}_{\text{wm}}x start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT, (ii) The attacker edits xwmsubscriptxwm\textbf{x}_{\text{wm}}x start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT and forms a new piece of text xsuperscriptx\textbf{x}^{\prime}x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and claims that xsuperscriptx\textbf{x}^{\prime}x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is generated by the target LLM. The editing method can be defined by the attacker. Simple strategies could include inserting toxic tokens into the watermarked sentence xwmsubscriptxwm\textbf{x}_{\text{wm}}x start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT at random positions, or editing specific tokens to make the output inaccurate (see example in Table 1). As we show, editing can also be done at scale by querying another LLM like GPT4 to generate fluent output.

We present the formal analysis on the attack feasibility in Appendix B and point out the takeaway that is universally applicable to all robust watermarks: A more robust watermark makes piggyback spoofing attack easier by allowing more toxic tokens to be inserted. This is a fundamental design trade-off: If a watermark is robust, such spoofing attacks are inevitable and may be extremely difficult to detect, as even one toxic token can render the entire content harmful or inaccurate.

4.1 Evaluation

Experiment Setup.  We assess the effectiveness of our piggyback spoofing attack by using the two editing strategies discussed above. Through toxic token insertion, we study the limits of how many tokens can be inserted into the watermarked content. Using fluent inaccurate editing, we show that piggyback spoofing can generate fluent, watermarked, but inaccurate results at scale. Specifically, for the toxic token insertion, we generate a list of 200200200200 toxic tokens and insert them at random positions in the watermarked output. For the fluent inaccurate editing, we edit the watermarked sentence by querying GPT4 using the prompt “Modify less than 3 words in the following sentence and make it inaccurate or have opposite meanings.” Unless otherwise specified, in the evaluations of this work, we utilize 500500500500 prompts data from OpenGen [KSK+23] dataset, and query the watermarked language models (LLAMA-2-7B [TMS+23] and OPT-1.3B [ZRG+22]) to generate the watermarked outputs. We evaluate three SOTA watermarks including KGW [KGW+23a], Unigram [ZALW24], and Exp [KTHL23], using the default watermarking hyperparameters. In our experiments, we default to a maximum of 200 new tokens for KGW and Unigram, and 70 for Exp, due to its complexity in the watermark detection. 70 is also the maximum number of tokens the authors of Exp evaluated in their paper [KTHL23].

Refer to caption
(a) Toxic token insertion.
Refer to caption
(b) Fluent inaccurate editing.
Figure 1: Piggyback spoofing of robust watermarks. (a) We can insert a large number of toxic tokens in robustly watermarked text without changing the watermark detection result, resulting in text that is likely to be identified as toxic. (b) We can use GPT4 to automatically modify watermarked text, making it appear inaccurate while retaining fluency.

Evaluation Result. We report the maximum portion of the inserted toxic tokens relative to the original watermarked sentence length on LLAMA-2-7B model in Fig. 1(a). We also present the confidence of the OpenAI moderation model [Ope23b] in identifying the content as violating their usage policy [Ope23c] due to the inserted toxic tokens in Fig. 1(a). Our findings show that we can insert a significant number of toxic tokens into content generated by all the robust watermarking schemes, with a median portion higher than 20%percent2020\%20 %, i.e., for a 200200200200-token sentence, the attacker can insert a median of 40404040 toxic tokens into it. These toxic sentences are then identified as violating OpenAI policy rules with high confidence scores, whose median is higher than 0.8 for all the watermarking schemes we study. The average confidence scores for content before attack are around 0.01. The empirical data on the maximum portion of inserted toxic tokens aligns with our analysis in Appendix B. We further validate this analysis in Fig. 5 of Appendix C, showing that attackers can insert nontrivial portions of toxic tokens into the watermarked text to launch piggyback spoofing attacks. Notably, the more robust the watermark is, the more tokens can effectively be inserted. We present the results on OPT-1.3B in Appendix E.

In Fig. 1(b), we report the PPL and watermark detection scores of the piggyback results on KGW and LLAMA-2-7B by the fluent inaccurate editing strategy. We show that we can successfully generate fluent results, with a slightly higher PPL. 94.17%percent94.1794.17\%94.17 % of the piggyback results have a z-score higher than the default threshold 4444. We randomly sample 100100100100 piggyback results and manually check that most of them (92%percent9292\%92 %) are fluent and have inaccurate or opposite content from the original watermarked content. See concrete examples in Appendix D. The results show that we can generate watermarked, fluent, but inaccurate content at scale with an ASR higher than 90%.

4.2 Discussion

Guideline #1 Robust watermarks are inherently vulnerable to piggyback spoofing attacks. To mitigate piggyback spoofing attacks, watermark designers may need to comprise on robustness to removal attacks.

Our results highlight that piggyback spoofing attacks are easy to execute in practice. LLM watermarks typically do not consider such attacks during design and deployment, and existing robust watermarks are inherently vulnerable to such attacks. We highlight the contradiction between the watermark robustness and the piggyback spoofing feasibility. We consider this attack to be challenging to defend against, especially considering examples such as those in Table 1 and Appendix D, where by only editing a single token, the entire content becomes incorrect. It is hard, if not impossible, to detect whether a particular token is from the attacker by using robust watermark detection algorithms. Thus, practitioners should weigh the risks of removal vs. piggyback spoofing attacks for the model at hand. A feasible strategy to mitigate spoofing attacks is by requiring proof of digital signatures on the LLM generated content. However, while an attacker without access to the private key cannot spoof, it is worth nothing that this strategy is still vulnerable to watermark-removal attacks, as a single editing can invalidate the original signature.

5 Attacking Stealing-Resistant Watermarks

As discussed in Sec. 2, many works have explored the possibility of launching watermark stealing attacks to infer the secret pattern of the watermark, which can then enable spoofing and removal attacks [SKB+23, JSV24, GLLH23]. A natural and effective defense against watermark stealing is using multiple watermark keys during embedding, which is a common practice in cryptography and also suggested by prior watermarks and work in watermark stealing [KGW+23a, KTHL23, JSV24]. Unfortunately, we demonstrate that using multiple keys can in turn introduce new watermark-removal attacks.

In particular, SOTA watermarking schemes [KGW+23a, FGJ+23, CGZ23, KTHL23, ZALW24, KGW+23b] aim to ensure the watermarked text retains its high quality and the private watermark patterns are not easily distinguished by maintaining an “unbiasedness” property:

𝔼sk𝒮(Mwm(x,sk))ϵMorig(x),subscriptitalic-ϵsubscript𝔼𝑠𝑘𝒮subscript𝑀wmx𝑠𝑘subscript𝑀origx\footnotesize\mathbb{E}_{sk\in\mathcal{S}}(M_{\text{wm}}(\textbf{x},sk))% \approx_{\epsilon}M_{\text{orig}}(\textbf{x}),blackboard_E start_POSTSUBSCRIPT italic_s italic_k ∈ caligraphic_S end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT ( x , italic_s italic_k ) ) ≈ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT ( x ) , (4)

i.e., the expected distribution of watermarked output over the watermark key space sk𝒮𝑠𝑘𝒮sk\in\mathcal{S}italic_s italic_k ∈ caligraphic_S is close to the output distribution without a watermark, differing by a distance of ϵitalic-ϵ\epsilonitalic_ϵ. Exp [KTHL23] is rigorously unbiased, and KGW [KGW+23a] and Unigram [ZALW24] slightly shift the watermarked distributions.

The insight of our proposed watermark-removal attack is that given the “unbiasedness” nature of watermarks and considering multiple keys may be used during watermark embedding, malicious users can estimate the output distribution without any watermark by querying the watermarked LLM multiple times using the same prompt. As this attack estimates the original, unwatermarked distribution, the quality of the generated content is preserved.

Attack Procedure. An attacker queries a watermarked model with an input x multiple times, observing n𝑛nitalic_n subsequent tokens xt+1subscriptx𝑡1\textbf{x}_{t+1}x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. This is easy for text completion model APIs, and chat model APIs can also be easily attacked by constructing a prompt to ask the chat model to complete a partial sentence without any prefix. The attacker then creates a frequency histogram of these tokens and samples according to the frequency. This sampled token matches the result of sampling on an unwatermarked output distribution with a nontrivial probability. Consequently, the attacker can progressively eliminate watermarks while maintaining a high quality of the synthesized content. We present a formal analysis of the number of required queries in Appendix F.

5.1 Evaluation

Experiment Setup.  Our watermarks, models and datasets settings are the same as Sec. 4.1. We study the trade-off between resistance against watermark stealing and watermark-removal attacks by evaluating a recent watermark stealing attack [NKIH23]. In this attack, we query the watermarked LLM to obtain 2.2 million tokens in total to estimate the watermark pattern and then launch spoofing attacks using the estimated watermark pattern. We follow their assumptions that the attacker can access the unwatermarked tokens’ distribution. In our watermark removal attack, we consider that the attacker has observations with different keys. We evaluate the detection scores (z-score or p-value) and the output perplexity (PPL, evaluated using GPT3 [OWJ+22]). The detection algorithm returns the maximum detection score across all the keys, which increases the expectation of unwatermarked detection results. Thus, we set the detection thresholds for different keys to keep the false positive rates (FPR) below 1e-3 and report the attack success rates (ASR). We use default watermark hyperparameters.

Evaluation Result. As shown in Fig. 2(a), using multiple keys can effectively defend against watermark stealing attacks. With a single key, the ASR is 91%percent9191\%91 %, which matches the results reported in [JSV24]. We observe that using three keys can effectively reduce the ASR to 13%percent1313\%13 %, and using more than 7 keys, the ASR of the watermark stealing is close to zero. However, using more keys also makes the system vulnerable to our watermark-removal attacks as shown in Fig. 2(b). When we use more than 7777 keys, the detection scores of the content produced by our watermark removal attacks closely resemble those of unwatermarked content and are much lower than the detection thresholds, with ASRs higher than 97%percent9797\%97 %. Fig. 2(c) suggests that using more keys improves the quality of the output content. This is because, with a greater number of keys, there is a higher probability for an attacker to accurately estimate the unwatermarked distribution, which is consistent with our analysis in Appendix F. We observe that in practice, 7 keys suffice to produce high-quality content comparable to the unwatermarked content. These observations remain consistent across various watermarking schemes and models; for additional results see Appendix I.

Refer to caption
(a) Z-Score and attack success rate (ASR) of watermark stealing [NKIH23].
Refer to caption
(b) Z-Score and attack success rate (ASR) of watermark-removal.
Refer to caption
(c) Perplexity (PPL) of watermark-removal.
Figure 2: Spoofing attack based on watermark stealing [NKIH23] and watermark-removal attacks on KGW watermark and LLAMA-2-7B model with different number of watermark keys n𝑛nitalic_n. Higher z-score reflects more confidence in watermarking and lower perplexity indicates better sentence quality. The attack success rates are based on the threshold with FPR@1e-3.

5.2 Discussion

Guideline #2 Using a larger number of watermarking keys can defend against watermark stealing attacks, but increases vulnerability to watermark-removal attacks. Limiting users’ query rates can help to mitigate both attacks.

Many prior works have suggested using multiple keys to defend against watermark stealing attacks. However, in this study, we reveal that a conflict exists between improving resistance to watermark stealing and the feasibility of removing watermarks. Our evaluation results show that finding a "sweet spot" in terms of the number of keys to use to mitigate both the watermark stealing and the watermark-removal attacks is not trivial. For example, our watermark-removal attack achieves a high ASR of 36.2%percent36.236.2\%36.2 % just using three keys, and the corresponding watermark stealing-based spoofing’s ASR is 13.0%percent13.013.0\%13.0 %. Using more keys can decrease the watermark stealing-based spoofing’s ASR, but at the cost of making the system more vulnerable to watermark removal and vice-versa. We note that the ASRs with three keys are not negligible, thus limiting the ability of potentially malicious users is necessary in practice to mitigate these attacks. As a practical defense, we evaluate watermark stealing with various query limits on the watermarked LLM, and found that the ASR can be significantly reduced by limiting the attacker’s query rate. Detailed results can be found in Appendix I. Given the trade-off that exists, we suggest that LLM service providers consider “defense-in-depth” techniques such as anomaly detection, query rate limiting, and user identification verification.

6 Attacking Watermark Detection APIs

It is still an open question whether watermark detection APIs should be made publicly available to users. Although this makes it easier to detect watermarked text, it is a commonly acknowledged that it will make the system vulnerable to attacks [Aar23]. Here, we study this statement more precisely by examining the specific risk trade-offs that exist, as well as introducing a novel defense that may make the public detection API more feasible in practice. In the following sections, we first introduce attacks that exploit the APIs and then propose suggestions and defenses to mitigate these attacks.

6.1 Attack Procedures

Watermark-Removal Attack. For the watermark-removal attack, we consider an attacker who has access to the target watermarked LLM’s API, and can query the watermark detection results. The attacker feeds a prompt into the watermarked LLM, which generates the response in an auto-regressive manner. For the token xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the attacker will generate a list of possible replacements for xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This list can be generated by querying the watermarked LLM, querying a local model, or simply returned by the watermarked LLM. In this work, we choose the third approach because of its simplicity and guarantee of synthesized sentences’ quality. This is a common assumption made by prior works [NKIH23], and such an API is also provided by OpenAI (top_logprobs=5top_logprobs5\mathrm{top\_logprobs=5}roman_top _ roman_logprobs = 5), which can benefit the normal users in understanding the model confidence, debugging and analyzing the model’s behavior, customizing sampling strategies, etc. Consider that the top L=5𝐿5L=5italic_L = 5 tokens and their probabilities are returned to the attackers. The probability that the attacker can find an unwatermarked token in the token candidates’ list of length L𝐿Litalic_L is 1γL1superscript𝛾𝐿1-\gamma^{L}1 - italic_γ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT for KGW and Unigram, which becomes sufficiently large given L=5𝐿5L=5italic_L = 5 and γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5. The attacker will query the detection using these replacements and sample a token based on their probabilities and detection scores to remove the watermark while preserving a high output quality. See the detailed algorithm Alg. 1 in Appendix H.

Spoofing Attack. Spoofing attacks follow a similar procedure where the attacker can generate (harmful) content using a local model. When sampling the tokens, instead of selecting those that yield low confidence scores as in removal attacks, the attacker will choose tokens that have higher confidence scores upon watermark detection queries. Thanks to the robustness of the LLM watermarks, attackers don’t need to ensure every single token carries a watermark; only that the overall detection confidence score surpasses the threshold, thereby treating synthesized content as if generated by the watermarked LLM. Please refer to Alg. 2 in Appendix H for the detailed algorithm.

Refer to caption
(a) Z-Score/P-Value of wm-removal.
Refer to caption
(b) Perplexity of wm-removal.
Refer to caption
(c) Z-Score/P-Value of spoofing.
Figure 3: Attacks exploiting detection APIs on LLAMA-2-7B model.

6.2 Evaluation

Experiment Setup.  We use the same evaluation setup as in Sec. 4.1 and Sec. 5.1. We evaluate the detection scores for both the watermark-removal and the spoofing attacks. We also report the number of queries to the detection API. Furthermore, for the watermark-removal attack, where the attackers care more about the output quality, we report the output PPL. For spoofing attacks, the attackers’ local models are LLAMA-2-7B and OPT-1.3B.

Evaluation Result.  As shown in Fig. 3(a) and Fig. 3(b), watermark-removal attacks exploiting the detection API significantly reduce detection confidence while maintaining high output quality. For instance, for the KGW watermark on LLAMA-2-7B model, we achieve a median z-score of 1.431.431.431.43, which is much lower than the threshold 4444. The PPL is also close to the watermarked outputs (6.176.176.176.17 vs. 6.286.286.286.28). We observe that the Exp watermark has higher PPL than the other two watermarks. This is because that Exp watermark is deterministic, while other watermarks enable random sampling during inference. Our attack also employs sampling based on the token probabilities and detection scores, thus we can improve the output quality for the Exp watermark.

wm-removal spoofing
ASR #queries ASR #queries
KGW 1.001.001.001.00 2.422.422.422.42 0.980.980.980.98 2.952.952.952.95
Unigram 0.960.960.960.96 2.662.662.662.66 0.980.980.980.98 2.962.962.962.96
Exp 0.960.960.960.96 1.551.551.551.55 0.850.850.850.85 2.892.892.892.89
Table 2: The attack success rate (ASR), and the average query numbers per token for the watermark-removal and spoofing attacks exploiting the detection API on LLAMA-2-7B model.

The spoofing attacks also significantly boost the detection confidence even though the content is not from the watermarked LLM, as depicted in Fig. 3(c). We report the attack success rate (ASR) and the number of queries for both of the attacks in Table 2. The ASR quantifies how much of the generated content surpasses or falls short of the detection threshold. These attacks use a reasonable number of queries to the detection API and achieve high success rate, demonstrating practical feasibility. We observe consistent results on OPT-1.3B, please see Appendix J.

6.3 Defending Detection with Differential Privacy

In light of the issues above, we propose an effective defense using ideas from differential privacy (DP) [DR+14] to counteract detection API based spoofing attacks. DP adds random noise to function results evaluated on private dataset such that the results from neighbouring datasets are indistinguishable. Similarly, we consider adding Gaussian noise to the distance score in the watermark detection, making the detection (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP [DR+14], and ensuring that attackers cannot tell the difference between two queries by replacing a single token in the content, thus increasing the hardness of launching the attacks. Considering an attacker can average multiple query results to reduce noise and estimate original scores without DP protection, we propose to calculate the noise based on the random seed generated by a pseudorandom function (PRF) with the sentence to be detected as the input. Specifically, 𝚜𝚎𝚎𝚍=𝙿𝚁𝙵sk(x)𝚜𝚎𝚎𝚍subscript𝙿𝚁𝙵𝑠𝑘x\mathtt{seed}=\mathtt{PRF}_{sk}(\textbf{x})typewriter_seed = typewriter_PRF start_POSTSUBSCRIPT italic_s italic_k end_POSTSUBSCRIPT ( x ), where sk𝑠𝑘skitalic_s italic_k is the secret key held by the detection service. The users without the secret key cannot reverse or reduce the noise in the detection score. Thus, we can successfully mitigate the noise reduction via averaging multiple query results without comprising on utility or protection of the DP defense. In the following, we evaluate the utility of the DP defense and its performance in mitigating the spoofing attacks.

Refer to caption
(a) Spoofing ASR and detection ACC.
Refer to caption
(b) Z-scores with/without DP.
Figure 4: Evaluation of DP detection on KGW watermark and LLAMA-2-7B model. (a). Spoofing attack success rate (ASR) and detection accuracy (ACC) without and with DP watermark detection under different noise parameters. (b). Z-scores of original text without attack, spoofing attack without DP, and spoofing attacks with DP. We use the best σ=4𝜎4\sigma=4italic_σ = 4 from (a).

Experiment Setup. Firstly, we assess the utility of DP defense by evaluating the accuracy of the detection under various noise scales. Next, we evaluate the efficacy of the spoofing against DP detection defense using the same method as in Sec. 6.1. We select the optimal noise scale that provides best defense while keeping the drop in accuracy within 2%percent22\%2 %.

Evaluation Result. As shown in Fig. 4(a), with a noise scale of σ=4𝜎4\sigma=4italic_σ = 4, the DP detection’s accuracy drops from the original 98.2%percent98.298.2\%98.2 % to 97.2%percent97.297.2\%97.2 % on KGW and LLAMA-2-7B, while the spoofing ASR becomes 0%percent00\%0 % using the same attack procedure as Sec. 6.1. The results are consistent for Unigram and Exp watermarks and OPT-1.3B model as shown in Appendix K, which illustrates that the DP defense has a great utility-defense trade-off, with a negligible accuracy drop and significantly mitigates the spoofing attacks.

6.4 Discussion

Guideline #3 Public detection APIs can enable both spoofing and removal attacks. To defend against these attacks, we propose a DP-inspired defense, which combined with techniques such as anomaly detection, query rate limiting, and user identification verification can help to make public detection more feasible in practice.

The detection API, available to the public, aids users in differentiating between AI and human-created materials. However, it can be exploited by attackers to gradually remove watermarks or launch spoofing attacks. We propose a defense utilizing the ideas in differential privacy, which significantly increases the difficulty for spoofing attacks. However, this method is less effective against watermark-removal attacks that exploit the detection API because attackers’ actions will be close to random sampling, which, even though with less success rates, remains an effective way of removing watermarks. Therefore, we leave developing a more powerful defense mechanism against watermark-removal attacks exploiting detection API as future work. We recommend companies providing detection services should detect and curb malicious behavior by limiting query rates from potential attackers, and also verify the identity of the users to protect against Sybil attacks.

7 Conclusion

In this work, we reveal new attack vectors that exploit common features and design choices of LLM watermarks. In particular, while these design choices may enhance robustness, resistance against watermark stealing attacks, and public detection ease, they also allow malicious actors to launch attacks that can easily remove the watermark or damage the model’s reputation. Based on the theoretical and empirical analysis of our attacks, we suggest guidelines for designing and deploying LLM watermarks along with possible defenses to establish more reliable LLM watermark systems.

Our work studies the security implications of common LLM watermarking design choices. By developing realistic attacks and defenses and a simple set of guidelines for watermarking in practice, we aim for the work to serve as a resource for the development of secure LLM watermarking systems. Of course, by outlining such attacks, there is a risk that our work may in fact increase the prevalence of watermark removal or spoofing attacks performed in practice. We believe that this is nonetheless an important step towards educating the community about potential risks in watermarking systems and ultimately creating more effective defenses for secure LLM watermarking.

References

  • [Aar23] Scott Aaronson. Watermarking of large language models. https://simons.berkeley.edu/talks/scott-aaronson-ut-austin-openai-2023-08-17, 2023.
  • [BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [CGZ23] Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194, 2023.
  • [DR+14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  • [FGJ+23] Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, and Mingyuan Wang. Publicly detectable watermarking for language models. Cryptology ePrint Archive, 2023.
  • [GLLH23] Chenchen Gu, Xiang Lisa Li, Percy Liang, and Tatsunori Hashimoto. On the learnability of watermarks for language models. arXiv preprint arXiv:2312.04469, 2023.
  • [Gum48] Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office, 1948.
  • [HCW+23] Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. Unbiased watermark for large language models. arXiv preprint arXiv:2310.10669, 2023.
  • [IWGZ18] Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885, 2018.
  • [JSV24] Nikola Jovanović, Robin Staab, and Martin Vechev. Watermark stealing in large language models. arXiv preprint arXiv:2402.19361, 2024.
  • [KGW+23a] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 17061–17084. PMLR, 23–29 Jul 2023.
  • [KGW+23b] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634, 2023.
  • [KSK+23] Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Frederick Wieting, and Mohit Iyyer. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [KTHL23] Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023.
  • [LCW21] Zhe Lin, Yitao Cai, and Xiaojun Wan. Towards document-level paraphrase generation with sentence rewriting and reordering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1033–1044, 2021.
  • [LJSL18] Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. Paraphrase generation with deep reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3865–3878, 2018.
  • [MLK+23] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. Detectgpt: zero-shot machine-generated text detection using probability curvature. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • [NKIH23] Ali Naseh, Kalpesh Krishna, Mohit Iyyer, and Amir Houmansadr. Stealing the decoding algorithms of language models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 1835–1849, 2023.
  • [Ope22] OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI blog, https://openai.com/blog/chatgpt, 2022.
  • [Ope23a] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [Ope23b] OpenAI. Openai moderation endpoint. https://platform.openai.com/docs/guides/moderation, 2023.
  • [Ope23c] OpenAI. Openai usage policies. https://openai.com/policies/usage-policies, 2023.
  • [OWJ+22] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • [PSF+23] Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Mark my words: Analyzing and evaluating language model watermarks. arXiv preprint arXiv:2312.00273, 2023.
  • [SBC+19] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. Release strategies and the social impacts of language models, 2019.
  • [SCS+22] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  • [SKB+23] Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
  • [TMS+23] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [WHZH23] Yihan Wu, Zhengmian Hu, Hongyang Zhang, and Heng Huang. Dipmark: A stealthy, efficient and resilient watermark for large language models. arXiv preprint arXiv:2310.07710, 2023.
  • [WYC+23] Lean Wang, Wenkai Yang, Deli Chen, Hao Zhou, Yankai Lin, Fandong Meng, Jie Zhou, and Xu Sun. Towards codable text watermarking for large language models. arXiv preprint arXiv:2307.15992, 2023.
  • [ZALW24] Xuandong Zhao, Prabhanjan Vijendra Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for AI-generated text. In The Twelfth International Conference on Learning Representations, 2024.
  • [ZEF+23] Hanlin Zhang, Benjamin Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. Watermarks in the sand: Impossibility of strong watermarking for generative models. arXiv preprint arXiv:2311.04378, 2023.
  • [ZRG+22] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Appendix A Watermarking Schemes & Hyper-Parameters

In this section, we introduce the three watermarking schemes we evaluate in the paper—KGW [KGW+23a], Unigram [ZALW24], and Exp [KTHL23]. We also introduce the perplexity, a metric to evaluate the sentence quality.

KGW. In the KGW watermarking scheme, when generating the current token xt+1subscriptx𝑡1\textbf{x}_{t+1}x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, all the tokens in the vocabulary is pseudorandomly shuffled and split into two lists—the green list and the red list. The random seed used to determine the green and red lists is computed by a watermark secret key sk𝑠𝑘skitalic_s italic_k and the prior hhitalic_h tokens xth1xtsubscriptx𝑡1normsubscriptx𝑡\textbf{x}_{t-h-1}||\cdots||\textbf{x}_{t}x start_POSTSUBSCRIPT italic_t - italic_h - 1 end_POSTSUBSCRIPT | | ⋯ | | x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using pseudorandom functions (PRFs):

seed=Fsk(xth1xt),seedsubscript𝐹𝑠𝑘subscriptx𝑡1normsubscriptx𝑡\textsc{seed}=F_{sk}(\textbf{x}_{t-h-1}||\cdots||\textbf{x}_{t}),seed = italic_F start_POSTSUBSCRIPT italic_s italic_k end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t - italic_h - 1 end_POSTSUBSCRIPT | | ⋯ | | x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where hhitalic_h is the context width of the watermark. We note that the choice of hhitalic_h has minor influence on our attacks or defenses, as our algorithms are not dependent on hhitalic_h. Here we use their original algorithm with h=11h=1italic_h = 1. Then, the seed is used to split the vocabulary into the green and red lists of tokens, with γ𝛾\gammaitalic_γ portion of tokens in the green list:

Lgreen,Lred=Shuffle(𝒱,seed,γ)subscript𝐿greensubscript𝐿redShuffle𝒱seed𝛾L_{\text{green}},L_{\text{red}}=\text{Shuffle}(\mathcal{V},\textsc{seed},\gamma)italic_L start_POSTSUBSCRIPT green end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT red end_POSTSUBSCRIPT = Shuffle ( caligraphic_V , seed , italic_γ )

Then, KGW generates a binary watermark mask vector for the current token prediction, which has the same size as the vocabulary. All the tokens in the green list Lgreensubscript𝐿greenL_{\text{green}}italic_L start_POSTSUBSCRIPT green end_POSTSUBSCRIPT have value 1111 in the mask, and all the tokens in the red list have value 00 in the mask:

mask=GenerateMask(Lgreen,Lred)maskGenerateMasksubscript𝐿greensubscript𝐿red\textsc{mask}=\text{GenerateMask}(L_{\text{green}},L_{\text{red}})mask = GenerateMask ( italic_L start_POSTSUBSCRIPT green end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT red end_POSTSUBSCRIPT )

To embed the watermark, KGW add a constant to the logits of the LLM’s prediction for token xt+1subscriptx𝑡1\textbf{x}_{t+1}x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT:

WatermarkedProb=Softmax(logits+δ×mask),WatermarkedProbSoftmaxlogits𝛿mask\textsc{WatermarkedProb}=\text{Softmax}(\text{logits}+\delta\times\textsc{mask% }),WatermarkedProb = Softmax ( logits + italic_δ × mask ) ,

where the logits is from the LLM, and the δ𝛿\deltaitalic_δ is the watermark strength. Then the LLM will sample the token xt+1subscriptx𝑡1\textbf{x}_{t+1}x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT according to the watermarked probability distribution.

The detection involves computing the z-score:

z=gγlγ(1γ)l,𝑧𝑔𝛾𝑙𝛾1𝛾𝑙z=\frac{g-\gamma l}{\sqrt{\gamma(1-\gamma)l}},italic_z = divide start_ARG italic_g - italic_γ italic_l end_ARG start_ARG square-root start_ARG italic_γ ( 1 - italic_γ ) italic_l end_ARG end_ARG ,

where g𝑔gitalic_g is the number of tokens in the green list, l𝑙litalic_l is the total number of tokens in the input token sequence, and γ𝛾\gammaitalic_γ is the portion of the vocabulary tokens in the green list. Similar to the watermark embedding, the green and red lists for each token position are determined by watermark secret key and the token prior to the current token in the input token sequence.

Unigram. Similar to KGW, Unigram also splits the vocabulary into green and red lists and prioritize the tokens in the green list by adding a constant to the logits before computing the softmax. The difference is that Unigram uses global red and green lists instead of computing the green and red lists for each token. That is, the seed to shuffle the list is only determined by the watermark secret key and generated by a Pseudo-Random Generator (PRG):

seed=G(sk)seed𝐺𝑠𝑘\textsc{seed}=G(sk)seed = italic_G ( italic_s italic_k )

Then, similar to KGW, the seed is used to split the vocabulary into the green and red lists of tokens, with γ𝛾\gammaitalic_γ portion of tokens in the green list:

Lgreen,Lred=Shuffle(𝒱,seed,γ)subscript𝐿greensubscript𝐿redShuffle𝒱seed𝛾L_{\text{green}},L_{\text{red}}=\text{Shuffle}(\mathcal{V},\textsc{seed},\gamma)italic_L start_POSTSUBSCRIPT green end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT red end_POSTSUBSCRIPT = Shuffle ( caligraphic_V , seed , italic_γ )

The watermark embedding and detection procedures are the same as KGW: Unigram first compute the watermark mask:

mask=GenerateMask(Lgreen,Lred)maskGenerateMasksubscript𝐿greensubscript𝐿red\textsc{mask}=\text{GenerateMask}(L_{\text{green}},L_{\text{red}})mask = GenerateMask ( italic_L start_POSTSUBSCRIPT green end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT red end_POSTSUBSCRIPT )

And then embed the watermark by perturbing the logits of the LLM outputs:

WatermarkedProb=Softmax(logits+δ×mask),WatermarkedProbSoftmaxlogits𝛿mask\textsc{WatermarkedProb}=\text{Softmax}(\text{logits}+\delta\times\textsc{mask% }),WatermarkedProb = Softmax ( logits + italic_δ × mask ) ,

where the logits is from the LLM, and the δ𝛿\deltaitalic_δ is the watermark strength. Then the LLM will sample the token xt+1subscriptx𝑡1\textbf{x}_{t+1}x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT according to the watermarked probability distribution.

The detection also computes the z-score:

z=gγlγ(1γ)l,𝑧𝑔𝛾𝑙𝛾1𝛾𝑙z=\frac{g-\gamma l}{\sqrt{\gamma(1-\gamma)l}},italic_z = divide start_ARG italic_g - italic_γ italic_l end_ARG start_ARG square-root start_ARG italic_γ ( 1 - italic_γ ) italic_l end_ARG end_ARG ,

where g𝑔gitalic_g is the number of tokens in the green list, l𝑙litalic_l is the total number of tokens in the input token sequence, and γ𝛾\gammaitalic_γ is the portion of the vocabulary tokens in the green list. According to the analysis in [ZALW24] and also consistent with our results in Sec. 4.1, by decoupling the green and red lists splitting with the prior tokens, Unigram is twice as robust as KGW. But it’s more likely to leak the pattern of the watermarked tokens given that it uses a global green-red list splitting.

Exp. The Exp watermarking scheme from [KTHL23] is an extension of [Aar23]. Instead of using a single key as in KGW and Unigram, the usage of multiple watermark keys is inherent in Exp to provide the distortion-free guarantee. Each key is a vector of size |𝒱|𝒱|\mathcal{V}|| caligraphic_V | with values uniformly distributed in [0,1]01[0,1][ 0 , 1 ]. That is, sk=ξ1,ξ2,,ξn𝑠𝑘subscript𝜉1subscript𝜉2subscript𝜉𝑛sk=\xi_{1},\xi_{2},\cdots,\xi_{n}italic_s italic_k = italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_ξ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where ξk[0,1]|𝒱|,k[n]formulae-sequencesubscript𝜉𝑘superscript01𝒱𝑘delimited-[]𝑛\xi_{k}\in[0,1]^{|\mathcal{V}|},k\in[n]italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT , italic_k ∈ [ italic_n ], and n𝑛nitalic_n is the length of the watermark keys, default to 256256256256.

For the prediction of the token xt+1subscriptx𝑡1\textbf{x}_{t+1}x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, Exp firstly collects the output probability vector p[0,1]|𝒱|psuperscript01𝒱\textbf{p}\in[0,1]^{|\mathcal{V}|}p ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT from the LLM. A random shift r$[n]𝑟currency-dollardelimited-[]𝑛r\overset{{\scriptscriptstyle\$}}{\leftarrow}[n]italic_r over$ start_ARG ← end_ARG [ italic_n ] is sampled at the beginning of receiving the prompt. Then the token xt+1subscriptx𝑡1\textbf{x}_{t+1}x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is sampled using the Gumbel trick [Gum48]:

xt+1=argmaxi(ξk,i)1/pi,\textbf{x}_{t+1}={\arg\max}_{i}\;(\xi_{k,i})^{1/\textbf{p}_{i}},x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where k=r+t+1 mod n𝑘𝑟𝑡1 mod 𝑛k=r+t+1\text{ mod }nitalic_k = italic_r + italic_t + 1 mod italic_n, i.e., each position uses a different watermark key which determines the uniform distribution sampling used in the Gumbel trick sampling. This method guarantees that the output distribution is distortion-free, whose expectation is identical to the distribution without watermark given sufficiently large n𝑛nitalic_n.

The watermark detection also computes test statistics. The basic test statistics is:

ϕ=t=1llog(1ξk,xt),italic-ϕsuperscriptsubscript𝑡1𝑙1subscript𝜉𝑘subscriptx𝑡\phi=\sum_{t=1}^{l}-\log(1-\xi_{k,\textbf{x}_{t}}),italic_ϕ = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - roman_log ( 1 - italic_ξ start_POSTSUBSCRIPT italic_k , x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where k=t mod n𝑘𝑡 mod 𝑛k=t\text{ mod }nitalic_k = italic_t mod italic_n. And Exp computes the minimum Levenshtein distance using the basic test statistic as a cost (see Sec. 2.4 in [KTHL23]).

Instead of using single keys as KGW and Unigram, Exp uses multiple keys and incorporates Gumbel trick to rigorously provide distortion-free (unbiased) guarantee, whose expected output distribution over the key space is identical to the unwatermarked distribution.

Sentence Quality. Perplexity (PPL) is one of the most common metrics for evaluating language models. It can also be utilized to measure the quality of the sentences [ZALW24, KGW+23a] based on the oracle of high-quality language models. Formally, PPL returns the following quality score for an input sentence x:

PPL(x)=exp{1ti=1tlog[Pr(xi|x0,xi1)]}PPLx1𝑡superscriptsubscript𝑖1𝑡Prconditionalsubscriptx𝑖subscriptx0subscriptx𝑖1\textsc{PPL}(\textbf{x})=\exp\{-\frac{1}{t}\sum_{i=1}^{t}\log[\Pr(\textbf{x}_{% i}|\textbf{x}_{0},\cdots\textbf{x}_{i-1})]\}PPL ( x ) = roman_exp { - divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log [ roman_Pr ( x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ] } (5)

In our evaluation, we utilize the GPT3 [OWJ+22] as the oracle model to evaluate sentence quality.

Watermark Setups and Hyper-Parameters. For KGW [KGW+23a] and Unigram [ZALW24] watermarks, we utilize the default parameters in [ZALW24], where the watermark strength is δ=2𝛿2\delta=2italic_δ = 2, and the green list portion is γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5. We employ a threshold of T=4𝑇4T=4italic_T = 4 for these two watermarks with a single watermark key. For the scenarios where multiple keys are used, we calculate the thresholds to guarantee that the false positive rates (FPRs) are below 1e-3. For the Exp watermark (refered to as Exp-edit in [KTHL23]), we use the default parameters, where the watermark key length is n=256𝑛256n=256italic_n = 256 and the block size k𝑘kitalic_k is default to be identical to the token length. We set the p-value threshold for Exp to 0.050.050.050.05 in our experiments. We conduct the experiments on a cluster with 8 NVIDIA A100 GPUs, AMD EPYC 7763 64-Core CPU, and 1TB memory.

Appendix B Attack Feasibility Analysis of Piggyback Spoofing Exploiting Robustness

We study the bound on the maximum number of tokens that are allowed to be inserted or edited in a watermarked sentence, and we present the following theorem on Unigram watermark [ZALW24] due to its clean robustness guarantee:

Theorem 1 (Maximum insertion portion).

Consider a watermarked token sequence x of length l𝑙litalic_l. The Unigram watermark z-score threshold is T𝑇Titalic_T, the portion of the tokens in the green list is γ𝛾\gammaitalic_γ, the detection z-score of x is z𝑧zitalic_z, and the number of inserted tokens is s𝑠sitalic_s. Then, to guarantee the expected z-score of the edited text is greater than T𝑇Titalic_T, it suffices to guarantee slz2T2T2𝑠𝑙superscript𝑧2superscript𝑇2superscript𝑇2\frac{s}{l}\leq\frac{z^{2}-T^{2}}{T^{2}}divide start_ARG italic_s end_ARG start_ARG italic_l end_ARG ≤ divide start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG.

Proof.

Recall that the watermarking schemes’ detections usually involve computing the statistical testing. Unigram splits the vocabulary into two lists—the green list and the red list. It prioritizes the tokens in the green list during watermark embedding, and the detection computes the z-score:

z=gγlγ(1γ)l,𝑧𝑔𝛾𝑙𝛾1𝛾𝑙z=\frac{g-\gamma l}{\sqrt{\gamma(1-\gamma)l}},italic_z = divide start_ARG italic_g - italic_γ italic_l end_ARG start_ARG square-root start_ARG italic_γ ( 1 - italic_γ ) italic_l end_ARG end_ARG ,

where g𝑔gitalic_g is the number of tokens in the green list, l𝑙litalic_l is the total number of tokens in the input token sequence, and γ𝛾\gammaitalic_γ is the portion of the vocabulary tokens in the green list. Let the number of the inserted toxic tokens be s𝑠sitalic_s. Since toxic tokens are independent of the secret key sk𝑠𝑘skitalic_s italic_k, the expected new z-score zsuperscript𝑧z^{\prime}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is:

𝔼(z)=g+γsγ(l+s)γ(1γ)(l+s)=zll+s,𝔼superscript𝑧𝑔𝛾𝑠𝛾𝑙𝑠𝛾1𝛾𝑙𝑠𝑧𝑙𝑙𝑠\mathbb{E}(z^{\prime})=\frac{g+\gamma s-\gamma(l+s)}{\sqrt{\gamma(1-\gamma)(l+% s)}}=z\sqrt{\frac{l}{l+s}},blackboard_E ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG italic_g + italic_γ italic_s - italic_γ ( italic_l + italic_s ) end_ARG start_ARG square-root start_ARG italic_γ ( 1 - italic_γ ) ( italic_l + italic_s ) end_ARG end_ARG = italic_z square-root start_ARG divide start_ARG italic_l end_ARG start_ARG italic_l + italic_s end_ARG end_ARG ,

To guarantee that 𝔼(z)T𝔼superscript𝑧𝑇\mathbb{E}(z^{\prime})\geq Tblackboard_E ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ italic_T, it suffices to guarantee

slz2T2T2𝑠𝑙superscript𝑧2superscript𝑇2superscript𝑇2\frac{s}{l}\leq\frac{z^{2}-T^{2}}{T^{2}}divide start_ARG italic_s end_ARG start_ARG italic_l end_ARG ≤ divide start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

Different from the analysis in the Unigram paper on how the z-score changes given a specific number of edits, we have a tight bound on the maximum possible number of edits, which is also more straightforward for the attack feasibility analysis. According to Theorem 1, as long as the number of toxic tokens inserted is bounded by lz2T2T2𝑙superscript𝑧2superscript𝑇2superscript𝑇2l\frac{z^{2}-T^{2}}{T^{2}}italic_l divide start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, the attacker can execute a piggyback attack to generate toxic content with the target watermark embedded. The editing distance bound (Def. 3) for a sentence is ϵ=lz2T2T2italic-ϵ𝑙superscript𝑧2superscript𝑇2superscript𝑇2\epsilon=l\frac{z^{2}-T^{2}}{T^{2}}italic_ϵ = italic_l divide start_ARG italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. A stronger watermark makes piggyback spoofing attacks easier by allowing more toxic tokens to be inserted. This conclusion applies universally to all robust watermarking schemes. This is a fundamental design trade-off: if a watermark is robust, such spoofing attacks are inevitable and may be extremely difficult to detect, as even one toxic token can render the entire content harmful or inaccurate.

Appendix C Validation of Theorem 1

In this section, we validate Theorem 1 by using watermarked texts of varying lengths l𝑙litalic_l and z-scores z𝑧zitalic_z to study the relationship between sl𝑠𝑙\frac{s}{l}divide start_ARG italic_s end_ARG start_ARG italic_l end_ARG and ztT2T2superscript𝑧𝑡superscript𝑇2superscript𝑇2\frac{z^{t}-T^{2}}{T^{2}}divide start_ARG italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG of Unigram watermark. The results are shown in Fig. 5. As anticipated, 85.78% of the maximum allowable tokens to be inserted into the watermarked content satisfy Theorem 1. Given that this equation analyzes expected s/l𝑠𝑙s/litalic_s / italic_l, a small portion of outliers is reasonable. We primarily visualize this result for Unigram due to its clean robustness guarantee. Other watermarks can also reach similar conclusions, but their bounds on s𝑠sitalic_s are either complex [KGW+23a] or lack a closed form [KTHL23], making them difficult to visualize. Our empirical findings in Fig. 1 sufficiently prove an attacker can insert nontrivial portions of toxic or incorrect tokens into the watermarked text to launch the spoofing attack, which can be generalized across all robust watermarking schemes.

Refer to caption
Figure 5: The relationship between s/l𝑠𝑙s/litalic_s / italic_l and z𝑧zitalic_z. The data points are evaluated on Unigram using LLAMA-2-7B and 500500500500 samples from OpenGen dataset.

Appendix D Piggyback Attack Examples

Here we present more piggyback attack results using the edition strategy by querying GPT4 using the prompt “Modify less than 3 words in the following sentence and make it inaccurate or have opposite meanings.” The attack is launched on KGW watermark and LLAMA-2-7B model.

Watermarked content, z-score: 4.934.934.934.93, PPL: 4.614.614.614.61 Earth has a history of 4.5 billion years and humans have been around for 200,000 years. Yet humans have been using computers for just over 70 years and even then the term was first used in 1945. In the age of technology, we are still just getting started. The first computer, ENIAC (Electronic Numerical Integrator And Calculator), was built at the University of Pennsylvania between 1943 and 1946. The ENIAC took up 1800 sq ft and had 18,000 vacuum tube and mechanical parts. The ENIAC was used for mathematical calculations, ballistics, and code breaking. The ENIAC was 1000 times faster than any other calculator of the time. The first computer to run a program was the Z3, built by Konrad Zuse at his house.
Piggyback attack, z-score: 4.364.364.364.36, PPL: 5.685.685.685.68 Earth has a history of 4.5 billion years and humans have been around for 200,000 years. Yet humans have been using computers for just over 700 years and even then the term was first used in 1445. In the age of technology, we are still just getting started. The first computer, ENIAC (Electronic Numerical Integrator And Calculator), was built at the University of Pennsylvania between 1943 and 1946. The ENIAC took up 1800 sq ft and had 18,000 vacuum tube and mechanical parts. The ENIAC was used for mathematical calculations, ballistics, and code breaking. The ENIAC was 1000 times slower than any other calculator of the time. The first computer to run a program was the Z3, built by Konrad Zuse at his house.
Watermarked content, z-score: 4.574.574.574.57, PPL: 3.513.513.513.51 Harry Potter is a 2001 film adaptation of J.K. Rowling’s best-selling novels. In the year 1991, Harry Potter is rescued from his aunt and uncle’s home and taken to Hogwarts, a school for magic. During the course of his first year, Harry makes friends and learns about the evil Lord Voldemort and his attempts to kill him. The film was directed by Chris Columbus, and stars Daniel Radcliffe as Harry Potter, with Richard Harris, Maggie Smith, Robbie Coltrane, Alan Rickman, Emma Watson, Rupert Grint, Gary Oldman, and Julie Walters in supporting roles. In the first film, Harry Potter is taken to Hogwarts School of Witchcraft and Wizardry, where he makes friends and discovers that Lord Voldemort wants to kill him.
Piggyback attack, z-score: 4.474.474.474.47, PPL: 3.783.783.783.78 Harry Potter is a 2001 film adaptation of J.K. Rowling’s best-selling novels. In the year 2991, Harry Potter is rescued from his aunt and uncle’s home and taken to Hogwarts, a school for magic. During the course of his first year, Harry makes friends and learns about the evil Lord Voldemort and his attempts to kill him. The film was directed by Chris Columbus, and stars Daniel Radcliffe as Harry Potter, with Richard Harris, Maggie Smith, Robbie Coltrane, Alan Rickman, Emma Watson, Rupert Grint, Gary Oldman, and Julie Walters in supporting roles. In the first film, Harry Potter is taken to Hogwarts School of Witchcraft and Wizardry, where he makes friends and discovers that Lord Voldemort wants to kill him.
Watermarked content, z-score: 6.156.156.156.15, PPL: 4.294.294.294.29 Usain Bolt is a 6’6”, 200-pound sprinter from the island of Jamaica. He’s also arguably the greatest athlete to ever walk the earth. In an era where many of us are glued to our phones, he has over 18 million followers on Instagram. The 31-year-old athlete is known for his incredible speed and endurance. He’s won eight Olympic gold medals, including three in 2012 alone. He’s also set world records in both the 100-meter dash and 200-meter dash. Bolt is the only person to ever hold the world record in both events at the same time, which he did twice. He’s also the only person to ever hold the world record in both events at the same Olympics. Bolt is known for his incredible speed and endurance.
Piggyback attack, z-score: 5.585.585.585.58, PPL: 4.464.464.464.46 Usain Bolt is a 6’6”, 200-pound sprinter from the island of Jamaica. He’s also arguably the greatest athlete to ever walk the earth. In an era where many of us are glued to our phones, he has over 18 million followers on Instagram. The 31-year-old athlete is known for his incredible speed and endurance. He’s won three Olympic gold medals, including three in 2012 alone. He’s also set world records in both the 100-meter dash and 200-meter dash. Bolt is the only person to ever hold the world record in both events at the same time, which he did twice. He’s also the only person to ever hold the world record in both events at the same Olympics. Bolt is known for his incredible speed and endurance.
Watermarked content, z-score: 6.016.016.016.01, PPL: 6.686.686.686.68 The history of the modern airplane is 100 years old this month. And yet it’s not been 100 years since the Wright Brothers’ first flight. The first airplane flight took place on Dec. 17, 1903. After three years of development, Orville and Wilbur Wright’s first flight lasted only 12 seconds. But within a decade, the first airliner flew. In 1924, the Ford Motor Co. flew the first commercial plane on the U.S. East Coast. In the next year, the company built a 10-passenger airliner with passenger windows and seats and an aisle. The 10-seat plane was called the Model T, and Ford executives said it would have been better if the company made a 10-passenger car instead of a plane.
Piggyback attack, z-score: 5.035.035.035.03, PPL: 7.197.197.197.19 The history of the modern airplane is 100 years old this month. And yet it’s not been 100 years since the Wright Brothers’ first flight. The first airplane flight took place on Dec. 17, 1903. After three years of development, Orville and Wilbur Wright’s first flight lasted only 12 seconds. But within a decade, the first airliner flew. In 1924, the Ford Motor Co. never flew the first commercial plane on the U.S. East Coast. In the next year, the company built a 10-passenger airliner with passenger windows and seats and an aisle. The 10-seat plane was called the Model T, and Ford executives said it would have been better if the company made a 10-passenger car instead of a plane.

Appendix E Additional Results of Piggyback Spoofing Attack

In Sec. 4, we present the piggyback spoofing attack using toxic token insertion strategy on LLAMA-2-7B model. Here, we present the results on OPT-1.3B model, which are consistent with LLAMA-2-7B model’s results.

Refer to caption
Figure 6: Piggyback spoofing of robust watermarks with toxic token insertion strategy on OPT-1.3B.

In Sec. 4, we present the fluent inaccurate editing strategy by querying the GPT4 on KGW watermark and LLAMA-2-7B model. Here we present more results of this strategy on all the three watermarks (KGW, Unigram, and Exp) and two models (LLAMA-2-7B and OPT-1.3B). The results are shown in Fig. 7, Fig. 8, and Fig. 9, which are consistent with our findings in Fig. 1, indicating that our piggyback spoofing attack can be generalized across various robust watermarks and models.

Refer to caption
(a) LLAMA-2-7B model.
Refer to caption
(b) OPT-1.3B model.
Figure 7: Fluent inaccurate editing strategy on KGW watermark and LLAMA-2-7B and OPT-1.3B models.
Refer to caption
(a) LLAMA-2-7B model.
Refer to caption
(b) OPT-1.3B model.
Figure 8: Fluent inaccurate editing strategy on Unigram watermark and LLAMA-2-7B and OPT-1.3B models.
Refer to caption
(a) LLAMA-2-7B model.
Refer to caption
(b) OPT-1.3B model.
Figure 9: Fluent inaccurate editing strategy on Exp watermark and LLAMA-2-7B and OPT-1.3B models.

Appendix F Watermark Key Number Analysis for Watermark-Removal Attacks Exploiting the Use of Multiple Watermark Keys

Now we analyze the number of required queries under different keys to estimate the token with the highest probability without a watermark. We have the following probability bound for KGW and Unigram with the corresponding proof, and present the bound for Exp in Appendix G.

Theorem 2 (Probability bound of unwatermarked token estimation).

Suppose there are n𝑛nitalic_n observations under different keys, the portion of the green list in KGW or Unigram is γ𝛾\gammaitalic_γ. Then the probability that the most frequent token is the same as the original unwatermarked token is

1k=0n/2(nk)γk(1γ)nk×p(k),1superscriptsubscript𝑘0𝑛2binomial𝑛𝑘superscript𝛾𝑘superscript1𝛾𝑛𝑘𝑝𝑘\footnotesize 1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}\gamma^{k}(1-\gamma% )^{n-k}\times p(k),1 - ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_n / 2 ⌋ end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT × italic_p ( italic_k ) , (6)

where p(k)=1(m=0k1(nkm)γm(1γ)nkm)c𝑝𝑘1superscriptsuperscriptsubscript𝑚0𝑘1binomial𝑛𝑘𝑚superscript𝛾𝑚superscript1𝛾𝑛𝑘𝑚𝑐p(k)=1-\Bigl{(}\sum_{m=0}^{k-1}\binom{n-k}{m}\gamma^{m}(1-\gamma)^{n-k-m}\Bigr% {)}^{c}italic_p ( italic_k ) = 1 - ( ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n - italic_k end_ARG start_ARG italic_m end_ARG ) italic_γ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT italic_n - italic_k - italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, c𝑐citalic_c is the number of other tokens whose watermarked probability can exceed that of the highest unwatermarked token.

In a practical scenario where n=13,γ=0.5formulae-sequence𝑛13𝛾0.5n=13,\gamma=0.5italic_n = 13 , italic_γ = 0.5, and c=3𝑐3c=3italic_c = 3, Theorem 2 suggests that the attacker has a probability of 0.710.710.710.71 in finding the token with the highest unwatermarked probability. This implies that we can successfully remove watermarks from over 71%percent7171\%71 % of tokens using a small number of observations under different keys (n=13𝑛13n=13italic_n = 13), yielding high-quality unwatermarked content.

Proof.

Recall that KGW and Unigram randomly split the tokens in the vocabulary into the green list and the red list. We consider the greedy sampling, where the token with the highest (watermarked) probability is sampled. We have n𝑛nitalic_n independent observations under different watermark keys. For each key, the token xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the highest unwatermarked probability is in the green list is γ𝛾\gammaitalic_γ. As long as xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the green list, the greedy sampling will always yield xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT since the watermarks add the same constant to all the tokens’ loogits in the green list.

Thus, the probability that the most frequent token among these n𝑛nitalic_n observations is xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is at least:

1k=0n/2(nk)γk(1γ)nk,1superscriptsubscript𝑘0𝑛2binomial𝑛𝑘superscript𝛾𝑘superscript1𝛾𝑛𝑘1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}\gamma^{k}(1-\gamma)^{n-k},1 - ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_n / 2 ⌋ end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT ,

which is the probability that xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is in the green list for at least half of the n𝑛nitalic_n keys.

For another token xjsubscriptx𝑗\textbf{x}_{j}x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT whose probability can exceed xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, if xjsubscriptx𝑗\textbf{x}_{j}x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is in the green list and xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is in the red list. Then if xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is in the green list for k𝑘kitalic_k keys, the probability that xjsubscriptx𝑗\textbf{x}_{j}x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is in the green list for at least k𝑘kitalic_k keys among the other nk𝑛𝑘n-kitalic_n - italic_k keys is:

1m=0k1(nkm)γm(1γ)nkm1superscriptsubscript𝑚0𝑘1binomial𝑛𝑘𝑚superscript𝛾𝑚superscript1𝛾𝑛𝑘𝑚1-\sum_{m=0}^{k-1}\binom{n-k}{m}\gamma^{m}(1-\gamma)^{n-k-m}1 - ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n - italic_k end_ARG start_ARG italic_m end_ARG ) italic_γ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT italic_n - italic_k - italic_m end_POSTSUPERSCRIPT

Consider we have c𝑐citalic_c such tokens having potential to exceed xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then at least one of the c𝑐citalic_c tokens is in the green list for at least k𝑘kitalic_k keys among the other nk𝑛𝑘n-kitalic_n - italic_k keys is:

1(m=0k1(nkm)γm(1γ)nkm)c1superscriptsuperscriptsubscript𝑚0𝑘1binomial𝑛𝑘𝑚superscript𝛾𝑚superscript1𝛾𝑛𝑘𝑚𝑐1-\Bigl{(}\sum_{m=0}^{k-1}\binom{n-k}{m}\gamma^{m}(1-\gamma)^{n-k-m}\Bigr{)}^{c}1 - ( ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n - italic_k end_ARG start_ARG italic_m end_ARG ) italic_γ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT italic_n - italic_k - italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT

Thus, with all the above analysis, we have that if there are c𝑐citalic_c tokens that have the potential to exceed the probability of the token with highest unwatermarked probability (i.e., xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), the probability that the most frequent token among the n𝑛nitalic_n observations is the same as xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is:

1k=0n/2(nk)γk(1γ)nk×(1(m=0k1(nkm)γm(1γ)nkm)c),1superscriptsubscript𝑘0𝑛2binomial𝑛𝑘superscript𝛾𝑘superscript1𝛾𝑛𝑘1superscriptsuperscriptsubscript𝑚0𝑘1binomial𝑛𝑘𝑚superscript𝛾𝑚superscript1𝛾𝑛𝑘𝑚𝑐1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}\gamma^{k}(1-\gamma)^{n-k}\times% \Biggl{(}1-\Bigl{(}\sum_{m=0}^{k-1}\binom{n-k}{m}\gamma^{m}(1-\gamma)^{n-k-m}% \Bigr{)}^{c}\Biggr{)},1 - ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_n / 2 ⌋ end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT × ( 1 - ( ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n - italic_k end_ARG start_ARG italic_m end_ARG ) italic_γ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT italic_n - italic_k - italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ,

which concludes the proof. ∎

Here we consider that the watermarked LLM is utilizing greedy sampling. In practice, the greedy sampling might not be an optimal sampling strategy, but we note that it is extremely challenging to incorporate the multinomial sampling when analyzing the KGW and Unigram watermarks. Because KGW and Unigram add bias to the output logits, which will go through the softmax function to calculate the probabilities for the tokens. Given the softmax function is not unbiased, we cannot get a tight bound on its variance. Thus, we leave this part as a future direction to further incorporate multinomial sampling in the analysis. Nevertheless, our empirical results still show that the attackers can generate high-quality unwatermarked content when multinomial sampling is used. Also, our analysis on Exp watermark in Appendix G can naturally incorporate multinomial sampling.

Appendix G Probability Bound of Unwatermarked Token Estimation for Exp

In this section, we present and prove the probability bound of unwatermarked token estimation for the Exp watermark [KTHL23].

Theorem 3 (Probability bound of unwatermarked token estimation for Exp).

Suppose there are n𝑛nitalic_n observations under different keys, the highest probability for the unwatermarked tokens is p𝑝pitalic_p. Then the probability that the most frequently appeared token among the n𝑛nitalic_n observations is the same as the original unwatermarked token with highest probability is:

1k=0n/2(nk)pk(1p)nk1superscriptsubscript𝑘0𝑛2binomial𝑛𝑘superscript𝑝𝑘superscript1𝑝𝑛𝑘1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}p^{k}(1-p)^{n-k}1 - ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_n / 2 ⌋ end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) italic_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT (7)
Proof.

The proof of Theorem 3 is straightforward. As we have introduced in Appendix A, the Exp watermark employs the Gumbel trick sampling [Gum48] when embedding the watermark. Thus, the probability that we observe the token whose original unwatermarked probability is p𝑝pitalic_p is exactly p𝑝pitalic_p for each of the independent keys. Thus, if we make n𝑛nitalic_n observations under different keys, then at least half of them yields the token with the highest original probability p𝑝pitalic_p is:

1k=0n/2(nk)pk(1p)nk,1superscriptsubscript𝑘0𝑛2binomial𝑛𝑘superscript𝑝𝑘superscript1𝑝𝑛𝑘1-\sum_{k=0}^{\lfloor n/2\rfloor}\binom{n}{k}p^{k}(1-p)^{n-k},1 - ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_n / 2 ⌋ end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) italic_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT ,

which concludes the proof. ∎

Appendix H Algorithms of Attacks Exploiting the Detection API

In this section, we provide the detailed algorithm of the attacks exploiting the detection API as we have introduced in Sec. 6. Specifically, we present the algorithm for watermark-removal attack exploiting the detection API in Alg. 1 and the algorithm for spoofing attack exploiting the detection API in Alg. 2.

Algorithm 1 Watermark-removal attack exploiting the detection API.
  Input: Prompt xpromptsubscriptxprompt\textbf{x}_{\text{prompt}}x start_POSTSUBSCRIPT prompt end_POSTSUBSCRIPT, watermarked LLM Mwmsubscript𝑀wmM_{\text{wm}}italic_M start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT, detection API fdetectionsubscript𝑓detectionf_{\text{detection}}italic_f start_POSTSUBSCRIPT detection end_POSTSUBSCRIPT, maximum output token number m2𝑚2m\geq 2italic_m ≥ 2
  Let k5𝑘5k\leftarrow 5italic_k ← 5, x1Mwm(xprompt)similar-tosubscriptx1subscript𝑀wmsubscriptxprompt\textbf{x}_{1}\sim M_{\text{wm}}(\textbf{x}_{\text{prompt}})x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_M start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT prompt end_POSTSUBSCRIPT )
  for t=2𝑡2t=2italic_t = 2 to m𝑚mitalic_m do
    (xt1,xt2,,xtk),(pt1,pt2,,ptk)Mwm(xprompt||x1xt1)(\textbf{x}_{t}^{1},\textbf{x}_{t}^{2},\cdots,\textbf{x}_{t}^{k}),(\textbf{p}_% {t}^{1},\textbf{p}_{t}^{2},\cdots,\textbf{p}_{t}^{k})\leftarrow M_{\text{wm}}(% \textbf{x}_{prompt}||\textbf{x}_{1}\cdots\textbf{x}_{t-1})( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , ( p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ← italic_M start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT | | x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) {The watermarked LLM returns the top k𝑘kitalic_k tokens and their corresponding probabilities in descending order.}
    for i=1𝑖1i=1italic_i = 1 to k𝑘kitalic_k do
       difdetection(x1||||xt1||xti)subscript𝑑𝑖subscript𝑓detectionsubscriptx1subscriptx𝑡1superscriptsubscriptx𝑡𝑖d_{i}\leftarrow f_{\text{detection}}(\textbf{x}_{1}||\cdots||\textbf{x}_{t-1}|% |\textbf{x}_{t}^{i})italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT detection end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | ⋯ | | x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | | x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
    dminmin(d1,d2,,dk)subscript𝑑minsubscript𝑑1subscript𝑑2subscript𝑑𝑘d_{\text{min}}\leftarrow\min(d_{1},d_{2},\cdots,d_{k})italic_d start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ← roman_min ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), lcandidateemptysubscript𝑙candidateemptyl_{\text{candidate}}\leftarrow\text{empty}italic_l start_POSTSUBSCRIPT candidate end_POSTSUBSCRIPT ← empty {Get the detection score with the lowest confidence.}
    for i=1𝑖1i=1italic_i = 1 to k𝑘kitalic_k do
       if dmin=disubscript𝑑minsubscript𝑑𝑖d_{\text{min}}=d_{i}italic_d start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT then
          lcandidatelcandidate||xtil_{\text{candidate}}\leftarrow l_{\text{candidate}}||\textbf{x}_{t}^{i}italic_l start_POSTSUBSCRIPT candidate end_POSTSUBSCRIPT ← italic_l start_POSTSUBSCRIPT candidate end_POSTSUBSCRIPT | | x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT {Get all the tokens with the lowest detection confidence.}
    if xt1lcandidatesuperscriptsubscriptx𝑡1subscript𝑙candidate\textbf{x}_{t}^{1}\in l_{\text{candidate}}x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ italic_l start_POSTSUBSCRIPT candidate end_POSTSUBSCRIPT then
       j0𝑗0j\leftarrow 0italic_j ← 0 {If the token with the highest probability (the first token) is in the list, output that token.}
    else
       c1𝑐1c\leftarrow 1italic_c ← 1
       for xtilcandidatesuperscriptsubscriptx𝑡𝑖subscript𝑙candidate\textbf{x}_{t}^{i}\in l_{\text{candidate}}x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_l start_POSTSUBSCRIPT candidate end_POSTSUBSCRIPT do
          ptipt1/csuperscriptsubscriptp𝑡𝑖superscriptsubscriptp𝑡1𝑐\textbf{p}_{t}^{i}\leftarrow\textbf{p}_{t}^{1}/cp start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT / italic_c {Update the tokens’ probabilities that have lowest detection confidence scores.}
          cc+1𝑐𝑐1c\leftarrow c+1italic_c ← italic_c + 1
       pt10superscriptsubscriptp𝑡10\textbf{p}_{t}^{1}\leftarrow 0p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ← 0
       jSample(pt1,,ptk)𝑗Samplesuperscriptsubscriptp𝑡1superscriptsubscriptp𝑡𝑘j\leftarrow\text{Sample}(\textbf{p}_{t}^{1},\cdots,\textbf{p}_{t}^{k})italic_j ← Sample ( p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) {Sample the tokens according to the updated probabilities.}
    xtxtjsubscriptx𝑡superscriptsubscriptx𝑡𝑗\textbf{x}_{t}\leftarrow\textbf{x}_{t}^{j}x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
  Return x1,x2,,xmsubscriptx1subscriptx2subscriptx𝑚\textbf{x}_{1},\textbf{x}_{2},\cdots,\textbf{x}_{m}x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
Algorithm 2 Spoofing attack exploiting the detection API.
  Input: Prompt xpromptsubscriptxprompt\textbf{x}_{\text{prompt}}x start_POSTSUBSCRIPT prompt end_POSTSUBSCRIPT, local LLM M𝑀Mitalic_M, detection API fdetectionsubscript𝑓detectionf_{\text{detection}}italic_f start_POSTSUBSCRIPT detection end_POSTSUBSCRIPT, maximum output token number m𝑚mitalic_m
  Let k3𝑘3k\leftarrow 3italic_k ← 3
  for t=1𝑡1t=1italic_t = 1 to m𝑚mitalic_m do
    (xt1,xt2,,xtk),(pt1,pt2,,ptk)M(xprompt||x1xt1)(\textbf{x}_{t}^{1},\textbf{x}_{t}^{2},\cdots,\textbf{x}_{t}^{k}),(\textbf{p}_% {t}^{1},\textbf{p}_{t}^{2},\cdots,\textbf{p}_{t}^{k})\leftarrow M(\textbf{x}_{% prompt}||\textbf{x}_{1}\cdots\textbf{x}_{t-1})( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , ( p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ← italic_M ( x start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT | | x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) {The local LLM returns the top k𝑘kitalic_k tokens and their corresponding probabilities in descending order.}
    for i=1𝑖1i=1italic_i = 1 to k𝑘kitalic_k do
       difdetection(x1||||xt1||xti)subscript𝑑𝑖subscript𝑓detectionsubscriptx1subscriptx𝑡1superscriptsubscriptx𝑡𝑖d_{i}\leftarrow f_{\text{detection}}(\textbf{x}_{1}||\cdots||\textbf{x}_{t-1}|% |\textbf{x}_{t}^{i})italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT detection end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | ⋯ | | x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | | x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
    jargmax(d1,d2,,dk)𝑗subscript𝑑1subscript𝑑2subscript𝑑𝑘j\leftarrow\arg\max(d_{1},d_{2},\cdots,d_{k})italic_j ← roman_arg roman_max ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) {Get the token resulting in the highest confidence.}
    xtxtjsubscriptx𝑡superscriptsubscriptx𝑡𝑗\textbf{x}_{t}\leftarrow\textbf{x}_{t}^{j}x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
  Return x1,x2,,xmsubscriptx1subscriptx2subscriptx𝑚\textbf{x}_{1},\textbf{x}_{2},\cdots,\textbf{x}_{m}x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

Appendix I Additional Results of Watermark-Removal Attacks Exploiting the use of Multiple Watermark Keys

In this section, we provide more evaluation results of the watermark stealing [JSV24] and our watermark-removal attacks exploiting the use of multiple watermark keys (see Sec. 5) on all the three watermarks (KGW, Unigram, and Exp) and two models (LLAMA-2-7B and OPT-1.3B). The results are shown in Fig. 11, Fig. 12, Fig. 13, Fig. 14, Fig. 15. For KGW watermark on OPT-1.3B model and Unigram watermark on LLAMA-2-7B and OPT-1.3B models, we have consistent observations with the KGW watermark on LLAMA-2-7B as we present in Sec. 5.1, demonstrating the effectiveness and generalizability of our attacks. For the Exp watermark, our results in Fig. 12 and Fig. 15 also show that the watermark can be easily removed using multiple queries to estimate the distribution of the unwatermarked tokens.

The results of the watermark stealing [JSV24] on Unigram watermark and OPT-1.3B model are also consistent with our observations in Sec. 5. Using more keys can effectively mitigate the watermark stealing; however, it will make the system more vulnerable to our watermark removal attacks. Throughout these experiments, we observe that using three keys is the optimal choice to defend against both attacks. However, the attack success rates with three keys are not negligible. Thus, consistent with our guidelines in Sec. 5, we highly recommend that the LLM service provider to simultaneously limit the ability of the potentially malicious users.

To further verify that the LLM service provider can mitigate the watermark stealing attacks by limiting the attacker’s query rates, we present the stealing attack results with various numbers of queries on the KGW watermark and LLAMA-2-7B model using three keys in Fig. 10. The results show that by limiting the query rates of the attacker, the attack success rate of the watermark stealing attack can be significantly decreased. Thus, we recommend that the LLM service provider follow a “defense-in-depth” approach and utilize complementary techniques such as anomaly detection, query rate limiting, and user identification verification to mitigate stealing and removal attacks.

Refer to caption
Figure 10: Watermark stealing attack [JSV24] on KGW watermark and LLAMA-2-7B model using three keys with different numbers of attacker obtained tokens Q (in million). The attack success rates are based on the threshold with FPR@1e-3.

We note that the watermark stealing attacks do not work on the Exp watermark [KTHL23], as the use of a large number of watermark keys is inherent in their design, which defaults to 256256256256. Thus, we omit the watermark stealing results on Exp, but we show that this watermark is inherently vulnerable to our watermark removal attack. From the results in Fig. 12 and Fig. 15, we conclude that using n=13𝑛13n=13italic_n = 13 queries, the resulting p-value is very close to that of the content without a watermark and is significantly different from the watermarked p-value, which shows that we can effectively remove the watermark using 13131313 queries for each token. We note that for Exp, the perplexity of the watermarked content is significantly higher than that of the unwatermarked content. This is mainly because Exp does not allow sampling in watermark embedding, which becomes a deterministic algorithm when the key is fixed. In contrast, our watermark removal attack generates content with much lower perplexity, making it comparable to unwatermarked content when the query number under different keys exceeds 13131313. This can be attributed to our attack functioning as a layer of random sampling. Unlike greedy sampling methods, we have a probability to sample the token with the highest unwatermarked probability (see Sec. 4, Appendix F, and Appendix G). The results of the three watermarks and two models prove that the watermark-removal attack exploiting the use of multiple watermark keys can effectively eliminate the watermarks while maintaining high output quality.

Refer to caption
(a) Z-Score and attack success rate (ASR) of watermark stealing [NKIH23].
Refer to caption
(b) Z-Score and attack success rate (ASR) of watermark-removal.
Refer to caption
(c) Perplexity (PPL) of watermark-removal.
Figure 11: Spoofing attack based on watermark stealing [NKIH23] and watermark-removal attacks on Unigram watermark and LLAMA-2-7B model with different number of watermark keys n𝑛nitalic_n.
Refer to caption
(a) P-Value of watermark-removal attack.
Refer to caption
(b) PPL of watermark-removal attack.
Figure 12: Watermark-removal on Exp watermark [KTHL23] and LLAMA-2-7B model with multiple watermark keys.
Refer to caption
(a) Z-Score and attack success rate (ASR) of watermark stealing [NKIH23].
Refer to caption
(b) Z-Score and attack success rate (ASR) of watermark-removal.
Refer to caption
(c) Perplexity (PPL) of watermark-removal.
Figure 13: Spoofing attack based on watermark stealing [NKIH23] and watermark-removal attacks on KGW watermark and OPT-1.3B model with different number of watermark keys n𝑛nitalic_n.
Refer to caption
(a) Z-Score and attack success rate (ASR) of watermark stealing [NKIH23].
Refer to caption
(b) Z-Score and attack success rate (ASR) of watermark-removal.
Refer to caption
(c) Perplexity (PPL) of watermark-removal.
Figure 14: Spoofing attack based on watermark stealing [NKIH23] and watermark-removal attacks on Unigram watermark and OPT-1.3B model with different number of watermark keys n𝑛nitalic_n.
Refer to caption
(a) P-Value of watermark-removal attack.
Refer to caption
(b) PPL of watermark-removal attack.
Figure 15: Watermark-removal on Exp watermark [KTHL23] and OPT-1.3B model with multiple watermark keys.

Appendix J Additional Results of Attacks Exploiting Detection APIs

We present the results of watermark-removal and spoofing attacks on OPT-1.3B model in Fig. 16 and Table 3. The results are consistent with the LLAMA-2-7B model presented in Sec. 6.1., with all the attack success rates higher than 75%percent7575\%75 % using a small number of queries to the detection API of around 3333 per token. The results on OPT-1.3B model further demonstrate the effectiveness of our attacks exploiting the detection API.

Refer to caption
(a) Z-Score/P-Value of wm-removal.
Refer to caption
(b) Perplexity of wm-removal.
Refer to caption
(c) Z-Score/P-Value of spoofing.
Figure 16: Attacks exploiting detection APIs on OPT-1.3B model.
wm-removal spoofing
ASR #queries ASR #queries
KGW 0.990.990.990.99 2.872.872.872.87 1.001.001.001.00 2.962.962.962.96
Unigram 0.770.770.770.77 3.253.253.253.25 1.001.001.001.00 2.972.972.972.97
Exp 0.860.860.860.86 2.072.072.072.07 0.930.930.930.93 2.922.922.922.92
Table 3: The attack success rate (ASR), and the average query numbers per token for the watermark-removal and spoofing attacks exploiting the detection API on OPT-1.3B model.

Appendix K Additional Results of DP Defense

We present additional evaluation results of our defence technique that enhances the watermark detection by utilizing the techniques of differential privacy (see Sec. 6). Consistent with Sec. 6.3, we evaluate the utility of the DP defense as well as its performance in mitigating the spoofing attack exploiting the detection API. The results are shown in Fig. 17, Fig. 18, Fig. 19, Fig. 20, Fig. 21.

We first identify the optimal noise scale parameter σ𝜎\sigmaitalic_σ based on its detection accuracy and attack success rate, aiming for a drop in detection accuracy within 2%percent22\%2 % and the lowest attack success rate. Then we assess the performance of the defense. Our findings across three watermarks and two models consistently demonstrate that we can significantly reduce the attack success rate to around or below 20%percent2020\%20 %.

Our defense can be generalized to all LLM watermarking schemes. It allows us to substantially mitigate spoofing attacks exploiting the detection API while having negligible impact on utility.

Refer to caption
(a) Detection ACC and spoofing ASR.
Refer to caption
(b) Z-scores with/without DP.
Figure 17: Evaluation of DP watermark detection on Unigram watermark and LLAMA-2-7B model. (a). Detection accuracy and spoofing attack success rate without and with DP watermark detection under different noise parameters. (b). Z-scores of original text without attack, spoofing attack without DP, and spoofing attacks with DP. We use the best σ=4𝜎4\sigma=4italic_σ = 4 from (a).
Refer to caption
(a) Detection accuracy and spoofing attack success rate.
Refer to caption
(b) P-values with/without DP and under multiple queries.
Figure 18: Evaluation of DP watermark detection on Exp watermark and LLAMA-2-7B model. (a). Detection accuracy and spoofing attack success rate without and with DP watermark detection under different noise parameters. (b). Z-scores of original text without attack, spoofing attack without DP, and spoofing attacks with DP. We use the best σ=4𝜎4\sigma=4italic_σ = 4 from (a).
Refer to caption
(a) Detection accuracy and spoofing attack success rate.
Refer to caption
(b) Z-scores with/without DP and under multiple queries.
Figure 19: Evaluation of DP watermark detection on KGW watermark and OPT-1.3B model. (a). Detection accuracy and spoofing attack success rate without and with DP watermark detection under different noise parameters. (b). Z-scores of original text without attack, spoofing attack without DP, and spoofing attacks with DP. We use the best σ=4𝜎4\sigma=4italic_σ = 4 from (a).
Refer to caption
(a) Detection accuracy and spoofing attack success rate.
Refer to caption
(b) Z-scores with/without DP and under multiple queries.
Figure 20: Evaluation of DP watermark detection on Unigram watermark and OPT-1.3B model. (a). Detection accuracy and spoofing attack success rate without and with DP watermark detection under different noise parameters. (b). Z-scores of original text without attack, spoofing attack without DP, and spoofing attacks with DP. We use the best σ=4𝜎4\sigma=4italic_σ = 4 from (a).
Refer to caption
(a) Detection accuracy and spoofing attack success rate.
Refer to caption
(b) P-values with/without DP and under multiple queries.
Figure 21: Evaluation of DP watermark detection on Exp watermark and OPT-1.3B model. (a). Detection accuracy and spoofing attack success rate without and with DP watermark detection under different noise parameters. (b). Z-scores of original text without attack, spoofing attack without DP, and spoofing attacks with DP. We use the best σ=4𝜎4\sigma=4italic_σ = 4 from (a).