PostMark: A Robust Blackbox Watermark for Large Language Models

Yapei Chang Kalpesh Krishna
Amir Houmansadr John Wieting Mohit Iyyer

University of Massachusetts Amherst, Google
{yapeichang,amir,miyyer}@cs.umass.edu
{kalpeshk,jwieting}@google.com

Abstract

The most effective techniques to detect LLM-generated text rely on inserting a detectable signature—or watermark—during the model’s decoding process. Most existing watermarking methods require access to the underlying LLM’s logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark.

PostMark: A Robust Blackbox Watermark for Large Language Models

Yapei Chang Kalpesh Krishna Amir Houmansadr John Wieting Mohit Iyyer University of Massachusetts Amherst, Google {yapeichang,amir,miyyer}@cs.umass.edu {kalpeshk,jwieting}@google.com

1 Introduction

Large language models (LLMs) are increasingly being deployed for malicious applications such as fake content generation. The consequences of such applications for the web as a whole are dire: modern LLMs are known to hallucinate (Xu et al., 2024), and their outputs may contain biases and artifacts that are a product of their training data (Navigli et al., 2023). If the web is flooded with millions of LLM-generated articles, how can we trust the veracity of the content we are reading? Additionally, do we want to train LLMs of the future on text generated by LLMs of the present (Shumailov et al., 2023)?

To combat this emerging problem, researchers have developed several LLM-generated text detection techniques that leverage watermarking (Aaronson and Kirchner, 2022; Kirchenbauer et al., 2023), outlier detection (Mitchell et al., 2023), trained classifiers (Tian, 2023), or retrieval-based methods (Krishna et al., 2023). Among these, watermarking methods that embed detectable signatures into model outputs tend to be the most effective and robust (Krishna et al., 2023). However, most watermarking algorithms require access to the logits of the underlying LLM, which means that they can only be implemented by individual LLM API providers such as OpenAI or Google (Yang et al., 2023). Furthermore, while these methods are able to achieve high detection rates with minimal false positives, their effectiveness goes down when the LLM-generated text is modified through paraphrasing, translation, or cropping (Krishna et al., 2023; He et al., 2024; Kirchenbauer et al., 2024).

Refer to caption — Figure 1: The PostMark watermarking and detection procedure. Given some unwatermarked input text, we generate its embedding using the Embedder and compute its cosine similarity with all word embeddings in the SecTable, performing top-k selection and additional semantic similarity filtering to choose a list of words. Then, we instruct the Inserter to watermark the text by rewriting it to incorporate all selected words. During detection, we similarly obtain a watermark word list and check how many of these words are present in the input text.

In this work, we develop PostMark, a watermarking approach with relatively high detection rates even in the presence of paraphrasing attacks. PostMark is a post-hoc watermark that given some model-generated text, finds words conditioned on the semantics of the text using an embedding model, then calls a separate instruction-following LLM to insert these words into the text without appreciably modifying its meaning. Unlike prior methods, PostMark only requires access to the outputs of the underlying LLM (i.e., no logits).

Overall, our contributions are threefold: 1. We propose PostMark, a novel post-hoc watermarking method that can be applied by third-party entities to outputs from an API provider like OpenAI. 2. We conduct extensive experiments across eight baseline algorithms, five base LLMs, and three datasets, showing that PostMark offers superior robustness to paraphrasing attacks compared to existing methods. 3. We verify through a human evaluation that the words inserted by PostMark during watermarking cannot be reliably detected by humans. We also conduct comprehensive quality evaluations encompassing coherence, relevance, and interestingness for various watermarking methods. Notably, we also assess factuality, an aspect that has not been evaluated in prior work. Our findings reveal that relatively robust watermarks all negatively affect factuality.

2 PostMark: a post-hoc watermark

Most existing watermarking algorithms embed the watermark during the LLM’s decoding process. For example, the watermark of Kirchenbauer et al. (2023, KGW) partitions an LLM’s vocabulary into two lists (a green list and a red list) at each decoding timestep based on a hash of the previous word, and then upweights the green tokens such that they are more likely to be sampled than red tokens. These watermarks have several issues: (1) they require access to the LLM’s logits; (2) because they rely on modifications to the next-token probability distribution, their effectiveness diminishes on LLMs that produce lower-entropy distributions, such as those that have undergone RLHF (Bai et al., 2022); and (3) they show limited robustness to paraphrasing attacks as demonstrated by our results in Section 3.2 and supported by findings from prior work (Krishna et al., 2023; Sadasivan et al., 2024).

In response, we develop PostMark, a watermarking method that does not require logit access, maintains high detection rates on low-entropy models and tasks, and exhibits improved robustness to paraphrasing attacks. Unlike existing watermarks, PostMark requires access to just the text generated by the underlying LLM, not the next-token distributions. The rest of this section fully specifies PostMark’s operation.

Intuition and terminology:

At a high level, PostMark is based on the intuition that a text’s semantics should not drastically change after watermarking or paraphrasing. Thus, we can condition our watermark on a semantic embedding of the input text that ideally changes only minimally when paraphrasing is applied. To make this work, we rely on three modules: an embedding model Embedder, a secret word embedding table SecTable, and an insertion model Inserter implemented via an instruction-following LLM.

Figure 1 illustrates PostMark’s watermarking and detection pipelines. First, we generate the embedding of an input text using the Embedder. We then compute the cosine similarity between this embedding and all of the word embeddings in SecTable, performing top- $k$ selection and filtering to form a watermark word list. Next, we prompt Inserter to smoothly incorporating the selected words into the input to create the watermarked text. During detection, we follow similar steps to obtain a word list, and check how many of the words are present in the input text.

Embedding model Embedder:

The Embedder needs to be capable of projecting both words and documents into a high-dimensional latent space. In our main experiments, we use OpenAI’s text-embedding-3-large (OpenAI, 2024b), a powerful model that demonstrates strong performance on the MTEB benchmark (Muennighoff et al., 2023). However, any embedding model can be used here. In Section 3.2, we also experiment with nomic-embed (Nussbaum et al., 2024), an open-source model.

Secret word embedding table SecTable:

The core idea behind PostMark is to use an LLM to insert a list of watermark words into the input text without appreciably modifying the quality or meaning of the text, where the words in the list are selected by computing the cosine similarity between the text embedding and a word embedding table SecTable. The construction of SecTable involves two main steps, which we detail below:

> Step 1. Choosing a vocabulary $\mathbb{V}$ : To decide which words to include in SecTable, we use the WikiText-103 corpus (Merity et al., 2017) as our base vocabulary. To avoid inserting arbitrary words that make little sense, we remove all function words, proper nouns, and infrequent rare words. This refined set forms our final vocabulary, $\mathbb{V}$ . We provide more details on this filtering process in §A.

> Step 2. Mapping words in $\mathbb{V}$ to embeddings: To make it difficult for attackers to recover our embedding table, we construct SecTable by randomly assigning each word in the vocabulary to an embedding produced by Embedder; the resulting mapping acts as a cryptographic key.¹¹1We could also just use Embedder’s word embeddings as SecTable directly. However, this can easily be recovered by an attacker, and our experiments show that it also reduces PostMark’s effectiveness due to many words already being present in the input text even before insertion. More specifically, we generate a set of embeddings $\mathbb{D}$ for a collection of random documents using Embedder and then randomly map each word in $\mathbb{V}$ to a unique document embedding in $\mathbb{D}$ to produce SecTable.²²2The selection of these documents is flexible. In our experiments, we randomly sample 250-word snippets from the RedPajama dataset’s English split (Computer, 2023).

Insertion model Inserter:

The Inserter needs to have instruction-following capabilities, and its purpose is to rewrite the input text to incorporate words from the watermark word list. We use GPT-4o (OpenAI, ) as the Inserter in our main experiments, and later show in Section 3.2 that open-source models like Llama-3-70B-Inst (AI@Meta, 2024) also show promising performance.

2.1 Inserting the watermark

> Step 1. Deciding how many words to insert: How many words should we insert into a given text? We define a hyperparameter called the insertion ratio $r$ that determines this number. The insertion ratio represents the percentage of the input text’s word count: for example, if $r=10\%$ and the input text has $50$ words, we will insert 5 words.

> Step 2. Obtaining a watermark word list: Suppose that the watermark list should contain $k$ words. To create the watermark word list given the input text, we first compute the input’s embedding $e_{t}=\textsc{Embedder}(t)$ . Next, we compute CosineSimilarity $(e_{t},\textsc{SecTable})$ and select the top $k^{\prime}$ most similar words, then perform semantic similarity filtering to obtain the final $k$ words.³³3Due to the random nature of the word-to-embedding mapping of $T$ , the top $k^{\prime}$ words might include highly irrelevant words (e.g., “hotel” in Figure 1). Thus, we refine the top- $k^{\prime}$ list by selecting the top $k$ words whose actual embeddings (as determined by Embedder) are most similar to $e_{t}$ . We present an analysis on how frequently a word is chosen as an watermark word in §A.

> Step 3. Inserting words into the text: To watermark the text, we instruct Inserter to rewrite it via zero-shot prompting, incorporating words in the watermark word list while keeping the rewritten text coherent, factual, and concise.⁴⁴4In practice, we find that dividing a long word list into sublists of 10 words each and then iteratively asking the Inserter to incorporate each sublist ensures a high insertion success rate. This may not be necessary if the Inserter has better instruction-following capabilities. The prompt can be found in §B.

2.2 Detecting the watermark

During detection, given some text, the goal is to find out if the text contains a watermark. Similar to the watermarking procedure, we embed the candidate text using Embedder, form a word list, and then check how many words in the list are present in the text by computing a presence score $p$ :

p=\frac{\left|\left\{w\in\text{list}\ \text{s.t.}\ \exists w^{\prime}\in\text{% text},\ \text{sim}(w^{\prime},w)\geq 0.7\right\}\right|}{|\text{list}|}

A word $w$ is marked present in the text if there is any other word $w^{\prime}$ with an embedding cosine similarity greater than a threshold that we set to 0.7. We choose this method over exact match to ensure additional robustness against paraphrasing.⁵⁵5We use the paragram word embedding model developed by Wieting et al. (2015) to compute cosine similarity for this step. This model is chosen for its superior performance in assigning high similarity scores to close synonyms and low scores to unrelated words, more details in §C. If $p$ is larger than a certain threshold, it is likely that the text has been watermarked. As later discussed in Section 3.1, the primary metric we use to measure detection accuracy is the true positive rate at a fixed 1% false positive rate. We thus set the threshold to ensure a 1% FPR, same as what we do for all baselines in our main experiments.

3 Experiments

In this section, through extensive experiments on three datasets and five language models, we demonstrate that PostMark consistently outperforms both logit-free and logit-based methods in terms of robustness to paraphrasing attacks, especially on low-entropy models that have undergone RLHF alignment. Furthermore, we showcase PostMark’s modular design by testing an open-source variant, which achieves promising results.

3.1 Experimental setup

Baselines:

We compare PostMark against 8 baseline algorithms, more detailed descriptions can be found in §D. (1) KGW (Kirchenbauer et al., 2023): Partitions the vocabulary into “green” and “red” lists based on the previous token, then boosts the probability of green tokens during generation. (2) Unigram (Zhao et al., 2023): A more robust variant of KGW that uses a fixed partition for all generations. (3) EXP (Aaronson and Kirchner, 2022): Uses exponential sampling to bias token selection with a pseudo-random sequence. (4) EXP-Edit (Kuditipudi et al., 2024): A variant of EXP that uses edit distance during detection. (5) SemStamp (Hou et al., 2023): A sentence-level algorithm that partitions the sentence semantic space. (6) k-SemStamp (Hou et al., 2024): Improves SemStamp by using k-means clustering to partition the semantic space. (7) SIR (Liu et al., 2024b): Generates watermark logits from the semantic embeddings of preceding tokens then adds them to the model’s logits. (8) Blackbox (Yang et al., 2023): This method, like ours, works in a blackbox setting where only model outputs are visible. It substitutes words representing bit-0 in a binary encoding scheme with synonyms representing bit-1.

Hyperparameters:

The key hyperparameter for PostMark is the insertion ratio $r$ , which controls how many words are inserted during the watermarking process. We set $r$ to 12% as preliminary experiments suggest that this value strikes a good balance between quality and robustness. Section 4.1 explores different PostMark configurations that vary $r$ . In all following discussion and tables, we refer to these configurations with the naming convention “PostMark@ $r$ ”. We carefully tune all baselines’ hyperparameters to maximize their robustness to paraphrasing; more details in §D.

Base models:

Our experiments involve five generative models: Llama-3-8B (AI@Meta, 2024), Llama-3-8B-Inst (AI@Meta, 2024), Mistral-7B-Inst Jiang et al. (2023), GPT-4 (OpenAI, 2024a), and OPT-1.3B (Zhang et al., 2022). Among these, Llama-3-8B-Inst, Mistral-7B-Inst, and GPT-4 have been aligned with human preferences. For details on model checkpoints and generation length, see §E. We do not run OPT-1.3B ourselves but directly use its unwatermarked outputs provided by Hou et al. (2024). Due to difficulties in running SemStamp, k-SemStamp, and SIR,⁶⁶6Their code is available but not runnable yet. We look forward to running these methods ourselves once the issues are resolved. we apply PostMark directly to these outputs and compare our results with the published numbers in Hou et al. (2024).

Datasets:

Our main experiments use three datasets: (1) OpenGen, a dataset collected by Krishna et al. (2023) designed for open-ended generation that consists of two-sentence chunks sampled from the validation set of WikiText-103; (2) LFQA, a dataset collected by Krishna et al. (2023) for long-form question answering that contains questions sampled from the r/explainlikeimfive subreddit that span multiple domains; and (3) RealNews (Raffel et al., 2020), a subset of the C4 dataset that includes news articles gathered from a wide range of reliable news websites.

Paraphrasing attack setup:

Following prior work (Hou et al., 2023, 2024; Kirchenbauer et al., 2024; Liu et al., 2024b), we use GPT-3.5-Turbo as our paraphraser. We use a sentence-level paraphrasing approach where the model iterates through each sentence of the input text, using all preceding context to paraphrase the current sentence. See §F for more details on this setup.

Metric for measuring detection performance:

In addition to the true positive rate, a low false positive rate is critical for LLM-generated detection. Thus, following prior detection work (Krishna et al., 2023; Zhao et al., 2023; Hou et al., 2023, 2024; Liu et al., 2024b), we use TPR at 1% FPR as our primary metric.

Metric $\rightarrow$			TPR at 1% FPR (Before Paraphrasing / After Paraphrasing)
Model $\downarrow$	Dataset $\downarrow$	Avg Entropy $\downarrow$	PostMark@12	Blackbox	KGW	Unigram	EXP	EXP-Edit	SIR	SemStamp	k-SemStamp
Llama-3-8B	OpenGen	3.6	99.7 / 63.5	81.2 / 2.2	100 / 74.8	99.8 / 93.4	99.8 / 36.6	97.3 / 73.3	-	-	-
	LFQA	3.5	97.8 / 72.5	82.8 / 1.6	99.8 / 25.6	99.8 / 79.6	99.8 / 12.4	83 / 41	-	-	-
Llama-3-8B-Inst	OpenGen	1.6	99.4 / 46.4	91.8 / 1	98.2 / 21.6	99.6 / 41.4	99.6 / 4.8	47.8 / 2.2	-	-	-
	LFQA	1.3	96 / 65.7	86.2 / 3	85.8 / 19	98.6 / 31.8	98.4 / 0.6	21.1 / 0.6	-	-	-
Mistral-7B-Inst	OpenGen	1.4	99.2 / 69.2	98.4 / 0.4	100 / 16	99.8 / 56	99.4 / 5	33 / 1.5	-	-	-
	LFQA	1.1	99.6 / 56.4	89.8 / 0.4	99.4 / 23.6	97.2 / 41.2	97.4 / 0.8	20.1 / 2.1	-	-	-
GPT-4	OpenGen	-	99.4 / 59.4	99.4 / 1.4	-	-	-	-	-	-	-
	LFQA	-	99.4 / 65	99.2 / 0.4	-	-	-	-	-	-	-
OPT-1.3B	RealNews	3.6	98.2 / 67.2	-	-	-	-	-	99.4 / 24.7	93.9 / 33.9	98.1 / 55.5

Table 1: Comparison of PostMark and baselines. All numbers are computed over 500 generations. Each entry shows the TPR at 1% FPR before paraphrasing and after paraphrasing. The “Avg Entropy” column shows the average token-level entropy (in bits) of each model on each dataset.

3.2 Results

We present our main experimental results on robustness to paraphrasing attacks in Table 1, and discuss our main findings below. Runtime analysis and API cost estimates can be found in §G.

PostMark is an effective and robust watermark.

PostMark consistently achieves a high TPR before paraphrasing ( $>90\%$ ), outperforming baselines like Blackbox, KGW, and EXP-Edit. Additionally, PostMark achieves higher TPR after paraphrasing compared to other baselines, including Blackbox, the only other method that operates under the same logit-free condition. The only setting that PostMark is not the most robust model under paraphrasing is with the base Llama-3-8B model, where Unigram exhibits more robustness. We note that Unigram is much more vulnerable to reverse-engineering than PostMark because it uses a fixed green/red list partition for all inputs, which can be exploited with repetition attacks.⁷⁷7For Unigram, detection works by comparing the number of green tokens present in the input text to the expected count under the null hypothesis of no watermarking. The adversary can pick a word “apple” and submit a long repeating sequence of this word (e.g., “apple apple apple…”) to the watermark detection service. If it says this sequence is watermarked, then “apple” must be in the green list. Unigram’s effectiveness diminishes with low-entropy models, and in Section 4, we also observe Unigram’s severe negative impact on text quality. Finally, the bottom of Table 1 shows that PostMark is more robust than the three baselines that also condition on input semantics: SemStamp, k-SemStamp, and SIR.

Logit-based baselines perform worse on low-entropy models and tasks, while PostMark stays relatively unaffected.

Results from Table 1 demonstrate that logit-based baselines (i.e., all baselines except Blackbox) generally perform worse on aligned models (Llama-3-8B-Inst and Mistral-7B-Inst) compared to the non-aligned Llama-3-8B, and worse on LFQA than on OpenGen. This performance difference is consistent with findings from prior work (Kuditipudi et al., 2024) and can be attributed to the lower entropy of aligned models resulting from RLHF or instruction-tuning, as well as the inherently lower entropy of the LFQA task. The “Avg Entropy” column of Table 1 illustrates these entropy differences. In contrast, PostMark consistently outperforms all baselines in terms of robustness against paraphrasing attacks in these low-entropy scenarios.

Open-weight PostMark shows promise.

While our main experiments use GPT-4o as the Inserter and OpenAI’s text-embedding-3-large as the Embedder, we show in Table 2 that an open-weight combination of Llama-3-70B-Inst and nomic-embed can also achieve promising robustness to paraphrasing attacks. The modular design of PostMark allows for flexible experimentation with various components. As each module’s capabilities advance, PostMark’s robustness will likewise improve.

PostMark@12 Impl.	TPR at 1% FPR
Closed	99.4 / 59.4
Open	100 / 52.1

Table 2: TPR at 1% FPR before and after paraphrasing. The open-source implementation of PostMark@12 with nomic-embed as the Embedder and Llama-3-70B-Inst as the Inserter shows promising performance on OpenGen with GPT-4 as the base LLM.

4 Impact of watermarking on text quality

Type	Before watermark	After watermark
Clarification	Since the charges concerned violation of civil rights and not actual murder, the defendants received surprisingly light sentences, ranging from three to ten years.	Since the charges concerned violation of civil rights and not actual murder, the defendants received surprisingly light sentences, ranging from three to ten years of imprisonment.
Metaphors	In fact, despite Mount Elbert’s somewhat minimal precipitation, it displays a remarkable ability to sustain life.	In fact, despite Mount Elbert’s somewhat minimal precipitation, it displays a remarkable ability to sustain life, almost as if it wears an armor of resilience, immune to the challenges it faces.

Table 3: Example edits made by PostMark during the watermarking process. Changes are highlighted in orange, and watermark words are in bold. More examples can be found in §J.

PostMark modifies text during watermarking by inserting new words, which often results in longer watermarked text while preserving its semantic meaning.⁸⁸8A full table of length comparison is in §H. To assess the preservation of semantic meaning, we compare the embeddings of watermarked text with those of the input text and consistently find a high cosine similarity of around 0.95. More details on this can be found in §I. Table 3 shows several common types of edits made by PostMark during watermarking.⁹⁹9Summarized based on a small-scale qualitative analysis. Although edits adding new content are expected to hurt quality, this quality degradation is not unique to PostMark. Prior work has found that all watermarking methods negatively affect text quality to some extent (Singh and Zou, 2023). For logit-based methods like KGW, quality degradation occurs because relevant words can be downweighted during decoding. While existing papers on watermarking often lack extensive quality evaluations, we conduct both automatic and human evaluations to assess the quality of watermarked text (relevance, coherence, interestingness, and factuality) in this section.

Setting up quality evaluations:

Prior work on watermarking has predominantly used perplexity as a measure for text quality (Kirchenbauer et al., 2023; Zhao et al., 2023; Yang et al., 2023; Liu et al., 2024b; Hu et al., 2024; Hou et al., 2023, 2024). However, perplexity alone has been shown to be an unreliable indicator of quality (Wang et al., 2022). Some studies have explored alternative methods, such as LLM-based evaluations (Singh and Zou, 2023) and human assessments (Kirchenbauer et al., 2024). Here, we evaluate the quality of watermarked text using automated and human evaluations, aiming to address four key questions:

> Q1: How does PostMark compare to other baselines in terms of impact on text quality?

> Q2: What is the quality-robustness trade-off for PostMark?

> Q3: How often do humans think that PostMark watermarked texts are at least as good as their unwatermarked versions?

> Q4: Are words inserted by PostMark easily detectable by humans?

4.1 Automatic evaluation

In this section, we compare PostMark with other baselines regarding impact on quality (Q1) and address the quality-robustness trade-off of PostMark (Q2).

Pairwise preference evaluation setup:

We adopt the LLM-as-a-judge (Zheng et al., 2023) setup to perform a pairwise comparison task. We choose GPT-4-Turbo as our judge as it is the high-ranked evaluator model on the Reward Bench leaderboard (Lambert et al., 2024)¹⁰¹⁰10The current leaderboard is hosted on huggingface. GPT-4-Turbo’s high ranking indicates that it is a relatively robust and reliable LLM evaluator. that we can easily access. Given 100 OpenGen prefixes and corresponding pairs of anonymized unwatermarked and watermarked responses, the model evaluates each pair and chooses which response it prefers, where ties are allowed. The model is instructed to consider the relevance, coherence, and the interestingness of the responses when making a judgment. The full prompt can be found in §M. Then, we compute the soft win rate of various baselines in Table 4 and several PostMark configurations in Table 5, which equals the number of ties plus the number of wins for the watermarked response.

Factuality evaluation setup:

To assess factuality, an essential aspect not addressed in the previous pairwise comparisons or previous watermarking research, we use FactScore (Min et al., 2023), an automatic metric that measures the percentage of atomic claims in an LLM-generated biography that are supported by Wikipedia. We generate biographies for the entities in the FactScore dataset and comparing the FactScores of the outputs before and after watermarking. Before watermarking, Llama-3-8B-Inst achieves a score of 40.2. We then run KGW, Unigram, PostMark@12, and PostMark@6, resulting in scores of 37.8, 37.2, 37.3, and 38.3. The full table is in §K.

Metric $\rightarrow$	Soft Win Rate
Method $\rightarrow$	KGW	Unigram	EXP	EXP-Edit	Blackbox	PostMark@12
Llama-3-8B	37	17	23	49	45	74
Llama-3-8B-Inst	52	52	59	57	55	68
Mistral-7B-Inst	57	54	49	54	46	64
GPT-4	-	-	-	-	53	64

Table 4: Soft win rates computed based on the pairwise comparison evaluation with GPT-4-Turbo as the judge, measured over 100 pairs of unwatermarked and watermarked OpenGen outputs from various LLMs (first column). PostMark@12 outperforms all baselines.

Configuration	Soft Win Rate	TPR After Para.
PostMark@6	84	20.8
PostMark@8	79	28.2
PostMark@12	64	59.4
PostMark@15	67	61.9
PostMark@20	62	82.8
PostMark@30	55	98

Table 5: Quality-robustness trade-off. All soft win rates are averaged over 100 pairs of unwatermarked and watermarked texts judged by GPT-4-Turbo. All paraphrased TPR numbers at 1% FPR are computed over on 500 OpenGen instances.

> Q1: PostMark does not affect quality as much as other baselines.

Results from Table 4 show that PostMark performs exceptionally well in pairwise comparisons across models. In contrast, despite Unigram’s strong robustness to paraphrasing—sometimes even outperforming PostMark when tested on Llama-3-8B —it has a significantly lower soft win rates, especially on Llama-3-8B (17%). This low score is likely due to frequent repetitions in Unigram outputs, as detailed in §L. Regarding factuality, KGW, Unigram, and PostMark@12 all show similar levels of negative impact as their FactScores are respectively 37.8, 37.2, and 37.3.

> Q2: Inserting more words enhances robustness but hurts quality, and vice versa.

We first use the pairwise comparison setup to evaluate the quality-robustness trade-off of PostMark with $r$ set to six different values: 6, 8, 12, 15, 20, and 30. Results in Table 5 reveal a strong inverse correlation between quality and robustness, with a Pearson coefficient of -0.98. FactScore@6 also achieves a higher FactScore (38.3) than FactScore@12 (37.3). In practical applications, the choice of $r$ should be based on the desired balance between quality and robustness.

4.2 Human evaluation

While LLM-based evaluators serve as good proxies for human judgments in several cases (Zheng et al., 2023), their results should be interpreted with caution, as they can be biased to certain aspects of the text such as length (Wang et al., 2023) or overlap between the generator and the judge model (Panickssery et al., 2024). Thus, we hire two annotators fluent in English and conduct two human annotation studies detailed below, addressing Q3 and Q4. More details on annotator qualifications, payment, and each annotation setup can be found in §N.

> Q3: PostMark watermarked texts are at least as good as their unwatermarked counterparts the majority of the time.

We first evaluate the impact of PostMark on quality through a pairwise comparison task, similar to the setup in Section 4.1. Each annotator reads 20 OpenGen prefixes and the corresponding pairs of anonymized watermarked and unwatermarked responses generated by GPT-4. We then ask them to indicate their preferred response overall, as well as their preferences in terms of relevance, coherence, and interestingness, allowing for ties. Results in Figure 2 indicate that for PostMark@12 and PostMark@6, watermarked responses are at least as good as their unwatermarked counterparts the majority of the time (i.e., total percentage of wins and ties $\geq$ 50%). As expected, reducing the insertion rate to 6% improves quality, especially in the coherence aspect.¹¹¹¹11While soft win rates computed from human annotations are much lower than those from GPT-4-Turbo’s judgments, both judges agree that a smaller $r$ improves quality. To put things in perspective, a previous human evaluation study by Kirchenbauer et al. (2024) found that annotators preferred KGW-watermarked text over unwatermarked text only 38.4% of the time.

> Q4: Annotators struggle to identify the words inserted by PostMark.

A primary concern with PostMark is whether the words inserted into the watermarked text will be conspicuous enough for humans to identify, making it easy for attackers to remove them. To measure this, we create an anonymized mixture of 20 unwatermarked¹²¹²12We include unwatermarked responses in this evaluation as a baseline. For fairness, we regenerated unwatermarked texts to roughly match the length of the watermarked texts. and 20 watermarked responses generated for 20 prefixes in OpenGen with GPT-4 as the base LLM.¹³¹³13These 20 prefixes are different from the ones they see in the pairwise comparison evaluation. We then ask annotators to highlight out-of-place words that they think might have been inserted post-hoc after the initial generation. Overall, annotators achieve a F1 of merely 0.06 (0.46 precision, 0.03 recall). On average, they highlight 2.2 words in each unwatermarked response, and 3.45 words in each watermarked response. Thus, even when annotators are aware of the insertion of words, they cannot pinpoint the specific words.

5 Related work

Early research on watermarking:

Our work is relevant to early work on watermarking text documents, either using the text document image (Brassil et al., 1995; Low et al., 1998), syntactic transformations (Atallah et al., 2001; Meral et al., 2009), or semantic changes (Atallah et al., 2003; Topkara et al., 2006). Later work also explores watermarking machine-generated text (Venugopal et al., 2011).

Watermarking LLM-generated text:

Recent research has primarily focused on watermarking LLM-generated outputs. Most existing approaches operate in the whitebox setting, assuming access to model logits and the ability to modify the decoding process (Fang et al., 2017; Kaptchuk et al., 2021; Aaronson and Kirchner, 2022; Kirchenbauer et al., 2023; Zhao et al., 2023; Liu et al., 2024a, b) or inject detectable signals without altering the original token distribution (Christ et al., 2023; Kuditipudi et al., 2024). Alternatively, Hou et al. (2023, 2024) watermark at the sentence level via rejection sampling. Prior blackbox methods access only model outputs (like PostMark), but rely on simple lexical substitution (Abdelnabi and Fritz, 2021; Qiang et al., 2023; Yang et al., 2023; Munyer et al., 2024).

Evading watermark detection:

Our work also relates to prior work on text editing attacks designed to evade watermark detection. He et al. (2024) propose a cross-lingual attack, while Kirchenbauer et al. (2024) studies a copy-paste attack that embeds watermarked text into a larger human-written document. Krishna et al. (2023) train a controllable paraphraser that allows for control over lexical and syntactic diversity. Sadasivan et al. (2024) design a recursive paraphrasing attack that repeatedly rewrites watermarked text. Similar to our work, several studies directly prompt an instruction-following LLM to paraphrase text (Zhao et al., 2023; Hou et al., 2023, 2024; Liu et al., 2024b; Kirchenbauer et al., 2024).

Quality-robustness trade-off:

Relevant to our discussion in Section 4, several recent papers highlight the impact of watermarking on quality. In line with our conclusions, Singh and Zou (2023) and Molenda et al. (2024) both find that less robust watermarks tend to have less negative impact on text quality.

6 Conclusion

We propose PostMark, a novel watermarking approach that only requires access to the underlying model’s outputs, making it applicable by third-party entities to outputs from API providers. Through extensive experiments acorss eight baseline algorithms, five base LLMs, and three datasets, we show that PostMark is more robust to paraphrasing attacks than existing methods. We conduct a human evaluation to show that words inserted by PostMark are not easily identifiable by humans. We further run comprehensive quality evaluations covering coherence, relevance, interestingness, and factuality, and find that PostMark preserves text quality relatively well. Future work could look into further optimizing each of the three modules in PostMark, evaluating PostMark on attacks other than paraphrasing, or making logit-based methods less entropy-dependent.

Limitations

In this section, we address the primary limitations of our work.

Other attacks:

Our work focuses on evaluating robustness of various watermarking methods against paraphrasing attacks. However, there are many other interesting and practical attacks that we do not consider, such as the copy-paste attack and the recursive paraphrasing attack discussed in Section 5. We anticipate that PostMark will be less effective when the watermarked text is embedded in a larger human-written document or when it undergoes repeated paraphrasing, similar to other watermarking methods. We leave the exploration of these other types of attacks to future work.

Runtime and API costs:

The PostMark implementation used in all our main experiments relies on closed-source models from OpenAI (text-embedding-3-large and GPT-4o). As a result, the runtime and costs of running PostMark are heavily dependent on the API provider. Our cost estimate in §G suggests that watermarking 100 tokens with the default PostMark@12 configuration costs around $1.2 USD. However, the framework is highly flexible in terms of module selection. In fact, as demonstrated in Section 3.2, an open-source implementation can perform nearly as well as the closed-source version. We leave the optimization of open-source implementations of PostMark to future work.

Ethical considerations

Our human study was determined exempt by IRB review. All annotators have consented to the release of their annotations, and we ensured they were fairly compensated for their valuable contributions. Scientific artifacts are implemented for their intended usage. The risks associated with our framework are no greater than those already present in the large language models it utilizes (Weidinger et al., 2021).

Acknowledgments

We extend special gratitude to the Upwork annotators for their hard work. This project was partially supported by awards IIS-2202506 and IIS-2312949 from the National Science Foundation (NSF).

References

Aaronson and Kirchner (2022) Scott Aaronson and Hendrik Kirchner. 2022. Watermarking gpt outputs.
Abdelnabi and Fritz (2021) Sahar Abdelnabi and Mario Fritz. 2021. Adversarial watermarking transformer: Towards tracing text provenance with data hiding. Preprint, arXiv:2009.03015.
AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
Atallah et al. (2001) Mikhail J. Atallah, Victor Raskin, Michael Crogan, Christian Hempelmann, Florian Kerschbaum, Dina Mohamed, and Sanket Naik. 2001. Natural language watermarking: Design, analysis, and a proof-of-concept implementation. In Information Hiding, pages 185–200, Berlin, Heidelberg. Springer Berlin Heidelberg.
Atallah et al. (2003) Mikhail J. Atallah, Victor Raskin, Christian F. Hempelmann, Mercan Karahan, Radu Sion, Umut Topkara, and Katrina E. Triezenberg. 2003. Natural language watermarking and tamperproofing. In Information Hiding, pages 196–212, Berlin, Heidelberg. Springer Berlin Heidelberg.
Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint, arXiv:2204.05862.
Brassil et al. (1995) J.T. Brassil, S. Low, N.F. Maxemchuk, and L. O’Gorman. 1995. Electronic marking and identification techniques to discourage document copying. IEEE Journal on Selected Areas in Communications, 13(8):1495–1504.
Christ et al. (2023) Miranda Christ, Sam Gunn, and Or Zamir. 2023. Undetectable watermarks for language models. Preprint, arXiv:2306.09194.
Computer (2023) Together Computer. 2023. Redpajama: an open dataset for training large language models.
Fang et al. (2017) Tina Fang, Martin Jaggi, and Katerina Argyraki. 2017. Generating steganographic text with LSTMs. In Proceedings of ACL 2017, Student Research Workshop, pages 100–106, Vancouver, Canada. Association for Computational Linguistics.
He et al. (2024) Zhiwei He, Binglin Zhou, Hongkun Hao, Aiwei Liu, Xing Wang, Zhaopeng Tu, Zhuosheng Zhang, and Rui Wang. 2024. Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models. Preprint, arXiv:2402.14007.
Hou et al. (2023) Abe Bohan* Hou, Jingyu* Zhang, Tianxing* He, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. 2023. Semstamp: A semantic watermark with paraphrastic robustness for text generation. In Annual Conference of the North American Chapter of the Association for Computational Linguistics.
Hou et al. (2024) Abe Bohan Hou, Jingyu Zhang, Yichen Wang, Daniel Khashabi, and Tianxing He. 2024. k-semstamp: A clustering-based semantic watermark for detection of machine-generated text. Preprint, arXiv:2402.11399.
Hu et al. (2024) Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. 2024. Unbiased watermark for large language models. In The Twelfth International Conference on Learning Representations.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
Kaptchuk et al. (2021) Gabriel Kaptchuk, Tushar M. Jois, Matthew Green, and Aviel D. Rubin. 2021. Meteor: Cryptographically secure steganography for realistic distributions. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, page 1529–1548, New York, NY, USA. Association for Computing Machinery.
Kirchenbauer et al. (2023) John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023. A watermark for large language models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 17061–17084. PMLR.
Kirchenbauer et al. (2024) John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. 2024. On the reliability of watermarks for large language models. In The Twelfth International Conference on Learning Representations.
Krishna et al. (2023) Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. 2023. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. In Thirty-seventh Conference on Neural Information Processing Systems.
Kuditipudi et al. (2024) Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. 2024. Robust distortion-free watermarks for language models. Preprint, arXiv:2307.15593.
Lambert et al. (2024) Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. 2024. Rewardbench: Evaluating reward models for language modeling. Preprint, arXiv:2403.13787.
Liu et al. (2024a) Aiwei Liu, Leyi Pan, Xuming Hu, Shuang Li, Lijie Wen, Irwin King, and Philip S. Yu. 2024a. An unforgeable publicly verifiable watermark for large language models. In The Twelfth International Conference on Learning Representations.
Liu et al. (2024b) Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. 2024b. A semantic invariant robust watermark for large language models. In The Twelfth International Conference on Learning Representations.
Low et al. (1998) S.H. Low, N.F. Maxemchuk, and A.M. Lapone. 1998. Document identification for copyright protection using centroid detection. IEEE Transactions on Communications, 46(3):372–383.
Meral et al. (2009) Hasan Mesut Meral, Bülent Sankur, A. Sumru Özsoy, Tunga Güngör, and Emre Sevinç. 2009. Natural language watermarking via morphosyntactic alterations. Computer Speech and Language, 23(1):107–125.
Merity et al. (2017) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models. In International Conference on Learning Representations.
Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
Mitchell et al. (2023) Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. 2023. Detectgpt: Zero-shot machine-generated text detection using probability curvature. Preprint, arXiv:2301.11305.
Molenda et al. (2024) Piotr Molenda, Adian Liusie, and Mark J. F. Gales. 2024. Waterjudge: Quality-detection trade-off when watermarking large language models. Preprint, arXiv:2403.19548.
Muennighoff et al. (2023) Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.
Munyer et al. (2024) Travis Munyer, Abdullah Tanvir, Arjon Das, and Xin Zhong. 2024. Deeptextmark: A deep learning-driven text watermarking approach for identifying large language model generated text. Preprint, arXiv:2305.05773.
Navigli et al. (2023) Roberto Navigli, Simone Conia, and Björn Ross. 2023. Biases in large language models: Origins, inventory, and discussion. J. Data and Information Quality, 15(2).
Nussbaum et al. (2024) Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2024. Nomic embed: Training a reproducible long context text embedder. Preprint, arXiv:2402.01613.
(34) OpenAI. Model release blog: GPT-4o. Technical report, OpenAI.
OpenAI (2024a) OpenAI. 2024a. Gpt-4 technical report. Preprint, arXiv:2303.08774.
OpenAI (2024b) OpenAI. 2024b. New embedding models and api updates.
Pan et al. (2024) Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, and Irwin King. 2024. Markllm: An open-source toolkit for llm watermarking. Preprint, arXiv:2405.10051.
Panickssery et al. (2024) Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. Llm evaluators recognize and favor their own generations. Preprint, arXiv:2404.13076.
Qiang et al. (2023) Jipeng Qiang, Shiyu Zhu, Yun Li, Yi Zhu, Yunhao Yuan, and Xindong Wu. 2023. Natural language watermarking via paraphraser-based lexical substitution. Artif. Intell., 317(C).
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
Sadasivan et al. (2024) Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. 2024. Can ai-generated text be reliably detected? Preprint, arXiv:2303.11156.
Shumailov et al. (2023) Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2023. The curse of recursion: Training on generated data makes models forget. Preprint, arXiv:2305.17493.
Singh and Zou (2023) Karanpartap Singh and James Zou. 2023. New evaluation metrics capture quality degradation due to llm watermarking. Preprint, arXiv:2312.02382.
Tian (2023) Edward Tian. 2023. Gptzero: An ai text detector.
Topkara et al. (2006) Umut Topkara, Mercan Topkara, and Mikhail J. Atallah. 2006. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions. In Proceedings of the 8th Workshop on Multimedia and Security, page 164–174, New York, NY, USA. Association for Computing Machinery.
Venugopal et al. (2011) Ashish Venugopal, Jakob Uszkoreit, David Talbot, Franz Och, and Juri Ganitkevitch. 2011. Watermarking the outputs of structured prediction with an application in statistical machine translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1363–1372, Edinburgh, Scotland, UK. Association for Computational Linguistics.
Wang et al. (2023) Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators. Preprint, arXiv:2305.17926.
Wang et al. (2022) Yequan Wang, Jiawen Deng, Aixin Sun, and Xuying Meng. 2022. Perplexity from plm is unreliable for evaluating text quality.
Weidinger et al. (2021) Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2021. Ethical and social risks of harm from language models. Preprint, arXiv:2112.04359.
Wieting et al. (2015) John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. From paraphrase database to compositional paraphrase model and back. Transactions of the Association for Computational Linguistics, 3:345–358.
Xu et al. (2024) Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. 2024. Hallucination is inevitable: An innate limitation of large language models. Preprint, arXiv:2401.11817.
Yang et al. (2023) Xi Yang, Kejiang Chen, Weiming Zhang, Chang Liu, Yuang Qi, Jie Zhang, Han Fang, and Nenghai Yu. 2023. Watermarking text generated by black-box language models. Preprint, arXiv:2305.08883.
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-trained transformer language models. Preprint, arXiv:2205.01068.
Zhao et al. (2023) Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. 2023. Provable robust watermarking for ai-generated text. Preprint, arXiv:2306.17439.
Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Preprint, arXiv:2306.05685.

Appendix A More details on the vocabulary $\mathbb{V}$ of the SecTable

In this section, we provide more details on the creation of SecTable, and address how often a word in the SecTable can be selected as a watermark word.

Filtering the SecTable vocabulary $\mathbb{V}$ :

Specifically, we restrict $\mathbb{V}$ to only include lowercase nouns, verbs, adjectives, and adverbs that occur at least 1,000 times in the WikiText-103 training split. This results in a final vocabulary of 3,266 words.

Frequency of words chosen as watermark words:

In Figure 3, we plot the frequency distribution of all watermark words obtained for 500 OpenGen outputs (generated with GPT-4 as the base LLM). We find that the majority of the words are only selected as watermark words for less than 5% of all outputs, while two major hub words are selected in more than 20% of the outputs. Overall, the hubness problem is not too severe, but it could be mitigated by a more careful selection of the embeddings used in the SecTable.

Appendix B Prompt for the Inserter

{spverbatim}

Given below are a piece of text and a word list. Rewrite the text to incorporate all words from the provided word list. The rewritten text must be coherent and factual. Distribute the words from the list evenly throughout the text, rather than clustering them in a single section. When rewriting the text, try your best to minimize text length increase. Only return the rewritten text in your response, do not say anything else.

Text:

Word list:

Rewritten text:

Appendix C More details on cosine similarity word matching during detection

We use the paragram word embedding model developed by Wieting et al. (2015) to perform cosine similarity word matching during detection. We find this model to be superior at distinguishing semantically related words from irrelevant words, see details in Table 6.

	SIM(positive)	SIM(negative)
paragram	64.8	2.4
GloVe	60.7	16.4
nomic-embed	59.9	33.2
text-embedding-3-large	64.2	29.8

Table 6: Cosine similarity between embeddings of positive pairs (word + its synonym) and between negative pairs (word + irrelevant word) computed with different embedding models, averaged over 174 tuples of (word, synonym, irrelevant word).

Appendix D More details on baselines

In this section, we provide more details on how we run our baselines.

D.1 Expanded descriptions of baselines

(1) KGW (Kirchenbauer et al., 2023): Partitions the vocabulary into “green” and “red” lists based on the previous token, then boosts the probability of green tokens during generation. Detection is done by comparing the number of green tokens present to the expected count under the null hypothesis of no watermarking. (2) Unigram (Zhao et al., 2023): A variant of KGW that uses a fixed green-red partition for all generations instead of re-partitioning the vocabulary at each token, making it more robust to editing attacks. (3) EXP (Aaronson and Kirchner, 2022): Uses exponential sampling to embed a watermark by biasing token selection with a pseudo-random sequence during text generation. Detection measures the correlation between the generated text and the sequence to identify the watermark. (4) EXP-Edit (Kuditipudi et al., 2024): A variant of the EXP watermark that incorporates edit distance to measure the correlation. (5) SemStamp (Hou et al., 2023): A sentence-level algorithm that partitions the semantic space using locality-sensitive hashing with arbitrary hyperplanes, assigning binary signatures to regions and accepting sentences that fall within “valid” regions, which enhances robustness against paraphrase attacks. (6) k-SemStamp (Hou et al., 2024): Improves upon SemStamp by using k-means clustering to partition the semantic space. (7) SIR (Liu et al., 2024b): Generates watermark logits from the semantic embeddings of preceding tokens using an embedding language model and a trained watermark model. These logits are added to the language model’s logits. Detection works by averaging these watermark logits for each token and identifying a watermark if the average is significantly greater than zero. (8) Blackbox (Yang et al., 2023): While all other baseline methods require access to model logits, this method focuses on the blackbox setting where only the model output is observable, similar to our assumption. It encodes words as binary bits, replaces bit-0 words with synonyms representing bit-1, and detects watermarks through a statistical test identifying the altered distribution of binary bits.

D.2 Hyperparameters for baselines

All baselines are run with nucleus sampling with $p=0.9$ unless otherwise specified.

KGW:

We run KGW in the LeftHash configuration with $\gamma=0.5$ and $\delta=4.0$ , using the original authors’ implementation. These hyper-parameters control the size of the green token list and the strength of the watermark, respectively. While $\delta$ is typically set to $2.0$ in prior literature, we chose $\delta=4.0$ based on findings by Kirchenbauer et al. (2024). They found that $\delta=4.0$ made the watermark more robust to paraphrasing attacks in their experiments with Vicuna, a supervised instruction-finetuned model. Given that our experiments also focus on lower-entropy models aligned through RLHF or instruction tuning, we adopt the same value for $\delta$ .

Unigram:

To align with the setup of KGW, we set $\gamma=0.5$ and $\delta=4.0$ for Unigram as well. While the authors open-source their code, we ran into unexpected performance issues, where Unigram could not even achieve a TPR at 1% FPR higher than 70% even before any attacks on OpenGen with Llama-3-8B as the base model. Thus, we switched to the implementation in MarkLLM (Pan et al., 2024), an open-source watermarking toolkit. With this implementation, Unigram’s TPR before attacks became close to 100% and the TPR after attacks stayed above 90%, in line with results reported in the Unigram paper (Zhao et al., 2023).

EXP:

We run EXP with prefix length set to $1$ using the MarkLLM implementation.

EXP-Edit:

Using the authors’ implementation, we run EXP-Edit with $\gamma=0.5$ , watermark key length = 256, block size = sequence length = 300, and number of resamples = 100. This method is run with multinomial sampling (the default setting in the authors’ code), because we find that adding a nucleus sampling logits wrapper on top significantly hurts its performance. For Llama-3-8B-Inst and Mistral-7B-Inst, we find that this method cannot reach a TPR at 1% FPR above 70% even before attacks. We tried several values for $\gamma$ , the hyperparameter that controls the statistical power of the watermark, but it did not improve the results. Increasing the number of resamples to 500 also had little effect.

Blackbox:

We run Blackbox with $\tau=0.8$ and $\lambda=0.83$ using fast detection with the authors’ implementation. Empirically, we find that fast detection offers a significant speed advantage with negligible impact on performance when compared to precise detection. On 200 OpenGen outputs with GPT-4 as the base LLM, using precise detection yields TPR of 100 before paraphrasing and 3.5 after paraphrasing, whereas fast detection yields 99 and 0.5.

Appendix E More details on base models

In this section, we provide more details on how we run the base generator models.

Model checkpoints:

We detail the checkpoint we use for each base model in Table 7.

Model	Checkpoint
Llama-3-8B	Meta-Llama-3-8B
Llama-3-8B-Inst	Meta-Llama-3-8B-Instruct
Mistral-7B-Inst	Mistral-7B-Instruct-v0.2
GPT-4	OpenAI API (gpt-4-0613)

Table 7: Base model checkpoints.

Generation length:

For all aligned models (Llama-3-8B-Inst, Mistral-7B-Inst, and GPT-4), we generate free-form text until the model outputs an EOS (end-of-sequence) token to simulate the downstream setting. For Llama-3-8B, we set the maximum token limit to 300, as generating freely until reaching EOS often leads to meaningless repetitions, sometimes even exceeding 8,000 tokens. We do not run OPT-1.3B ourselves.

Appendix F Paraphrasing attack setup

In this section, we provide more details on the paraphrasing attack we use for all experiments.

Prompt for sentence-level paraphrasing:

We build on the prompt used by Hou et al. (2023, 2024) and include more clarification on what to return:

{spverbatim}

Given some previous context and a sentence following that context, paraphrase the current sentence. Only return the paraphrased sentence in your response.

Previous context: Current sentence to paraphrase: Your paraphrase of the current sentence:

Why sentence-level paraphrasing?

We choose a sentence-level paraphrasing setup for two reasons. First, Hou et al. (2023, 2024) use a sentence-level paraphrasing setup to evaluate the robustness of their method. Since we are unable to run their method directly, adopting the same paraphrasing setup allows for a fair comparison with their results. Second, as observed by Kirchenbauer et al. (2024), naively prompting GPT-3.5-Turbo to rewrite the entire input text often results in significant loss of important content. While the authors developed a sophisticated prompt to mitigate this issue, we empirically find that paraphrasing at a sentence level achieves a similar effect.

Appendix G PostMark runtime and API cost estimates

Runtime:

We compare the runtime of several PostMark configurations with other baselines in Table 8. Recall that in our experiments, we find insertion success rate to be higher if we divide the watermark word list into sublists of 10 words, then ask the Inserter to insert one sublist at a time. This iterative insertion process can have some negative impact on runtime, but it may become unnecessary in the future when the Inserter has better instruction-following capabilities.

API costs:

Under the default PostMark@12 configuration with GPT-4o as the Inserter and text-embedding-3-large as the Embedder watermarking 500 outputs with around 300 tokens costs around $18.5 USD, which means that watermarking 100 tokens costs about $1.2 on average.

Method	Avg Time / Output
PostMark@6	29.2
PostMark@12	36.6
PostMark@12 (no iter.)	25.3
KGW	17.5
Unigram	18.5
EXP	18.4
EXP-Edit	17.3
Blackbox	21.6

Table 8: Average time (in seconds) it takes to generate one watermarked instance with Llama-3-8B-Inst as the base LLM. Runtime is averaged over 10 outputs, with an average token count of 280. For PostMark and Blackbox, the runtime includes the time it takes for Llama-3-8B-Inst to generate the initial unwatermarked output. PostMark@12 (no iter.) refers to the setup where instead of breaking up the watermark word list into sublists and iteratively asking the Inserter to insert one sublist at a time, we directly ask the Inserter to insert all words in the list.

Appendix H PostMark length comparison

We present a comparison between output length (before and after watermarking) for various watermarking methods in Table 9.

Metric $\rightarrow$	Number of Tokens (Before / After Watermarking)
Methods $\rightarrow$	KGW	Unigram	EXP	EXP-Edit	Blackbox	PostMark@12
Llama-3-8B	239.6 / 226.6	237.6 / 250.7	232.5 / 269.8	213 / 225.7	239.6 / 244.8	239.6 / 381.2
Llama-3-8B-Inst	251.2 / 234.6	259.5 / 261.6	259 / 282.6	251.3 / 255	251.2 / 256.4	251.2 / 431
Mistral-7B-Inst	315.3 / 588.2	318 / 321	317.4 / 247.8	248.7 / 249.5	315.3 / 320.6	315.3 / 552.2
GPT-4	-	-	-	-	301.2 / 305.7	301.2 / 507.1

Table 9: Length comparison between different watermarking methods before and after watermarking, averaged over 500 OpenGen outputs.

Appendix I PostMark semantic meaning preservation

To check whether PostMark preserves the general semantic meaning of the original unwatermarked text, we compute the average cosine similarity between the embeddings unwatermarked and watermarked outputs in Table 10, and find the similarity score to be consistently around 0.95.

Base LLM	SIM
Llama-3-8B	94.2
Llama-3-8B-Inst	94.8
Mistral-7B-Inst	94.6
GPT-4	95.3

Table 10: Average cosine similarity between the embeddings of unwatermarked and PostMark@12 watermarked outputs on OpenGen. Embeddings are obtained using text-embedding-3-large. Numbers are averaged over 500 pairs.

Appendix J More details on PostMark edits

A full table of five major types of edits made by PostMark during watermarking is in Table 11. These 5 categories were summarized based on a small-scale qualitative analysis of 30 watermarked OpenGen outputs.

Type	Before watermark	After watermark
Rewriting existing content
Rewording	Her decision to quit the opera, however, did not lessen the engulfing sadness which veiled her once radiant joy.	Her decision to resign from the opera, however, did not lessen the engulfing sadness which veiled her once radiant joy.
Clarification	Since the charges concerned violation of civil rights and not actual murder, the defendants received surprisingly light sentences, ranging from three to ten years.	Since the charges concerned violation of civil rights and not actual murder, the defendants received surprisingly light sentences, ranging from three to ten years of imprisonment.
Adding new content
Metaphors	In fact, despite Mount Elbert’s somewhat minimal precipitation, it displays a remarkable ability to sustain life.	In fact, despite Mount Elbert’s somewhat minimal precipitation, it displays a remarkable ability to sustain life, almost as if it wears an armor of resilience, immune to the challenges it faces.
Interpretive claims	He swiftly plants timed explosives around the warehouse, ensuring to place a few on the largest weapon caches for maximum effect.	He swiftly plants timed explosives around the warehouse, ensuring to place a few on the largest weapon caches for maximum effect. The depth of his planning was a testament to his expertise in defense tactics.
New details	Headlam had the ability to foster a culture of discipline, camaraderie and respect among the airmen under his command, reflecting his firm belief in focused team effort and mutual support.	Headlam had the ability to foster a culture of discipline, camaraderie and respect among the airmen under his command, reflecting his firm belief in focused team effort and mutual support. His attention to detail was evident in every aspect of the unit’s operations.

Table 11: Example edits made by PostMark during the watermarking process. Changes are highlighted in orange, and watermark words are in bold.

Appendix K FactScore results

We present results from our FactScore evaluation in Table 12. Overall, less robust methods (KGW and PostMark@6) have less negative impact on factuality.

	FactScore
Llama-3-8B-Inst	40.2
+ KGW	37.8
+ Unigram	37.2
+ PostMark@12	37.3
+ PostMark@6	38.3

Table 12: FactScore evaluation results based on 100 generations with Llama-3-8B-Inst as the base generator LLM. All four evaluated methods impact factuality negatively to some extent, with less robust methods causing a lesser negative impact.

Appendix L Unigram repetitions

We present several examples of Unigram’s repetitive watermarked outputs in Table 13, generated with Llama-3-8B as the base LLM.

Prefix	Without Watermarking	After Unigram Watermarking
Unlike mountains of similar altitude elsewhere, Elbert lacks both a permanent snowpack and a prominent north-facing cirque, which can be attributed to its position among other mountains of similar height, causing it to receive relatively small quantities of precipitation. Mount Elbert was named by miners in honor of Samuel Hitt Elbert, the governor of the then-Territory of Colorado, because he brokered a treaty in September 1873 with the Ute tribe that opened up more than of reservation land to mining and railroad activity.	Mount Elbert Mountain, United States of America: Elevation: 4401 meters/ 14434 feet: USGS Map Name: Mountain Elbert: Latitude: 39° 07’ 17” N Longitude: 106° 26’ 08” W Latitude/Longitude (WGS84) 39° 7’ 17” N 106° 26’ 8” W (39.1213939, -106.4355046) Elevation: 4400+ft. Name Coordinates: Elevation Elbert, Mount: 39° 07’ 17” N 106° 26’ 08” W: 4,401 m (14,440 ft) Location: Colorado, United States: Parent peak: Uncompahgre Peak: Elevation gain: 4,419 ft (1,348 m) Coordinates: 39° 07’ 17" N, 106° 26’ 08" W; OS grid reference: Coordinates: Denomination: From the list below, please choose the region you would like to visit Elbert, Mount, and we’ll show you the times of departure and scheduled arrival, in addition to realistic information about guarantees, time change schedules, and delays for this train. Mount Elbert, the highest mountain in Colorado, stands at 4,401m.	Mount of Snow – Mt. Mt. Canyon Lake High School alumnus Steve Phelps of Leadville competes annually on Mount. The summit of Mt. Because of its high altitude of approximately Mt. Mt. Hiking Mt. Mounts of Snow – Mount of Snow is peak of Mount. Mt. Mount of Snow – Mount of Snow is peak of Mount. Holly Berry For. Mount of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mount of Snow – Mount of Snow is peak of Mount. Mount of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mt. Mt. Mount of Snow – Mount of Snow is peak of Mount. Holly Berry For. Mt. Mount of Snow – Mount of Snow is peak of Mount. Mount of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mt. Mount of Snow – Mount of Snow is peak of Mount.
We are not! Like men in the story of the Good Samaritan, they pass by on the other side…	they avoid the God-driven endeavors of the church. Many modernized men avoid the stories of the women in the Bible and opt to ignore Jesus’ directive to be be "transformed into his image," Colossians 3:10. (Not the way our men view the world through "TrophyWives Corner"!). But, this is certainly not all. Now, as of today, we will work with a modernized spin on God’s word to show you how Jesus, 2000 years ago, managed to "reformat" most men through his efforts in church. But before we start, I must thank two lovely women who were in my life in the 1970’s…women who unknowingly were reformatting the world of a confused young boy like me. I really appreciate you both, no matter how confusing you made the world seem to me then. And by the way, you will meet a couple of young lads too. I’m sure your daughters, granddaughters and nieces will relate to this. This is for all the little girls out there. Thank you guys, and keep on telling us men, just like Jesus did, what we should be like.	of Route 119. But that leaves you to cross Route 119 without adult supervision. Presuming you’ve had adult supervision all along. Which is probably true if you haven’t had all of that adult supervision that ought to have be all along. The point is that you’re here to cross Route 119 without adult supervision. Like the Good Fisherman who nets all of us into Heaven. The only catch is that you probably won’t be able to cross without being hurt. The flip side is that you can probably cross without hurting anyone else. If you give to God without doubting that you’ll get what you give to him then you’ll be able to cross Route 119 without adult supervision. Giving to God is almost like going to Heaven because what we give to him is what we’ll get back. If you give him bread then you’ll get bread. If you give him food then you’ll get food. If you give him money then you’ll get money. If you give him parents then you’ll get parents. If you give him teachers then you’ll get teachers. If you give him insurance then you’ll get insurance. If you give him Good Parents then you’ll get Good Parents. If you give him Good Men then you’ll get Good Men. If you give him Good Fisherman then you’ll get Good Fisherman. If you give him Good Fish then you’ll get Good Fish. If you give him Good Charismata then you’ll get Good Charismata.

Table 13: Example repetitive outputs by Unigram with Llama-3-8B-Inst as the base LLM.

Appendix M Prompt for the LLM-based pairwise evaluation setup

{spverbatim}

Please act as an impartial judge and evaluate the quality of the text completions provided by two large language models to the prefix displayed below. Assess each response according to the criteria outlined. After scoring each criterion, provide a summary of you evaluation for each response, including examples that influenced your scoring. Additionally, ensure that the order in which the responses are presented does not affect your decision. Do not allow the length of the responses to influence your evaluation. Be as objective as possible.

Criteria: 1. Relevance to the prefix 2. Coherence 3. Interestingness

Start with a brief statement about which response you think is better overall. Then, for each criterion, state which response is better, or if there is a tie, followed by a concise justification for that judgment. At the very end of your response, declare your verdict by choosing one of the choices below, strictly following the given format: "[[A]]" if assistant A is better overall, "[[B]]" if assistant B is better overall, or "[[C]]" for a tie.

[Prefix]

[Response A]

[Response B]

Appendix N Human evaluation setup and costs

Hiring annotators:

We hire two annotators from Upwork. Both annotators are fluent in English, have 100% job success rates, and have demonstrated exceptional professionalism in their communications with us.

Pairwise evaluation:

The interface we use for this task, built with Label Studio, is shown in Figure 4. For this task, we pay each annotator $2 USD per pair, and they spend around 5-10 minutes per pair.

Identifying watermark words:

The interface we use for this task is shown in Figure 5. For this task, we pay each annotator $1.5 USD per output, and they spend around 3-5 minutes on each output.

PostMark: A Robust Blackbox Watermark for Large Language Models

Abstract

1 Introduction

2 PostMark: a post-hoc watermark

Intuition and terminology:

Embedding model Embedder:

Secret word embedding table SecTable:

Insertion model Inserter:

2.1 Inserting the watermark

2.2 Detecting the watermark

3 Experiments

3.1 Experimental setup

Baselines:

Hyperparameters:

Base models:

Datasets:

Paraphrasing attack setup:

Metric for measuring detection performance:

3.2 Results

PostMark is an effective and robust watermark.

Logit-based baselines perform worse on low-entropy models and tasks, while PostMark stays relatively unaffected.

Open-weight PostMark shows promise.

4 Impact of watermarking on text quality

Setting up quality evaluations:

4.1 Automatic evaluation

Pairwise preference evaluation setup:

Factuality evaluation setup:

> Q1: PostMark does not affect quality as much as other baselines.

> Q2: Inserting more words enhances robustness but hurts quality, and vice versa.

4.2 Human evaluation

> Q3: PostMark watermarked texts are at least as good as their unwatermarked counterparts the majority of the time.

> Q4: Annotators struggle to identify the words inserted by PostMark.

5 Related work

Early research on watermarking:

Watermarking LLM-generated text:

Evading watermark detection:

Quality-robustness trade-off:

6 Conclusion

Limitations

Other attacks:

Runtime and API costs:

Ethical considerations

Acknowledgments

References

Appendix A More details on the vocabulary 𝕍𝕍\mathbb{V}blackboard_V of the SecTable

Filtering the SecTable vocabulary 𝕍𝕍\mathbb{V}blackboard_V:

Frequency of words chosen as watermark words:

Appendix B Prompt for the Inserter

Appendix C More details on cosine similarity word matching during detection

Appendix D More details on baselines

D.1 Expanded descriptions of baselines

D.2 Hyperparameters for baselines

KGW:

Unigram:

EXP:

EXP-Edit:

Blackbox:

Appendix E More details on base models

Model checkpoints:

Generation length:

Appendix F Paraphrasing attack setup

Prompt for sentence-level paraphrasing:

Why sentence-level paraphrasing?

Appendix G PostMark runtime and API cost estimates

Runtime:

API costs:

Appendix H PostMark length comparison

Appendix I PostMark semantic meaning preservation

Appendix J More details on PostMark edits

Appendix K FactScore results

Appendix L Unigram repetitions

Appendix M Prompt for the LLM-based pairwise evaluation setup

Appendix N Human evaluation setup and costs

Hiring annotators:

Pairwise evaluation:

Identifying watermark words:

Appendix A More details on the vocabulary $\mathbb{V}$ of the SecTable

Filtering the SecTable vocabulary $\mathbb{V}$ :