Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

PostMark: A Robust Blackbox Watermark for Large Language Models

Yapei Chang[Uncaptioned image]  Kalpesh Krishna[Uncaptioned image]
Amir Houmansadr[Uncaptioned image]  John Wieting[Uncaptioned image]  Mohit Iyyer[Uncaptioned image]

[Uncaptioned image]University of Massachusetts Amherst, [Uncaptioned image]Google
{yapeichang,amir,miyyer}@cs.umass.edu
{kalpeshk,jwieting}@google.com
Abstract

The most effective techniques to detect LLM-generated text rely on inserting a detectable signature—or watermark—during the model’s decoding process. Most existing watermarking methods require access to the underlying LLM’s logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark.

PostMark: A Robust Blackbox Watermark for Large Language Models


Yapei Chang[Uncaptioned image]  Kalpesh Krishna[Uncaptioned image] Amir Houmansadr[Uncaptioned image]  John Wieting[Uncaptioned image]  Mohit Iyyer[Uncaptioned image] [Uncaptioned image]University of Massachusetts Amherst, [Uncaptioned image]Google {yapeichang,amir,miyyer}@cs.umass.edu {kalpeshk,jwieting}@google.com


1 Introduction

Large language models (LLMs) are increasingly being deployed for malicious applications such as fake content generation. The consequences of such applications for the web as a whole are dire: modern LLMs are known to hallucinate (Xu et al., 2024), and their outputs may contain biases and artifacts that are a product of their training data (Navigli et al., 2023). If the web is flooded with millions of LLM-generated articles, how can we trust the veracity of the content we are reading? Additionally, do we want to train LLMs of the future on text generated by LLMs of the present (Shumailov et al., 2023)?

To combat this emerging problem, researchers have developed several LLM-generated text detection techniques that leverage watermarking (Aaronson and Kirchner, 2022; Kirchenbauer et al., 2023), outlier detection (Mitchell et al., 2023), trained classifiers (Tian, 2023), or retrieval-based methods (Krishna et al., 2023). Among these, watermarking methods that embed detectable signatures into model outputs tend to be the most effective and robust (Krishna et al., 2023). However, most watermarking algorithms require access to the logits of the underlying LLM, which means that they can only be implemented by individual LLM API providers such as OpenAI or Google (Yang et al., 2023). Furthermore, while these methods are able to achieve high detection rates with minimal false positives, their effectiveness goes down when the LLM-generated text is modified through paraphrasing, translation, or cropping (Krishna et al., 2023; He et al., 2024; Kirchenbauer et al., 2024).

Refer to caption
Figure 1: The PostMark watermarking and detection procedure. Given some unwatermarked input text, we generate its embedding using the Embedder and compute its cosine similarity with all word embeddings in the SecTable, performing top-k selection and additional semantic similarity filtering to choose a list of words. Then, we instruct the Inserter to watermark the text by rewriting it to incorporate all selected words. During detection, we similarly obtain a watermark word list and check how many of these words are present in the input text.

In this work, we develop PostMark, a watermarking approach with relatively high detection rates even in the presence of paraphrasing attacks. PostMark is a post-hoc watermark that given some model-generated text, finds words conditioned on the semantics of the text using an embedding model, then calls a separate instruction-following LLM to insert these words into the text without appreciably modifying its meaning. Unlike prior methods, PostMark only requires access to the outputs of the underlying LLM (i.e., no logits).

Overall, our contributions are threefold: 1. We propose PostMark, a novel post-hoc watermarking method that can be applied by third-party entities to outputs from an API provider like OpenAI. 2. We conduct extensive experiments across eight baseline algorithms, five base LLMs, and three datasets, showing that PostMark offers superior robustness to paraphrasing attacks compared to existing methods. 3. We verify through a human evaluation that the words inserted by PostMark during watermarking cannot be reliably detected by humans. We also conduct comprehensive quality evaluations encompassing coherence, relevance, and interestingness for various watermarking methods. Notably, we also assess factuality, an aspect that has not been evaluated in prior work. Our findings reveal that relatively robust watermarks all negatively affect factuality.

2 PostMark: a post-hoc watermark

Most existing watermarking algorithms embed the watermark during the LLM’s decoding process. For example, the watermark of Kirchenbauer et al. (2023, KGW) partitions an LLM’s vocabulary into two lists (a green list and a red list) at each decoding timestep based on a hash of the previous word, and then upweights the green tokens such that they are more likely to be sampled than red tokens. These watermarks have several issues: (1) they require access to the LLM’s logits; (2) because they rely on modifications to the next-token probability distribution, their effectiveness diminishes on LLMs that produce lower-entropy distributions, such as those that have undergone RLHF (Bai et al., 2022); and (3) they show limited robustness to paraphrasing attacks as demonstrated by our results in Section 3.2 and supported by findings from prior work (Krishna et al., 2023; Sadasivan et al., 2024).

In response, we develop PostMark, a watermarking method that does not require logit access, maintains high detection rates on low-entropy models and tasks, and exhibits improved robustness to paraphrasing attacks. Unlike existing watermarks, PostMark requires access to just the text generated by the underlying LLM, not the next-token distributions. The rest of this section fully specifies PostMark’s operation.

Intuition and terminology:

At a high level, PostMark is based on the intuition that a text’s semantics should not drastically change after watermarking or paraphrasing. Thus, we can condition our watermark on a semantic embedding of the input text that ideally changes only minimally when paraphrasing is applied. To make this work, we rely on three modules: an embedding model Embedder, a secret word embedding table SecTable, and an insertion model Inserter implemented via an instruction-following LLM.

Figure 1 illustrates PostMark’s watermarking and detection pipelines. First, we generate the embedding of an input text using the Embedder. We then compute the cosine similarity between this embedding and all of the word embeddings in SecTable, performing top-k𝑘kitalic_k selection and filtering to form a watermark word list. Next, we prompt Inserter to smoothly incorporating the selected words into the input to create the watermarked text. During detection, we follow similar steps to obtain a word list, and check how many of the words are present in the input text.

Embedding model Embedder:

The Embedder needs to be capable of projecting both words and documents into a high-dimensional latent space. In our main experiments, we use OpenAI’s text-embedding-3-large (OpenAI, 2024b), a powerful model that demonstrates strong performance on the MTEB benchmark (Muennighoff et al., 2023). However, any embedding model can be used here. In Section 3.2, we also experiment with nomic-embed (Nussbaum et al., 2024), an open-source model.

Secret word embedding table SecTable:

The core idea behind PostMark is to use an LLM to insert a list of watermark words into the input text without appreciably modifying the quality or meaning of the text, where the words in the list are selected by computing the cosine similarity between the text embedding and a word embedding table SecTable. The construction of SecTable involves two main steps, which we detail below:

> Step 1. Choosing a vocabulary 𝕍𝕍\mathbb{V}blackboard_V: To decide which words to include in SecTable, we use the WikiText-103 corpus (Merity et al., 2017) as our base vocabulary. To avoid inserting arbitrary words that make little sense, we remove all function words, proper nouns, and infrequent rare words. This refined set forms our final vocabulary, 𝕍𝕍\mathbb{V}blackboard_V. We provide more details on this filtering process in §A.

> Step 2. Mapping words in 𝕍𝕍\mathbb{V}blackboard_V to embeddings: To make it difficult for attackers to recover our embedding table, we construct SecTable by randomly assigning each word in the vocabulary to an embedding produced by Embedder; the resulting mapping acts as a cryptographic key.111We could also just use Embedder’s word embeddings as SecTable directly. However, this can easily be recovered by an attacker, and our experiments show that it also reduces PostMark’s effectiveness due to many words already being present in the input text even before insertion. More specifically, we generate a set of embeddings 𝔻𝔻\mathbb{D}blackboard_D for a collection of random documents using Embedder and then randomly map each word in 𝕍𝕍\mathbb{V}blackboard_V to a unique document embedding in 𝔻𝔻\mathbb{D}blackboard_D to produce SecTable.222The selection of these documents is flexible. In our experiments, we randomly sample 250-word snippets from the RedPajama dataset’s English split (Computer, 2023).

Insertion model Inserter:

The Inserter needs to have instruction-following capabilities, and its purpose is to rewrite the input text to incorporate words from the watermark word list. We use GPT-4o (OpenAI, ) as the Inserter in our main experiments, and later show in Section 3.2 that open-source models like Llama-3-70B-Inst (AI@Meta, 2024) also show promising performance.

2.1 Inserting the watermark

> Step 1. Deciding how many words to insert: How many words should we insert into a given text? We define a hyperparameter called the insertion ratio r𝑟ritalic_r that determines this number. The insertion ratio represents the percentage of the input text’s word count: for example, if r=10%𝑟percent10r=10\%italic_r = 10 % and the input text has 50505050 words, we will insert 5 words.

> Step 2. Obtaining a watermark word list: Suppose that the watermark list should contain k𝑘kitalic_k words. To create the watermark word list given the input text, we first compute the input’s embedding et=Embedder(t)subscript𝑒𝑡Embedder𝑡e_{t}=\textsc{Embedder}(t)italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Embedder ( italic_t ). Next, we compute CosineSimilarity(et,SecTable)subscript𝑒𝑡SecTable(e_{t},\textsc{SecTable})( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , SecTable ) and select the top ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT most similar words, then perform semantic similarity filtering to obtain the final k𝑘kitalic_k words.333Due to the random nature of the word-to-embedding mapping of T𝑇Titalic_T, the top ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT words might include highly irrelevant words (e.g., “hotel” in  Figure 1). Thus, we refine the top-ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT list by selecting the top k𝑘kitalic_k words whose actual embeddings (as determined by Embedder) are most similar to etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We present an analysis on how frequently a word is chosen as an watermark word in §A.

> Step 3. Inserting words into the text: To watermark the text, we instruct Inserter to rewrite it via zero-shot prompting, incorporating words in the watermark word list while keeping the rewritten text coherent, factual, and concise.444In practice, we find that dividing a long word list into sublists of 10 words each and then iteratively asking the Inserter to incorporate each sublist ensures a high insertion success rate. This may not be necessary if the Inserter has better instruction-following capabilities. The prompt can be found in §B.

2.2 Detecting the watermark

During detection, given some text, the goal is to find out if the text contains a watermark. Similar to the watermarking procedure, we embed the candidate text using Embedder, form a word list, and then check how many words in the list are present in the text by computing a presence score p𝑝pitalic_p:

p=|{wlists.t.wtext,sim(w,w)0.7}||list|p=\frac{\left|\left\{w\in\text{list}\ \text{s.t.}\ \exists w^{\prime}\in\text{% text},\ \text{sim}(w^{\prime},w)\geq 0.7\right\}\right|}{|\text{list}|}italic_p = divide start_ARG | { italic_w ∈ list s.t. ∃ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ text , sim ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_w ) ≥ 0.7 } | end_ARG start_ARG | list | end_ARG

A word w𝑤witalic_w is marked present in the text if there is any other word wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with an embedding cosine similarity greater than a threshold that we set to 0.7. We choose this method over exact match to ensure additional robustness against paraphrasing.555We use the paragram word embedding model developed by Wieting et al. (2015) to compute cosine similarity for this step. This model is chosen for its superior performance in assigning high similarity scores to close synonyms and low scores to unrelated words, more details in §C. If p𝑝pitalic_p is larger than a certain threshold, it is likely that the text has been watermarked. As later discussed in Section 3.1, the primary metric we use to measure detection accuracy is the true positive rate at a fixed 1% false positive rate. We thus set the threshold to ensure a 1% FPR, same as what we do for all baselines in our main experiments.

3 Experiments

In this section, through extensive experiments on three datasets and five language models, we demonstrate that PostMark consistently outperforms both logit-free and logit-based methods in terms of robustness to paraphrasing attacks, especially on low-entropy models that have undergone RLHF alignment. Furthermore, we showcase PostMark’s modular design by testing an open-source variant, which achieves promising results.

3.1 Experimental setup

Baselines:

We compare PostMark against 8 baseline algorithms, more detailed descriptions can be found in §D. (1) KGW (Kirchenbauer et al., 2023): Partitions the vocabulary into “green” and “red” lists based on the previous token, then boosts the probability of green tokens during generation. (2) Unigram (Zhao et al., 2023): A more robust variant of KGW that uses a fixed partition for all generations. (3) EXP (Aaronson and Kirchner, 2022): Uses exponential sampling to bias token selection with a pseudo-random sequence. (4) EXP-Edit (Kuditipudi et al., 2024): A variant of EXP that uses edit distance during detection. (5) SemStamp (Hou et al., 2023): A sentence-level algorithm that partitions the sentence semantic space. (6) k-SemStamp (Hou et al., 2024): Improves SemStamp by using k-means clustering to partition the semantic space. (7) SIR (Liu et al., 2024b): Generates watermark logits from the semantic embeddings of preceding tokens then adds them to the model’s logits. (8) Blackbox (Yang et al., 2023): This method, like ours, works in a blackbox setting where only model outputs are visible. It substitutes words representing bit-0 in a binary encoding scheme with synonyms representing bit-1.

Hyperparameters:

The key hyperparameter for PostMark is the insertion ratio r𝑟ritalic_r, which controls how many words are inserted during the watermarking process. We set r𝑟ritalic_r to 12% as preliminary experiments suggest that this value strikes a good balance between quality and robustness. Section 4.1 explores different PostMark configurations that vary r𝑟ritalic_r. In all following discussion and tables, we refer to these configurations with the naming convention “PostMark@r𝑟ritalic_r”. We carefully tune all baselines’ hyperparameters to maximize their robustness to paraphrasing; more details in §D.

Base models:

Our experiments involve five generative models: Llama-3-8B (AI@Meta, 2024), Llama-3-8B-Inst (AI@Meta, 2024), Mistral-7B-Inst Jiang et al. (2023), GPT-4 (OpenAI, 2024a), and OPT-1.3B (Zhang et al., 2022). Among these, Llama-3-8B-Inst, Mistral-7B-Inst, and GPT-4 have been aligned with human preferences. For details on model checkpoints and generation length, see §E. We do not run OPT-1.3B ourselves but directly use its unwatermarked outputs provided by Hou et al. (2024). Due to difficulties in running SemStamp, k-SemStamp, and SIR,666Their code is available but not runnable yet. We look forward to running these methods ourselves once the issues are resolved. we apply PostMark directly to these outputs and compare our results with the published numbers in Hou et al. (2024).

Datasets:

Our main experiments use three datasets: (1) OpenGen, a dataset collected by Krishna et al. (2023) designed for open-ended generation that consists of two-sentence chunks sampled from the validation set of WikiText-103; (2) LFQA, a dataset collected by Krishna et al. (2023) for long-form question answering that contains questions sampled from the r/explainlikeimfive subreddit that span multiple domains; and (3) RealNews (Raffel et al., 2020), a subset of the C4 dataset that includes news articles gathered from a wide range of reliable news websites.

Paraphrasing attack setup:

Following prior work (Hou et al., 2023, 2024; Kirchenbauer et al., 2024; Liu et al., 2024b), we use GPT-3.5-Turbo as our paraphraser. We use a sentence-level paraphrasing approach where the model iterates through each sentence of the input text, using all preceding context to paraphrase the current sentence. See §F for more details on this setup.

Metric for measuring detection performance:

In addition to the true positive rate, a low false positive rate is critical for LLM-generated detection. Thus, following prior detection work (Krishna et al., 2023; Zhao et al., 2023; Hou et al., 2023, 2024; Liu et al., 2024b), we use TPR at 1% FPR as our primary metric.

Metric \rightarrow TPR at 1% FPR (Before Paraphrasing / After Paraphrasing)
Model \downarrow Dataset \downarrow Avg Entropy \downarrow PostMark@12 Blackbox KGW Unigram EXP EXP-Edit SIR SemStamp k-SemStamp
Llama-3-8B OpenGen 3.6 99.7 / 63.5 81.2 / 2.2 100 / 74.8 99.8 / 93.4 99.8 / 36.6 97.3 / 73.3 - - -
LFQA 3.5 97.8 / 72.5 82.8 / 1.6 99.8 / 25.6 99.8 / 79.6 99.8 / 12.4 83 / 41 - - -
Llama-3-8B-Inst OpenGen 1.6 99.4 / 46.4 91.8 / 1 98.2 / 21.6 99.6 / 41.4 99.6 / 4.8 47.8 / 2.2 - - -
LFQA 1.3 96 / 65.7 86.2 / 3 85.8 / 19 98.6 / 31.8 98.4 / 0.6 21.1 / 0.6 - - -
Mistral-7B-Inst OpenGen 1.4 99.2 / 69.2 98.4 / 0.4 100 / 16 99.8 / 56 99.4 / 5 33 / 1.5 - - -
LFQA 1.1 99.6 / 56.4 89.8 / 0.4 99.4 / 23.6 97.2 / 41.2 97.4 / 0.8 20.1 / 2.1 - - -
GPT-4 OpenGen - 99.4 / 59.4 99.4 / 1.4 - - - - - - -
LFQA - 99.4 / 65 99.2 / 0.4 - - - - - - -
OPT-1.3B RealNews 3.6 98.2 / 67.2 - - - - - 99.4 / 24.7 93.9 / 33.9 98.1 / 55.5
Table 1: Comparison of PostMark and baselines. All numbers are computed over 500 generations. Each entry shows the TPR at 1% FPR before paraphrasing and after paraphrasing. The “Avg Entropy” column shows the average token-level entropy (in bits) of each model on each dataset.

3.2 Results

We present our main experimental results on robustness to paraphrasing attacks in Table 1, and discuss our main findings below. Runtime analysis and API cost estimates can be found in §G.

PostMark is an effective and robust watermark.

PostMark consistently achieves a high TPR before paraphrasing (>90%absentpercent90>90\%> 90 %), outperforming baselines like Blackbox, KGW, and EXP-Edit. Additionally, PostMark achieves higher TPR after paraphrasing compared to other baselines, including Blackbox, the only other method that operates under the same logit-free condition. The only setting that PostMark is not the most robust model under paraphrasing is with the base Llama-3-8B model, where Unigram exhibits more robustness. We note that Unigram is much more vulnerable to reverse-engineering than PostMark because it uses a fixed green/red list partition for all inputs, which can be exploited with repetition attacks.777For Unigram, detection works by comparing the number of green tokens present in the input text to the expected count under the null hypothesis of no watermarking. The adversary can pick a word “apple” and submit a long repeating sequence of this word (e.g., “apple apple apple…”) to the watermark detection service. If it says this sequence is watermarked, then “apple” must be in the green list. Unigram’s effectiveness diminishes with low-entropy models, and in Section 4, we also observe Unigram’s severe negative impact on text quality. Finally, the bottom of Table 1 shows that PostMark is more robust than the three baselines that also condition on input semantics: SemStamp, k-SemStamp, and SIR.

Logit-based baselines perform worse on low-entropy models and tasks, while PostMark stays relatively unaffected.

Results from Table 1 demonstrate that logit-based baselines (i.e., all baselines except Blackbox) generally perform worse on aligned models (Llama-3-8B-Inst and Mistral-7B-Inst) compared to the non-aligned Llama-3-8B, and worse on LFQA than on OpenGen. This performance difference is consistent with findings from prior work (Kuditipudi et al., 2024) and can be attributed to the lower entropy of aligned models resulting from RLHF or instruction-tuning, as well as the inherently lower entropy of the LFQA task. The “Avg Entropy” column of Table 1 illustrates these entropy differences. In contrast, PostMark consistently outperforms all baselines in terms of robustness against paraphrasing attacks in these low-entropy scenarios.

Open-weight PostMark shows promise.

While our main experiments use GPT-4o as the Inserter and OpenAI’s text-embedding-3-large as the Embedder, we show in Table 2 that an open-weight combination of Llama-3-70B-Inst and nomic-embed can also achieve promising robustness to paraphrasing attacks. The modular design of PostMark allows for flexible experimentation with various components. As each module’s capabilities advance, PostMark’s robustness will likewise improve.

PostMark@12 Impl. TPR at 1% FPR
Closed 99.4 / 59.4
Open 100 / 52.1
Table 2: TPR at 1% FPR before and after paraphrasing. The open-source implementation of PostMark@12 with nomic-embed as the Embedder and Llama-3-70B-Inst as the Inserter shows promising performance on OpenGen with GPT-4 as the base LLM.

4 Impact of watermarking on text quality

Type Before watermark After watermark
Clarification Since the charges concerned violation of civil rights and not actual murder, the defendants received surprisingly light sentences, ranging from three to ten years. Since the charges concerned violation of civil rights and not actual murder, the defendants received surprisingly light sentences, ranging from three to ten years of imprisonment.
Metaphors In fact, despite Mount Elbert’s somewhat minimal precipitation, it displays a remarkable ability to sustain life. In fact, despite Mount Elbert’s somewhat minimal precipitation, it displays a remarkable ability to sustain life, almost as if it wears an armor of resilience, immune to the challenges it faces.
Table 3: Example edits made by PostMark during the watermarking process. Changes are highlighted in orange, and watermark words are in bold. More examples can be found in §J.

PostMark modifies text during watermarking by inserting new words, which often results in longer watermarked text while preserving its semantic meaning.888A full table of length comparison is in §H. To assess the preservation of semantic meaning, we compare the embeddings of watermarked text with those of the input text and consistently find a high cosine similarity of around 0.95. More details on this can be found in §I. Table 3 shows several common types of edits made by PostMark during watermarking.999Summarized based on a small-scale qualitative analysis. Although edits adding new content are expected to hurt quality, this quality degradation is not unique to PostMark. Prior work has found that all watermarking methods negatively affect text quality to some extent (Singh and Zou, 2023). For logit-based methods like KGW, quality degradation occurs because relevant words can be downweighted during decoding. While existing papers on watermarking often lack extensive quality evaluations, we conduct both automatic and human evaluations to assess the quality of watermarked text (relevance, coherence, interestingness, and factuality) in this section.

Setting up quality evaluations:

Prior work on watermarking has predominantly used perplexity as a measure for text quality (Kirchenbauer et al., 2023; Zhao et al., 2023; Yang et al., 2023; Liu et al., 2024b; Hu et al., 2024; Hou et al., 2023, 2024). However, perplexity alone has been shown to be an unreliable indicator of quality (Wang et al., 2022). Some studies have explored alternative methods, such as LLM-based evaluations (Singh and Zou, 2023) and human assessments (Kirchenbauer et al., 2024). Here, we evaluate the quality of watermarked text using automated and human evaluations, aiming to address four key questions:

> Q1: How does PostMark compare to other baselines in terms of impact on text quality?

> Q2: What is the quality-robustness trade-off for PostMark?

> Q3: How often do humans think that PostMark watermarked texts are at least as good as their unwatermarked versions?

> Q4: Are words inserted by PostMark easily detectable by humans?

4.1 Automatic evaluation

In this section, we compare PostMark with other baselines regarding impact on quality (Q1) and address the quality-robustness trade-off of PostMark (Q2).

Pairwise preference evaluation setup:

We adopt the LLM-as-a-judge (Zheng et al., 2023) setup to perform a pairwise comparison task. We choose GPT-4-Turbo as our judge as it is the high-ranked evaluator model on the Reward Bench leaderboard (Lambert et al., 2024)101010The current leaderboard is hosted on huggingface. GPT-4-Turbo’s high ranking indicates that it is a relatively robust and reliable LLM evaluator. that we can easily access. Given 100 OpenGen prefixes and corresponding pairs of anonymized unwatermarked and watermarked responses, the model evaluates each pair and chooses which response it prefers, where ties are allowed. The model is instructed to consider the relevance, coherence, and the interestingness of the responses when making a judgment. The full prompt can be found in §M. Then, we compute the soft win rate of various baselines in Table 4 and several PostMark configurations in  Table 5, which equals the number of ties plus the number of wins for the watermarked response.

Factuality evaluation setup:

To assess factuality, an essential aspect not addressed in the previous pairwise comparisons or previous watermarking research, we use FactScore (Min et al., 2023), an automatic metric that measures the percentage of atomic claims in an LLM-generated biography that are supported by Wikipedia. We generate biographies for the entities in the FactScore dataset and comparing the FactScores of the outputs before and after watermarking. Before watermarking, Llama-3-8B-Inst achieves a score of 40.2. We then run KGW, Unigram, PostMark@12, and PostMark@6, resulting in scores of 37.8, 37.2, 37.3, and 38.3. The full table is in §K.

Metric \rightarrow Soft Win Rate
Method \rightarrow KGW Unigram EXP EXP-Edit Blackbox PostMark@12
Llama-3-8B 37 17 23 49 45 74
Llama-3-8B-Inst 52 52 59 57 55 68
Mistral-7B-Inst 57 54 49 54 46 64
GPT-4 - - - - 53 64
Table 4: Soft win rates computed based on the pairwise comparison evaluation with GPT-4-Turbo as the judge, measured over 100 pairs of unwatermarked and watermarked OpenGen outputs from various LLMs (first column). PostMark@12 outperforms all baselines.
Configuration Soft Win Rate TPR After Para.
PostMark@6 84 20.8
PostMark@8 79 28.2
PostMark@12 64 59.4
PostMark@15 67 61.9
PostMark@20 62 82.8
PostMark@30 55 98
Table 5: Quality-robustness trade-off. All soft win rates are averaged over 100 pairs of unwatermarked and watermarked texts judged by GPT-4-Turbo. All paraphrased TPR numbers at 1% FPR are computed over on 500 OpenGen instances.

> Q1: PostMark does not affect quality as much as other baselines.

Results from Table 4 show that PostMark performs exceptionally well in pairwise comparisons across models. In contrast, despite Unigram’s strong robustness to paraphrasing—sometimes even outperforming PostMark when tested on Llama-3-8B —it has a significantly lower soft win rates, especially on Llama-3-8B (17%). This low score is likely due to frequent repetitions in Unigram outputs, as detailed in §L. Regarding factuality, KGW, Unigram, and PostMark@12 all show similar levels of negative impact as their FactScores are respectively 37.8, 37.2, and 37.3.

> Q2: Inserting more words enhances robustness but hurts quality, and vice versa.

We first use the pairwise comparison setup to evaluate the quality-robustness trade-off of PostMark with r𝑟ritalic_r set to six different values: 6, 8, 12, 15, 20, and 30. Results in Table 5 reveal a strong inverse correlation between quality and robustness, with a Pearson coefficient of -0.98. FactScore@6 also achieves a higher FactScore (38.3) than FactScore@12 (37.3). In practical applications, the choice of r𝑟ritalic_r should be based on the desired balance between quality and robustness.

4.2 Human evaluation

While LLM-based evaluators serve as good proxies for human judgments in several cases (Zheng et al., 2023), their results should be interpreted with caution, as they can be biased to certain aspects of the text such as length (Wang et al., 2023) or overlap between the generator and the judge model (Panickssery et al., 2024). Thus, we hire two annotators fluent in English and conduct two human annotation studies detailed below, addressing Q3 and Q4. More details on annotator qualifications, payment, and each annotation setup can be found in §N.

Refer to caption
Figure 2: Pairwise preference human evaluation results on PostMark@12 and PostMark@6. For both configurations, the watermarked text is at least as good as its unwatermarked counterpart the majority of the time in all aspects.

> Q3: PostMark watermarked texts are at least as good as their unwatermarked counterparts the majority of the time.

We first evaluate the impact of PostMark on quality through a pairwise comparison task, similar to the setup in Section 4.1. Each annotator reads 20 OpenGen prefixes and the corresponding pairs of anonymized watermarked and unwatermarked responses generated by GPT-4. We then ask them to indicate their preferred response overall, as well as their preferences in terms of relevance, coherence, and interestingness, allowing for ties. Results in Figure 2 indicate that for PostMark@12 and PostMark@6, watermarked responses are at least as good as their unwatermarked counterparts the majority of the time (i.e., total percentage of wins and ties \geq 50%). As expected, reducing the insertion rate to 6% improves quality, especially in the coherence aspect.111111While soft win rates computed from human annotations are much lower than those from GPT-4-Turbo’s judgments, both judges agree that a smaller r𝑟ritalic_r improves quality. To put things in perspective, a previous human evaluation study by Kirchenbauer et al. (2024) found that annotators preferred KGW-watermarked text over unwatermarked text only 38.4% of the time.

> Q4: Annotators struggle to identify the words inserted by PostMark.

A primary concern with PostMark is whether the words inserted into the watermarked text will be conspicuous enough for humans to identify, making it easy for attackers to remove them. To measure this, we create an anonymized mixture of 20 unwatermarked121212We include unwatermarked responses in this evaluation as a baseline. For fairness, we regenerated unwatermarked texts to roughly match the length of the watermarked texts. and 20 watermarked responses generated for 20 prefixes in OpenGen with GPT-4 as the base LLM.131313These 20 prefixes are different from the ones they see in the pairwise comparison evaluation. We then ask annotators to highlight out-of-place words that they think might have been inserted post-hoc after the initial generation. Overall, annotators achieve a F1 of merely 0.06 (0.46 precision, 0.03 recall). On average, they highlight 2.2 words in each unwatermarked response, and 3.45 words in each watermarked response. Thus, even when annotators are aware of the insertion of words, they cannot pinpoint the specific words.

5 Related work

Early research on watermarking:

Our work is relevant to early work on watermarking text documents, either using the text document image (Brassil et al., 1995; Low et al., 1998), syntactic transformations (Atallah et al., 2001; Meral et al., 2009), or semantic changes (Atallah et al., 2003; Topkara et al., 2006). Later work also explores watermarking machine-generated text (Venugopal et al., 2011).

Watermarking LLM-generated text:

Recent research has primarily focused on watermarking LLM-generated outputs. Most existing approaches operate in the whitebox setting, assuming access to model logits and the ability to modify the decoding process (Fang et al., 2017; Kaptchuk et al., 2021; Aaronson and Kirchner, 2022; Kirchenbauer et al., 2023; Zhao et al., 2023; Liu et al., 2024a, b) or inject detectable signals without altering the original token distribution (Christ et al., 2023; Kuditipudi et al., 2024). Alternatively, Hou et al. (2023, 2024) watermark at the sentence level via rejection sampling. Prior blackbox methods access only model outputs (like PostMark), but rely on simple lexical substitution (Abdelnabi and Fritz, 2021; Qiang et al., 2023; Yang et al., 2023; Munyer et al., 2024).

Evading watermark detection:

Our work also relates to prior work on text editing attacks designed to evade watermark detection. He et al. (2024) propose a cross-lingual attack, while Kirchenbauer et al. (2024) studies a copy-paste attack that embeds watermarked text into a larger human-written document. Krishna et al. (2023) train a controllable paraphraser that allows for control over lexical and syntactic diversity. Sadasivan et al. (2024) design a recursive paraphrasing attack that repeatedly rewrites watermarked text. Similar to our work, several studies directly prompt an instruction-following LLM to paraphrase text (Zhao et al., 2023; Hou et al., 2023, 2024; Liu et al., 2024b; Kirchenbauer et al., 2024).

Quality-robustness trade-off:

Relevant to our discussion in Section 4, several recent papers highlight the impact of watermarking on quality. In line with our conclusions, Singh and Zou (2023) and Molenda et al. (2024) both find that less robust watermarks tend to have less negative impact on text quality.

6 Conclusion

We propose PostMark, a novel watermarking approach that only requires access to the underlying model’s outputs, making it applicable by third-party entities to outputs from API providers. Through extensive experiments acorss eight baseline algorithms, five base LLMs, and three datasets, we show that PostMark is more robust to paraphrasing attacks than existing methods. We conduct a human evaluation to show that words inserted by PostMark are not easily identifiable by humans. We further run comprehensive quality evaluations covering coherence, relevance, interestingness, and factuality, and find that PostMark preserves text quality relatively well. Future work could look into further optimizing each of the three modules in PostMark, evaluating PostMark on attacks other than paraphrasing, or making logit-based methods less entropy-dependent.

Limitations

In this section, we address the primary limitations of our work.

Other attacks:

Our work focuses on evaluating robustness of various watermarking methods against paraphrasing attacks. However, there are many other interesting and practical attacks that we do not consider, such as the copy-paste attack and the recursive paraphrasing attack discussed in Section 5. We anticipate that PostMark will be less effective when the watermarked text is embedded in a larger human-written document or when it undergoes repeated paraphrasing, similar to other watermarking methods. We leave the exploration of these other types of attacks to future work.

Runtime and API costs:

The PostMark implementation used in all our main experiments relies on closed-source models from OpenAI (text-embedding-3-large and GPT-4o). As a result, the runtime and costs of running PostMark are heavily dependent on the API provider. Our cost estimate in §G suggests that watermarking 100 tokens with the default PostMark@12 configuration costs around $1.2 USD. However, the framework is highly flexible in terms of module selection. In fact, as demonstrated in Section 3.2, an open-source implementation can perform nearly as well as the closed-source version. We leave the optimization of open-source implementations of PostMark to future work.

Ethical considerations

Our human study was determined exempt by IRB review. All annotators have consented to the release of their annotations, and we ensured they were fairly compensated for their valuable contributions. Scientific artifacts are implemented for their intended usage. The risks associated with our framework are no greater than those already present in the large language models it utilizes (Weidinger et al., 2021).

Acknowledgments

We extend special gratitude to the Upwork annotators for their hard work. This project was partially supported by awards IIS-2202506 and IIS-2312949 from the National Science Foundation (NSF).

References

Appendix A More details on the vocabulary 𝕍𝕍\mathbb{V}blackboard_V of the SecTable

In this section, we provide more details on the creation of SecTable, and address how often a word in the SecTable can be selected as a watermark word.

Filtering the SecTable vocabulary 𝕍𝕍\mathbb{V}blackboard_V:

Specifically, we restrict 𝕍𝕍\mathbb{V}blackboard_V to only include lowercase nouns, verbs, adjectives, and adverbs that occur at least 1,000 times in the WikiText-103 training split. This results in a final vocabulary of 3,266 words.

Frequency of words chosen as watermark words:

In Figure 3, we plot the frequency distribution of all watermark words obtained for 500 OpenGen outputs (generated with GPT-4 as the base LLM). We find that the majority of the words are only selected as watermark words for less than 5% of all outputs, while two major hub words are selected in more than 20% of the outputs. Overall, the hubness problem is not too severe, but it could be mitigated by a more careful selection of the embeddings used in the SecTable.

Refer to caption
Figure 3: Watermark word frequency distribution over 500 OpenGen outputs. The majority of the words are chosen as watermark words less than 5% of the time. There are only two major hub words that are selected more than 20% of the time.

Appendix B Prompt for the Inserter

{spverbatim}

Given below are a piece of text and a word list. Rewrite the text to incorporate all words from the provided word list. The rewritten text must be coherent and factual. Distribute the words from the list evenly throughout the text, rather than clustering them in a single section. When rewriting the text, try your best to minimize text length increase. Only return the rewritten text in your response, do not say anything else.

Text:

Word list:

Rewritten text:

Appendix C More details on cosine similarity word matching during detection

We use the paragram word embedding model developed by Wieting et al. (2015) to perform cosine similarity word matching during detection. We find this model to be superior at distinguishing semantically related words from irrelevant words, see details in Table 6.

SIM(positive) SIM(negative)
paragram 64.8 2.4
GloVe 60.7 16.4
nomic-embed 59.9 33.2
text-embedding-3-large 64.2 29.8
Table 6: Cosine similarity between embeddings of positive pairs (word + its synonym) and between negative pairs (word + irrelevant word) computed with different embedding models, averaged over 174 tuples of (word, synonym, irrelevant word).

Appendix D More details on baselines

In this section, we provide more details on how we run our baselines.

D.1 Expanded descriptions of baselines

(1) KGW (Kirchenbauer et al., 2023): Partitions the vocabulary into “green” and “red” lists based on the previous token, then boosts the probability of green tokens during generation. Detection is done by comparing the number of green tokens present to the expected count under the null hypothesis of no watermarking. (2) Unigram (Zhao et al., 2023): A variant of KGW that uses a fixed green-red partition for all generations instead of re-partitioning the vocabulary at each token, making it more robust to editing attacks. (3) EXP (Aaronson and Kirchner, 2022): Uses exponential sampling to embed a watermark by biasing token selection with a pseudo-random sequence during text generation. Detection measures the correlation between the generated text and the sequence to identify the watermark. (4) EXP-Edit (Kuditipudi et al., 2024): A variant of the EXP watermark that incorporates edit distance to measure the correlation. (5) SemStamp (Hou et al., 2023): A sentence-level algorithm that partitions the semantic space using locality-sensitive hashing with arbitrary hyperplanes, assigning binary signatures to regions and accepting sentences that fall within “valid” regions, which enhances robustness against paraphrase attacks. (6) k-SemStamp (Hou et al., 2024): Improves upon SemStamp by using k-means clustering to partition the semantic space. (7) SIR (Liu et al., 2024b): Generates watermark logits from the semantic embeddings of preceding tokens using an embedding language model and a trained watermark model. These logits are added to the language model’s logits. Detection works by averaging these watermark logits for each token and identifying a watermark if the average is significantly greater than zero. (8) Blackbox (Yang et al., 2023): While all other baseline methods require access to model logits, this method focuses on the blackbox setting where only the model output is observable, similar to our assumption. It encodes words as binary bits, replaces bit-0 words with synonyms representing bit-1, and detects watermarks through a statistical test identifying the altered distribution of binary bits.

D.2 Hyperparameters for baselines

All baselines are run with nucleus sampling with p=0.9𝑝0.9p=0.9italic_p = 0.9 unless otherwise specified.

KGW:

We run KGW in the LeftHash configuration with γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 and δ=4.0𝛿4.0\delta=4.0italic_δ = 4.0, using the original authors’ implementation. These hyper-parameters control the size of the green token list and the strength of the watermark, respectively. While δ𝛿\deltaitalic_δ is typically set to 2.02.02.02.0 in prior literature, we chose δ=4.0𝛿4.0\delta=4.0italic_δ = 4.0 based on findings by Kirchenbauer et al. (2024). They found that δ=4.0𝛿4.0\delta=4.0italic_δ = 4.0 made the watermark more robust to paraphrasing attacks in their experiments with Vicuna, a supervised instruction-finetuned model. Given that our experiments also focus on lower-entropy models aligned through RLHF or instruction tuning, we adopt the same value for δ𝛿\deltaitalic_δ.

Unigram:

To align with the setup of KGW, we set γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 and δ=4.0𝛿4.0\delta=4.0italic_δ = 4.0 for Unigram as well. While the authors open-source their code, we ran into unexpected performance issues, where Unigram could not even achieve a TPR at 1% FPR higher than 70% even before any attacks on OpenGen with Llama-3-8B as the base model. Thus, we switched to the implementation in MarkLLM (Pan et al., 2024), an open-source watermarking toolkit. With this implementation, Unigram’s TPR before attacks became close to 100% and the TPR after attacks stayed above 90%, in line with results reported in the Unigram paper (Zhao et al., 2023).

EXP:

We run EXP with prefix length set to 1111 using the MarkLLM implementation.

EXP-Edit:

Using the authors’ implementation, we run EXP-Edit with γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5, watermark key length = 256, block size = sequence length = 300, and number of resamples = 100. This method is run with multinomial sampling (the default setting in the authors’ code), because we find that adding a nucleus sampling logits wrapper on top significantly hurts its performance. For Llama-3-8B-Inst and Mistral-7B-Inst, we find that this method cannot reach a TPR at 1% FPR above 70% even before attacks. We tried several values for γ𝛾\gammaitalic_γ, the hyperparameter that controls the statistical power of the watermark, but it did not improve the results. Increasing the number of resamples to 500 also had little effect.

Blackbox:

We run Blackbox with τ=0.8𝜏0.8\tau=0.8italic_τ = 0.8 and λ=0.83𝜆0.83\lambda=0.83italic_λ = 0.83 using fast detection with the authors’ implementation. Empirically, we find that fast detection offers a significant speed advantage with negligible impact on performance when compared to precise detection. On 200 OpenGen outputs with GPT-4 as the base LLM, using precise detection yields TPR of 100 before paraphrasing and 3.5 after paraphrasing, whereas fast detection yields 99 and 0.5.

Appendix E More details on base models

In this section, we provide more details on how we run the base generator models.

Model checkpoints:

We detail the checkpoint we use for each base model in Table 7.

Model Checkpoint
Llama-3-8B Meta-Llama-3-8B
Llama-3-8B-Inst Meta-Llama-3-8B-Instruct
Mistral-7B-Inst Mistral-7B-Instruct-v0.2
GPT-4 OpenAI API (gpt-4-0613)
Table 7: Base model checkpoints.

Generation length:

For all aligned models (Llama-3-8B-Inst, Mistral-7B-Inst, and GPT-4), we generate free-form text until the model outputs an EOS (end-of-sequence) token to simulate the downstream setting. For Llama-3-8B, we set the maximum token limit to 300, as generating freely until reaching EOS often leads to meaningless repetitions, sometimes even exceeding 8,000 tokens. We do not run OPT-1.3B ourselves.

Appendix F Paraphrasing attack setup

In this section, we provide more details on the paraphrasing attack we use for all experiments.

Prompt for sentence-level paraphrasing:

We build on the prompt used by Hou et al. (2023, 2024) and include more clarification on what to return:

{spverbatim}

Given some previous context and a sentence following that context, paraphrase the current sentence. Only return the paraphrased sentence in your response.

Previous context: Current sentence to paraphrase: Your paraphrase of the current sentence:

Why sentence-level paraphrasing?

We choose a sentence-level paraphrasing setup for two reasons. First, Hou et al. (2023, 2024) use a sentence-level paraphrasing setup to evaluate the robustness of their method. Since we are unable to run their method directly, adopting the same paraphrasing setup allows for a fair comparison with their results. Second, as observed by Kirchenbauer et al. (2024), naively prompting GPT-3.5-Turbo to rewrite the entire input text often results in significant loss of important content. While the authors developed a sophisticated prompt to mitigate this issue, we empirically find that paraphrasing at a sentence level achieves a similar effect.

Appendix G PostMark runtime and API cost estimates

Runtime:

We compare the runtime of several PostMark configurations with other baselines in Table 8. Recall that in our experiments, we find insertion success rate to be higher if we divide the watermark word list into sublists of 10 words, then ask the Inserter to insert one sublist at a time. This iterative insertion process can have some negative impact on runtime, but it may become unnecessary in the future when the Inserter has better instruction-following capabilities.

API costs:

Under the default PostMark@12 configuration with GPT-4o as the Inserter and text-embedding-3-large as the Embedder   watermarking 500 outputs with around 300 tokens costs around $18.5 USD, which means that watermarking 100 tokens costs about $1.2 on average.

Method Avg Time / Output
PostMark@6 29.2
PostMark@12 36.6
PostMark@12 (no iter.) 25.3
KGW 17.5
Unigram 18.5
EXP 18.4
EXP-Edit 17.3
Blackbox 21.6
Table 8: Average time (in seconds) it takes to generate one watermarked instance with Llama-3-8B-Inst as the base LLM. Runtime is averaged over 10 outputs, with an average token count of 280. For PostMark and Blackbox, the runtime includes the time it takes for Llama-3-8B-Inst to generate the initial unwatermarked output. PostMark@12 (no iter.) refers to the setup where instead of breaking up the watermark word list into sublists and iteratively asking the Inserter to insert one sublist at a time, we directly ask the Inserter to insert all words in the list.

Appendix H PostMark length comparison

We present a comparison between output length (before and after watermarking) for various watermarking methods in Table 9.

Metric \rightarrow Number of Tokens (Before / After Watermarking)
Methods \rightarrow KGW Unigram EXP EXP-Edit Blackbox PostMark@12
Llama-3-8B 239.6 / 226.6 237.6 / 250.7 232.5 / 269.8 213 / 225.7 239.6 / 244.8 239.6 / 381.2
Llama-3-8B-Inst 251.2 / 234.6 259.5 / 261.6 259 / 282.6 251.3 / 255 251.2 / 256.4 251.2 / 431
Mistral-7B-Inst 315.3 / 588.2 318 / 321 317.4 / 247.8 248.7 / 249.5 315.3 / 320.6 315.3 / 552.2
GPT-4 - - - - 301.2 / 305.7 301.2 / 507.1
Table 9: Length comparison between different watermarking methods before and after watermarking, averaged over 500 OpenGen outputs.

Appendix I PostMark semantic meaning preservation

To check whether PostMark preserves the general semantic meaning of the original unwatermarked text, we compute the average cosine similarity between the embeddings unwatermarked and watermarked outputs in Table 10, and find the similarity score to be consistently around 0.95.

Base LLM SIM
Llama-3-8B 94.2
Llama-3-8B-Inst 94.8
Mistral-7B-Inst 94.6
GPT-4 95.3
Table 10: Average cosine similarity between the embeddings of unwatermarked and PostMark@12 watermarked outputs on OpenGen. Embeddings are obtained using text-embedding-3-large. Numbers are averaged over 500 pairs.

Appendix J More details on PostMark edits

A full table of five major types of edits made by PostMark during watermarking is in Table 11. These 5 categories were summarized based on a small-scale qualitative analysis of 30 watermarked OpenGen outputs.

Type Before watermark After watermark
Rewriting existing content
Rewording Her decision to quit the opera, however, did not lessen the engulfing sadness which veiled her once radiant joy. Her decision to resign from the opera, however, did not lessen the engulfing sadness which veiled her once radiant joy.
Clarification Since the charges concerned violation of civil rights and not actual murder, the defendants received surprisingly light sentences, ranging from three to ten years. Since the charges concerned violation of civil rights and not actual murder, the defendants received surprisingly light sentences, ranging from three to ten years of imprisonment.
Adding new content
Metaphors In fact, despite Mount Elbert’s somewhat minimal precipitation, it displays a remarkable ability to sustain life. In fact, despite Mount Elbert’s somewhat minimal precipitation, it displays a remarkable ability to sustain life, almost as if it wears an armor of resilience, immune to the challenges it faces.
Interpretive claims He swiftly plants timed explosives around the warehouse, ensuring to place a few on the largest weapon caches for maximum effect. He swiftly plants timed explosives around the warehouse, ensuring to place a few on the largest weapon caches for maximum effect. The depth of his planning was a testament to his expertise in defense tactics.
New details Headlam had the ability to foster a culture of discipline, camaraderie and respect among the airmen under his command, reflecting his firm belief in focused team effort and mutual support. Headlam had the ability to foster a culture of discipline, camaraderie and respect among the airmen under his command, reflecting his firm belief in focused team effort and mutual support. His attention to detail was evident in every aspect of the unit’s operations.
Table 11: Example edits made by PostMark during the watermarking process. Changes are highlighted in orange, and watermark words are in bold.

Appendix K FactScore results

We present results from our FactScore evaluation in Table 12. Overall, less robust methods (KGW and PostMark@6) have less negative impact on factuality.

FactScore
Llama-3-8B-Inst 40.2
+ KGW 37.8
+ Unigram 37.2
+ PostMark@12 37.3
+ PostMark@6 38.3
Table 12: FactScore evaluation results based on 100 generations with Llama-3-8B-Inst as the base generator LLM. All four evaluated methods impact factuality negatively to some extent, with less robust methods causing a lesser negative impact.

Appendix L Unigram repetitions

We present several examples of Unigram’s repetitive watermarked outputs in Table 13, generated with Llama-3-8B as the base LLM.

Prefix Without Watermarking After Unigram Watermarking
Unlike mountains of similar altitude elsewhere, Elbert lacks both a permanent snowpack and a prominent north-facing cirque, which can be attributed to its position among other mountains of similar height, causing it to receive relatively small quantities of precipitation. Mount Elbert was named by miners in honor of Samuel Hitt Elbert, the governor of the then-Territory of Colorado, because he brokered a treaty in September 1873 with the Ute tribe that opened up more than of reservation land to mining and railroad activity. Mount Elbert Mountain, United States of America: Elevation: 4401 meters/ 14434 feet: USGS Map Name: Mountain Elbert: Latitude: 39° 07’ 17” N Longitude: 106° 26’ 08” W Latitude/Longitude (WGS84) 39° 7’ 17” N 106° 26’ 8” W (39.1213939, -106.4355046) Elevation: 4400+ft. Name Coordinates: Elevation Elbert, Mount: 39° 07’ 17” N 106° 26’ 08” W: 4,401 m (14,440 ft) Location: Colorado, United States: Parent peak: Uncompahgre Peak: Elevation gain: 4,419 ft (1,348 m) Coordinates: 39° 07’ 17" N, 106° 26’ 08" W; OS grid reference: Coordinates: Denomination: From the list below, please choose the region you would like to visit Elbert, Mount, and we’ll show you the times of departure and scheduled arrival, in addition to realistic information about guarantees, time change schedules, and delays for this train. Mount Elbert, the highest mountain in Colorado, stands at 4,401m. Mount of Snow – Mt. Mt. Canyon Lake High School alumnus Steve Phelps of Leadville competes annually on Mount. The summit of Mt. Because of its high altitude of approximately Mt. Mt. Hiking Mt. Mounts of Snow – Mount of Snow is peak of Mount. Mt. Mount of Snow – Mount of Snow is peak of Mount. Holly Berry For. Mount of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mount of Snow – Mount of Snow is peak of Mount. Mount of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mt. Mt. Mount of Snow – Mount of Snow is peak of Mount. Holly Berry For. Mt. Mount of Snow – Mount of Snow is peak of Mount. Mount of Snow – Mount of Snow is peak of Mount. Mounts of Snow – Mount of Snow is peak of Mount. Mt. Mount of Snow – Mount of Snow is peak of Mount.
We are not! Like men in the story of the Good Samaritan, they pass by on the other side… they avoid the God-driven endeavors of the church. Many modernized men avoid the stories of the women in the Bible and opt to ignore Jesus’ directive to be be "transformed into his image," Colossians 3:10. (Not the way our men view the world through "TrophyWives Corner"!). But, this is certainly not all. Now, as of today, we will work with a modernized spin on God’s word to show you how Jesus, 2000 years ago, managed to "reformat" most men through his efforts in church. But before we start, I must thank two lovely women who were in my life in the 1970’s…women who unknowingly were reformatting the world of a confused young boy like me. I really appreciate you both, no matter how confusing you made the world seem to me then. And by the way, you will meet a couple of young lads too. I’m sure your daughters, granddaughters and nieces will relate to this. This is for all the little girls out there. Thank you guys, and keep on telling us men, just like Jesus did, what we should be like. of Route 119. But that leaves you to cross Route 119 without adult supervision. Presuming you’ve had adult supervision all along. Which is probably true if you haven’t had all of that adult supervision that ought to have be all along. The point is that you’re here to cross Route 119 without adult supervision. Like the Good Fisherman who nets all of us into Heaven. The only catch is that you probably won’t be able to cross without being hurt. The flip side is that you can probably cross without hurting anyone else. If you give to God without doubting that you’ll get what you give to him then you’ll be able to cross Route 119 without adult supervision. Giving to God is almost like going to Heaven because what we give to him is what we’ll get back. If you give him bread then you’ll get bread. If you give him food then you’ll get food. If you give him money then you’ll get money. If you give him parents then you’ll get parents. If you give him teachers then you’ll get teachers. If you give him insurance then you’ll get insurance. If you give him Good Parents then you’ll get Good Parents. If you give him Good Men then you’ll get Good Men. If you give him Good Fisherman then you’ll get Good Fisherman. If you give him Good Fish then you’ll get Good Fish. If you give him Good Charismata then you’ll get Good Charismata.
Table 13: Example repetitive outputs by Unigram with Llama-3-8B-Inst as the base LLM.

Appendix M Prompt for the LLM-based pairwise evaluation setup

{spverbatim}

Please act as an impartial judge and evaluate the quality of the text completions provided by two large language models to the prefix displayed below. Assess each response according to the criteria outlined. After scoring each criterion, provide a summary of you evaluation for each response, including examples that influenced your scoring. Additionally, ensure that the order in which the responses are presented does not affect your decision. Do not allow the length of the responses to influence your evaluation. Be as objective as possible.

Criteria: 1. Relevance to the prefix 2. Coherence 3. Interestingness

Start with a brief statement about which response you think is better overall. Then, for each criterion, state which response is better, or if there is a tie, followed by a concise justification for that judgment. At the very end of your response, declare your verdict by choosing one of the choices below, strictly following the given format: "[[A]]" if assistant A is better overall, "[[B]]" if assistant B is better overall, or "[[C]]" for a tie.

[Prefix]

[Response A]

[Response B]

Appendix N Human evaluation setup and costs

Hiring annotators:

We hire two annotators from Upwork. Both annotators are fluent in English, have 100% job success rates, and have demonstrated exceptional professionalism in their communications with us.

Pairwise evaluation:

The interface we use for this task, built with Label Studio, is shown in Figure 4. For this task, we pay each annotator $2 USD per pair, and they spend around 5-10 minutes per pair.

Identifying watermark words:

The interface we use for this task is shown in Figure 5. For this task, we pay each annotator $1.5 USD per output, and they spend around 3-5 minutes on each output.

Refer to caption
Figure 4: Human annotation interface for the pairwise comparison task.
Refer to caption
Figure 5: Human annotation interface for the watermark word identification task.