Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\NewCommandCopy\cnumdef\numdef\NewCommandCopy\endcnumdef\endnumdef

y2023

A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions

Junchao Wu NLP2CT Lab, Faculty of Science and Technology
Institute of Collaborative Innovation
University of Macau
nlp2ct.junchao@gmail.com
   Shu Yang NLP2CT Lab, Faculty of Science and Technology
Institute of Collaborative Innovation
University of Macau
nlp2ct.shuyang@gmail.com
   Runzhe Zhan NLP2CT Lab, Faculty of Science and Technology
Institute of Collaborative Innovation
University of Macau
nlp2ct.runzhe@gmail.com
   Yulin Yuan Department of Chinese Language and Literature, Faculty of Arts and Humanties
University of Macau
yulinyuan@um.edu.mo
Department of Chinese Language and Literature, Faculty of Humanities
Peking University
yuanyl@pku.edu.cn
   Derek Fai Wong Yulin Yuan and Derek Fai Wong are co-coresponding authors. NLP2CT Lab, Faculty of Science and Technology
Institute of Collaborative Innovation
University of Macau
derekfw@um.edu.mo
   Lidia Sam Chao NLP2CT Lab, Faculty of Science and Technology
State Key Laboratory of Internet of Things for Smart City
University of Macau
lidiasc@um.edu.mo
Abstract

The powerful ability to understand, follow, and generate complex language emerging from large language models (LLMs) makes LLM-generated text flood many areas of our daily lives at an incredible speed and is widely accepted by humans. As LLMs continue to expand, there is an imperative need to develop detectors that can detect LLM-generated text. This is crucial to mitigate potential misuse of LLMs and safeguard realms like artistic expression and social networks from harmful influence of LLM-generated content. The LLM-generated text detection aims to discern if a piece of text was produced by an LLM, which is essentially a binary classification task. The detector techniques have witnessed notable advancements recently, propelled by innovations in watermarking techniques, statistics-based detectors, neural-base detectors, and human-assisted methods. In this survey, we collate recent research breakthroughs in this area and underscore the pressing need to bolster detector research. We also delve into prevalent datasets, elucidating their limitations and developmental requirements. Furthermore, we analyze various LLM-generated text detection paradigms, shedding light on challenges like out-of-distribution problems, potential attacks, real-world data issues and the lack of effective evaluation framework. Conclusively, we highlight interesting directions for future research in LLM-generated text detection to advance the implementation of responsible artificial intelligence (AI). Our aim with this survey is to provide a clear and comprehensive introduction for newcomers while also offering seasoned researchers a valuable update in the field of LLM-generated text detection. The useful resources are publicly available at: https://github.com/NLP2CT/LLM-generated-Text-Detection.

issue: x

1 Introduction

With the rapid development of LLMs, the text generation capabilities of LLMs have reached a level comparable to human writing OpenAI (2023); Anthropic (2023); Chowdhery et al. (2022b). LLMs have permeated various aspects of daily life and play a vital role become pivotal in many professional workflows Veselovsky, Ribeiro, and West (2023), facilitating tasks such as advertising slogan creation Murakami, Hoshino, and Zhang (2023), news composition Yanagi et al. (2020), story generation Yuan et al. (2022), and code generation Becker et al. (2023); Zheng et al. (2023). A recent research from Hanley and Durumeric (2023) indicates that the relative quantity of AI-generated news articles on mainstream websites has risen by 55.4%, whereas on websites known for disseminating misinformation, it has risen by 457% from January 1, 2022, to May 1, 2023. Furthermore, their impact significantly shapes the progression of numerous sectors and disciplines, including education Susnjak (2022), law Cui et al. (2023), biology Piccolo et al. (2023), and medicine Thirunavukarasu et al. (2023).

The powerful generation capabilities of LLMs have rendered it challenging for individuals to discern between LLM-generated and human-written texts, resulting in the emergence of intricate concerns. The concerns regarding LLM-generated text originate from two perspectives. Firstly, LLMs are susceptible to fabrications Ji et al. (2023), reliance on outdated information, and heightened sensitivity to prompts. These vulnerabilities can facilitate the spread of erroneous knowledge Christian (2023), undermine technical expertise Rodriguez et al. (2022a); Aliman and Kester (2021), and promote plagiarism Lee et al. (2023a). Secondly, there exists the risk of malicious exploitation of LLMs in activities such as disinformation dissemination Pagnoni, Graciarena, and Tsvetkov (2022a); Lin, Hilton, and Evans (2022), online fraudulent schemes Weidinger et al. (2021); Ayoobi, Shahriar, and Mukherjee (2023), social media spam production Mirsky et al. (2022), and academic dishonesty, especially with students employing LLMs for essay writing Stokel-Walker (2022); Kasneci et al. (2023). Concurrently, LLMs increasingly shoulder the data generation responsibility in AI research, leading to the recursive use of LLM-generated text in their own training and assessment. A recent analysis, titled Model Autophagy Disorder (MAD) (Alemohammad et al., 2023), raised alarms over this AI data feedback loop. As generative models undergo iterative improvements, LLM-generated text may gradually replace the need for human-curated training data. This could potentially lead to a reduction in the quality and diversity of subsequent models. In essence, the consequences of LLM-generated text encompass both societal Cardenuto et al. (2023) and academic Yu et al. (2023a) risks, and the use of LLM-generated data will hinder the future development of LLMs and detection technology.

However, for the LLM-generated text detection task, current detection technologies, including the discriminatory capabilities  Price and Sakellarios (2023) of commercial detectors are unreliable. They are primarily biased towards classifying outputs as human-written text, rather than detecting text generated by LLMs Walters (2023); Weber-Wulff et al. (2023, 2023). Detection methods that rely on human are also unreliable and have very low accuracy, even only slightly better than random classification Uchendu et al. (2021); Dou et al. (2022); Clark et al. (2021a); Soni and Wade (2023a, b). Furthermore, the ability of humans to identify LLM-generated text is often lower than that of detectors or detection algorithms in various environmental settings Ippolito et al. (2020); Soni and Wade (2023b). Thus, there is an imperative demand for robust detectors to identify LLM-generated text effectively. Establishing such mechanisms is pivotal to mitigating LLM misuse risks and fostering responsible AI governance in the LLM era Stokel-Walker and Van Noorden (2023); Porsdam Mann et al. (2023); Shevlane et al. (2023).

Research on the detection of LLM-generated text has received lots of attention even before the advent of ChatGPT, especially in areas such as early identification of deepfake text Pu et al. (2023a), machine-generated text detection Jawahar, Abdul-Mageed, and Lakshmanan (2020) and authorship attribution Uchendu, Le, and Lee (2023a). Typically, this problem was regarded as a classification task, discerning between LLM-generated text and human-written text Jawahar, Abdul-Mageed, and Lakshmanan (2020). Back to this research stage, the detection task was predominantly focusing on translation-generated texts and utilizing simple statistical methods. The introduction of ChatGPT has sparked a significant surge in interest surrounding LLMs, heralding a paradigm shift in the research landscape. In response to the escalating challenges posed by LLM-generated text, the NLP community has intensely pursued solutions, delving into LLM-generated text detection and related attacking methodologies. While Crothers, Japkowicz, and Viktor (2023a); Tang, Chuang, and Hu (2023) recently presented reviews on this topic, we argue that its depth of detection methods is insufficient (We discuss related work in detail in subsection 3.1).

In this article, we furnish a meticulous and profound review of contemporary research on LLM-generated text detection, aiming to guide researchers through the challenges and prospective research trajectories. We investigate the latest breakthroughs, beginning with an introduction to the task of LLM-generated text detection, the underlying mechanisms of text generation by LLMs, and the sources of LLM’s enhanced text generation capabilities. We also shed light on the contexts and imperatives of LLM-generated text detection. Furthermore, we spotlight popular datasets and benchmarks for the task, exposing their current deficiencies to stimulate the creation of more refined data resources. Our discussion extends to the latest detector studies. In addition to the traditional neural-based methods and statistical-based methods, we also report watermarking techniques, and human-assisted methods. A subsequent analysis pinpoints research limitations in LLM-generated text detectors, highlighting critical areas like out-of-distribution challenges, potential attacks, real-world data issues, and the lack of effective evaluation framework. Conclusively, we ponder upon potential directions for future research, aiming to help the development of efficient detectors.

2 Background

2.1 LLM-generated Text Detection Task

Refer to caption
Figure 1: Toy picture of LLM-generated text detection task. This task is a binary classification task that detects whether the provided text is generated by LLMs or written by humans.

Detecting LLM-generated text is an intricate challenge. Generally speaking, humans struggle to discern between LLM-generated text and human-written text Uchendu et al. (2021); Dou et al. (2022); Clark et al. (2021a); Soni and Wade (2023a, b), and their capability to distinguish such texts exceeds random classification only slightly. Table 1 offers some examples where LLM-generated text often is extremely close to human-written text and can be difficult to distinguish. When LLMs generate fabricated details, discerning their origins and veracity remains equally challenging.

Table 1: Examples of human-written text and LLM-generated text. Text generated by LLMs during normal operation and instances in which they fabricate facts often exhibit no intuitively discernible differences. When LLMs either abstain from providing an answer or craft neutral responses, certain indicators, such as the explicit statement “I am an AI language model”, may facilitate human adjudication, but such examples are less.
Type Question Human-written LLMs-generated
Normal
Explain what is NLP?
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence …
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics that focuses on …
Refusal
How is today special?
Today’s Special is a Canadian children’s television show produced by Clive VanderBurgh at TVOntario from 1981 to 1987.
I’m sorry, but I am an AI language model and do not have access to current dates or events. Is there anything else I can help you with …
Fabricated
Explain what is NLP based on one publication in the recent literature.
In “Natural language processing: state of the art, current trends and challenges”, NLP is summarized as a discipline that uses various algorithms, tools and methods to …
NLP is a multidisciplinary field at the intersection of computer science, linguistics, and ai, as described in a recent peer-reviewed publication titled "Natural Language Processing: A Comprehensive Overview and Recent Advances" (2023) …

Recent studies Guo et al. (2023); Ma, Liu, and Yi (2023); Muñoz-Ortiz, Gómez-Rodríguez, and Vilares (2023); Giorgi et al. (2023); Seals and Shalin (2023) have highlighted significant disparities between human-written and LLM-generated text, such as ChatGPT. The differences between LLM-generated text and human-written text are not merely within the scope of individual word choice Seals and Shalin (2023), but also manifest in stylistic dimensions, such as syntactical simplicity, use of passive voice, and narrativity. Notably, LLM-generated text often exhibits qualities of enhanced organization, logical structure, formality, and objectivity in comparison to human-written text. Additionally, LLMs frequently produce extensive and comprehensive responses, characterized by a lower prevalence of bias and harmful content. Nevertheless, they occasionally introduce nonsensical or fabricated details. Linguistically, LLM-generated text tends to be about twice the length of human-written text but exhibits a more limited vocabulary. LLMs exhibit a higher frequency of noun, verb, determiner, adjective, auxiliary, coordinating conjunction, and particle word categories compared to humans, and less adverb and punctuation, incorporating more deterministic, conjunctive, and auxiliary structures in their syntax. Additionally, LLM-generated text often conveys less emotional intensity and exhibits clearer presentation than human writings, a phenomenon possibly linked to an inherent positive bias in LLMs Giorgi, Ungar, and Schwartz (2021); Markowitz, Hancock, and Bailenson (2023); Mitrovic, Andreoletti, and Ayoub (2023). Although there are slightly different statistical gaps on different datasets, it is clear that the difference between LLM-generated text and human-written text clearly exists, because the statistical results of the difference in language features and human visual perception are consistent. Chakraborty et al. (2023b) have further substantiated the view by reporting on the detectability of text generated by LLMs, including the high-performance models such as GPT-3.5-Turbo and GPT-4 Helm, Priebe, and Yang (2023), while Chakraborty et al. (2023a) introduced an AI Detectability Index to further rank models according to their detectability.

In this survey, we begin by providing definitions for human-written text, LLM-generated text, and the detection task.

Human-written Text

is characterized as the text crafted by individuals to express thoughts, emotions, and viewpoints. This encompasses articles, poems, and reviews, among others, typically reflecting personal knowledge, cultural milieu, and emotional disposition, spanning the entirety of the human experience.

LLM-generated Text

is defined as cohesive, grammatically sound, and pertinent content generated by LLMs. These models are trained extensively on NLP techniques, utilizing large datasets and machine learning methodologies. The quality and fidelity of the generated text typically depend on the scale of the model and the diversity of training data.

LLM-generated Text Detection Task

is conceptualized as a binary classification task, aiming to ascertain if a given text is generated by an LLM. The formal representation of this task is given by the subsequent equation:

D(x)={1if x generated by LLMs0if x written by human𝐷𝑥cases1missing-subexpressionif 𝑥 generated by LLMs0missing-subexpressionif 𝑥 written by humanD(x)=\left\{\begin{array}[]{rcl}1&&\text{if }x\text{ generated by LLMs}\\ 0&&\text{if }x\text{ written by human}\end{array}\right.italic_D ( italic_x ) = { start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL end_CELL start_CELL if italic_x generated by LLMs end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL if italic_x written by human end_CELL end_ROW end_ARRAY (1)

where D(x)𝐷𝑥D(x)italic_D ( italic_x ) represents the detector, and x𝑥xitalic_x is the text to be detected.

2.2 LLMs Text Generation and Confusion Sources

2.2.1 Generation Mechanisms of LLMs

The text generation mechanism of LLMs operates by sequentially predicting subsequent tokens. Rather than producing an entire paragraph instantaneously, LLMs methodically construct text one word at a time. Specifically, LLMs decode subsequent tokens in a textual sequence, taking into account both the input sequence and previously decoded tokens. Assume that the total time step is T𝑇Titalic_T, the current time step is t𝑡titalic_t, the input text or tokenised sequence is: XT={x1,x2,xT}subscript𝑋𝑇subscript𝑥1subscript𝑥2subscript𝑥𝑇X_{T}=\{x_{1},x_{2},...x_{T}\}italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, and the previous output sequence is Yt1={y1,y2,yt1}subscript𝑌𝑡1subscript𝑦1subscript𝑦2subscript𝑦𝑡1Y_{t-1}=\{y_{1},y_{2},...y_{t-1}\}italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }. At this point, the output next word ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be expressed as:

ytP(yt|Yt1,XT)=softmax(woht)similar-tosubscript𝑦𝑡𝑃conditionalsubscript𝑦𝑡subscript𝑌𝑡1subscript𝑋𝑇𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑤𝑜subscript𝑡y_{t}\sim P(y_{t}|Y_{t-1},X_{T})=softmax(w_{o}\cdot h_{t})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_w start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2)

htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the hidden state of the model at time step t𝑡titalic_t, wosubscript𝑤𝑜w_{o}italic_w start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is the output matrix, the softmax function is used to obtain the probability distribution of the vocabulary, and ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is sampled from the probability distribution of the vocabulary P(yt|Yt1,XT)𝑃conditionalsubscript𝑦𝑡subscript𝑌𝑡1subscript𝑋𝑇P(y_{t}|Y_{t-1},X_{T})italic_P ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). The joint probability function for the final output sequence can be modeled and represented as:

YT={y1,y2,,yT}subscript𝑌𝑇subscript𝑦1subscript𝑦2subscript𝑦𝑇Y_{T}=\{y_{1},y_{2},...,y_{T}\}italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } (3)

The quality of the decoded text is intrinsically tied to the chosen decoding strategy. Given that the model constructs text sequentially, the quality of generated text hinges on the method used to select the subsequent word from the vocabulary, that is, how ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is sampled from the probability distribution of vocabulary. The predominant decoding techniques encompass greedy search, beam search, top-k sampling, and top-p sampling. Table 2 offers a comparison of the underlying principles, along with the respective merits and drawbacks, of these decoding methods. This comparison aids in elucidating the text generation process of LLMs and the specific characteristics of the text they produce.

Table 2: The core ideas of different text decoding strategies, as well as their advantages and disadvantages. Greedy Search uses a simple greedy strategy, considering only the current highest probability token at each step, which is simple and fast but lacks diversity. Beam Search allows for multiple candidates to be considered, which improves the quality of the text but tends to generate duplicates. Top-K Sampling increases diversity but has difficulty controlling the quality of generation. Top-P Sampling relies on the shape of the probability distribution to determine the set of tokens to sample, which is coherent, but diversity is correlated with the parameter P𝑃Pitalic_P.
Strategy Core Idea Advantages Drawbacks
Greedy Search Only the token with the highest current probability is considered at each step. Fast and simple. Easy to fall into local optimality, lack of diversity, unable to deal with uncertainty.
Beam Search Graves (2012) Several more candidates can be considered at each step. Improvement of text quality and flexibility. Tend to generate repetitive fragments, work poorly in open generation domains, unable to handle uncertainty.
Top-K Sampling Fan, Lewis, and Dauphin (2018) Samples among the K most likely words at each step. Increase diversity and be able to deal with uncertainty. Difficulty in controlling the quality of generation, which may result in incoherent text.
Top-P Sampling Holtzman et al. (2020) Use the shape of the probability distribution to determine the set of tokens to be sampled from Coherence and the ability to deal with uncertainty. Dependent on the quality of the model predictions, diversity is related to the parameter P𝑃Pitalic_P.

2.2.2 Sources of LLMs’ Strong Generation Capabilities

The burgeoning growth in model size, data volume, and computational capacity has significantly enhanced the capabilities of LLMs. Beyond a specific model size, certain abilities manifest that are not predictable by scaling laws. These abilities, absent in smaller models but evident in LLMs, are termed “Emergent Abilities” of LLMs.

In-Context Learning (ICL)

The origins of ICL capabilities remain a topic of ongoing debate Dai et al. (2023). However, this capability introduces a paradigm where model parameters remain unchanged, and only the design of the prompt is modified to elicit desired outputs from LLMs. This concept was first introduced in the GPT-3 Brown et al. (2020). Brown et al. (2020) posited that the presence of ICL is fundamental for the swift adaptability of LLMs across a diverse set of tasks. With just a few examples, LLMs can effectively tackle downstream tasks, obviating the previous BERT model-based approach that relied on pretraining followed by fine-tuning for specific tasks Raffel et al. (2020).

Alignment of Human Preference

Although LLMs can be guided to generate content using carefully designed prompts, the resulting text might lack control, potentially leading to the creation of misleading or biased content Zhang et al. (2023b). The primary concern of these models lies in predicting subsequent words to form coherent sentences based on vast corpora, rather than ensuring that the content generated is both beneficial and innocuous to humans. To address these shortcomings, OpenAI introduced the Reinforcement Learning from Human Feedback (RLHF) approach, as detailed in Ouyang et al. (2022) and Lambert et al. (2022). This approach begins by fine-tuning LLMs using data from user-directed quizzes and subsequently evaluating the model’s outputs with human assessors. Simultaneously, a reward function is established, and the LLM is further refined using the Proximal Policy Optimization (PPO) algorithm Schulman et al. (2017a). The end result is a model that aligns with human values, understands human instructions, and genuinely assists users.

Complex Reasoning Capabilities

While LLMs’ ICL and alignment capability facilitate meaningful interactions and assistance, their performance tends to degrade in tasks demanding logical reasoning and heightened complexity. Wei et al. (2022) observed that encouraging LLMs to produce more intermediate steps through a Chain of Thought (CoT) can enhance their effectiveness. Tree of Thoughts (ToT) Yao et al. (2023) and Graph of Thoughts (GoT) Besta et al. (2023) are extensions of this methodology. Both strategies augment LLM performance on intricate tasks by amplifying the computational effort required for the model to deduce an answer.

2.3 Why Do We Need to Detect Text Generated by LLMs?

As LLMs undergo iterative refinements and reinforcement learning through human feedback, their outputs become increasingly harmonized with human values and preferences. This alignment facilitates broader acceptance and integration of LLM-generated text into everyday life. The emergence of various AI tools has played a significant role in fostering intuitive human-AI interactions and democratizing access to the advanced capabilities of previously esoteric models. From interactive web-based assistants like ChatGPT,111https://chat.openai.com/ to search engines enhanced with LLM technology like the contemporary version of Bing,222https://www.bing.com/ to specialized tools like Coplit,333https://github.com/features/copilot/ and Scispeace444https://typeset.io/ that assist professionals in code generation and scientific research, LLMs have subtly woven into the digital fabric of our lives, propagating their content across diverse platforms.

However, it is important to acknowledge that for the majority of users, LLMs and their applications are still considered black-box AI systems. For individual users, this often serves as a benign efficiency boost, sidestepping laborious retrieval, and summarization. However, within specific contexts and against the backdrop of the broader digital landscape, it becomes crucial to discern, filter, or even exclude LLM-generated text. It is important to emphasize that not all situations call for the detection of LLM-generated text. Unnecessary detection can lead to consequences such as system inefficiencies and inflated development costs. Generally, detecting LLM-generated text might be superfluous when:

  • The utilization of LLMs poses minimal risk, especially when they handle routine, replicable tasks.

  • The spread of LLM-generated text is confined to predictable, limited domains, like closed information circles with few participants.

Drawing upon the literature reviewed in this study, the rationale behind detecting LLM-generated text can be elucidated from multiple vantage points, as illustrated in Figure 2. The delineated perspectives are, in part, informed by the insights presented in Gade et al. (2020) and Saeed and Omlin (2023).

Refer to caption
Figure 2: The most critical reasons why LLM-generated text detection is needed urgently. We discussed it from five perspectives: Regulation, Users, Developments, Science, and Human Society.

While the list provided above may not be exhaustive and some facets may intersect or further delineate as LLMs and AI systems mature, we posit that these points underscore the paramount reasons for the necessity of detecting text generated by LLMs.

Regulation

As AI tools, often characterized as black boxs, the inclusion of LLM-generated text in creative endeavors raises significant legal issues. A pressing concern is the eligibility of LLM-generated texts for intellectual property rights protection, a subject still mired in debate Epstein et al. (2023); Wikipedia (2023), although the EU AI Act555https://artificialintelligenceact.eu/the-act/ has begun to continuously improve to regulate the use of AI. The main challenges arise from issues such as ownership of the training data used by the AI in generating output and how to determine how much human involvement is enough to make the work theirs. The prerequisite for copyright protection for AI supervision and AI-generated content is that human creativity in the materials used to train AI systems can be distinguished, so as to further promote the implementation of more complete legal supervision.

Users

LLM-generated text, refined through various alignment methods, is progressively aligning with human preferences. This content permeates numerous user-accessible platforms, including blogs and Questions & Answers (Q&A) forums. However, excessive reliance on such content can undermine user trust in AI systems and, by extension, digital content as a whole. In this context, the role of LLM-generated text detection becomes crucial as a gatekeeper to regulate the prevalence of LLM-generated text online.

Developments

With the evolving prowess of LLMs, Li et al. (2023b) suggested that LLMs can self-assess and even benchmark their own performances. Due to its excellent text generation performance, LLMs are also used to construct many training data sets through preset instructions Taori et al. (2023). However, Alemohammad et al. (2023) posited that this “Self-Consuming” paradigm may engender a homogenization in LLM-generated texts, potentially culminating in what is termed as “LLM Autophagy Disorder” (MAD). If LLMs heavily rely on web-sourced data for training, and a significant portion of this data originates from LLM outputs, it could hinder their long-term progress.

Science

The relentless march of human progress owes much to the spirit of scientific exploration and discovery. However, the increasing presence of LLM-generated text in academic writing Májovskỳ et al. (2023) and the use of LLM-originated designs in research endeavors raise concerns about potentially diluting human ingenuity and exploratory drive. At the same time, it could also undermine the ability of higher education to validate student knowledge and comprehension, and diminish the academic reputation of specific higher education institutions Ibrahim et al. (2023). Although current methodologies may have limitations, further enhancements in detection capabilities will strengthen academic integrity and preserve human independent thinking in scientific research.

Human Society

From a societal perspective, analyzing the implications of LLM-generated text reveals that these models essentially mimic specific textual patterns while predicting subsequent tokens. If used improperly, these models have the potential to diminish linguistic diversity and contribute to the formation of information silos within societal discourse. In the long run, detecting and filtering LLM-generated text is crucial for preserving the richness and diversity of human communication, both linguistically and informatively.

3 Related Works and Our Investigation

3.1 Related Works

The comprehensive review article by Beresneva (2016) represents the first extensive survey of methods for detecting computer-generated text. At that time, the detection process was relatively simple, mainly focusing on machine translation text detection and employing simple statistical methods for detection. The emergence of autoregressive models has significantly increased the complexity of text detection tasks. Jawahar, Abdul-Mageed, and Lakshmanan (2020) provide a detailed survey on the detection of machine-generated text, establishing a foundational context for the field with an emphasis on the SOTA generative models prevalent at the time, such as GPT-2. The subsequent release of ChatGPT sparked a surge of interest in LLMs, and signified a major shift in research directions. In response to the rapid challenges posed by LLM-generated text, the NLP community has recently embarked on extensive research to devise robust detection mechanisms and explore the dynamics of detector evasion techniques, aiming to seek effective solutions. The recent survey by Crothers, Japkowicz, and Viktor (2023b); Dhaini, Poelman, and Erdogan (2023) provide new reviews of LLM-generated text detection, but we contend that these reviews are not advanced enough and the summary of detection methods needs to be improved. Tang, Chuang, and Hu (2023) present another survey, categorizing detection methods into black-box detection and white-box detection, and highlighting cutting-edge technologies such as watermarking, but the review could benefit from a more comprehensive analysis and critical evaluation. Ghosal et al. (2023) discuss the current attacks and defenses of AI-generated text detectors and provide a thorough inductive analysis. Nevertheless, the discussion could be enriched with a more detailed examination of task motivation, data resources, and evaluation methodologies.

In this article, we strive to provide a more comprehensive and insightful review of the latest research on LLM-generated text detection, enriched with thoughtful analysis. We highlight the strengths of our review in comparison to others:

  • Systematic and Comprehensive Review: Our survey offers an extensive exploration of LLM-generated text detection, covering the task’s description and underlying motivation, benchmarks and datasets, various detection and attack methods, evaluation frameworks, the most pressing challenges faced today, potential future directions, and a critical examination of each aspect.

  • In-depth Analysis of Detection Mechanisms: We provide a detailed overview of detection strategies, from traditional approaches to the latest research, and systematically evaluate their effectiveness, strengths, and weaknesses in the current environment of LLMs.

  • More Pragmatic Insights. Our discussion delves into research questions with practical implications, such as how model size affects detection capabilities, the challenges of identifying text that is not purely generated by LLMs, and the lack of effective evaluation framework.

In summary, we firmly believe that this review is more systematic and comprehensive than existing works. More importantly, our critical discussion not only provides guidance to new researchers but also imparts valuable insights into established works within the field.

3.2 Systematic Investigation and Implementation

Table 3: Overview of the diverse databases and search engines utilized in our research, along with the incorporated search schemes and the consequent results obtained. Google Scholar predominates as the search engine yielding the maximum number of retrievable documents. Upon meticulous examination, it is observed that a substantial portion of the documents originate from ArXiv, primarily shared by researchers.
Databases Search Engine Search Scheme Retrieved
Google Scholar https://scholar.google.com/ Full Text 210
ArXiv https://arxiv.org/ Full Text N/Aa
Scopus https://www.scopus.com/ TITLE-ABS-KEY: ( Title, Abstract, Author Keywords, Indexed Keywords ) 133
Web of Science https://www.webofscience.com/ Topic: ( Searches Title, Abstract, Author Keywords, Keywords Plus ) 92
IEEE Xplore https://ieeexplore.ieee.org/ Full Text 49
Springer Link https://link.springer.com/ Full Text N/Aa
ACL Anthology https://aclanthology.org/ Full Text N/Aa
ACM Digital Library https://dl.acm.org/ Title N/Ab
\tabnote

a Search engines cannot use all keywords in a single search string. Therefore the retrieved results are inaccurate and there may be duplicate results of thesis queries. \tabnoteb The search engine retrieved an inaccurate number of papers that were weakly related to our topic.

Our survey employed the System for Literature Review (SLR) as delineated by Barbara Kitchenham (2007), a methodological framework designed for evaluating the extent and quality of extant evidence pertaining to a specified research question or topic. Offering a more expansive and accurate insight compared to conventional literature reviews, this approach has been prominently utilized in numerous scholarly surveys, as evidenced by Murtaza et al. (2020); Saeed and Omlin (2023). The research questions guiding our SLR were as follows:

What are the prevailing methods for detecting LLM-generated text, and what are the main challenges associated with these methods?

2019202020212022202300202020204040404060606060YearNumber of publications
Figure 3: The distribution by year of the last 5 years of literature obtained from the screening is plotted. The number of published articles obtain significant attention in 2023.

Upon delineating the research problems, our study utilized search terms directly related to the research issue, specifically: “LLM-generated text detection”, “machine-generated text detection”, “AI-written text detection”, “authorship attribution”, and “deepfake text detection”. These terms were strategically combined using the Boolean operator OR to formulate the following search string: (“LLM-generated text detection” OR “machine-generated text detection” OR “AI-written text detection” OR “authorship attribution” OR “deepfake text detection”). Subsequently, employing this search string, we engaged in a preliminary search through pertinent and authoritative electronic dissertation databases and search engines. Our investigation mainly focused on scholarly articles that were publicly accessible prior to November 2023. Table 3 outlines the sources used and provides an overview of our results.

Subsequently, we established the ensuing criteria to scrutinize the amassed articles:

  • The article should be a review focusing on the methods and challenges pertinent to LLM-generated (machine-generated/AI-written) text detection.

  • The article should propose a methodology specifically designed for the detection of LLM-generated (machine-generated/AI-written) text.

  • The article should delineate challenges and prospective directions for future research in the domain of text generation for LLMs.

  • The article should articulate the necessity and applications of LLM-generated text detection.

If any one of the aforementioned four criteria was met, the respective work was deemed valuable for our study. Following a process of de-duplication and manual screening, we identified 83 pertinent pieces of literature. The distribution trend of these works, delineated by year, is illustrated in Figure 3. Notably, the majority of relevant research on LLM-generated text detection was published in the year 2023 (as shown in Figure 3), underscoring the vibrant development within this field and highlighting the significance of our study. In subsequent sections, we provide a synthesis and analysis of the data (see section 4), primary detectors (see section 5), evaluation metrics (see section 6), issues (see section 7), and future research directions (see section 8) pertinent to LLM-generated text detection. The comprehensive structure of the survey is outlined in Table 4, offering a detailed overview of the organization of our review.

Table 4: Summary of content organisation of this survey.
Section Topic Content
Section 4 Data Datasets and Benchmarks for LLM-generated Text Detection, Other Datasets Easily Extended to Detection Tasks and Challenge of datsets for LLM-generated Text Detection.
Section 5 Detectors Watermarking Technology, Statistics-Based Detectors, Neural-Based Detectors, and Human-Assisted Methods
Section 6 Evaluation Metrics Accuracy, Precision, Recall, False Positive Rate, True Negative Rate, False Negative Rate, F1-Score, and Area under the ROC curve (AUROC).
Section 7 Issues Out of Distribution Challenges, Potential Attacks, Real-world Data Issues, Impact of Model Size on Detectors, and Lack of Effective Evaluation Framework
Section 8 Future Directions Building Robust Detectors with Attacks, Enhancing the Efficacy of Zero-Shot Detectors, Optimizing Detectors for Low-resource Environments, Detection for Not purely LLM-generated Text, Constructing Detectors Amidst Data Ambiguity, Developing Effective Evaluation Framework Aligned With Real-World, and Constructing Detectors with Misinformatio Discrimination Capabilities.
Section 9 Conclusion -

4 Data

High-quality datasets are pivotal for advancing research in the LLM-generated text detection task. These datasets enable researchers to swiftly develop and calibrate efficient detectors and establish standardized metrics for evaluating the efficacy of their methodologies. However, procuring such high-quality labeled data often demands substantial financial, material, and human resources. Presently, the development of datasets focused on detecting LLM-generated text is in its nascent stages, plagued by issues such as limited data volume and sample complexity, both crucial for crafting robust detectors. In this section, we first introduce popular datasets employed for training LLM-generated text detectors. Additionally, we highlight datasets from unrelated domains or tasks, which, though not initially designed for detection tasks, can be repurposed for various detection scenarios, which is a prevailing strategy in many contemporary detection studies. We subsequently introduce benchmarks for verifying the effectiveness of LLM-generated text detectors, which are carefully designed to evaluate the performance of the detector from different perspectives. Lastly, we evaluate these training datasets and benchmarks, identifying current shortcomings and challenges in dataset construction for LLM-generated text detection, aiming to inform the design of future data resources.

4.1 Training

4.1.1 Detection Datasets

Massive and high-quality datasets can assist researchers in rapidly training their detectors. We thoroughly organize and compare datasets that are widely used and have potential, refer to Table 5. Given that different studies focus on various practical issues, our aim is to facilitate researchers in conveniently selecting high-quality datasets that meet their specific needs through our comprehensive review work.

Table 5: Summary of Detection Datasets for LLM-generated text detection.
Corpus Use Human LLMs LLMs Type Language Attack Domain
HC3 Guo et al. (2023) train ~80k ~43k ChatGPT English, Chinese - Web Text, QA, Social Media
CHEAT Yu et al. (2023a) train ~15k ~35k ChatGPT English Paraphrase Scientific Writing
HC3 Plus Su et al. (2023b) train valid test ~95k ~10k ~38k GPT-3.5-Turbo Englilsh, Chinese Paraphrase News Writing, Social Media
OpenLLMText Chen et al. (2023a) train, valid, test ~52k ~8k ~8k ~209k ~33k ~33k ChatGPT, PaLM, LLaMA, GPT2-XL English - Web Text
GROVER Dataset Zellers et al. (2019b) train ~24k Grover-Mega English - News Writing
TweepFake Fagni et al. (2021) train ~12k ~12k GPT-2, RNN, Markov, LSTM, CharRNN English - Social Media
GPT-2 Output Dataset666https://github.com/openai/gpt-2-output-dataset train test ~250k ~5k ~2000k ~40k GPT-2 (small, medium, large, xl) English - Web Text
ArguGPT Liu et al. (2023c) train valid test ~6k 700 700 GPT2-Xl, Text-Babbage-001, Text-Curie-001, Text-Davinci-001, Text-Davinci-002, Text-Davinci-003, GPT-3.5-Turbo English - Scientific writing
DeepfakeTextDetect Li et al. (2023c) train valid test ~236k ~56k ~56k GPT (Text-Davinci-002, Text-Davinci-003, GPT-Turbo-3.5), LLaMA (6B, 13B, 30B, 65B), GLM-130B, FLAN-T5 (small, base, large, xl, xxl), OPT(125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, iml1.3B, iml-30B), T0 (3B, 11B), BLOOM-7B1, GPT-J-6B, GPT-NeoX-20B) English Paraphrase Social Media, News Writing, QA, Story Generation, Comprehension and Reasoning, Scientific writing
HC3

The Human ChatGPT Comparison Corpus (HC3) Guo et al. (2023) stands as one of the initial open-source efforts to compare ChatGPT-generated text with human-written text. It involves collecting both human and ChatGPT responses to identical questions. Due to its pioneering contributions in this field, the HC3 corpus has been utilized in numerous subsequent studies as a valuable resource. The corpus offers datasets in both English and Chinese. Specifically, HC3-en comprises 58k human responses and 26k ChatGPT responses, derived from 24k questions, which primarily originate from the ELI5 dataset, WikiQA dataset, Crawled Wikipedia, Medical Dialog dataset, and FiQA dataset. On the other hand, HC3-zh encompasses a broader spectrum of domains, featuring 22k human answers and 17k ChatGPT responses. The data within HC3-zh spans seven sources: WebTextQA, BaikeQA, Crawled BaiduBaike, NLPCC-DBQA dataset, Medical Dialog dataset, Baidu AI Studio, and the LegalQA dataset. However, it is pertinent to note some limitations of the HC3 dataset, such as the lack of diversity in prompts used for data creation.

CHEAT

The CHEAT dataset Yu et al. (2023a) stands as the most extensive publicly accessible resource dedicated to detecting spurious academic content generated by ChatGPT. It includes human-written academic abstracts sourced from IEEE Xplore, with an average abstract length of 163.9 words and a vocabulary size of 130k words. Following the ChatGPT generation process, the dataset contains 15k human-written abstracts and 35k ChatGPT-generated summaries. To better simulate real-world applications, the outputs were guided by ChatGPT for further refinement and amalgamation. The “polishing” process aims to simulate users who may seek to bypass plagiarism detection by improving the text, while “blending” represents scenarios where users might combine manually drafted abstracts with those seamlessly crafted by ChatGPT to elude detection mechanisms. Nevertheless, a limitation of the CHEAT dataset is its focus on narrow academic disciplines, overlooking cross-domain challenges, which stems from constraints related to its primary data source.

HC3 Plus

HC3 Plus Su et al. (2023b) represents an enhanced version of the original HC3 dataset, introducing an augmented section named HC3-SI. This new section specifically targets tasks requiring semantic invariance, such as summarization, translation, and paraphrasing, thus extending the scope of HC3. To compile the human-written text corpus for HC3-SI, data was curated from several sources, including the CNN/DailyMail dataset, Xsum, LCSTS, the CLUE benchmark, and datasets from the Workshop on Machine Translation (WMT). Simultaneously, the LLM-generated texts were generated using GPT-3.5-Turbo. The expanded English dataset now includes a training set of 95k samples, a validation set of 10k samples, and a test set of 38k samples. The Chinese dataset, in comparison, contains 42k training samples, 4k for validation, and 22k for testing. Despite these expansions, HC3-SI still mirrors HC3’s approach to data construction, which is somewhat monolithic and lacks diversity, particularly in the variety of LLMs and the use of complex and varied prompts for generating data.

OpenLLMText

The OpenLLMText dataset Chen et al. (2023a) is derived from four types of LLMs: GPT-3.5, PaLM, LLaMA-7B, and GPT2-1B (also known as GPT-2 Extra Large). The samples from GPT2-1B are sourced from the GPT-2 Output dataset, which OpenAI has made publicly available. Text generation from GPT-3.5 and PaLM was conducted using the prompt “Rephrase the following paragraph paragraph by paragraph: [Human_Sample],” while LLaMA-7B generated text by completing the first 75 tokens of human samples. The dataset comprises a total of 344k samples, including 68k written by humans. It is divided into training, validation, and test sets at 76%, 12%, and 12%, respectively. Notably, this dataset features LLMs like PaLM, which are commonly used in everyday applications. However, it does not fully capture the nuances of cross-domain and multilingual text, which limits its usefulness for related research.

TweepFake Dataset

TweepFake Fagni et al. (2021) is a foundational dataset designed for the analysis of fake tweets on Twitter, derived from both genuine and counterfeit accounts. It encompasses a total of 25k tweets, with an equal distribution between human-written and machine-generated samples. The machine-generated tweets were crafted using various techniques, including GPT-2, RNN, Markov, LSTM, and CharRNN. While TweepFake continues to be a dataset of choice for many scholars, those working with LLMs should critically assess its relevance and rigor in light of evolving technological capabilities.

GPT2-Output Dataset

The GPT2-Output Dataset,777https://github.com/openai/gpt-2-output-dataset introduced by OpenAI, is based on 250k documents sourced from the WebText test set for its human-written text. Regarding the LLM-generated text, the dataset includes 250k randomly generated samples using a temperature setting of 1 without truncation and an additional 250k samples produced with Top-K 40 truncation. This dataset was conceived to further research into the detectability of the GPT-2 model. However, a notable limitation lies in the insufficient complexity of the dataset, marked by the uniformity of both the generative models and data distribution.

GROVER Dataset

The GROVER Dataset, introduced by Zellers et al. (2019b), is styled after news articles. Its human-written text is sourced from RealNews, a comprehensive corpus of news articles derived from Common Crawl. The LLM-generated text is produced by Grover-Mega, a transformer-based news generator with 1.5 billion parameters. A limitation of this dataset, particularly in the current LLM landscape, is the uniformity and singularity of both its generative model and data distribution.

ArguGPT Dataset

The ArguGPT Dataset Liu et al. (2023c) is specifically designed for detecting LLM-generated text in various academic contexts such as classroom exercises, TOEFL, and GRE writing tasks. It comprises 4k argumentative essays, generated by seven distinct GPT models. Its primary aim is to tackle the unique challenges associated with teaching English as a second language.

DeepfakeTextDetect Dataset

Attention is also drawn to the DeepfakeTextDetect Dataset Li et al. (2023c), a robust platform tailored for deepfake text detection. The dataset combines human-written text from ten diverse datasets, encompassing genres like news articles, stories, scientific writings, and more. It features texts generated by 27 prominent LLMs, sourced from entities such as OpenAI, LLaMA, and EleutherAI. Furthermore, the dataset introduces an augmented challenge with the inclusion of text produced by GPT-4 and paraphrased text.

4.1.2 Potential Datasets

Constructing a dataset from scratch that encompasses both human-written and LLM-generated text can indeed be a resource-intensive endeavor. Recognizing the diverse requirements for LLM-generated text detection across scenarios, researchers commonly adapt existing datasets from fields like Q&A, academic writing, and story generation to represent human-written text. They then produce LLM-generated text for detectors training using methods like prompt engineering and bootstrap complementation. This survey offers a concise classification and overview of these datasets, summarized in Table 6.

Table 6: Summary of other potential datasets that can easily extended to LLM-generated text detection tasks.
Corpus Size Source Language Domain
XSum Narayan, Cohen, and Lapata (2018) 42k BBC English News Writing
SQuAD Rajpurkar et al. (2016) 98.2k Wiki English Question Answering
WritingPrompts Fan, Lewis, and Dauphin (2018) 302k Reddit WRITINGPROMPTS English Story Generation
Wiki40B Guo et al. (2020) 17.7m Wiki 40+ Languages Web Text
PubMedQA Jin et al. (2019) 211k PubMed English Question Answering
Children’s Book Corpus Hill et al. (2016) 687k Books English Question Answering
Avax Tweets Dataset Muric, Wu, and Ferrara (2021) 137m Twitter English Social Media
Climate Change Dataset Littman and Wrubel (2019) 4m Twitter English Social Media
Yelp Dataset Asghar (2016) 700k Yelp English Social Media
ELI5 Fan et al. (2019) 556k Reddit English Question Answering
ROCStories Mostafazadeh et al. (2016) 50k Crowdsourcing English Story Generation
HellaSwag Zellers et al. (2019a) 70k ActivityNet Captions, Wikihow English Question Answering
SciGen Moosavi et al. (2021) 52k arXiv English Scientific Writing, Question Answering
WebText Radford et al. (2019) 45m Web English Web Text
TruthfulQA Lin, Hilton, and Evans (2022) 817 authors writtEnglish English Question Answering
NarrativeQA Kočiský et al. (2018) 1.4k Gutenberg3, web English Question Answering
TOEFL11 Blanchard et al. (2013) 12k TOEFL test 11 Languages Scientific writing
Peer Reviews Kang et al. (2018) 14.5k NIPS 2013–2017, CoNLL 2016, ACL 2017 English Scientific Writing
ICLR 2017, arXiv 2007–2017
Q&A

Q&A is a prevalent and equitable method for constructing datasets. By posing identical questions to LLMs, one can generate paired sets of human-written and LLM-generated text.

  • PubMedQA Jin et al. (2019): This is a biomedical question-and-answer (QA) dataset sourced from PubMed.888https://www.ncbi.nlm.nih.gov/pubmed/

  • Children Book Corpus Hill et al. (2016): This dataset, derived from freely available books, gauges the capacity of LMs to harness broader linguistic contexts. It challenges models to select the correct answer from ten possible options, based on a context of 20 consecutive sentences. The answer types encompass verbs, pronouns, named entities, and common nouns.

  • ELI5 Fan et al. (2019): A substantial corpus for long-form Q&A, ELI5 focuses on tasks demanding detailed responses to open-ended queries. The dataset comprises 270k entries from the Reddit forum “Explain Like I’m Five”, which offers explanations tailored to the comprehension level of a five-year-old.

  • TruthfulQA Lin, Hilton, and Evans (2022): This benchmark evaluates the veracity of answers generated by LLMs. It encompasses 817 questions spread across 38 categories such as health, law, finance, and politics. All questions were crafted by humans.

  • NarrativeQA Kočiský et al. (2018): This English-language dataset includes summaries or stories along with related questions aimed at assessing reading comprehension, especially concerning extended documents. Data is sourced from Project Gutenberg 999https://gutenberg.org/ and web-scraped movie scripts, with hired annotators providing the answers.

Scientific Writing

Scientific writing is frequently explored in real-world research contexts. Given a specific academic topic, LLMs can efficiently generate academic articles or abstracts.

  • Peer Read Kang et al. (2018): This represents the inaugural public dataset of scientific peer-reviewed articles, comprising 14.7k draft articles and 10.7k expert-written peer reviews for a subset of these articles. Additionally, it includes the acceptance or rejection decisions from premier conferences such as ACL, NeurIPS, and ICLR.

  • ArXiv:101010https://arxiv.org/ A freely accessible distribution service and repository, ArXiv hosts 2.3 million scholarly articles spanning disciplines like physics, mathematics, computer science, and statistics.

  • TOEFL11 Blanchard et al. (2013): A publicly accessible corpus featuring works of non-native English writers from the TOEFL test, it encompasses 1.1k essay samples across 11 languages: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. These essays are uniformly distributed over eight writing prompts and three score levels (low/medium/high).

Story Generation

LLMs excel in the domain of story generation, with users frequently utilizing story titles and writing prompts to guide the models in their creative endeavors.

  • WritingPrompts Fan, Lewis, and Dauphin (2018): This dataset comprises 300k human-written stories paired with writing prompts. The data was sourced from Reddit’s writing prompts forum, a vibrant online community where members inspire one another by posting story ideas or prompts. Stories in this dataset are restricted in length, ranging between 30 to 1k words, with no words repeated more than 10 times.

News Writing

The task of news article writing can be approached through article summary datasets. LLMs can be instructed either to generate abstracts from the primary text or to generate articles based on provided abstracts. Nonetheless, given the resource constraints, researchers often employ LLMs to generate such datasets by directly reinterpreting or augmenting the existing abstracts and articles.

  • Extreme Summarization (XSum) Narayan, Cohen, and Lapata (2018): This dataset contains BBC articles accompanied by concise one-sentence abstracts. It encompasses a total of 225k samples from 2010 to 2017, spanning various domains such as News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment, Arts, and more.

Web Text

Web text data primarily originates from platforms like Wikipedia. For web text generation, a common approach is to provide the LLMs with an opening sentence and allow them to continue the narrative. Alternatively, LLMs can be instructed to generate content based on a webpage title.

  • Wiki-40B Guo et al. (2020): Initially conceived as a multilingual benchmark for language model training, this dataset comprises text from approximately 19.5 million Wikipedia pages across 40 languages, aggregating to about 40 billion characters. The content has been meticulously cleaned to maintain quality.

  • WebText Radford et al. (2019): Originally utilized to investigate the learning potential of LMs or LLMs, this dataset encompasses 45 million web pages. Prioritizing content quality, the dataset exclusively includes web pages curated or filtered by humans, while deliberately excluding common sources from other datasets, such as Wikipedia.

Social Media

Social Media datasets are instrumental in assessing the disparity in subjective expressions between LLM-generated and human-written texts.

  • Yelp Reviews Dataset Asghar (2016): Sourced from the 2015 Yelp Dataset Challenge, this dataset was primarily used for classification tasks such as predicting user ratings based on reviews and determining polarity labels. It comprises 1.5 million review text samples.

  • r/ChangeMyView (CMV) Reddit Subcommunity:111111https://www.reddit.com/r/changemyview/ Often referred to as “Change My View (CMV)”, this subreddit offers a platform for users to debate a spectrum of topics, ranging from politics and media to popular culture, often presenting contrasting viewpoints.

  • IMDB Dataset:121212https://huggingface.co/datasets/imdb Serving as an expansive film review dataset for binary sentiment classification, it exceeds prior benchmark datasets in volume, encompassing 25k training and 25k test samples.

  • Avax Tweets Dataset Muric, Wu, and Ferrara (2021): Designed to examine anti-vaccine misinformation on social media, this dataset was acquired via the Twitter API. It features a streaming dataset centered on keywords with over 1.8 million tweets, complemented by a historical account-level dataset containing more than 135 million tweets.

  • Climate Change Tweets Ids Littman and Wrubel (2019): This dataset houses the tweet IDs for 39.6 million tweets related to climate change. These tweets were sourced from the Twitter API between 21 September 2017 and 17 May 2019 using the Social Feed Manager, based on specific search keywords.

Comprehension and Reasoning

Datasets geared towards comprehension and generation typically provide consistent contextual materials, guiding LLMs in the regeneration or continuation of text.

  • Stanford Question and Answer Dataset (SQuAD) Rajpurkar et al. (2016): This reading comprehension dataset features 100k Q&A pairs, encompassing subjects from musical celebrities to abstract notions. It draws samples from the top 10k English Wikipedia articles sourced via PageRank. From this collection, 536 articles were randomly selected, excluding passages shorter than 500 words. Crowdsource contributors frame the questions based on these experts, while additional personnel provide the answers.

  • SciGen Moosavi et al. (2021): This task centers on reasoning from perceptual data to generate text. It comprises tables from scientific articles alongside their descriptions. The entire dataset is sourced from the “Computer Science” section in the arXiv website, holding up to 17.5k samples from “Computation and Language” and another 35.5k from domains like “Machine Learning”. Additionally, the dataset facilitates the evaluation of generative models’ arithmetic reasoning capabilities using intricate input formats, such as scientific tables.

  • ROCStories Corpora (ROC) Mostafazadeh et al. (2016): Aimed at natural language understanding, this dataset tasks systems with determining the apt conclusion to a four-sentence narrative. It is a curated collection of 50k five-sentence stories reflecting everyday experiences. Beyond its primary purpose, it can also support tasks like story generation.

  • HellaSwag Zellers et al. (2019a): Focused on common-sense reasoning, this dataset encompasses approximately 70k questions. Utilizing Adversarial Filtering (AF), the dataset crafts misleading and intricate false answers for multiple-choice settings, where the objective is to pinpoint the correct answer in context.

4.2 Evaluation Benchmarks

Table 7: Summary of benchmarks for LLM-generated text detection.
Corpus Use Human LLMs LLMs Type Language Attack Domain
TuringBench Uchendu et al. (2021) train ~8k ~159k GPT-1, GPT-2, GPT-3, GROVER, CTRL, XLM, XLNET, FAIR, TRANSFORMER_XL, PPLM English - News Writing
MGTBench He et al. (2023) train test ~2.4k ~0.6k ~14.4k ~3.6k ChatGPT, ChatGPT-turbo, ChatGLM, Dolly, GPT4All, StableLM English Adversarial Scientific Writing, Story Generation, News Writing
GPABenchmark Liu et al. (2023d) test ~150k ~450k GPT-3.5 English Paraphrase Scientific Writing
Scientific-articles Benchmark Mosca et al. (2023) test ~16k ~13k SCIgen, GPT-2, GPT-3, ChatGPT, Galactica English - Scientific Writing
MULTITuDE Macko et al. (2023) train test ~4k ~3k ~40k ~26k Alpaca-lora, GPT-3.5-Turbo, GPT-4, LLaMA, OPT, OPT-IML-Max, Text-Davinci-003, Vicuna Arabic, Catalan, Chinese, Czech, Dutch, English, German, Portuguese, Russian, Spanish, Ukrainian - Scientific Writing, News Writing, Social Media
HANSEN Tripto et al. (2023) test - ~21k ChatGPT, PaLM2, Vicuna13B English - Spoken Text
M4 Wang et al. (2023b) train valid test ~35k ~3.5k ~3.5k ~112k ~3.5k ~3.5k GPT-4, ChatGPT, GPT-3.5, Cohere, Dolly-v2, BLOOMz 176B English, Chinese, Russian, Urdu, Indonesian, Bulgarian, Arabic - Web Text, Scientific Writing, News Writing, Social Media, QA

Benchmarks with high quality can help researchers verify whether their detectors are feasible and effective rapidly. We sort out and compare the benchmarks that are currently popular or with potential, as shown in Table 7. On the one hand, we hope to help researchers better understand their differences to choose suitable benchmarks for their experiments. On the other hand, we hope to draw researchers’ attention to the latest benchmarks, which have been fully designed to verify the latest issues for the task, with great potential.

TuringBench

The TuringBench dataset Uchendu et al. (2021) is an initiative designed to explore the challenges of the “Turing test” in the context of neural text generation techniques. It comprises human-written content derived from 10k news articles, predominantly from reputable sources such as CNN. For the purpose of this dataset, only articles ranging between 200 to 400 words were selected. LLM-generated text within this dataset is produced by 19 distinct text generation models, including GPT-1, GPT-2 variants (small, medium, large, xl, and pytorch), GPT-3, different versions of GROVER (base, large, and mega), CTRL, XLM, XLNET variants (base and large), FAIR for both WMT19 and WMT20, Transformer-XL, and both PLM variants (distil and GPT-2). Each model contributed  8k samples, categorized by label type. Notably, TuringBench emerged as one of the pioneering benchmark environments for the detection of LLM-generated text. However, given the rapid advancements in LLM technologies, the samples within TuringBench are now less suited for training and validating contemporary detector performances. As such, timely updates incorporating the latest generation models and their resultant texts are imperative.

MGTBench

Introduced by He et al. (2023), MGTBench stands as the inaugural benchmark framework for MGT detection. It boasts a modular architecture, encompassing an input module, a detection module, and an evaluation module. The dataset draws upon several of the foremost LLMs, including ChatGPT, ChatGLM, Dolly, ChatGPT-turbo, GPT4All, and StableLM, for text generation. Furthermore, it incorporates over ten widely-recognized detection algorithms, demonstrating significant potential.

GPABenchmark

The GPABenchmark Liu et al. (2023d) is a comprehensive dataset encompassing 600k samples. These samples span human-written, GPT-written, GPT-completed, and GPT-polished abstracts from a broad spectrum of academic disciplines, such as computer science, physics, and the humanities and social sciences. This dataset meticulously captures the quintessential scenarios reflecting both the utilization and potential misapplication of LLMs in academic composition. Consequently, it delineates three specific tasks: generation of text based on a provided title, completion of a partial draft, and refinement of an existing draft. Within the domain of academic writing detection, GPABenchmark stands as a robust benchmark, attributed to its voluminous data and its holistic approach to scenario representation.

Scientific-articles Benchmark

The Scientific-articles Benchmark Mosca et al. (2023) comprises 16k human-written articles alongside 13k LLM-generated samples. The human-written articles are sourced from the arXiv dataset available on Kaggle. In contrast, the machine-generated samples, which include abstracts, introductions, and conclusions, are produced by SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica using the titles of the respective scientific articles as prompts. A notable limitation of this dataset is its omission of various adversarial attack types.

MULTITuDE

It is a benchmark for detecting machine-generated text in multiple languages. This dataset consists of 74k machine-generated texts and 7k human-written texts across 11 languages Macko et al. (2023), including Arabic, Catalan, Chinese, Czech, Dutch, English, German, Portuguese, Russian, Spanish, and Ukrainian. The machine-generated texts are produced by eight generative models, including Alpaca-Lora, GPT-3.5-turbo, GPT-4, LLaMA, OPT, OPT-IML-Max, Text-Davinci-003, and Vicuna. In an era of rapidly increasing numbers of multilingual LLMs, MULTITuDE serves as an effective benchmark for assessing the detection capabilities of LLM-generated text detectors in various languages.

HANSEN

The Human and AI Spoken Text Benchmark (HANSEN) Tripto et al. (2023) is the largest benchmark for spoken text, encompassing the organization of 17 speech datasets and records, as well as 23k novel AI-generated spoken texts. The AI-generated spoken texts in HANSEN were created by ChatGPT, PaLM2, and Vicuna-13B. Due to the stylistic differences between spoken and written language, detectors may require a more nuanced understanding of spoken text. HANSEN can effectively assess the progress in research aimed at developing such nuanced detectors.

M4

M4 Wang et al. (2023b) serves as a comprehensive benchmark corpus for the detection of text generated by LLMs. It spans a variety of generators, domains, and languages. Compiled from diverse sources, including wiki pages from various regions, news outlets, and academic portals, the dataset reflects common scenarios where LLMs are utilized in daily applications. The LLM-generated texts in M4 are created using cutting-edge generative models such as ChatGPT, LLaMa, BLOOMz, FlanT5, and Dolly. Notably, the dataset captures cross-lingual subtleties, featuring content in more than ten languages. In summary, while the M4 dataset proficiently tackles complexities across domains, models, and languages, it could be further enriched by incorporating a broader range of adversarial scenarios.

4.3 Data Challenges

In light of our extensive experience in the area, a notable deficiency persists in the realm of robust datasets and benchmarks tailored for LLMs. Despite commendable advancements, current efforts remain insufficient. A noticeable trend among researchers is the tendency to utilize datasets originally designed for other tasks as human-written texts, and produce LLM-generated texts base on them for training detectors. This approach arises from the limitations of existing datasets or benchmarks in comprehensively addressing diverse research perspectives. As a result, we aim to outline the prominent limitations and challenges associated with the current datasets and benchmarks in this article.

4.3.1 Comprehensiveness of Evaluation Frameworks

Before gaining trust, a reliable detector demands multifaceted assessment. The current benchmarks are somewhat limited, providing only superficial challenges and thereby not facilitating a holistic evaluation of detectors. We highlight five crucial dimensions that are essential for the development of more robust benchmarks for LLM-generated text detection task. These dimensions include the incorporation of multiple types of attacks, diverse domains, varied tasks, a spectrum of models, and the inclusion of multiple languages.

Multiple Types of Attack

are instrumental in ascertaining the efficacy of detection methodologies. In practical environments, LLM-generated text detectors often encounter texts that are generated using a wide range of attack mechanisms, which differ from texts generated through simple prompts. For instance, the prompt attack elucidated in subsection 7.2 impels the generative model to yield superior-quality text, leveraging intricate and sophisticated prompts. Integrating such texts into prevailing datasets is imperative. This concern is also echoed in the limitations outlined by Guo et al. (2023).

Multi-domains and multi-tasks

configurations are pivotal in assessing a detector’s performance across diverse real-world domains and LLM applications. These dimensions bear significant implications for a detector’s robustness, usability, and credibility. For instance, in scholarly contexts, a proficient detector should consistently excel across all fields. In everyday scenarios, it should adeptly identify LLM-generated text spanning academic compositions, news articles, arithmetic reasoning, and Q&A sessions. While numerous existing studies prudently incorporate these considerations, we advocate for the proliferation of superior-quality datasets.

Multiple LLMs

The ongoing research momentum in LLMs has ushered in formidable counterparts like LLaMa Touvron et al. (2023), PaLM Chowdhery et al. (2022a), and Claude-2,131313https://www.anthropic.com/index/claude-2 rivaling ChatGPT’s prowess. As the spotlight remains on ChatGPT, it is essential to concurrently address potential risks emanating from other emerging LLMs.

Multilingual

considerations demand increased attention. We strongly encourage researchers to spearhead the creation of multilingual datasets to facilitate the evaluation of text detectors generated by LLMs across different languages. The utilization of pre-trained models may uncover instances where certain detectors struggle with underrepresented languages, while LLMs could exhibit more noticeable inconsistencies. This dimension presents a rich avenue for exploration and discourse.

4.3.2 Temporal

It is discernible that certain contemporary studies persistently employed seminal but somewhat antiquated benchmark datasets, which had significantly shaped prior GPT-generated text and fake news detection endeavors. However, these datasets predominantly originate from backward LLMs, implying that validated methodologies might not invariably align with current real-world dynamics. We emphasize the significance of utilizing datasets formulated with advanced and powerful LLMs, while also urging benchmark dataset developers to regularly update their contributions to reflect the rapid evolution of the field.

{forest}

for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=left, font=, rectangle, draw, rounded corners,align=left, minimum width=2.12em, inner xsep=4pt, inner ysep=1pt, , where level=1font=,fill=pink!5, where level=2font=,yshift=0.26pt,fill=yellow!20, [Advanced
Detector
Research
(Sec. 5), text width=3.2em, fill=blue!10, [Refer to caption
Watermarking Technology, text width=12em [Data-Driven Watermarking: Gu et al. (2022) / Lucas and Havens (2023) /
Tang et al. (2023), text width=19em] [Model-Driven Watermarking: Kirchenbauer et al. (2023a) / Lee et al. (2023b) /
Kirchenbauer et al. (2023b) / Liu et al. (2023b) / Liu et al. (2023a) /
Kuditipudi et al. (2023) / Hou et al. (2023), text width=19em] [Post-Processing Watermarking: Por, Wong, and Chee (2012) /
Rizzo, Bertini, and Montesi (2016) / Topkara, Topkara, and Atallah (2006) /
Yang et al. (2022) / Munyer and Zhong (2023) / Yoo et al. (2023) /
Yang et al. (2023a) / Abdelnabi and Fritz (2021) / Zhang et al. (2023a), text width=19em] ] [Refer to caption
Statistics-Based Detectors, text width=12em [Linguistics Features Statistics: Corston-Oliver, Gamon, and Brockett (2001) /
Kalinichenko et al. (2003) / Baayen (2001) / Arase and Zhou (2013) /
Gallé et al. (2021) / Hamed and Wu (2023), text width=19em] [White-Box Statistics: Solaiman et al. (2019) / Gehrmann, Strobelt, and Rush (2019)
Su et al. (2023a) / Lavergne, Urvoy, and Yvon (2008) / Beresneva (2016) /
Vasilatos et al. (2023) / Wu et al. (2023) / Mitchell et al. (2023) / Deng et al. (2023) /
Bao et al. (2023) / Yang et al. (2023b) / Tulchinskii et al. (2023), text width=19em] [Black-Box Statistics: Yang et al. (2023b) / Mao et al. (2024) / Zhu et al. (2023) /
Yu et al. (2023b) / Quidwai, Li, and Dube (2023) / Guo and Yu (2023), text width=19em] ] [Refer to caption
Neural-Based Detectors, text width=12em [Feature-Based Classifiers: Aich, Bhattacharya, and Parde (2022) / Shah et al. (2023) /
Corizzo and Leal-Arenas (2023) / Mindner, Schlippe, and Schaaff (2023) /
Schaaff, Schlippe, and Mindner (2023) / Schuster et al. (2020a) /
Crothers et al. (2022) / Li et al. (2023a) / Wang et al. (2023a) / Wu and Xiang (2023), text width=19em ] [Pre-Training Classifiers: Bakhtin et al. (2019) / Uchendu et al. (2020) /
Antoun et al. (2023a) / Li et al. (2023c) / Fagni et al. (2021) / Gambini et al. (2022) /
Guo et al. (2023) / Liu et al. (2023c) / Liu et al. (2023d) / Wang et al. (2023c) /
Wang et al. (2023c) / Bakhtin et al. (2019) / Uchendu et al. (2020) /
Antoun et al. (2023a) / Li et al. (2023c) / Sarvazyan et al. (2023a) /
Rodriguez et al. (2022b) / Liu et al. (2022) / Zhong et al. (2020) /
Bhattacharjee et al. (2023) / Yang, Jiang, and Li (2023) / Shi et al. (2023) /
Koike, Kaneko, and Okazaki (2023b) / He et al. (2023) / Hu, Chen, and Ho (2023) /
Koike, Kaneko, and Okazaki (2023b) / Tu et al. (2023) / Kumarage et al. (2023a) /
Cowap, Graham, and Foster (2023) / Uchendu, Le, and Lee (2023b) , text width=19em ] [LLMs as Detectors: Zellers et al. (2019b) / Liu et al. (2023c) /
Bhattacharjee and Liu (2023) / Koike, Kaneko, and Okazaki (2023b), text width=19em ] ] [Refer to caption
Human-Assisted Methods, text width=12em [Intuitive Indicators: Uchendu et al. (2023) / Dugan et al. (2023) /
/ Jawahar, Abdul-Mageed, and Lakshmanan (2020), text width=19em] [Imperceptible Features: Ippolito et al. (2020) / Clark et al. (2021b) /
Gehrmann, Strobelt, and Rush (2019), text width=19em] [Enhancing Human Detection Capabilities: Ippolito et al. (2020) / Dugan et al. (2020) /
Dugan et al. (2023) / Dou et al. (2022), text width=19em] [Mixed Detection: Understanding and Explanation: Weng et al. (2023), text width=19em] ] ]

Figure 4: Classification of LLM-generated text detectors with corresponding diagrams and paper lists. We categorize the detectors into watermarking technology, statistics-based detectors, neural-based detectors, and human-assisted methods. In the diagrams, HWT represents Human-Written Text and LGT represents LLM-Generated Text. We use the orange lines to highlight the source of the detector’s detection capability, and the green lines to describe the detection process.

5 Advances in Detector Research

In this section, we present different detector designs and detection algorithms, including watermarking technology, statistics-based detectors, neural-based detectors, and human-assisted methods. We focus on the most recently proposed methods and divide our discussion according to their underlying principles (see Figure 4).

5.1 Watermarking Technology

Originally deployed within the realm of computer vision for the development of generative models, watermarking techniques have been integral to the detection of AI-generated images, serving as protective measures for intellectual and property rights in the visual arts. With the advent and subsequent proliferation of LLMs, the application of watermarking technology has expanded to encompass the identification of text generated by these models. Watermarking techniques not only protect substantial models from unauthorized acquisition, such as through sequence distillation but also mitigate the risks associated with replication and misuse of LLM-generated text.

5.1.1 Data-Driven Watermarking

Data-driven methods enable the verification of data ownership or the tracking of illegal copying or misuse by embedding specific patterns or tags within the training datasets of LLMs. These methods typically rely on backdoor insertion, where a small number of watermarked samples are added to the dataset, allowing the model to implicitly learn a secret function set by the defender. When a specific trigger is activated, the backdoor watermark is triggered, which is usually implemented in a black-box setting Gu et al. (2022). This mechanism protects the model from unauthorized fine-tuning or use beyond the terms of the license by embedding a backdoor during the foundational and multi-task learning framework phases of model training, specified by the owner’s input. Even if the model is fine-tuned for several downstream tasks, the watermark remains difficult to eradicate.

However, subsequent studies identified vulnerabilities in this technology, showing that it can be relatively easily compromised. Lucas and Havens (2023) detailed an attack method on this watermarking strategy by analyzing the content generated by autoregressive models to precisely locate the trigger words or phrases of the backdoor watermark. The study points out that triggers composed of randomly combined common words are easier to detect than those composed of unique and rare markers. Additionally, the research mentions that access to the model’s weights is the only prerequisite for detecting the backdoor watermark. Recently, Tang et al. (2023) introduced a clean-label backdoor watermarking framework that uses subtle adversarial perturbations to mark and trigger samples. This method effectively protects the dataset while minimizing the impact on the performance of the original task. The results show that adding just 1% of watermarked samples can inject a traceable watermark feature.

It is important to note that data-driven methods were initially designed to protect copyright of datasets and therefore generally lack substantial payload capacity and generalizability. Moreover, applying such techniques in the field of LLM-generated text detection requires significant resource investment, including the embedding of watermarks in a vast amount of data and the retraining of LLMs. |

5.1.2 Model-Driven Watermarking

Model-Driven methods embed watermarks directly into the LLMs by manipulating the logits output distribution or token sampling during the inference process. As a result, the LLMs generate responses that carry the embedded watermark.

Logits-Based Methods

Kirchenbauer et al. (2023a) were the first to design a logits-based watermarking framework for LLMs, characterized by minimal impact on text quality. This framework facilitates the detection process through efficient open-source algorithms, eliminating the need to access the LLM’s API or parameters. Before text generation, the method randomly selects a set of “green” tokens, defines the rest as “red,” and then gently guides the model to choose tokens from the “green” set during sampling. Additionally, Kirchenbauer et al. (2023a) developed a watermark detection method based on interpretable p𝑝pitalic_p-values, which identifies watermarks by performing statistical analysis on the red and green tokens within the text to calculate the significance of the p𝑝pitalic_p-values. Following Kirchenbauer et al. (2023a), Lee et al. (2023b) introduced a new watermarking method called SWEET, which elevates “green” tokens only at positions with high token distribution entropy during the generation process, thus maintaining the watermark’s stealth and integrity. It uses entropy-based statistical tests and Z-scores for detecting watermarked code.

Despite the excellent performance of Kirchenbauer et al. (2023a), its robustness is still debated. Recent work from Kirchenbauer et al. (2023b) studies the resistance of watermarked texts to attacks via manual rewriting, rewriting using an unwatermarked LLM, or integration into a large corpus of handwritten documents. This study introduces a window testing method called “WinMax” to evaluate the effectiveness of accurately detecting watermarked areas within a large number of documents. To address the challenges of synonym substitution and text paraphrasing, Liu et al. (2023b) proposed a semantic invariant robust watermarking method for LLMs. This method generates semantic embeddings for all preceding tokens and uses them to determine the watermark logic, demonstrating robustness to synonym substitution and text paraphrasing. Moreover, current watermark detection algorithms require a secret key during generation, which can lead to security vulnerabilities and forgery in public detection processes. To address this issue, Liu et al. (2023a) introduced the first dedicated private watermarking algorithm for watermark generation and detection, deploying two different neural networks for each stage. By avoiding the use of the secret key in both stages, this method innovatively extends existing text watermark algorithms. Furthermore, it shares certain parameters between the watermark generation and detection networks, thus improving the efficiency and accuracy of the detection network while minimizing the impact on the speed of both generation and detection processes.

Token Sampling-Based Methods

During the normal model inference process, token sampling is determined by the sampling strategy and is often random, which helps guide the LLMs to produce more unpredictable text. Token sampling-based methods achieve watermarking by influencing the token sampling process, either by setting random seeds or specific patterns for token sampling. Kuditipudi et al. (2023) employed a sequence of random numbers as a secret watermark key to intervene in and determine the token sampling, which is then mapped into the LLMs to generate watermarked text. In the detection phase, the secret key is utilized to align the text with the random number sequence for the detection task. The method demonstrates strong robustness against paraphrasing attacks, even when approximately 40-50% of the tokens have been modified.

Another recent work is SemStamp Hou et al. (2023), a robust sentence-level semantic watermarking algorithm based on Locality-Sensitive Hashing (LSH). This algorithm starts by encoding and LSH hashing the candidate sentences generated by the LLM, dividing the semantic embedding space into watermarked and non-watermarked regions. It then continuously performs sentence-level rejection sampling until a sampled sentence falls into the watermarked partition of the semantic embedding space. Experimental results indicate that this method is not only more robust than previous SOTA methods in defending against common and more effective bigram paraphrase attacks but also superior in maintaining the quality of text generation.

In general, model-driven watermarking is a plug-and-play method that does not require any changes to the model’s parameters and has minimal impact on text quality, making it a reliable and practical watermarking approach. However, there is still significant opportunity for improvement in its robustness, and its specific usability needs to be further explored through additional experiments and practical applications.

5.1.3 Post-Processing Watermarking

Post-processing watermarking refers to a technique that involves embedding a watermark by processing the text after it has been output by a LLM. This method typically functions as a separate module that works in a pipeline with the output of the generative model.

Character-Embedded Methods

Early post-processing watermarking techniques relied on the insertion or substitution of special Unicode characters into text. These characters are difficult for the naked eye to recognize but carry distinct encoding information Por, Wong, and Chee (2012); Rizzo, Bertini, and Montesi (2016). More recently, Rizzo, Bertini, and Montesi (2016) introduced Easymark, a method which ingeniously utilizes the fact that Unicode has many code points with identical or similar appearances. Specifically, Easymark embeds watermarks by replacing the regular space character (U+0020) with another whitespace code point (e.g., U+2004), using Unicode’s variation selectors, substituting substrings, or using spaces and homoglyphs of slightly different lengths, all while ensuring that the appearance of the text remains virtually unchanged. The results indicate that watermarks embedded by Easymark can be reliably detected without reducing the BLEU score or increasing perplexity of the text, surpassing existing advanced techniques in terms of both quality and watermark reliability.

Synonym Substitution-Based Methods

In light of the vulnerability of character-level methods to targeted attacks, some research has shifted towards embedding watermarks at the word level, mainly through synonym substitution. Early watermark embedding schemes involve the continuous replacement of words with synonyms until the text carries the intended watermark content. To address this, Topkara, Topkara, and Atallah (2006) introduced a quantifiable and resilient watermarking technique using Wordnet Fellbaum (1998). Building upon this, Yang et al. (2022); Munyer and Zhong (2023); Yoo et al. (2023) employed pre-trained or further fine-tuned neural models to perform word replacement and detection tasks, thereby better preserving the semantic integrity of the original sentences. Additionally, Yang et al. (2023a) defined a binary encoding function to calculate random binary codes corresponding to words, and selectively replaced words representing a binary ”0“ with contextually relevant synonyms representing a binary ”1“, effectively embedding the watermark. Experiments have demonstrated that this method ensures the watermark’s robustness against attacks such as retranslation, text polishing, word deletion, and synonym substitution without compromising the original text’s semantics.

Sequence-to-Sequence Methods

Recent research has explored end-to-end watermark encryption techniques with the goal of enhancing flexibility and reducing the presence of artifacts introduced by watermarks. For instance, Abdelnabi and Fritz (2021) proposed Adversarial Watermark Transformer (AWT), the first framework to automate the learning of word replacements and their contents for watermark embedding. This method combines end-to-end and adversarial training, capable of injecting binary messages into designated input text at the encoding layer, producing an output text that is unnoticeable and minimally alters the semantics and correctness of the input. The method employs a transformer encoder layer to extract secret messages embedded within the text. Similarly, Zhang et al. (2023a) introduced the REMARK-LLM framework, which includes three components: (i) a message encoding module that injects binary signatures into texts generated by LLMs; (ii) a reparametrization module that converts the dense distribution of message encoding into a sparse distribution for generating watermarked text tokens; (iii) a decoding module dedicated to extracting signatures. Experiments suggest that REMARK-LLM embeds more signature bits into the same text while maintaining semantic integrity and showing enhanced resistance to various watermark removal and detection attacks compared to AWT.

Compared to model-driven watermarking, post-processing watermarking may depend more heavily on specific rules, making it more vulnerable to sophisticated attacks that exploit visible clues. Despite this risk, post-processing watermarking has significant potential for various applications. Many existing watermarking techniques typically necessitate training within white-box models, making them unsuitable for use with black-box LLMs setting. For instance, embedding watermarks in GPT-4 is nearly impossible given its proprietary and closed-source nature. Nevertheless, post-processing watermarking provides a solution for adding watermarks to text generated by black-box LLMs, enabling third parties to embed watermarks independently.

5.2 Statistics-Based Methods

Within the statistics-based setup, this subsection presents methods for proficiently identifying text generated by LLMs using detectors, without the need for additional training through supervised signals. This approach assumes access to LLMs or extract features from text, and is evaluated based on unique features and statistical data to derive statistical regularities (e.g., compute thresholds).

5.2.1 Linguistics Features Statistics

The inception of statistics-based detection research can be traced back to the pioneering work of Corston-Oliver, Gamon, and Brockett (2001). In this foundational study, the authors utilized linguistic features, such as the branching properties observed in grammatical analyses of text, function word density, and constituent length, to determine whether a given text was generated by a machine translation model. These features served as key indicators in distinguishing machine-generated text from human-generated text.

Another notable method, dedicated to achieving similar detection objectives, employs frequency statistics. For instance, Kalinichenko et al. (2003) utilized the frequency statistics associated with word pairs present in the text as a mechanism to ascertain whether the text had been autonomously generated by a generative system. Furthermore, the approach adopted by Baayen (2001) is grounded in the distributional features characteristic of words. Progressing in this line of inquiry, Arase and Zhou (2013) later contributed by developing a detection technique that captures the “phrase salad” phenomenon within sentences.

Recent studies on LLM generated text detection have proposed methodologies based on linguistics features statistics. Gallé et al. (2021) proposed a method of using repeated high-order n-grams to detect LLM-generated documents. This approach is predicated on the observation that certain n-grams recur with unusual frequency within LLM-generated text, a phenomenon that has been documented extensively. Similarly, Hamed and Wu (2023) developed a detection system based on statistical similarities of bigrams. Their findings indicate that only 23% of bigrams in texts generated by ChatGPT are unique, underscoring significant disparities in terminology usage between human and LLM-generated content. Impressively, their algorithm successfully identified 98 of 100 LLM-written academic papers, thereby demonstrating the efficacy of their feature engineering approach in distinguishing LLM-generated texts.

However, our empirical observations reveal a conspicuous limitation in the application of linguistic feature statistics: the availability of these methods relies heavily on extensive corpus statistics and various types of LLMs.

5.2.2 White-Box Statistics

Currently, white-box methods for detecting text generated by LLMs require direct access to the source model for implementation. The existing white-box detection techniques primarily use zero-shot approaches, which involves obtaining the model’s logits output and calculating specific metrics. These metrics are then compared against predetermined thresholds obtained through statistical methods to identify LLM-generated text.

Logits-Based Statistics

Logits are the raw outputs produced by LLMs during text generation, specifically from the model’s final linear layer, which is typically located before the softmax function. These outputs indicate the model’s confidence levels associated with generating each potential subsequent word. The Log-Likelihood Solaiman et al. (2019), a metric derived directly from the logits, measures the average token-wise log probability for each token within the provided text by consulting the originating LLM. This measurement helps to determine the likelihood of the text being generated by an LLM. At present, the Log-Likelihood is recognized as one of the most popular baseline metrics for LLM-generated text detection task.

Similarly, Rank Solaiman et al. (2019) is another normal baseline computed from logits. The Rank metric calculates the ranking of each word in a sample within the model’s output probability distribution. This ranking is determined by comparing the logit score of the word against the logit scores of all other possible words. If the average rank of each word in the sample is high, it suggests that the sample is likely generated by LLMs. Log-Rank, on the other hand, further processes each token’s rank value by applying a logarithmic function and has garnered increasing attention. One noteworthy method based on this intuitive approach is GLTR Gehrmann, Strobelt, and Rush (2019), which is designed as a visual forensic tool to facilitate comparative judgment. The tool divides different marking colors according to the sampling frequency level of the token, and highlights the proportion of words that LLMs tend to use in the analyzed text by marking different colors for the tokens that provide the text. The Log-Likelihood Ratio Ranking (LRR) proposed by Su et al. (2023a) combines Log-Likelihood and Log-Rank by taking the ratio of the two metrics. This approach enhances performance by effectively complementing Log Likelihood assessments with Log Rank analysis to provide a more comprehensive assessment.

Entropy represents another early zero-shot method used for evaluating LLM-generated text. It is typically employed to measure the uncertainty or the amount of information in a text or model output, and is also calculated through the probability distribution of words. High entropy indicates that the content of the sample text is unclear or highly diversified, meaning that many words have a similar probability of being chosen. In such cases, the sample is likely to have been generated by an LLM. Lavergne, Urvoy, and Yvon (2008) employed the Kullback-Leibler (KL) divergence to assign scores to n-grams, taking into account the semantic relationships between their initial and final words. This approach identifies n-grams with significant dependencies between the initial and terminal words, thus aiding in the detection of spurious content and enhancing the overall performance of the detection process.

The method employing perplexity, grounded in traditional n𝑛nitalic_n-gram LMs, evaluates the proficiency of LMs at predicting text Beresneva (2016). More recent work, such as HowkGPT Vasilatos et al. (2023), discerns LLM-generated text, specifically homework assignments, by calculating and comparing perplexity scores derived from student-written and ChatGPT-generated text. Through this comparison, thresholds are established to identify the origin of submitted assignments accurately. Moreover, the widely recognized GPTZero141414https://gptzero.me/ estimates the likelihood of a review text being generated by LLMs. This estimation is based on a meticulous examination of the text’s perplexity and burstiness metrics. In a recent study, Wu et al. (2023) unveiled LLMDet, a tool designed to quantify and catalogue the perplexity scores attributable to various models for selected n-grams by computing their next-token probabilities. LLMDet exploits the intrinsic self-watermarking characteristics of text, as evidenced by proxy perplexity, to trace the source of the text and to detefct it accordingly. The tool demonstrates a high classification accuracy of 98.54%, while also offering computational efficiency compared to fine-tuned RoBERTa. In addition, Venkatraman, Uchendu, and Lee (2023) extract UID-based features by analyzing the token probabilities of articles and then trains a logistic regression classifier to fit the UID characteristics of texts generated by different LLMs, in order to identify the origins of the texts. GHOSTBUSTER Verma et al. (2023) inputs text generated by LLMs into a series of weaker language models to obtain token probabilities, and then conducts a structured search on the combinations of these model outputs to train a linear classifier for distinguishing LLM-generated texts. This detector achieves an average F1 score of 99.0, which is an increase of 41.6 F1 score over previous methods such as GPTZero and DetectGPT.

Perturbed-Based Methods

Some white-box statistical (or zero-shot) approaches detect LLM-generated text by comparing the differences in performance metrics after statistical perturbation. Mitchell et al. (2023) proposed a method to identify text produced by LLMs by analyzing structural patterns in the LLMs’ probability functions, specifically in regions of negative curvature. The premise is that LLM-generated text tends to cluster at local log-probability maxima. Detection involves comparing log-probabilities of text against those from the target LLM, using a pre-trained mask-filling model like T5 to create semantically similar text perturbations.

While innovative and sometimes more effective than supervised methods, DetectGPT has limitations, including potential performance drops if rewrites don’t adequately represent the space of meaningful alternatives, and high computational demands, as it needs to score many text perturbations. In response to this challenge, Deng et al. (2023) proposed a method that uses a Bayesian surrogate model to select a small number of typical samples for scoring. By interpolating the scores of typical samples to other samples to improve query efficiency, the overhead is reduced by half while maintaining performance. Bao et al. (2023) reported a method that replaces the perturbation step of DetectGPT with a more efficient sampling step, significantly improving the detection accuracy by about 75% and increasing the detection speed by 340 times. Unlike DetectGPT, the white-box configuration in DNA-GPT Yang et al. (2023b) utilizes large language models such as ChatGPT to continue writing truncated texts instead of employing perturbation settings. It analyzes the differences between the original text and the continued text by calculating probability divergence, achieving a detection performance close to 100%. DetectLLM Su et al. (2023a), another recent contribution, parallels the conceptual framework of DetectGPT. It employs normalized perturbed log-rank for text detection generated by LLMs, asserting a lower susceptibility to the perturbation model and the number of perturbations compared to DetectGPT.

Intrinsic Dimension Estimation

The study conducted by Tulchinskii et al. (2023) posited the invariant nature of the competencies exhibited by both human and LLMs within their respective textual domains. The proposed approach involves the construction of detectors utilizing the intrinsic dimensions of manifolds underpinning the embedding set of given text samples. More specifically, the methodology entails computing the average intrinsic dimensionality values for both sets of fluent human-written texts and LLM-generated texts in the target natural language. The ensuing statistical separation between these two sets facilitates the establishment of a separation threshold for the target language, thereby enabling the detection of text generated by LLMs. It is imperative to acknowledge the robustness of this approach across various scenarios, including cross-domain challenges, model shifts, and adversarial attacks. However, its reliability falters when confronted with suboptimal or high-temperature generators.

5.2.3 Black-Box Statistics

Unlike white-box statistical methods, black-box statistical methods utilize a black-box model to calculate specific feature scores of a text without needing access to the logits of the source or surrogate model. Yang et al. (2023b) employed LLMs to continue writing truncated texts under review and defined human-written versus LLM-generated texts by calculating the n𝑛nitalic_n-gram similarity between the continuation and the original text. Similarly, Mao et al. (2024) and Zhu et al. (2023) identified LLM-generated texts by computing the similarity scores between the original texts and their rewritten and revised versions. These approaches are based on the observation that human-written texts tend to trigger more revisions when LLMs are tasked with rewriting and editing than LLM-generated texts. Yu et al. (2023b) introduced a detection mechanism that also capitalizes on the similarity between the original text and the regenerated text. Differing from other methods, this approach initially identifies the original question that prompted the generation of the text and regenerates the text based on this inferred question. Additionally, Quidwai, Li, and Dube (2023) analyzes sentences from LLM-generated texts and their paraphrases, distinguishing them from human-written texts by calculating cosine similarity, achieving an accuracy of up to 94%. Guo and Yu (2023) introduced a denoising-based black-box zero-shot statistics method that employs a black-box LLM to denoise artificially added noise to input texts. The denoised texts are then semantically compared to the original texts, resulting in an AUROC score of 91.8%.

However, the approaches of black-box statistics are not without challenges, including the substantial overhead of accessing the LLM and long response times.

5.3 Neural-Based Methods

5.3.1 Features-Based Classifiers

Linguistic Features-Based classifiers

When comparing texts generated by LLMs with those written by humans, the differences in numerous linguistic features provide a solid basis for feature-based classifiers to effectively distinguish between them. The workflow of such classifiers typically starts with the extraction of key statistical language features, followed by the application of machine learning techniques to train a classification model. This approach has been widely used in the identification of fake news. For instance, in the recent study, Aich, Bhattacharya, and Parde (2022) achieved an impressive accuracy of 97%by extracting 21 textual features and employing a KNN classifier. Drawing inspiration from the tasks of detecting fake news and LLM-generated texts, the linguistic features of texts can be extensively categorized into stylistic features, complexity features, semantic features, psychological features, and knowledge-based features. These features are primarily obtained through statistical methods.

Stylistic Features primarily focus on the frequency of words that can highlight the stylistic elements of the text, including the frequency of capitalized words, proper nouns, verbs, past tense words, stopwords, technical words, quotes, and punctuation Horne and Adali (2017). Complexity Features are extracted to showcase the complexity of the text, such as the type-token ratio (TTR) and textual lexical diversity (MTLD) McCarthy (2005). Semantic Features includes Advanced Semantic (AdSem), Lexico Semantic (LxSem), and statistics of semantic dependency tags, among other semantic-level features. These can be extracted using tools like LingFeat Lee, Jang, and Lee (2021). Psychological Features generally related to sentiment analysis, these can be based on tools like SentiWordNet Baccianella, Esuli, and Sebastiani (2010) to calculate sentiment scores or extracted using sentiment classifiers. Information Features include named entities (NE), opinions (OP), and entity relation extraction (RE), and can be extracted using tools such as UIE Lu et al. (2022) and CogIE Jin et al. (2021).

Shah et al. (2023) constructed a classifier based on stylistic features such as syllable count, word length, sentence structure, frequency of function word usage, and punctuation ratio. This classifier achieved an accuracy of 93%, which effectively demonstrates the significance of stylistic features for LLM-generated text detection. Other work integrated text modeling with a variety of linguistic features through data fusion techniques Corizzo and Leal-Arenas (2023), which included different types of punctuation marks, the use of the Oxford comma, paragraph structures, average sentence length, the repetitiveness of high-frequency words, and sentiment scores. On English and Spanish datasets, this approach achieved F1-Scores of 98.36% and 98.29% respectively, indicating its exceptional performance. Mindner, Schlippe, and Schaaff (2023) further employed a multidimensional approach to enhance the classifier’s discriminative power, which included complexity measures, semantic analysis, list searches, error-based features, readability assessments, artificial intelligence feedback, and text vector features. Ultimately, the optimized detector’s performance exceeded that of GPTZero by 183.8% in F1 Score, showcasing its superior detection capabilities.

Although classifiers based on linguistic features have their advantages in distinguishing between human and AI-generated texts, their shortcomings cannot be overlooked. The results from Schaaff, Schlippe, and Mindner (2023) indicate that such feature classifiers have poor robustness against ambiguous semantics and often underperform neural network features. Moreover, classifiers based on stylistic features may be capable of differentiating between texts written by humans and those generated by LLMs, but their ability to detect LLM-generated misinformation is limited. This limitation is highlighted in Schuster et al. (2020a), which shows that language models tend to produce stylistically consistent texts. However, Crothers et al. (2022) suggests that statistical features can offer additional adversarial robustness and can be utilized in constructing integrated detection models.

Model Features-Based Classifiers

In addition to linguistic features, classifiers based on model features have recently garnered considerable attention from researchers. These classifiers are not only capable of detecting texts generated by LLMs but can also be employed for text provenance tracing. Sniffer Li et al. (2023a) involves extracting aligned token-level perplexity and contrastive features, which measure the percentage of words with lower perplexity when comparing one model θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with another model θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. By training a linear classifier with these features, an accuracy of 86.0% was achieved. SeqXGPT Wang et al. (2023a) represents further exploration in the field of text provenance tracing, building on the proposed features to design a context network that combines a CNN with a two-layer transformer for encoding texts, and detecting LLM-generated texts through a sequence tagging task. Research in Wu and Xiang (2023) considers a combination of features such as log likelihood, log rank, entropy, and LLM bias, and by training a neural network classifier, it achieved an average F1 score of 98.41%. However, a common drawback of these methods is that they all require access to the source model’s logits. For other powerful closed-source models where logits are inaccessible, these methods may struggle to be effective.

5.3.2 Pre-Training Classifiers

In-domain Fine-tuning is All You Need

Within this subsection, we explore methods that involve fine-tuning Transformer-based LMs to discriminate between input texts that are generated by LLMs and those that are not. This approach requires paired samples for the facilitation of supervised training processes. According to Qiu et al. (2020), pre-trained LMs have proven to be powerful in natural language understanding, which is crucial for enhancing various tasks in NLP, with text categorization being particularly noteworthy. Esteemed pre-trained models, such as BERT Devlin et al. (2019a), Roberta Liu et al. (2019), and XLNet Yang et al. (2019), have exhibited superior performance relative to their counterparts in traditional statistical machine learning and deep learning when applied to the text categorization tasks within the GLUE benchmark Wang et al. (2019).

Moreover, there is an extensive body of prior work (Bakhtin et al., 2019; Uchendu et al., 2020; Antoun et al., 2023a; Li et al., 2023c) that has meticulously examined the capabilities of fine-tuned LMs in detecting LLM-generated text. Notably, studies conducted in 2019 have acknowledged fine-tuned LMs, with Roberta Liu et al. (2019) being especially prominent, as being amongst the most formidable detectors of LLM-generated text. In the following discourse, we will introduce recent scholarly contributions in this vein, providing an updated review and summary of the methods deployed.

Fine-tuning Roberta provides a robust baseline for detecting text generated by LLMs. Fagni et al. (2021) observed that fine-tuning Roberta led to optimal classification outcomes in various encoding configurations Gambini et al. (2022), with the subsequent OpenAI detector (Radford et al., 2019) also adopting a Roberta fine-tuning approach. Recent works Guo et al. (2023); Liu et al. (2023c, d); Chen et al. (2023b); Wang et al. (2023c, c) further corroborated the superior performance of fine-tuned members of the BERT family, such as RoBERTa, in identifying LLM-generated text. On average, these fine-tuned models yielded a 95% accuracy rate within their respective domains, outperforming zero-shot and watermarking methods, and exhibiting a modicum of resilience to various attack techniques within in-domain settings. Nevertheless, like their counterparts, these encoder-based fine-tuning approaches lack robustness (Bakhtin et al., 2019; Uchendu et al., 2020; Antoun et al., 2023a; Li et al., 2023c), as they tend to overfit to their training data or the source model’s training distribution, resulting in a decline in performance when faced with cross-domain or unseen data. Additionally, fine-tuning LMs classifiers is limited in facing data generated by different models Sarvazyan et al. (2023a). Despite this, detectors based on RoBERTa exhibit significant potential for robustness, requiring as few as a few hundred labels to fine-tune and deliver impressive results Rodriguez et al. (2022b). mBERT Devlin et al. (2019b) has demonstrated consistently robust performance in document-level LLM-generated text classification and various model attribution settings, maintaining optimal performance particularly in English and Spanish tasks. In contrast, encoder models like XLM-RoBERTa Conneau et al. (2020) and TinyBERT Jiao et al. (2020) have shown significant performance disparities in the same document-level tasks and model attribution setups, suggesting that these two tasks may require different capabilities from the models.

Contrastive Learning

Data scarcity has propelled the application of contrastive learning Yan et al. (2021); Gao, Yao, and Chen (2021); Chen et al. (2022) to LM-based classifiers, with the core of this approach being self-supervised learning. This strategy minimizes the distance between the anchor and positive samples while maximizing the distance to negative samples through spatial transformations. An enhanced contrastive loss, proposed by Liu et al. (2022), assigns greater weight to hard-negative samples, thereby optimizing model utility and stimulation to bolster performance in low-resource contexts. This method thoroughly accounts for linguistic characteristics and sentence structures, representing text as a coherence graph to encapsulate its inherent entity consistency. Research findings affirm the potency of incorporating information fact structures to refine LM-based detectors’ efficacy, a conclusion echoed by Zhong et al. (2020). Bhattacharjee et al. (2023) proposed ConDA, a contrastive domain adaptation framework, which combines standard domain adaptation technology with the representation capabilities of contrastive learning, greatly improving the model’s defense capabilities against unknown models.

Adversarial Learning Methods

In light of the vulnerability of detectors to different attacks and robustness issues, a significant body of scholarly research has been dedicated to utilizing adversarial learning as a mitigation strategy. Predominantly, adversarial learning methods bear relevance to fine-tuning LMs methods. Noteworthy recent work by Koike, Kaneko, and Okazaki (2023b) revealed that it is feasible to train adversarially without fine-tuning the model, with context serving as a guide for the parameter-frozen model. We compartmentalize the studies into two categories: Sample Enhancement Based Adversarial Training and Two-Player Games.

A prominent approach within Sample Enhancement Based Adversarial Training centers on deploying adversarial attacks predicated on sample augmentation, with the overarching aim of crafting deceptive inputs to thereby enhance the model’s competency in addressing a broader array of scenarios that bear deception potential. Specifically, this method emphasizes the importance of sample augmentation and achieves it by injecting predetermined adversarial attacks. This augmentation process is integral to fortifying the detector’s robustness by furnishing it with an expanded pool of adversarial samples. Section 7.2 of the article outlines various potential attack mechanisms, including paraphrase attacks, adversarial attacks, and prompt attacks. Yang, Jiang, and Li (2023); Shi et al. (2023); He et al. (2023) conducted the adversarial data augmentation process on LLM-generated text, the findings of which indicated that models trained on meticulously augmented data exhibited commendable robustness against potential attacks.

The methods of Two-Player Games fundamentally aligned with the principles underpinning Generative Adversarial Networks Goodfellow et al. (2020) and Break-It-Fix-It strategies Yasunaga and Liang (2021), typically involve the configuration of an attack model alongside a detection model, with the iterative confrontation between the two culminating in enhanced detection capabilities. Hu, Chen, and Ho (2023) introduced a framework, RADAR, envisaged for the concurrent training of robust detectors through adversarial learning. This framework facilitates interaction between a paraphrasing model, responsible for generating realistic content that evades detection, and a detector whose goal is to enhance its capability to identify text produced by LLMs. The RADAR framework incrementally refines the paraphrase model, drawing on feedback garnered from the detector and employing PPO (Schulman et al., 2017b). Despite its commendable performance in countering paraphrase attacks, the study by Hu, Chen, and Ho (2023) did not provide a comprehensive analysis of RADAR’s defense mechanism against other attack modalities. In a parallel vein, Koike, Kaneko, and Okazaki (2023b) proposed a training methodology for detectors, predicated on a continual interaction between an attacker and a detector. Distinct from RADAR, OUTFOX allocates greater emphasis on the likelihood of detectors employing ICL Dong et al. (2023) for attacker identification. Specifically, the attacker in the OUTFOX framework utilizes predicted labels from the detector as ICL exemplars to generate text that poses detection challenges. Conversely, the detector uses the content generated adversarially as ICL exemplars to enhance its detection capabilities against formidable attackers. This reciprocal consideration of each other’s outputs fosters improved robustness in detectors for text generated by LLMs. Empirical evidence attests to the superior performance of the OUTFOX method relative to preceding statistical methods and those based on RoBERTa, particularly in responding to attacks from TF-IDF and DIPPER Krishna et al. (2023).

Features-Enhanced Approaches

In addition to enhancements in training methodology, Tu et al. (2023) demonstrated that the extraction of linguistic features can effectively improve the robustness of a RoBERTa-based detector, with benefits observed in various related models. Cowap, Graham, and Foster (2023) developed an emotion-aware detector by fine-tuning a Pre-trained Language Model (PLM) for sentiment analysis, thereby enhancing the potential of emotion as a signal for identifying synthetic text. They achieved this by further fine-tuning BERT specifically for sentiment classification, resulting in a detection performance F1 score improvement of up to 9.03%. Uchendu, Le, and Lee (2023b) employed RoBERTa to capture contextual representations, such as semantic and syntactic linguistic features, and integrated Topological Data Analysis to analyze the shape and structure of data, which includes linguistic structure. This approach surpassed the performance of RoBERTa alone on the SynSciPass and M4 datasets. The framework J-Guard Kumarage et al. (2023a) guides existing supervised AI text detectors in detecting AI-generated news by extracting Journalism Features, which help the detector recognize LLM-generated fake news text. This framework exhibits strong robustness, maintaining an average performance decrease as low as 7% when faced with adversarial attacks.

5.3.3 LLMs as Detectors

Questionable Reliability of Using LLMs

Several works have examined the feasibility of utilizing LLMs as detectors to discern text generated by either themselves or other LLMs. This approach was first broached by Zellers et al. (2019b), wherein the text generation model Grover Zellers et al. (2019b) was noted to produce disinformation that was remarkably deceptive due to its inherently controllable nature. Subsequent exploratory analyses by Zellers et al. (2019b) engaging various architectural models like GPT-2 Radford et al. (2019) and BERT Devlin et al. (2019c) revealed that Grover’s most effective countermeasure was Grover itself, boasting an accuracy rate of 92%, while other detector types experienced a decline in accuracy to approximately 70% as Grover’s size increased. A recent reevaluation conducted by Bhattacharjee and Liu (2023) on more recent LLMs like ChatGPT and GPT-4 yielded that neither could reliably identify text generated by various LLMs. During the observations, it was noted that ChatGPT and GPT-4 exhibited contrasting tendencies. ChatGPT tended classify text generated by LLMs as if it were written by humans, with a misclassification probability of about 50%. While GPT-4 leaned towards labeling human-written text as if it were generated by LLMs, and about 95% of human-written texts are misclassified as LLM-generated texts. ArguGPT Liu et al. (2023c) further attested to the lackluster performance of GPT-4-Turbo in detecting text generated by LLMs, with accuracy rates languishing below 50% across zero-shot, one-shot, and two-shot settings. These findings collectively demonstrate the diminishing reliability of employing LLMs for direct self-generated text detection, particularly when compared to statistical and neural network methods. This is particularly evident in light of the increasing complexity of LLMs.

ICL: A Powerful Technique for LLM-Based Detection

Despite the unreliability issues associated with using LLMs for direct detection of LLM-generated text, recent empirical investigations highlight the potential efficacy of ICL in augmenting LLMs’ detection capabilities. ICL, a specialized form of cue engineering, integrates examples into cues provided to the model, thereby facilitating the learning of new tasks by LLMs. Through ICL, existing LLMs can adeptly tackle different tasks without necessitating additional fine-tuning. The OUTFOX Detector Koike, Kaneko, and Okazaki (2023b) employs an ICL approach, continuously supplying example samples to LLMs for text generation detection tasks. The experimental findings demonstrate that the ICL strategy outperforms both traditional zero-shot methods and RoBERTa-based detectors.

5.4 Human-Assisted Methods

In this section, we delve into human-assisted methods for detecting text generated by LLMs. These methods leverage human prior knowledge and analytical skills, providing notable interpretability and credibility in the detection process.

5.4.1 Intuitive Indicators

Several studies have delved into the disparities between human and machine classification capabilities. Human classification primarily depends on visual observation to discern features indicative of text generation by LLMs. Uchendu et al. (2023) noted that a lack of coherence and consistency in LLM-generated text serves as a strong indicator of falsified content. Texts produced by LLMs often exhibit semantic inconsistencies and logical errors. Furthermore, Dugan et al. (2023) identified that the human discernment of LLM-generated text varies across different domains. For instance, LLMs tend to generate more “generic” text in the news domain, whereas, in story domains, the text might be more “irrelevant”. Ma et al. (2023) noted that evaluators of academic writing typically emphasize style. Summaries generated by LLMs often lack detail, particularly in describing the research motivation and methodology, which hampers the provision of fresh insights. In contrast, LLM-generated papers exhibit fewer grammatical and other types of errors and demonstrate a broader variety of expression Yan et al. (2023); Liao et al. (2023a). However, these papers commonly use general terms instead of effectively tailored information pertinent to the specific problem context. In human-written texts, such as scientific papers, authors are prone to composing lengthy paragraphs and using ambiguous language Desaire et al. (2023), often incorporating terms like “but,” “however,” and “although.” Dugan et al. (2023) also noted that relying solely on grammatical errors as a detection strategy is unreliable. In addition, LLMs frequently commit factual and common-sense reasoning errors, which, while often overlooked by neural network-based detectors, are easily noticed by humans Jawahar, Abdul-Mageed, and Lakshmanan (2020).

5.4.2 Imperceptible Features

Ippolito et al. (2020) suggested that text perceived as high quality by humans tends to be more easily recognizable by detectors. This observation implies that some features, imperceptible to humans, can be efficiently captured by detection algorithms. While humans are adept at identifying errors in many LLM-generated texts, unseen features also significantly inform human decision-making. In contrast, statistical thresholds commonly employed in zero-shot Detector research to distinguish LLM-generated text can be manipulated. However, humans typically possess the ability to detect such manipulations through various metrics, GLTR Gehrmann, Strobelt, and Rush (2019) pioneered this approach, serving as a visual forensic tool to assist human vetting processes, while also providing rich interpretations easily understandable by non-experts Clark et al. (2021b).

5.4.3 Enhancing Human Detection Capabilities

Recent studies Ippolito et al. (2020) indicated that human evaluators might not be as proficient as detection algorithms in recognizing LLM-generated text across various settings. However, exposing evaluators to examples before evaluation enhances their detection capabilities, especially with longer samples. The platform RoFT Dugan et al. (2020) allows users to engage with LLM-generated text, shedding light on human perception of such text. Although revealing true boundaries post-annotation did not lead to an immediate improvement in annotator accuracy, it is worth noting that with proper incentives and motivations, annotators can indeed improve their performance over time Dugan et al. (2023). The SCARECROW framework Dou et al. (2022) facilitates the annotation and review of LLM-generated text, outlining ten error types to guide users. The result from SCARECROW reports Manual annotation outperformed detection models on half of the error types, suggesting potential in developing efficient annotation systems despite the associated human overhead.

5.4.4 Mixed Detection: Understanding and Explanation

Weng et al. (2023) introduced a prototype amalgamating human expertise and machine intelligence for visual analysis, premised on the belief that human judgment is the benchmark. Initially, experts label text based on their prior knowledge, elucidating the distinctions between human and LLM-generated text. Subsequently, machine-learning models are trained and iteratively refined based on labeled data. Finally, the most intuitive detector is selected through visual statistical analysis, serving the detection purpose. This granular analysis approach not only bolsters experts’ trust in decision-making models but also fosters learning from the models’ behavior to efficiently identify LLM-generated samples.

6 Evaluation Metrics

Evaluation metrics, indispensable for the assessment of model performance within any NLP task, warrant meticulous consideration. In this section, we enumerate and discuss metrics conventionally utilized in the tasks of LLM-generated text detection. These metrics include Accuracy, Paired Accuracy, Unpaired Accuracy, Recall, Human-written Recall (HumanRec), LLM-generated Recall (LLMRec), Average Recall (AvgRec), F1 Score, and Area Under the Receiver Operating Characteristic Curve (AUROC). Furthermore, we discuss the advantages and drawbacks associated with each metric to facilitate informed metric selection for varied research scenarios in subsequent studies.

The confusion matrix can help effectively evaluate the performance of the classification task and describes all possible results (four types in total) of the LLM-generated text detection task:

  • True Positive (TP) refers to the result of the positive category (LLM-generate text) correctly classified by the model.

  • True Negative (TN) refers to the result of the negative category (human-written text) correctly classified by the model.

  • False Positive (FP) refers to the result of the positive category (LLM-generate text) incorrectly classified by the model.

  • False Negative (FN) refers to the result of the negative category (human-written text) incorrectly predicted by the model.

The evaluation metrics introduced below can all be described by TP, TN, FP, and FP.

Accuracy

Accuracy serves as a general metric, denoting the ratio of correctly classified texts to the total text count. While suitable for balanced datasets, its utility diminishes for unbalanced ones due to sensitivity to category imbalance. The metrics of Paired and Unpaired Accuracy have also found application in Zellers et al. (2019b); Zhong et al. (2020) to evaluate the detector’s ability in different scenarios. In the unpaired setting, the discriminator must independently classify each test sample as human or machine. In the paired setting, the model is given two test samples with the same metadata, one real and one generated by the large model. The discriminator must assign a higher machine probability to articles written by large models than to articles written by humans. These indicators are used to measure the performance of the algorithm on data in different scenarios. Relatively speaking, the detection difficulty of unpaired settings is higher than that of paired settings. Accuracy can be described by the following formula:

Accuracy𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦\displaystyle Accuracyitalic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y =correctly detected samplesall samplesabsentcorrectly detected samplesall samples\displaystyle=\frac{\text{correctly detected samples}}{\text{all samples}}= divide start_ARG correctly detected samples end_ARG start_ARG all samples end_ARG (4)
=TP+TNTP+TN+FP+FNabsent𝑇𝑃𝑇𝑁𝑇𝑃𝑇𝑁𝐹𝑃𝐹𝑁\displaystyle=\frac{TP+TN}{TP+TN+FP+FN}= divide start_ARG italic_T italic_P + italic_T italic_N end_ARG start_ARG italic_T italic_P + italic_T italic_N + italic_F italic_P + italic_F italic_N end_ARG
Precision

Precision is a measure of the correctness of real predictions and refers to the proportion of correctly detected LLM-generated samples among all detected LLM-generated samples. This metric is very useful in situations where we are concerned about false positives. When a sample is not LLM-generated, but is classified as LLM-generated text, this erroneous result may reduce the user’s impression of the model or even cause the negative impact on business. Therefore, improving precision is also important in the LLM-generated text detection task. This Metric can be described by the following formula:

Precision𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛\displaystyle Precisionitalic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n =correctly detected LLM-generated samplesall detected LLM-generated samplesabsentcorrectly detected LLM-generated samplesall detected LLM-generated samples\displaystyle=\frac{\text{correctly detected LLM-generated samples}}{\text{all% detected LLM-generated samples}}= divide start_ARG correctly detected LLM-generated samples end_ARG start_ARG all detected LLM-generated samples end_ARG (5)
=TPTP+FPabsent𝑇𝑃𝑇𝑃𝐹𝑃\displaystyle=\frac{TP}{TP+FP}= divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG
Recall

Recall represents the proportion of actual machine-generated texts accurately identified as such. This metric is invaluable in contexts where underreporting must be minimized, as in instances requiring the capture of the majority of machine-generated texts. AvgRec, the mean recall across categories, is particularly useful for multi-category tasks requiring collective performance assessment across categories. HumanRec and LLMRec denote the proportions of texts accurately classified as human-written and machine-generated, respectively, shedding light on the model’s differential performance on these two classes. Recall, HumanRec, LLMRec, and AvgRec can be described by the following formulas respectively:

Recall=TPTP+FN𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑃𝑇𝑃𝐹𝑁Recall=\frac{TP}{TP+FN}italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG (6)
HumanRecall=correctly detected human-written samplesall human-written samples𝐻𝑢𝑚𝑎𝑛𝑅𝑒𝑐𝑎𝑙𝑙correctly detected human-written samplesall human-written samplesHumanRecall=\frac{\text{correctly detected human-written samples}}{\text{all % human-written samples}}italic_H italic_u italic_m italic_a italic_n italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG correctly detected human-written samples end_ARG start_ARG all human-written samples end_ARG (7)
LLMRecall=correctly detected LLM-generated samplesall LLM-generated samples𝐿𝐿𝑀𝑅𝑒𝑐𝑎𝑙𝑙correctly detected LLM-generated samplesall LLM-generated samplesLLMRecall=\frac{\text{correctly detected LLM-generated samples}}{\text{all LLM% -generated samples}}italic_L italic_L italic_M italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG correctly detected LLM-generated samples end_ARG start_ARG all LLM-generated samples end_ARG (8)
AvgRecall=HumanRecall+LLMRecall2𝐴𝑣𝑔𝑅𝑒𝑐𝑎𝑙𝑙𝐻𝑢𝑚𝑎𝑛𝑅𝑒𝑐𝑎𝑙𝑙𝐿𝐿𝑀𝑅𝑒𝑐𝑎𝑙𝑙2AvgRecall=\frac{HumanRecall+LLMRecall}{2}italic_A italic_v italic_g italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG italic_H italic_u italic_m italic_a italic_n italic_R italic_e italic_c italic_a italic_l italic_l + italic_L italic_L italic_M italic_R italic_e italic_c italic_a italic_l italic_l end_ARG start_ARG 2 end_ARG (9)
False Positive Rate (FPR)

The FPR refers to the proportion of all actual human-written samples that are incorrectly detected as LLM-generated samples. This metric can measure the proportion of incorrect predictions made by the model in samples that are actually written by humans. It helps to understand the false positive rate of the model and thus has a higher sensitivity for the detection of LLM-generated samples. This metric can be described by the following formula:

FPR𝐹𝑃𝑅\displaystyle FPRitalic_F italic_P italic_R =incorrectly detected LLM-generated samplesall human-written samplesabsentincorrectly detected LLM-generated samplesall human-written samples\displaystyle=\frac{\text{incorrectly detected LLM-generated samples}}{\text{% all human-written samples}}= divide start_ARG incorrectly detected LLM-generated samples end_ARG start_ARG all human-written samples end_ARG (10)
=FPFP+TPabsent𝐹𝑃𝐹𝑃𝑇𝑃\displaystyle=\frac{FP}{FP+TP}= divide start_ARG italic_F italic_P end_ARG start_ARG italic_F italic_P + italic_T italic_P end_ARG
True Negative Rate (TNR)

The TNR refers to the proportion of samples that are correctly detected as human-written among all actual human-written samples. This metric measures how accurately the model predicts human-written samples, but does not take into account the FPR, where text that is actually human-written is incorrectly detected as LLM-generated text. This metric can be described by the following formula:

TNR𝑇𝑁𝑅\displaystyle TNRitalic_T italic_N italic_R =correctly detected human-written samplesall human-written samplesabsentcorrectly detected human-written samplesall human-written samples\displaystyle=\frac{\text{correctly detected human-written samples}}{\text{all% human-written samples}}= divide start_ARG correctly detected human-written samples end_ARG start_ARG all human-written samples end_ARG (11)
=TNTN+FPabsent𝑇𝑁𝑇𝑁𝐹𝑃\displaystyle=\frac{TN}{TN+FP}= divide start_ARG italic_T italic_N end_ARG start_ARG italic_T italic_N + italic_F italic_P end_ARG
False Negative Rate (FNR)

The FNR refers to the proportion of all actual LLM-generated samples that are incorrectly detected as human-written. This metric helps understand how misinterpreted the model is for LLM-generated text. This metric can be described by the following formula:

FNR𝐹𝑁𝑅\displaystyle FNRitalic_F italic_N italic_R =incorrectly detected human-written samplesall LLM-generated samplesabsentincorrectly detected human-written samplesall LLM-generated samples\displaystyle=\frac{\text{incorrectly detected human-written samples}}{\text{% all LLM-generated samples}}= divide start_ARG incorrectly detected human-written samples end_ARG start_ARG all LLM-generated samples end_ARG (12)
=FNFN+TPabsent𝐹𝑁𝐹𝑁𝑇𝑃\displaystyle=\frac{FN}{FN+TP}= divide start_ARG italic_F italic_N end_ARG start_ARG italic_F italic_N + italic_T italic_P end_ARG
F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Score

The F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Score constitutes a harmonic mean of precision and recall, integrating considerations of false positives and false negatives. It emerges as a prudent choice when a balance between precision and recall is imperative. The F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score can be calculated using the following formula:

F1=subscript𝐹1absent\displaystyle F_{1}=italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2PrecisionRecallPrecision+Recall2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙\displaystyle 2*\frac{Precision*Recall}{Precision+Recall}2 ∗ divide start_ARG italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n ∗ italic_R italic_e italic_c italic_a italic_l italic_l end_ARG start_ARG italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n + italic_R italic_e italic_c italic_a italic_l italic_l end_ARG (13)
=2TP2TP+FP+FNabsent2𝑇𝑃2𝑇𝑃𝐹𝑃𝐹𝑁\displaystyle=\frac{2TP}{2TP+FP+FN}= divide start_ARG 2 italic_T italic_P end_ARG start_ARG 2 italic_T italic_P + italic_F italic_P + italic_F italic_N end_ARG
AUROC

The AUROC metric, derived from Receiver Operating Characteristic curves, considers true and false positive rates at varying classification thresholds, proving beneficial for evaluating classification efficacy at different thresholds. This is particularly crucial in scenarios necessitating specific false positive and miss rates, especially within the context of unbalanced datasets and binary classification tasks. Given that the detection rate of zero-shot detection methods significantly hinges on threshold values, the AUROC metric is commonly employed to evaluate their performance across all possible thresholds. The calculation formula of AUROC is as follows:

AUROC=01TPTP+FPdFPFP+TN𝐴𝑈𝑅𝑂𝐶superscriptsubscript01𝑇𝑃𝑇𝑃𝐹𝑃differential-d𝐹𝑃𝐹𝑃𝑇𝑁AUROC=\int_{0}^{1}\frac{TP}{TP+FP}\mathrm{d}\frac{FP}{FP+TN}italic_A italic_U italic_R italic_O italic_C = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG roman_d divide start_ARG italic_F italic_P end_ARG start_ARG italic_F italic_P + italic_T italic_N end_ARG (14)

7 Important Issues of LLM-generated Text Detection

In this section, we discuss the main issues and limitations of contemporary SOTA techniques designed for detecting text generated by LLMs. It is important to note that no technique has been acknowledged as infallible. The issues elucidated herein may pertain specifically to one or multiple classes of detectors.

7.1 Out of Distribution Challenges

Out-of-distribution issues significantly impede the efficacy of current techniques dedicated to the detection of LLM-generated text. This section elucidates the constraints of these detectors to variations in domains and languages.

Cross-domain

The dilemma of cross-domain application is a ubiquitous challenge inherent to numerous NLP tasks. Studies conducted by Antoun et al. (2023a); Li et al. (2023c) underscored considerable limitations in the performance of sophisticated detectors, including but not limited to DetectGPT Mitchell et al. (2023), GLTR Gehrmann, Strobelt, and Rush (2019), and fine-tuned Roberta methods when applied to cross-domain data. These detectors exhibit substantial performance degradation when confronted with out-of-distribution data prevalent in real-world scenarios, with the efficacy of some classifiers marginally surpassing that of random classification. This disparity between high reported performance and actual reliability underlines the need for critical evaluation and enhancement of existing methods.

Cross-lingual

The issue of cross-lingual application introduces a set of complex challenges that hinder the global applicability of existing detector research. Predominantly, contemporary detectors designed for LLM-generated text primarily target monolingual applications, often neglecting to evaluate and optimize performance across multiple languages. Wang et al. (2023b) and Chaka (2023) noted the lack of control observed in multilingual LLM-generated text detectors across various languages, despite the existence of certain language migration capabilities. We emphatically draw attention to these cross-lingual challenges as addressing them is pivotal for enhancing the usability and fairness of detectors for LLM-generated text. Moreover, recent research Liang et al. (2023a) revealed a discernible decline in the performance of state-of-the-art detectors when processing texts authored by non-native English speakers. Although employing effective prompt strategies can alleviate this bias, it also inadvertently allows the generated text to bypass the detectors. Consequently, there is a risk that detectors might inadvertently penalize writers who exhibit non-standard linguistic styles or employ limited expressions, thereby introducing issues of discrimination within the detection process.

Cross-llms

Another significant out-of-distribution issue in the LLM-generated text detection task is the cross-llms challenge. Current white-box detection approaches primarily rely on accessing the source model and comparing features such as Log-likelihood. Consequently, white-box methods may underperform when encountering text generated by unknown LLMs. The results of DetectGPT Mitchell et al. (2023) highlight the vulnerability of white-box methods when dealing to unknown models, particularly when encountering powerful models like GPT-3.5-Turbo. However, the recent findings from Fast-DetectGPT Bao et al. (2023) show that statistical comparisons with surrogate models can significantly mitigate this issue. Additionally, identifying the type of the generative model before applying white-box methods could be beneficial. In this regard, the methodologies of Siniff Li et al. (2023a), SeqXGPT Wang et al. (2023a), and LLMDet Wu et al. (2023) may provide useful insights. On the other hand, methods based on neural classifiers, especially those fine-tuned classifiers susceptible to overfitting training data, may struggle to recognize types of LLMs not seen during training. Thus, for newly emerging LLMs, detectors may not effectively identify them Pagnoni, Graciarena, and Tsvetkov (2022b). For instance, the OpenAI detector151515openai-community/roberta-large-openai-detector (trained on texts generated by GPT-2) struggles to discern texts generated by GPT-3.5-Turbo and GPT-4,achieving an AUROC of only 74.74%, while itperforms nearly perfectly on GPT-2 generated texts Bao et al. (2023). The results of Sarvazyan et al. (2023b) demonstrate that supervised LLM-generated text detectors exhibit good generalization capabilities across model scales but have limitations in generalizing across model families. Enhancing the cross-llms robustness of neural classifiers is thus essential for the practical deployment of detectors. Nonetheless, classifiers fine-tuned on Roberta still possess strong transfer capabilities, and with additional fine-tuning on just a few hundred samples, detectors can effectively generalize to texts generated by other models. Therefore, incorporating LLM-generated text from various sources into the training data could substantially improve the cross-llms robustness of detectors in real-world applications, even with a small sample size.

7.2 Potential Attacks

Potential attacks significantly contribute to the ongoing unreliability of current LLM-generated text detectors. We present the current effective attacks to push researchers to focus on more comprehensive defensive measures.

Paraphrase Attacks

Paraphrasing attacks are one of the most effective attacks that can be fully effective against detectors using watermarking technology as well as fine-tuned supervised detectors and zero-shot detectors (Sadasivan et al., 2023; Orenstrakh et al., 2023). The underlying principle involves applying a lightweight paraphrase model on LLMs’ outputs and changing the distribution of lexical and syntactic features of the text by paraphrasing, thereby confusing the detector. Sadasivan et al. (2023) reported on Parrot (Damodaran, 2021), a T5-based paraphrase model and DIPPER (Krishna et al., 2023), an 11B paraphrasing model that allows for tuning paraphrase diversity and the degree of content reordering that attacks the overall superiority of existing detection methods. Although retrieval-based approaches have been shown to defend effectively against paraphrasing attacks (Krishna et al., 2023), implementing such defenses requires ongoing maintenance by the language model API provider and is still susceptible to recursive paraphrasing attacks (Sadasivan et al., 2023).

Adversarial Attacks

Normal LLM-generated texts are highly identifiable, yet adversarial perturbations, such as substitution, can effectively reduce the accuracy of detectors Peng et al. (2024). We summarise attacks that process on textual features as adversarial attacks, including cutoff (cropping a portion of the feature or input) Shen et al. (2020), shuffle (randomly disrupting the word order of the input) Lee et al. (2020), mutation (character and word mutation) Liang, Guerrero, and Alsmadi (2023), word swapping (substituting other suitable words given the context) Shi and Huang (2020); Ren et al. (2019); Crothers et al. (2022) and misspelling Gao et al. (2018a). There are also adversarial attack frameworks such as TextAttack Morris et al. (2020), which can build an attack from four components: an objective function, a set of constraints, a transformation, and a search method. Shi et al. (2023) and He et al. (2023) reported on the effectiveness of the permutation approach on attack detectors. Specifically, Shi et al. (2023) replaced words with synonyms based on context, which forms an effective attack on the fine-tuned classifier, watermarking Kirchenbauer et al. (2023a), and DetectGPT Mitchell et al. (2023), reducing detector performance by more than 18%, 10%, and 25% respectively. He et al. (2023) employed probability-weighted word saliency Ren et al. (2019) to generate adversarial examples, which further maintains semantic similarity.

Stiff and Johansson (2022) utilized the DeepWordBug Gao et al. (2018b) adversarial attack algorithm to introduce character-level perturbations to generated texts, including adjacent character swaps, character substitutions, deletions, and insertions, which resulted in more than a halving of the performance of the OpenAI large detector.161616openai-community/roberta-large-openai-detector Wolff (2020) presented two types of black-box attacks against these detectors: random substitutions of characters with visually similar homoglyphs and the intentional misspelling of words. These attacks drastically reduced the recall rate of popular neural text detectors from 97.44% to 0.26% and 22.68%, respectively. Moreover, Bhat and Parthasarathy (2020) showed that detectors are more sensitive to syntactic perturbations, including breaking longer sentences, removing definite articles, using semantic-preserving rule conversions (such as changing “that’s” to “that is”), and reformatting paragraphs of machine-generated text.

Although existing detection methods are highly sensitive to adversarial attacks, different types of detectors exhibit varying degrees of resilience to such attacks. Antoun et al. (2023b) reported that supervised approaches are effective defensive measures against these attacks: training on adversarial samples can significantly improve a detector’s ability to recognize texts that have been manipulated by such attacks. Additionally, Kulkarni et al. (2023) explored the impact of semantic perturbations on the Grover detector, finding that synonym substitution, fake-fake replacement, insertion instead of substitution, and changes in the position of substitution had no effect on Grover’s detection capabilities. However, adversarial embedding techniques can effectively deceive Grover into classifying false articles as genuine. The attack degrades the performance of the fine-tuning classifier significantly, even though the distributional features of the attack can be learned by the fine-tuning classifier to form a strong defense.

Prompt Attacks

Prompt attacks pose a significant challenge for current LLM-generated text detection techniques. The quality of LLM-generated text is associated with the complexity of the prompts that instruct LLMs to generate text. As the model and corpus size increase, LLMs emerge with excellent ICL capabilities for more complex text generation capabilities. Numerous efficient prompting methods have been developed, including few-shot prompt (Brown et al., 2020), combining prompt (Zhao et al., 2021), Chain of Thought (CoT) (Wei et al., 2022), and zero-shot CoT (Kojima et al., 2022), etc., which significantly enhance the quality and capabilities of LLMs. Existing works on LLM-generated text detectors primarily utilize datasets created with simple direct prompts. For instance, the study by Guo et al. (2023) demonstrates that detectors might struggle to identify text generated with complex prompts. Liu et al. (2023d) reported a noticeable decrease in the detection ability of a detector using a fine-tuned language model when faced with varied prompts, which indicates that the use of different prompts results in large differences in the detection performance of existing detectors Koike, Kaneko, and Okazaki (2023a).

The Substitution-based Contextual Example Optimisation method, as proposed by Lu et al. (2023), employs sophisticated prompts to bypass the defenses of current detection systems. This leads to an appreciable reduction in the Area Under the Curve (AUC), averaging a decrease of 0.54, and achieves a higher success rate with better text quality compared to paraphrase attacks. It is worth mentioning that both paraphrase attacks and adversarial attacks mentioned above could be executed through careful prompt design Shi et al. (2023); Koike, Kaneko, and Okazaki (2023b). With ongoing research in prompt engineering, the risk posed by prompt attacks is expected to escalate further. This underscores the need for developing more robust detection methods that can effectively counteract such evolving threats.

Training Threat Models

Further training of language models has been preliminarily proven to effectively attack existing detectors. Nicks et al. (2023) used the “humanity” scores of various open source and commercial detectors as a reward function for reinforcement learning, which fine-tunes language models to confound existing detectors. Without significantly altering the model, further fine-tuning of the Llama-2-7B can reduce the AUROC of the OpenAI RoBERTa-Large detector from 0.84 AUROC to 0.62 AUROC in a short training period. A similar idea is demonstrated in Schneider et al. (2023): using reinforcement learning to refine generative models can successfully circumvent BERT-based classifiers with detection accuracy as low as 0.15 AUROC, even when using linguistic features as a reward function. Kumarage et al. (2023b) proposes a universal evasion framework named EScaPe to guide PLMs in generating “human-like text” that may mislead detectors. Through evasive soft prompt learning and transfer, the performance of DetectGPT and OpenAI Detector can be effectively reduced by up to 40% AUROC. The results from Henrique, Kucharavy, and Guerraoui (2023) reveal another potential vulnerability of detectors. If a generative model can access the human-written text used to train the detector and use them for fine-tuning, it is impossible to use detector for text detection on this generative model. This indicate that LLMs trained on more human-written corpus will be more robust against existing detectors, and training against a specific detector can provide the LLMs with a sharp spear to breach its defenses.

7.3 Real-World Data Issues

Detection for Not Purely LLM-generated Text

In practice, there are many texts that are not purely generated by LLMs, and they may even contain a mix of human-written text. Specifically, this can be categorized as either data-mixed text or human-edited text. Data-mixed text refers to the sentence or paragraph level mixture of human-written text and LLM-generated text. For instance, in a document, some sentences may be generated by LLMs, while others are written by humans. In such cases, identifying the category of the document becomes challenging. Data-mixed text necessitates more fine-grained detection methods, such as sentence-level detection, to effectively address this challenge. However, current LLM-generated text detectors struggle to perform effectively with short texts. Recent research, such as that by Wang et al. (2023a), indicates that sentence-level detection appears to be feasible. Furthermore, we are very pleased to observe that studies have started to propose and attempt to solve this issue. Zeng et al. (2023) proposed a two-step method to effectively identify a mix of human-written and LLM-generated text. This method first uses contrastive learning to distinguish between content generated by LLMs and human-written content. It then calculates the similarity between adjacent prototypes, assuming that a boundary exists between the least similar adjacent prototypes.

Another issue that has not been fully discussed is the human-edited text. For example, after applying LLM to generate a text, humans often edit and modify certain words or passages. The detection of such text poses a significant challenge and is an issue we must confront, as it is prevalent in real-world applications. Therefore, there is an urgent need to organize relevant datasets and define tasks to address this issue. One potential approach for tackling this problem is informed by experimental results from paraphrasing and adversarial perturbation attacks. These methods effectively simulate how individuals might use LLMs to refine text or make word substitutions. However, tend to degrade in performance when dealing with paraphrased text, current mainstream detectors tend to degrade in performance when dealing with paraphrased text Wolff (2020), although certain black-box detectors display relatively good robustness. Another potential solution could involve breaking down the detection task to the word level, but as of now, there is no research directly addressing this.

Data Ambiguity

Data ambiguity remains a challenge in the field of LLM-generated text detection, with close ties to the inherent mechanics of the detection technology itself. The pervasive deployment of LLMs across various domains exacerbates this issue, rendering it increasingly challenging to discern whether training data comprises human-written or LLM-generated text. Utilizing LLM-generated text as training data under the misapprehension that it is human-written inadvertently instigates a detrimental cycle. Within this cycle, detectors, consequently trained, demonstrate diminished efficacy in distinguishing between human-written and LLM-generated text, thereby undermining the foundational premises of detector research. It is imperative to acknowledge that this quandary poses a significant, pervasive threat to all facets of detection research, yet, to our knowledge, no existing studies formally address this concern. An additional potential risk was articulated by Alemohammad et al. (2023), who posited that data ambiguity might precipitate the recycling of LLM-generated data in the training processes of subsequent models. This scenario could adversely impact the text generation quality of these emergent LLMs, thereby destabilizing the research landscape dedicated to the detection of LLM-generated text.

7.4 Impact of Model Size on Detectors

Many researchers are concerned about the impact of the model size on detectors, which can be viewed from two perspectives: one is the size of the generative model, and the other is the size of the supervised classifiers. The size of the generative model is closely related to the quality of the generated text. Generally speaking, texts generated by smaller-sized models are easier to recognize, while those generated by larger models pose a greater challenge for detection. Another issue of concern is how the texts generated by models of different sizes affect the detectors when used as training samples. Pu et al. (2023b) report that detectors trained with data generated by medium-sized LLMs can generalize to larger versions without any samples, while training samples generated by overly large or small models may reduce the generalization ability of the detectors. Antoun, Sagot, and Seddah (2023) further explores the apparent negative correlation between classifier effectiveness and the size of the generative model. The results show that text generated by larger LLMs is more difficult to detect, especially when the classifier is trained on data generated by smaller LLMs. Aligning the distribution of the generative models for the training and test sets can improve the performance of the detectors. From the perspective of the size of the supervised classifiers, the detection capability of the detectors is directly proportional to the size of the fine-tuned LMs Guo et al. (2023). However, recent findings suggest that while larger detectors perform better on test sets with the same distribution as the training set, their generalization ability is somewhat diminished.

7.5 Lack of Effective Evaluation Framework

A widespread phenomenon is that many studies claim their detectors exhibit impressive and robust performance. However, in practical experiments, these methods often perform less than satisfactorily on the test sets created by other researchers. This variance is due to researchers using different strategies to construct their test sets. Variables such as the parameters used to generate the test set, the computational environment, text distribution, and text processing strategies, including truncation, can all influence the effectiveness of detectors. Due to these factors’ complex nature, the reproducibility of evaluation results is often compromised, even when researchers adhere to identical dataset production protocols. We elaborate on the limitations of existing benchmarks in section 4, where we advocate for the creation of a high-quality and comprehensive evaluation framework. We encourage future research to actively implement these frameworks to maintain consistency in testing standards. Furthermore, we call upon researchers focusing on specific issues to openly share their test sets, emphasizing the strong adaptability of current evaluation frameworks to integrate them. In conclusion, setting an objective and fair benchmark for detector comparison is essential to propel research in detecting LLM-generated text forward, rather than persisting in siloed efforts.

8 Future Research Directions

In this section, we explore potential directions for future research aimed at better construction of more efficient and realistically effective detectors.

8.1 Building Robust Detectors with Attacks

The attack methods introduced in subsection 7.2, encompass Paraphrase Attacks (Sadasivan et al., 2023), Adversarial Attacks (He et al., 2023), and Prompt Attacks (Lu et al., 2023). These methods underscore the primary challenges impeding the utility of current detectors. While recent research, such as Yang, Jiang, and Li (2023), has addressed robustness against specific attacks, it often neglects potential threats posed by other attack forms. Consequently, it is imperative to develop and validate diverse attack types, thereby gaining insights into vulnerabilities inherent to LLM-generated text detectors. We further advocate for the establishment of comprehensive benchmarks to assess existing detection strategies. Although some studies (He et al., 2023; Wang et al., 2023b) purport to provide such benchmarks, the scope and diversity of the validated attacks remain limited.

8.2 Enhancing the Efficacy of Zero-shot Detectors

Zero-shot methods stand out as notably stable detectors Deng et al. (2023). Crucially, they offer enhanced controllability and interpretability for users Mitrović, Andreoletti, and Ayoub (2023). Recent research (Giorgi et al., 2023; Liao et al., 2023b) has elucidated distinct disparities between LLM-generated text and human-written text, underscoring a tangible and discernible gap between the two. This revelation has invigorated research in the domain of LLM-generated text detection. We advocate for a proliferation of studies that delve into the nuanced distinctions between LLM-generated texts and human-written text, spanning from low-dimensional to high-dimensional features. Unearthing metrics that more accurately distinguish the two can bolster the evolution of automatic detectors and furnish more compelling justifications for decision-making processes. We have observed that the latest emerging black-box zero-shot methods Yang et al. (2023b); Mao et al. (2024); Zhu et al. (2023); Quidwai, Li, and Dube (2023); Guo and Yu (2023) demonstrate enhanced stability and application potential compared to white-box based zero-shot methods by extracting discriminative metrics that are independent of white-box models. These methods do not rely on an understanding of the model’s internal workings, thereby offering broader applicability across various models and environments.

8.3 Optimizing Detectors for Low-resource Environments

Many contemporary detection techniques tend to overlook the challenges faced by resource-constrained settings, neglecting the need for resources in developing the detector. The relative efficacy of various detectors across different data volume settings remains inadequately explored. Concurrently, determining the minimal resource prerequisites for different detection methods to yield satisfactory results is imperative. Beyond examining the model’s adaptability across distinct domains Rodriguez et al. (2022a) and languages Wang et al. (2023b), we advocate for investigating the defensive adaptability against varied attack strategies. Such exploration can guide users in selecting the most beneficial approach to establish a dependable detector under resource constraints.

8.4 Detection for Not Purely LLM-Generated Text

In subsection 7.3, we highlight a significant challenge encountered in real-world scenarios: the detection of text that is not purely produced by LLMs. We examine this issue by separately discussing texts that are a mixture of data sources and those that have been edited by humans, and review the latest related work and propose potential solutions, which are still pending verification. We emphasize that organizing relevant datasets and defining tasks to address this issue is an urgent need at present, because fundamentally, this type of text may be the most commonly encountered in detector applications.

8.5 Constructing Detectors Amidst Data Ambiguity

A significant challenge that arises is verifying the authenticity of training data. When aggregating textual data from sources such as blogs and web comments, there is a potential risk of inadvertently including a substantial amount of LLM-generated text. This incorporation can fundamentally compromise the integrity of detector research, perpetuating a detrimental feedback loop. We urge forthcoming detection studies to prioritize the authenticity assessment of real-world data, anticipating this as a pressing challenge in the future.

8.6 Developing Effective Evaluation Framework Aligned With Real-World

In subsection 7.5, we analyze the objective differences between evaluation environments and real-world settings, which limit the effectiveness of existing detectors when applied in practice. On one hand, there may be biases in the construction of test sets in many works because they often favor the detectors built by their creators; on the other hand, current benchmarks frequently reflect idealized scenarios far removed from real-world applications. We call on researchers to develop a fair and effective evaluation framework closely linked to the practical needs of LLM-generated detection tasks. For instance, considering the necessity of the application domain, the black-box nature of LLM-generated texts, and the various attacks and post-editing strategies that texts may encounter. We believe such an evaluation framework will promote the research and development of detectors that are more practical and aligned with real-world scenarios.

8.7 Constructing Detectors with Misinformation Discrimination Capabilities

Contemporary detection methodologies have largely overlooked the capacity to discern misinformation. Existing detectors primarily emphasize the distribution of features within text generated by LLMs, while often overlooking their potential for factual verification. A proficient detector should possess the capability to discern the veracity or falsity of factual claims presented in text. In the initial stages of generative modeling’s emergence, when it had yet to pose significant societal challenges, the emphasis was on assessing the truth or falsity of the content in LLM-generated text, with less regard for its source Schuster et al. (2020b). Constructing detectors with misinformation discrimination capabilities can aid in more accurately attributing the source of text, rather than relying solely on distributional features, and subsequently contribute to mitigating the proliferation of misinformation. Recent studies Gao et al. (2023); Chern et al. (2023) highlight the potential of LLMs to detect factual content in texts. We recommend bolstering such endeavors through integration with external knowledge bases (Asai et al., 2023) or search engines Liang et al. (2023b).

9 Conclusion

With the widespread development and application of LLMs, the pervasive presence of LLM-generated text in our daily lives has transitioned from expectation to reality. LLM-generated text detectors play a pivotal role in distinguishing between human-written and LLM-generated text, serving as a crucial defense against the misuse of LLMs for generating deceptive news, engaging in scams, or exacerbating issues such as educational inequality. In this survey, we introduce the task of LLM-generated text detection, outline the sources contributing to enhanced LLM-generated text capabilities, and highlight the escalating demand for efficient detectors. We also list datasets that are popular or promising, pointing out the challenges and requirements associated with existing detectors. In addition, we shed light on the critical limitations of contemporary detectors, including issues related to out-of-distribution data, potential attacks, real-world data issues, and the lack of an effective evaluation framework, to direct researchers’ attention to the focal points of the field, thereby sparking innovative ideas and approaches. Finally, we propose potential future research directions that are poised to guide the development of more powerful and effective detection systems.

Acknowledgements.
This work was supported in part by the Major Program of the State Commission of Science Technology of China (Grant No. 2020AAA0106701), the Science and Technology Development Fund, Macau SAR (Grant Nos. FDCT/0070/2022/AMJ, FDCT/060/2022/AFJ) and the Multi-year Research Grant from the University of Macau (Grant No. MYRG-GRG2023-00006-FST-UMDF).
\starttwocolumn

References

  • Abdelnabi and Fritz (2021) Abdelnabi, Sahar and Mario Fritz. 2021. Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021, pages 121–140, IEEE.
  • Aich, Bhattacharya, and Parde (2022) Aich, Ankit, Souvik Bhattacharya, and Natalie Parde. 2022. Demystifying neural fake news via linguistic feature-based interpretation. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 6586–6599, International Committee on Computational Linguistics.
  • Alemohammad et al. (2023) Alemohammad, Sina, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G. Baraniuk. 2023. Self-consuming generative models go MAD. CoRR, abs/2307.01850.
  • Aliman and Kester (2021) Aliman, Nadisha Marie and Leon Kester. 2021. Epistemic defenses against scientific and empirical adversarial ai attacks. In CEUR Workshop Proceedings, volume 2916, CEUR WS.
  • Anthropic (2023) Anthropic. 2023. Model card and evaluations for claude models.
  • Antoun et al. (2023a) Antoun, Wissam, Virginie Mouilleron, Benoît Sagot, and Djamé Seddah. 2023a. Towards a robust detection of language model generated text: Is chatgpt that easy to detect? CoRR, abs/2306.05871.
  • Antoun et al. (2023b) Antoun, Wissam, Virginie Mouilleron, Benoît Sagot, and Djamé Seddah. 2023b. Towards a robust detection of language model generated text: Is chatgpt that easy to detect? CoRR, abs/2306.05871.
  • Antoun, Sagot, and Seddah (2023) Antoun, Wissam, Benoît Sagot, and Djamé Seddah. 2023. From text to source: Results in detecting large language model-generated content. CoRR, abs/2309.13322.
  • Arase and Zhou (2013) Arase, Yuki and Ming Zhou. 2013. Machine translation detection from monolingual web-text. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1597–1607, Association for Computational Linguistics.
  • Asai et al. (2023) Asai, Akari, Sewon Min, Zexuan Zhong, and Danqi Chen. 2023. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41–46.
  • Asghar (2016) Asghar, Nabiha. 2016. Yelp dataset challenge: Review rating prediction. ArXiv preprint, abs/1605.05362.
  • Ayoobi, Shahriar, and Mukherjee (2023) Ayoobi, Navid, Sadat Shahriar, and Arjun Mukherjee. 2023. The looming threat of fake and llm-generated linkedin profiles: Challenges and opportunities for detection and prevention. In Proceedings of the 34th ACM Conference on Hypertext and Social Media, pages 1–10.
  • Baayen (2001) Baayen, R Harald. 2001. Word frequency distributions, volume 18. Springer Science & Business Media.
  • Baccianella, Esuli, and Sebastiani (2010) Baccianella, Stefano, Andrea Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta, European Language Resources Association.
  • Bakhtin et al. (2019) Bakhtin, Anton, Sam Gross, Myle Ott, Yuntian Deng, Marc’Aurelio Ranzato, and Arthur Szlam. 2019. Real or fake? learning to discriminate machine from human generated text. CoRR, abs/1906.03351.
  • Bao et al. (2023) Bao, Guangsheng, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2023. Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature. arXiv preprint arXiv:2310.05130, abs/2310.05130.
  • Barbara Kitchenham (2007) Barbara Kitchenham, Stuart Charters. 2007. Guidelines for performing systematic literature reviews in software engineering.
  • Becker et al. (2023) Becker, Brett A, Paul Denny, James Finnie-Ansley, Andrew Luxton-Reilly, James Prather, and Eddie Antonio Santos. 2023. Programming is hard-or at least it used to be: Educational opportunities and challenges of ai code generation. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, pages 500–506.
  • Beresneva (2016) Beresneva, Daria. 2016. Computer-generated text detection using machine learning: A systematic review. In Natural Language Processing and Information Systems: 21st International Conference on Applications of Natural Language to Information Systems, NLDB 2016, Salford, UK, June 22-24, 2016, Proceedings 21, pages 421–426, Springer.
  • Besta et al. (2023) Besta, Maciej, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. 2023. Graph of thoughts: Solving elaborate problems with large language models. ArXiv preprint, abs/2308.09687.
  • Bhat and Parthasarathy (2020) Bhat, Meghana Moorthy and Srinivasan Parthasarathy. 2020. How effectively can machines defend against machine-generated fake news? an empirical study. In Proceedings of the First Workshop on Insights from Negative Results in NLP, Insights 2020, Online, November 19, 2020, pages 48–53.
  • Bhattacharjee et al. (2023) Bhattacharjee, Amrita, Tharindu Kumarage, Raha Moraffah, and Huan Liu. 2023. Conda: Contrastive domain adaptation for ai-generated text detection. CoRR, abs/2309.03992.
  • Bhattacharjee and Liu (2023) Bhattacharjee, Amrita and Huan Liu. 2023. Fighting fire with fire: Can chatgpt detect ai-generated text? ArXiv preprint, abs/2308.01284.
  • Blanchard et al. (2013) Blanchard, Daniel, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. 2013. Toefl11: A corpus of non-native english. ETS Research Report Series, 2013(2):i–15.
  • Brown et al. (2020) Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Cardenuto et al. (2023) Cardenuto, João Phillipe, Jing Yang, Rafael Padilha, Renjie Wan, Daniel Moreira, Haoliang Li, Shiqi Wang, Fernanda A. Andaló, Sébastien Marcel, and Anderson Rocha. 2023. The age of synthetic realities: Challenges and opportunities. CoRR, abs/2306.11503.
  • Chaka (2023) Chaka, Chaka. 2023. Detecting ai content in responses generated by chatgpt, youchat, and chatsonic: The case of five ai content detection tools. Journal of Applied Learning and Teaching, 6(2).
  • Chakraborty et al. (2023a) Chakraborty, Megha, S. M. Towhidul Islam Tonmoy, S. M. Mehedi Zaman, Shreya Gautam, Tanay Kumar, Krish Sharma, Niyar R. Barman, Chandan Gupta, Vinija Jain, Aman Chadha, Amit P. Sheth, and Amitava Das. 2023a. Counter turing test (CT2): ai-generated text detection is not as easy as you may think - introducing AI detectability index (ADI). In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 2206–2239, Association for Computational Linguistics.
  • Chakraborty et al. (2023b) Chakraborty, Souradip, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, and Furong Huang. 2023b. On the possibilities of ai-generated text detection. CoRR, abs/2304.04736.
  • Chen et al. (2022) Chen, Qianben, Richong Zhang, Yaowei Zheng, and Yongyi Mao. 2022. Dual contrastive learning: Text classification via label-aware data augmentation. ArXiv preprint, abs/2201.08702.
  • Chen et al. (2023a) Chen, Yutian, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, and Bhiksha Raj. 2023a. Token prediction as implicit classification to identify llm-generated text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 13112–13120, Association for Computational Linguistics.
  • Chen et al. (2023b) Chen, Yutian, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, and Bhiksha Ramakrishnan. 2023b. Gpt-sentinel: Distinguishing human and chatgpt generated content. ArXiv preprint, abs/2305.07969.
  • Chern et al. (2023) Chern, I, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu, et al. 2023. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. ArXiv preprint, abs/2307.13528.
  • Chowdhery et al. (2022a) Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022a. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  • Chowdhery et al. (2022b) Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022b. Palm: Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311.
  • Christian (2023) Christian, Jon. 2023. Cnet secretly used ai on articles that didn’t disclose that fact, staff say. Futurusm, January.
  • Clark et al. (2021a) Clark, Elizabeth, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. 2021a. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Association for Computational Linguistics.
  • Clark et al. (2021b) Clark, Elizabeth, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. 2021b. All that’s ’human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 7282–7296, Association for Computational Linguistics.
  • Conneau et al. (2020) Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451, Association for Computational Linguistics.
  • Corizzo and Leal-Arenas (2023) Corizzo, Roberto and Sebastian Leal-Arenas. 2023. A deep fusion model for human $vs$. machine-generated essay classification. In International Joint Conference on Neural Networks, IJCNN 2023, Gold Coast, Australia, June 18-23, 2023, pages 1–10, IEEE.
  • Corston-Oliver, Gamon, and Brockett (2001) Corston-Oliver, Simon, Michael Gamon, and Chris Brockett. 2001. A machine learning approach to the automatic evaluation of machine translation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 148–155, Association for Computational Linguistics.
  • Cowap, Graham, and Foster (2023) Cowap, Alan, Yvette Graham, and Jennifer Foster. 2023. Do stochastic parrots have feelings too? improving neural detection of synthetic text via emotion recognition. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 9928–9946, Association for Computational Linguistics.
  • Crothers, Japkowicz, and Viktor (2023a) Crothers, Evan, Nathalie Japkowicz, and Herna L Viktor. 2023a. Machine-generated text: A comprehensive survey of threat models and detection methods. IEEE Access.
  • Crothers, Japkowicz, and Viktor (2023b) Crothers, Evan, Nathalie Japkowicz, and Herna L. Viktor. 2023b. Machine-generated text: A comprehensive survey of threat models and detection methods. IEEE Access, 11:70977–71002.
  • Crothers et al. (2022) Crothers, Evan, Nathalie Japkowicz, Herna L. Viktor, and Paula Branco. 2022. Adversarial robustness of neural-statistical features in detection of generative transformers. In International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, July 18-23, 2022, pages 1–8, IEEE.
  • Cui et al. (2023) Cui, Jiaxi, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. Chatlaw: Open-source legal large language model with integrated external knowledge bases. ArXiv preprint, abs/2306.16092.
  • Dai et al. (2023) Dai, Damai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. 2023. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
  • Damodaran (2021) Damodaran, Prithiviraj. 2021. Parrot: Paraphrase generation for nlu.
  • Deng et al. (2023) Deng, Zhijie, Hongcheng Gao, Yibo Miao, and Hao Zhang. 2023. Efficient detection of llm-generated texts with a bayesian surrogate model. ArXiv preprint, abs/2305.16617.
  • Desaire et al. (2023) Desaire, Heather, Aleesa E. Chua, Madeline Isom, Romana Jarosova, and David Hua. 2023. Chatgpt or academic scientist? distinguishing authorship with over 99% accuracy using off-the-shelf machine learning tools. CoRR, abs/2303.16352.
  • Devlin et al. (2019a) Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019a. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186, Association for Computational Linguistics.
  • Devlin et al. (2019b) Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019b. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186, Association for Computational Linguistics.
  • Devlin et al. (2019c) Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019c. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Association for Computational Linguistics.
  • Dhaini, Poelman, and Erdogan (2023) Dhaini, Mahdi, Wessel Poelman, and Ege Erdogan. 2023. Detecting chatgpt: A survey of the state of detecting chatgpt-generated text. CoRR, abs/2309.07689.
  • Dong et al. (2023) Dong, Qingxiu, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2023. A survey for in-context learning. ArXiv preprint, abs/2301.00234.
  • Dou et al. (2022) Dou, Yao, Maxwell Forbes, Rik Koncel-Kedziorski, Noah A. Smith, and Yejin Choi. 2022. Is GPT-3 text indistinguishable from human text? scarecrow: A framework for scrutinizing machine text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7250–7274, Association for Computational Linguistics.
  • Dugan et al. (2020) Dugan, Liam, Daphne Ippolito, Arun Kirubarajan, and Chris Callison-Burch. 2020. RoFT: A tool for evaluating human detection of machine-generated text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 189–196, Association for Computational Linguistics.
  • Dugan et al. (2023) Dugan, Liam, Daphne Ippolito, Arun Kirubarajan, Sherry Shi, and Chris Callison-Burch. 2023. Real or fake text?: Investigating human ability to detect boundaries between human-written and machine-generated text. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 12763–12771, AAAI Press.
  • Epstein et al. (2023) Epstein, Ziv, Aaron Hertzmann, Investigators of Human Creativity, Memo Akten, Hany Farid, Jessica Fjeld, Morgan R Frank, Matthew Groh, Laura Herman, Neil Leach, et al. 2023. Art and the science of generative ai. Science, 380(6650):1110–1111.
  • Fagni et al. (2021) Fagni, Tiziano, Fabrizio Falchi, Margherita Gambini, Antonio Martella, and Maurizio Tesconi. 2021. TweepFake: About detecting deepfake tweets. PLOS ONE, 16(5):e0251415.
  • Fan et al. (2019) Fan, Angela, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Association for Computational Linguistics.
  • Fan, Lewis, and Dauphin (2018) Fan, Angela, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Association for Computational Linguistics.
  • Fellbaum (1998) Fellbaum, Christiane. 1998. WordNet: An electronic lexical database. MIT press.
  • Gade et al. (2020) Gade, Krishna, Sahin Geyik, Krishnaram Kenthapadi, Varun Mithal, and Ankur Taly. 2020. Explainable ai in industry: Practical challenges and lessons learned. In Companion Proceedings of the Web Conference 2020, pages 303–304.
  • Gallé et al. (2021) Gallé, Matthias, Jos Rozen, Germán Kruszewski, and Hady Elsahar. 2021. Unsupervised and distributional detection of machine-generated text. CoRR, abs/2111.02878.
  • Gambini et al. (2022) Gambini, Margherita, Tiziano Fagni, Fabrizio Falchi, and Maurizio Tesconi. 2022. On pushing deepfake tweet detection capabilities to the limits. In Proceedings of the 14th ACM Web Science Conference 2022, pages 154–163.
  • Gao et al. (2018a) Gao, Ji, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018a. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56, IEEE.
  • Gao et al. (2018b) Gao, Ji, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018b. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops, SP Workshops 2018, San Francisco, CA, USA, May 24, 2018, pages 50–56.
  • Gao et al. (2023) Gao, Luyu, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. 2023. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508.
  • Gao, Yao, and Chen (2021) Gao, Tianyu, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Association for Computational Linguistics.
  • Gehrmann, Strobelt, and Rush (2019) Gehrmann, Sebastian, Hendrik Strobelt, and Alexander Rush. 2019. GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116, Association for Computational Linguistics.
  • Ghosal et al. (2023) Ghosal, Soumya Suvra, Souradip Chakraborty, Jonas Geiping, Furong Huang, Dinesh Manocha, and Amrit Bedi. 2023. A survey on the possibilities & impossibilities of ai-generated text detection. Transactions on Machine Learning Research.
  • Giorgi et al. (2023) Giorgi, Salvatore, David M. Markowitz, Nikita Soni, Vasudha Varadarajan, Siddharth Mangalik, and H. Andrew Schwartz. 2023. "i slept like a baby": Using human traits to characterize deceptive chatgpt and human text. In Proceedings of the IACT - The 1st International Workshop on Implicit Author Characterization from Texts for Search and Retrieval held in conjunction with the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), Taipei, Taiwan, July 27, 2023, volume 3477 of CEUR Workshop Proceedings, pages 23–37, CEUR-WS.org.
  • Giorgi, Ungar, and Schwartz (2021) Giorgi, Salvatore, Lyle Ungar, and H. Andrew Schwartz. 2021. Characterizing social spambots by their human traits. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 5148–5158, Association for Computational Linguistics.
  • Goodfellow et al. (2020) Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Communications of the ACM, 63(11):139–144.
  • Graves (2012) Graves, Alex. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
  • Gu et al. (2022) Gu, Chenxi, Chengsong Huang, Xiaoqing Zheng, Kai-Wei Chang, and Cho-Jui Hsieh. 2022. Watermarking pre-trained language models with backdooring. ArXiv preprint, abs/2210.07543.
  • Guo et al. (2023) Guo, Biyang, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. 2023. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. ArXiv preprint, abs/2301.07597.
  • Guo et al. (2020) Guo, Mandy, Zihang Dai, Denny Vrandečić, and Rami Al-Rfou. 2020. Wiki-40B: Multilingual language model dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2440–2452, European Language Resources Association.
  • Guo and Yu (2023) Guo, Zhen and Shangdi Yu. 2023. Authentigpt: Detecting machine-generated text via black-box language models denoising. CoRR, abs/2311.07700.
  • Hamed and Wu (2023) Hamed, Ahmed Abdeen and Xindong Wu. 2023. Improving detection of chatgpt-generated fake science using real publication text: Introducing xfakebibs a supervised-learning network algorithm. CoRR, abs/2308.11767.
  • Hanley and Durumeric (2023) Hanley, Hans W. A. and Zakir Durumeric. 2023. Machine-made media: Monitoring the mobilization of machine-generated articles on misinformation and mainstream news websites. CoRR, abs/2305.09820.
  • He et al. (2023) He, Xinlei, Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. 2023. Mgtbench: Benchmarking machine-generated text detection. ArXiv preprint, abs/2303.14822.
  • Helm, Priebe, and Yang (2023) Helm, Hayden S., Carey E. Priebe, and Weiwei Yang. 2023. A statistical turing test for generative models. CoRR, abs/2309.08913.
  • Henrique, Kucharavy, and Guerraoui (2023) Henrique, Da Silva Gameiro, Andrei Kucharavy, and Rachid Guerraoui. 2023. Stochastic parrots looking for stochastic parrots: Llms are easy to fine-tune and hard to detect with other llms. CoRR, abs/2304.08968.
  • Hill et al. (2016) Hill, Felix, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The goldilocks principle: Reading children’s books with explicit memory representations. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  • Holtzman et al. (2020) Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net.
  • Horne and Adali (2017) Horne, Benjamin and Sibel Adali. 2017. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the international AAAI conference on web and social media, volume 11, pages 759–766.
  • Hou et al. (2023) Hou, Abe Bohan, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. 2023. Semstamp: A semantic watermark with paraphrastic robustness for text generation. CoRR, abs/2310.03991.
  • Hu, Chen, and Ho (2023) Hu, Xiaomeng, Pin-Yu Chen, and Tsung-Yi Ho. 2023. Radar: Robust ai-text detection via adversarial learning. ArXiv preprint, abs/2307.03838.
  • Ibrahim et al. (2023) Ibrahim, Hazem, Fengyuan Liu, Rohail Asim, Balaraju Battu, Sidahmed Benabderrahmane, Bashar Alhafni, Wifag Adnan, Tuka Alhanai, Bedoor K. AlShebli, Riyadh Baghdadi, Jocelyn J. Bélanger, Elena Beretta, Kemal Celik, Moumena Chaqfeh, Mohammed F. Daqaq, Zaynab El Bernoussi, Daryl Fougnie, Borja Garcia de Soto, Alberto Gandolfi, András György, Nizar Habash, J. Andrew Harris, Aaron Kaufman, Lefteris Kirousis, Korhan Kocak, Kangsan Lee, Seungah S. Lee, Samreen Malik, Michail Maniatakos, David Melcher, Azzam Mourad, Minsu Park, Mahmoud Rasras, Alicja Reuben, Dania Zantout, Nancy W. Gleason, Kinga Makovi, Talal Rahwan, and Yasir Zaki. 2023. Perception, performance, and detectability of conversational artificial intelligence across 32 university courses. CoRR, abs/2305.13934.
  • Ippolito et al. (2020) Ippolito, Daphne, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. Automatic detection of generated text is easiest when humans are fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1808–1822, Association for Computational Linguistics.
  • Jawahar, Abdul-Mageed, and Lakshmanan (2020) Jawahar, Ganesh, Muhammad Abdul-Mageed, and Laks Lakshmanan, V.S. 2020. Automatic detection of machine generated text: A critical survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2296–2309, International Committee on Computational Linguistics.
  • Ji et al. (2023) Ji, Ziwei, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  • Jiao et al. (2020) Jiao, Xiaoqi, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. Tinybert: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 4163–4174, Association for Computational Linguistics.
  • Jin et al. (2019) Jin, Qiao, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Association for Computational Linguistics.
  • Jin et al. (2021) Jin, Zhuoran, Yubo Chen, Dianbo Sui, Chenhao Wang, Zhipeng Xue, and Jun Zhao. 2021. Cogie: An information extraction toolkit for bridging texts and cognet. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL 2021 - System Demonstrations, Online, August 1-6, 2021, pages 92–98, Association for Computational Linguistics.
  • Kalinichenko et al. (2003) Kalinichenko, Leonid A, Vladimir V Korenkov, Vladislav P Shirikov, Alexey N Sissakian, and Oleg V Sunturenko. 2003. Digital libraries: Advanced methods and technologies, digital collections. D-Lib Magazine, 9(1):1082–9873.
  • Kang et al. (2018) Kang, Dongyeop, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, and Roy Schwartz. 2018. A dataset of peer reviews (PeerRead): Collection, insights and NLP applications. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1647–1661, Association for Computational Linguistics.
  • Kasneci et al. (2023) Kasneci, Enkelejda, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. 2023. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274.
  • Kirchenbauer et al. (2023a) Kirchenbauer, John, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023a. A watermark for large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 17061–17084, PMLR.
  • Kirchenbauer et al. (2023b) Kirchenbauer, John, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. 2023b. On the reliability of watermarks for large language models. CoRR, abs/2306.04634.
  • Kočiský et al. (2018) Kočiský, Tomáš, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
  • Koike, Kaneko, and Okazaki (2023a) Koike, Ryuto, Masahiro Kaneko, and Naoaki Okazaki. 2023a. How you prompt matters! even task-oriented constraints in instructions affect llm-generated text detection. CoRR, abs/2311.08369.
  • Koike, Kaneko, and Okazaki (2023b) Koike, Ryuto, Masahiro Kaneko, and Naoaki Okazaki. 2023b. Outfox: Llm-generated essay detection through in-context learning with adversarially generated examples. ArXiv preprint, abs/2307.11729.
  • Kojima et al. (2022) Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS.
  • Krishna et al. (2023) Krishna, Kalpesh, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. 2023. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. ArXiv preprint, abs/2303.13408.
  • Kuditipudi et al. (2023) Kuditipudi, Rohith, John Thickstun, Tatsunori Hashimoto, and Percy Liang. 2023. Robust distortion-free watermarks for language models. CoRR, abs/2307.15593.
  • Kulkarni et al. (2023) Kulkarni, Pranav, Ziqing Ji, Yan Xu, Marko Neskovic, and Kevin Nolan. 2023. Exploring semantic perturbations on grover. CoRR, abs/2302.00509.
  • Kumarage et al. (2023a) Kumarage, Tharindu, Amrita Bhattacharjee, Djordje Padejski, Kristy Roschke, Dan Gillmor, Scott W. Ruston, Huan Liu, and Joshua Garland. 2023a. J-guard: Journalism guided adversarially robust detection of ai-generated news. CoRR, abs/2309.03164.
  • Kumarage et al. (2023b) Kumarage, Tharindu, Paras Sheth, Raha Moraffah, Joshua Garland, and Huan Liu. 2023b. How reliable are ai-generated-text detectors? an assessment framework using evasive soft prompts. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1337–1349, Association for Computational Linguistics.
  • Lambert et al. (2022) Lambert, Nathan, Louis Castricato, Leandro von Werra, and Alex Havrilla. 2022. Illustrating reinforcement learning from human feedback (rlhf). Hugging Face Blog. Https://huggingface.co/blog/rlhf.
  • Lavergne, Urvoy, and Yvon (2008) Lavergne, Thomas, Tanguy Urvoy, and François Yvon. 2008. Detecting fake content with relative entropy scoring. In Proceedings of the 2008 International Conference on Uncovering Plagiarism, Authorship and Social Software Misuse-Volume 377, pages 27–31.
  • Lee, Jang, and Lee (2021) Lee, Bruce W., Yoo Sung Jang, and Jason Hyung-Jong Lee. 2021. Pushing on text readability assessment: A transformer meets handcrafted linguistic features. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 10669–10686, Association for Computational Linguistics.
  • Lee et al. (2020) Lee, Haejun, Drew A. Hudson, Kangwook Lee, and Christopher D. Manning. 2020. SLM: Learning a discourse language representation with sentence unshuffling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1551–1562, Association for Computational Linguistics.
  • Lee et al. (2023a) Lee, Jooyoung, Thai Le, Jinghui Chen, and Dongwon Lee. 2023a. Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, pages 3637–3647.
  • Lee et al. (2023b) Lee, Taehyun, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. 2023b. Who wrote this code? watermarking for code generation. CoRR, abs/2305.15060.
  • Li et al. (2023a) Li, Linyang, Pengyu Wang, Ke Ren, Tianxiang Sun, and Xipeng Qiu. 2023a. Origin tracing and detecting of llms. CoRR, abs/2304.14072.
  • Li et al. (2023b) Li, Xian, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, and Mike Lewis. 2023b. Self-alignment with instruction backtranslation. ArXiv preprint, abs/2308.06259.
  • Li et al. (2023c) Li, Yafu, Qintong Li, Leyang Cui, Wei Bi, Longyue Wang, Linyi Yang, Shuming Shi, and Yue Zhang. 2023c. Deepfake text detection in the wild. CoRR, abs/2305.13242.
  • Liang, Guerrero, and Alsmadi (2023) Liang, Gongbo, Jesus Guerrero, and Izzat Alsmadi. 2023. Mutation-based adversarial attacks on neural text detectors. ArXiv preprint, abs/2302.05794.
  • Liang et al. (2023a) Liang, Weixin, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. 2023a. Gpt detectors are biased against non-native english writers. In ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models.
  • Liang et al. (2023b) Liang, Yaobo, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. 2023b. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. ArXiv preprint, abs/2303.16434.
  • Liao et al. (2023a) Liao, Wenxiong, Zhengliang Liu, Haixing Dai, Shaochen Xu, Zihao Wu, Yiyang Zhang, Xiaoke Huang, Dajiang Zhu, Hongmin Cai, Tianming Liu, and Xiang Li. 2023a. Differentiate chatgpt-generated and human-written medical texts. CoRR, abs/2304.11567.
  • Liao et al. (2023b) Liao, Wenxiong, Zhengliang Liu, Haixing Dai, Shaochen Xu, Zihao Wu, Yiyang Zhang, Xiaoke Huang, Dajiang Zhu, Hongmin Cai, Tianming Liu, et al. 2023b. Differentiate chatgpt-generated and human-written medical texts. ArXiv preprint, abs/2304.11567.
  • Lin, Hilton, and Evans (2022) Lin, Stephanie, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Association for Computational Linguistics.
  • Littman and Wrubel (2019) Littman, Justin and Laura Wrubel. 2019. Climate Change Tweets Ids.
  • Liu et al. (2023a) Liu, Aiwei, Leyi Pan, Xuming Hu, Shu’ang Li, Lijie Wen, Irwin King, and Philip S Yu. 2023a. A private watermark for large language models. ArXiv preprint, abs/2307.16230.
  • Liu et al. (2023b) Liu, Aiwei, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. 2023b. A semantic invariant robust watermark for large language models. CoRR, abs/2310.06356.
  • Liu et al. (2022) Liu, Xiaoming, Zhaohan Zhang, Yichen Wang, Yu Lan, and Chao Shen. 2022. Coco: Coherence-enhanced machine-generated text detection under data limitation with contrastive learning. ArXiv preprint, abs/2212.10341.
  • Liu et al. (2023c) Liu, Yikang, Ziyin Zhang, Wanyang Zhang, Shisen Yue, Xiaojing Zhao, Xinyuan Cheng, Yiwen Zhang, and Hai Hu. 2023c. Argugpt: evaluating, understanding and identifying argumentative essays generated by gpt models. ArXiv preprint, abs/2304.07666.
  • Liu et al. (2019) Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  • Liu et al. (2023d) Liu, Zeyan, Zijun Yao, Fengjun Li, and Bo Luo. 2023d. Check me if you can: Detecting chatgpt-generated academic writing using checkgpt. ArXiv preprint, abs/2306.05524.
  • Lu et al. (2023) Lu, Ning, Shengcai Liu, Rui He, and Ke Tang. 2023. Large language models can be guided to evade ai-generated text detection. ArXiv preprint, abs/2305.10847.
  • Lu et al. (2022) Lu, Yaojie, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 5755–5772, Association for Computational Linguistics.
  • Lucas and Havens (2023) Lucas, Evan and Timothy Havens. 2023. Gpts don’t keep secrets: Searching for backdoor watermark triggers in autoregressive language models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 242–248.
  • Ma, Liu, and Yi (2023) Ma, Yongqiang, Jiawei Liu, and Fan Yi. 2023. Is this abstract generated by ai? a research for the gap between ai-generated scientific text and human-written scientific text. ArXiv preprint, abs/2301.10416.
  • Ma et al. (2023) Ma, Yongqiang, Jiawei Liu, Fan Yi, Qikai Cheng, Yong Huang, Wei Lu, and Xiaozhong Liu. 2023. Ai vs. human–differentiation analysis of scientific content generation. arXiv, 2301.
  • Macko et al. (2023) Macko, Dominik, Róbert Móro, Adaku Uchendu, Jason Samuel Lucas, Michiharu Yamashita, Matús Pikuliak, Ivan Srba, Thai Le, Dongwon Lee, Jakub Simko, and Mária Bieliková. 2023. Multitude: Large-scale multilingual machine-generated text detection benchmark. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9960–9987.
  • Májovskỳ et al. (2023) Májovskỳ, Martin, Martin Černỳ, Matěj Kasal, Martin Komarc, and David Netuka. 2023. Artificial intelligence can generate fraudulent but authentic-looking scientific medical articles: Pandora’s box has been opened. Journal of Medical Internet Research, 25:e46924.
  • Mao et al. (2024) Mao, Chengzhi, Carl Vondrick, Hao Wang, and Junfeng Yang. 2024. Raidar: generative AI detection via rewriting. CoRR, abs/2401.12970.
  • Markowitz, Hancock, and Bailenson (2023) Markowitz, David M, Jeffrey Hancock, and Jeremy Bailenson. 2023. Linguistic markers of inherent ai deception and intentional human deception: Evidence from hotel reviews. PsyArXiv preprint.
  • McCarthy (2005) McCarthy, Philip M. 2005. An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). Ph.D. thesis, The University of Memphis.
  • Mindner, Schlippe, and Schaaff (2023) Mindner, Lorenz, Tim Schlippe, and Kristina Schaaff. 2023. Classification of human- and ai-generated texts: Investigating features for chatgpt. CoRR, abs/2308.05341.
  • Mirsky et al. (2022) Mirsky, Yisroel, Ambra Demontis, Jaidip Kotak, Ram Shankar, Deng Gelei, Liu Yang, Xiangyu Zhang, Maura Pintor, Wenke Lee, Yuval Elovici, et al. 2022. The threat of offensive ai to organizations. Computers & Security, page 103006.
  • Mitchell et al. (2023) Mitchell, Eric, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. 2023. Detectgpt: Zero-shot machine-generated text detection using probability curvature. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 24950–24962, PMLR.
  • Mitrovic, Andreoletti, and Ayoub (2023) Mitrovic, Sandra, Davide Andreoletti, and Omran Ayoub. 2023. Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text. CoRR, abs/2301.13852.
  • Mitrović, Andreoletti, and Ayoub (2023) Mitrović, Sandra, Davide Andreoletti, and Omran Ayoub. 2023. Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text. ArXiv preprint, abs/2301.13852.
  • Moosavi et al. (2021) Moosavi, Nafise Sadat, Andreas Rücklé, Dan Roth, and Iryna Gurevych. 2021. Scigen: a dataset for reasoning-aware text generation from scientific tables. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  • Morris et al. (2020) Morris, John X., Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. 2020. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, pages 119–126.
  • Mosca et al. (2023) Mosca, Edoardo, Mohamed Hesham Ibrahim Abdalla, Paolo Basso, Margherita Musumeci, and Georg Groh. 2023. Distinguishing fact from fiction: A benchmark dataset for identifying machine-generated scientific papers in the llm era. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 190–207.
  • Mostafazadeh et al. (2016) Mostafazadeh, Nasrin, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, Association for Computational Linguistics.
  • Muñoz-Ortiz, Gómez-Rodríguez, and Vilares (2023) Muñoz-Ortiz, Alberto, Carlos Gómez-Rodríguez, and David Vilares. 2023. Contrasting linguistic patterns in human and llm-generated text. ArXiv preprint, abs/2308.09067.
  • Munyer and Zhong (2023) Munyer, Travis J. E. and Xin Zhong. 2023. Deeptextmark: Deep learning based text watermarking for detection of large language model generated text. CoRR, abs/2305.05773.
  • Murakami, Hoshino, and Zhang (2023) Murakami, Soichiro, Sho Hoshino, and Peinan Zhang. 2023. Natural language generation for advertising: A survey. ArXiv preprint, abs/2306.12719.
  • Muric, Wu, and Ferrara (2021) Muric, G, Y Wu, and E Ferrara. 2021. Covid-19 vaccine hesitancy on social media: Building a public twitter dataset of anti-vaccine content, vaccine misinformation and conspiracies. 2021; 1–10. ArXiv preprint, abs/2105.05134.
  • Murtaza et al. (2020) Murtaza, Ghulam, Liyana Shuib, Ainuddin Wahid Abdul Wahab, Ghulam Mujtaba, Ghulam Mujtaba, Henry Friday Nweke, Mohammed Ali Al-garadi, Fariha Zulfiqar, Ghulam Raza, and Nor Aniza Azmi. 2020. Deep learning-based breast cancer classification through medical imaging modalities: state of the art and research challenges. Artificial Intelligence Review, 53:1655–1720.
  • Narayan, Cohen, and Lapata (2018) Narayan, Shashi, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Association for Computational Linguistics.
  • Nicks et al. (2023) Nicks, Charlotte, Eric Mitchell, Rafael Rafailov, Archit Sharma, Christopher D Manning, Chelsea Finn, and Stefano Ermon. 2023. Language model detectors are easily optimized against. In The Twelfth International Conference on Learning Representations.
  • OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  • Orenstrakh et al. (2023) Orenstrakh, Michael Sheinman, Oscar Karnalim, Carlos Anibal Suarez, and Michael Liut. 2023. Detecting llm-generated text in computing education: A comparative study for chatgpt cases. ArXiv preprint, abs/2307.07411.
  • Ouyang et al. (2022) Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback, 2022. ArXiv preprint, abs/2203.02155.
  • Pagnoni, Graciarena, and Tsvetkov (2022a) Pagnoni, Artidoro, Martin Graciarena, and Yulia Tsvetkov. 2022a. Threat scenarios and best practices to detect neural fake news. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1233–1249, International Committee on Computational Linguistics.
  • Pagnoni, Graciarena, and Tsvetkov (2022b) Pagnoni, Artidoro, Martin Graciarena, and Yulia Tsvetkov. 2022b. Threat scenarios and best practices to detect neural fake news. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 1233–1249, International Committee on Computational Linguistics.
  • Peng et al. (2024) Peng, Xinlin, Ying Zhou, Ben He, Le Sun, and Yingfei Sun. 2024. Hidding the ghostwriters: An adversarial evaluation of ai-generated student essay detection. CoRR, abs/2402.00412.
  • Piccolo et al. (2023) Piccolo, Stephen R, Paul Denny, Andrew Luxton-Reilly, Samuel Payne, and Perry G Ridge. 2023. Many bioinformatics programming tasks can be automated with chatgpt. ArXiv preprint, abs/2303.13528.
  • Por, Wong, and Chee (2012) Por, Lip Yee, KokSheik Wong, and Kok Onn Chee. 2012. Unispach: A text-based data hiding method using unicode space characters. J. Syst. Softw., 85(5):1075–1082.
  • Porsdam Mann et al. (2023) Porsdam Mann, Sebastian, Brian D Earp, Sven Nyholm, John Danaher, Nikolaj Møller, Hilary Bowman-Smart, Joshua Hatherley, Julian Koplin, Monika Plozza, Daniel Rodger, et al. 2023. Generative ai entails a credit–blame asymmetry. ArXiv preprint, abs/2305.15324.
  • Price and Sakellarios (2023) Price, Gregory and Marc D Sakellarios. 2023. The effectiveness of free software for detecting ai-generated writing. International Journal of Teaching, Learning and Education, 2(6).
  • Pu et al. (2023a) Pu, Jiameng, Zain Sarwar, Sifat Muhammad Abdullah, Abdullah Rehman, Yoonjin Kim, Parantapa Bhattacharya, Mobin Javed, and Bimal Viswanath. 2023a. Deepfake text detection: Limitations and opportunities. In 44th IEEE Symposium on Security and Privacy, SP 2023, San Francisco, CA, USA, May 21-25, 2023, pages 1613–1630, IEEE.
  • Pu et al. (2023b) Pu, Xiao, Jingyu Zhang, Xiaochuang Han, Yulia Tsvetkov, and Tianxing He. 2023b. On the zero-shot generalization of machine-generated text detectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4799–4808.
  • Qiu et al. (2020) Qiu, Xipeng, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897.
  • Quidwai, Li, and Dube (2023) Quidwai, Mujahid Ali, Chunhui Li, and Parijat Dube. 2023. Beyond black box AI generated plagiarism detection: From sentence to document level. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2023, Toronto, Canada, 13 July 2023, pages 727–735, Association for Computational Linguistics.
  • Radford et al. (2019) Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  • Raffel et al. (2020) Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  • Rajpurkar et al. (2016) Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Association for Computational Linguistics.
  • Ren et al. (2019) Ren, Shuhuai, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1085–1097, Association for Computational Linguistics.
  • Rizzo, Bertini, and Montesi (2016) Rizzo, Stefano Giovanni, Flavio Bertini, and Danilo Montesi. 2016. Content-preserving text watermarking through unicode homoglyph substitution. In Proceedings of the 20th International Database Engineering & Applications Symposium, IDEAS 2016, Montreal, QC, Canada, July 11-13, 2016, pages 97–104, ACM.
  • Rodriguez et al. (2022a) Rodriguez, Juan, Todd Hay, David Gros, Zain Shamsi, and Ravi Srinivasan. 2022a. Cross-domain detection of GPT-2-generated technical text. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1213–1233, Association for Computational Linguistics.
  • Rodriguez et al. (2022b) Rodriguez, Juan Diego, Todd Hay, David Gros, Zain Shamsi, and Ravi Srinivasan. 2022b. Cross-domain detection of gpt-2-generated technical text. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 1213–1233, Association for Computational Linguistics.
  • Sadasivan et al. (2023) Sadasivan, Vinu Sankar, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. 2023. Can ai-generated text be reliably detected? ArXiv preprint, abs/2303.11156.
  • Saeed and Omlin (2023) Saeed, Waddah and Christian Omlin. 2023. Explainable ai (xai): A systematic meta-survey of current challenges and future opportunities. Knowledge-Based Systems, 263:110273.
  • Sarvazyan et al. (2023a) Sarvazyan, Areg Mikael, José Ángel González, Paolo Rosso, and Marc Franco-Salvador. 2023a. Supervised machine-generated text detectors: Family and scale matters. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 121–132, Springer.
  • Sarvazyan et al. (2023b) Sarvazyan, Areg Mikael, José Ángel González, Paolo Rosso, and Marc Franco-Salvador. 2023b. Supervised machine-generated text detectors: Family and scale matters. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18-21, 2023, Proceedings, volume 14163 of Lecture Notes in Computer Science, pages 121–132, Springer.
  • Schaaff, Schlippe, and Mindner (2023) Schaaff, Kristina, Tim Schlippe, and Lorenz Mindner. 2023. Classification of human- and ai-generated texts for english, french, german, and spanish. In Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), Virtual Event, 16-17 December 2023, pages 1–10, Association for Computational Linguistics.
  • Schneider et al. (2023) Schneider, Sinclair, Florian Steuber, Joao A. G. Schneider, and Gabi Dreo Rodosek. 2023. How well can machine-generated texts be identified and can language models be trained to avoid identification? CoRR, abs/2310.16992.
  • Schulman et al. (2017a) Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017a. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
  • Schulman et al. (2017b) Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017b. Proximal policy optimization algorithms. ArXiv preprint, abs/1707.06347.
  • Schuster et al. (2020a) Schuster, Tal, Roei Schuster, Darsh J. Shah, and Regina Barzilay. 2020a. The limitations of stylometry for detecting machine-generated fake news. Comput. Linguistics, 46(2):499–510.
  • Schuster et al. (2020b) Schuster, Tal, Roei Schuster, Darsh J. Shah, and Regina Barzilay. 2020b. The limitations of stylometry for detecting machine-generated fake news. Computational Linguistics, 46(2):499–510.
  • Seals and Shalin (2023) Seals, S. M. and Valerie L. Shalin. 2023. Long-form analogies generated by chatgpt lack human-like psycholinguistic properties. CoRR, abs/2306.04537.
  • Shah et al. (2023) Shah, Aditya, Prateek Ranka, Urmi Dedhia, Shruti Prasad, Siddhi Muni, and Kiran Bhowmick. 2023. Detecting and unmasking ai-generated texts through explainable artificial intelligence using stylistic features. International Journal of Advanced Computer Science and Applications, 14(10).
  • Shen et al. (2020) Shen, Dinghan, Mingzhi Zheng, Yelong Shen, Yanru Qu, and Weizhu Chen. 2020. A simple but tough-to-beat data augmentation approach for natural language understanding and generation. ArXiv preprint, abs/2009.13818.
  • Shevlane et al. (2023) Shevlane, Toby, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul F. Christiano, and Allan Dafoe. 2023. Model evaluation for extreme risks. CoRR, abs/2305.15324.
  • Shi and Huang (2020) Shi, Zhouxing and Minlie Huang. 2020. Robustness to modification with shared words in paraphrase identification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 164–171, Association for Computational Linguistics.
  • Shi et al. (2023) Shi, Zhouxing, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, and Cho-Jui Hsieh. 2023. Red teaming language model detectors with language models. ArXiv preprint, abs/2305.19713.
  • Solaiman et al. (2019) Solaiman, Irene, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. 2019. Release strategies and the social impacts of language models. ArXiv preprint, abs/1908.09203.
  • Soni and Wade (2023a) Soni, Mayank and Vincent Wade. 2023a. Comparing abstractive summaries generated by chatgpt to real summaries through blinded reviewers and text classification algorithms. CoRR, abs/2303.17650.
  • Soni and Wade (2023b) Soni, Mayank and Vincent P. Wade. 2023b. Comparing abstractive summaries generated by chatgpt to real summaries through blinded reviewers and text classification algorithms. ArXiv preprint, abs/2303.17650.
  • Stiff and Johansson (2022) Stiff, Harald and Fredrik Johansson. 2022. Detecting computer-generated disinformation. Int. J. Data Sci. Anal., 13(4):363–383.
  • Stokel-Walker (2022) Stokel-Walker, Chris. 2022. Ai bot chatgpt writes smart essays-should academics worry? Nature.
  • Stokel-Walker and Van Noorden (2023) Stokel-Walker, Chris and Richard Van Noorden. 2023. What chatgpt and generative ai mean for science. Nature, 614(7947):214–216.
  • Su et al. (2023a) Su, Jinyan, Terry Yue Zhuo, Di Wang, and Preslav Nakov. 2023a. Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. CoRR, abs/2306.05540.
  • Su et al. (2023b) Su, Zhenpeng, Xing Wu, Wei Zhou, Guangyuan Ma, and Songlin Hu. 2023b. HC3 plus: A semantic-invariant human chatgpt comparison corpus. CoRR, abs/2309.02731.
  • Susnjak (2022) Susnjak, Teo. 2022. Chatgpt: The end of online exam integrity? ArXiv preprint, abs/2212.09292.
  • Tang, Chuang, and Hu (2023) Tang, Ruixiang, Yu-Neng Chuang, and Xia Hu. 2023. The science of detecting llm-generated texts. CoRR, abs/2303.07205.
  • Tang et al. (2023) Tang, Ruixiang, Qizhang Feng, Ninghao Liu, Fan Yang, and Xia Hu. 2023. Did you train on my dataset? towards public dataset protection with clean-label backdoor watermarking. CoRR, abs/2303.11470.
  • Taori et al. (2023) Taori, Rohan, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  • Thirunavukarasu et al. (2023) Thirunavukarasu, Arun James, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine. Nature medicine, pages 1–11.
  • Topkara, Topkara, and Atallah (2006) Topkara, Umut, Mercan Topkara, and Mikhail J. Atallah. 2006. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions. In Proceedings of the 8th workshop on Multimedia & Security, MM&Sec 2006, Geneva, Switzerland, September 26-27, 2006, pages 164–174, ACM.
  • Touvron et al. (2023) Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  • Tripto et al. (2023) Tripto, Nafis Irtiza, Adaku Uchendu, Thai Le, Mattia Setzu, Fosca Giannotti, and Dongwon Lee. 2023. HANSEN: human and AI spoken text benchmark for authorship analysis. CoRR, abs/2310.16746.
  • Tu et al. (2023) Tu, Shangqing, Chunyang Li, Jifan Yu, Xiaozhi Wang, Lei Hou, and Juanzi Li. 2023. Chatlog: Recording and analyzing chatgpt across time. CoRR, abs/2304.14106.
  • Tulchinskii et al. (2023) Tulchinskii, Eduard, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Serguei Barannikov, Irina Piontkovskaya, Sergey Nikolenko, and Evgeny Burnaev. 2023. Intrinsic dimension estimation for robust detection of ai-generated texts. ArXiv preprint, abs/2306.04723.
  • Uchendu, Le, and Lee (2023a) Uchendu, Adaku, Thai Le, and Dongwon Lee. 2023a. Attribution and obfuscation of neural text authorship: A data mining perspective. SIGKDD Explor. Newsl., 25(1):1–18.
  • Uchendu, Le, and Lee (2023b) Uchendu, Adaku, Thai Le, and Dongwon Lee. 2023b. Toproberta: Topology-aware authorship attribution of deepfake texts. CoRR, abs/2309.12934.
  • Uchendu et al. (2020) Uchendu, Adaku, Thai Le, Kai Shu, and Dongwon Lee. 2020. Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384–8395, Association for Computational Linguistics.
  • Uchendu et al. (2023) Uchendu, Adaku, Jooyoung Lee, Hua Shen, and Thai Le. 2023. Does human collaboration enhance the accuracy of identifying llm-generated deepfake texts? ArXiv preprint, abs/2304.01002.
  • Uchendu et al. (2021) Uchendu, Adaku, Zeyu Ma, Thai Le, Rui Zhang, and Dongwon Lee. 2021. TURINGBENCH: A benchmark environment for Turing test in the age of neural text generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2001–2016, Association for Computational Linguistics.
  • Vasilatos et al. (2023) Vasilatos, Christoforos, Manaar Alam, Talal Rahwan, Yasir Zaki, and Michail Maniatakos. 2023. Howkgpt: Investigating the detection of chatgpt-generated university student homework through context-aware perplexity analysis. ArXiv preprint, abs/2305.18226.
  • Venkatraman, Uchendu, and Lee (2023) Venkatraman, Saranya, Adaku Uchendu, and Dongwon Lee. 2023. Gpt-who: An information density-based machine-generated text detector. CoRR, abs/2310.06202.
  • Verma et al. (2023) Verma, Vivek, Eve Fleisig, Nicholas Tomlin, and Dan Klein. 2023. Ghostbuster: Detecting text ghostwritten by large language models. CoRR, abs/2305.15047.
  • Veselovsky, Ribeiro, and West (2023) Veselovsky, Veniamin, Manoel Horta Ribeiro, and Robert West. 2023. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. ArXiv preprint, abs/2306.07899.
  • Walters (2023) Walters, William H. 2023. The effectiveness of software designed to detect ai-generated writing: A comparison of 16 ai text detectors. Open Information Science, 7(1):20220158.
  • Wang et al. (2019) Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net.
  • Wang et al. (2023a) Wang, Pengyu, Linyang Li, Ke Ren, Botian Jiang, Dong Zhang, and Xipeng Qiu. 2023a. Seqxgpt: Sentence-level ai-generated text detection. CoRR, abs/2310.08903.
  • Wang et al. (2023b) Wang, Yuxia, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Alham Fikri Aji, et al. 2023b. M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection. ArXiv preprint, abs/2305.14902.
  • Wang et al. (2023c) Wang, Zecong, Jiaxi Cheng, Chen Cui, and Chenhao Yu. 2023c. Implementing BERT and fine-tuned roberta to detect AI generated news by chatgpt. CoRR, abs/2306.07401.
  • Weber-Wulff et al. (2023) Weber-Wulff, Debora, Alla Anohina-Naumeca, Sonja Bjelobaba, Tomáš Foltỳnek, Jean Guerrero-Dib, Olumide Popoola, Petr Šigut, and Lorna Waddington. 2023. Testing of detection tools for ai-generated text. International Journal for Educational Integrity, 19(1):26.
  • Weber-Wulff et al. (2023) Weber-Wulff, Debora, Alla Anohina-Naumeca, Sonja Bjelobaba, Tomás Foltýnek, Jean Guerrero-Dib, Olumide Popoola, Petr Sigut, and Lorna Waddington. 2023. Testing of detection tools for ai-generated text. CoRR, abs/2306.15666.
  • Wei et al. (2022) Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  • Weidinger et al. (2021) Weidinger, Laura, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. 2021. Ethical and social risks of harm from language models (2021). ArXiv preprint, abs/2112.04359.
  • Weng et al. (2023) Weng, Luoxuan, Minfeng Zhu, Kam Kwai Wong, Shi Liu, Jiashun Sun, Hang Zhu, Dongming Han, and Wei Chen. 2023. Towards an understanding and explanation for mixed-initiative artificial scientific text detection. ArXiv preprint, abs/2304.05011.
  • Wikipedia (2023) Wikipedia. 2023. Large language models and copyright.
  • Wolff (2020) Wolff, Max. 2020. Attacking neural text detectors. CoRR, abs/2002.11768.
  • Wu et al. (2023) Wu, Kangxi, Liang Pang, Huawei Shen, Xueqi Cheng, and Tat-Seng Chua. 2023. Llmdet: A third party large language models generated text detection tool. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 2113–2133, Association for Computational Linguistics.
  • Wu and Xiang (2023) Wu, Zhendong and Hui Xiang. 2023. Mfd: Multi-feature detection of llm-generated text. CoRR.
  • Yan et al. (2023) Yan, Duanli, Michael Fauss, Jiangang Hao, and Wenju Cui. 2023. Detection of ai-generated essays in writing assessment. Psychological Testing and Assessment Modeling, 65(2):125–144.
  • Yan et al. (2021) Yan, Yuanmeng, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. ConSERT: A contrastive framework for self-supervised sentence representation transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5065–5075, Association for Computational Linguistics.
  • Yanagi et al. (2020) Yanagi, Yuta, Ryohei Orihara, Yuichi Sei, Yasuyuki Tahara, and Akihiko Ohsuga. 2020. Fake news detection with generated comments for news articles. In 2020 IEEE 24th International Conference on Intelligent Engineering Systems (INES), pages 85–90, IEEE.
  • Yang, Jiang, and Li (2023) Yang, Lingyi, Feng Jiang, and Haizhou Li. 2023. Is chatgpt involved in texts? measure the polish ratio to detect chatgpt-generated text. ArXiv preprint, abs/2307.11380.
  • Yang et al. (2023a) Yang, Xi, Kejiang Chen, Weiming Zhang, Chang Liu, Yuang Qi, Jie Zhang, Han Fang, and Nenghai Yu. 2023a. Watermarking text generated by black-box language models. CoRR, abs/2305.08883.
  • Yang et al. (2022) Yang, Xi, Jie Zhang, Kejiang Chen, Weiming Zhang, Zehua Ma, Feng Wang, and Nenghai Yu. 2022. Tracing text provenance via context-aware lexical substitution. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 11613–11621, AAAI Press.
  • Yang et al. (2023b) Yang, Xianjun, Wei Cheng, Linda R. Petzold, William Yang Wang, and Haifeng Chen. 2023b. DNA-GPT: divergent n-gram analysis for training-free detection of gpt-generated text. CoRR, abs/2305.17359.
  • Yang et al. (2019) Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5754–5764.
  • Yao et al. (2023) Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models, may 2023. ArXiv preprint, abs/2305.10601.
  • Yasunaga and Liang (2021) Yasunaga, Michihiro and Percy Liang. 2021. Break-it-fix-it: Unsupervised learning for program repair. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 11941–11952, PMLR.
  • Yoo et al. (2023) Yoo, KiYoon, Wonhyuk Ahn, Jiho Jang, and Nojun Kwak. 2023. Robust multi-bit natural language watermarking through invariant features. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 2092–2115, Association for Computational Linguistics.
  • Yu et al. (2023a) Yu, Peipeng, Jiahan Chen, Xuan Feng, and Zhihua Xia. 2023a. CHEAT: A large-scale dataset for detecting chatgpt-written abstracts. CoRR, abs/2304.12008.
  • Yu et al. (2023b) Yu, Xiao, Yuang Qi, Kejiang Chen, Guoqiang Chen, Xi Yang, Pengyuan Zhu, Weiming Zhang, and Nenghai Yu. 2023b. Gpt paternity test: Gpt generated text detection with gpt genetic inheritance. ArXiv preprint, abs/2305.12519.
  • Yuan et al. (2022) Yuan, Ann, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852.
  • Zellers et al. (2019a) Zellers, Rowan, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019a. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Association for Computational Linguistics.
  • Zellers et al. (2019b) Zellers, Rowan, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019b. Defending against neural fake news. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 9051–9062.
  • Zeng et al. (2023) Zeng, Zijie, Lele Sha, Yuheng Li, Kaixun Yang, Dragan Gašević, and Guanliang Chen. 2023. Towards automatic boundary detection for human-ai hybrid essay in education. arXiv preprint arXiv:2307.12267.
  • Zhang et al. (2023a) Zhang, Ruisi, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar. 2023a. REMARK-LLM: A robust and efficient watermarking framework for generative large language models. CoRR, abs/2310.12362.
  • Zhang et al. (2023b) Zhang, Yue, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023b. Siren’s song in the ai ocean: A survey on hallucination in large language models. ArXiv preprint, abs/2309.01219.
  • Zhao et al. (2021) Zhao, Zihao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 12697–12706, PMLR.
  • Zheng et al. (2023) Zheng, Qinkai, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. ArXiv preprint, abs/2303.17568.
  • Zhong et al. (2020) Zhong, Wanjun, Duyu Tang, Zenan Xu, Ruize Wang, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. 2020. Neural deepfake detection with factual structure of text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2461–2470, Association for Computational Linguistics.
  • Zhu et al. (2023) Zhu, Biru, Lifan Yuan, Ganqu Cui, Yangyi Chen, Chong Fu, Bingxiang He, Yangdong Deng, Zhiyuan Liu, Maosong Sun, and Ming Gu. 2023. Beat llms at their own game: Zero-shot llm-generated text detection via querying chatgpt. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7470–7483, Association for Computational Linguistics.