Truth is Universal: Robust Detection of Lies in LLMs

Bürger, Lennart; Hamprecht, Fred A.; Nadler, Boaz

Computer Science > Computation and Language

arXiv:2407.12831 (cs)

[Submitted on 3 Jul 2024 (v1), last revised 21 Oct 2024 (this version, v2)]

Title:Truth is Universal: Robust Detection of Lies in LLMs

Authors:Lennart Bürger, Fred A. Hamprecht, Boaz Nadler

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities. In particular, LLMs are capable of "lying", knowingly outputting false statements. Hence, it is of interest and importance to develop methods to detect when LLMs lie. Indeed, several authors trained classifiers to detect LLM lies based on their internal model activations. However, other researchers showed that these classifiers may fail to generalise, for example to negated statements. In this work, we aim to develop a robust method to detect when an LLM is lying. To this end, we make the following key contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B, Mistral-7B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, attaining 94% accuracy in both distinguishing true from false factual statements and detecting lies generated in real-world scenarios.

Comments:	NeurIPS 2024 poster
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.12831 [cs.CL]
	(or arXiv:2407.12831v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.12831

Submission history

From: Lennart Bürger [view email]
[v1] Wed, 3 Jul 2024 13:01:54 UTC (2,558 KB)
[v2] Mon, 21 Oct 2024 08:55:49 UTC (4,653 KB)

Computer Science > Computation and Language

Title:Truth is Universal: Robust Detection of Lies in LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Truth is Universal: Robust Detection of Lies in LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators