Are you still on track!? Catching LLM Task Drift with Activations

Abdelnabi, Sahar; Fay, Aideen; Cherubin, Giovanni; Salem, Ahmed; Fritz, Mario; Paverd, Andrew

Computer Science > Cryptography and Security

arXiv:2406.00799 (cs)

[Submitted on 2 Jun 2024 (v1), last revised 19 Jul 2024 (this version, v4)]

Title:Are you still on track!? Catching LLM Task Drift with Activations

Authors:Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, Andrew Paverd

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) are routinely used in retrieval-augmented applications to orchestrate tasks and process inputs from users and other sources. These inputs, even in a single LLM interaction, can come from a variety of sources, of varying trustworthiness and provenance. This opens the door to prompt injection attacks, where the LLM receives and acts upon instructions from supposedly data-only sources, thus deviating from the user's original instructions. We define this as task drift, and we propose to catch it by scanning and analyzing the LLM's activations. We compare the LLM's activations before and after processing the external input in order to detect whether this input caused instruction drift. We develop two probing methods and find that simply using a linear classifier can detect drift with near perfect ROC AUC on an out-of-distribution test set. We show that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks. Our setup does not require any modification of the LLM (e.g., fine-tuning) or any text generation, thus maximizing deployability and cost efficiency and avoiding reliance on unreliable model output. To foster future research on activation-based task inspection, decoding, and interpretability, we will release our large-scale TaskTracker toolkit, comprising a dataset of over 500K instances, representations from 5 SoTA language models, and inspection tools.

Subjects:	Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:	arXiv:2406.00799 [cs.CR]
	(or arXiv:2406.00799v4 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2406.00799

Submission history

From: Sahar Abdelnabi [view email]
[v1] Sun, 2 Jun 2024 16:53:21 UTC (3,274 KB)
[v2] Mon, 10 Jun 2024 15:39:56 UTC (3,275 KB)
[v3] Thu, 20 Jun 2024 13:33:08 UTC (3,275 KB)
[v4] Fri, 19 Jul 2024 13:07:25 UTC (3,687 KB)

Computer Science > Cryptography and Security

Title:Are you still on track!? Catching LLM Task Drift with Activations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Are you still on track!? Catching LLM Task Drift with Activations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators