Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–29 of 29 results for author: Nanda, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17759  [pdf, other

    cs.LG

    Interpreting Attention Layer Outputs with Sparse Autoencoders

    Authors: Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

    Abstract: Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also h… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2406.16254  [pdf, other

    cs.LG cs.AI cs.CL

    Confidence Regulation Neurons in Language Models

    Authors: Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda

    Abstract: Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored. This study investigates two critical components believed to influence this uncertainty: the recently discovered entropy neurons and a new set of components that we term token frequency neurons. Entropy neurons are characterized b… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 25 pages, 14 figures

  3. arXiv:2406.11944  [pdf, other

    cs.LG cs.CL

    Transcoders Find Interpretable LLM Feature Circuits

    Authors: Jacob Dunefsky, Philippe Chlenski, Neel Nanda

    Abstract: A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely m… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 28 pages, 6 figures, 3 tables, 2 algorithms. Under review

  4. arXiv:2406.11717  [pdf, other

    cs.LG cs.AI cs.CL

    Refusal in Language Models Is Mediated by a Single Direction

    Authors: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, Neel Nanda

    Abstract: Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  5. arXiv:2405.08366  [pdf, other

    cs.LG

    Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

    Authors: Aleksandar Makelov, George Lange, Neel Nanda

    Abstract: Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them aga… ▽ More

    Submitted 20 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

  6. arXiv:2404.16014  [pdf, other

    cs.LG cs.AI

    Improving Dictionary Learning with Gated Sparse Autoencoders

    Authors: Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

    Abstract: Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to enco… ▽ More

    Submitted 30 April, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

    Comments: 15 main text pages, 22 appendix pages

  7. arXiv:2404.15255  [pdf, other

    cs.LG

    How to use and interpret activation patching

    Authors: Stefan Heimersheim, Neel Nanda

    Abstract: Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply activation patching and a discussion on how to interpret the results. We… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: A tutorial on activation patching. 13 pages, 2 figures

  8. arXiv:2403.00745  [pdf, other

    cs.LG cs.CL

    AtP*: An efficient and scalable method for localizing LLM behaviour to components

    Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

    Abstract: Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching an… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  9. arXiv:2402.15390  [pdf, other

    cs.LG cs.AI cs.CL

    Explorations of Self-Repair in Language Models

    Authors: Cody Rushing, Neel Nanda

    Abstract: Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on t… ▽ More

    Submitted 26 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: ICML 2024

  10. arXiv:2402.07321  [pdf, other

    cs.LG cs.CL

    Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

    Authors: Bilal Chughtai, Alan Cooney, Neel Nanda

    Abstract: How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form `Fact: The Colosseum is in the country of'. We find that the mechanistic story behind factual recall is more complex than previously thought. It comprises several disti… ▽ More

    Submitted 11 February, 2024; originally announced February 2024.

    Comments: NeurIPS 2023 Attributing Model Behaviour at Scale Workshop

  11. arXiv:2401.12181  [pdf, other

    cs.LG cs.AI cs.CL

    Universal Neurons in GPT2 Language Models

    Authors: Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas

    Abstract: A basic question within the emerging field of mechanistic interpretability is the degree to which neural networks learn the same underlying mechanisms. In other words, are neural mechanisms universal across different models? In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neuron… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

  12. arXiv:2311.17030  [pdf, other

    cs.LG cs.AI cs.CL

    Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

    Authors: Aleksandar Makelov, Georg Lange, Neel Nanda

    Abstract: Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In thi… ▽ More

    Submitted 6 December, 2023; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: NeurIPS 2023 Workshop on Attributing Model Behavior at Scale

  13. arXiv:2311.00863  [pdf, other

    cs.LG cs.AI cs.CL

    Training Dynamics of Contextual N-Grams in Language Models

    Authors: Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda

    Abstract: Prior work has shown the existence of contextual neurons in language models, including a neuron that activates on German text. We show that this neuron exists within a broader contextual n-gram circuit: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active. We investigate the formation of this circuit throughou… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: Accepted workshop paper at ATTRIB 2023 (@ NeurIPS)

  14. arXiv:2310.15154  [pdf, other

    cs.LG cs.AI cs.CL

    Linear Representations of Sentiment in Large Language Models

    Authors: Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

    Abstract: Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  15. arXiv:2310.04625  [pdf, other

    cs.LG cs.AI cs.CL

    Copy Suppression: Comprehensively Understanding an Attention Head

    Authors: Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

    Abstract: We present a single attention head in GPT-2 Small that has one main role across the entire training distribution. If components in earlier layers predict a certain token, and this token appears earlier in the context, the head suppresses it: we call this copy suppression. Attention Head 10.7 (L10H7) suppresses naive copying behavior which improves overall model calibration. This explains why multi… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

  16. arXiv:2309.16042  [pdf, other

    cs.LG cs.AI cs.CL

    Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

    Authors: Fred Zhang, Neel Nanda

    Abstract: Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice o… ▽ More

    Submitted 16 January, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: 27 pages. ICLR 2024

  17. arXiv:2309.00941  [pdf, other

    cs.LG

    Emergent Linear Representations in World Models of Self-Supervised Sequence Models

    Authors: Neel Nanda, Andrew Lee, Martin Wattenberg

    Abstract: How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for "my colour" vs. "opponent's colour" may be a simple yet powerful way to interpret the… ▽ More

    Submitted 7 September, 2023; v1 submitted 2 September, 2023; originally announced September 2023.

  18. arXiv:2307.09458  [pdf, other

    cs.LG

    Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

    Authors: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik

    Abstract: \emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and… ▽ More

    Submitted 24 July, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

  19. arXiv:2305.19911  [pdf, other

    cs.LG cs.CL

    Neuron to Graph: Interpreting Language Model Neurons at Scale

    Authors: Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

    Abstract: Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make th… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

  20. arXiv:2305.01610  [pdf, other

    cs.LG cs.AI

    Finding Neurons in a Haystack: Case Studies with Sparse Probing

    Authors: Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas

    Abstract: Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of f… ▽ More

    Submitted 2 June, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

  21. arXiv:2304.12918  [pdf, other

    cs.LG

    N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

    Authors: Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

    Abstract: Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than cu… ▽ More

    Submitted 22 April, 2023; originally announced April 2023.

    Comments: To be published at ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models

  22. arXiv:2302.03025  [pdf, other

    cs.LG cs.AI math.RT

    A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

    Authors: Bilal Chughtai, Lawrence Chan, Neel Nanda

    Abstract: Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks. In this work, we study the universality hypothesis by examining how small neural networks learn to implement group composition. We present a novel algorithm by which neural networks may implement composition for any finite group via mathematic… ▽ More

    Submitted 24 May, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

    Comments: 9 page main body, 1 page references, 12 page appendix

  23. arXiv:2301.05217  [pdf, other

    cs.LG cs.AI

    Progress measures for grokking via mechanistic interpretability

    Authors: Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

    Abstract: Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: r… ▽ More

    Submitted 19 October, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

    Comments: 10 page main body, 2 page references, 24 page appendix

  24. arXiv:2209.11895  [pdf

    cs.LG

    In-context Learning and Induction Heads

    Authors: Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish , et al. (1 additional authors not shown)

    Abstract: "Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induc… ▽ More

    Submitted 23 September, 2022; originally announced September 2022.

  25. arXiv:2204.05862  [pdf, other

    cs.CL cs.LG

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Authors: Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei , et al. (6 additional authors not shown)

    Abstract: We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where prefer… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

    Comments: Data available at https://github.com/anthropics/hh-rlhf

  26. Predictability and Surprise in Large Generative Models

    Authors: Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Dario Amodei , et al. (5 additional authors not shown)

    Abstract: Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad train… ▽ More

    Submitted 3 October, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: Updated to reflect the version submitted (and accepted) to ACM FAccT '22. This update incorporates feedback from peer-review and fixes minor typos. See open access FAccT conference version at: https://dl.acm.org/doi/abs/10.1145/3531146.3533229

  27. arXiv:2110.01577  [pdf, other

    cs.LG cs.CY

    An Empirical Investigation of Learning from Biased Toxicity Labels

    Authors: Neel Nanda, Jonathan Uesato, Sven Gowal

    Abstract: Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As such, it is often only possible to gather a small amount of high-quality labels. In this paper, we study how different training strategies can leverage a small dataset of human-annotated labels and a large but noisy dataset of synthetically g… ▽ More

    Submitted 4 October, 2021; originally announced October 2021.

    Comments: 8 pages, 6 figures. Accepted to the Socially Responsible Machine Learning Workshop, ICML 2021

  28. arXiv:2109.12370  [pdf, other

    cs.SI

    Interpretable Business Survival Prediction

    Authors: Anish K. Vallapuram, Nikhil Nanda, Young D. Kwon, Pan Hui

    Abstract: The survival of a business is undeniably pertinent to its success. A key factor contributing to its continuity depends on its customers. The surge of location-based social networks such as Yelp, Dianping, and Foursquare has paved the way for leveraging user-generated content on these platforms to predict business survival. Prior works in this area have developed several quantitative features to ca… ▽ More

    Submitted 25 September, 2021; originally announced September 2021.

    Comments: 8 pages, 10 figures

  29. arXiv:2102.08686  [pdf, other

    cs.LG cs.AI

    Fully General Online Imitation Learning

    Authors: Michael K. Cohen, Marcus Hutter, Neel Nanda

    Abstract: In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. In general, one mistake during learning can lead to completely different events. In the special setting of environments that… ▽ More

    Submitted 4 October, 2022; v1 submitted 17 February, 2021; originally announced February 2021.

    Comments: 13 pages with 8-page appendix

    ACM Class: I.2.0; I.2.6