Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller
FAR AI &Bilal Chughtai
Independent &William Saunders
Independent Correspondence to josephmiller101@gmail.com

Abstract

Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover ‘circuits’ – subgraphs of the full model that explain behaviour on specific tasks. But how do we measure the performance of such circuits? Prior work has attempted to measure circuit ‘faithfulness’ – the degree to which the circuit replicates the performance of the full model. In this work, we survey many considerations for designing experiments that measure circuit faithfulness by ablating portions of the model’s computation. Concerningly, we find existing methods are highly sensitive to seemingly insignificant changes in the ablation methodology. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit - the task a circuit is required to perform depends on the ablation used to test it. The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits. We open source a library at this https URL that includes highly efficient implementations of a wide range of ablation methodologies and circuit discovery algorithms.

1 Introduction

Mechanistic interpretability (MI) is a form of post-hoc interpretability that attempts to reverse engineer neural networks to provide faithful low-level explanations of model behaviour (Olah et al., 2020). One focus of interpretability work on transformer language models is identifying ‘circuits’ – subgraphs of the entire model’s computational graph that are primarily responsible for the model’s output on some task (Wang et al., 2023); where a task is specific type of problem that a language model has to solve to output correct next-token predictions (ie. sentences that require a specific algorithm to complete correctly).

A key metric used by mechanistic interpretability (MI) researchers to quantify the quality of a ‘circuit’ for some task is it’s faithfulness – that is, the degree to which the circuit captures the performance of the entire model (Zhang & Nanda, 2024). In this work, we study various small and reasonable seeming variations on methodologies for measuring circuit faithfulness and find that such variations often lead to significantly different faithfulness scores. Faithfulness is typically measured by performing a targeted, circuit-dependent ablation to the model, and observing the effect of this on some metric of the model’s output. In the context of MI, an ablation refers to a type of intervention made on the activations of a model during its forward pass with the intended purpose of ‘deleting’ some causal pathway(s), thereby isolating the causal effect of the circuit.

In this work, we seek to answer the questions: What do circuit faithfulness metrics actually show? To what extent are they a useful test of the circuit and to what extent are they a reflection of the experimental methodology?

We begin by reviewing the ways in which MI researchers may vary their ablation methodology (Section 3), providing a detailed review of methods for ablating transformer circuits. Next, we test these variations on existing circuits discovered by MI researchers (Section 4). We provide detailed case studies of the ‘Indirect Object Identification’ circuit by Wang et al. (2023), the ‘Docstring’ circuit by Heimersheim & Janiak (2023) and the ‘Sports Players’ circuit by Nanda et al. (2023b). We then go on to study ‘optimal circuits’ (Section 5) in the context of automated circuit discovery (Conmy et al., 2023) – an emerging paradigm that aims to discover circuits algorithmically, without human input.

We conclude with recommendations for MI researchers (Section 6). We additionally release AutoCircuit, a library containing efficient implementations of the circuit-discovery and circuit-evaluation techniques used in this paper, that is significantly faster than prior implementations we tested (see Appendix A for more details).

2 Related Work

Circuit Analysis. Circuit analysis is a form of post-hoc interpretability focused on understanding the full end-to-end learned algorithm responsible for some specified narrow behaviour. A circuit is a subgraph of the full computational graph of the model that (is alleged to) implement some precise behavior. Circuits have been studied in vision models (Cammarata et al., 2021; Olah et al., 2020) and in toy transformer models (Nanda et al., 2023a; Chughtai et al., 2023). More recently, the circuit analysis paradigm has achieved success in interpreting transformer language models too, with a number of papers discovering circuits implementing human understandable algorithms through ablation studies (Wang et al., 2023; Heimersheim & Janiak, 2023; Hanna et al., 2023). To accelerate such studies, recent work has attempted to automate the process of discovering circuits (Conmy et al., 2023; Syed & Rager, 2023; Kramar et al., 2024), particularly in large language models, as circuits have historically required a large amount of researcher-effort to uncover. Prior work has suggested that ideal circuits exist on the Pareto frontier of faithfulness, completeness and simplicity (description length), as the entire network is trivially optimal for the first two criteria (Sharkey, 2024).

Activation Patching. Zhang & Nanda (2024) recommend best practices in Activation Patching (a form of ablation, defined in Section 3.1) for measuring circuit faithfulness in a similar work to ours. They compare single layer vs. multi-layer ablation, Resample Ablation vs. Noise Ablation and logit difference vs probability metrics when Node Patching. We study a larger set of variations in ablation methodology in this work, enumerating several more choices in methodology and arguing that different optimal circuits are defined in part by different ablation methodologies, rather than prescribing a single correct approach to ablation.

Faithful explanations in NLP. We are interested in explaining model behavior in a way that reflects the underlying reasoning process of the model, a criteria often referred to as faithfulness. In this work we measure faithfulness by studying the fidelity of ablated models - the similarity of the ablated output to the outputs of the full model (Alishahi et al., 2019; Guidotti et al., 2018; Agarwal et al., 2024). As argued by Jacovi & Goldberg (2020), faithfulness should be viewed as a continuum. Any interpretation is an approximation that will necessarily fail to capture some aspects of the underlying behavior.

Mechanistic interpretability (MI) attempts to reverse engineer trained machine learning models to produce faithful human understandable explanations of model predictions via analysis of the low level features and algorithms implemented by the network. Circuit analysis is just one important direction in this theme of work. Besides circuit analysis, MI more broadly seeks to understand the correct frame to interpret neural network computation (Elhage et al., 2021; Bricken et al., 2023; Cunningham et al., 2023) and to understand the learned features of models (Li et al., 2023; Tigges et al., 2023; Gurnee & Tegmark, 2024; Bills et al., 2023). MI has also inspired work in steering model outputs through representation engineering (Turner et al., 2023; Li et al., 2024; Rimsky et al., 2024).

3 Measuring Faithfulness

We follow previous works (Wang et al., 2023; Heimersheim & Janiak, 2023; Hanna et al., 2023) in defining faithfulness of circuits as the extent to which they encapsulate the full model’s computation of a particular task. These works measure faithfulness by ablating the components of the computational graph that are not in the circuit and observing the change in output of the model.

However, even within this framework, there are several important further choices when designing experiments, which we review in this section and summarise in Table 1. We also provide a summary of the approaches taken by previous works in Table 2.

Choice	Granularity	Component	Value	Token positions	Direction	Set
Examples	Heads, MLPs	Node	Resample/Patch	All tokens	Ablate Clean	Circuit
	Q, K, V, MLPs	Edge	Zero	Specific tokens	Restore Clean	Complement
	Heads, MLP Neurons	Branch	Mean
	Sparse features		Noise

Table 1: The six-tuple that defines ablation methodology for transformer circuits.

Work

Granularity

Component

Value

Token positions

Direction

Set

Vig et al. (2020)

(Gender Bias)

Heads, Neurons

Node

Resample (clean)

All tokens

Resample Clean

Circuit

Meng et al. (2022)

(ROME)

Layers

Node

Resample (clean)

Specific tokens

Resample Clean

Circuit

Wang et al. (2023)

(IOI)

Heads

Node (evaluation) /

Path (discovery)

Mean

Specific tokens

Ablate Clean

Complement

Conmy et al. (2023)

(ACDC)

Heads, MLPs

Edge

Resample (corrupt)

All tokens

Ablate Clean

Complement

Heimersheim & Janiak (2023)

(Docstring)

Heads

Node

Resample (clean)

All tokens (evaluation)/

Specific tokens (discovery)

Resample Clean

Circuit

Hanna et al. (2023)

(Greaterthan)

Heads, MLPs

Path

Resample (corrupt)

All tokens

Ablate Clean

Complement

Nanda et al. (2023b)

(Sports Players)

Heads, MLPs

Path

Resample (corrupt)

Specific tokens

Ablate Clean

Complement

Table 2: Summary of the patching methodologies used by seven previous works. Note that each methodology differs from all of the others in at least one aspect.

3.1 Ablation Methodology

In the context of MI, an ablation refers to a type of intervention made on the activations of a model during its forward pass with the intended purpose of ‘deleting’ precise causal pathways. In the language of casual inference, we denote the ablation of all activations outside a circuit $C$ on a model $M$ as:

F(x)=M(x\mid\text{do}(a=\tilde{a})),\,a\notin C

(1)

Where $x$ is the input to the model, $a$ is an internal activation of the model and $\tilde{a}$ is the ablated value of $a$ . The ablation methodology determines the types of activations and values that $a$ and $\tilde{a}$ can be (eg. whether $a$ is a neuron node activation or an edge between attention heads).

Intuitively, deleting important subcomponents for some task should damage task performance, and conversely deleting unimportant sub-components should preserve task performance. As such, ablations have arisen as a commonly used tool for localizing model behaviour to specific internal model components. Ablations may be used both to find and evaluate mechanistic explanations of model behavior.

The concept of ablation overlaps with a related technique, activation patching, in which activations are modified during a model’s forward pass to some cached values from a different input. ‘Corrupted’ inputs are inputs which are similar to the ‘clean’ distribution being studied, but which have crucial differences that drastically change the output. For example, a typical ‘corrupt’ prompt could retain the structure of a ‘clean’ prompt, while switching a proper noun, such that the correct next token prediction is changed. In this work we consider activation patching to be a specific type of ablation, and use the term Resample Ablation interchangeably. But we note that in general, ‘patching’ means editing activations to some other value, instead of ‘deleting’ them, as ablation typically connotes.

In the remainder of this section, we review the range of ablation techniques that exist in the literature, specifically as they relate to evaluating circuits. There exist several important experimental design choices when evaluating transformer circuits via ablations. These are (1) the granularity of the computational graph used to represent the model, (2) what type of component in the graph is ablated, (3) what type of activation value is used to ablate the component, (4) which token positions are ablated, (5) the ablation direction (whether the ablation destroys or restores the signal) and (6) the set of components ablated (the circuit or the complement of the circuit). A circuit-based ablation methodology can therefore be specified as a six-tuple, and prior work has used many different combinations (Table 2). In this paper we argue that existing evaluations of circuits are sensitive to each of these variables.

3.1.1 Circuit Granularity

In this work we study circuits specified at the level of attention heads and MLPs¹¹1See Thickstun (2024) for a brief overview of the transformer architecture.. We also separate the input of each attention head into the Q, K and V inputs, but we omit this from our diagrams for visual simplicity. This is the most common granularity for mechanistic circuit analysis (Conmy et al., 2023; Wang et al., 2023; Heimersheim & Janiak, 2023; Hanna et al., 2023; Nanda et al., 2023b)), but previous works have also studied circuits specified at the level of layers (Meng et al., 2022), neurons (Vig et al., 2020), subspaces (Geiger et al., 2023) and sparse ”features” (Marks et al., 2024).

3.1.2 Ablation Component Type (and Associated Model Views)

Refer to caption — (a) Node Patching (often called Activation Patching) replaces the output of some component to the residual stream in the unfactorized transformer.

Transformers can be described as computational graphs in several different, equivalent ways. We can choose to write the graph as a residual network (Figure 7(a)) or a ‘factorized’ network in which all nodes are connected via an edge to all prior nodes (Figure 7(b)) (Elhage et al., 2021). Or we can write down a ‘treeified’ network that separates all paths from input to output (Figure 8(a)). All formulations are equivalent but the ‘factorized’ view allows us to isolate interactions between individual components and the ‘treeified’ view allows us to isolate chains of interactions from input to output.

The component type defines the type of intervention made: we detail three possibilities, with increasing granularity. The more granular approaches are generally more difficult to implement and more computationally expensive.

(1) Nodes. We may intervene on a node (in the standard, residual view) during the forward pass, replacing its activation with some other value (Figure 1(a)). This is the least specific form of ablation. Since all downstream nodes ‘see’ the change there are a large number of causal pathways affected by the ablation, which may result in unintended side-effects. This type of ablation is also known as (vanilla) activation patching (Vig et al., 2020) when we ablate with a cached activation from another input.

(2) Edges. Using the factorized view of a transformer, we may intervene on an edge between two components (Figure 1(b)). This is more specific than ablating nodes, as only the specified destination node receives the ablated activation of the source node, so a smaller number of causal pathways are affected.

(3) Branches. The previous two ablations can be applied to individual nodes or edges, or to a collection of nodes and edges. Branch ablations on the other hand can only be applied to paths from input to output (Figure 1(c)). The causal effect of individual paths through the model is isolated by ‘treeifying’ the factorized model. This approach was introduced by Chan et al. (2022) (formalized by Goldowsky-Dill et al. (2023)) and is a key component of a rigorous circuit evaluation approach known as Causal Scrubbing. However, because the number of paths in the treeified model is exponential in the number of layers of the model this approach to circuit evaluation is often intractable in practice. We omit treeified experiments in this work.

3.1.3 Ablation Value

When performing a causal intervention on some activation, we may choose what value we patch in. The simplest choice is to Zero Ablate, by replacing the activation with a vector of zeros (Olsson et al., 2022; Cammarata et al., 2021). Prior work has noted however that the zero point is arbitrary (Wang et al., 2023). The next simplest is to apply Gaussian Noise (GN) to the token embeddings of the clean input to obtain corrupted activations (Meng et al., 2022). Both of these approaches can take the model significantly out of distribution (Zhang & Nanda, 2024), producing noisy outputs (Wang et al., 2023).

Two more principled approaches are Resample Ablation (take an activation from some other corrupted input) (Vig et al., 2020; Meng et al., 2022), and Mean Ablation (replace with the mean activation of a node from some distribution) (Wang et al., 2023). These two ablation types have the desirable property of keeping the model closer to its usual distribution of activations. Importantly, they do not delete all information present in a component. Instead, they delete information that varies across the distribution, while preserving information that is constant across it, allowing us to isolate precise language tasks, while ignoring, say, generic grammar processing. When Mean Ablating, we have an additional choice in the size of the mean ablation dataset (see Section 4.1). We focus on Mean and Resample Ablations in this work.

3.1.4 Token Positions

Circuits in autoregressive transformers on a narrow distribution are sometimes defined in terms of components and token positions. When these token positions are specified, we can choose to either ablate all token positions, or only the token positions not in the specified set (Wang et al., 2023). We can modify equation (1) to

F(x)=M(x\mid\text{do}(a_{i}=\tilde{a}_{i})),\,a_{i}\notin C

where $a_{i}$ is the activation $a$ at token position $i$ .

3.1.5 Ablation Direction and Testing Circuits

Ablation typically refers to instances where we run the model on a clean input and change activations to destroy the input signal (Wang et al., 2023; Conmy et al., 2023; Hanna et al., 2023; Nanda et al., 2023b). However, we can also run the model on a corrupt input and Resample Ablate (or Patch) in activations from the clean input (Meng et al., 2022; Heimersheim & Janiak, 2023). Separately, when evaluating circuits, we can choose to either ablate all the components of the circuit or we can ablate all the components not in the circuit (the complement).

The combination of these choices determines the target of our faithfulness metric:

Model Input	Direction	Set	Faithfulness Target
Clean	Ablate Clean	Circuit	Destroy Performance
Corrupt	Restore Clean	Circuit	Restore Performance
Clean	Ablate Clean	Complement	Maintain Performance
Corrupt	Restore Clean	Complement	Maintain Inefficacy

Table 3: The four methodologies for directional patching for circuit evaluation.

Figure 2 compares the second and third rows of the table, which both measure faithfulness as the similarity of the ablation to the full model. We note that Resample Ablating clean activations for the circuit components while passing a corrupt input allows the signal from the clean input to flow through edges not included in the circuit. Whereas ablating with corrupt activations on the complement of the circuit with a clean input ensures that the signal from the input only flows through the circuit.

3.2 Metric

One further consideration in addition to the ablation methodology is the metric used to evaluate the effect of the ablation. We also argue that the choice of metric is important. There are many choices used in the literature, including KL Divergence (Conmy et al., 2023), top- $k$ accuracy Heimersheim & Janiak (2023) and task-specific benchmarks (Hanna et al., 2023). In this work we will focus on the metrics used by the respective authors of the circuits that we study, but note these choices are also in general free.

4 Faithfulness Metrics are Sensitive to Ablation Methodology

In this section, we empirically demonstrate that evaluations of a given circuit’s faithfulness are highly sensitive to the experimental choices outlined in Section 3 made at evaluation time. We further argue that this sensitivity is important, and may result in practitioners finding fundamentally different algorithms.

We provide a case study here on the Indirect Object Identification (IOI) circuit identified by Wang et al. (2023), as this is the most studied language model circuit in the literature (Conmy et al., 2023; Makelov et al., 2023; Zhang & Nanda, 2024), but find similar results for other known language model circuits in Appendix D. The IOI circuit is specified as an edge-level circuit, but Wang et al. (2023) evaluate its faithfulness via a node-wise ablation methodology. We begin by testing the circuit using edge-level ablation.

The IOI circuit. The IOI circuit is a manually-identified subgraph of GPT-2 that is intended to perform the IOI task, which is defined by the IOI distribution. The IOI clean distribution consists of 15 sentence templates which involve two people interacting, structured such that the next word to be predicted is the indirect object A. Each template can be filled with names in the order ABBA or BABA, where the final A is the predicted token. For example: "When John and Mary went to the store, John bought flowers for ____". The corrupt distribution (also called the ABC distribution) fills the same templates with names in the order ABC where A, B and C are three different names sampled independently of the corresponding clean prompt (we only need to specify three names because we are not defining a correct completion, unlike with ABBA and BABA). For example: "When Gary and Nora went to the store, Naomi bought flowers for ____".

Measuring IOI Circuit Faithfulness. Wang et al. (2023) define the metric of circuit faithfulness to be logit difference recovered²²2Wang et al. (2023) use different metrics throughout the paper. Here we are referring to the metric used to test the overall faithfulness of the circuit in Section 4 of their paper.. The logit difference is computed between the correct answer A and incorrect answer (the other name in the prompt) B both when the full model is run as normal and when the specified nodes are ablated. Then, the percentage of the full model’s logit difference which is recovered by the ablated model is calculated.

\frac{F(x)_{\text{correct}}-F(x)_{\text{incorrect}}}{M(x)_{\text{correct}}-M(x% )_{\text{incorrect}}}\times 100

Where $F(x)_{\text{correct}}$ denotes the logit of the correct answer token on $F(x)$ (and other terms are defined similarly). A logit difference recovered of $100\%$ means the circuit output has the same logit difference as the full model. A negative value means that the circuit outputs the corrupt logit as larger than the clean logit and a value over $100\%$ means the circuit output has a greater logit difference than the full model. We adopt this definition of faithfulness for the remainder of this section.

Wang et al. (2023) test the faithfulness of their circuit by passing in a clean input and Node Ablating the complement of the circuit. They distinguish between token positions – that is, they ablate nodes in the circuit at all token positions except those specified by the circuit. They use a Mean Ablation, where the mean value is computed for each token position over the ABC distribution, using around seven examples per template.

4.1 Variance Between Ablation Methodologies

We now show circuit faithfulness is sensitive to these choices. First we compare the faithfulness metric when we change the ablation component from nodes to edges - we ablate the complement of the set of edges specified by the circuit instead of the complement of the set of nodes in the circuit. As shown in Figure 3, ablating at the edge level returns substantially higher percentages.

Figure 3 also evaluates the effect of ablation value. We rerun the above experiment using Resample Ablations from the ABC distribution, and find that this results in a systematically lower faithfulness as compared with mean ablations (statically significant on a t-test with $p=1e-5$ for Node Ablation but not Edge Ablation). Finally, we study the effect of ablating at every token position, instead of only those specified by the circuit. This consistently results in lower faithfulness scores. It is concerning that the edge-level circuit with specific token positions has a median score well over 100%, as this best represents the hypothesis of Wang et al. (2023).

Next, we discuss sensitivity of the faithfulness metric to both the clean distribution and intricacies of the metric calculation. For these experiments, we perform node-level Mean Ablations on the complement of the circuit, split by token position, similarly to Wang et al. (2023). As shown in the left two charts of Figure 4, faithfulness is systematically greater for the prompts of form BABA than prompts of form ABBA. We also find that faithfulness monotonically increases with the size of the ABC dataset (used for computing the Mean Ablation).

Finally we note that Wang et al. (2023) compute the logit difference recovered by first finding the mean logit difference for the full model and the ablated model over all prompts, and then computing the percentage (Figure 4, far left).

\frac{\mathbb{E}[F(x)_{\text{correct}}-F(x)_{\text{incorrect}}]}{\mathbb{E}[M(% x)_{\text{correct}}-M(x)_{\text{incorrect}}]}\times 100

If instead we compute the percent difference for each prompt and then take the mean, we return substantially higher percentages (Figure 4, middle left).

\mathbb{E}\left[\frac{F(x)_{\text{correct}}-F(x)_{\text{incorrect}}}{M(x)_{% \text{correct}}-M(x)_{\text{incorrect}}}\times 100\right]

These are significant and important changes in evaluation. If the researchers had used a different methodology, they may have discovered a different circuit and, therefore, a different underlying algorithm. This is important since it suggests that the algorithm the circuit is required to perform depends on the ablation methodology. We expand on this point in Section 5.

4.2 Variance Between Individual Datapoints

Even for a fixed ablation methodology and metric, there is significant variation in the measured faithfulness between individual prompts in the distribution.

We show this for the IOI circuit in the figures above, with results for other circuits in Appendix D. The graphs on the right of Figure 4 show a large range of faithfulness scores attained when we ablate the complement of the nodes in the IOI circuit. Note that the graphs do not show the full range of datapoints and there are several extreme outliers with a logit difference recovered in the tens of thousands of percent. The inter-quartile range (IQR) is also large, stretching up to 50% across the dataset. This is concerning: while the circuit matches the behavior on average, it does not match it for many examples. Another property of ideal circuits describing behaviour on some task is that their faithfulness variance should be low over the task input distribution. Otherwise, the circuit is at least partially optimized to balance out extremely high (significantly $>$ 100%) and extremely low faithfulness scores ( $<$ 0%). This variance consideration is importantly missing from the mechanistic explanations of how GPT-2 implements the IOI task provided by Wang et al. (2023). We encourage MI researchers to evaluate task performance in both the average case and worst case.

5 Optimal Circuits Are Defined By Prompts and Ablation Methodologies

We showed in the previous section that measurement details can greatly change the faithfulness score of an experiment. However, one might ask if this difference matters. In this section we discuss the consequences of such sensitivity for circuit discovery.

If a circuit is specified as a set of edges, it should be tested using edge ablations and if it is specified with token positions then it should be tested with token-specific ablation. But in other aspects there often isn’t a clearly correct methodology. So how should we think about the difference in faithfulness between different methodologies? We study this question in small toy models, where we have access to the ‘ground truth’ circuit. We conclude that the optimal circuit for some distribution cannot be defined unless we also specify the ablation methodology and metric that we are using to measure it.

Tracr models (Lindner et al., 2023) are tiny transformers that are compiled instead of trained. Since the ground truth algorithm is both simple and known, they provide an excellent setup for testing circuit discovery algorithms. RASP programs (Rush & Weiss, 2023) are compiled into the weights of a transformer that implements the program exactly. Following Conmy et al. (2023), we study two Tracr models, Reverse and X-Proportion.

The X-Proportion model performs the task of outputting at each token position the proportion of previous characters that are ‘x’s. The model has two layers, with one head in each attention layer. The first attention layer and the second MLP are not used, so we need only consider the edges between the Input, MLP 0, Attn 1.0 and Output.

Conmy et al. consider the edge from Input to Attn 1.0 to be part of the ground truth circuit (Figure 11). Inspecting the RASP program, we see that the only information in this edge’s activation that is used by the model is the positional encoding of the tokens. However, this does not vary between different inputs, so if our ablation methodology uses Resample Ablations then this edge need not be included in the circuit, as ablating it will not change this positional information. However, if we instead use Zero Ablations, then this information will be destroyed, so the edge must be included in the circuit.

Conmy et al. test three automatic circuit discovery algorithms on this task. All three algorithms use (or approximate) Resample Ablations to discover circuits. The first method, ACDC, traverses the model in reverse topological order, ablating each edge in turn. Subnetwork Probing (SP) learns a mask parameter for each node, via gradient descent, attempting to maximize the number of nodes ablated, while minimizing the KL divergence from the original model. Lastly, Head Importance Scoring (HISP), uses a first order, gradient-based approximation of Node Ablation to assign attribution scores to each node. We test each circuit discovery method by sweeping over a range of importance thresholds to obtain an ordering of circuits of increasing size. Following Conmy et al. we then plot pessimistic receiver operating characteristic (ROC) curves (Figure 5) and compare the area under curves.

SP and HISP, use (or approximate) Node Ablations, while ACDC uses Edge Ablations.³³3To convert the predictions of SP and HISP to edge-based circuits, Conmy et al. include all edges which connect two nodes of sufficient importance. With this implementation it may be impossible for SP and HISP to correctly order edges. For example, there can be two nodes which are both individually important, but where the edge connecting them is unimportant.In our experiments we adjust the implementation of both SP and HISP to use (or approximate) Edge Ablations; SP learns mask parameters that ablate each edge and HISP assigns attribution scores for each edge by approximating Edge Patching. We provide a comparison between Edge and Node-based circuit discovery methods in Appendix E.

Conmy et al. considered the edges that would be required with Zero Ablations to be the correct circuits. Therefore, the algorithms fail to fully recover the “ground truth”. When we instead consider the edges that are required with Resample Ablations to be the correct circuit, all three algorithms perfectly recover the “ground truth” (Figure 5).

This case study illustrates that the optimal circuit with respect to only a set of prompts is undefined. The ablation partly determines the task. In this case, we must decide - is determining the positional encoding part of the task? If so then the zero ablation circuit should be considered the ‘ground-truth’, if not then the resample ablation circuit should be.

6 Conclusion

In this work we show existing transformer circuit evaluations are highly sensitive to small changes in the ablation methodology and the metrics used to quantify faithfulness. We further show that the optimality of a circuit cannot be defined with respect to a set of prompts without a precise evaluation methodology

If a circuit is specified as a set of edges, it should be tested using edge ablations. And if it is specified at a chosen set of token positions it should be tested with these. But in other aspects there often isn’t a clearly correct methodology. Do you want your IOI circuit to include the mechanism that decides it needs to output a name? Then use zero ablations. Or do you want to find the circuit that, given the context of outputting a name, completes the IOI task? Then use mean ablations. The task cannot be separated from the ablation methodology.

Our work has significant consequences for circuit discovery work, particularly automated circuit discovery algorithms that aim to optimize these faithfulness scores. It suggests that assessing the quality of automated methods by measuring the overlap with some ‘ground truth’ can be misleading, if the ground truth was discovered using a different ablation methodology.

We recommend that researchers precisely describe their experimental procedure when reporting evaluations of circuits. They should consider which task exactly they are expecting their circuit to perform.

7 Acknowledgments

Thanks to Arthur Conmy for his generous assistance in understanding and reproducing his work on Automatic Circuit Discovery and his insightful comments. Thanks to Adam Gleave, Lawrence Chan, Clement Neo, Alex Cloud, David Bau, Steven Bills, Sam Marks, Adrià Garriga-Alonso and our anonymous reviewers at COLM 2024 for their invaluable feedback and suggestions. Thanks to Bryce Woodworth for his help and encouragement.

References

Agarwal et al. (2024) Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models, 2024. URL https://arxiv.org/abs/2402.04614.
Alishahi et al. (2019) Afra Alishahi, Grzegorz Chrupała, and Tal Linzen. Analyzing and interpreting neural networks for nlp: A report on the first blackboxnlp workshop. Natural Language Engineering, 25(4):543–557, 2019. doi: 10.1017/S135132491900024X.
Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023.
Bills et al. (2023) Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023.
Bricken et al. (2023) Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
Cammarata et al. (2021) Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea Voss, Ludwig Schubert, and Chris Olah. Curve circuits. Distill, 2021. doi: 10.23915/distill.00024.006. https://distill.pub/2020/circuits/curve-circuits.
Chan et al. (2022) Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: a method for rigorously testing interpretability hypotheses [redwood research], 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
Chughtai et al. (2023) Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations, 2023.
Conmy et al. (2023) Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Cunningham et al. (2023) Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023.
Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URL https://transformer-circuits.pub/2021/framework/index.html.
Geiger et al. (2023) Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas F. Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations. ArXiv, abs/2303.02536, 2023. URL https://api.semanticscholar.org/CorpusID:257365438.
Goldowsky-Dill et al. (2023) Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching, 2023.
Guidotti et al. (2018) Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM Comput. Surv., 51(5), aug 2018. ISSN 0360-0300. doi: 10.1145/3236009. URL https://doi.org/10.1145/3236009.
Gurnee & Tegmark (2024) Wes Gurnee and Max Tegmark. Language models represent space and time, 2024.
Hanna et al. (2023) Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, 2023.
Heimersheim & Janiak (2023) Stefan Heimersheim and Jett Janiak. A circuit for Python docstrings in a 4-layer attention-only transformer, 2023. URL https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only.
Jacovi & Goldberg (2020) Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198–4205, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://aclanthology.org/2020.acl-main.386.
Kramar et al. (2024) Janos Kramar, Tom Lieberum, Rohin Shah, and Neel Nanda. Atp*: An efficient and scalable method for localizing llm behaviour to components, 2024.
Li et al. (2023) Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task, 2023.
Li et al. (2024) Maximilian Li, Xander Davies, and Max Nadeau. Circuit breaking: Removing model behaviors with targeted ablation, 2024.
Lindner et al. (2023) David Lindner, János Kramar, Matthew Rahtz, Thomas McGrath, and Vladimir Mikulik. Tracr: Compiled transformers as a laboratory for interpretability. arXiv preprint arXiv:2301.05062, 2023.
Makelov et al. (2023) Aleksandar Makelov, Georg Lange, and Neel Nanda. Is this the subspace you are looking for? an interpretability illusion for subspace activation patching, 2023.
Marks et al. (2024) Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. Computing Research Repository, arXiv:2403.19647, 2024. URL https://arxiv.org/abs/2403.19647.
Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022.
Nanda & Bloom (2022) Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/TransformerLensOrg/TransformerLens, 2022.
Nanda et al. (2023a) Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023a.
Nanda et al. (2023b) Neel Nanda, Senthooran Rajamanoharan, Janos Kramar, and Rohin Shah. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023b. URL https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall.
Olah et al. (2020) Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
Rimsky et al. (2024) Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, 2024.
Rush & Weiss (2023) Alexander Rush and Gail Weiss. Thinking like transformers. In ICLR Blogposts 2023, 2023. URL https://iclr-blogposts.github.io/2023/blog/2023/raspy/. https://iclr-blogposts.github.io/2023/blog/2023/raspy/.
Sharkey (2024) Lee Sharkey. Sparsify: A mechanistic interpretability research agenda, 2024. URL https://www.alignmentforum.org/posts/64MizJXzyvrYpeKqm/sparsify-a-mechanistic-interpretability-research-agenda. Accessed: 2024-06-28.
Syed & Rager (2023) Aaquib Syed and Can Rager. Attribution patching outperforms automated circuit discovery, 9 2023.
Thickstun (2024) John Thickstun. The transformer model in equations, 2024. URL https://johnthickstun.com/docs/transformers.pdf. Accessed: 2024-06-28.
Tigges et al. (2023) Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representations of sentiment in large language models, 2023.
Turner et al. (2023) Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2023.
Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In Advances in neural information processing systems, volume 33, pp. 12388–12401, 2020.
Wang et al. (2023) Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul.
Zhang & Nanda (2024) Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods, 2024.

Appendix A AutoCircuit Library

We release AutoCircuit, a Python library with a highly efficient implementation of Edge Patching and various circuit discovery algorithms, with support for TransformerLens models (Nanda & Bloom, 2022). It supports Mean, Zero and Resample Ablations. See our blog post for more detail on our fast implementation.

We test the performance of our implementation by running the ACDC (Conmy et al., 2023) circuit discovery algorithm, which iteratively patches every edge in the model. We compare the performance of AutoCircuit’s implementation to the official ACDC implementation (which is currently the most popular library for patching large numbers of activations). We run ACDC using both libraries at a range of thresholds for a tiny 2-layer model with only 0.5 million parameters and measure the time taken to execute on a single GPU.

Different numbers of edges are included at different thresholds in the ACDC algorithm. Note that ACDC and AutoCircuit count the number of edges differently (AutoCircuit doesn’t include ’Direct Computation’ or ’Placeholder’ edges) so we compare the proportion of edges included (the underlying computation graphs are equivalent). Figure 6 shows that our implementation is significantly faster and the number of edges included greatly affects the performance of the official ACDC implementation, but it doesn’t change the performance of our implementation.

Appendix B Further Details on Ablation Methodology

Appendix C Summary of Tasks Studied

Name

Model

Example Clean Prompt

Example Corrupt Prompt

Correct Answer

Incorrect Answer

Faithfulness

Metric

Tracr

X-Proportion

Tracr

X-Proportion

y,x,z,x,w

z,w,w,y,x

0,0.5,0.333,

0.5,0.4

0,0,0,0,0.2

Mean squared

error

Tracr

Reverse

Tracr

Reverse

1,0,2,2,2

1,0,0,1,2

2,2,2,0,1

2,1,0,0,1

KL Divergence

Indirect

Object

Identification

GPT-2

Then, Scott and Jeremy went

to the hospital. Jeremy gave a

snack to

Then, Michael and Anderson

went to the hospital. Rachel

gave a snack to

” Scott”

” Jeremy”

Logit

Difference

Recovered

Docstring

4 Layer

Attention

Only

def error(self, create, option,

file, run, client, project):

”””land employment camp

:param file: protein author

:param run: forest degree

:param

def error(self, create, option,

output, host, label, project):

”””land employment camp

:param first: protein author

:param text: forest degree

:param

” client”

” size”, ” output”,

” host”,” label”,

” first”, ” text”,

” request”, ” user”,

” file”,” run”,

” create”, ” option”,

” project”

Correct

Prediction

Proportion

Sports Players

Pythia 2.8B

Fact: Tiger Woods plays the

sport of golf\nFact: Phil

Simms plays the sport of

Fact: Tiger Woods plays the

sport of golf\nFact: Babe Ruth

plays the sport of

” football”

” basketball”,

” baseball”

Top Sport

Logit

Table 4: The tasks we study, which previous works have found circuits for, and the metrics used by previous works to measure their faithfulness.

Appendix D Further Study of Faithfulness Metrics

In this section, we provide further analysis demonstrating faithfulness metrics are brittle, on two other circuits from the existing literature.

D.1 Docstring

The Docstring Task. The Docstring task (Heimersheim & Janiak, 2023) is a simple task that tests a 4 layer, attention-only model’s ability to complete a specific part of a standard Python docstring (see Table 4 for an example). All prompts follow a very similar format, with the only difference being the names of the variables in the function. The corrupt distribution follows the exact same format, using a disjoint set of variable names.

Measuring Docstring Circuit Faithfulness. Heimersheim & Janiak (2023) test their circuit using a similar methodology to the one which Wang et al. (2023) used to test the IOI circuit. They ablate all nodes in the complement of their circuit. However, unlike Wang et al. (2023) they use a Resample Ablation (also known in this context as Activation Patching), and they do not distinguish different token positions. The metric that they use for faithfulness is the percent of highest logit outputs that are the correct answer over some set of prompts.

In Figure 9, we test the faithfulness of the Docstring circuit with various ablation methodologies. We compare: (1) distinguishing between different token positions (Heimersheim & Janiak specify their circuit with token positions, even though they do not use this information in their faithfulness evaluations), (2) ablating at the edge-level and node-level (they also specify edges, even though they evaluate only with nodes), (3) ablating with Resample and Mean Ablations and (4) two different faithfulness metrics: correct answer percentage and answer probability.

We measure various significant changes in faithfulness in response to these adjustments. Most importantly, Edge Ablations perform significantly better using a Mean Ablation instead of a Resample Ablation. Had Heimersheim & Janiak (2023) performed edge-level Resample Ablations instead of node-wise Resample Ablations, they may have trusted their circuit significantly less (and if they had used edge-level Mean Ablations, they may have trusted it more).

Distinguishing by token position also had a large effect on faithfulness scores for both node-wise and edge-wise ablations. These low scores suggest the circuit is in fact performing significant computation on token positions outside of the circuit specified by Heimersheim & Janiak (2023).

When we measure the probability of the correct answer we find that, similar to IOI, the variance between individual prompts is high. This is important for reasons outlined in Section 4.

D.2 Sports Players

The Sports Players Task. The Sports Players task (Nanda et al., 2023b) is a simple task that tests the Pythia-2.8b model’s (Biderman et al., 2023) ability to recall the sports of famous football, baseball and basketball players. See Table 4 for an example. All prompts follow a very similar format, with the only difference being the name of the sports player in question. The corrupt distribution follows the exact same format, with each clean/corrupt pair having two players of different sports.

Measuring Sports Players Circuit Faithfulness. In Figure 10, we test the faithfulness of the edge-level sports players circuit, distinguishing token positions while (1) ablating the complement with both Resample and Mean Ablations and (2) calculating two different faithfulness metrics: correct answer percentage (considering only the three possible sports, following Nanda et al. (2023b)) and answer probability.

We find a dramatic difference in correct answer percentage between Resample and Mean Ablation. This case is a little different because the authors’ aim wasn’t to find the full circuit but to identify the place in the model where factual recall occurs, so this result doesn’t negate their hypothesis.

Note that random guessing would achieve $33\%$ accuracy as there are 3 possible sports, and this is roughly what we see when Mean Ablating the whole model. But Resample Ablating adds signal from the corrupt prompt, which is always a different sport, explaining the $0\%$ accuracy score for the Ablated Model and the Circuit.

D.3 Further Detail on the X-Proportion Tracr Ground Truth Circuits

Appendix E Edge-Based vs. Node-Based Circuit Discovery Methods

In Section 5, we adapted the Subnetwork Probing (SP) and Head Importance Scoring (HISP) circuit discovery methods to use (or approximate) Edge Ablation. ACDC (Conmy et al., 2023) already uses Edge Ablations, but we can similarly adapt ACDC to use Node Ablations. We compare the performance of the Node Patching versions of ACDC, SP and HISP to the Edge Patching versions, for the Resample Ablation based “ground truth” circuit introduced in Section 5 (Figure E).

Appendix F Clarifying Nomenclature

Some authors have used different terms for some of the concepts introduced in Section 3. For instance, Activation patching has previously also been called Causal Tracing or Interchange Intervention. In the remainder of this section, we summarise how our nomenclature relates to the terminology used by Redwood Research in their series of early mechanistic interpretability transformer-circuits papers. Chronologically, these are Wang et al. (2023); Chan et al. (2022); Goldowsky-Dill et al. (2023).

We first discuss the final, most comprehensive work (Chan et al., 2022), which we refer to as Causal Scrubbing. Causal Scrubbing is a very general approach for evaluating circuits together with explanations of the role of nodes within the circuit. It generically comprises performing specific branch-based Resample Ablations on the treeified model on both the circuit and its complement. Causal Scrubbing randomly replaces activations with those that your hypothesis predicts will not change the model output. For instance, if we claim that a given node detects whether the input is even, Causal Scrubbing could patch in an activation from a different even input, and expects the output not to change. In general, Causal Scrubbing permits an arbitrary number of possible counterfactual inputs.

Goldowsky-Dill et al. (2023) simplify this setup, dropping the strict requirement of requiring an explanation for each node. This reduces the hypothesis class to the now standard circuit discovery problem; does some path matter for task performance or not?

Finally Wang et al. (2023) perform a further simplified version of path patching to discover the IOI circuit. This is equivalent to Edge Resample Ablation in our terminology but which they call Path Patching. They patch paths one at a time, to establish which edges are important for task performance. Importantly, Wang et al. (2023) reason that the IOI task should be an attention-only task, as it only comprises moving information between tokens. As such, they take nodes to only be attention heads, with MLPs considered to be part of the direct path between nodes. This approach of one-hop path patching is extended and automated by Conmy et al. (2023).