Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–1 of 1 results for author: Krzyzanowski, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17759  [pdf, other

    cs.LG

    Interpreting Attention Layer Outputs with Sparse Autoencoders

    Authors: Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

    Abstract: Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also h… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.