Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–4 of 4 results for author: Makelov, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.08366  [pdf, other

    cs.LG

    Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

    Authors: Aleksandar Makelov, George Lange, Neel Nanda

    Abstract: Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them aga… ▽ More

    Submitted 20 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

  2. arXiv:2311.17030  [pdf, other

    cs.LG cs.AI cs.CL

    Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

    Authors: Aleksandar Makelov, Georg Lange, Neel Nanda

    Abstract: Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In thi… ▽ More

    Submitted 6 December, 2023; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: NeurIPS 2023 Workshop on Attributing Model Behavior at Scale

  3. arXiv:2307.10163  [pdf, other

    cs.CR cs.LG stat.ML

    Rethinking Backdoor Attacks

    Authors: Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, Aleksander Madry

    Abstract: In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks typically involves viewing these inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them. In this work, we present a different approach to the… ▽ More

    Submitted 19 July, 2023; originally announced July 2023.

    Comments: ICML 2023

  4. arXiv:1706.06083  [pdf, other

    stat.ML cs.LG cs.NE

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Authors: Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu

    Abstract: Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustne… ▽ More

    Submitted 4 September, 2019; v1 submitted 19 June, 2017; originally announced June 2017.

    Comments: ICLR'18