Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–4 of 4 results for author: Rimsky, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11717  [pdf, other

    cs.LG cs.AI cs.CL

    Refusal in Language Models Is Mediated by a Single Direction

    Authors: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, Neel Nanda

    Abstract: Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  2. arXiv:2406.09289  [pdf, other

    cs.CL cs.AI cs.LG

    Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

    Authors: Sarah Ball, Frauke Kreuter, Nina Rimsky

    Abstract: Conversational Large Language Models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. To better understand how different jailbreak types circumvent safeguards, this paper analyses model activations on different jailbreak inputs. We find that it is possible to extract a… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2402.00402  [pdf, other

    cs.CL cs.AI

    Investigating Bias Representations in Llama 2 Chat via Activation Steering

    Authors: Dawn Lu, Nina Rimsky

    Abstract: We address the challenge of societal bias in Large Language Models (LLMs), focusing on the Llama 2 7B Chat model. As LLMs are increasingly integrated into decision-making processes with substantial societal impact, it becomes imperative to ensure these models do not reinforce existing biases. Our approach employs activation steering to probe for and mitigate biases related to gender, race, and rel… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  4. arXiv:2312.06681  [pdf, other

    cs.CL cs.AI cs.LG

    Steering Llama 2 via Contrastive Activation Addition

    Authors: Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

    Abstract: We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steerin… ▽ More

    Submitted 6 March, 2024; v1 submitted 8 December, 2023; originally announced December 2023.