Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–2 of 2 results for author: Arditi, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11717  [pdf, other

    cs.LG cs.AI cs.CL

    Refusal in Language Models Is Mediated by a Single Direction

    Authors: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

    Abstract: Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models… ▽ More

    Submitted 15 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  2. A Framework for Single-Item NFT Auction Mechanism Design

    Authors: Jason Milionis, Dean Hirsch, Andy Arditi, Pranav Garimidi

    Abstract: Lately, Non-Fungible Tokens (NFTs), i.e., uniquely discernible assets on a blockchain, have skyrocketed in popularity by addressing a broad audience. However, the typical NFT auctioning procedures are conducted in various, ad hoc ways, while mostly ignoring the context that the blockchain provides. One of the main targets of this work is to shed light on the vastly unexplored design space of NFT A… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: To appear in ACM DeFi 2022. 17 pages