Showing 1–2 of 2 results for author: Arditi, A

Search v0.5.6 released 2020-02-24

arXiv:2406.11717 [pdf, other]

cs.LG cs.AI cs.CL

Refusal in Language Models Is Mediated by a Single Direction

Authors: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

Abstract: Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models… ▽ More Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior. △ Less

Submitted 15 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.
arXiv:2209.11293 [pdf, other]

cs.GT econ.TH

doi 10.1145/3560832.3563436

A Framework for Single-Item NFT Auction Mechanism Design

Authors: Jason Milionis, Dean Hirsch, Andy Arditi, Pranav Garimidi

Abstract: Lately, Non-Fungible Tokens (NFTs), i.e., uniquely discernible assets on a blockchain, have skyrocketed in popularity by addressing a broad audience. However, the typical NFT auctioning procedures are conducted in various, ad hoc ways, while mostly ignoring the context that the blockchain provides. One of the main targets of this work is to shed light on the vastly unexplored design space of NFT A… ▽ More Lately, Non-Fungible Tokens (NFTs), i.e., uniquely discernible assets on a blockchain, have skyrocketed in popularity by addressing a broad audience. However, the typical NFT auctioning procedures are conducted in various, ad hoc ways, while mostly ignoring the context that the blockchain provides. One of the main targets of this work is to shed light on the vastly unexplored design space of NFT Auction Mechanisms, especially in those characteristics that fundamentally differ from traditional and more contemporaneous forms of auctions. We focus on the case that bidders have a valuation for the auctioned NFT, i.e., what we term the single-item NFT auction case. In this setting, we formally define an NFT Auction Mechanism, give the properties that we would ideally like a perfect mechanism to satisfy (broadly known as incentive compatibility and collusion resistance) and prove that it is impossible to have such a perfect mechanism. Even though we cannot have an all-powerful protocol like that, we move on to consider relaxed notions of those properties that we may desire the protocol to satisfy, as a trade-off between implementability and economic guarantees. Specifically, we define the notion of an equilibrium-truthful auction, where neither the seller nor the bidders can improve their utility by acting non-truthfully, so long as the counter-party acts truthfully. We also define asymptotically second-price auctions, in which the seller does not lose asymptotically any revenue in comparison to the theoretically-optimal (static) second-price sealed-bid auction, in the case that the bidders' valuations are drawn independently from some distribution. We showcase why these two are very desirable properties for an auction mechanism to enjoy, and construct the first known NFT Auction Mechanism which provably possesses such formal guarantees. △ Less

Submitted 22 September, 2022; originally announced September 2022.

Comments: To appear in ACM DeFi 2022. 17 pages

Search v0.5.6 released 2020-02-24