Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–13 of 13 results for author: Stickland, A C

.
  1. arXiv:2407.15549  [pdf, other

    cs.LG cs.AI cs.CL

    Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

    Authors: Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper

    Abstract: Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of `jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from ho… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  2. arXiv:2407.04108  [pdf, other

    cs.CR cs.CL cs.LG

    Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

    Authors: Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland

    Abstract: Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often only contains information about events that have already occurred, a component of a simple backdoor trigger could be a model recognizing data that is i… ▽ More

    Submitted 17 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

  3. arXiv:2406.15518  [pdf, other

    cs.CL cs.LG

    Steering Without Side Effects: Improving Post-Deployment Control of Language Models

    Authors: Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman

    Abstract: Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such m… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  4. arXiv:2311.12022  [pdf, other

    cs.AI cs.CL

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Authors: David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman

    Abstract: We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert v… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: 28 pages, 5 figures, 7 tables

  5. arXiv:2309.12288  [pdf, other

    cs.CL cs.AI cs.LG

    The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

    Authors: Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

    Abstract: We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Valentina Tereshkova was the first woman to travel to space", it will not automatically be able to answe… ▽ More

    Submitted 26 May, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: 21 pages, 11 figures

  6. arXiv:2309.00667  [pdf, other

    cs.CL cs.LG

    Taken out of context: On measuring situational awareness in LLMs

    Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans

    Abstract: We aim to better understand the emergence of `situational awareness' in large language models (LLMs). A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment. Today's LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while tak… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  7. arXiv:2210.04782  [pdf, other

    cs.CL cs.AI cs.LG

    Robustification of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining

    Authors: Asa Cooper Stickland, Sailik Sengupta, Jason Krone, Saab Mansour, He He

    Abstract: Advances in neural modeling have achieved state-of-the-art (SOTA) results on public natural language processing (NLP) benchmarks, at times surpassing human performance. However, there is a gap between public benchmarks and real-world applications where noise, such as typographical or grammatical mistakes, is abundant and can result in degraded performance. Unfortunately, works which evaluate the r… ▽ More

    Submitted 10 February, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: Accepted and to be presented at the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

  8. arXiv:2205.11277  [pdf, other

    cs.CL

    When does Parameter-Efficient Transfer Learning Work for Machine Translation?

    Authors: Ahmet Üstün, Asa Cooper Stickland

    Abstract: Parameter-efficient fine-tuning methods (PEFTs) offer the promise of adapting large pre-trained models while only tuning a small number of parameters. They have been shown to be competitive with full model fine-tuning for many downstream tasks. However, prior work indicates that PEFTs may not work as well for machine translation (MT), and there is no comprehensive study showing when PEFTs work for… ▽ More

    Submitted 24 October, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: Accepted at EMNLP 2022 (Main Conference)

  9. arXiv:2110.09574  [pdf, other

    cs.CL

    Multilingual Domain Adaptation for NMT: Decoupling Language and Domain Information with Adapters

    Authors: Asa Cooper Stickland, Alexandre Bérard, Vassilina Nikoulina

    Abstract: Adapter layers are lightweight, learnable units inserted between transformer layers. Recent work explores using such layers for neural machine translation (NMT), to adapt pre-trained models to new domains or language pairs, training only a small set of parameters for each new setting (language pair or domain). In this work we study the compositionality of language and domain adapters in the contex… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: Accepted at The Sixth Conference in Machine Translation (WMT21)

  10. arXiv:2009.13102  [pdf, other

    cs.CL cs.LG

    Deep Transformers with Latent Depth

    Authors: Xian Li, Asa Cooper Stickland, Yuqing Tang, Xiang Kong

    Abstract: The Transformer model has achieved state-of-the-art performance in many sequence modeling tasks. However, how to leverage model capacity with large or variable depths is still an open challenge. We present a probabilistic framework to automatically learn which layer(s) to use by learning the posterior distributions of layer selection. As an extension of this framework, we propose a novel method to… ▽ More

    Submitted 15 October, 2020; v1 submitted 28 September, 2020; originally announced September 2020.

  11. arXiv:2007.04206  [pdf, ps, other

    cs.LG stat.ML

    Diverse Ensembles Improve Calibration

    Authors: Asa Cooper Stickland, Iain Murray

    Abstract: Modern deep neural networks can produce badly calibrated predictions, especially when train and test distributions are mismatched. Training an ensemble of models and averaging their predictions can help alleviate these issues. We propose a simple technique to improve calibration, using a different data augmentation for each ensemble member. We additionally use the idea of `mixing' un-augmented and… ▽ More

    Submitted 8 July, 2020; originally announced July 2020.

    Comments: Presented at the ICML 2020 Workshop on Uncertainty and Robustness in Deep Learning

  12. Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation

    Authors: Asa Cooper Stickland, Xian Li, Marjan Ghazvininejad

    Abstract: There has been recent success in pre-training on monolingual data and fine-tuning on Machine Translation (MT), but it remains unclear how to best leverage a pre-trained model for a given MT task. This paper investigates the benefits and drawbacks of freezing parameters, and adding new ones, when fine-tuning a pre-trained model on MT. We focus on 1) Fine-tuning a model trained only on English monol… ▽ More

    Submitted 20 June, 2022; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: Accepted for publication at EACL 2021

  13. arXiv:1902.02671  [pdf, other

    cs.LG cs.CL stat.ML

    BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

    Authors: Asa Cooper Stickland, Iain Murray

    Abstract: Multi-task learning shares information between related tasks, sometimes reducing the number of parameters required. State-of-the-art results across multiple natural language understanding tasks in the GLUE benchmark have previously used transfer from a single large task: unsupervised pre-training with BERT, where a separate BERT model was fine-tuned for each task. We explore multi-task approaches… ▽ More

    Submitted 15 May, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

    Comments: Accepted for publication at ICML 2019