-
Min P Sampling: Balancing Creativity and Coherence at High Temperature
Authors:
Minh Nguyen,
Andrew Baker,
Andreas Kirsch,
Clement Neo
Abstract:
Large Language Models (LLMs) generate longform text by successively sampling the next token based on the probability distribution of the token vocabulary at each decoding step. Current popular truncation sampling methods such as top-$p$ sampling, also known as nucleus sampling, often struggle to balance coherence and creativity in generating text, particularly when using higher temperatures. To ad…
▽ More
Large Language Models (LLMs) generate longform text by successively sampling the next token based on the probability distribution of the token vocabulary at each decoding step. Current popular truncation sampling methods such as top-$p$ sampling, also known as nucleus sampling, often struggle to balance coherence and creativity in generating text, particularly when using higher temperatures. To address this issue, we propose min-$p$, a dynamic truncation sampling method, that establishes a minimum base percentage threshold for tokens, which the scales according to the probability of the top candidate token. Through experiments on several benchmarks, such as GPQA, GSM8K and AlpacaEval Creative Writing, we demonstrate that min-$p$ improves the coherence and quality of generated text even at high temperatures, while also facilitating more creative and diverse outputs compared to top-$p$ and other sampling methods. As of writing, min-$p$ has been adopted by multiple open-source LLM implementations, and have been independently assessed by members of the open-source LLM community, further validating its practical utility and potential.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Authors:
Clement Neo,
Shay B. Cohen,
Fazl Barez
Abstract:
In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens. By prompting an LLM like GPT-4 to explain these model internals, we can elucidate attention mechanisms that activate certain next-token neurons. Our analysis identifies attention heads that recognize contexts relevant to predicting a pa…
▽ More
In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens. By prompting an LLM like GPT-4 to explain these model internals, we can elucidate attention mechanisms that activate certain next-token neurons. Our analysis identifies attention heads that recognize contexts relevant to predicting a particular token, activating the associated neuron through the residual connection. We focus specifically on heads in earlier layers consistently activating the same next-token neuron across similar prompts. Exploring these differential activation patterns reveals that heads that specialize for distinct linguistic contexts are tied to generating certain tokens. Overall, our method combines neural explanations and probing isolated components to illuminate how attention enables context-dependent, specialized processing in LLMs.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
Increasing Trust in Language Models through the Reuse of Verified Circuits
Authors:
Philip Quirke,
Clement Neo,
Fazl Barez
Abstract:
Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. We show that a model can be trained to meet this sta…
▽ More
Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. We show that a model can be trained to meet this standard if built using mathematically and logically specified frameworks. In this paper, we fully verify an auto-regressive transformer model for n-digit integer addition. To exhibit the reusability of verified modules, we insert the trained integer addition model into a larger untrained model and train the combined model to perform both addition and subtraction. We find extensive reuse of the addition circuits for both tasks, easing verification of the more complex subtractor model. We discuss how inserting verified task modules into LMs can leverage model reuse to improve verifiability and trustworthiness of language models built using them. The reuse of verified circuits reduces the effort to verify more complex composite models which we believe to be a significant step towards safety of language models.
△ Less
Submitted 11 July, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models
Authors:
Luke Marks,
Amir Abdullah,
Clement Neo,
Rauno Arike,
Philip Torr,
Fazl Barez
Abstract:
Large language models (LLMs) fine-tuned by reinforcement learning from human feedback (RLHF) are becoming more widely deployed. We coin the term $\textit{Implicit Reward Model}$ (IRM) to refer to the changes that occur to an LLM during RLHF that result in high-reward generations. We interpret IRMs, and measure their divergence from the RLHF reward model used in the fine-tuning process that induced…
▽ More
Large language models (LLMs) fine-tuned by reinforcement learning from human feedback (RLHF) are becoming more widely deployed. We coin the term $\textit{Implicit Reward Model}$ (IRM) to refer to the changes that occur to an LLM during RLHF that result in high-reward generations. We interpret IRMs, and measure their divergence from the RLHF reward model used in the fine-tuning process that induced them. By fitting a linear function to an LLM's IRM, a reward model with the same type signature as the RLHF reward model is constructed, allowing for direct comparison. Additionally, we validate our construction of the IRM through cross-comparison with classifications of features generated by an LLM based on their relevance to the RLHF reward model. Better comprehending IRMs can help minimize discrepencies between LLM behavior and training objectives, which we believe to be an essential component of the $\textit{safety}$ and $\textit{alignment}$ of LLMs.
△ Less
Submitted 7 February, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Cloud Classification with Unsupervised Deep Learning
Authors:
Takuya Kurihana,
Ian Foster,
Rebecca Willett,
Sydney Jenkins,
Kathryn Koenig,
Ruby Werman,
Ricardo Barros Lourenco,
Casper Neo,
Elisabeth Moyer
Abstract:
We present a framework for cloud characterization that leverages modern unsupervised deep learning technologies. While previous neural network-based cloud classification models have used supervised learning methods, unsupervised learning allows us to avoid restricting the model to artificial categories based on historical cloud classification schemes and enables the discovery of novel, more detail…
▽ More
We present a framework for cloud characterization that leverages modern unsupervised deep learning technologies. While previous neural network-based cloud classification models have used supervised learning methods, unsupervised learning allows us to avoid restricting the model to artificial categories based on historical cloud classification schemes and enables the discovery of novel, more detailed classifications. Our framework learns cloud features directly from radiance data produced by NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) satellite instrument, deriving cloud characteristics from millions of images without relying on pre-defined cloud types during the training process. We present preliminary results showing that our method extracts physically relevant information from radiance data and produces meaningful cloud classes.
△ Less
Submitted 30 September, 2022;
originally announced September 2022.