Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models

Sunny Duan
Brain and Cognitive Sciences
MIT
sunnyd@mit.edu Mikail Khona
Physics
MIT
mikail@mit.edu Abhiram Iyer
EECS
MIT
abiyer@mit.edu Rylan Schaeffer
Computer Science
Stanford University
rschaef@cs.stanford.edu Ila Rani Fiete
Brain and Cognitive Sciences
MIT
fiete@mit.edu
Abstract

The proliferation of large language models has revolutionized natural language processing tasks, yet it raises profound concerns regarding data privacy and security. Language models are trained on extensive corpora including potentially sensitive or proprietary information, and the risk of data leakage — where the model response reveals pieces of such information — remains inadequately understood. This study examines susceptibility to data leakage by quantifying the phenomenon of memorization in machine learning models, focusing on the evolution of memorization patterns over training. We investigate how the statistical characteristics of training data influence the memories encoded within the model by evaluating how repetition influences memorization. We reproduce findings that the probability of memorizing a sequence scales logarithmically with the number of times it is present in the data. Furthermore, we find that sequences which are not apparently memorized after the first encounter can be “uncovered” throughout the course of training even without subsequent encounters. The presence of these “latent” memorized sequences presents a challenge for data privacy since they may be hidden at the final checkpoint of the model. To this end, we develop a diagnostic test for uncovering these latent memorized sequences by considering their cross entropy loss.

1 Introduction

Large language models (LLMs) are trained on vast data-sets. The size of the training datasets enables high competency in the trained models in the sense of fluency, knowledge about various domainsAlKhamissi et al. (2022)Guu et al. (2020), and the ability to perform in-context reasoning. The training datasets often include proprietary, copyrighted, or otherwise private information. In human memory, repeated encounters with information and data are gradually transformed from an “episodic” or contextually detailed verbatime-like stores into “semantic” stores in which the gist and general nature of the content is retained but the specifics are discarded. Semanitic memories (like "Paris is the capital of France") retain utility for future tasks, but become stripped of the specific instance in which that knowledge was acquired (e.g. "My teacher Ms. Ross taught me that in 3rd grade").

By contrast, LLMs are capable of not only using training data for general knowledge and performance, but have been shown to possess a vast capacity for detailed memorization. Specifically, with appropriate cueing, LLMs can regurgitate verbatim text from their training corpii. This phenomenon is the opposite of “catastrophic forgetting”, in which shifts in the training data cause models to forget previous learning, which has led to a vigorous subfield of research on mitigating this interference-driven forgetting Kirkpatrick et al. (2017)Zenke et al. (2017). In part, the ability of LLMs to exhibit detailed memory of training data may be due to their large size. Yet LLMs are often trained on a single pass through the data corpus, meaning that the model encounters distributional shifts throughout training. Surprisingly, the verbatim recall of LLMs extends to sequences seen early in training Biderman et al. (2023b).

One hypothesis is that memorized sequences appear multiple times within the corpus, allowing the network to re-store the data into its weights. We confirm that repeated sequences constitute the majority of the memorized sequences. However, we also show that sequences which are encountered only once during training are also memorized by the model and persist throughout the course of training. This property raises serious concerns for privacy and copyright.

1.1 Related Work

Extracting memorized sequences from language models is an area of high interest. Early work established that it was possible to extract sensitive data including phone numbers, URLs and personal information from trained language models Carlini et al. (2020). Other studies injected canaries to determine what aspects of the training process contributed to whether a sequence is extractable Henderson et al. (2017)Thakkar et al. (2020). More recent work have extended this to investigate how these properties scale with model size and data statistics Carlini et al. (2022). This has motivated the use of deduplication, which in addition to reducing the chance of data leakage Kandpal et al. (2022), also has been shown to improve sample efficiency and improve evaluation Lee et al. (2021).

The definition of memorization is also still debated and various approaches to quantifying memorization have been made Zhang et al. (2021)Feldman and Zhang (2020). A variety of attacks have been designed to extract memorized sequences using designed prompts Thakkar et al. (2020) and model activation perturbations Kassem et al. (2024).

More generally, the notion of membership inference has been studied as a way to determine whether a given training example was part of the corpus Shokri et al. (2016)Mireshghallah et al. (2022)Hisamoto et al. (2019), and these approaches have been applied to language models as wellDuan et al. (2024).

Forgetting has also been studied extensively in neural networks, typically in the context of preventing forgetting. Kirkpatrick et al. (2017)Zenke et al. (2017)Chen et al. (2020). Studies have also shown that forgetting decreases with model size Tirumala et al. (2022)Mirzadeh et al. (2021). This work has also been examined in the context of understanding what aspects about a model and the data contribute to forgetting Toneva et al. (2018)

Finally, there has also been work studying how the training process affects the status of memorization Tirumala et al. (2022). This work focuses on how parameters of training and size of the model affect the dynamics of training. They find that scaling the model generally leads to less forgetting. In our work, we focus on sequences which counter-intuitively do not obey the forgetting laws presented in this work and expanding on the implications of these persistent "episodic" memories.

1.2 Contribution

This study provides significant insights into the dynamics and mechanics of memorization in large language models, contributing to the broader understanding of data privacy and security within machine learning. Our primary contributions are as follows:

  • Quantification of Memorization Susceptibility: We systematically evaluate how the statistical characteristics of training data, specifically sequence complexity and repetition, influence the likelihood of memorization in language models. Our findings demonstrate that the probability of memorizing a sequence scales logarithmically with its repetition in the training data as well as the complexity of the sequence under consideration.

  • Stationarity of Memorized Sequences: Through detailed analysis of training dynamics, we discover that the memorization status of sequences remains largely stationary after initial exposure, despite not being re-encountered. This indicates a fundamental property of the model’s memory mechanism, where the state of memorized sequences is fixed and subsequent training only modifies the readout.

  • Latent Memorization and Recovery: We identify the presence of "latent" memorized sequences, which are not evident at certain checkpoints but can be uncovered later in training or through controlled perturbations. Our experimental results show that adding random Gaussian noise to model weights can recover these latent memorized sequences, supporting the hypothesis that further training acts as random additive noise rather than fundamentally altering the memorization state.

  • Development of a Diagnostic Test: We propose a novel diagnostic test for uncovering latent memorized sequences by analyzing their cross-entropy loss. This test provides a practical tool for detecting and mitigating potential data leakage in deployed language models.

Our study underscores the risks associated with data leakage in language models, emphasizing the need for robust mechanisms to ensure data privacy. The persistence of memorized sequences poses a challenge for the prevention of data leakage. By characterizing the nature of memorization as well as the nature of these latent memorized sequences, we elucidate possible mechanisms of how sequences become memorized and offer practical solutions for mitigating data privacy risks, and developing safer and more secure models.

2 Methodology

2.1 Sequence Complexity

The ability of transformers to perform in-context learning allow them to produce patterns easily. As in previous studies Carlini et al. (2020), we find that one class of data which is highly represented in memorized data are "simple" sequences composing of repeated subsequences, sequences of numbers, and other simple patterns.

These samples are easily memorized by the model, but they are not very informative. This notion of complexity can be formalized using the definition of Kologomorov complexity. The Kologomorov complexity is defined as the minimum description needed to describe a sequence. While this formalism is helpful in defining complexity, it is a theoretical measure which cannot be computed readily. As a proxy, we use modern compression algorithms to determine the extent to which sequences have a smaller description than the original sequence. In order to calculate the complexity of a sequence we define a metric, z-compressibility, which is the ratio between the compressed length of the sequence and the length of the original sequence. This metric contains values from 00 to 1111 and is efficiently computable using the zlib package in python. This metric is an upper bound on the Kologomorov complexity of the sequence since the Kologomorov complexity is defined as the smallest of such descriptions.

2.2 Quantifying memorization

Many different attempts have been made to define memorization in large language models. In essence, a memorized sequence is one which can be reproduced given the right conditions. One popular definition of memorization is kl𝑘𝑙klitalic_k italic_l-memorization Carlini et al. (2022). kl𝑘𝑙klitalic_k italic_l-memorization is evaluated by considering a sequence of length k+l𝑘𝑙k+litalic_k + italic_l. The first k𝑘kitalic_k tokens are presented to the model as context. The model is used to generate a continuation of length l𝑙litalic_l. The model’s continuation is compared to the "true" continuation, and a sequence is said to be kl𝑘𝑙klitalic_k italic_l memorized if the model’s output exactly matches the true continuation.

We find that kl𝑘𝑙klitalic_k italic_l-memorization may be overly strict as even small deviations from the true continuation may cause us to misclassify a sequence as forgotten. In many cases, the model may make small errors such as inserting or modifying a single token. We identified a few examples in in Table 1. In order to to be more robust to small changes in the learned sequence, we propose a modification of kl𝑘𝑙klitalic_k italic_l-memorization by introducing k-Levenshtein distance (k-LD) in which k𝑘kitalic_k context tokens are provided to the model and the true continuation of the sequence is compared to the model’s continuation and the measure of memorization is given by the Levenshtein distance (edit distance) between the two continuations. We find that this is a more natural measure of memorization which also provides a range of values to provide more granular insight into the strength of the model’s memory. Throughout this study, we set k=32𝑘32k=32italic_k = 32 and compare the continuation of the model with the original sequence by computing the levenshtein distance between the next 64646464 tokens.

Table 1: Model continuations at various stages in training for a few selected sequences which were complex and encountered only once during training. Minimum edits are highlighted such that character edits are highlighted in orange, deletions are highlighted in red and new characters are highlighted in green.
Context True Continuation Checkpoint 10000 Checkpoint 15000 Checkpoint 19000
.r001 Decision Letter 0 Silva Daniel de Paiva Academic Editor © 2020 Daniel de Paiva 2020 Daniel de Paiva This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 20 Apr 2020 P Silva 2020 Daniel de Paiva Silva This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 20 Apr 2020 P Silva 2020 Daniel de Paiva Silva This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 26 Feb 2020 Silva 2020 Daniel de Paiva Silva This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 10 May 2020
992> por favor ayudenme para instalar DBDesigner <him> hay <BluesKaj>!es | Guest17992 <ubottu> Guest17992: En la mayorÃxada de canales Ubuntu se comunica en inglés. Para ayuda en Español, por favor entre en los canales #ubuntu-es o #kubuntu-es. <BluesKaj>!es | <ubottu> Guest17992: En la mayorà a de canales de Ubuntu se comunica sólo en inglés. Para busca ayuda en Español, por favor entrera en los canales #ubuntu-es o #kubuntu-es. <BluesKaj>! es | <ubottu> Guest17992: En la mayorà a de canales de Ubuntu se habla sólo en inglés. Si busca ayuda en español o charlar entra en el canal #ubuntu-es. Escribe "/join #ubuntu-es" <ubottu> Guest17992: En la mayorà a de los canales de Ubuntu, se habla sólo en inglés. Si busca ayuda en español entre al lar entra en el canal #ubuntu-es; escriba " /join #ubuntu-es " (
, findings, beliefs, or experiences on those topics or products. The views and opinions expressed on CateTheOkay.com are purely my own. Any product claim, statistic, quote or other representation about a product or service should be verified with the manufacturer, provider or party in question. CateTheOkay.com doesn’t contain any content which might present a conflict of interest. claim, statistic, quote or other representation about a product or service should be verified with the manufacturer or provider. Comments. I have a question. I have a friend who is a teacher and she is a teacher. She is a teacher and she is a student. She is a student and she is a claim, statistic, quote or other representation about a product or service should be verified with the manufacturer or provider or party in question. CateTheOkay.com is not affiliated with, endorsed by, or sponsored by the Coca-Cola Company. CateTheOkay.com is not affiliated with, endorsed by, claim, statistic, quote or other representation about a product or service should be verified with the manufacturer or provider or party in question. I am not a doctor, pharmacist, or registered dietitian. I am not a registered dietitian. I am not a registered dietitian. I am not a registered dietitian. I am

2.3 Analyzing repeated Samples

In this study, we seek to understand both how repeated encounters of a sequence during training drives memorization and also how sequences which are encountered only once are retained by the model. To this end, we analyze where training sequences are repeated throughout the course of training. In our study, we focus on the l𝑙litalic_l portion of the sequence. For this study, we fixed l𝑙litalic_l to be 64 tokens. Given this target sequence, we compare the target sequence with all of the training sequences which were presented to the model during the period of training under consideration. We compute the largest subsequence match between the target and every individual training example and call a training example a "repeat" if there was a sub-sequence match of length 30303030 or longer. We employed a parallelized data pipeline to search for repeats of 512,000 such target sequences.

2.4 Models

In this study, we largely focused on the large language model, Pythia-1b Biderman et al. (2023a) which was trained on the Pile datasetGao et al. (2020). For selected experiments, we reproduced the results using a larger and better performing model, Amber-7B Liu et al. (2023), in order to ensure that our results were consistent with other large language models. We selected these two models as they were large high performing models which had fully reproducible data sequences and frequent checkpoints. As in previous works Biderman et al. (2023a), all experiments were run with the models run with half precision and no temperature.

2.5 Checkpoints

In our analysis, we used checkpoints from every 1000 training steps between from step 10k-20k in Pythia-1B and every revision of Amber-7B, corresponding to roughly 1.7 million training examples between revision 100 to 110. These selections were 10 checkpoints from each model which represented a sizable portion of training. These were chosen to be offset from the beginning of training to avoid artifacts from the initial phases of training.

3 Experimental results

3.1 Statistics of memorization

We analyze two primary drivers of memorization during training: sequence complexity, and the number of repetitions. Previous studies have shown that the probability of extraction is related to the model size and number of repetitions Carlini et al. (2020). We find that this relationship is true in the models we analyzed as well. In addition, we found that the complexity of the string itself was a strong predictor of whether a sequence was memorized (Figure 1a). Furthermore, we found that for strings of different complexity exhibited different memorization curves (Figure 1c). Both of these factors influenced the memorization probability with a log-linear relationship.

Refer to caption
Figure 1: Data statistics and the probability of memorization a. Plot of average k-LD as a function of the number of times the sequence is repeated in the dataset for Pythia-1b and Amber-7b b. Average k-LD as a function of the Z-complexity of the sequence. c. Relationship between k-LD and repeats for different complexity levels. d. Comparison of the predictions of the best linear model predicting the k-LD from the logarithm of the sample complexity and number of repeats.

While these factors were able to predict the probability of memorization, they did not fully determine whether a sequence will be memorized and significant uncertainty remains (Figure 1d). There are likely other factors which contribute to the memorization process such as the sequencing of training data and the state of the model when encountering the sequence.

3.2 Dynamics of memorization

In order to produce a more complete picture of how successive training affects the state of memorized sequences within our model, we analyze how the k-LD changes throughout the course of training for individual sequences. In order to eliminate the effects of repeated exposure to a string, we filter out sequences which are repeated throughout the course of training by eliminating sequences which are repeated according to our heuristic outlined above.

Refer to caption
Figure 2: Memorization status is stationary a. Histograms of changes of edit distance between consecutive checkpoints for sequences which were encountered once during training. Notably, the change in k-LD is symmetric between consecutive checkpoints. This is surprising since the model appears to "forget" the sequence during one timestep but recover it later on. b. Distribution of k-LD during checkpoint 10k and 11k. Color is the log of the number of sequences in each bin. The vast majority of sequences are not memorized in either checkpoint. c. Visualization of individual samples and the change in the memorized length during training. d. Grey lines are subsampled single sequence trajectories throughout training. Each sequence was normalized such that the distribution of memorization lengths was mean 0 and variance 1. Red line denotes the mean and shaded area denotes region of two standard deviations of the k-LD of all sequences at a single point in time. Notably, the distribution at each timestep is the same for all checkpoints. This is in contrast to both the expected exponential decay behavior exhibited by models which experience catastrophic forgetting as well as the linear growth of variance which is expected of processes exhibiting random walk behavior.

Surprisingly, we find that the memorization status of a sequence is largely stationary throughout training. After the initial checkpoint, the k-LD of the sequences fluctuate but do so in a way which is stationary across training (Figure 2d). This is consistent across both Pythia-1b and Amber-7b models. This is reflected in the individual trajectories, and also in the overall mean of the population which shows no clear trend as training progresses. Furthermore, unlike a random walk, we see that the variance of the does not grow over time, but remains fixed. This is indicative of a mean reversion tendency of the dynamics and demonstrate the stability of the memories within the model weights. Additionally, we observe that the changes in the k-LD between consecutive checkpoints (Figure 2ab) are symmetric and roughly follow a laplace distribution. This again confirms the counter-intuitive property of sequences to become memorized as often as they are forgotten. Notably, the model is able to recall memories which, at one point in time, appeared to be forgotten, despite never encountering that sequence again.

The stationarity of the memorization status of these sequences indicates that the memorized sequence is fixed throughout time, but this is in conflict with the fact that the model weights are constantly evolving. This stability in the presence of noise is indicative of a stabilizing mechanism by which the encoding of the sequence memory is preserved by some restorative process illustrated in Figure 3d where the memorized sequence becomes a fixed point in the weight space of the model under training dynamics. Subsequent training may alter the readout of the sequence, but the memory of the sequence is fixed throughout time. Since this is not true of all sequences, but only the few which exhibit this persistent memorization, it may point to a phase transition that occurs when the sequence is first encountered.

3.3 Latent memorization and recovery

Refer to caption
Figure 3: a. Comparison of the distribution of best achievable k-LD by perturbing the model weights. Data points were selected such that they were un-memorized (k-LD >>> 50) at 10k but we’re memorized (k-LD <10absent10<10< 10) at some point during the next 10k training steps. Top panel is the histogram of the perturbations of the checkpoint at 19k and bottom is 10k. Notably, the perturbations cause the 10k model distances to match the distribution of the 19k model, and perturbing the 19k model does not have a significant effect. This is indicative of how model training mimics random noise with respect to the memorization status of the sequences. b. Comparison of using perturbations to evoke a target sequence for three different classes of sequences. In the top panel, we examine the sequences which are "latent" memorized. In the middle panel, we find sequences which weren’t memorized during training and in the bottom panel, we analyze sequences which were encountered later in training but were not encountered by the model. We not that perturbing the weights is only able to evoke sequences which are "latent" memorized. c. Comparison of the cross entropy losses of sequences separated into the three different classes of sequences analyzed in b. The cross entropy losses of "latent" memorized sequences are much lower. d. Drawing of a mechanistic proposal for how memorization is stabilized during training. e. Visualization of the Levenshtein distances from the target for various perturbations. Each row is a single sequence, and the heights of the bars correspond to the number of perturbations which resulted in a Levenshtein distance of the corresponding bin.

Since some sequences exhibited seemingly random variations in their memorization state across different checkpoints, we hypothesize that these sequences remain memorized but are not be visible at a given checkpoint. Indeed, we found many sequences which were not memorized at the initial checkpoint (10000) but exhibited memorization by checkpoint 19k (Table 1).

For these sequences, the nature of the random changes shown in Figure 2 indicate the form of a random walk. We hypothesize that the process of training in large language models acts as random noise on the weights with respect to the memory of the sequence. Thus, simply perturbing the weights with random noise should produce similar effects as training.

We find that this prediction is true. We randomly perturb the model weights by adding a small amount of random gaussian noise with σ=103𝜎superscript103\sigma=10^{-3}italic_σ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT to each of the weight parameters. We repeat this process 200 times and find the perturbation which yields the lowest k-LD. Notably, in the high dimensional weight space, it is difficult to reproduce arbitrary sequences using random weight perturbations, thus the recovery of memorized sequences must be due to intrinsic factors of how the memory is encoded in the weights.

We find that sequences which were not memorized at checkpoint 10k but were memorized later in training were able to be recovered using random perturbation (Figure 3a). In contrast, sequences which were not memorized during the period of consideration could not be recovered. As a control, we also selected sequences which were not presented to the model yet, and observed that their distributions closely matched those which were encountered by not memorized by the model (Figure 3b). Furthermore, we found that the perturbations yielded memorization patterns which closely matched that of the model at a later point in training. These observations support the view that with respect to a memorized sequence, subsequent training acts similar to random noise perturbations to the model weights.

Finally, we find that these sequences which are not memorized at one point in training but appear later seem to be remembered by the model in spite of their incorrect continuation. These sequences can be considered to be "latent" memorized as they may not be visible at the current point in training, but they can be uncovered by small perturbations of the weights. These sequences pose a significant risk for leakage since they are not easily detectable from evaluating kl-memorization of those sequences. To this end, we discovered that these "latent" memorized sequences had significantly lower cross entropy loss when evaluated by the model (Figure 3c), thus simply evaluating the likelihood of those sequences using the trained model is a natural diagnostic for detecting these "latent" memorized sequences.

4 Conclusion and limitations

We study how memorization changes throughout training and focused on sequences which occurred only once throughout training. Under these conditions, we find that rather than forgetting these sequences, the model retains them for the duration of training. This stationarity indicates a stability of the memorized sequence in weight space since the training process necessarily modifies the weights which encode the memorized sequences. We test this mechanistic view of how the training process interacts with the memorized sequence by using random weight perturbations to the model weights. These perturbations confirm that sequences which appeared to be forgotten at one point during training, may still be memorized by the model and are able to be uncovered with a small amount of random noise. We concluded by demonstrating a simple diagnostic to distinguish between "latent" memorized sequences and un-memorized sequences.

This study highlights one surprising behavior of large language models and begins to elucidate what mechanisms are present in the memorization behavior of these models. Our work suggest a possible mechanism of how memorized strings are sustained throughout training and further experiments are needed to confirm the underlying mechanism. Notably, further testing is required across other large language models which were not considered here. We also propose a mechanistic explanation for this phenomenon which requires further study to explain the cause of these persistent memories. Finally, our analysis was restricted to a significant portion of training, but further analysis is needed to consider if these properties hold for even longer training durations.

References

  • AlKhamissi et al. [2022] Badr AlKhamissi, Millicent Li, Asli Celikyilmaz, Mona Diab, and Marjan Ghazvininejad. A review on language models as knowledge bases. April 2022.
  • Biderman et al. [2023a] Stella Biderman, Usvsn Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raff. Emergent and predictable memorization in large language models. April 2023a.
  • Biderman et al. [2023b] Stella Biderman, Usvsn Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin G Anthony, Shivanshu Purohit, and Edward Raf. Emergent and predictable memorization in large language models. Adv. Neural Inf. Process. Syst., abs/2304.11158, April 2023b.
  • Carlini et al. [2020] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. December 2020.
  • Carlini et al. [2022] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. February 2022.
  • Chen et al. [2020] Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. April 2020.
  • Duan et al. [2024] Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, and Hannaneh Hajishirzi. Do membership inference attacks work on large language models? February 2024.
  • Feldman and Zhang [2020] Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. August 2020.
  • Gao et al. [2020] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800GB dataset of diverse text for language modeling. ArXiv, abs/2101.00027, December 2020.
  • Guu et al. [2020] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-Augmented language model Pre-Training. February 2020.
  • Henderson et al. [2017] Peter Henderson, Koustuv Sinha, Nicolas Angelard-Gontier, Nan Rosemary Ke, Genevieve Fried, Ryan Lowe, and Joelle Pineau. Ethical challenges in Data-Driven dialogue systems. November 2017.
  • Hisamoto et al. [2019] Sorami Hisamoto, Matt Post, and Kevin Duh. Membership inference attacks on Sequence-to-Sequence models: Is my data in your machine translation system? April 2019.
  • Kandpal et al. [2022] Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. February 2022.
  • Kassem et al. [2024] Aly M Kassem, Omar Mahmoud, Niloofar Mireshghallah, Hyunwoo Kim, Yulia Tsvetkov, Yejin Choi, Sherif Saad, and Santu Rana. Alpaca against vicuna: Using LLMs to uncover memorization of LLMs. March 2024.
  • Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. U. S. A., 114(13):3521–3526, March 2017.
  • Lee et al. [2021] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. July 2021.
  • Liu et al. [2023] Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo Li, Yuqi Wang, Suqi Sun, Omkar Pangarkar, Richard Fan, Yi Gu, Victor Miller, Yonghao Zhuang, Guowei He, Haonan Li, Fajri Koto, Liping Tang, Nikhil Ranjan, Zhiqiang Shen, Xuguang Ren, Roberto Iriondo, Cun Mu, Zhiting Hu, Mark Schulze, Preslav Nakov, Tim Baldwin, and Eric P Xing. LLM360: Towards fully transparent Open-Source LLMs. December 2023.
  • Mireshghallah et al. [2022] Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, and Reza Shokri. Quantifying privacy risks of masked language models using membership inference attacks. March 2022.
  • Mirzadeh et al. [2021] Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Huiyi Hu, Razvan Pascanu, Dilan Gorur, and Mehrdad Farajtabar. Wide neural networks forget less catastrophically. October 2021.
  • Shokri et al. [2016] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. October 2016.
  • Thakkar et al. [2020] Om Thakkar, Swaroop Ramaswamy, Rajiv Mathews, and Françoise Beaufays. Understanding unintended memorization in federated learning. June 2020.
  • Tirumala et al. [2022] Kushal Tirumala, Aram H Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. May 2022.
  • Toneva et al. [2018] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. December 2018.
  • Zenke et al. [2017] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. Proc Mach Learn Res, 70:3987–3995, 2017.
  • Zhang et al. [2021] Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, and Nicholas Carlini. Counterfactual memorization in neural language models. December 2021.

Appendix A Appendix / supplemental material

A.1 Compute details

All experiments were run on a cluster with access to 16 concurrent a100 GPUs. All of the language models were run using a single GPU and multiple GPUs were used to parallelize the experiments in order to speed up progress. Searching for repeats within the dataset was performed using the library dask, using 64 CPUs distributed in a cluster, each with 32Gb of RAM.

A.2 Licenses

This project used code from the Pythia project Biderman et al. [2023a] released by EleutherAI under the Apache license version 2.0. We also used the Pile dataset Gao et al. [2020] which is released under the MIT license. The Amber model was produced by LLM360, and the code and dataset are both released under apache 2.0.

A.3 Additional figures

We include figures which were ommitted from the main paper. These provide additional details that were not central to the claims made in the paper.

Refer to caption
Figure 4: Histogram of the repeats vs the edit distance Hue is log density.
Refer to caption
Figure 5: Histogram of the repeats vs the edit distance split by complexity Hue is log density.
Refer to caption
Figure 6: Average of the k-LD metric k-LD values are binned by number of repeats and complexity and the mean and variance of the samples in those bins are computed and colored.
Refer to caption
Figure 7: Average of the k-LD metric k-LD values are binned by number of repeats and complexity and the mean and variance of the samples in those bins are computed and colored.
Refer to caption
Figure 8: Examples of strings which were seen once during training. Top left plot shows the k-LD over for different trajectories and bottom left plot is a histogram of when the examples were repeated and at what length with the time on the x axis and the length of the repeat on the y axis. The text of the context, true continuation and model continuation are shown as well.