Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY 4.0
arXiv:2309.08628v3 [cs.CL] 14 Dec 2023

Recovering from Privacy-Preserving Masking with Large Language Models

Abstract

Model adaptation is crucial to handle the discrepancy between proxy training data and actual users’ data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.

Index Terms—  Privacy-preserving machine learning, language modeling, large language models, automatic speech recognition

1 Introduction

A common issue arising after deploying a machine learning model on central servers or user devices is the discrepancy between training data and actual user data received. Specifically, in the applications of natural language processing (NLP), semantic characteristics and topics of real users’ textual data could be very different from those of server-side proxy corpora, in which scenarios model adaptation is indispensable [1, 2].

To effectively perform model adaptation, textual data of users is typically stored on servers or their devices, where any downstream NLP models will be trained using such in-domain data. However, users’ personal data might contain sensitive user information, such as people’s names, addresses, and credit card numbers. Therefore, this conventional practice of users’ data storage might raise privacy and security concerns due to the risks of exposing user information to adversaries. In addition, recent research has shown that sensitive information in training datasets can be detected and then extracted in unexpected ways [3, 4, 5, 6, 7]. Particularly, language models (LMs) are prone to unintentionally memorize rare or unique sequences of data, and when being prompted appropriately, they will be able to emit the memorized text verbatim [8]. Thus, having NLP models directly trained on private user data might have extra risks of exposing sensitive information.

To overcome these challenges, replacing identifying information in textual data with a generic marker has been explored  [9, 10, 11]. To be more specific, tokens considered as sensitive or private are masked out using some special symbol, such as “[MASK]”. In the example where the raw textual sequence is “Tom lives in Chicago”, one might mark the words of “Tom” and “Chicago” as personal and thus replace them with the mask symbol. The resulting sequence is “[MASK] lives in [MASK]”, which will be stored into servers or local devices for model adaptation purposes later on.

While this strategy is capable to provide privacy protections on user data, it also introduces significant complexities to the training of any NLP models for downstream adaptation tasks. The existence of markers might break the semantic structures, disrupt the coherence of languages, or fail to preserve the meaning of the original textual sequences. As a result, models directly trained on the masked corpus could yield much worse performance compared with the ones trained on the raw corpus without privacy-preserving token masking. Therefore, it calls for advanced approaches on effectively substituting the masked tokens in the corpus and bridge the accuracy gaps in NLP models for adaptation tasks.

In this work, we propose to use large language models (LLMs) to provide appropriate candidate tokens to fill in the generic markers in any masked corpus. Note that predicting the masked tokens based on the surrounding context can be considered as a task of masked LM (MLM), thus bi-directional Transformer [12] based pre-trained LLMs, such as BERT [13] and RoBERTa [14], would be suitable for this endeavor. Upon observing the remarkable capabilities demonstrated by decoder-only LLMs, models such as ChatGPT [15] and LLaMA2 [16] can also be utilized here for providing substitutes of masked tokens. Our goal is not to restore any markers to the original tokens without masking, instead, we aim to replace any masked token with some substitute of the same type. More specifically, the efficiency of any recovering method from privacy-preserving masking shall be evaluated on the downstream adaptation tasks, through the NLP models trained on the obfuscation corpus. In this paper, we use language modeling and LM-fused automatic speech recognition (ASR) [17, 18, 19, 20, 21] as the downstream tasks.

We make the following contributions:

  • To the best of our knowledge, our work is the first to leverage LLMs to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream LM and ASR tasks;

  • We propose multiple pre-trained and fine-tuned LLM-based methods and conduct empirical experiments on various NLP datasets for the comparison of adapted models accordingly. The results of our experiments indicate that models trained on the obfuscation corpora have comparable performance with the ones trained on the original data without privacy-preserving token masking;

  • We also present three token masking techniques and measure the performance of our proposed methods on each of them in downstream tasks as well.

The rest of the paper is organized as follows. We review related works in Section 2. Section 3 describes the details of our proposed framework on privacy-preserving token masking and the substitutes of masked tokens using LLMs. Next, Section 4 shows the experiments and results for downstream tasks of LM and ASR. Finally, We conclude in Section 5.

2 Related Works

Privacy protection has been becoming crucial in NLP research [10]. One important direction in this area is through anonymization, which involves the removal of identifying information from textual corpus [9, 22, 23]. More recently, obfuscation, replacing any sensitive information with a different substitute of the same type has been investigated. In particular, a survey of profanity obfuscation in NLP is conducted in [24]. Authors in [25] employs a neural model that aims to preserve the syntactic relationships of the original sentence so that the obfuscated sentence can be parsed instead of the original one; it outperforms random substitution baselines across syntactic parsers. The work of [11] studies named entity obfuscation in speech, which focuses on identifying, replacing, and inserting replacement named entities synthesized using voice cloning into original audio. The paper of [26] improves the speech recognition of personal identifiers by including fake textual substitutes in the training data of ASR. None of these existing works explore the use and comparison of different LLMs for suggesting token substitutes in obfuscation.

3 Methodology

We describe our proposed approaches on privacy-preserving token masking and the substitutes of masked tokens using LLMs. Specifically, we introduce several token masking techniques in Section 3.1; LLM-based methods on replacing the masked tokens are presented in Section 3.2; Section 3.3 discusses the use of obfuscation corpus for performing language modeling task.

The overall framework is depicted in Figure 1.

Refer to caption
Fig. 1: The framework of token masking and obfuscation using LLMs.

3.1 Token Masking Techniques

Masking sensitive tokens from users’ data helps reduce the privacy risks and prevent any personal information being leaked or extracted from adversaries. Such token masking task shall be performed without human-in-the-loop since practitioners are not allowed to have the access to annotate or label private data of users.

To automatically conceal sensitive information in some private corpus, we propose the following token masking techniques:

  • allowList𝑎𝑙𝑙𝑜𝑤𝐿𝑖𝑠𝑡allowListitalic_a italic_l italic_l italic_o italic_w italic_L italic_i italic_s italic_t: This is a pre-defined list of tokens that are considered non-sensitive and safe to keep. Typically, such list is handcrafted by linguistic specialists. Then during the process of masking, any token not present in this allow list will be masked out;

  • vocabThres𝑣𝑜𝑐𝑎𝑏𝑇𝑟𝑒𝑠vocabThresitalic_v italic_o italic_c italic_a italic_b italic_T italic_h italic_r italic_e italic_s: This involves the selection of N𝑁Nitalic_N most frequent tokens from a vocabulary as the list of non-sensitive tokens. That is, any token with its frequency less than some threshold will be masked out. Here, the vocabulary set can be built from some generic large corpora;

  • entityTagger𝑒𝑛𝑡𝑖𝑡𝑦𝑇𝑎𝑔𝑔𝑒𝑟entityTaggeritalic_e italic_n italic_t italic_i italic_t italic_y italic_T italic_a italic_g italic_g italic_e italic_r: In this approach, named entity recognition (NER) models are utilized to identify potential entities in any private corpus, which will be treated as personal tokens and masked out. These entities include but are not limit to individuals’ names, locations, and organizations.

Throughout these masking techniques, we will more likely mask the non-common tokens in any corpus, assuming privacy information is more related to rare or unique tokens. After applying the masking, we obtain a masked corpus where the masked tokens were replaced with the symbol of “[MASK]”.

3.2 Recovery Methods from Masking

Token masking provides privacy protections, however, the resulting masked corpus might not be suitable to be directly used for training NLP models for downstream tasks.

Given any masked corpus, we propose to use LLMs to fill in each mask symbol with appropriate token that matches the semantic contexts. It is important to note that we are not aiming to predict exactly the same token with the original one in the raw corpus. We expect to substitute it with some token that makes the whole sentence linguistically correct and complete.

The following illustrates different strategies on leveraging LLMs for substituting masked tokens:

  • Top-1: In this method, we directly use the 1-best predicted token from an LLM to replace the masked token. Here, token filling is considered as a masked LM task. If there are multiple markers in the sentence, they are replaced in a sequential order from the left to the right, one at a time;

  • Top-K: This approach extends the token filling candidates from the 1-best to the K𝐾Kitalic_K-best from the predictions of an LLM. Specifically, we randomly choose a token from the top-K𝐾Kitalic_K predictions. Then this selected token is used to fill in the marker in the sentence. For substituting any masked tokens from allowList𝑎𝑙𝑙𝑜𝑤𝐿𝑖𝑠𝑡allowListitalic_a italic_l italic_l italic_o italic_w italic_L italic_i italic_s italic_t or vocabThres𝑣𝑜𝑐𝑎𝑏𝑇𝑟𝑒𝑠vocabThresitalic_v italic_o italic_c italic_a italic_b italic_T italic_h italic_r italic_e italic_s based masking techniques, we prefer the predicted tokens not being included in the corresponding token list, thus we repeat the random sampling process until this condition is met or there is no available candidates of predicted tokens among the top-K𝐾Kitalic_K;

  • Fine-Tuning(FT): In the previous two approaches, we utilize the token predictions from a pre-trained LLM. Fine-tuning a pre-trained LLM using in-domain corpus helps the model gain domain-specific knowledge, and hence enhance the performance in the masked token prediction. To accomplish this, samples without any masked tokens can be used for fine-tuning. However, in many scenarios, it is possible that majority of samples contain at least one mask symbol so that fine-tuning is less effective especially when the size of corpus is small. Alternatively, the top-1 or top-K𝐾Kitalic_K predictions from the same pre-trained LLM can be firstly used to substitute the masked tokens in any samples, and then the entire obfuscation corpus can be used for fine-tuning the LLM. Once we have a fine-tuned LLM, either Top-1 or Top-K can be applied for the substitution of masked tokens. Note that the process above can be utilized for multiple times.

After applying any of these methods, we obtain an obfuscation corpus that does not contain any masks.

3.3 Performing Downstream Tasks

Once we have substituted masked tokens, the resulting corpus can be used for training machine learning models for any downstream tasks. Notice that the effectiveness of any token filling approach should be measured by the performance of these machine learning models on these downstream tasks.

In this work, we consider the language modeling adaptation task where a generic pre-trained LM is fine-tuned on the obfuscation corpus. This adapted LM will be evaluated on a (unmasked) test set which has the same domain with the raw corpus. The performance of LM is measured in term of perplexity.

When integrating an adapted LM with an ASR model via shallow fusion, word error rate (WER) can also be evaluated on a test set of utterances.

4 Experiments

4.1 Datasets

To compare the performance of multiple baselines and our proposed approaches on the downstream language modeling task, we explore three datasets in the experiments: Fisher [27], Pushshift.io Reddit111Pushshift.io Reddit dataset is a previously existing dataset extracted and obtained by a third party that contains preprocessed comments posted on the social network Reddit and hosted by pushshift.io. We will refer this dataset as “Reddit” in the rest of the paper. [28], and Wall Street Journal (WSJ) [29]. The statistics of these datasets are summarized in Table 1. The test set of WSJ data also consists of voice utterances and is thus used for evaluating the ASR models with fused LMs.

Table 1: Data information.
Train Set (#sent) Test Set (#sent)
Fisher 1,158,496 50,000
Reddit 763,683 49,570
WSJ 6,000 800

4.2 Setups

4.2.1 Downstream Tasks

The downstream LM is a Transformer with 6 layers, 12 attention heads, and 768 hidden units. The set of word vocabulary is around 85K. The LM is pre-trained on WikiText-103 corpus [30].

For each of the masking techniques considered in this study, LMs are fine-tuned on the obfuscation train sets of Fisher, Reddit, and WSJ data. Their perplexities are evaluated on the corresponding test sets.

On the WSJ test set, we also evaluate the ASR performance. The ASR model is an RNN-T model with the Emformer encoder [31], LSTM predictor, and a joiner. It has around 80 million parameters and is trained from scratch using the train split of LibriSpeech ASR corpus [32].

4.2.2 Masking Techniques

In our experiments, allowList𝑎𝑙𝑙𝑜𝑤𝐿𝑖𝑠𝑡allowListitalic_a italic_l italic_l italic_o italic_w italic_L italic_i italic_s italic_t contains a set of 5K curated common words, and vocabThres𝑣𝑜𝑐𝑎𝑏𝑇𝑟𝑒𝑠vocabThresitalic_v italic_o italic_c italic_a italic_b italic_T italic_h italic_r italic_e italic_s consists of 10K most frequent words among the same 85K word vocabulary mentioned above. For the entityTagger𝑒𝑛𝑡𝑖𝑡𝑦𝑇𝑎𝑔𝑔𝑒𝑟entityTaggeritalic_e italic_n italic_t italic_i italic_t italic_y italic_T italic_a italic_g italic_g italic_e italic_r masking technique, we utilize the BERT-NER model [13, 33] for tagging named entities in the train sets.

For each of these masking techniques, Table 2 shows the percentage of masked tokens per dataset. We can see that allowList𝑎𝑙𝑙𝑜𝑤𝐿𝑖𝑠𝑡allowListitalic_a italic_l italic_l italic_o italic_w italic_L italic_i italic_s italic_t masks many more tokens than the other two techniques.

Table 2: Percentages of masked tokens.
allowList vocabThres entityTagger
Fisher 12.5% 1.3% 1.7%
Reddit 22.7% 11.9% 4.2%
WSJ 30.4% 11.2% 9.1%

4.2.3 Baselines

We consider the following methods as the baselines:

  • Oracle: an LM is trained on the ground-truth sentences without any masking, which provides the upper bound for the model performance on each dataset;

  • Baseline0: an LM is directly trained on the masked corpus, where the mask symbol “[MASK]” is treated as a special token during model training;

  • Baseline1: zero weight is assigned to any mask symbol “[MASK]” in the LM loss function during model training.

Note that for each of these methods, the LM is still pre-trained on the WikiText-103 corpus.

4.2.4 LLM-Based Methods

In our experiments, we consider the following LLMs for substituting masked tokens in any training sequences: BERT (base, uncased), RoBERTa (base), and LLaMA2 (7B model parameters).

For the fine-tuning of BERT and RoBERTa, we use MLM as the training task. During the inference time of using pre-trained or fine-tuned BERT and RoBERTa to substitute masked tokens, any consecutive markers of “[MASK]” are merged into one marker. We set K=10𝐾10K=10italic_K = 10 in the Top-K method.

For LLaMA2, we adopt a different approach for the fine-tuning process since it is an auto-regressive model. Specifically, for each training sample, we generate prompts by combining some instruction, input, and output text: instruction contains the text of “Predict the [MASK] tokens in the given sentence”; input is the same training sample but having a few tokens randomly replaced with the symbol of “[MASK]”; and output is the original training sample (without masking). We leverage the low-rank adaptation (LoRA) method [34] for fine-tuning LLaMA2 on the set of prompts. During the inference time, the instruction and input are provided to the fine-tuned model, which allows the model for continued text generation.

4.3 Results

Table 3 shows the perplexity results of the baselines and proposed methods on Fisher dataset. We have the following observations:

  • All proposed methods give lower perplexity results than the two baseline methods;

  • In all scenarios, Top-K outperforms Top-1 based methods; fine-tuned BERT and RoBERTa obtain better results than the ones without fine-tuning;

  • Since more tokens are masked out with allowList𝑎𝑙𝑙𝑜𝑤𝐿𝑖𝑠𝑡allowListitalic_a italic_l italic_l italic_o italic_w italic_L italic_i italic_s italic_t, the gap between Oracle and any other method is much larger than that of vocabThres𝑣𝑜𝑐𝑎𝑏𝑇𝑟𝑒𝑠vocabThresitalic_v italic_o italic_c italic_a italic_b italic_T italic_h italic_r italic_e italic_s or entityTagger𝑒𝑛𝑡𝑖𝑡𝑦𝑇𝑎𝑔𝑔𝑒𝑟entityTaggeritalic_e italic_n italic_t italic_i italic_t italic_y italic_T italic_a italic_g italic_g italic_e italic_r masking technique;

  • RoBERTa yields the best perplexity performance across all the masking techniques. In particular, for vocabThres𝑣𝑜𝑐𝑎𝑏𝑇𝑟𝑒𝑠vocabThresitalic_v italic_o italic_c italic_a italic_b italic_T italic_h italic_r italic_e italic_s and entityTagger𝑒𝑛𝑡𝑖𝑡𝑦𝑇𝑎𝑔𝑔𝑒𝑟entityTaggeritalic_e italic_n italic_t italic_i italic_t italic_y italic_T italic_a italic_g italic_g italic_e italic_r, perplexity results from fine-tuned RoBERTa are very close to those of Oracle, which indicates that most of the missing information can be recovered in the obfuscation dataset;

  • LLaMA2(Top-1,FT) is a competitive method but is not as good as fine-tuned BERT or RoBERTa for this task.

Table 3: Perplexity results on Fisher dataset.
allowList vocabThres entityTagger
Oracle 37.3 37.3 37.3
Baseline0 120.1 42.3 41.7
Baseline1 109.4 41.6 41.6
BERT(Top-1) 93.0 41.3 41.5
RoBERTa(Top-1) 71.6 40.5 39.5
BERT(Top-K) 75.2 40.8 40.5
RoBERTa(Top-K) 70.2 38.9 38.7
BERT(Top-K,FT) 73.6 39.8 39.7
RoBERTa(Top-K,FT) 65.3 38.9 38.5
LLaMA2(Top-1,FT) 89.3 40.8 40.7

Table 4 shows the experimental results on Reddit dataset. The observations are similar to the ones in Fisher dataset. In particular, RoBERTa(Top-K,FT) again achieves the best perplexity results across all the masking techniques.

Table 4: Perplexity results on Reddit dataset.
allowList vocabThres entityTagger
Oracle 76.0 76.0 76.0
Baseline0 339.6 168.2 82.3
Baseline1 221.9 134.9 79.8
BERT(Top-1) 196.2 121.2 78.9
RoBERTa(Top-1) 117.3 94.2 78.4
BERT(Top-K) 127.4 106.3 78.7
RoBERTa(Top-K) 123.4 92.6 77.4
BERT(Top-K,FT) 117.4 102.5 77.6
RoBERTa(Top-K,FT) 98.5 82.1 76.8
LLaMA2(Top-1,FT) 123.3 107.7 78.7

Table 5 and Table 6 show the perplexity and WER results on WSJ dataset, respectively. We have the following findings:

  • The use of fused LM for conducting domain adaptation in ASR models is effective: comparing the WERs between ASR models with the pre-trained LM and the Oracle LM, there is a more than 15% WER improvement achieved by the latter;

  • The best WERs obtained by proposed methods have relatively small gaps compared with those of the Oracle LM. For vocabThres𝑣𝑜𝑐𝑎𝑏𝑇𝑟𝑒𝑠vocabThresitalic_v italic_o italic_c italic_a italic_b italic_T italic_h italic_r italic_e italic_s and entityTagger𝑒𝑛𝑡𝑖𝑡𝑦𝑇𝑎𝑔𝑔𝑒𝑟entityTaggeritalic_e italic_n italic_t italic_i italic_t italic_y italic_T italic_a italic_g italic_g italic_e italic_r masking techniques, the WERs from Oracle are lifted by only 1% (10.7 versus 10.6) and 5% (11.1 versus 10.6), respectively. That is, the proposed methods are able to achieve significant improvements over the pre-trained LM (without adaptation), while they also provide better privacy protection than the Oracle LM.

Table 5: Perplexity results on WSJ dataset.
allowList vocabThres entityTagger
Oracle 86.5 86.5 86.5
Baseline0 309.0 144.3 204.0
Baseline1 210.0 122.9 198.2
BERT(Top-1) 205.9 119.4 149.3
RoBERTa(Top-1) 181.1 102.5 118.2
BERT(Top-K) 174.1 103.3 108.3
RoBERTa(Top-K) 114.5 93.4 98.7
BERT(Top-K,FT) 186.7 113.4 162.3
RoBERTa(Top-K,FT) 120.7 110.4 157.8
LLaMA2(Top-1,FT) 135.6 106.8 145.6
Table 6: WER results on WSJ dataset.
allowList vocabThres entityTagger
ASR-without-LM 14.4 14.4 14.4
Pre-Trained-LM 12.6 12.6 12.6
Oracle 10.6 10.6 10.6
Baseline0 13.0 12.6 11.3
Baseline1 12.5 11.2 11.2
BERT(Top-1) 12.4 11.1 11.2
RoBERTa(Top-1) 12.4 10.9 11.1
BERT(Top-K) 12.1 11.1 11.4
RoBERTa(Top-K) 11.9 10.9 11.1
BERT(Top-K,FT) 12.7 11.5 11.7
RoBERTa(Top-K,FT) 11.8 11.4 11.1
LLaMA2(Top-1,FT) 12.0 10.7 11.2

5 Conclusion

In this paper, we propose multiple pre-trained and fine-tuned LLM-based methods to recover from privacy-preserving token masking on textual corpus and perform empirical studies on various datasets for the comparison of these approaches. Our experimental results demonstrate that LMs trained on the obfuscation corpora can obtain comparable accuracy with the ones trained on the raw data without privacy-preserving token masking.

Future research might include fine-tuning LLMs with the object function designed to be more directly related to the downstream NLP tasks. Also, we would consider a combination of these three masking techniques and adopt class-specific markers such as “[PERSON]”, “[NUMBER]”, etc.

References

  • [1] Ke Li, Zhe Liu, Tianxing He, Hongzhao Huang, Fuchun Peng, Daniel Povey, and Sanjeev Khudanpur, “An empirical study of transformer-based neural language model adaptation,” in Proc. ICASSP, 2020.
  • [2] Zhe Liu, Ke Li, Shreyan Bakshi, and Fuchun Peng, “Private language model adaptation for speech recognition,” arXiv preprint arXiv:2110.10026, 2021.
  • [3] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proc. ACM SIGSAC, 2015.
  • [4] Congzheng Song and Vitaly Shmatikov, “Auditing data provenance in text-generation models,” in Proc. ACM SIGKDD, 2019.
  • [5] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song, “The secret sharer: Evaluating and testing unintended memorization in neural networks,” in 28th USENIX Security Symposium, 2019.
  • [6] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al., “Extracting training data from large language models,” in 30th USENIX Security Symposium, 2021.
  • [7] W Ronny Huang, Steve Chien, Om Thakkar, and Rajiv Mathews, “Detecting unintended memorization in language-model-fused ASR,” in Proc. Interspeech, 2022.
  • [8] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang, “Quantifying memorization across neural language models,” arXiv preprint arXiv:2202.07646, 2022.
  • [9] Sergio Martínez, David Sánchez, Aida Valls, and Montserrat Batet, “Privacy protection of textual attributes through a semantic-based masking method,” Information Fusion, vol. 13, no. 4, pp. 304–314, 2012.
  • [10] Samuel Sousa and Roman Kern, “How to keep text private? a systematic review of deep learning methods for privacy-preserving natural language processing,” Artificial Intelligence Review, vol. 56, no. 2, pp. 1427–1492, 2023.
  • [11] Judita Preiss, “Automatic named entity obfuscation in speech,” in Findings of ACL, 2023.
  • [12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in NeurIPS, 2017.
  • [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [14] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  • [15] OpenAI, “ChatGPT: Optimizing language models for dialogue,” Feb 2022.
  • [16] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  • [17] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur, “Recurrent neural network based language model,” in Proc. Interspeech, 2010.
  • [18] Xie Chen, Xunying Liu, Mark JF Gales, and Philip C Woodland, “Improving the training and evaluation efficiency of recurrent neural network language models,” in Proc. ICASSP, 2015.
  • [19] Xunying Liu, Yongqiang Wang, Xie Chen, Mark JF Gales, and Philip C Woodland, “Efficient lattice rescoring using recurrent neural network language models,” in Proc. ICASSP, 2014.
  • [20] Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N Sainath, Zhijeng Chen, and Rohit Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP, 2018.
  • [21] Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Language modeling with deep transformers,” in Proc. Interspeech, 2019.
  • [22] Pierre Lison, Ildikó Pilán, David Sánchez, Montserrat Batet, and Lilja Øvrelid, “Anonymisation models for text data: State of the art, challenges and future directions,” in Proc. ACL, 2021.
  • [23] Tzvika Hartman, Michael D Howell, Jeff Dean, Shlomo Hoory, Ronit Slyper, Itay Laish, Oren Gilon, Danny Vainstein, Greg Corrado, Katherine Chou, et al., “Customization scenarios for de-identification of clinical notes,” BMC medical informatics and decision making, vol. 20, no. 1, pp. 1–9, 2020.
  • [24] Debora Nozza and Dirk Hovy, “The state of profanity obfuscation in natural language processing,” arXiv preprint arXiv:2210.07595, 2022.
  • [25] Zhifeng Hu, Serhii Havrylov, Ivan Titov, and Shay B. Cohen, “Obfuscation for privacy-preserving syntactic parsing,” 2020.
  • [26] Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, and Bhuvana Ramabhadran, “Using text injection to improve recognition of personal identifiers in speech,” arXiv preprint arXiv:2308.07393, 2023.
  • [27] Christopher Cieri, David Miller, and Kevin Walker, “The fisher corpus: a resource for the next generations of speech-to-text,” in International Conference on Language Resources and Evaluation, 2004.
  • [28] Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn, “The Pushshift reddit dataset,” in International Conference on Web and Social Media, 2020.
  • [29] Lukas Drude, Jens Heitkaemper, Christoph Boeddeker, and Reinhold Haeb-Umbach, “SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition,” arXiv preprint arXiv:1910.13934, 2019.
  • [30] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher, “Pointer sentinel mixture models,” arXiv preprint arXiv:1609.07843, 2016.
  • [31] Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Mike Seltzer, “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in Proc. ICASSP, 2021.
  • [32] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
  • [33] Erik F. Tjong Kim Sang and Fien De Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proc. of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 2003.
  • [34] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.