PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration

Ziqian Zeng ¹, Jianwei Wang¹¹footnotemark: 1 ¹, ZhengdongLu¹, Huiping Zhuang¹, Cen Chen^1,2
¹South China University of Technology
² Pazhou Laboratory, China
zqzeng@scut.edu.cn, wiwjwilliam@mail.scut.edu.cn Equal contribution.Corresponding author

Abstract

The widespread usage of online Large Language Models (LLMs) inference services has raised significant privacy concerns about the potential exposure of private information in user inputs to eavesdroppers or untrustworthy service providers. Existing privacy protection methods for LLMs suffer from insufficient privacy protection, performance degradation, or severe inference time overhead. In this paper, we propose PrivacyRestore to protect the privacy of user inputs during LLM inference. PrivacyRestore directly removes privacy spans in user inputs and restores privacy information via activation steering during inference. The privacy spans are encoded as restoration vectors. We propose Attention-aware Weighted Aggregation (AWA) which aggregates restoration vectors of all privacy spans in the input into a meta restoration vector. AWA not only ensures proper representation of all privacy spans but also prevents attackers from inferring the privacy spans from the meta restoration vector alone. This meta restoration vector, along with the query with privacy spans removed, is then sent to the server. The experimental results show that PrivacyRestore can protect private information while maintaining acceptable levels of performance and inference efficiency.

1 Introduction

Large Language Models (LLMs) have emerged as powerful tools in various domains, including healthcare [6, 49], law [47, 10], and finance [46, 48]. With the exception of a very small portion of users who have the resources and expertise to deploy LLMs locally, the vast majority of users access and interact with these powerful models through online inference services.

However, the widespread usage of online LLMs inference services has raised significant privacy concerns, especially regarding the potential risk of private information being leaked through user inputs when interacting with LLMs deployed on cloud platforms. User inputs often contain sensitive and proprietary information such as medical records, unpublished narrative works, and personal financial details. Potential threats might arise from eavesdropper attackers intercepting user queries during transmission to cloud platforms, as well as the LLM service providers themselves potentially exploiting or misusing the sensitive data contained within user inputs for illicit purposes. For example, in sensitive domains like medical diagnosis, if a user’s input containing personal health information, such as "I was previously diagnosed with HIV, and lately I’ve been experiencing fever and diarrhea..." is disclosed, it may cause troubles to individual’s personal life.

In this paper, we aims to protect the privacy information contained in user inputs during LLM inference stage. In this setting, the client submits inputs to the server (also known as a service provider) and there is a risk that inputs might be disclosed by attackers or the server. Two existing categories of methods can protect user inputs: Secure Multi-Party Computation (SMPC) and Differential Privacy (DP). SMPC based methods [16, 21, 25] utilize encryption protocols and techniques to enable collaborative computation without revealing original data to others. In DP based methods in a local privacy setting [30, 37, 24], users apply a text-to-text privatization [13] on data locally before publishing data. SMPC methods still have sever inference time overhead, making them impractical for real-time applications. For example, running a single pass inference on RoBERTa-Base [29] requires 168.43 seconds [17]. For DP methods, imposing privacy protection inevitably degrades the performance of downstream tasks, also known as the privacy-utility trade-off. Hence, there is a need to develop privacy-preserving methods which can effectively safeguard the privacy of user inputs while maintaining high-quality outputs, without incurring prohibitive computational costs.

We propose PrivacyRestore which directly removes privacy spans in user inputs and restores privacy information via activation steering [23, 44, 18] during model inference. The privacy spans are encoded as vectors and securely transmitted to the service provider. Our method is based on two key assumptions: (a) Private information is confined within specific spans, termed “privacy spans” rather than being dispersed throughout the entire input; (b) In a particular domain, the number of potential privacy spans is limited and finite. These assumptions are reasonable in many scenarios. For instance, in the medical diagnosis domain, private information typically relates to symptoms and disease names, expressed through specific spans such as “leakage of urine” and “HIV”. It is difficult for attackers to disclose private information if privacy spans are removed. Moreover, despite the inevitable evolution of medical knowledge and terminology, the core set of symptoms and diseases remains relatively stable and finite.

During inference, the users submit a plain text with privacy spans removed as well a meta restoration vector which encodes all privacy spans in the input. The meta restoration vector is used for activation steering at some specific attention heads to implicitly restore privacy information. At the core of PrivacyRestore is our Attention-aware Weighted Aggregation (AWA) which estimates the importance of each privacy span in the input and takes a weighted sum on the restoration vectors of all privacy spans as the meta restoration vector. AWA not only ensures proper representation of all privacy spans but also prevents attackers from inferring the privacy spans from the meta restoration vector alone. PrivacyRestore is a plug-and-play method that only restoration vectors are trainable while the LLM is frozen. Once the training of restoration vectors is completed, a restoration bank is constructed. The purpose of this restoration bank is to provide the meta restoration vector when given a query and a set of privacy spans. During inference, users retrieve the meta restoration vector locally from the restoration bank. The experimental results show that the proposed method can protect private information while maintaining acceptable levels of performance and inference efficiency.

The contributions of our paper are summarized as follows,

•

We propose a plug-and-play privacy protection method that removes privacy spans in the input and restores privacy information via activation steering during inference.
•

We propose Attention-aware Weighted Aggregation to construct the meta restoration vector which ensures proper representation of all privacy spans in the input and prevents attackers from inferring the privacy spans from the meta restoration vector alone.
•

We construct two datasets for medical diagnosis task to evaluate our method, and PrivacyRestore can protect private information while maintaining acceptable levels of performance and inference efficiency.

2 Related Works

In this section, we mainly introduce the relevant work on user inputs protection methods and activation steering as follows.

2.1 Privacy-Preserving Methods for User Inputs Protection during Inference

In some sensitive domains, when using LLMs deployed on cloud platforms, user inputs and queries containing sensitive information need to be properly protected. There are mainly two privacy-preserving methods that can be applied to protect user inputs in the inference stage: differential privacy and secure multi-party computation. Federated learning [54, 50, 8, 14] is applied to protect the privacy of training data, so it falls outside the scope of our discussion.

Differential Privacy. Differential Privacy (DP) can be categorized into two settings [37, 24]: Centralized DP (CDP) and Local DP (LDP). LDP assumes that only the client is trustworthy and privatizes the data locally, while CDP assumes a trustworthy data curator to collect raw data and protect it from privacy leakage. Under the CDP setting, various methods explored for pre-training [51, 20] and fine-tuning [52, 38] based on DP in sensitive domains for the training process of LLMs. Under the LDP setting, $d_{\mathcal{X}}$ privacy [3, 5] is applied to privatize user input for local privacy protection [37] during inference. RAPT [24] proposes a privacy-preserving prompt tuning for LLMs on the basis of local DP. In addition, some works [53, 32, 42] have applied DP to synthetic text generation, enabling the model to generate text with formal privacy guarantees.

Secure Multi-Party Computation. Secure multi-party computation (SMPC) based methods utilize multi-party encryption technology for collaborative computation among multiple parties while protecting the privacy of their private data. SMPC methods based on model structure optimization aim to replace SMPC-unfriendly nonlinear operations with other SMPC-friendly operators. MPC-Former [21] approximate nonlinear operations in Transformer using polynomials, and maintains performance through model distillation. MERGER [25] integrate previous techniques to natural language generation (NLG) tasks by bypassing embedded computation and reorganizing linear operations in Transformer modules, further improving computational efficiency and model performance. SMPC Protocol Optimization based methods focus on designing efficient SMPC operators for nonlinear operations in LLMs while maintaining the original model structure. Some recent works [16, 27, 55, 15] improve the efficiency of nonlinear operations in LLMs privacy-preserving inference by utilizing multiple SMPC protocols such as confusion circuit and function secret sharing.

2.2 Activation Steering

Activation steering aims to control frozen LLMs to generate the desired text by searching or constructing vectors that intervene the model’s activations during inference. PPLM [9] uses a separate classifier to perturb model activations, which controls the topics and sentiment styles of the text generated by the model. Z-Steer [39] extracts sentence-specific steering vectors from frozen LLMs through gradient descent, steering model generation to near perfect BLEU scores [35] and unsupervised style transfer. ITI [23] computes intervention vectors based on the activation distributions of true and false statements and selects attention heads to be intervened according to the accuracy of the linear probe. Different from ITI [23], another work [44] computes intervention vectors by only obtaining activation differences of prompt pairs. REMEDI [18] inspects and edits knowledge representations in LLMs by learning attribute encodings of entities.

3 The Proposed Method

We first introduce the threat model in §3.1 and the overview of PrivacyRestore in §3.2 respectively. Subsequently, we provide detailed illustrations of the three components of PrivacyRestore, namely, edited attention heads identification, restoration vectors training, and restoration vectors aggregation. We will introduce the inference process in §3.6. Additionally, we introduce possible attack scenarios that PrivacyRestore might encounter in Appendix B.

3.1 Threat Model

We consider a threat model that includes two parties: an untrustworthy server holding LLM weights, and a client holding user inputs containing privacy information. The LLM weights in the server are publicly available. Attacks aim to steal private information in user inputs from the client during online LLM inference that takes place on the server side.

3.2 The Overview of PrivacyRestore

The fundamental idea of PrivacyRestore is removing privacy spans in user inputs and restoring privacy information via activation steering [23, 44, 18] during model inference. An overview of PrivacyRestore is depicted in Figure 1.

Refer to caption — Figure 1: The PrivacyRestore framework consists of three procedures. (1) Edited Attention Heads Identification. This procedure aims to identify the attention heads that, after being edited with restoration vectors, can achieve the most effective restoration effect. Different privacy spans may require different attention heads for effective restoration. We propose top-K Heads Selector to obtain the common top-K heads set. (2) Restoration Vectors Training. The training objective is to align the predictions given the input with privacy spans removed to be the same as the predictions given an intact input. Restoration vectors are trainable while LLM are frozen. (3) Attention-aware Weighted Aggregation (AWA). AWA is responsible for aggregating the restoration vectors of all privacy spans in the input and generating a meta restoration vector. This meta restoration vector is then utilized to edit the attention heads of the LLM. During inference, the user sends a query with privacy spans removed, along with the meta restoration vector, to the server.

3.3 Edited Attention Heads Identification

This procedure aims to identify the attention heads that, after being edited with restoration vectors, can achieve the most effective restoration effect.

According to [12, 45], the transformer architecture utilizes attention mechanism to capture the relationship between the current token and its surrounding context. The attention mechanism can be expressed as:

\displaystyle\mathbf{U}^{l,h}

\displaystyle=

\displaystyle\text{Attn}^{l,h}(\mathbf{X}^{l,h}),

(1)

where $\mathbf{X}^{l,h}$ the sequence input of $h$ -th head in the $l$ -th layer and $\mathbf{U}^{l,h}$ is the sequence output. Directly removing privacy spans may lead to insufficient relationships due to incomplete context and produce wrong head outputs. Following activation steering methods [23, 7], we inject restoration vectors into the outputs of each attention head to restore private information:

\displaystyle\mathbf{y}^{l,h}

\displaystyle=

\displaystyle\mathbf{u}^{l,h}+\mathcal{R}^{l,h},

(2)

where $\mathbf{u}^{l,h}=\mathbf{U}^{l,h}[-1]$ is the output of the last token and $\mathbf{y}^{l,h}$ is the output after privacy restoration. $\mathcal{R}^{l,h}$ is the aggregated restoration vector for $h$ -th head in $l$ -th layer. The aggregated restoration vector is the aggregation of restoration vectors for all privacy spans in the user input, which will be introduced detailedly in §3.5

As pointed by activation steering methods [23, 7], applying restoration vectors to all attention heads in LLM will degrade the model performance. As shown in the first part of Figure 1, we employ the probe technique [2, 41, 4] to ascertain the most salient heads associated with each privacy span. We train a separate classifier for each head, tailored to each privacy span $s$ , formulated as

\mathcal{F}_{s}^{l,h}(\textbf{u}^{l,h})=\sigma(\theta\cdot\textbf{u}^{l,h}),

(3)

where $\textbf{y}^{l,h}$ is the output of the last token on $h$ -th head of $l$ -th layer, and $\theta$ is the parameters of the classifier. In our setting, the classifier is to discriminate whether the target privacy span $s$ appears in the query. A classifier associated with an attention head demonstrating higher accuracy suggests a stronger correlation with the target privacy span, therefore, we select the top $K$ attention heads with high accuracies for each privacy span.

Considering the privacy protection scenario, using different top-K head sets for different privacy spans may suffer the risk of privacy leakage. If different privacy spans have different edited attention heads, the attacker can infer a specific privacy span based on some characteristics of attention heads. For example, an attacker might observe that only the heads set of privacy span “HIV” contains the $j$ -th head and the appearance of $j$ -th head indicates the presence of ’HIV’ in the user. We propose an algorithm named Top-K Heads Selector to combine all different top-K heads sets of all privacy spans to form a common top-K heads set $\mathcal{H}_{c}$ according to statistical characteristics. Firstly, we initialize an empty score list $L_{h}$ for each head. Secondly, each privacy span $s$ has its corresponding top-K heads set $\mathcal{H}_{k}^{s}$ . For each head $h$ in $\mathcal{H}_{k}^{s}$ , we append score $K-\text{SortedIndex}(h,\mathcal{H}_{k}^{s})$ into its score list $L_{h}$ . Thirdly, we calculate the average value of each score list $L_{h}$ as the score of the corresponding head. Finally, we sort all heads in the LLM by the scores and pick up top-K heads as the common top-K heads set $\mathcal{H}_{c}$ . The detailed algorithm is described in Algorithm 1.

Algorithm 1 Top-K Heads Selector

Input: $\mathcal{S}$ is the set of privacy spans; $\mathcal{H}_{a}$ is the set of all heads; $\mathcal{H}_{k}$ is the set of top-K heads sets; $\mathcal{H}_{k}^{s}$ is the top-K heads set of the symptom $s$ ; $\text{SortedIndex}(h,\mathcal{H}_{k}^{s})$ is the function that sorts $\mathcal{H}_{k}^{s}$ in descending order based on the accuracy of the classifier for head $h$ and returns the index of $h$ .

1: Initialize an empty score list

L_{h}=[\;]

for each head

h

\mathcal{H}_{a}

2: for

\mathcal{H}_{k}^{s}\;

\;\mathcal{H}_{k}

3: for

h\;

\;\mathcal{H}_{k}^{s}

4: Append

K-\text{SortedIndex}(h,\mathcal{H}_{k}^{s})

into

L_{h}

5: end for

6: end for

7: for

h\;

\;\mathcal{H}_{a}

\text{score}_{h}=\text{average}(L_{h})

8: end for

9: Sort

\mathcal{H}_{a}

according to

\text{score}_{h}

and select top-K heads to obtain common top-K heads set

\mathcal{H}_{c}

Output: $\mathcal{H}_{c}$ is the common top-K heads set.

3.4 Restoration Vectors Training

Restoration vectors are designed to restore the information contained within the removed privacy spans. The training objective is to align the predictions given the input with privacy spans removed to be the same as the predictions given an intact input.

As illustrated in the second part of Figure 1, we generate training samples by appending gold response as the positive answer $a_{p}$ to user queries. The invasive injection of restoration vectors modifies the attention activation and may adversely affect the instruction-following capability of the LLM. To maintain the instruction-following capability, we include an empty response as the negative answer $a_{n}$ for each sample. Consequently, we obtain the training dataset $D=\{(q,a_{p},a_{n})|q\in Q,a_{p}\in A,a_{n}=\text{<EOS>}\}$ , where $q$ denotes the query with privacy spans removed, $Q$ denotes the query set, $a_{p}$ denotes the positive answer, $A$ signifies the gold answer set, $a_{n}$ denotes the negative answer, and <EOS> is the token indicating the end of a sequence.

For each privacy span $s\in\mathcal{S}$ , there is a trainable restoration vector $r_{s}^{l,h}$ for each head $h$ in the common top-K heads set $\mathcal{H}_{c}$ . Restoration vectors of all privacy spans are the only trainable parameters of our method. The LLM weights are frozen. Our method is plug-and-play and parameter-efficient for training. We fine-tune these restoration vectors using ORPO loss proposed by [19]:

$\displaystyle\log P(a\|q;\Theta)$	$\displaystyle=$	$\displaystyle\frac{1}{\|a\|}\sum_{i=1}^{\|a\|}\log(P(a\|q;a_{<i};\Theta)),$	(4)
$\displaystyle\text{ratio}(a\|q;\Theta)$	$\displaystyle=$	$\displaystyle\frac{P(a\|q;r_{s}^{l,h})}{1-P(a\|q;\Theta)},$	(5)
$\displaystyle\mathcal{L}_{\text{ORPO}}$	$\displaystyle=$	$\displaystyle-\log P(a_{p}\|q;\Theta)-\log\sigma\left(\log\frac{\text{ratio}(a_% {p}\|q;\Theta)}{\text{ratio}(a_{n}\|q;\Theta)}\right),$	(6)

where $q$ denotes the query with privacy spans removed, $\Theta$ is the set of restoration vectors corresponding to the privacy spans present in the input.. $a$ is a response, which can be positive response $a_{p}$ or negative response $a_{n}$ . $|a|$ is the length of the response. $\mathcal{L}_{\text{ORPO}}$ simultaneously encourages generating correct responses and diminishes the occurrence of meaningless empty responses.

3.5 Attention-aware Weighted Aggregation

There are commonly multiple privacy spans that may lead to privacy leakage in user inputs. Some of privacy spans may have a critical impact on final outputs, while others may not be as significant. Directly aggregating all restoration vectors equally may weaken the impact of the critical privacy spans and enlarge the impact of irrelevant privacy spans. Therefore, we propose a novel method called Attention-aware Weighted Aggregation (AWA) which estimates a weight for each privacy span, and then takes the weighted sum of restoration vectors as the aggregation. To discern the importance of different privacy spans and consider the limitations of computing resources, we employed a tiny model (such as BERT [11]) on the client side to calculate the importance weights.

As depicted in the third panel in Figure 1, the first step of AWA is computing attention score of the last token $t$ attending to a privacy span $s$ , yielding importance weights denoted as $w_{s}$ :

	$\displaystyle\text{Attn}^{h}(s_{j},x_{t})$	$\displaystyle=$	$\displaystyle\text{Softmax}(x_{t}\mathbf{W}_{Q}^{h}\cdot(s_{j}\mathbf{W}_{K}^{% h})^{T}),$		(7)
	$\displaystyle w_{s}$	$\displaystyle=$	$\displaystyle\frac{1}{\|s\|}\sum_{j=1}^{\|s\|}\frac{1}{\|H\|}\sum_{h\in H}\text{Attn% }^{h}(s_{j},x_{t}),$		(8)

where $\mathbf{W}_{Q}^{h}$ and $\mathbf{W}_{V}^{h}$ are the query and key projection matrices of head $h$ in the last layer of a tiny model, respectively. $x_{t}$ and $s_{j}$ represents the hidden state of the last token $t$ and the $j$ -th token of privacy span $s$ in the last layer, respectively. $H$ denotes all attention heads in the last layer. $|s|$ denotes the number of tokens contained in $s$ .

After obtaining importance weights of multiple privacy spans in the query, we calculate the aggregated restoration vector $\mathcal{R}^{l,h}$ of the $l$ -th layer and $h$ -th head as follows,

\displaystyle\mathcal{R}^{l,h}

\displaystyle=

\displaystyle\begin{cases}r_{s}^{l,h}+\epsilon&\text{if }|\mathcal{S}_{q}|=1,% \\ \sum_{s\in\mathcal{S}_{q}}w_{s}\cdot r_{s}^{l,h}&\text{otherwise },\end{cases}

(9)

where $\mathcal{S}_{q}$ denotes the privacy spans containing in the query and $|\mathcal{S}_{q}|$ is the number of privacy spans. In the scenario where the query contains a single privacy span, the aggregated restoration vector is the restoration vector $r_{s}^{l,h}$ itself. An attacker can intercept $\mathcal{R}^{l,h}$ and easily deduce the plain text of the privacy span. We inject noise $\epsilon$ into the restoration vector, making it difficult for attackers to infer the plain text of privacy span from $\mathcal{R}^{l,h}$ .

The client side maintains a local Restoration Bank responsible for providing the meta restoration vector based on a given query and a set of privacy spans. The Restoration Bank comprises three key components: the aggregation algorithm AWA, restoration vectors for each privacy span, and a common top-K heads set. The meta restoration vector can be obtained by concatenating the $\mathcal{R}^{l,h}$ for all heads in the common top-K heads set. The server is aware of the specific order of concatenation.

3.6 Inference

As depicted in the third panel of Figure 1, during LLM inference, on the client side, users are allowed to select specific spans (e.g., black stools, pale skin, blood in stool) which they wish to keep undisclosed. These selected spans are considered as privacy spans. The client side obtains the meta restoration vector from the Restoration Bank. Then, the client side provides queries with privacy spans removed and the meta restoration vector to LLMs deployed on the server side. On the server side, the meta restoration vector will be injected into attention heads of LLMs in the common top-K heads set. Finally, the client receives response outputs generated by the LLM.

4 Experiments

4.1 Experiments setup

Backbones. We use Llama2-chat-7b [43] as the LLM deployed on server side. We utilize BERT-base [11] as the tiny model on the client side for attention-aware weighted aggregation mentioned in §3.5. We also report results on larger LLMs such as Llama2-chat-13b [43] and Llama3-8b-instruct [33] in Appendix D.

Datasets. We mainly focus on protecting user inputs in medical diagnosis task. We consider symptoms in user inputs as privacy spans. In real-world scenarios, some symptoms may be highly sensitive, while others may be less sensitive or non-sensitive. Existing benchmarks such as PriDdxplus [40] and NLICE [1] for the medical diagnosis task lack of privacy level of symptoms. We utilized GPT-3.5 [34] to rate the privacy level for all possible symptoms in Ddxplus [40] and NLICE[1] benchmark respectively. The query prompt for rating privacy levels is shown in Appendix A.2. Each symptom is rated from level 1 to level 5, with the definitions of privacy levels provided in Appendix A.1. The sensitivity levels range from 1 (non-sensitive information) to 5 (highly sensitive information), with higher levels indicating increasing privacy levels. Leveraging the source datasets (Ddxplus and NLICE) and incorporating privacy levels for symptoms, we curated two privacy-preserving datasets, named Pri-Ddxplus and Pri-NLICE, respectively. There are 149 and 70 types of privacy spans in Pri-Ddxplus and Pri-NLICE, respectively. The detailed construction process and statistical information for two datasets are provided in Appendix A.

Metrics. The evaluation of privacy protection should consider both the performance, the effectiveness of privacy protection, and inference efficiency. We evaluate the model’s performance using two metrics: MC1 and MC2 [26]. We assign each sample in Pri-Ddxplus and Pri-NLICE with four options, including one correct diagnosis and three incorrect ones. To calculate MC1, we compute the accuracy, i.e., the proportion of questions answered correctly out of the total number of questions. We consider the option with the highest probability as the answer. To calculate MC2, we compute the normalized probability of the true answer.

To evaluate the effectiveness of privacy protection, we construct two commonly used privacy attacks targeted at LLMs, Prompt Injection Attack [36, 28] and Attribute Inference Attack [31, 22]. For prompt injection attack, attackers intercept the medical query and inject extra harmful prompts to induce the model to generate the privacy symptoms of patients. If the model output contains privacy symptoms, we consider it as a successful attack. We compute the ratio of successful attacks, namely ASR (Attack Success Ratio). For attribute inference attack, attackers intend to build up a simple multi-label classifier to map representations of the medical query into corresponding privacy symptoms. We compute the F1 score of the classifier. The lower the ASR and F1, the higher the effectiveness of privacy protection.

To evaluate inference efficiency, we measure latency and throughput. Latency is measured from the moment the user submits the query until the answer is generated. As different samples have varying lengths, we calculate the average latency per sample and provide the average output length. Latency is further divided into two parts: namely, latency on server side and latency on client side. The generation of meta restoration vector is conducted on the client side. Hence, it is necessary to investigate the latency of generating the meta restoration vector. Lower latency indicates higher inference efficiency. In addition to latency, we compute the throughput, which denotes the number of tokens generated by the model per second. Higher throughput indicates higher inference efficiency.

Due to space limitations, implementation details are presented in Appendix C.

4.2 Compared Methods

To demonstrate the effectiveness of our method, we compare our model with the following baselines: No Protection. The client does not provide any protection for patient’s queries and sends the intact queries to the servers. Since the query is intact, the model’s performance is not affected. It provides a theoretical upper bound of the model’s performance. However, it suffers from the privacy leakage. Direct Removal. The user directly deletes all privacy spans in the input and submits it to the LLM. DP. Differential privacy method [37] apply text-to-text privatization [13] on user inputs, which injects noise into tokens’ embedding and find the nearest tokens to replace the initial tokens. The text-to-text privatization is applied on all tokens in the input. The LLM backbone of DP method is frozen. DP on Privacy Spans. The client employs the text-to-text privatization [13] on privacy spans, rather than the entire query. SMPC methods suffer from huge inference time overhead. We do not include them as a baseline.

The settings in the ablation study of PrivacyRestore are shown as follows. w/o Top-K Heads Selector. We remove the Top-K heads selector in PrivacyRestore. We randomly select K heads from $\mathcal{H}_{a}$ to form the head set $\mathcal{H}_{c}$ . w/o Restoration Vectors Training. We eliminate the training process and do not use trainable restoration vectors. We use an activation steering method called ITI [23] to obtain the restoration vectors of each privacy span. w/o AWA. We remove the attention-aware weighted aggregation component. The meta restoration vector is computed as an equally weighted sum of restoration vectors of all privacy spans.

Pri-Ddxplus
Methods	Privacy	$\geq\text{Level 2}$				$\geq\text{Level 3}$				$\geq\text{Level 4}$				$\geq\text{Level 5}$
Methods	Privacy	MC1 $\uparrow$	MC2 $\uparrow$	ASR $\downarrow$	F1 $\downarrow$	MC1	MC2	ASR	F1	MC1	MC2	ASR	F1	MC1	MC2	ASR	F1
No Protection	$\times$	86.83	85.13	100.00	12.30	86.83	85.13	100.00	12.30	86.83	85.13	100.00	12.30	86.83	85.13	100.00	12.30
Direct Removal	✓	33.31	33.49	45.19	7.27	63.20	59.47	24.79	4.54	78.24	75.66	10.20	1.23	86.24	84.45	0.58	1.58
DP [37]	✓	41.83	40.61	70.69	6.28	41.83	40.61	70.69	6.28	41.83	40.61	70.69	6.28	41.83	40.61	70.69	6.28
DP on Privacy Spans	✓	41.83	40.61	70.69	6.28	65.59	63.01	43.44	3.62	78.37	75.68	8.58	1.33	86.44	84.55	0.51	1.71
PrivacyRestore	✓	61.78	61.13	41.18	5.81	75.40	72.83	22.27	0.25	84.89	84.09	8.56	0.00	87.73	85.96	0.51	0.00
w/o Top-K Heads Selector	✓	21.04	22.19	49.96	5.94	27.82	27.87	11.87	26.51	67.59	67.45	3.55	42.35	85.86	83.96	0.00	35.8
w/o Restoration Vectors Training	✓	41.05	39.15	37.44	5.82	48.58	45.23	27.63	0.58	64.36	64.11	12.39	0.00	85.92	84.14	0.12	0.00
w/o AWA	✓	44.35	43.26	3.74	0.00	68.62	65.95	3.87	0.00	84.89	84.07	13.36	0.00	87.73	85.95	0.06	0.00
Pri-NLICE
Methods	Privacy	$\geq\text{Level 2}$				$\geq\text{Level 3}$				$\geq\text{Level 4}$				$\geq\text{Level 5}$
Methods	Privacy	MC1 $\uparrow$	MC2 $\uparrow$	ASR $\downarrow$	F1 $\downarrow$	MC1	MC2	ASR	F1	MC1	MC2	ASR	F1	MC1	MC2	ASR	F1
No Protection	$\times$	76.64	75.80	77.78	75.87	76.64	75.80	77.78	75.87	76.64	75.80	77.78	75.87	76.64	75.80	77.78	75.87
Direct Removal	✓	58.58	55.18	2.27	23.86	64.52	61.63	0.00	20.49	73.73	72.12	0.00	3.84	74.36	72.85	0.00	1.1
DP [37]	✓	59.21	56.12	0.00	18.72	59.21	56.12	0.00	18.72	59.21	56.12	0.00	18.72	59.21	56.12	0.00	18.72
DP on Privacy Spans	✓	37.87	38.98	0.50	20.86	48.10	47.17	0.01	18.66	72.22	70.40	0.00	3.30	73.48	71.70	0.00	0.79
PrivacyRestore	✓	79.54	78.81	0.25	0.00	78.66	77.44	0.12	0.12	76.64	75.35	0.37	6.52	75.88	74.98	0.00	0.00
w/o Top-K Heads Selector	✓	71.46	67.92	5.05	12.79	70.20	69.30	0.75	18.61	72.22	70.44	0.13	2.17	71.46	70.53	0.00	1.29
w/o Restoration Vectors Training	✓	54.16	50.93	5.42	27.83	61.99	58.77	0.25	22.63	72.97	71.73	0.50	7.36	72.97	72.61	0.00	6.38
w/o AWA	✓	76.76	74.61	0.12	28.36	78.4	77.31	0.12	22.55	76.63	75.33	0.37	7.36	75.88	74.98	0.00	6.38

Table 1: Comparison of the model performance and the effectiveness of privacy protection among the compared methods on the Pri-Ddxplus and Pri-NLICE datasets on Llama2-chat-7b. MC1 and MC2 metrics indicate the model’s performance, with higher values representing better performance. ASR and F1 metrics measure the effectiveness of privacy protection, with lower values indicating better performance. The “

\geq\text{Level}\;j

” column indicates that privacy protection is applied to symptoms with a privacy level greater than or equal to

j

. The “Privacy” column shows whether the method applies privacy protection to the user input or not. The best model performance and privacy protection are marked in bold, excluding the No Protection method, as it does not provide any privacy protection.

4.3 Main Results

As shown in Table 1, we evaluate the performance and effectiveness of privacy protection on two datasets. Since the No Protection method does not use any protection and the DP method injects noise into all tokens in the query, the outcomes of both methods are independent of the privacy level. As shown in the first row of the upper and lower parts, the No Protection method achieves the upper bound of the MC1 and MC2 but suffers from privacy leaking problems on two datasets.

Direct Removal method serves as a strong competitor in privacy protection but suffers from model performance degradation. The lower level means more privacy symptoms were removed and the performance degradation is more obvious. The performance of DP and DP on Privacy Spans is significantly lower than the No Protection method, and their privacy protection capability is weaker than Direct Removal methods.

Compared to DP based methods, PrivacyRestore achieves significant improvements in model performance with even lower ASR and F1. The model performance is comparable to the No Protection method. Moreover, sometimes PrivacyRestore even outperforms the No Protection method, as demonstrated in the $\geq\text{Level 5}$ case on Pri-Ddxplus dataset and the $\geq\text{Level 2}$ , $\geq\text{Level 3}$ , $\geq\text{Level 5}$ cases on Pri-NLICE dataset. This suggests that using the restoration vectors for the corresponding privacy symptoms can sometimes be more beneficial for model inference than including the symptom directly in the query context. Surprisingly, Privacy Protection achieves better privacy protection ability than Direct Removal with lower ASR and F1 on both datasets. It indicates that applying restoration vectors in model inference can effectively prevent attackers from stealing the privacy symptoms.

We also provide the results after removing the Top-K Heads Selector, Restoration Vectors Training and Attention-aware Weighted Aggregation (AWA) components respectively. After removing Top-K Heads Selector and Restoration Vectors Training, both the model performance and effectiveness of privacy protection are weakened. After removing the AWA component, both the model performance and privacy protection capability decrease (except for slight changes in ASR), which indicates the effectiveness of AWA component.

4.4 Inference Efficiency

We compare the inference efficiency of the initial model and PrivacyRestore on Llama2-chat-7b in Table 2. On the server side, PrivacyRestore needs to edit the attention heads using the meta restore vector during model inference on the server as shown in Eq 2. These operations are fast and only bring 8%-13% overhead on the server, as shown in the “Latency on Server” column. On the client side, PrivacyRestore needs to generate the meta restoration vector for uses. The latency on the client side is relatively minor compared to the latency on the server, as depicted in the “Latency on Client” column. As shown in the ‘Throughput” column, PrivacyRestore can achieve nearly 80% of the throughput of the original model.

Pri-Ddxplus
Methods	Avg. Output Length	Latency on Server	Latency on Client				Throughput
Methods	Avg. Output Length	Latency on Server	$\geq\text{Level 2}$	$\geq\text{Level 3}$	$\geq\text{Level 4}$	$\geq\text{Level 5}$	Throughput
Initial	8.44	531.05	-	-	-	-	15.9
PrivacyRestore	9.14	601.46	109.84	88.52	89.07	145.63	12.83 (81%)
Pri-NLICE
Methods	Avg. Output Length	Latency on Server	Latency on Client				Throughput
Methods	Avg. Output Length	Latency on Server	$\geq\text{Level 2}$	$\geq\text{Level 3}$	$\geq\text{Level 4}$	$\geq\text{Level 5}$	Throughput
Initial	6.56	462.04	-	-	-	-	14.2
PrivacyRestore	6.56	506.27	91.03	86.84	79.34	79.25	11.12 (78%)

Table 2: Comparison of inference efficiency between original model (Llama2-chat-7b) and PrivacyRestore. Latency was measured in milliseconds (ms), and throughput denotes tokens per second. The bold number represents the ratio of PrivacyRestore’s throughput to that of the original model.

5 Conclusion

We propose PrivacyRestore which aims to protect user inputs when using online LLMs inference services. By directly removing privacy spans and restoring the information through activation steering, PrivacyRestore offers a practical and efficient solution to the privacy-utility trade-off. At the core of PrivacyRestore is our Attention-aware Weighted Aggregation (AWA) technique. AWA aggregates the restoration vectors of all privacy spans in the input into a meta restoration vector. This not only ensures proper representation of all private information, but also prevents attackers from inferring the original privacy spans from the meta restoration vector alone. Compared with differential privacy methods, PrivacyRestore achieves better performance in medical diagnosis task and demonstrates better privacy protection capabilities in prompt injection attack and attribute inference attack.

Limitations

Our experiments are specifically focused on the healthcare domain. It is more convincing to conduct experiments on other domains such as law and finance. The set of privacy spans may evolve over time, necessitating the re-training of restoration vectors from time to time. This process typically requires 4 to 5 hours for the datasets mentioned above. In cases where privacy spans are out-of-vocabulary, one possible solution is to exclude them from privacy protection.

References

Al-Ars et al. [2023] Zaid Al-Ars, Obinna Agba, Zhuoran Guo, Christiaan Boerkamp, Ziyaad Jaber, and Tareq Jaber. 2023. Nlice: Synthetic medical record generation for effective primary healthcare differential diagnosis. In 2023 IEEE 23rd International Conference on Bioinformatics and Bioengineering (BIBE), pages 397–402. IEEE.
Alain and Bengio [2016] Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
Alvim et al. [2018] Mário Alvim, Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Anna Pazii. 2018. Invited paper: Local differential privacy on metric spaces: Optimizing the trade-off with utility. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 262–267.
Belinkov [2022] Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
Chatzikokolakis et al. [2013] Kostas Chatzikokolakis, Miguel Andrés, Nicolás Bordenabe, and Catuscia Palamidessi. 2013. Broadening the scope of differential privacy using metrics.
Chen et al. [2023] Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
Chen et al. [2024] Zhongzhi Chen, Xingwu Sun, Xianfeng Jiao, Fengzong Lian, Zhanhui Kang, Di Wang, and Chengzhong Xu. 2024. Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20967–20974.
Chu et al. [2023] Hong-Min Chu, Jonas Geiping, Liam H Fowl, Micah Goldblum, and Tom Goldstein. 2023. Panning for gold in federated learning: Targeted text extraction under arbitrarily large-scale aggregation. In The Eleventh International Conference on Learning Representations.
Dathathri et al. [2020] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations.
Deng et al. [2023] Wentao Deng, Jiahuan Pei, Keyi Kong, Zhe Chen, Furu Wei, Yujun Li, Zhaochun Ren, Zhumin Chen, and Pengjie Ren. 2023. Syllogistic reasoning for legal judgment analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13997–14009, Singapore. Association for Computational Linguistics.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Elhage et al. [2021] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1:1.
Feyisetan et al. [2020] Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations. In Proceedings of the 13th international conference on web search and data mining, pages 178–186.
Fowl et al. [2022] Liam H Fowl, Jonas Geiping, Steven Reich, Yuxin Wen, Wojciech Czaja, Micah Goldblum, and Tom Goldstein. 2022. Decepticons: Corrupted transformers breach privacy in federated learning for language models. In NeurIPS ML Safety Workshop.
Gupta et al. [2023] Kanav Gupta, Neha Jawalkar, Ananta Mukherjee, Nishanth Chandran, Divya Gupta, Ashish Panwar, and Rahul Sharma. 2023. Sigma: Secure gpt inference with function secret sharing. Cryptology ePrint Archive, Paper 2023/1269. https://eprint.iacr.org/2023/1269.
Hao et al. [2022a] Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, and Tianwei Zhang. 2022a. Iron: Private inference on transformers. In Advances in Neural Information Processing Systems, volume 35, pages 15718–15731. Curran Associates, Inc.
Hao et al. [2022b] Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, and Tianwei Zhang. 2022b. Iron: Private inference on transformers. Advances in neural information processing systems, 35:15718–15731.
Hernandez et al. [2023] Evan Hernandez, Belinda Z. Li, and Jacob Andreas. 2023. Inspecting and editing knowledge representations in language models.
Hong et al. [2024] Jiwoo Hong, Noah Lee, and James Thorne. 2024. ORPO: monolithic preference optimization without reference model. CoRR, abs/2403.07691.
Igamberdiev and Habernal [2023] Timour Igamberdiev and Ivan Habernal. 2023. DP-BART for privatized text rewriting under local differential privacy. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13914–13934, Toronto, Canada. Association for Computational Linguistics.
Li et al. [2023a] Dacheng Li, Hongyi Wang, Rulin Shao, Han Guo, Eric Xing, and Hao Zhang. 2023a. MPCFORMER: FAST, PERFORMANT AND PRIVATE TRANSFORMER INFERENCE WITH MPC. In The Eleventh International Conference on Learning Representations.
Li et al. [2022] Haoran Li, Yangqiu Song, and Lixin Fan. 2022. You don’t know my favorite color: Preventing dialogue representations from revealing speakers’ private personas. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 5858–5870. Association for Computational Linguistics.
Li et al. [2023b] Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023b. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems, volume 36, pages 41451–41530. Curran Associates, Inc.
Li et al. [2023c] Yansong Li, Zhixing Tan, and Yang Liu. 2023c. Privacy-preserving prompt tuning for large language model services. CoRR, abs/2305.06212.
Liang et al. [2024] Zi Liang, Pinghui Wang, Ruofei Zhang, Nuo Xu, Shuo Zhang, Lifeng Xing, Haitao Bai, and Ziyang Zhou. 2024. Merge: Fast private text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19884–19892.
Lin et al. [2021] Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
Liu and Liu [2023] Xuanqi Liu and Zhuotao Liu. 2023. Llms can understand encrypted prompt: Towards privacy-computing friendly transformers. arXiv preprint arXiv:2305.18396.
Liu et al. [2023] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2023. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Lyu et al. [2020] Lingjuan Lyu, Xuanli He, and Yitong Li. 2020. Differentially private representation for NLP: formal guarantee and an empirical study on privacy and fairness. pages 2355–2365.
Mahloujifar et al. [2021] Saeed Mahloujifar, Huseyin A Inan, Melissa Chase, Esha Ghosh, and Marcello Hasegawa. 2021. Membership inference on word embedding and beyond. arXiv preprint arXiv:2106.11384.
Mattern et al. [2022] Justus Mattern, Zhijing Jin, Benjamin Weggenmann, Bernhard Schoelkopf, and Mrinmaya Sachan. 2022. Differentially private language models for secure data sharing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
MetaAI [2023] MetaAI. 2023. Introducing meta llama 3: The most capable openly available llm to date.
Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Perez and Ribeiro [2022] Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop.
Qu et al. [2021] Chen Qu, Weize Kong, Liu Yang, Mingyang Zhang, Michael Bendersky, and Marc Najork. 2021. Natural language understanding with privacy-preserving BERT. In CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, pages 1488–1497.
Shi et al. [2022] Weiyan Shi, Aiqi Cui, Evan Li, Ruoxi Jia, and Zhou Yu. 2022. Selective differential privacy for language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2848–2859, Seattle, United States. Association for Computational Linguistics.
Subramani et al. [2022] Nishant Subramani, Nivedita Suresh, and Matthew Peters. 2022. Extracting latent steering vectors from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 566–581, Dublin, Ireland. Association for Computational Linguistics.
Tchango et al. [2022] Arsène Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. 2022. Ddxplus: A new dataset for automatic medical diagnosis. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
Tenney et al. [2019] Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
Tian et al. [2022] Zhiliang Tian, Yingxiu Zhao, Ziyue Huang, Yu-Xiang Wang, Nevin L. Zhang, and He He. 2022. Seqpate: Differentially private text generation via knowledge distillation. In Advances in Neural Information Processing Systems, volume 35, pages 11117–11130. Curran Associates, Inc.
Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Turner et al. [2023] Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. Activation addition: Steering language models without optimization.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Wu et al. [2023] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
[47] Yiquan Wu, Yuhang Liu, Yifei Liu, Ang Li, Siying Zhou, and Kun Kuang. wisdominterrogatory. Available at GitHub.
Xie et al. [2023] Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. 2023. Pixiu: A large language model, instruction data and evaluation benchmark for finance. arXiv preprint arXiv:2306.05443.
Xu et al. [2023a] Canwen Xu, Daya Guo, Nan Duan, and Julian J. McAuley. 2023a. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Proceedings of EMNLP, pages 6268–6278.
Xu et al. [2023b] Mingbin Xu, Congzheng Song, Ye Tian, Neha Agrawal, Filip Granqvist, Rogier van Dalen, Xiao Zhang, Arturo Argueta, Shiyi Han, Yaqiao Deng, Leo Liu, Anmol Walia, and Alex Jin. 2023b. Training large-vocabulary neural language models by private federated learning for resource-constrained devices. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
Yu et al. [2023] Da Yu, Sivakanth Gopi, Janardhan Kulkarni, Zi-Han Lin, Saurabh Naik, Tomasz L. Religa, Jian Yin, and Huishuai Zhang. 2023. Selective pre-training for private fine-tuning. ArXiv, abs/2305.13865.
Yu et al. [2022] Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. 2022. Differentially private fine-tuning of language models. In International Conference on Learning Representations.
Yue et al. [2023] Xiang Yue, Huseyin Inan, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, and Robert Sim. 2023. Synthetic text generation with differential privacy: A simple and practical recipe. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1321–1342, Toronto, Canada. Association for Computational Linguistics.
Zhang et al. [2023] Zhuo Zhang, Yuanhang Yang, Yong Dai, Qifan Wang, Yue Yu, Lizhen Qu, and Zenglin Xu. 2023. FedPETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9963–9977, Toronto, Canada. Association for Computational Linguistics.
Zheng et al. [2023] Mengxin Zheng, Qian Lou, and Lei Jiang. 2023. Primer: Fast private transformer inference on encrypted data. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE.

Appendix A Datasets

A.1 Definitions of Privacy Level

We classify the privacy of patients’ symptoms into five levels, with higher levels indicating greater sensitivity. The detail definitions of privacy level are present in Table 3. We also display the distribution of symptoms across five privacy levels on Pri-Ddxplus dataset and Pri-NLICE dataset.

Privacy Level	Definition	%
Level 1	Public information, symptoms or antecedents that are common, widely known, and do not reveal any personal or sensitive information. Examples include sneezing, headache, or minor injuries.	$20.43$
Level 2	Non-sensitive personal information, symptoms or antecedents that may be personal but not necessarily sensitive or revealing. These may include common illnesses like cold or flu, allergies, or minor digestive issues.	$22.58$
Level 3	Potentially sensitive information, symptoms or antecedents that could be indicative of underlying health conditions but are not immediately sensitive or stigmatizing. Examples include chronic conditions like diabetes, hypertension, or asthma.	$40.86$
Level 4	Sensitive personal information, symptoms or antecedents that may be stigmatizing or have social implications if disclosed publicly. This could include mental health issues like depression or anxiety, reproductive health concerns, or substance abuse.	$6.98$
Level 5	Highly sensitive information, symptoms or antecedents that are highly personal, stigmatizing, or potentially life-altering if disclosed publicly. This category includes sexually transmitted infections, HIV/AIDS, certain types of cancer, or rare and serious medical conditions.	$9.13$

Table 3: The definition of different Privacy Level for possible symptoms in PriDdxplus. The rightmost column displays the ratio of symptoms at the corresponding privacy level.

A.2 Construction Process

We utilized GPT-3.5 [34] to assess the privacy levels of all possible symptoms in Ddxplus [40] and NLICE [1], ranging from non-sensitive to highly sensitive. The assessing prompt template is shown in Figure 2.

We also assign each sample three random pathologies as incorrect diagnoses to combine with the correct diagnosis and create options. Some data samples are present in Figure 3.

The initial Ddxplus [40] and NLICE [1] dataset is extensive, and we observed that for most samples, providing only non-sensitive symptoms often yields diagnosis outputs similar to those obtained when intact symptoms are provided. Privacy preserving for these samples is meaningless because users can directly hide those privacy spans and obtain approximate diagnosis results. In real world, sensitive symptoms sometimes play an vital role in diagnosis and privacy preserving is highly valuable. Our dataset is designed to benchmark various privacy-preserving methods and must include samples where privacy symptoms are crucial for diagnosis outcomes. We utilize the KL divergences to measure the importance degree of privacy spans. We calculate the KL divergence between the model output distributions with and without the privacy symptoms included. Higher KL divergence indicates that the absence of sensitive symptoms may lead to different or even incorrect output. After filtering, we obtain the Pri-Ddxplus and Pri-NLICE dataset for privacy preserving.

A.3 Statistical Information

We show the statistics of the obtained Pri-Ddxplus and Pri-NLICE dataset in Table 4. We tally the number of instances, symptom types, and diagnosis types. Symptoms with a privacy level greater than 2 were considered private. We calculate the number of privacy symptom types, which are regarded as privacy spans. We also compute the average occurrence of privacy symptoms per instance.

Pri-Ddxplus commonly contains more instances and more privacy symptoms types compared to Pri-NLICE. Each sample in Pri-Ddxplus contains six privacy symptoms on average, while samples in Pri-NLICE have four privacy symptoms.

Pri-Ddxplus
Dataset Split	Instances	Symptom Type	Diagnosis Type	Privacy Symptom Type	Avg. Privacy Symptoms
All	7509	187	73	149	5.98
Train	5901	187	73	149	6.02
Dev	59	122	53	102	6.16
Test	1549	97	45	78	5.83
Pri-NLICE
Dataset Split	Instances	Symptom Type	Diagnosis Type	Privacy Symptom Type	Avg. Privacy Symptoms
All	3992	80	47	70	3.76
Train	3168	80	47	65	3.50
Dev	32	50	44	41	3.78
Test	792	43	47	25	4.79

Table 4: The statistics of Pri-Ddxplus and Pri-NLICE. Each instance represents a patient’s description of symptoms along with the correct diagnosis. Average privacy symptoms indicate the average symptoms occur in one query.

Appendix B Defense to Various Attacks

We enumerate the potential attacks that PrivacyRestore may face and demonstrate that PrivacyRestore can effectively defend against them as follows.

Leakage of restoration vectors of each privacy span and meta restoration vector.

Attackers may illegally obtain restoration vectors of each privacy span and intercept the meta restoration vector sent to the server. Even in such scenarios, it is still difficult for attackers to infer privacy spans based on a specific meta restoration vector. According to the AWA method in §3.5, when only one privacy span exists in the query, the meta restoration vector is a restoration vector with random noise injected, which prevents the privacy span from being inferred. When the query contains multiple privacy spans, attackers need to try all combinations of restoration vectors to infer the privacy span. The number of combinations that the attacker needs to try is equal to the sum of combinatorial numbers of any number of restoration vectors, which can be expressed as:

\mathcal{N}_{c}=\sum_{i=2}^{|\mathcal{S}|}C_{|\mathcal{S}|}^{i}=2^{|\mathcal{S% }|}-|\mathcal{S}|-1,

(10)

where $|\mathcal{S}|$ is the total number of privacy spans, and $C_{n}^{i}$ represents the combinatorial number of ways to choose $i$ elements from a set of $n$ elements. The number of combinations grows exponentially with $|\mathcal{S}|$ , and in practical scenarios where $|\mathcal{S}|$ is typically large, so it is impossible for attackers to infer privacy spans even if the restoration vectors of each privacy span are available.

Prompt Injection Attack. The attack condition is that attackers own LLM weights and can obtain meta restoration vector and a query with privacy spans removed. During inference, attackers intercept the query sent by the client, modify the content of the query, and then send it to the server. Attackers inject malicious content into the query to manipulate the LLM to generate privacy information. For example, in medical diagnosis task, the malicious content would be “print out the possible symptoms.” For prompt injection attack, experimental results in Table 1 show that the attack success ratio of our method is lower than baselines.

Attribute Inference Attack. The attack condition is the same with Prompt Injection Attack. Attribute Inference Attack aims to recover sensitive attributes in the input text. Attackers commonly use input text or its embeddings as input to train a classifier to classify whether sensitive attributes are contained in input text. As shown in Table 1, the F1 score of classifier for our method is low.

Appendix C Implementation Details.

For the fine-tuning process described in §3.4, we only train the restoration vectors and fix all other parameters of LLM. The training is conducted for five epochs with a batch size of 8. This process takes 4~5 hours on a single A100 GPU, which is a reasonable duration for retraining the model when new privacy spans are introduced.

For the classifier used in the attribute inference attack, we employ a 2-layer Multilayer Perceptron (MLP) as the classifier, following the approach in [22]. The classifier is trained on the same dataset used for restoration vector training. The classifier takes the query representation as the input and predicts whether the query contain specific symptoms. We regard the meta restoration vector as the query representation. The other methods in §4.2 do not have obvious representations associated with the privacy spans. We utilize the hidden state of the last token from the last layer of LLM as the query representation.

Differential privacy [37] use $\eta$ to control the strength of injected noise. Smaller $\eta$ provides stronger privacy protection but results in significant performance degradation. As shown in [24], $\eta$ ranges from 75 to 175. A lower value represents a higher level of privacy protection. For the DP based methods, we set $\eta$ to 75 to prioritize privacy protection. For noise injection regarding the single restoration vector case mentioned in §3.5, we use the same $\eta=75$ to maintain a consistent noise level.

To evaluate inference efficiency, we use the greedy search decoding strategy and restrict the max generation length to 64. For top-K heads selector, we following the setting of [23] and set K to 48. All of our experiment is conducted on the NVIDIA A800.

Appendix D Additional Experiment Results

We report the additional experiment results on larger LLMs, such as Llama2-chat-13b [43] and Llama3-8b-instruct [33]. As shown in Table 5 and Table6, PrivacyRestore can achieve comparable results to the normal inference and provide strong privacy preserving ability.

As shown Table 7 and Table 8, with the model size of LLM increasing, the latency cost on Client is stable. The latency cost on the client is independent of the LLM size, and becomes more negligible when PrivacyRestore is applied to larger LLMs. Surprisingly, for Llama2-chat-13b on the Pri-NLICE dataset, PrivacyRestore can achieve 97% of the initial throughput. On average, PrivacyRestore can achieve 80% of the initial throughput.

Pri-Ddxplus
Methods	Privacy	$\geq\text{Level 2}$				$\geq\text{Level 3}$				$\geq\text{Level 4}$				$\geq\text{Level 5}$
Methods	Privacy	MC1 $\uparrow$	MC2 $\uparrow$	ASR $\downarrow$	AF $\downarrow$	MC1	MC2	ASR	AF	MC1	MC2	ASR	AF	MC1	MC2	ASR	AF
No Protection	$\times$	83.02	82.91	82.44	35.06	83.02	82.91	82.44	35.06	83.02	82.91	82.44	35.06	83.02	82.91	82.44	35.06
Direct Removal	✓	43.25	44.43	22.33	25.94	57.71	57.31	4.13	29.17	76.95	76.44	0.06	0.93	82.56	82.17	0.00	0.00
DP [37]	✓	17.17	20.60	10.58	23.18	7.17	20.60	10.58	23.18	7.17	20.60	10.58	23.18	7.17	20.60	10.58	23.18
DP on Privacy Spans	✓	54.09	52.48	12.39	18.44	61.07	60.08	7.48	13.13	77.53	76.43	1.14	2.71	82.37	82.22	0.00	5.78
Privacy Restoration	✓	90.57	88.86	12.45	5.85	93.99	92.80	3.87	5.50	85.21	84.07	0.00	3.98	82.31	81.74	0.00	0.90
w/o Top-K Heads Selector	✓	54.09	52.41	23.24	56.81	57.45	57.14	2.17	82.23	77.53	75.91	0.06	50.94	81.60	81.33	0.00	45.92
w/o Restoration Vectors Training	✓	18.59	18.27	13.62	5.85	37.83	38.83	3.80	34.41	68.23	67.32	0.12	39.72	81.34	81.01	0.00	9.05
w/o AWA	✓	52.42	52.93	2.84	3.97	80.50	78.25	0.96	5.48	85.60	83.59	0.00	38.86	82.69	82.36	0.00	5.55
Pri-NLICE
Methods	Privacy	$\geq\text{Level 2}$				$\geq\text{Level 3}$				$\geq\text{Level 4}$				$\geq\text{Level 5}$
Methods	Privacy	MC1	MC2	ASR	AF	MC1	MC2	ASR	AF	MC1	MC2	ASR	AF	MC1	MC2	ASR	F1
No Protection	$\times$	81.56	78.37	91.54	76.08	81.56	78.37	91.54	76.08	81.56	78.37	91.54	76.08	81.56	78.37	91.54	76.08
Direct Removal	✓	36.48	36.72	0.00	2.79	43.43	46.36	0.00	0.43	50.00	49.51	0.00	0.15	53.15	52.00	0.00	0.00
DP [37]	✓	16.41	18.04	0.00	32.01	16.41	18.04	0.00	32.01	16.41	18.04	0.00	32.01	16.41	18.04	0.00	32.01
DP on Privacy Spans	✓	43.18	41.99	0.12	56.68	59.59	56.59	0.37	35.36	67.42	64.50	0.00	10.57	69.82	66.49	0.00	0.26
Privacy Restoration	✓	79.92	75.36	0.88	9.26	84.34	81.25	0.53	2.01	65.15	64.08	0.12	0.56	68.30	65.20	0.00	0.00
w/o Top-K Heads Selector	✓	52.77	55.31	0.56	22.54	71.33	69.47	0.45	10.64	54.04	53.36	0.20	1.44	56.31	54.13	0.00	0.14
w/o Restoration Vectors Training	✓	46.46	47.81	0.41	29.65	50.63	51.88	0.35	5.41	56.81	54.72	0.14	0.11	59.46	57.48	0.00	0.04
w/o AWA	✓	83.96	81.8	1.26	33.54	84.59	81.45	0.97	10.28	65.15	54.08	0.46	5.76	68.30	65.20	0.02	0.00

Table 5: Comparison of the model performance and the effectiveness of privacy protection among the compared methods on the Pri-Ddxplus and Pri-NLICE datasets on Llama2-chat-13b. The best model performance and privacy protection are marked in bold, excluding the No Protection method, as it does not provide any privacy protection.

Pri-Ddxplus
Methods	Privacy	$\geq\text{Level 2}$				$\geq\text{Level 3}$				$\geq\text{Level 4}$				$\geq\text{Level 5}$
Methods	Privacy	MC1 $\uparrow$	MC2 $\uparrow$	ASR $\downarrow$	AF $\downarrow$	MC1	MC2	ASR	AF	MC1	MC2	ASR	AF	MC1	MC2	ASR	AF
No Protection	$\times$	43.38	44.87	99.67	6.52	43.38	44.87	99.67	6.52	43.38	44.87	99.67	6.52	43.38	44.87	99.67	6.52
Direct Removal	✓	28.34	29.68	39.83	5.94	32.34	33.51	19.17	4.39	42.47	44.24	0.96	1.61	42.02	43.55	0.38	2.10
DP [37]	✓	22.14	23.84	99.74	1.46	22.14	23.84	99.74	1.46	22.14	23.84	99.74	1.46	22.14	23.84	99.74	1.46
DP on Privacy Spans	✓	27.88	29.39	97.86	31.44	31.50	32.10	21.43	8.61	40.54	41.09	1.61	0.01	42.02	43.50	1.22	0.00
Privacy Restoration	✓	76.56	78.02	23.49	0.96	59.52	60.17	11.94	1.10	54.48	53.82	0.00	0.00	40.67	42.44	0.00	0.00
w/o Top-K Heads Selector	✓	31.37	30.79	81.14	23.59	31.24	62.34	5.03	0.34	44.54	44.38	0.77	0.00	41.83	43.34	0.19	0.00
w/o Restoration Vectors Training	✓	21.62	22.17	3.35	0.64	19.69	21.48	0.71	1.05	39.31	38.47	0.12	0.98	41.63	43.34	0.00	0.61
w/o AWA	✓	63.33	65.06	2.90	5.81	56.68	57.64	12.65	1.57	54.50	53.81	0.00	0.97	40.66	42.44	0.00	0.60
Pri-NLICE
Methods	Privacy	$\geq\text{Level 2}$				$\geq\text{Level 3}$				$\geq\text{Level 4}$				$\geq\text{Level 5}$
Methods	Privacy	MC1	MC2	ASR	AF	MC1	MC2	ASR	AF	MC1	MC2	ASR	AF	MC1	MC2	ASR	F1
No Protection	$\times$	29.29	28.35	55.80	40.32	29.29	28.35	55.80	40.32	29.29	28.35	55.80	40.32	29.29	28.35	55.80	40.32
Direct Removal	✓	16.03	16.92	10.2	10.23	33.08	30.93	7.42	5.12	25.88	25.57	5.41	0.04	26.01	25.83	0.00	0.00
DP [37]	✓	25.00	22.40	52.39	12.87	25.00	22.40	52.39	12.87	25.00	22.40	52.39	12.87	25.00	22.40	52.39	12.87
DP on Privacy Spans	✓	16.41	17.59	80.42	9.65	25.50	24.85	76.13	1.01	29.54	28.19	4.04	0.45	28.66	28.10	0.00	0.01
Privacy Restoration	✓	93.81	94.50	33.58	10.34	96.08	95.07	6.81	4.11	42.55	41.63	0.12	0.06	38.76	37.97	0.00	0.01
w/o Top-K Heads Selector	✓	38.25	37.38	7.70	15.87	43.30	41.85	3.28	9.63	31.06	30.22	0.63	0.04	29.16	28.85	0.00	0.00
w/o Restoration Vectors Training	✓	26.26	26.58	90.90	16.32	40.27	38.92	19.57	5.69	27.39	26.90	0.25	0.36	25.63	25.78	0.00	0.00
w/o AWA	✓	96.71	95.84	5.30	23.75	96.08	95.07	6.69	11.30	42.55	41.63	0.12	0.75	38.76	37.97	0.00	0.05

Table 6: Comparison of the model performance and the effectiveness of privacy protection among the compared methods on the Pri-Ddxplus and Pri-NLICE datasets on Llama3-8b-instruct. The best model performance and privacy protection are marked in bold, excluding the No Protection method, as it does not provide any privacy protection.

Pri-Ddxplus
Methods	Avg. Output Length	Latency on Server	Latency on Client				Throughput
Methods	Avg. Output Length	Latency on Server	$\geq\text{Level 2}$	$\geq\text{Level 3}$	$\geq\text{Level 4}$	$\geq\text{Level 5}$	Throughput
Initial	11.89	1232.52	-	-	-	-	9.65
Privacy Restoration	8.16	1025.77	130.85	121.94	101.25	168.00	7.45 (77%)
Pri-NLICE
Methods	Avg. Output Length	Latency on Server	Latency on Client				Throughput
Methods	Avg. Output Length	Latency on Server	$\geq\text{Level 2}$	$\geq\text{Level 3}$	$\geq\text{Level 4}$	$\geq\text{Level 5}$	Throughput
Initial	5.83	953.35	-	-	-	-	6.12
Privacy Restoration	5.83	848.14	104.46	98.18	86.42	88.01	5.97 (97%)

Table 7: Comparison of the generation efficiency on Llama2-chat-13b.

Pri-Ddxplus
Methods	Avg. Output Length	Latency on Server	Latency on Client				Throughput
Methods	Avg. Output Length	Latency on Server	$\geq\text{Level 2}$	$\geq\text{Level 3}$	$\geq\text{Level 4}$	$\geq\text{Level 5}$	Throughput
Initial	63.70	1854.41	-	-	-	-	34.35
Privacy Restoration	60.92	2290.97	105.07	94.85	85.00	139.10	25.41 (74%)
Pri-NLICE
Methods	Avg. Output Length	Latency on Server	Latency on Client				Throughput
Methods	Avg. Output Length	Latency on Server	$\geq\text{Level 2}$	$\geq\text{Level 3}$	$\geq\text{Level 4}$	$\geq\text{Level 5}$	Throughput
Initial	64	1983.10	-	-	-	-	32.27
Privacy Restoration	64	2528.37	89.99	85.72	78.73	79.82	24.5 (76%)

Table 8: Comparison of the generation efficiency on Llama2-chat-8b.