Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration

Ziqian Zeng    1, Jianwei Wang11footnotemark: 1  1, ZhengdongLu1, Huiping Zhuang1, Cen Chen1,2
1South China University of Technology
2 Pazhou Laboratory, China
   zqzeng@scut.edu.cn,    wiwjwilliam@mail.scut.edu.cn
Equal contribution.Corresponding author
Abstract

The widespread usage of online Large Language Models (LLMs) inference services has raised significant privacy concerns about the potential exposure of private information in user inputs to eavesdroppers or untrustworthy service providers. Existing privacy protection methods for LLMs suffer from insufficient privacy protection, performance degradation, or severe inference time overhead. In this paper, we propose PrivacyRestore to protect the privacy of user inputs during LLM inference. PrivacyRestore directly removes privacy spans in user inputs and restores privacy information via activation steering during inference. The privacy spans are encoded as restoration vectors. We propose Attention-aware Weighted Aggregation (AWA) which aggregates restoration vectors of all privacy spans in the input into a meta restoration vector. AWA not only ensures proper representation of all privacy spans but also prevents attackers from inferring the privacy spans from the meta restoration vector alone. This meta restoration vector, along with the query with privacy spans removed, is then sent to the server. The experimental results show that PrivacyRestore can protect private information while maintaining acceptable levels of performance and inference efficiency.

1 Introduction

Large Language Models (LLMs) have emerged as powerful tools in various domains, including healthcare [6, 49], law [47, 10], and finance [46, 48]. With the exception of a very small portion of users who have the resources and expertise to deploy LLMs locally, the vast majority of users access and interact with these powerful models through online inference services.

However, the widespread usage of online LLMs inference services has raised significant privacy concerns, especially regarding the potential risk of private information being leaked through user inputs when interacting with LLMs deployed on cloud platforms. User inputs often contain sensitive and proprietary information such as medical records, unpublished narrative works, and personal financial details. Potential threats might arise from eavesdropper attackers intercepting user queries during transmission to cloud platforms, as well as the LLM service providers themselves potentially exploiting or misusing the sensitive data contained within user inputs for illicit purposes. For example, in sensitive domains like medical diagnosis, if a user’s input containing personal health information, such as "I was previously diagnosed with HIV, and lately I’ve been experiencing fever and diarrhea..." is disclosed, it may cause troubles to individual’s personal life.

In this paper, we aims to protect the privacy information contained in user inputs during LLM inference stage. In this setting, the client submits inputs to the server (also known as a service provider) and there is a risk that inputs might be disclosed by attackers or the server. Two existing categories of methods can protect user inputs: Secure Multi-Party Computation (SMPC) and Differential Privacy (DP). SMPC based methods [16, 21, 25] utilize encryption protocols and techniques to enable collaborative computation without revealing original data to others. In DP based methods in a local privacy setting [30, 37, 24], users apply a text-to-text privatization [13] on data locally before publishing data. SMPC methods still have sever inference time overhead, making them impractical for real-time applications. For example, running a single pass inference on RoBERTa-Base [29] requires 168.43 seconds [17]. For DP methods, imposing privacy protection inevitably degrades the performance of downstream tasks, also known as the privacy-utility trade-off. Hence, there is a need to develop privacy-preserving methods which can effectively safeguard the privacy of user inputs while maintaining high-quality outputs, without incurring prohibitive computational costs.

We propose PrivacyRestore which directly removes privacy spans in user inputs and restores privacy information via activation steering [23, 44, 18] during model inference. The privacy spans are encoded as vectors and securely transmitted to the service provider. Our method is based on two key assumptions: (a) Private information is confined within specific spans, termed “privacy spans” rather than being dispersed throughout the entire input; (b) In a particular domain, the number of potential privacy spans is limited and finite. These assumptions are reasonable in many scenarios. For instance, in the medical diagnosis domain, private information typically relates to symptoms and disease names, expressed through specific spans such as “leakage of urine” and “HIV”. It is difficult for attackers to disclose private information if privacy spans are removed. Moreover, despite the inevitable evolution of medical knowledge and terminology, the core set of symptoms and diseases remains relatively stable and finite.

During inference, the users submit a plain text with privacy spans removed as well a meta restoration vector which encodes all privacy spans in the input. The meta restoration vector is used for activation steering at some specific attention heads to implicitly restore privacy information. At the core of PrivacyRestore is our Attention-aware Weighted Aggregation (AWA) which estimates the importance of each privacy span in the input and takes a weighted sum on the restoration vectors of all privacy spans as the meta restoration vector. AWA not only ensures proper representation of all privacy spans but also prevents attackers from inferring the privacy spans from the meta restoration vector alone. PrivacyRestore is a plug-and-play method that only restoration vectors are trainable while the LLM is frozen. Once the training of restoration vectors is completed, a restoration bank is constructed. The purpose of this restoration bank is to provide the meta restoration vector when given a query and a set of privacy spans. During inference, users retrieve the meta restoration vector locally from the restoration bank. The experimental results show that the proposed method can protect private information while maintaining acceptable levels of performance and inference efficiency.

The contributions of our paper are summarized as follows,

  • We propose a plug-and-play privacy protection method that removes privacy spans in the input and restores privacy information via activation steering during inference.

  • We propose Attention-aware Weighted Aggregation to construct the meta restoration vector which ensures proper representation of all privacy spans in the input and prevents attackers from inferring the privacy spans from the meta restoration vector alone.

  • We construct two datasets for medical diagnosis task to evaluate our method, and PrivacyRestore can protect private information while maintaining acceptable levels of performance and inference efficiency.

2 Related Works

In this section, we mainly introduce the relevant work on user inputs protection methods and activation steering as follows.

2.1 Privacy-Preserving Methods for User Inputs Protection during Inference

In some sensitive domains, when using LLMs deployed on cloud platforms, user inputs and queries containing sensitive information need to be properly protected. There are mainly two privacy-preserving methods that can be applied to protect user inputs in the inference stage: differential privacy and secure multi-party computation. Federated learning [54, 50, 8, 14] is applied to protect the privacy of training data, so it falls outside the scope of our discussion.

Differential Privacy. Differential Privacy (DP) can be categorized into two settings [37, 24]: Centralized DP (CDP) and Local DP (LDP). LDP assumes that only the client is trustworthy and privatizes the data locally, while CDP assumes a trustworthy data curator to collect raw data and protect it from privacy leakage. Under the CDP setting, various methods explored for pre-training [51, 20] and fine-tuning [52, 38] based on DP in sensitive domains for the training process of LLMs. Under the LDP setting, d𝒳subscript𝑑𝒳d_{\mathcal{X}}italic_d start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT privacy [3, 5] is applied to privatize user input for local privacy protection [37] during inference. RAPT [24] proposes a privacy-preserving prompt tuning for LLMs on the basis of local DP. In addition, some works [53, 32, 42] have applied DP to synthetic text generation, enabling the model to generate text with formal privacy guarantees.

Secure Multi-Party Computation. Secure multi-party computation (SMPC) based methods utilize multi-party encryption technology for collaborative computation among multiple parties while protecting the privacy of their private data. SMPC methods based on model structure optimization aim to replace SMPC-unfriendly nonlinear operations with other SMPC-friendly operators. MPC-Former [21] approximate nonlinear operations in Transformer using polynomials, and maintains performance through model distillation. MERGER [25] integrate previous techniques to natural language generation (NLG) tasks by bypassing embedded computation and reorganizing linear operations in Transformer modules, further improving computational efficiency and model performance. SMPC Protocol Optimization based methods focus on designing efficient SMPC operators for nonlinear operations in LLMs while maintaining the original model structure. Some recent works [16, 27, 55, 15] improve the efficiency of nonlinear operations in LLMs privacy-preserving inference by utilizing multiple SMPC protocols such as confusion circuit and function secret sharing.

2.2 Activation Steering

Activation steering aims to control frozen LLMs to generate the desired text by searching or constructing vectors that intervene the model’s activations during inference. PPLM [9] uses a separate classifier to perturb model activations, which controls the topics and sentiment styles of the text generated by the model. Z-Steer [39] extracts sentence-specific steering vectors from frozen LLMs through gradient descent, steering model generation to near perfect BLEU scores [35] and unsupervised style transfer. ITI [23] computes intervention vectors based on the activation distributions of true and false statements and selects attention heads to be intervened according to the accuracy of the linear probe. Different from ITI [23], another work [44] computes intervention vectors by only obtaining activation differences of prompt pairs. REMEDI [18] inspects and edits knowledge representations in LLMs by learning attribute encodings of entities.

3 The Proposed Method

We first introduce the threat model in §3.1 and the overview of PrivacyRestore in §3.2 respectively. Subsequently, we provide detailed illustrations of the three components of PrivacyRestore, namely, edited attention heads identification, restoration vectors training, and restoration vectors aggregation. We will introduce the inference process in §3.6. Additionally, we introduce possible attack scenarios that PrivacyRestore might encounter in Appendix B.

3.1 Threat Model

We consider a threat model that includes two parties: an untrustworthy server holding LLM weights, and a client holding user inputs containing privacy information. The LLM weights in the server are publicly available. Attacks aim to steal private information in user inputs from the client during online LLM inference that takes place on the server side.

3.2 The Overview of PrivacyRestore

The fundamental idea of PrivacyRestore is removing privacy spans in user inputs and restoring privacy information via activation steering [23, 44, 18] during model inference. An overview of PrivacyRestore is depicted in Figure 1.

Refer to caption
Figure 1: The PrivacyRestore framework consists of three procedures. (1) Edited Attention Heads Identification. This procedure aims to identify the attention heads that, after being edited with restoration vectors, can achieve the most effective restoration effect. Different privacy spans may require different attention heads for effective restoration. We propose top-K Heads Selector to obtain the common top-K heads set. (2) Restoration Vectors Training. The training objective is to align the predictions given the input with privacy spans removed to be the same as the predictions given an intact input. Restoration vectors are trainable while LLM are frozen. (3) Attention-aware Weighted Aggregation (AWA). AWA is responsible for aggregating the restoration vectors of all privacy spans in the input and generating a meta restoration vector. This meta restoration vector is then utilized to edit the attention heads of the LLM. During inference, the user sends a query with privacy spans removed, along with the meta restoration vector, to the server.

3.3 Edited Attention Heads Identification

This procedure aims to identify the attention heads that, after being edited with restoration vectors, can achieve the most effective restoration effect.

According to [12, 45], the transformer architecture utilizes attention mechanism to capture the relationship between the current token and its surrounding context. The attention mechanism can be expressed as:

𝐔l,hsuperscript𝐔𝑙\displaystyle\mathbf{U}^{l,h}bold_U start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT =\displaystyle== Attnl,h(𝐗l,h),superscriptAttn𝑙superscript𝐗𝑙\displaystyle\text{Attn}^{l,h}(\mathbf{X}^{l,h}),Attn start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ( bold_X start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ) , (1)

where 𝐗l,hsuperscript𝐗𝑙\mathbf{X}^{l,h}bold_X start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT the sequence input of hhitalic_h-th head in the l𝑙litalic_l-th layer and 𝐔l,hsuperscript𝐔𝑙\mathbf{U}^{l,h}bold_U start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT is the sequence output. Directly removing privacy spans may lead to insufficient relationships due to incomplete context and produce wrong head outputs. Following activation steering methods [23, 7], we inject restoration vectors into the outputs of each attention head to restore private information:

𝐲l,hsuperscript𝐲𝑙\displaystyle\mathbf{y}^{l,h}bold_y start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT =\displaystyle== 𝐮l,h+l,h,superscript𝐮𝑙superscript𝑙\displaystyle\mathbf{u}^{l,h}+\mathcal{R}^{l,h},bold_u start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT + caligraphic_R start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT , (2)

where 𝐮l,h=𝐔l,h[1]superscript𝐮𝑙superscript𝐔𝑙delimited-[]1\mathbf{u}^{l,h}=\mathbf{U}^{l,h}[-1]bold_u start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT = bold_U start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT [ - 1 ] is the output of the last token and 𝐲l,hsuperscript𝐲𝑙\mathbf{y}^{l,h}bold_y start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT is the output after privacy restoration. l,hsuperscript𝑙\mathcal{R}^{l,h}caligraphic_R start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT is the aggregated restoration vector for hhitalic_h-th head in l𝑙litalic_l-th layer. The aggregated restoration vector is the aggregation of restoration vectors for all privacy spans in the user input, which will be introduced detailedly in §3.5

As pointed by activation steering methods [23, 7], applying restoration vectors to all attention heads in LLM will degrade the model performance. As shown in the first part of Figure 1, we employ the probe technique [2, 41, 4] to ascertain the most salient heads associated with each privacy span. We train a separate classifier for each head, tailored to each privacy span s𝑠sitalic_s, formulated as

sl,h(ul,h)=σ(θul,h),superscriptsubscript𝑠𝑙superscriptu𝑙𝜎𝜃superscriptu𝑙\mathcal{F}_{s}^{l,h}(\textbf{u}^{l,h})=\sigma(\theta\cdot\textbf{u}^{l,h}),caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ( u start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ) = italic_σ ( italic_θ ⋅ u start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ) , (3)

where yl,hsuperscripty𝑙\textbf{y}^{l,h}y start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT is the output of the last token on hhitalic_h-th head of l𝑙litalic_l-th layer, and θ𝜃\thetaitalic_θ is the parameters of the classifier. In our setting, the classifier is to discriminate whether the target privacy span s𝑠sitalic_s appears in the query. A classifier associated with an attention head demonstrating higher accuracy suggests a stronger correlation with the target privacy span, therefore, we select the top K𝐾Kitalic_K attention heads with high accuracies for each privacy span.

Considering the privacy protection scenario, using different top-K head sets for different privacy spans may suffer the risk of privacy leakage. If different privacy spans have different edited attention heads, the attacker can infer a specific privacy span based on some characteristics of attention heads. For example, an attacker might observe that only the heads set of privacy span “HIV” contains the j𝑗jitalic_j-th head and the appearance of j𝑗jitalic_j-th head indicates the presence of ’HIV’ in the user. We propose an algorithm named Top-K Heads Selector to combine all different top-K heads sets of all privacy spans to form a common top-K heads set csubscript𝑐\mathcal{H}_{c}caligraphic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT according to statistical characteristics. Firstly, we initialize an empty score list Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for each head. Secondly, each privacy span s𝑠sitalic_s has its corresponding top-K heads set kssuperscriptsubscript𝑘𝑠\mathcal{H}_{k}^{s}caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. For each head hhitalic_h in kssuperscriptsubscript𝑘𝑠\mathcal{H}_{k}^{s}caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, we append score KSortedIndex(h,ks)𝐾SortedIndexsuperscriptsubscript𝑘𝑠K-\text{SortedIndex}(h,\mathcal{H}_{k}^{s})italic_K - SortedIndex ( italic_h , caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) into its score list Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Thirdly, we calculate the average value of each score list Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT as the score of the corresponding head. Finally, we sort all heads in the LLM by the scores and pick up top-K heads as the common top-K heads set csubscript𝑐\mathcal{H}_{c}caligraphic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The detailed algorithm is described in Algorithm 1.

Algorithm 1 Top-K Heads Selector

Input: 𝒮𝒮\mathcal{S}caligraphic_S is the set of privacy spans; asubscript𝑎\mathcal{H}_{a}caligraphic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the set of all heads; ksubscript𝑘\mathcal{H}_{k}caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the set of top-K heads sets; kssuperscriptsubscript𝑘𝑠\mathcal{H}_{k}^{s}caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is the top-K heads set of the symptom s𝑠sitalic_s; SortedIndex(h,ks)SortedIndexsuperscriptsubscript𝑘𝑠\text{SortedIndex}(h,\mathcal{H}_{k}^{s})SortedIndex ( italic_h , caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) is the function that sorts kssuperscriptsubscript𝑘𝑠\mathcal{H}_{k}^{s}caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT in descending order based on the accuracy of the classifier for head hhitalic_h and returns the index of hhitalic_h.

1:  Initialize an empty score list Lh=[]subscript𝐿L_{h}=[\;]italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = [ ] for each head hhitalic_h in asubscript𝑎\mathcal{H}_{a}caligraphic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.
2:  for  kssuperscriptsubscript𝑘𝑠\mathcal{H}_{k}^{s}\;caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT in ksubscript𝑘\;\mathcal{H}_{k}caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT do
3:     for hh\;italic_h in kssuperscriptsubscript𝑘𝑠\;\mathcal{H}_{k}^{s}caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT do
4:        Append KSortedIndex(h,ks)𝐾SortedIndexsuperscriptsubscript𝑘𝑠K-\text{SortedIndex}(h,\mathcal{H}_{k}^{s})italic_K - SortedIndex ( italic_h , caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) into Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.
5:     end for
6:  end for
7:  for hh\;italic_h in asubscript𝑎\;\mathcal{H}_{a}caligraphic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT do
7:     scoreh=average(Lh)subscriptscoreaveragesubscript𝐿\text{score}_{h}=\text{average}(L_{h})score start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = average ( italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
8:  end for
9:  Sort asubscript𝑎\mathcal{H}_{a}caligraphic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT according to scorehsubscriptscore\text{score}_{h}score start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and select top-K heads to obtain common top-K heads set csubscript𝑐\mathcal{H}_{c}caligraphic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

Output: csubscript𝑐\mathcal{H}_{c}caligraphic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the common top-K heads set.

3.4 Restoration Vectors Training

Restoration vectors are designed to restore the information contained within the removed privacy spans. The training objective is to align the predictions given the input with privacy spans removed to be the same as the predictions given an intact input.

As illustrated in the second part of Figure 1, we generate training samples by appending gold response as the positive answer apsubscript𝑎𝑝a_{p}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to user queries. The invasive injection of restoration vectors modifies the attention activation and may adversely affect the instruction-following capability of the LLM. To maintain the instruction-following capability, we include an empty response as the negative answer ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for each sample. Consequently, we obtain the training dataset D={(q,ap,an)|qQ,apA,an=<EOS>}𝐷conditional-set𝑞subscript𝑎𝑝subscript𝑎𝑛formulae-sequence𝑞𝑄formulae-sequencesubscript𝑎𝑝𝐴subscript𝑎𝑛<EOS>D=\{(q,a_{p},a_{n})|q\in Q,a_{p}\in A,a_{n}=\text{<EOS>}\}italic_D = { ( italic_q , italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | italic_q ∈ italic_Q , italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ italic_A , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = <EOS> }, where q𝑞qitalic_q denotes the query with privacy spans removed, Q𝑄Qitalic_Q denotes the query set, apsubscript𝑎𝑝a_{p}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denotes the positive answer, A𝐴Aitalic_A signifies the gold answer set, ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the negative answer, and <EOS> is the token indicating the end of a sequence.

For each privacy span s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, there is a trainable restoration vector rsl,hsuperscriptsubscript𝑟𝑠𝑙r_{s}^{l,h}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT for each head hhitalic_h in the common top-K heads set csubscript𝑐\mathcal{H}_{c}caligraphic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Restoration vectors of all privacy spans are the only trainable parameters of our method. The LLM weights are frozen. Our method is plug-and-play and parameter-efficient for training. We fine-tune these restoration vectors using ORPO loss proposed by [19]:

logP(a|q;Θ)𝑃conditional𝑎𝑞Θ\displaystyle\log P(a|q;\Theta)roman_log italic_P ( italic_a | italic_q ; roman_Θ ) =\displaystyle== 1|a|i=1|a|log(P(a|q;a<i;Θ)),1𝑎superscriptsubscript𝑖1𝑎𝑃conditional𝑎𝑞subscript𝑎absent𝑖Θ\displaystyle\frac{1}{|a|}\sum_{i=1}^{|a|}\log(P(a|q;a_{<i};\Theta)),divide start_ARG 1 end_ARG start_ARG | italic_a | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_a | end_POSTSUPERSCRIPT roman_log ( italic_P ( italic_a | italic_q ; italic_a start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ; roman_Θ ) ) , (4)
ratio(a|q;Θ)ratioconditional𝑎𝑞Θ\displaystyle\text{ratio}(a|q;\Theta)ratio ( italic_a | italic_q ; roman_Θ ) =\displaystyle== P(a|q;rsl,h)1P(a|q;Θ),𝑃conditional𝑎𝑞superscriptsubscript𝑟𝑠𝑙1𝑃conditional𝑎𝑞Θ\displaystyle\frac{P(a|q;r_{s}^{l,h})}{1-P(a|q;\Theta)},divide start_ARG italic_P ( italic_a | italic_q ; italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - italic_P ( italic_a | italic_q ; roman_Θ ) end_ARG , (5)
ORPOsubscriptORPO\displaystyle\mathcal{L}_{\text{ORPO}}caligraphic_L start_POSTSUBSCRIPT ORPO end_POSTSUBSCRIPT =\displaystyle== logP(ap|q;Θ)logσ(logratio(ap|q;Θ)ratio(an|q;Θ)),𝑃conditionalsubscript𝑎𝑝𝑞Θ𝜎ratioconditionalsubscript𝑎𝑝𝑞Θratioconditionalsubscript𝑎𝑛𝑞Θ\displaystyle-\log P(a_{p}|q;\Theta)-\log\sigma\left(\log\frac{\text{ratio}(a_% {p}|q;\Theta)}{\text{ratio}(a_{n}|q;\Theta)}\right),- roman_log italic_P ( italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | italic_q ; roman_Θ ) - roman_log italic_σ ( roman_log divide start_ARG ratio ( italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | italic_q ; roman_Θ ) end_ARG start_ARG ratio ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_q ; roman_Θ ) end_ARG ) , (6)

where q𝑞qitalic_q denotes the query with privacy spans removed, ΘΘ\Thetaroman_Θ is the set of restoration vectors corresponding to the privacy spans present in the input.. a𝑎aitalic_a is a response, which can be positive response apsubscript𝑎𝑝a_{p}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT or negative response ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. |a|𝑎|a|| italic_a | is the length of the response. ORPOsubscriptORPO\mathcal{L}_{\text{ORPO}}caligraphic_L start_POSTSUBSCRIPT ORPO end_POSTSUBSCRIPT simultaneously encourages generating correct responses and diminishes the occurrence of meaningless empty responses.

3.5 Attention-aware Weighted Aggregation

There are commonly multiple privacy spans that may lead to privacy leakage in user inputs. Some of privacy spans may have a critical impact on final outputs, while others may not be as significant. Directly aggregating all restoration vectors equally may weaken the impact of the critical privacy spans and enlarge the impact of irrelevant privacy spans. Therefore, we propose a novel method called Attention-aware Weighted Aggregation (AWA) which estimates a weight for each privacy span, and then takes the weighted sum of restoration vectors as the aggregation. To discern the importance of different privacy spans and consider the limitations of computing resources, we employed a tiny model (such as BERT [11]) on the client side to calculate the importance weights.

As depicted in the third panel in Figure 1, the first step of AWA is computing attention score of the last token t𝑡titalic_t attending to a privacy span s𝑠sitalic_s, yielding importance weights denoted as wssubscript𝑤𝑠w_{s}italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT:

Attnh(sj,xt)superscriptAttnsubscript𝑠𝑗subscript𝑥𝑡\displaystyle\text{Attn}^{h}(s_{j},x_{t})Attn start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =\displaystyle== Softmax(xt𝐖Qh(sj𝐖Kh)T),Softmaxsubscript𝑥𝑡superscriptsubscript𝐖𝑄superscriptsubscript𝑠𝑗superscriptsubscript𝐖𝐾𝑇\displaystyle\text{Softmax}(x_{t}\mathbf{W}_{Q}^{h}\cdot(s_{j}\mathbf{W}_{K}^{% h})^{T}),Softmax ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , (7)
wssubscript𝑤𝑠\displaystyle w_{s}italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =\displaystyle== 1|s|j=1|s|1|H|hHAttnh(sj,xt),1𝑠superscriptsubscript𝑗1𝑠1𝐻subscript𝐻superscriptAttnsubscript𝑠𝑗subscript𝑥𝑡\displaystyle\frac{1}{|s|}\sum_{j=1}^{|s|}\frac{1}{|H|}\sum_{h\in H}\text{Attn% }^{h}(s_{j},x_{t}),divide start_ARG 1 end_ARG start_ARG | italic_s | end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_s | end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_H | end_ARG ∑ start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT Attn start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (8)

where 𝐖Qhsuperscriptsubscript𝐖𝑄\mathbf{W}_{Q}^{h}bold_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and 𝐖Vhsuperscriptsubscript𝐖𝑉\mathbf{W}_{V}^{h}bold_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT are the query and key projection matrices of head hhitalic_h in the last layer of a tiny model, respectively. xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the hidden state of the last token t𝑡titalic_t and the j𝑗jitalic_j-th token of privacy span s𝑠sitalic_s in the last layer, respectively. H𝐻Hitalic_H denotes all attention heads in the last layer. |s|𝑠|s|| italic_s | denotes the number of tokens contained in s𝑠sitalic_s.

After obtaining importance weights of multiple privacy spans in the query, we calculate the aggregated restoration vector l,hsuperscript𝑙\mathcal{R}^{l,h}caligraphic_R start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT of the l𝑙litalic_l-th layer and hhitalic_h-th head as follows,

l,hsuperscript𝑙\displaystyle\mathcal{R}^{l,h}caligraphic_R start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT =\displaystyle== {rsl,h+ϵif |𝒮q|=1,s𝒮qwsrsl,hotherwise ,casessuperscriptsubscript𝑟𝑠𝑙italic-ϵif subscript𝒮𝑞1subscript𝑠subscript𝒮𝑞subscript𝑤𝑠superscriptsubscript𝑟𝑠𝑙otherwise \displaystyle\begin{cases}r_{s}^{l,h}+\epsilon&\text{if }|\mathcal{S}_{q}|=1,% \\ \sum_{s\in\mathcal{S}_{q}}w_{s}\cdot r_{s}^{l,h}&\text{otherwise },\end{cases}{ start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT + italic_ϵ end_CELL start_CELL if | caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | = 1 , end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT end_CELL start_CELL otherwise , end_CELL end_ROW (9)

where 𝒮qsubscript𝒮𝑞\mathcal{S}_{q}caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT denotes the privacy spans containing in the query and |𝒮q|subscript𝒮𝑞|\mathcal{S}_{q}|| caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | is the number of privacy spans. In the scenario where the query contains a single privacy span, the aggregated restoration vector is the restoration vector rsl,hsuperscriptsubscript𝑟𝑠𝑙r_{s}^{l,h}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT itself. An attacker can intercept l,hsuperscript𝑙\mathcal{R}^{l,h}caligraphic_R start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT and easily deduce the plain text of the privacy span. We inject noise ϵitalic-ϵ\epsilonitalic_ϵ into the restoration vector, making it difficult for attackers to infer the plain text of privacy span from l,hsuperscript𝑙\mathcal{R}^{l,h}caligraphic_R start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT.

The client side maintains a local Restoration Bank responsible for providing the meta restoration vector based on a given query and a set of privacy spans. The Restoration Bank comprises three key components: the aggregation algorithm AWA, restoration vectors for each privacy span, and a common top-K heads set. The meta restoration vector can be obtained by concatenating the l,hsuperscript𝑙\mathcal{R}^{l,h}caligraphic_R start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT for all heads in the common top-K heads set. The server is aware of the specific order of concatenation.

3.6 Inference

As depicted in the third panel of Figure 1, during LLM inference, on the client side, users are allowed to select specific spans (e.g., black stools, pale skin, blood in stool) which they wish to keep undisclosed. These selected spans are considered as privacy spans. The client side obtains the meta restoration vector from the Restoration Bank. Then, the client side provides queries with privacy spans removed and the meta restoration vector to LLMs deployed on the server side. On the server side, the meta restoration vector will be injected into attention heads of LLMs in the common top-K heads set. Finally, the client receives response outputs generated by the LLM.

4 Experiments

4.1 Experiments setup

Backbones. We use Llama2-chat-7b [43] as the LLM deployed on server side. We utilize BERT-base [11] as the tiny model on the client side for attention-aware weighted aggregation mentioned in §3.5. We also report results on larger LLMs such as Llama2-chat-13b [43] and Llama3-8b-instruct [33] in Appendix D.

Datasets. We mainly focus on protecting user inputs in medical diagnosis task. We consider symptoms in user inputs as privacy spans. In real-world scenarios, some symptoms may be highly sensitive, while others may be less sensitive or non-sensitive. Existing benchmarks such as PriDdxplus [40] and NLICE [1] for the medical diagnosis task lack of privacy level of symptoms. We utilized GPT-3.5 [34] to rate the privacy level for all possible symptoms in Ddxplus [40] and NLICE[1] benchmark respectively. The query prompt for rating privacy levels is shown in Appendix A.2. Each symptom is rated from level 1 to level 5, with the definitions of privacy levels provided in Appendix A.1. The sensitivity levels range from 1 (non-sensitive information) to 5 (highly sensitive information), with higher levels indicating increasing privacy levels. Leveraging the source datasets (Ddxplus and NLICE) and incorporating privacy levels for symptoms, we curated two privacy-preserving datasets, named Pri-Ddxplus and Pri-NLICE, respectively. There are 149 and 70 types of privacy spans in Pri-Ddxplus and Pri-NLICE, respectively. The detailed construction process and statistical information for two datasets are provided in Appendix A.

Metrics. The evaluation of privacy protection should consider both the performance, the effectiveness of privacy protection, and inference efficiency. We evaluate the model’s performance using two metrics: MC1 and MC2 [26]. We assign each sample in Pri-Ddxplus and Pri-NLICE with four options, including one correct diagnosis and three incorrect ones. To calculate MC1, we compute the accuracy, i.e., the proportion of questions answered correctly out of the total number of questions. We consider the option with the highest probability as the answer. To calculate MC2, we compute the normalized probability of the true answer.

To evaluate the effectiveness of privacy protection, we construct two commonly used privacy attacks targeted at LLMs, Prompt Injection Attack [36, 28] and Attribute Inference Attack [31, 22]. For prompt injection attack, attackers intercept the medical query and inject extra harmful prompts to induce the model to generate the privacy symptoms of patients. If the model output contains privacy symptoms, we consider it as a successful attack. We compute the ratio of successful attacks, namely ASR (Attack Success Ratio). For attribute inference attack, attackers intend to build up a simple multi-label classifier to map representations of the medical query into corresponding privacy symptoms. We compute the F1 score of the classifier. The lower the ASR and F1, the higher the effectiveness of privacy protection.

To evaluate inference efficiency, we measure latency and throughput. Latency is measured from the moment the user submits the query until the answer is generated. As different samples have varying lengths, we calculate the average latency per sample and provide the average output length. Latency is further divided into two parts: namely, latency on server side and latency on client side. The generation of meta restoration vector is conducted on the client side. Hence, it is necessary to investigate the latency of generating the meta restoration vector. Lower latency indicates higher inference efficiency. In addition to latency, we compute the throughput, which denotes the number of tokens generated by the model per second. Higher throughput indicates higher inference efficiency.

Due to space limitations, implementation details are presented in Appendix C.

4.2 Compared Methods

To demonstrate the effectiveness of our method, we compare our model with the following baselines: No Protection. The client does not provide any protection for patient’s queries and sends the intact queries to the servers. Since the query is intact, the model’s performance is not affected. It provides a theoretical upper bound of the model’s performance. However, it suffers from the privacy leakage. Direct Removal. The user directly deletes all privacy spans in the input and submits it to the LLM. DP. Differential privacy method [37] apply text-to-text privatization [13] on user inputs, which injects noise into tokens’ embedding and find the nearest tokens to replace the initial tokens. The text-to-text privatization is applied on all tokens in the input. The LLM backbone of DP method is frozen. DP on Privacy Spans. The client employs the text-to-text privatization [13] on privacy spans, rather than the entire query. SMPC methods suffer from huge inference time overhead. We do not include them as a baseline.

The settings in the ablation study of PrivacyRestore are shown as follows. w/o Top-K Heads Selector. We remove the Top-K heads selector in PrivacyRestore. We randomly select K heads from asubscript𝑎\mathcal{H}_{a}caligraphic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to form the head set csubscript𝑐\mathcal{H}_{c}caligraphic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. w/o Restoration Vectors Training. We eliminate the training process and do not use trainable restoration vectors. We use an activation steering method called ITI [23] to obtain the restoration vectors of each privacy span. w/o AWA. We remove the attention-aware weighted aggregation component. The meta restoration vector is computed as an equally weighted sum of restoration vectors of all privacy spans.

Pri-Ddxplus
Methods Privacy Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
MC1 \uparrow MC2 \uparrow ASR \downarrow F1 \downarrow MC1 MC2 ASR F1 MC1 MC2 ASR F1 MC1 MC2 ASR F1
No Protection ×\times× 86.83 85.13 100.00 12.30 86.83 85.13 100.00 12.30 86.83 85.13 100.00 12.30 86.83 85.13 100.00 12.30
Direct Removal 33.31 33.49 45.19 7.27 63.20 59.47 24.79 4.54 78.24 75.66 10.20 1.23 86.24 84.45 0.58 1.58
DP [37] 41.83 40.61 70.69 6.28 41.83 40.61 70.69 6.28 41.83 40.61 70.69 6.28 41.83 40.61 70.69 6.28
DP on Privacy Spans 41.83 40.61 70.69 6.28 65.59 63.01 43.44 3.62 78.37 75.68 8.58 1.33 86.44 84.55 0.51 1.71
PrivacyRestore 61.78 61.13 41.18 5.81 75.40 72.83 22.27 0.25 84.89 84.09 8.56 0.00 87.73 85.96 0.51 0.00
w/o Top-K Heads Selector 21.04 22.19 49.96 5.94 27.82 27.87 11.87 26.51 67.59 67.45 3.55 42.35 85.86 83.96 0.00 35.8
w/o Restoration Vectors Training 41.05 39.15 37.44 5.82 48.58 45.23 27.63 0.58 64.36 64.11 12.39 0.00 85.92 84.14 0.12 0.00
w/o AWA 44.35 43.26 3.74 0.00 68.62 65.95 3.87 0.00 84.89 84.07 13.36 0.00 87.73 85.95 0.06 0.00
Pri-NLICE
Methods Privacy Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
MC1 \uparrow MC2 \uparrow ASR \downarrow F1 \downarrow MC1 MC2 ASR F1 MC1 MC2 ASR F1 MC1 MC2 ASR F1
No Protection ×\times× 76.64 75.80 77.78 75.87 76.64 75.80 77.78 75.87 76.64 75.80 77.78 75.87 76.64 75.80 77.78 75.87
Direct Removal 58.58 55.18 2.27 23.86 64.52 61.63 0.00 20.49 73.73 72.12 0.00 3.84 74.36 72.85 0.00 1.1
DP [37] 59.21 56.12 0.00 18.72 59.21 56.12 0.00 18.72 59.21 56.12 0.00 18.72 59.21 56.12 0.00 18.72
DP on Privacy Spans 37.87 38.98 0.50 20.86 48.10 47.17 0.01 18.66 72.22 70.40 0.00 3.30 73.48 71.70 0.00 0.79
PrivacyRestore 79.54 78.81 0.25 0.00 78.66 77.44 0.12 0.12 76.64 75.35 0.37 6.52 75.88 74.98 0.00 0.00
w/o Top-K Heads Selector 71.46 67.92 5.05 12.79 70.20 69.30 0.75 18.61 72.22 70.44 0.13 2.17 71.46 70.53 0.00 1.29
w/o Restoration Vectors Training 54.16 50.93 5.42 27.83 61.99 58.77 0.25 22.63 72.97 71.73 0.50 7.36 72.97 72.61 0.00 6.38
w/o AWA 76.76 74.61 0.12 28.36 78.4 77.31 0.12 22.55 76.63 75.33 0.37 7.36 75.88 74.98 0.00 6.38
Table 1: Comparison of the model performance and the effectiveness of privacy protection among the compared methods on the Pri-Ddxplus and Pri-NLICE datasets on Llama2-chat-7b. MC1 and MC2 metrics indicate the model’s performance, with higher values representing better performance. ASR and F1 metrics measure the effectiveness of privacy protection, with lower values indicating better performance. The “LeveljabsentLevel𝑗\geq\text{Level}\;j≥ Level italic_j” column indicates that privacy protection is applied to symptoms with a privacy level greater than or equal to j𝑗jitalic_j. The “Privacy” column shows whether the method applies privacy protection to the user input or not. The best model performance and privacy protection are marked in bold, excluding the No Protection method, as it does not provide any privacy protection.

4.3 Main Results

As shown in Table 1, we evaluate the performance and effectiveness of privacy protection on two datasets. Since the No Protection method does not use any protection and the DP method injects noise into all tokens in the query, the outcomes of both methods are independent of the privacy level. As shown in the first row of the upper and lower parts, the No Protection method achieves the upper bound of the MC1 and MC2 but suffers from privacy leaking problems on two datasets.

Direct Removal method serves as a strong competitor in privacy protection but suffers from model performance degradation. The lower level means more privacy symptoms were removed and the performance degradation is more obvious. The performance of DP and DP on Privacy Spans is significantly lower than the No Protection method, and their privacy protection capability is weaker than Direct Removal methods.

Compared to DP based methods, PrivacyRestore achieves significant improvements in model performance with even lower ASR and F1. The model performance is comparable to the No Protection method. Moreover, sometimes PrivacyRestore even outperforms the No Protection method, as demonstrated in the Level 5absentLevel 5\geq\text{Level 5}≥ Level 5 case on Pri-Ddxplus dataset and the Level 2absentLevel 2\geq\text{Level 2}≥ Level 2, Level 3absentLevel 3\geq\text{Level 3}≥ Level 3, Level 5absentLevel 5\geq\text{Level 5}≥ Level 5 cases on Pri-NLICE dataset. This suggests that using the restoration vectors for the corresponding privacy symptoms can sometimes be more beneficial for model inference than including the symptom directly in the query context. Surprisingly, Privacy Protection achieves better privacy protection ability than Direct Removal with lower ASR and F1 on both datasets. It indicates that applying restoration vectors in model inference can effectively prevent attackers from stealing the privacy symptoms.

We also provide the results after removing the Top-K Heads Selector, Restoration Vectors Training and Attention-aware Weighted Aggregation (AWA) components respectively. After removing Top-K Heads Selector and Restoration Vectors Training, both the model performance and effectiveness of privacy protection are weakened. After removing the AWA component, both the model performance and privacy protection capability decrease (except for slight changes in ASR), which indicates the effectiveness of AWA component.

4.4 Inference Efficiency

We compare the inference efficiency of the initial model and PrivacyRestore on Llama2-chat-7b in Table 2. On the server side, PrivacyRestore needs to edit the attention heads using the meta restore vector during model inference on the server as shown in Eq 2. These operations are fast and only bring 8%-13% overhead on the server, as shown in the “Latency on Server” column. On the client side, PrivacyRestore needs to generate the meta restoration vector for uses. The latency on the client side is relatively minor compared to the latency on the server, as depicted in the “Latency on Client” column. As shown in the ‘Throughput” column, PrivacyRestore can achieve nearly 80% of the throughput of the original model.

Pri-Ddxplus
Methods Avg. Output Length Latency on Server Latency on Client Throughput
Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
Initial 8.44 531.05 - - - - 15.9
PrivacyRestore 9.14 601.46 109.84 88.52 89.07 145.63 12.83 (81%)
Pri-NLICE
Methods Avg. Output Length Latency on Server Latency on Client Throughput
Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
Initial 6.56 462.04 - - - - 14.2
PrivacyRestore 6.56 506.27 91.03 86.84 79.34 79.25 11.12 (78%)
Table 2: Comparison of inference efficiency between original model (Llama2-chat-7b) and PrivacyRestore. Latency was measured in milliseconds (ms), and throughput denotes tokens per second. The bold number represents the ratio of PrivacyRestore’s throughput to that of the original model.

5 Conclusion

We propose PrivacyRestore which aims to protect user inputs when using online LLMs inference services. By directly removing privacy spans and restoring the information through activation steering, PrivacyRestore offers a practical and efficient solution to the privacy-utility trade-off. At the core of PrivacyRestore is our Attention-aware Weighted Aggregation (AWA) technique. AWA aggregates the restoration vectors of all privacy spans in the input into a meta restoration vector. This not only ensures proper representation of all private information, but also prevents attackers from inferring the original privacy spans from the meta restoration vector alone. Compared with differential privacy methods, PrivacyRestore achieves better performance in medical diagnosis task and demonstrates better privacy protection capabilities in prompt injection attack and attribute inference attack.

Limitations

Our experiments are specifically focused on the healthcare domain. It is more convincing to conduct experiments on other domains such as law and finance. The set of privacy spans may evolve over time, necessitating the re-training of restoration vectors from time to time. This process typically requires 4 to 5 hours for the datasets mentioned above. In cases where privacy spans are out-of-vocabulary, one possible solution is to exclude them from privacy protection.

References

  • Al-Ars et al. [2023] Zaid Al-Ars, Obinna Agba, Zhuoran Guo, Christiaan Boerkamp, Ziyaad Jaber, and Tareq Jaber. 2023. Nlice: Synthetic medical record generation for effective primary healthcare differential diagnosis. In 2023 IEEE 23rd International Conference on Bioinformatics and Bioengineering (BIBE), pages 397–402. IEEE.
  • Alain and Bengio [2016] Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
  • Alvim et al. [2018] Mário Alvim, Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Anna Pazii. 2018. Invited paper: Local differential privacy on metric spaces: Optimizing the trade-off with utility. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 262–267.
  • Belinkov [2022] Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
  • Chatzikokolakis et al. [2013] Kostas Chatzikokolakis, Miguel Andrés, Nicolás Bordenabe, and Catuscia Palamidessi. 2013. Broadening the scope of differential privacy using metrics.
  • Chen et al. [2023] Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
  • Chen et al. [2024] Zhongzhi Chen, Xingwu Sun, Xianfeng Jiao, Fengzong Lian, Zhanhui Kang, Di Wang, and Chengzhong Xu. 2024. Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20967–20974.
  • Chu et al. [2023] Hong-Min Chu, Jonas Geiping, Liam H Fowl, Micah Goldblum, and Tom Goldstein. 2023. Panning for gold in federated learning: Targeted text extraction under arbitrarily large-scale aggregation. In The Eleventh International Conference on Learning Representations.
  • Dathathri et al. [2020] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations.
  • Deng et al. [2023] Wentao Deng, Jiahuan Pei, Keyi Kong, Zhe Chen, Furu Wei, Yujun Li, Zhaochun Ren, Zhumin Chen, and Pengjie Ren. 2023. Syllogistic reasoning for legal judgment analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13997–14009, Singapore. Association for Computational Linguistics.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Elhage et al. [2021] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1:1.
  • Feyisetan et al. [2020] Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations. In Proceedings of the 13th international conference on web search and data mining, pages 178–186.
  • Fowl et al. [2022] Liam H Fowl, Jonas Geiping, Steven Reich, Yuxin Wen, Wojciech Czaja, Micah Goldblum, and Tom Goldstein. 2022. Decepticons: Corrupted transformers breach privacy in federated learning for language models. In NeurIPS ML Safety Workshop.
  • Gupta et al. [2023] Kanav Gupta, Neha Jawalkar, Ananta Mukherjee, Nishanth Chandran, Divya Gupta, Ashish Panwar, and Rahul Sharma. 2023. Sigma: Secure gpt inference with function secret sharing. Cryptology ePrint Archive, Paper 2023/1269. https://eprint.iacr.org/2023/1269.
  • Hao et al. [2022a] Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, and Tianwei Zhang. 2022a. Iron: Private inference on transformers. In Advances in Neural Information Processing Systems, volume 35, pages 15718–15731. Curran Associates, Inc.
  • Hao et al. [2022b] Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, and Tianwei Zhang. 2022b. Iron: Private inference on transformers. Advances in neural information processing systems, 35:15718–15731.
  • Hernandez et al. [2023] Evan Hernandez, Belinda Z. Li, and Jacob Andreas. 2023. Inspecting and editing knowledge representations in language models.
  • Hong et al. [2024] Jiwoo Hong, Noah Lee, and James Thorne. 2024. ORPO: monolithic preference optimization without reference model. CoRR, abs/2403.07691.
  • Igamberdiev and Habernal [2023] Timour Igamberdiev and Ivan Habernal. 2023. DP-BART for privatized text rewriting under local differential privacy. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13914–13934, Toronto, Canada. Association for Computational Linguistics.
  • Li et al. [2023a] Dacheng Li, Hongyi Wang, Rulin Shao, Han Guo, Eric Xing, and Hao Zhang. 2023a. MPCFORMER: FAST, PERFORMANT AND PRIVATE TRANSFORMER INFERENCE WITH MPC. In The Eleventh International Conference on Learning Representations.
  • Li et al. [2022] Haoran Li, Yangqiu Song, and Lixin Fan. 2022. You don’t know my favorite color: Preventing dialogue representations from revealing speakers’ private personas. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 5858–5870. Association for Computational Linguistics.
  • Li et al. [2023b] Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023b. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems, volume 36, pages 41451–41530. Curran Associates, Inc.
  • Li et al. [2023c] Yansong Li, Zhixing Tan, and Yang Liu. 2023c. Privacy-preserving prompt tuning for large language model services. CoRR, abs/2305.06212.
  • Liang et al. [2024] Zi Liang, Pinghui Wang, Ruofei Zhang, Nuo Xu, Shuo Zhang, Lifeng Xing, Haitao Bai, and Ziyang Zhou. 2024. Merge: Fast private text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19884–19892.
  • Lin et al. [2021] Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  • Liu and Liu [2023] Xuanqi Liu and Zhuotao Liu. 2023. Llms can understand encrypted prompt: Towards privacy-computing friendly transformers. arXiv preprint arXiv:2305.18396.
  • Liu et al. [2023] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2023. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
  • Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Lyu et al. [2020] Lingjuan Lyu, Xuanli He, and Yitong Li. 2020. Differentially private representation for NLP: formal guarantee and an empirical study on privacy and fairness. pages 2355–2365.
  • Mahloujifar et al. [2021] Saeed Mahloujifar, Huseyin A Inan, Melissa Chase, Esha Ghosh, and Marcello Hasegawa. 2021. Membership inference on word embedding and beyond. arXiv preprint arXiv:2106.11384.
  • Mattern et al. [2022] Justus Mattern, Zhijing Jin, Benjamin Weggenmann, Bernhard Schoelkopf, and Mrinmaya Sachan. 2022. Differentially private language models for secure data sharing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • MetaAI [2023] MetaAI. 2023. Introducing meta llama 3: The most capable openly available llm to date.
  • Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  • Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Perez and Ribeiro [2022] Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop.
  • Qu et al. [2021] Chen Qu, Weize Kong, Liu Yang, Mingyang Zhang, Michael Bendersky, and Marc Najork. 2021. Natural language understanding with privacy-preserving BERT. In CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, pages 1488–1497.
  • Shi et al. [2022] Weiyan Shi, Aiqi Cui, Evan Li, Ruoxi Jia, and Zhou Yu. 2022. Selective differential privacy for language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2848–2859, Seattle, United States. Association for Computational Linguistics.
  • Subramani et al. [2022] Nishant Subramani, Nivedita Suresh, and Matthew Peters. 2022. Extracting latent steering vectors from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 566–581, Dublin, Ireland. Association for Computational Linguistics.
  • Tchango et al. [2022] Arsène Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. 2022. Ddxplus: A new dataset for automatic medical diagnosis. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  • Tenney et al. [2019] Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
  • Tian et al. [2022] Zhiliang Tian, Yingxiu Zhao, Ziyue Huang, Yu-Xiang Wang, Nevin L. Zhang, and He He. 2022. Seqpate: Differentially private text generation via knowledge distillation. In Advances in Neural Information Processing Systems, volume 35, pages 11117–11130. Curran Associates, Inc.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Turner et al. [2023] Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. Activation addition: Steering language models without optimization.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Wu et al. [2023] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  • [47] Yiquan Wu, Yuhang Liu, Yifei Liu, Ang Li, Siying Zhou, and Kun Kuang. wisdominterrogatory. Available at GitHub.
  • Xie et al. [2023] Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. 2023. Pixiu: A large language model, instruction data and evaluation benchmark for finance. arXiv preprint arXiv:2306.05443.
  • Xu et al. [2023a] Canwen Xu, Daya Guo, Nan Duan, and Julian J. McAuley. 2023a. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Proceedings of EMNLP, pages 6268–6278.
  • Xu et al. [2023b] Mingbin Xu, Congzheng Song, Ye Tian, Neha Agrawal, Filip Granqvist, Rogier van Dalen, Xiao Zhang, Arturo Argueta, Shiyi Han, Yaqiao Deng, Leo Liu, Anmol Walia, and Alex Jin. 2023b. Training large-vocabulary neural language models by private federated learning for resource-constrained devices. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  • Yu et al. [2023] Da Yu, Sivakanth Gopi, Janardhan Kulkarni, Zi-Han Lin, Saurabh Naik, Tomasz L. Religa, Jian Yin, and Huishuai Zhang. 2023. Selective pre-training for private fine-tuning. ArXiv, abs/2305.13865.
  • Yu et al. [2022] Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. 2022. Differentially private fine-tuning of language models. In International Conference on Learning Representations.
  • Yue et al. [2023] Xiang Yue, Huseyin Inan, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, and Robert Sim. 2023. Synthetic text generation with differential privacy: A simple and practical recipe. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1321–1342, Toronto, Canada. Association for Computational Linguistics.
  • Zhang et al. [2023] Zhuo Zhang, Yuanhang Yang, Yong Dai, Qifan Wang, Yue Yu, Lizhen Qu, and Zenglin Xu. 2023. FedPETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9963–9977, Toronto, Canada. Association for Computational Linguistics.
  • Zheng et al. [2023] Mengxin Zheng, Qian Lou, and Lei Jiang. 2023. Primer: Fast private transformer inference on encrypted data. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE.

Appendix A Datasets

A.1 Definitions of Privacy Level

We classify the privacy of patients’ symptoms into five levels, with higher levels indicating greater sensitivity. The detail definitions of privacy level are present in Table 3. We also display the distribution of symptoms across five privacy levels on Pri-Ddxplus dataset and Pri-NLICE dataset.

Privacy Level Definition %
Level 1 Public information, symptoms or antecedents that are common, widely known, and do not reveal any personal or sensitive information. Examples include sneezing, headache, or minor injuries. 20.4320.4320.4320.43
Level 2 Non-sensitive personal information, symptoms or antecedents that may be personal but not necessarily sensitive or revealing. These may include common illnesses like cold or flu, allergies, or minor digestive issues. 22.5822.5822.5822.58
Level 3 Potentially sensitive information, symptoms or antecedents that could be indicative of underlying health conditions but are not immediately sensitive or stigmatizing. Examples include chronic conditions like diabetes, hypertension, or asthma. 40.8640.8640.8640.86
Level 4 Sensitive personal information, symptoms or antecedents that may be stigmatizing or have social implications if disclosed publicly. This could include mental health issues like depression or anxiety, reproductive health concerns, or substance abuse. 6.986.986.986.98
Level 5 Highly sensitive information, symptoms or antecedents that are highly personal, stigmatizing, or potentially life-altering if disclosed publicly. This category includes sexually transmitted infections, HIV/AIDS, certain types of cancer, or rare and serious medical conditions. 9.139.139.139.13
Table 3: The definition of different Privacy Level for possible symptoms in PriDdxplus. The rightmost column displays the ratio of symptoms at the corresponding privacy level.

A.2 Construction Process

We utilized GPT-3.5 [34] to assess the privacy levels of all possible symptoms in Ddxplus [40] and NLICE [1], ranging from non-sensitive to highly sensitive. The assessing prompt template is shown in Figure 2.

We also assign each sample three random pathologies as incorrect diagnoses to combine with the correct diagnosis and create options. Some data samples are present in Figure 3.

The initial Ddxplus [40] and NLICE [1] dataset is extensive, and we observed that for most samples, providing only non-sensitive symptoms often yields diagnosis outputs similar to those obtained when intact symptoms are provided. Privacy preserving for these samples is meaningless because users can directly hide those privacy spans and obtain approximate diagnosis results. In real world, sensitive symptoms sometimes play an vital role in diagnosis and privacy preserving is highly valuable. Our dataset is designed to benchmark various privacy-preserving methods and must include samples where privacy symptoms are crucial for diagnosis outcomes. We utilize the KL divergences to measure the importance degree of privacy spans. We calculate the KL divergence between the model output distributions with and without the privacy symptoms included. Higher KL divergence indicates that the absence of sensitive symptoms may lead to different or even incorrect output. After filtering, we obtain the Pri-Ddxplus and Pri-NLICE dataset for privacy preserving.

Refer to caption
Figure 2: The prompt used for privacy level assessing.
Refer to caption
Figure 3: Some data examples in Pri-Ddxplus and Pri-NLICE datasets.

A.3 Statistical Information

We show the statistics of the obtained Pri-Ddxplus and Pri-NLICE dataset in Table 4. We tally the number of instances, symptom types, and diagnosis types. Symptoms with a privacy level greater than 2 were considered private. We calculate the number of privacy symptom types, which are regarded as privacy spans. We also compute the average occurrence of privacy symptoms per instance.

Pri-Ddxplus commonly contains more instances and more privacy symptoms types compared to Pri-NLICE. Each sample in Pri-Ddxplus contains six privacy symptoms on average, while samples in Pri-NLICE have four privacy symptoms.

Pri-Ddxplus
Dataset Split Instances Symptom Type Diagnosis Type Privacy Symptom Type Avg. Privacy Symptoms
All 7509 187 73 149 5.98
Train 5901 187 73 149 6.02
Dev 59 122 53 102 6.16
Test 1549 97 45 78 5.83
Pri-NLICE
Dataset Split Instances Symptom Type Diagnosis Type Privacy Symptom Type Avg. Privacy Symptoms
All 3992 80 47 70 3.76
Train 3168 80 47 65 3.50
Dev 32 50 44 41 3.78
Test 792 43 47 25 4.79
Table 4: The statistics of Pri-Ddxplus and Pri-NLICE. Each instance represents a patient’s description of symptoms along with the correct diagnosis. Average privacy symptoms indicate the average symptoms occur in one query.

Appendix B Defense to Various Attacks

We enumerate the potential attacks that PrivacyRestore may face and demonstrate that PrivacyRestore can effectively defend against them as follows.

Leakage of restoration vectors of each privacy span and meta restoration vector.

Attackers may illegally obtain restoration vectors of each privacy span and intercept the meta restoration vector sent to the server. Even in such scenarios, it is still difficult for attackers to infer privacy spans based on a specific meta restoration vector. According to the AWA method in §3.5, when only one privacy span exists in the query, the meta restoration vector is a restoration vector with random noise injected, which prevents the privacy span from being inferred. When the query contains multiple privacy spans, attackers need to try all combinations of restoration vectors to infer the privacy span. The number of combinations that the attacker needs to try is equal to the sum of combinatorial numbers of any number of restoration vectors, which can be expressed as:

𝒩c=i=2|𝒮|C|𝒮|i=2|𝒮||𝒮|1,subscript𝒩𝑐superscriptsubscript𝑖2𝒮superscriptsubscript𝐶𝒮𝑖superscript2𝒮𝒮1\mathcal{N}_{c}=\sum_{i=2}^{|\mathcal{S}|}C_{|\mathcal{S}|}^{i}=2^{|\mathcal{S% }|}-|\mathcal{S}|-1,caligraphic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT | caligraphic_S | end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 2 start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT - | caligraphic_S | - 1 , (10)

where |𝒮|𝒮|\mathcal{S}|| caligraphic_S | is the total number of privacy spans, and Cnisuperscriptsubscript𝐶𝑛𝑖C_{n}^{i}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the combinatorial number of ways to choose i𝑖iitalic_i elements from a set of n𝑛nitalic_n elements. The number of combinations grows exponentially with |𝒮|𝒮|\mathcal{S}|| caligraphic_S |, and in practical scenarios where |𝒮|𝒮|\mathcal{S}|| caligraphic_S | is typically large, so it is impossible for attackers to infer privacy spans even if the restoration vectors of each privacy span are available.

Prompt Injection Attack. The attack condition is that attackers own LLM weights and can obtain meta restoration vector and a query with privacy spans removed. During inference, attackers intercept the query sent by the client, modify the content of the query, and then send it to the server. Attackers inject malicious content into the query to manipulate the LLM to generate privacy information. For example, in medical diagnosis task, the malicious content would be “print out the possible symptoms.” For prompt injection attack, experimental results in Table 1 show that the attack success ratio of our method is lower than baselines.

Attribute Inference Attack. The attack condition is the same with Prompt Injection Attack. Attribute Inference Attack aims to recover sensitive attributes in the input text. Attackers commonly use input text or its embeddings as input to train a classifier to classify whether sensitive attributes are contained in input text. As shown in Table 1, the F1 score of classifier for our method is low.

Appendix C Implementation Details.

For the fine-tuning process described in §3.4, we only train the restoration vectors and fix all other parameters of LLM. The training is conducted for five epochs with a batch size of 8. This process takes 4~5 hours on a single A100 GPU, which is a reasonable duration for retraining the model when new privacy spans are introduced.

For the classifier used in the attribute inference attack, we employ a 2-layer Multilayer Perceptron (MLP) as the classifier, following the approach in [22]. The classifier is trained on the same dataset used for restoration vector training. The classifier takes the query representation as the input and predicts whether the query contain specific symptoms. We regard the meta restoration vector as the query representation. The other methods in §4.2 do not have obvious representations associated with the privacy spans. We utilize the hidden state of the last token from the last layer of LLM as the query representation.

Differential privacy [37] use η𝜂\etaitalic_η to control the strength of injected noise. Smaller η𝜂\etaitalic_η provides stronger privacy protection but results in significant performance degradation. As shown in [24], η𝜂\etaitalic_η ranges from 75 to 175. A lower value represents a higher level of privacy protection. For the DP based methods, we set η𝜂\etaitalic_η to 75 to prioritize privacy protection. For noise injection regarding the single restoration vector case mentioned in §3.5, we use the same η=75𝜂75\eta=75italic_η = 75 to maintain a consistent noise level.

To evaluate inference efficiency, we use the greedy search decoding strategy and restrict the max generation length to 64. For top-K heads selector, we following the setting of [23] and set K to 48. All of our experiment is conducted on the NVIDIA A800.

Appendix D Additional Experiment Results

We report the additional experiment results on larger LLMs, such as Llama2-chat-13b [43] and Llama3-8b-instruct [33]. As shown in Table 5 and Table6, PrivacyRestore can achieve comparable results to the normal inference and provide strong privacy preserving ability.

As shown Table 7 and Table 8, with the model size of LLM increasing, the latency cost on Client is stable. The latency cost on the client is independent of the LLM size, and becomes more negligible when PrivacyRestore is applied to larger LLMs. Surprisingly, for Llama2-chat-13b on the Pri-NLICE dataset, PrivacyRestore can achieve 97% of the initial throughput. On average, PrivacyRestore can achieve 80% of the initial throughput.

Pri-Ddxplus
Methods Privacy Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
MC1 \uparrow MC2 \uparrow ASR \downarrow AF \downarrow MC1 MC2 ASR AF MC1 MC2 ASR AF MC1 MC2 ASR AF
No Protection ×\times× 83.02 82.91 82.44 35.06 83.02 82.91 82.44 35.06 83.02 82.91 82.44 35.06 83.02 82.91 82.44 35.06
Direct Removal 43.25 44.43 22.33 25.94 57.71 57.31 4.13 29.17 76.95 76.44 0.06 0.93 82.56 82.17 0.00 0.00
DP [37] 17.17 20.60 10.58 23.18 7.17 20.60 10.58 23.18 7.17 20.60 10.58 23.18 7.17 20.60 10.58 23.18
DP on Privacy Spans 54.09 52.48 12.39 18.44 61.07 60.08 7.48 13.13 77.53 76.43 1.14 2.71 82.37 82.22 0.00 5.78
Privacy Restoration 90.57 88.86 12.45 5.85 93.99 92.80 3.87 5.50 85.21 84.07 0.00 3.98 82.31 81.74 0.00 0.90
w/o Top-K Heads Selector 54.09 52.41 23.24 56.81 57.45 57.14 2.17 82.23 77.53 75.91 0.06 50.94 81.60 81.33 0.00 45.92
w/o Restoration Vectors Training 18.59 18.27 13.62 5.85 37.83 38.83 3.80 34.41 68.23 67.32 0.12 39.72 81.34 81.01 0.00 9.05
w/o AWA 52.42 52.93 2.84 3.97 80.50 78.25 0.96 5.48 85.60 83.59 0.00 38.86 82.69 82.36 0.00 5.55
Pri-NLICE
Methods Privacy Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
MC1 MC2 ASR AF MC1 MC2 ASR AF MC1 MC2 ASR AF MC1 MC2 ASR F1
No Protection ×\times× 81.56 78.37 91.54 76.08 81.56 78.37 91.54 76.08 81.56 78.37 91.54 76.08 81.56 78.37 91.54 76.08
Direct Removal 36.48 36.72 0.00 2.79 43.43 46.36 0.00 0.43 50.00 49.51 0.00 0.15 53.15 52.00 0.00 0.00
DP [37] 16.41 18.04 0.00 32.01 16.41 18.04 0.00 32.01 16.41 18.04 0.00 32.01 16.41 18.04 0.00 32.01
DP on Privacy Spans 43.18 41.99 0.12 56.68 59.59 56.59 0.37 35.36 67.42 64.50 0.00 10.57 69.82 66.49 0.00 0.26
Privacy Restoration 79.92 75.36 0.88 9.26 84.34 81.25 0.53 2.01 65.15 64.08 0.12 0.56 68.30 65.20 0.00 0.00
w/o Top-K Heads Selector 52.77 55.31 0.56 22.54 71.33 69.47 0.45 10.64 54.04 53.36 0.20 1.44 56.31 54.13 0.00 0.14
w/o Restoration Vectors Training 46.46 47.81 0.41 29.65 50.63 51.88 0.35 5.41 56.81 54.72 0.14 0.11 59.46 57.48 0.00 0.04
w/o AWA 83.96 81.8 1.26 33.54 84.59 81.45 0.97 10.28 65.15 54.08 0.46 5.76 68.30 65.20 0.02 0.00
Table 5: Comparison of the model performance and the effectiveness of privacy protection among the compared methods on the Pri-Ddxplus and Pri-NLICE datasets on Llama2-chat-13b. The best model performance and privacy protection are marked in bold, excluding the No Protection method, as it does not provide any privacy protection.
Pri-Ddxplus
Methods Privacy Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
MC1 \uparrow MC2 \uparrow ASR \downarrow AF \downarrow MC1 MC2 ASR AF MC1 MC2 ASR AF MC1 MC2 ASR AF
No Protection ×\times× 43.38 44.87 99.67 6.52 43.38 44.87 99.67 6.52 43.38 44.87 99.67 6.52 43.38 44.87 99.67 6.52
Direct Removal 28.34 29.68 39.83 5.94 32.34 33.51 19.17 4.39 42.47 44.24 0.96 1.61 42.02 43.55 0.38 2.10
DP [37] 22.14 23.84 99.74 1.46 22.14 23.84 99.74 1.46 22.14 23.84 99.74 1.46 22.14 23.84 99.74 1.46
DP on Privacy Spans 27.88 29.39 97.86 31.44 31.50 32.10 21.43 8.61 40.54 41.09 1.61 0.01 42.02 43.50 1.22 0.00
Privacy Restoration 76.56 78.02 23.49 0.96 59.52 60.17 11.94 1.10 54.48 53.82 0.00 0.00 40.67 42.44 0.00 0.00
w/o Top-K Heads Selector 31.37 30.79 81.14 23.59 31.24 62.34 5.03 0.34 44.54 44.38 0.77 0.00 41.83 43.34 0.19 0.00
w/o Restoration Vectors Training 21.62 22.17 3.35 0.64 19.69 21.48 0.71 1.05 39.31 38.47 0.12 0.98 41.63 43.34 0.00 0.61
w/o AWA 63.33 65.06 2.90 5.81 56.68 57.64 12.65 1.57 54.50 53.81 0.00 0.97 40.66 42.44 0.00 0.60
Pri-NLICE
Methods Privacy Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
MC1 MC2 ASR AF MC1 MC2 ASR AF MC1 MC2 ASR AF MC1 MC2 ASR F1
No Protection ×\times× 29.29 28.35 55.80 40.32 29.29 28.35 55.80 40.32 29.29 28.35 55.80 40.32 29.29 28.35 55.80 40.32
Direct Removal 16.03 16.92 10.2 10.23 33.08 30.93 7.42 5.12 25.88 25.57 5.41 0.04 26.01 25.83 0.00 0.00
DP [37] 25.00 22.40 52.39 12.87 25.00 22.40 52.39 12.87 25.00 22.40 52.39 12.87 25.00 22.40 52.39 12.87
DP on Privacy Spans 16.41 17.59 80.42 9.65 25.50 24.85 76.13 1.01 29.54 28.19 4.04 0.45 28.66 28.10 0.00 0.01
Privacy Restoration 93.81 94.50 33.58 10.34 96.08 95.07 6.81 4.11 42.55 41.63 0.12 0.06 38.76 37.97 0.00 0.01
w/o Top-K Heads Selector 38.25 37.38 7.70 15.87 43.30 41.85 3.28 9.63 31.06 30.22 0.63 0.04 29.16 28.85 0.00 0.00
w/o Restoration Vectors Training 26.26 26.58 90.90 16.32 40.27 38.92 19.57 5.69 27.39 26.90 0.25 0.36 25.63 25.78 0.00 0.00
w/o AWA 96.71 95.84 5.30 23.75 96.08 95.07 6.69 11.30 42.55 41.63 0.12 0.75 38.76 37.97 0.00 0.05
Table 6: Comparison of the model performance and the effectiveness of privacy protection among the compared methods on the Pri-Ddxplus and Pri-NLICE datasets on Llama3-8b-instruct. The best model performance and privacy protection are marked in bold, excluding the No Protection method, as it does not provide any privacy protection.
Pri-Ddxplus
Methods Avg. Output Length Latency on Server Latency on Client Throughput
Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
Initial 11.89 1232.52 - - - - 9.65
Privacy Restoration 8.16 1025.77 130.85 121.94 101.25 168.00 7.45 (77%)
Pri-NLICE
Methods Avg. Output Length Latency on Server Latency on Client Throughput
Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
Initial 5.83 953.35 - - - - 6.12
Privacy Restoration 5.83 848.14 104.46 98.18 86.42 88.01 5.97 (97%)
Table 7: Comparison of the generation efficiency on Llama2-chat-13b.
Pri-Ddxplus
Methods Avg. Output Length Latency on Server Latency on Client Throughput
Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
Initial 63.70 1854.41 - - - - 34.35
Privacy Restoration 60.92 2290.97 105.07 94.85 85.00 139.10 25.41 (74%)
Pri-NLICE
Methods Avg. Output Length Latency on Server Latency on Client Throughput
Level 2absentLevel 2\geq\text{Level 2}≥ Level 2 Level 3absentLevel 3\geq\text{Level 3}≥ Level 3 Level 4absentLevel 4\geq\text{Level 4}≥ Level 4 Level 5absentLevel 5\geq\text{Level 5}≥ Level 5
Initial 64 1983.10 - - - - 32.27
Privacy Restoration 64 2528.37 89.99 85.72 78.73 79.82 24.5 (76%)
Table 8: Comparison of the generation efficiency on Llama2-chat-8b.