1 Introduction

With increased public awareness and the introduction of stricter regulation of how personally identifiable data may be stored and used, user privacy has become an issue of paramount importance in a wide range of practical applications. While many formal notions of privacy have been proposed (see, e.g., [76]), differential privacy (DP) [44, 46] has emerged as the gold standard due to its broad applicability and nice features such as composition and post-processing (see, e.g., [51, 93] for a comprehensive overview). A primary goal of DP is to enable processing of users’ data in a way that (i) does not reveal substantial information about the data of any single user, and (ii) allows the accurate computation of functions of the users’ inputs. The theory of DP studies what trade-offs between privacy and accuracy are feasible for desired families of functions.

Most work on DP has been in the central (a.k.a. curator) setup, where numerous private algorithms with small error have been devised (see, e.g., [18, 49, 50]). The premise of the central model is that a curator can access the raw user data before releasing a differentially private output. In distributed applications, this requires users to transfer their raw data to the curator—a strong limitation in cases where users would expect the entity running the curator (e.g., a government agency or a technology company) to gain little information about their data.

To overcome this limitation, recent work has studied the local model of DP [71] (also [97]), where each individual message sent by a user is required to be private. Indeed, several large-scale deployments of DP in practice, at companies such as Apple [5, 62], Google [55, 87], and Microsoft [40], have used local DP. While estimates in the local model require weaker trust assumptions than in the central model, they inevitably suffer from significant error. For many types of queries, the estimation error is provably larger than the error incurred in the central model by a factor growing with the square root of the number of users.

Shuffle Privacy Model. The aforementioned trade-offs have motivated the study of the shuffle model of privacy as a middle ground between the central and local models. While a similar setup was first studied in cryptography in the work of Ishai et al. [68] on cryptography from anonymity, the shuffle model was first proposed for privacy-preserving protocols by Bittau et al. [16] in their Encode-Shuffle-Analyze architecture. In the shuffle setting, each user sends one or more messages to the analyzer using an anonymous channel that does not reveal where each message comes from. Such anonymization is a common procedure in data collection and is easy to explain to regulatory agencies and users. The anonymous channel is equivalent to all user messages being randomly shuffled (i.e., permuted) before being operated on by the analyzer, leading to the model illustrated in Fig. 1; see Sect. 2 for a formal description of the shuffle model. In this work, we treat the shuffler as a black box, but note that various efficient cryptographic implementations of the shuffler have been considered, including onion routing, mixnets, third-party servers, and secure hardware (see, e.g., [16, 68]). A comprehensive overview of recent work on anonymous communication can be found on Free Haven’s Selected Papers in Anonymity website [57].

The DP properties of the shuffle model were first analytically studied, independently, in the works of Erlingsson et al. [54] and Cheu et al. [29]. Protocols within the shuffle model are non-interactive and fall into two categories: single-message protocols, in which each user sends one message (as in the local model), and multi-message protocols, in which a user can send more than one message. In both variants, the messages sent by all users are shuffled before being passed to the analyzer. The goal is to design private protocols in the shuffle model with as small error and total communication as possible. An example of the power of the shuffle model was established by Erlingsson et al. [54] and extended by Balle et al. [9], who showed that every locally differentially private algorithm directly yields a single-message protocol in the shuffle model with significantly better privacy. In this paper we study the optimal error achievable for fundamental tasks such as frequency estimation (i.e., histograms) and selection in the shuffle model of differential privacy. We show that in many settings, multi-message protocols can achieve significantly smaller error than single-message protocols, and we introduce such low-error multi-message protocols that have the additional property of having low communication.

The study of differential privacy in the shuffle model can be seen as part of a movement towards an integrated study of differential privacy and cryptographic protocols, i.e., “DP-cryptography” [94].

Fig. 1.
figure 1

Computation in the shuffle model consists of local randomization of inputs in the first stage, followed by a shuffle of all outputs of the local randomizers, after which the shuffled output is passed on to an analyzer.

Overview. The remainder of the paper is organized as follows. In Sect. 2 we review some preliminaries for differential privacy and the shuffle model. In Sect. 3 we give an overview of our main theorems for the frequency estimation and selection problems, and in Sect. 4 we overview the proofs of our main results. In Sect. 5 we discuss applications of our results to problems such as range queries and median estimation. In Sect. 6 we discuss related work in detail, and we conclude in Sect. 7. Full proofs of our results as well as the precise statements of some theorems are relegated to the supplementary material; see Section A.

2 Preliminaries

Before stating our main results, we formally introduce the basics of differential privacy and the shuffle model.

Notation. For a positive real number a, we use \(\log (a)\) to denote the logarithm base 2 of a, and \(\ln (a)\) to denote the natural logarithm of a. For any positive integer \(B\), let \([B] = \{ 1, 2, \ldots , B\}\). For any set \(\mathcal {Y}\), we denote by \(\mathcal {Y}^*\) the set consisting of sequences of elements of \(\mathcal {Y}\), i.e., \(\mathcal {Y}^* = \bigcup _{n \geqslant 0} \mathcal {Y}^n\). For positive integers \(n,B\), we write \(\mathrm {polylog}(n,B)\) to denote the class of functions \(f(n,B)\) for which there is a constant C so that for all \(n, B\in \mathbb {N}\), \(f(n,B) \leqslant C (\log (nB))^C\).

Datasets. Fix a finite set \(\mathcal {X}\), the space of reports of users. A dataset is an element of \(\mathcal {X}^*\), namely a tuple consisting of elements of \(\mathcal {X}\). Let \(\mathrm {hist}(X) \in \mathbb {N}^{|\mathcal {X}|}\) be the histogram of X: for any \(x \in \mathcal {X}\), the xth component of \(\mathrm {hist}(X)\) is the number of occurrences of x in the dataset X. We will consider datasets \(X, X'\) to be equivalent if they have the same histogram (i.e., the ordering of the elements \(x_1, \ldots , x_n\) does not matter). For a multiset \(\mathcal {S}\) whose elements are in \(\mathcal {X}\), we will also write \(\mathrm {hist}(\mathcal {S})\) to denote the histogram of \(\mathcal {S}\) (so that the xth component is the number of copies of x in \(\mathcal {S}\)).

Differential Privacy. Two datasets \(X, X'\) are said to be neighboring if they differ in a single element, meaning that we can write (up to equivalence) \(X = (x_1, \ldots , x_{n-1}, x_n)\) and \(X' = (x_1, \ldots , x_{n-1}, x_n')\), for \(x_1, \ldots , x_n, x_n' \in \mathcal {X}\). In this case, we write \(X \sim X'\). Let \(\mathcal {Z}\) be a set; we now define the differential privacy of a randomized function \(P : \mathcal {X}^n \rightarrow \mathcal {Z}\):

Definition 21

(Differential privacy [44, 46]). A randomized algorithm \(P : \mathcal {X}^n \rightarrow \mathcal {Z}\) is \((\varepsilon , \delta )\)-differentially private (DP) if for every pair of neighboring datasets \(X \sim X'\) and for every set \(\mathcal {S}\subset \mathcal {Z}\), we have

$$ \mathbb {P}[P(X) \in \mathcal {S}] \leqslant e^\varepsilon \cdot \mathbb {P}[P(X') \in \mathcal {S}] + \delta , $$

where the probabilities are taken over the randomness in P. Here, \(\varepsilon \geqslant 0, \delta \in [0,1]\).

We will use the following compositional property of differential privacy.

Lemma 1

(Post-processing, e.g., [50]). If P is \((\varepsilon , \delta )\)-differentially private, then for every randomized function A, the composed function \(A \circ P\) is \((\varepsilon , \delta )\)-differentially private.

Shuffle Model. We review the shuffle model of differential privacy [16, 29, 54]. The input to the model is a dataset \((x_1, \ldots , x_n) \in \mathcal {X}^n\), where item \(x_i \in \mathcal {X}\) is held by user i. A protocol in the shuffle model is the composition of three algorithms:

  • The local randomizer \(R: \mathcal {X}\rightarrow \mathcal {Y}^*\) takes as input the data of one user, \(x_i \in \mathcal {X}\), and outputs a sequence \((y_{i,1}, \ldots , y_{i,m_i})\) of messages; here \(m_i\) is a positive integer. In the single-message shuffle model, we require \(m_i = 1\) for each i; in the multi-message shuffle model, \(m_i\) may be any positive integer.

  • The shuffler \(S: \mathcal {Y}^* \rightarrow \mathcal {Y}^*\) takes as input a sequence of elements of \(\mathcal {Y}\), say \((y_1, \ldots , y_m)\), and outputs a random permutation, i.e., the sequence \((y_{\pi (1)}, \ldots , y_{\pi (m)})\), where \(\pi \in S_m\) is a uniformly random permutation on [m]. The input to the shuffler will be the concatenation of the outputs of the local randomizers.

  • The analyzer \(A: \mathcal {Y}^* \rightarrow \mathcal {Z}\) takes as input a sequence of elements of \(\mathcal {Y}\) (which will be taken to be the output of the shuffler) and outputs an answer in \(\mathcal {Z}\) which is taken to be the output of the protocol P.

We will write \(P = (R, S, A)\) to denote the protocol whose components are given by R, S, and A. The main distinction between the shuffle and local model is the introduction of the (trusted) shuffler S between the local randomizer and the analyzer. Similar to the local model, in the shuffle model the analyzer is untrusted; hence privacy must be guaranteed with respect to the input to the analyzer, i.e., the output of the shuffler. Formally, we have:

Definition 22

(Differential privacy in the shuffle model, [29, 54]). A protocol \(P = (R, S, A)\) is \((\varepsilon , \delta )\)-differentially private if, for any dataset \(X = (x_1, \ldots , x_n)\), the algorithm

$$ (x_1, \ldots , x_n) \mapsto S(R(x_1), \ldots , R(x_n)) $$

is \((\varepsilon , \delta )\)-differentially private.

Notice that the output of \(S(R(x_1), \ldots , R(x_n))\) can be simulated by an algorithm that takes as input the multiset consisting of the union of the elements of \(R(x_1), \ldots , R(x_n)\) (which we denote as \(\bigcup _i R(x_i)\), with a slight abuse of notation) and outputs a uniformly random permutation of them. Thus, by Lemma 1, it can be assumed without loss of generality for privacy analyses that the shuffler simply outputs the multiset \(\bigcup _i R(x_i)\). For the purpose of analyzing accuracy of the protocol \(P = (R, S, A)\), we define its output on the dataset \(X = (x_1, \ldots , x_n)\) to be \(P(X) := A(S(R(x_1), \ldots , R(x_n)))\). We also remark that the case of local differential privacy, formalized in Definition 23, is a variant of the shuffle model where the shuffler S is replaced by the identity function.

Definition 23

(Local differential privacy [71]). A protocol \(P = (R,A)\) is \((\varepsilon , \delta )\)-differentially private in the local model (or \((\varepsilon , \delta )\)-locally differentially private) if the function \(x \mapsto R(x)\) is \((\varepsilon , \delta )\)-differentially private in the sense of Definition 21. We say that the output of the protocol P on an input dataset \(X = (x_1, \ldots , x_n)\) is \(P(X) := A(R(x_1), \ldots , R(x_n))\).

3 Overview of Results

In this work, we study several basic problems related to counting in the shuffle model of DP. In these problems, each of n users holds an element from a domain of size B. We consider the problems of frequency estimation, variable selection, heavy hitters, median estimation, and range counting and study whether it is possible to obtain \((\varepsilon ,\delta )\)-DP in the shuffle model with accuracy close to what is possible in the central model, while keeping communication low. This section contains an overview of our main results.

The frequency estimation problem (also known as computing histograms) is at the core of many of the problems we study. In the simplest version, for some positive integer \(B\), each of n users gets an element of the domain \(\mathcal {X}:= [B]\), and the goal is to estimate the number of users in a dataset X holding element j, namely \(\mathrm {hist}(X)_j\), for each query element \(j \in [B]\). We study frequency estimation with the \(\ell _\infty \) error, meaning that we define the error of a frequency estimation protocol to be the maximum additive error for the frequency estimate of any coordinate j. In particular, if \(\hat{f} \in \mathbb {R}^B\) is a vector of frequency estimates for a dataset X, then the \(\ell _\infty \) error is \(\Vert \mathrm {hist}(X) - \hat{f} \Vert _\infty = \max _{j \in [B]} |\mathrm {hist}(X)_j - \hat{f}_j|\). Frequency estimation is a fundamental primitive that is used in various data structural, sketching, and streaming applications (see Sect. 5 for its use in the shuffled protocols for range counting and median estimation as well as Sect. 6 for a sample of related work on the problem). Frequency estimation has been extensively studied in DP where in the central model, the smallest possible error is \(\varTheta (\min (\log (1/\delta )/\varepsilon , \log (B)/\varepsilon , n))\) (see, e.g., [93, Section 7.1]). By contrast, in the local model of DP, the smallest possible error is known to be \(\varTheta (\min (\sqrt{n \log (B)}/\varepsilon , n))\) under the assumption that \(\delta < o(1/n)\) [12] (this regime for \(\delta \) covers all values for \(\delta \) of interest in the setting of differential privacy).Footnote 1

In the high-level exposition of our results given below, we let n and B be any positive integers. We typically take \(\varepsilon > 0\) to be any constant, and \(\delta > 0\) to be inverse polynomial in n. This assumption on \(\varepsilon \) and \(\delta \) covers a regime of parameters that is relevant in practice. We will make use of tilde notation (e.g., \(\tilde{O}\), \(\tilde{\varTheta }\)) to indicate the suppression of multiplicative factors that are polynomial in \(\log B\) and \(\log n\). Theorem statements which do not make such assumptions and contain full dependence on all parameters may be found in the supplementary material.

Single-Message Bounds for Frequency Estimation. For the frequency estimation problem, we show the following result in the shuffle model where each user sends a single message.

Theorem 1

(Informal version of Theorems 5 & 7). Any (O(1), o(1/n))-differentially private frequency estimation protocol in the single-message shuffle model has expected \(\ell _\infty \) error \(\tilde{\varOmega }( \min (\root 4 \of {n}, \sqrt{B}))\). Moreover, there is a single-message (O(1), o(1/n))-differentially private protocol with error \(\tilde{O}(\min (\root 4 \of {n}, \sqrt{B}))\).

The main contribution of Theorem 1 is the lower bound. To prove this result, we obtain improved bounds on the error needed for frequency estimation in local DP in the weak privacy regime where \(\varepsilon \) is around \(\ln {n}\). The upper bound in Theorem 1 follows by combining the recent result of Balle et al. [9] (building on the earlier result of Erlingsson et al. [54]) with RAPPOR [55] and B-ary randomized response [97] (see Sect. 4.1 and Section C for more details).

The precise version of Theorem 1 with polylogarithmic factors (i.e., Theorem 5) implies that in order for a single-message differentially private protocol to get error o(n) one needs to have \(n = \omega \left( \frac{\log B}{\log \log B} \right) \) users; see Corollary 2. This improves on a result of Cheu et al. [29, Corollary 32], which gives a lower bound of \(n = \omega (\log ^{1/17} B)\) for this task.

Multi-message Protocols for Frequency Estimation. Theorem 1 implies that in the single-message shuffle model, the error has to grow polynomially with \(\min (n,B)\), even with unbounded communication (i.e., message length). We next present (non-interactive) multi-message protocols in the shuffle model of DP for frequency estimation with only polylogarithmic error and communication. One of the protocols is a public-coin protocol, meaning that it makes use of a source of public randomness (known to all parties, including the adversary); the other protocol is a private-coin protocol, meaning that no such assumption is made. In addition to error and communication, a parameter of interest is the query time, which is the time to estimate the frequency of any element \(j \in [B]\) from the data structure constructed by the analyzer.Footnote 2

Theorem 2

(Informal version of Theorems 15 & 16). There are private-coin and public-coin multi-message \((O(1), 1/n^{O(1)})\)-DP protocols in the shuffle model for frequency estimation satisfying the following:

  • The private-coin protocol has \(\ell _\infty \) error \(O(\max \{\log B, \log n\})\), total communication of \(O(\log B\log ^2 n)\) bits per user, and query time \(\tilde{O}(n)\).

  • The public-coin protocol has \(\ell _\infty \) error \(O(\log ^{3/2}(B) \sqrt{\log (n\log (B))})\), total communication of \(O(\log ^4(B) \log ^2(n))\) bits per user, and query time \(O(\log B)\).

Table 1. Upper and lower bounds on expected maximum error (over all B queries, where the sum of all frequencies is n) for frequency estimation in different models of DP. The bounds are stated for fixed, positive privacy parameters \(\varepsilon \) and \(\delta \), and \(\tilde{\varTheta }/\tilde{O}/\tilde{\varOmega }\) asymptotic notation suppresses factors that are polylogarithmic in B and n. The communication per user is in terms of the total number of bits sent. In all upper bounds, the protocol is symmetric with respect to the users, and no public randomness is needed. References are to the first results we are aware of that imply the stated bounds.

Combining Theorems 1 and 2 yields the first separation between single-message and multi-message protocols for frequency estimation. Moreover, Theorem 2 can be used to obtain multi-message protocols with small error and small communication for several other widely studied problems (e.g., heavy hitters, range counting, and median and quantile estimation), discussed in Sect. 5. Finally, Theorem 2 implies the following consequence for statistical query (SQ) algorithms with respect to a distribution \(\mathcal {D}\) on \(\mathcal {X}\) (see Section G for the basic definitions). We say that a non-adaptive SQ algorithm \(\mathcal {A}\) making at most \(B\) queries \(q : \mathcal {X}\rightarrow \{ 0,1\}\) is k-sparse if for each \(x \in \mathcal {X}\), the Hamming weight of the output of the queries is at most k. Then, under the assumption that users’ data is drawn i.i.d. from \(\mathcal {D}\), the algorithm \(\mathcal {A}\) can be efficiently simulated in the shuffle model as follows (Table 1):

Corollary 1

(Informal version of Corollary 4). For any non-adaptive k-sparse SQ algorithm \(\mathcal {A}\) with \(B\) queries of tolerance \(\tau > 0\) and any \(\beta \in (0,1)\), there is a (private-coin) shuffle model protocol satisfying \((\varepsilon , \delta )\)-DP whose output has total variation distance at most \(\beta \) from that of \(\mathcal {A}\), such that the number of users is \(n \leqslant \tilde{O} \left( \frac{k}{\varepsilon \tau } + \frac{1}{\tau ^2} \right) \), and the per-user communication is \(\tilde{O}\left( \frac{k^2}{\varepsilon ^2}\right) \), where \(\tilde{O}(\cdot )\) hides logarithmic factors in \(B, n, 1/\delta , 1/\varepsilon \), and \(1/\beta \).

Corollary 1 improves upon the simulation of non-adaptive SQ algorithms in the local model [71], for which the number of users must grow as \(\frac{k}{\varepsilon ^2 \tau ^2}\) as opposed to \(\frac{1}{\tau ^2} + \frac{k}{\varepsilon \tau }\) in the shuffle model. We emphasize that the main novelty of Corollary 1 is in the regime that \(k^2/\varepsilon ^2 \ll B\); in particular, though prior work on low-communication private summation in the shuffle model [10, 29, 59] implies an algorithm for simulating \(\mathcal {A}\) with roughly the same bound on the number of users n as in Corollary 1 and communication \(\varOmega (B)\), it was unknown whether the communication could be reduced to have logarithmic dependence on \(B\), as in Corollary 1.

Single-Message Bounds for Selection. The techniques that we develop to prove the lower bound in Theorem 1 can be used to get a nearly tight \(\varOmega (B)\) lower bound on the number of users necessary to solve the selection problem. In the selection problemFootnote 3, each user \(i \in [n]\) is given an arbitrary subset of [B], represented by the indicator vector \(x_i \in \{0,1\}^B\), and the goal is for the analyzer to output an index \(j^* \in [B]\) such that

$$\begin{aligned} \sum _{i \in [n]} x_{i, j^*} \geqslant \max _{j \in [B]} \sum _{i \in [n]} x_{i,j} - \frac{n}{10}. \end{aligned}$$
(1)

In other words, the analyzer’s output should be the index of a domain element that is held by an approximately maximal number of users. The choice of the constant 10 in (1) is arbitrary; any constant larger than 1 may be used.

The selection problem has been studied in several previous works on differential privacy, and it has many applications to machine learning, hypothesis testing and approximation algorithms (see [41, 90, 92] and the references therein). Our work improves an \(\varOmega (B^{1/17})\) lower bound on n in the single-message shuffle model due to Cheu et al. [29]. For \(\varepsilon =1\), the exponential mechanism [78] implies an \((\varepsilon , 0)\)-DP algorithm for selection with \(n = O(\log {B})\) users in the central model, whereas in the local model, it is known that any \((\varepsilon , 0)\)-DP algorithm for selection requires \(n = \varOmega (B \log {B})\) users [92].

Theorem 3

(Informal version of Theorem 11). For any single-message \((O(1), o(1/(nB)))\)-DP protocol in the shuffle model that solves the selection problem given in Eq. (1), the number n of users should be \(\varOmega (B)\).

The lower bound in Theorem 3 nearly matches the \(O(B\log B)\) upper bound on the required number of users that holds even in the local model (and hence in the single-message shuffle model) and that uses the B-randomized response [92, 97]. Cheu et al. [29] have previously obtained a multi-message protocol for selection with \(O(\sqrt{B})\) users, and combined with this result Theorem 3 yields the first separation between single-message and multi-message protocols for selection.

In subsequent work Chen et al. [28] have extended Theorem 3 to the setting when each user only sends few messages; in particular, they show that if each user sends at most m messages in the shuffle model, then the number of users should be \(\varOmega (B/m)\). Their proof uses generally similar techniques to ours.

4 Proof Outlines

4.1 Overview of Single-Message Lower Bounds

We start by giving an overview of the lower bound of \(\tilde{\varOmega }( \min \{ n^{1/4}, \sqrt{B} \})\) in Theorem 1 on the error of any single-message frequency estimation protocol. We first focus on the case where \(n \leqslant B^2\) and thus \(\min \{n^{1/4}, \sqrt{B}\} = n^{1/4}\). The main component of the proof in this case is a lower bound of \(\tilde{\varOmega }(n^{1/4})\) for frequency estimation for \((\varepsilon _L, \delta _L)\)-local DP protocolsFootnote 4 when \(\varepsilon _L =\ln (n) + O(1)\). In fact, we prove lower bounds for \((\varepsilon _L, \delta _L)\)-locally differentially protocols for a broader range of parameters \(\varepsilon _L, \delta _L\) in Theorem 6; a special case of this result which includes the setting \(\varepsilon _L = \ln (n) + O(1)\) relevant for the shuffle model is stated below:

Theorem 4

(Local DP lower bound; informal version of Theorem 6). Suppose that \(\varepsilon _L, \delta _L > 0\) satisfy

$$ \frac{2}{3} \cdot \ln n \leqslant \varepsilon _L + \ln (1 + \varepsilon _L) \leqslant \min \left\{ 2 \ln (B) - O(1), 2 \ln (n) - 2 \ln \ln (B)\right\} , $$

and \(\delta _L < o\left( \min \left\{ \frac{1}{n \ln n}, \exp (-\varepsilon _L) \right\} \right) \). Then any \((\varepsilon _L, \delta _L)\)-locally differentially private protocol for frequency estimation on \([B]\) must have \(\ell _\infty \) error at least \( \tilde{\varOmega } \left( \frac{\sqrt{n}}{e^{\varepsilon _L/4}}\right) , \) where the tilde hides factors polynomial in \(\log B, \log n\).

While lower bounds for local DP frequency estimation were previously obtained in the seminal works of Bassily and Smith [12] and Duchi, Jordan and Wainwright [42], two critical reasons make them less useful for our purposes: (i) for \(\varepsilon _L = \omega (1)\) (i.e., in the low-privacy regime) they only apply to the case where \(\delta _L = 0\) (i.e., pure privacy)Footnote 5, and (ii) even for \(\delta _L = 0\), their dependence on \(\varepsilon _L\) is sub-optimal when \(\varepsilon _L = \omega (1)\): the results of [42], for instance, imply a lower bound of \(\varOmega \left( \frac{\sqrt{n\log B}}{e^{\varepsilon _{L}}} \right) \) on the \(\ell _\infty \) error.Footnote 6 By contrast, Theorem 4 covers the low and approximate privacy regime; we next discuss its proof.

Let R be an \((\varepsilon _L, \delta _L)\)-locally differentially private randomizer. The general approach in the proof of Theorem 4, which was also taken in [12, 42], is to show that if V is a random variable drawn uniformly at random from \([B]\) and if X is a random variable that is equal to V with some appropriate choice of \(\alpha \in (0,1)\), and is drawn uniformly at random from \([B]\) otherwise, then the mutual information between V and the local randomizer output R(X) satisfies

$$\begin{aligned} I(V; R(X)) \leqslant \frac{\log B}{4n}. \end{aligned}$$
(2)

Once (2) is established, the chain rule of mutual information implies that \(I(V; R(X_1), \ldots , R(X_n)) \leqslant \frac{\log {B}}{4}\), where \(X_1, \ldots , X_n\) are independent and identically distributed given V. Fano’s inequality [38] then implies that the probability that any analyzer receiving \(R(X_1), \ldots , R(X_n)\) correctly guesses V is at most 1/4; on the other hand, an \(\varOmega (\alpha n)\)-accurate analyzer must be able to determine V with high probability since its frequency in the dataset \(X_1, \ldots , X_n\) is roughly \(\alpha n\), greater than the frequency of all other \(v \in [B]\). This approach thus yields a lower bound of \(\varOmega (\alpha n)\) on frequency estimation.

To prove the lower bound of Theorem 4 using this approach, we choose \(\alpha n = \tilde{\varTheta }(\sqrt{n} / e^{\varepsilon _L / 4})\), and show that

$$\begin{aligned} I(V; R(X)) \leqslant \tilde{O}(\alpha ^4 n e^{\varepsilon _L}) \leqslant \frac{\log B}{4n}. \end{aligned}$$
(3)

For the application to the single-message shuffle model, we will have \(\varepsilon _L = \ln (n) + O(1)\) and so \(\alpha = \tilde{\varTheta }( n^{-3/4})\); as we will discuss later, (3) is essentially tight in this regime.

Limitations of Previous Approaches. We first state the existing upper bounds on I(VR(X)), which only use the privacy of the local randomizer. Bassily and Smith [12, Claim 5.4] showed an upper bound of \(I(V; R(X)) \leqslant O(\varepsilon _L^2 \alpha ^2)\) with \(\varepsilon _L = O(1)\) and \(\delta _L = o(1/(n \log n))\), which thus satisfies (2) with \(\alpha = \varTheta \left( \sqrt{\frac{\log B}{\varepsilon _L^2 n}} \right) \). For \(\delta _L = 0\), Duchi et al. [42] generalized this result to the case \(\varepsilon _L \geqslant 1\), proving thatFootnote 7 \(I(V; R(X)) \leqslant O(\alpha ^2 e^{2\varepsilon _L})\). Even ignoring the constraint \(\delta _L = 0\), this bound of [42] is weaker than (3) for the above setting of \(\alpha \) and \(\varepsilon _L\).

However, proving the mutual information bound in (3) turns out to be impossible if we only use the privacy of the local randomizers! In particular, the bound can be shown to be false if all we assume about R is that it is \((\varepsilon _L, \delta _L)\)-locally differentially private for some \(\varepsilon _L \approx \ln n\) and \(\delta _L \leqslant n^{-O(1)}\). For instance, it is violated if one takes R to be \(R_{{{\,\mathrm{RR}\,}}}\), the local randomizer of the \(B\)-randomized response [97]. Consider for example the regime where \(B \leqslant n \leqslant B^2\), and the setting where \(R_{{{\,\mathrm{RR}\,}}}(v)\) is equal to v with probability \(1-B/n\), and is uniformly random over \([B]\) with the remaining probability of \(B/n\). In this case, the local randomizer \(R_{{{\,\mathrm{RR}\,}}}(\cdot )\) is \((\ln (n) + O(1), 0)\)-differentially private. A simple calculation shows that \(I(V; R_{{{\,\mathrm{RR}\,}}}(X)) = \tilde{\varTheta }(\alpha )\). Whenever \(\alpha \ll 1/n^{2/3}\), which is the regime we have to consider in order to prove Theorem 1Footnote 8, it holds that \(\alpha \gg \alpha ^4 n \exp (\ln (n))\), thus contradicting (3). (See also Remark 4 for an explanation of how a slightly different strategy also fails.) The insight derived from this counterexample is crucial, as we describe in our new technique next.

Mutual Information Bound from Privacy and Accuracy. Departing from previous work, we manage to prove the stronger bound (3) as follows. Inspecting the counterexample based on the B-randomized response outlined above, we first observe that any analyzer must have error at least \(\varOmega (\sqrt{B})\) when combined with \(R_{{{\,\mathrm{RR}\,}}}(\cdot )\), which is larger than \(\alpha n\), the error that would be ruled out by the subsequent application of Fano’s inequality. This leads us to appeal to accuracy, in addition to privacy, when proving the mutual information upper bound. We thus leverage the additional available property that the local randomizer R can be combined with an analyzer A in such a way that the mapping \((x_1, \ldots , x_n) \mapsto A(R(x_1), \ldots , R(x_n))\) computes the frequencies of elements of every dataset \((x_1, \ldots , x_n)\) accurately, i.e., to within an error of \(O(\alpha n)\). At a high level, our approach for proving the bound in (3) then proceeds by:

  1. (i)

    Proving a structural property satisfied by the randomizer corresponding to any accurate frequency estimation protocol. Namely, we show in Lemma 10 that if there is an accurate analyzer, the total variation distance between the output of the local randomizer on any given input, and its output on a uniform input, is close to 1.

  2. (ii)

    Using the \((\varepsilon _L, \delta _L)\)-DP property of the randomizer along with the structural property in (i) in order to upper-bound the mutual information I(VR(X)).

We believe that the application of the structural property in (i) to proving bounds of the form (3) is of independent interest. As we further discuss below, this property is, in particular, used (together with privacy of R) to argue that for most inputs \(v \in [B]\), the local randomizer output R(v) is unlikely to equal a message that is much less likely to occur when the input is uniformly random than when it is v. Note that it is somewhat counter-intuitive that accuracy is used in the proof of this fact, as one way to achieve very accurate protocols is to ensure that R(v) is equal to a message which is unlikely when the input is any \(u \ne v\). We now outline the proofs of (i) and (ii) in more detail.

The gist of the proof of (i) is an anti-concentration statement. Let v be a fixed element of [B] and let U be a random variable uniformly distributed on [B]. Assume that the total variation distance \(\varDelta (R(v), R(U))\) is not close to 1, and that a small fraction of the users have input v while the rest have uniformly random inputs. Let \(\mathcal {Z}\) denote the range of the local randomizer R. First, we consider the special case where \(\mathcal {Z}\) is \(\{ 0,1\}\). Then the distribution of the histogram of outputs of the users with v as their input is in bijection with a binomial random variable with parameter \(p := \mathbb {P}[R(v) = 1]\), and the same is true for the distribution of the shuffled outputs of the users with uniform random inputs U (with parameter \(q := \mathbb {P}[R(U) = 1]\)). Then, we use the anti-concentration properties of binomial random variables in order to argue that if \(|p-q| = \varDelta (R(v), R(U))\) is too small, then with nontrivial probability the shuffled outputs of the users with input v will be indistinguishable from the shuffled outputs of the users with uniform random inputs. This is then used to contradict the supposed accuracy of the analyzer. To deal with the general case where the range \(\mathcal {Z}\) is any finite set, we repeatedly apply the data processing inequality for total variation distance in order to reduce to the binary case (Lemma 13). The full proof appears in Lemma 10.

Equipped with the property in (i), we now outline the proof of the mutual information bound in (ii). Denote by

  • \(\mathcal {T}_v\) the set of messages much more likely to occur when the input is v than when it is uniform,

  • \(\mathcal {Y}_v\) the set of messages less likely to occur when the input is v than when it is uniform.

Note that the union \(\mathcal {T}_v \cup \mathcal {Y}_v\) is not the entire range \(\mathcal {Z}\) of messages; in particular, it does not include messages that are a bit more likely to occur when the input is v than when it is uniform.Footnote 9 On a high level, it turns out that the mutual information I(VR(X)) will be large, i.e., R(X) will reveal a significant amount of information about V, if either of the following events occurs:

  1. (a)

    There are not enough inputs \(v \in [B]\) such that the mass \(\mathbb {P}[R(X) \in \mathcal {Y}_v]\) is large. Intuitively, for v so that \(\mathbb {P}[R(X) \in \mathcal {Y}_v]\) is large, the local randomizer “effectively hides” the fact that the uniform input X is v given that X indeed equals v and \(R(v) \in \mathcal {Y}_v\).

  2. (b)

    There are too many inputs \(v \in [B]\) such that the mass \(\mathbb {P}[R(v) \in \mathcal {T}_v]\) is large. Such inputs make it too likely that \(X = v\) given that \(R(X) \in \mathcal {T}_v\), which makes it more likely in turn that \(V=v\).

We first note that the total variation distance \(\varDelta (R(v), R(X))\) is upper-bounded by \(\mathbb {P}[R(X) \in \mathcal {Y}_v]\). On the other hand, the accuracy of the protocol along with property (i) imply that \(\varDelta (R(v), R(X))\) is close to 1 for all v. By putting these together, we can conclude that event (a) does not occur (see Lemma 10 for more details).

To prove that event (b) does not occur, we use the \((\varepsilon _L, \delta _L)\)-DP guarantee of the local randomizer R. Namely, we will use the inequality \(\mathbb {P}[R(v) \in \mathcal {S}] \leqslant e^{\varepsilon _L} \cdot \mathbb {P}[R(X) \in \mathcal {S}] + \delta \) for various subsets \(\mathcal {S}\) of \(\mathcal {Z}\). Unfortunately, setting \(\mathcal {S}= \mathcal {T}_v\) does not lead to a good enough upper bound on \(\mathbb {P}[R(v) \in \mathcal {T}_v]\); indeed, for the local randomizer \(R = R_{{{\,\mathrm{RR}\,}}}\) corresponding to the B-ary randomized response, we will have \(\mathcal {T}_v = \{ v \}\) for \(n \gg B\), and so \(\mathbb {P}[R(v) \in \mathcal {T}_v] = 1-B/n \approx 1\) for any v. Thus, to establish (b), we need to additionally use the accuracy of the analyzer A (i.e., property (i) above), together with a careful double-counting argument to enumerate the probabilities that R(v) belongs to subsets of \(\mathcal {T}_v\) of different granularity (with respect to the likelihood of occurrence under input v versus a uniform input). For the details, we refer the reader to Section B.3 and Lemma 9.

Having established Theorem 4 giving a lower bound for locally differentially private estimation in the low-privacy regime, Theorem 1 follows in a straightforward manner: the only step is to apply a lemma of Cheu et al. [29] (restated as Lemma 2 below), stating that any lower bound for \((\varepsilon + \ln (n), \delta )\)-locally differentially private protocols implies a lower bound for \((\varepsilon , \delta )\)-differentially private protocols in the single-message shuffle model (i.e., we take \(\varepsilon _L = \varepsilon + \ln (n)\)). Indeed, for \(\varepsilon _L = \ln (n) + O(1)\), the error lower bound from Theorem 4 is \(\tilde{\varOmega }(\sqrt{n} / e^{\varepsilon _L/4}) = \tilde{\varOmega }(n^{1/4})\). Finally, we point out that while the above outline assumed that \(n \le B^2\), it turns out that this is essentially without loss of generality as the other case where \(n > B^2\) can be reduced to the former (see Lemma 6).

Tightness of Lower Bounds. The lower bounds sketched above are nearly tight. The upper bound of Theorem 1 follows from combining existing results showing that the single-message shuffle model provides privacy amplification of locally differentially private protocols [9, 54], with known locally differentially private protocols for frequency estimation [9, 42, 55, 97]. In particular, as recently shown by Balle et al. [9], a pure \((\varepsilon _L, 0)\)-differentially private local randomizer yields a protocol in the shuffle model that is \(\left( O\left( e^{\varepsilon _L} \sqrt{\frac{\log (1/\delta )}{n}}\right) , \delta \right) \)-differentially private and that has the same level of accuracy.Footnote 10 Then:

  • When combined with RAPPOR [42, 55], we get an upper bound of \(\tilde{O}(n^{1/4})\) on the error.

  • When combined with the \(B\)-randomized response [3, 97], we get an error upper bound of \(\tilde{O}(\sqrt{B})\).

The full details appear in Section C. Put together, these imply that the minimum in our lower bound in Theorem 1 is tight (up to logarithmic factors). It also follows that the mutual information bound in Eq. (3) is tight (up to logarithmic factors) for \(\varepsilon _L = \ln (n) + O(1)\) and \(\alpha = n^{-3/4}\) (which is the parameter settings corresponding to the single-message shuffle model); indeed, a stronger bound in Eq. (3) would lead to larger lower bounds in the single-message shuffle model thereby contradicting the upper bounds discussed in this paragraph.

Lower Bound for Selection: Sharp Bound on Level-1 Weight of Probability Ratio Functions. We now outline the proof of the nearly tight lower bound on the number of users required to solve the selection problem in the single-message shuffle model (Theorem 3). The main component of the proof in this case is a lower bound of \(\varOmega (B)\) users for selection for \((\varepsilon _L, \delta _L)\)-local DP protocols when \(\varepsilon _L =\ln (n) + O(1)\).

In the case of local \((\varepsilon _L, 0)\)-DP (i.e., pure) protocols, Ullman [92] proved a lower bound \(n = \varOmega \left( \frac{B\log B}{(\exp (\varepsilon _L) - 1)^2}\right) \). There are two different reasons why this lower bound is not sufficient for our purposes:

  1. 1.

    It does not rule out DP protocols with \(\delta _L > 0\) (i.e., approximate protocols), which are necessary to consider for our application to the shuffle model.

  2. 2.

    For the low privacy setting of \(\varepsilon _L =\ln (n) + O(1)\), the bound simplifies to \( n = \tilde{\varOmega }(B/ n^2)\), i.e., \(n = \tilde{\varOmega }(B^{1/3})\), weaker than what we desire.

To prove our near-optimal lower bound, we remedy both of the aforementioned limitations by allowing positive values of \(\delta _L\) and achieving a better dependence on \(\varepsilon _L\). As in the proof of frequency estimation, we reduce proving Theorem 3 to the task of showing the following mutual information upper bound:

$$\begin{aligned} I((L,J); R(X_{L,J})) \leqslant \tilde{O} \left( \frac{1}{B} \right) + O(\delta _L (B+ n)), \end{aligned}$$
(4)

where L is a uniform random bit, J is a uniform random coordinate in [B], and \(X_{L,J}\) is uniform over the subcube \(\{ x \in \{0,1\}^B: x_J = L \}\). Indeed, once (4) holds and \(\delta _L < o(1/(Bn))\), the chain rule implies that the mutual information between all users’ messages and the pair (LJ) is at most \(O\left( \frac{n\ln (B)}{B}\right) \). It follows by Fano’s inequality that if \(n = o(B)\), no analyzer can determine the pair (LJ) with high probability (which any protocol for selection must be able to do).

For any message z in the range of R, define the Boolean function \(f_z(x) := \frac{\mathbb {P}[R(x) = z]}{\mathbb {P}[R(X_{L,J}) = z]}\) where \(x \in \{0,1\}^B\). Let \(\mathbf {W}^1[f]\) denote the level-1 Fourier weight of a Boolean function f. To prove inequalities of the form (4), the prior work of Ullman [92] shows that \(I((L,J); R(X_{L,J}))\) is determined by \(\mathbf {W}^1[f_z]\), up to normalization constants. In the case where \(\delta _L = 0\) and \(\varepsilon _L = \ln (n) + O(1)\), \(f_z \in [0,e^{\varepsilon _L}]\), and by Parseval’s identity \(\mathbf {W}^1[f_z] \leqslant O(e^{2\varepsilon _L})\) for any message z, leading to

$$\begin{aligned} I((L,J); R(X_{L,J})) \leqslant O\left( \frac{e^{2\varepsilon _L}}{B}\right) . \end{aligned}$$
(5)

Unfortunately, for our choice of \(\varepsilon _L = \ln (n) + O(1)\), (5) is weaker than (4).

To show (4), we depart from the previous approach in the following ways:

  1. (a)

    We show that the functions \(f_z\) take values in \([0, O(e^{\varepsilon _L})]\) for most inputs x; this uses the \((\varepsilon _L, \delta _L)\)-local DP of the local randomizer R (we cannot show this for all x as in general \(\delta _L > 0\)).

  2. (b)

    Using the Level-1inequality from the analysis of Boolean functions [84] (see Theorem 13 below), we upper bound \(\mathbf {W}^1[g_z]\) by \(O(\varepsilon _L)\), where \(g_z\) is the truncation of \(f_z\) defined by \(g_z(x) = f_z(x)\) if \(f_z(x) \leqslant O(n)\), and \(g_z(x) = 0\) otherwise.

  3. (c)

    We bound \(I((L,J); R(X_{L,J}))\) by \(\mathbf {W}^1[g_z]\), using the fact \(f_z\) is sufficiently close to its truncation \(g_z\).

The above line of reasoning, formalized in Section B.5, allows us to show

$$ I((L,J); R(X_{L,J})) \leqslant O \left( \frac{\varepsilon _L}{B} + \delta \cdot (B+ e^{\varepsilon _L}) \right) , $$

which is sufficient to establish that (4) holds.

Having proved a lower bound on the error of any \((\varepsilon + \ln n, \delta )\)-local DP protocol for selection with \(\varepsilon = O(1)\), the final step in the proof is to apply a lemma of [29] to deduce the desired lower bound in the single-message shuffle model.

4.2 Overview of Multi-message Protocols

An important consequence of our lower bound in Theorem 1 is that one cannot achieve an error of \(\mathrm {polylog}(n,B)\) using single-message protocols. This in particular rules out any approach that uses the following natural two-step recipe for getting a private protocol in the shuffle model with accuracy better than in the local model:

  1. 1.

    Run any known locally differentially private protocol with a setting of parameters that enables high-accuracy estimation at the analyzer, but exhibits low privacy locally.

  2. 2.

    Randomly shuffle the messages obtained when each user runs step 1 on their input, and use the privacy amplification by shuffling bounds [9, 54] to improve the privacy guarantees.

Thus, shuffled versions of the B-randomized response [3, 97], RAPPOR [3, 42, 55], the Bassily–Smith protocol [12], TreeHist and Bitstogram [11], and the Hadamard response protocol [2, 3], will still incur an error of \(\varOmega (\min (\root 4 \of {n}, \sqrt{B}))\).

Moreover, although the single-message protocol of Cheu et al. [29] for binary aggregation (as well as the multi-message protocols given in [7, 8, 59, 60] for the more general task of real-valued aggregation) can be applied to the one-hot encodings of each user’s input to obtain a multi-message protocol for frequency estimation with error \(\mathrm {polylog}(n,B)\), the communication per user would be \(\varOmega (B)\) bits, which is clearly undesirable.

Recall that the main idea behind (shuffled) randomized response is for each user to send their input with some probability, and random noise with the remaining probability. Similarly, the main idea behind (shuffled) Hadamard response is for each user to send a uniformly random index from the support of the Hadamard codeword corresponding to their input with some probability, and a random index from the entire universe with the remaining probability. In both protocols, the user is sending a message that either depends on their input or is noise; this restriction turns out to be a significant limitation. Our main insight is that multiple messages allows users to simultaneously send both types of messages, leading to a sweet spot with exponentially smaller error and communication.

Our Protocols. We design a multi-message version of the private-coin Hadamard response of Acharya et al. [2, 3] where each user sends a small subset of indices sampled uniformly at random from the support of the Hadamard codeword corresponding to their input, and in addition sends a small subset of indices sampled uniformly at random from the entire universe [B]. To get accurate results it is crucial that a subset of indices is sampled, as opposed to just a single index (as in the local model protocol of [2, 3]). We show that in the regime where the number of indices sampled from inside the support of the Hadamard codeword and the number of noise indices sent by each user are both logarithmic, the resulting multi-message algorithm is private in the shuffle model, and it has polylogarithmic error and communication per user (see Theorem 15, Lemmas 17, 18, and 19 for more details).

A limitation of our private-coin algorithm outlined above is that the time for the analyzer to answer a single query is \(\tilde{O}(n)\). This might be a drawback in applications where the analyzer is CPU-limited or where it is supposed to produce real-time answers. In the presence of public randomness, we design an algorithm that remedies this limitation, having error, communication per user, and query time all bounded above by \(\mathrm {polylog}(n,B)\). This algorithm is based on a multi-message version of randomized response combined in a delicate manner with the Count Min data structure [34] (for more details, see Section D.2). Previous work [11, 12] on DP has used Count Sketch [24], which is a close variant of Count Min, to reduce heavy hitter computation to frequency estimation. In contrast, our use of Count Min has the purpose of reducing the amount of communication per user.

5 Applications

Heavy Hitters. Another algorithmic task that is closely related to frequency estimation is computing the heavy hitters in a dataset distributed across n users, where the goal of the analyzer is to (approximately) retrieve the identities and counts of all elements that appear at least \(\tau \) times, for a given threshold \(\tau \). It is well-known that in the central DP model, it is possible to compute \(\tau \)-heavy hitters for \(\tau = \mathrm {polylog}(n,B)\) whereas in the local DP model, it is possible to compute \(\tau \)-heavy hitters if and only if \(\tau = \tilde{\varTheta }(\sqrt{n})\). By combining with known reductions (e.g., from Bassily et al. [11]), our multi-message protocols for frequency estimation yield multi-message protocols for computing the \(\tau \)-heavy hitters with \(\tau = \mathrm {polylog}(n,B)\) and total communication of \(\mathrm {polylog}(n,B)\) bits per user (for more details, see Section H).

Range Counting. In range counting, each of the n users is associated with a point in \([B]^d\) and the goal of the analyzer is to answer arbitrary queries of the form: given a rectangular box in \([B]^d\), how many of the points lie in it?Footnote 11 This is a basic algorithmic primitive that captures an important family of database queries and is useful in geographic applications. This problem has been well-studied in the central model of DP, where Chan et al. [22] obtained an upper bound of \((\log B)^{O(d)}\) on the error (see Sect. 6 for more related work). It has also been studied in the local DP model [33]; in this case, the error has to be at least \(\varOmega (\sqrt{n})\) even for \(d=1\).

We obtain private protocols for range counting in the multi-message shuffle model with exponentially smaller error than what is possible in the local model (for a wide range of parameters). Specifically, we give a private-coin multi-message protocol with \((\log {B})^{O(d)}\) messages per user each of length \(O(\log {n})\) bits, error \((\log {B})^{O(d)}\), and query time \(\tilde{O}(n \log ^d B)\). Moreover, we obtain a public-coin protocol with similar communication and error but with a much smaller query time of \(\tilde{O}(\log ^d B)\) (see Section F for more details).

We now briefly outline the main ideas behind our multi-message protocols for range counting. We first argue that even for \(d=2\), the total number of queries is \(\varTheta (B^2)\) and the number of possible queries to which a user positively contributes is also \(\varTheta (B^2)\). Thus, direct applications of DP algorithms for aggregation or for frequency estimation would result in polynomial error and polynomial communication per user. Instead, we combine our multi-message protocol for frequency estimation (Theorem 2) with a communication-efficient implementation, in the multi-message shuffle model, of the space-partitioning data structure used in the central model protocol of Chan et al. [22]. The idea is to use a collection \(\mathcal {B}\) of \(O(B\log ^d B)\) d-dimensional rectangles in \([B]^d\) (so-called dyadic intervals) with the property that an arbitrary rectangle can be formed as the disjoint union of \(O(\log ^d B)\) rectangles from \(\mathcal {B}\). Furthermore, each point in \([B]^d\) is contained in \(O(\log ^d B)\) rectangles from \(\mathcal {B}\). This means that it suffices to release a private count of the number of points inside each rectangle in \(\mathcal {B}\)—a frequency estimation task where each user input contributes to \(O(\log ^d B)\) buckets. To turn this into a protocol with small maximum communication in the shuffle model, we develop an approach analogous to the matrix mechanism [74, 75]. We argue that the transformation of the aforementioned central model algorithm for range counting into a private protocol in the multi-message shuffle model with small communication and error is non-trivial and relies on the specific protocol structure. In fact, the state-of-the-art range counting algorithm of Dwork et al. [48] in the central model does not seem to transfer to the shuffle model.

M-Estimation of Median. A very basic statistic of any dataset of real numbers is its median. For simplicity, suppose our dataset consists of real numbers lying in [0, 1]. It is well-known that there is no DP algorithm for estimating the value of the median of such a dataset with error o(1) (i.e., outputting a real number whose absolute distance to the true median is o(1)) [93, Section 3]. This is because the median of a dataset can be highly sensitive to a single data point when there are not many individual data points near the median. Thus in the context of DP, one has to settle for weaker notions of median estimation. One such notion is M-estimation, which amounts to finding a value \(\tilde{x}\) that approximately minimizes \(\sum _i |x_i - \tilde{x}|\) (recall that the median is the minimizer of this objective). This notion has been studied in previous work on DP including by [42, 73] (for more on related work, see Sect. 6 below). Our private range counting protocol described above yields a multi-message protocol with communication \(\mathrm {polylog}(n)\) per user and that M-estimates the median up to error \(\mathrm {polylog}(n)\), i.e., outputs a value \(y \in [0,1]\) such that \(\sum _i |x_i - y| \le \min _{\tilde{x}} \sum _i |x_i - \tilde{x}| + \mathrm {polylog}(n)\) (see Theorem 23 in Section I). Beyond M-estimation of the median, our work implies private multi-message protocols for estimating quantiles with \(\mathrm {polylog}(n)\) error and \(\mathrm {polylog}(n)\) bits of communication per user (see Section I for more details).

6 Related Work

Shuffle Privacy Model. Following the proposal of the Encode-Shuffle-Analyze architecture by Bittau et al. [16], several recent works have sought to formalize the trade-offs in the shuffle model with respect to standard local and central DP [9, 54] as well as devise private schemes in this model for tasks such as secure aggregation [7,8,9, 29, 59, 60]. In particular, for the task of real aggregation, Balle et al. [9] showed that in the single-message shuffle model, the optimal error is \(\varTheta (n^{1/6})\) (which is better than the error in the local model which is known to be \(\varTheta (n^{1/2})\)).Footnote 12 By contrast, recent follow-up work gave multi-message protocols for the same task with error and communication of \(\mathrm {polylog}(n)\) [7, 8, 59, 60]Footnote 13. Our work is largely motivated by the aforementioned body of works demonstrating the power of the shuffle model, namely, its ability to enable private protocols with lower error than in the local model while placing less trust in a central server or curator.

Wang et al. [96] recently designed an extension of the shuffle model and analyzed its trust properties and privacy-utility tradeoffs. They studied the basic task of frequency estimation, and benchmarked several algorithms, including one based on single-message shuffling. However, they did not consider improvements through multi-message protocols, such as the ones we propose in this work. Very recently, Erlingsson et al. [53] studied multi-message (“report fragmenting”) protocols for frequency estimation in a practical shuffle model setup. Though they make use of a sketching technique, like we do, their methods cannot be parameterized to have communication and error polylogarithmic in n and B (which our Theorem 2 achieves). This is a result of using an estimator (based on computing a mean) that does not yield high-probability guarantees.

(Private) Frequency Estimation, Heavy Hitters, and Median. Frequency estimation (and its extensions considered below) is a fundamental problem that has been extensively studied in numerous computational models including data structures, sketching, streaming, and communication complexity, (in particular, [24, 31, 34, 35, 56, 61, 63, 70, 77, 79, 80, 101]). Heavy hitters and frequency estimation have also been studied extensively in the standard models of DP, e.g., [2, 11, 12, 20, 67, 95, 97]. The other problems we consider in the shuffle model, namely, range counting, M-estimation of the median, and quantiles, have been well-studied in the literature on data structures and sketching [37] as well as in the context of DP in the central and local models. Dwork and Lei [45] initiated work on establishing a connection between DP and robust statistics, and gave private estimators for several problems including the median, using the paradigm of propose-test-release. Subsequently, Lei [73] provided an approach in the central DP model for privately releasing a wide class of M-estimators (including the median) that are statistically consistent. While such M-estimators can also be obtained indirectly from non-interactive release of the density function [98], the aforementioned approach exhibits an improved rate of convergence. Furthermore, motivated by risk bounds under privacy constraints, Duchi et al. [42] provided private versions of information-theoretic bounds for minimax risk of M-estimation of the median.

Frequency estimation can be viewed as the problem of distribution estimation in the \(\ell _\infty \) norm where the distribution to be estimated is the empirical distribution of a dataset \((x_1, \ldots , x_n)\). Some works [69, 100] have established tight lower bounds for locally differentially private distribution estimation in the weak privacy setting with loss instead given by either \(\ell _1\) or \(\ell _2^2\). However, their techniques proceed by using Assouad’s method [42] and are quite different from the approach we use for the \(\ell _\infty \) norm in the proof of Theorem 1 (specifically, in the proof of Theorem 6).

We also note that an anti-concentration lemma qualitatively similar to our Lemma 10 was used by Chan et al. [23, Lemma 3] to prove lower bounds on private aggregation, but they operated in a multi-party setting with communication limited by a sparse communication graph. After the initial release of this paper, Ghazi et al. [58] proved a similar anti-concentration lemma to establish a lower bound on private summation for protocols with short messages. The lemmas in both of these papers do not apply to the more general case of frequency estimation with an arbitrary number \(B\) of buckets, as is the case throughout this paper.

Range Counting. Range counting queries have also been an important subject of study in several areas including database systems and algorithms (see [30] and the references therein). Early works on differentially private frequency estimation , e.g., [43, 64], apply naturally to range counting, though the approach of summing up frequencies yields large errors for queries with large ranges.

For \(d = 1\), Dwork et al. [47] obtained an upper bound of \(O\left( \frac{\log ^2 B}{\varepsilon }\right) \) and a lower bound of \(\varOmega (\log B)\) for obtaining \((\varepsilon , 0)\)-DP. Chan et al. [22] extended the analysis to d-dimensional range counting queries in the central model, for which they obtained an upper bound of roughly \((\log B)^{O(d)}\). Meanwhile, a lower bound of Muthukrishnan and Nikolov [81] showed that for \(n \approx B\), the error is lower bounded by \(\varOmega \left( (\log n)^{d - O(1)}\right) \). Since then, the best-known upper bound on the error for general d-dimensional range counting has been \((\log {B}+\log (n)^{O(d)})/\varepsilon \) [48], obtained using ideas from [22, 47] along with a k-d tree-like data structure. We note that for the special case of \(d=1\), it is known how to get a much better dependence on B in the central model, namely, exponential in \(\log ^* B\) [14, 21].

Xiao et al. [99] showed how to obtain private range count queries by using Haar wavelets, while Hay et al. [66] formalized the method of maintaining a hierarchical representation of data; the aforementioned two works were compared and refined by Qardaji et al. [85]. Cormode et al. [33] showed how to translate many of the previous ideas to the local model of DP. We also note that the matrix mechanism of Li et al. [74, 75] also applies to the problem of range counting queries. An alternate line of work for tackling multi-dimensional range counting that relied on developing private versions of k-d trees and quadtrees was presented by Cormode et al. [36].

Secure Multi-party Computation. If we allow user interaction in the computation of the queries, then there is a rich theory, within cryptography, of secure multi-party computation (SMPC) that allows \(f(x_1,\dots ,x_n)\) to be computed without revealing anything about \(x_i\) except what can be inferred from \(f(x_1,\dots ,x_n)\) itself (see, e.g., the book of Cramer et al. [39]). Kilian et al. [72] studied SMPC protocols for heavy hitters, obtaining near-linear communication complexity with a multi-round protocol. In contrast, all results in this paper are about non-interactive (single-round) protocols in the shuffle model (in the multi-message setting, all messages are generated at once). Though generic SMPC protocols can be turned into differentially private protocols (see, e.g., Sect. 10.2 in [93] and the references therein), they almost always use multiple rounds, and often have large overheads compared to the cost of computing \(f(x_1,\dots ,x_n)\) in a non-private setting.

7 Conclusions and Open Problems

The shuffle model is a promising new privacy framework motivated by the significant interest in anonymous communication. In this paper, we studied the fundamental task of frequency estimation in this setup. In the single-message shuffle model, we established nearly tight bounds on the error for frequency estimation: while in the local model the error is well-known to be \(\tilde{\varTheta }(\sqrt{n})\), we proved that the right bound in the single-message model is the minimum of \(\tilde{\varTheta }(n^{1/4})\) and \(\tilde{\varTheta }(\sqrt{B})\), which interestingly are achieved by shuffling the widely used RAPPOR and the B-randomized response protocols, respectively. Moreover, we proved a nearly tight lower bound on the number of users required to solve the selection problem in the single-message shuffle model. We also obtained communication-efficient multi-message private-coin protocols with exponentially smaller error for frequency estimation, heavy hitters, range counting, and M-estimation of the median and quantiles (and more generally sparse non-adaptive SQ algorithms). We also gave public-coin protocols with, in addition, small query times. Our work raises several interesting open questions and points to fertile future research directions.

Our \(\tilde{\varOmega }(B)\) lower bound for selection (Theorem 3) holds for single-message protocols even with unbounded communication. We conjecture that a lower bound on the error of \(B^{\varOmega (1)}\) should hold even for multi-message protocols (with unbounded communication) in the shuffle model, and we leave this as a very interesting open question. Such a lower bound would imply a first separation between the central and (unbounded communication) multi-message shuffle model.

Another interesting question is to obtain a private-coin protocol for frequency estimation with polylogarithmic error, communication per user, and query time; reducing the query time of our current protocol below \(\tilde{O}(n)\) seems challenging. In general, it would also be interesting to reduce the polylogarithmic factors in our guarantees for range counting as that would make them practically useful.

Another interesting direction for future work is to determine whether our efficient protocols for frequency estimation with much less error than what is possible in the local model could lead to more accurate and efficient shuffle model protocols for fundamental primitives such as clustering [91] and distribution testing [1], for which current locally differentially private protocols use frequency estimation as a black box.

Finally, a promising future direction is to extend our protocols for sparse non-adaptive SQ algorithms to the case of sparse aggregation. Note that the queries made by sparse non-adaptive SQ algorithms correspond to the special case of sparse aggregation where all non-zero queries are equal to 1. Extending our protocols to the case where the non-zero coordinates can be arbitrary numbers would, e.g., capture sparse stochastic gradient descent (SGD) updates, an important primitive in machine learning. More generally, it would be interesting to study the complexity of various other statistical and learning tasks [13, 25,26,27, 88, 98] in the shuffle privacy model.