research-article

Open access

Bayesian Frequency Estimation under Local Differential Privacy with an Adaptive Randomized Response Mechanism

Authors:

Soner Aydin,

Sinan YıldırımAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 19, Issue 2

Article No.: 28, Pages 1 - 40

https://doi.org/10.1145/3706584

Published: 11 January 2025 Publication History

PDF eReader

Abstract

Frequency estimation plays a critical role in many applications involving personal and private categorical data. Such data are often collected sequentially over time, making it valuable to estimate their distribution online while preserving privacy. We propose AdOBEst-LDP, a new algorithm for adaptive, online Bayesian estimation of categorical distributions under local differential privacy (LDP). The key idea behind AdOBEst-LDP is to enhance the utility of future privatized categorical data by leveraging inference from previously collected privatized data. To achieve this, AdOBEst-LDP uses a new adaptive LDP mechanism to collect privatized data. This LDP mechanism constrains its output to a subset of categories that “predicts” the next user’s data. By adapting the subset selection process to the past privatized data via Bayesian estimation, the algorithm improves the utility of future privatized data. To quantify utility, we explore various well-known information metrics, including (but not limited to) the Fisher information matrix, total variation distance, and information entropy. For Bayesian estimation, we utilize posterior sampling through stochastic gradient Langevin dynamics, a computationally efficient approximate Markov chain Monte Carlo (MCMC) method.

We provide a theoretical analysis showing that (i) the posterior distribution of the category probabilities targeted with Bayesian estimation converges to the true probabilities even for approximate posterior sampling, and (ii) AdOBEst-LDP eventually selects the optimal subset for its LDP mechanism with high probability if posterior sampling is performed exactly. We also present numerical results to validate the estimation accuracy of AdOBEst-LDP. Our comparisons show its superior performance against non-adaptive and semi-adaptive competitors across different privacy levels and distributional parameters.

1 Introduction

Frequency estimation is the focus of many applications that involve personal and private categorical data. Suppose a type of sensitive information is represented as a random variable \(X\) with a categorical distribution denoted by \(\text{Cat}(\theta)\), where \(\theta\) is a \(K\)-dimensional probability vector. As real-life examples, this could be the distribution of the types of a product bought by the customers of an online shopping company, responses to a poll question like “Which party will you vote for in the next elections?”, occupational affiliations of the people who visit the Web site of a governmental agency, and so on.

In this article, we propose an adaptive and online algorithm to estimate \(\theta\) in a Local Differential Privacy(LDP) framework where \(X\) is unobserved and instead, we have access to a randomized response \(Y\) derived from \(X\). In the LDP framework, a central aggregator receives each user’s randomized (privatized) data to be used for inferential tasks. In that sense, LDP differs from global DP [7] where the aggregator privatizes operations on the sensitive dataset after it collects the sensitive data without noise. Hence, LDP can be said to provide a stricter form of privacy and is used in cases where the aggregator may not be trustable [14]. Below, we give a more formal definition of \(\epsilon\)-LDP as a property that concerns a randomized mechanism.

Definition 1 (LDP)

A randomized mechanism \(\mathcal{M}:\mathcal{X}\mapsto\mathcal{Y}\) satisfies \(\epsilon\)-LDP if the following inequality holds for any pairs of inputs \(x,x^{\prime}\in\mathcal{X}\), and for any output (response) \(y\in\mathcal{Y}\):

\begin{align*}e^{-\epsilon}\leq\frac{\mathbb{P}(\mathcal{M}(x)=y)}{\mathbb{P}(\mathcal{M}(x^ {\prime})=y)}\leq e^{\epsilon}.\end{align*}

The definition of LDP is almost the same as that of global DP. The main difference is that, in the global DP, inputs \(x,x^{\prime}\) are two datasets that differ in only one individual’s record, whereas in LDP, \(x,x^{\prime}\) are two different data points from \(\mathcal{X}\).

In Definition 1, \(\epsilon\geq 0\) is the privacy parameter. A smaller \(\epsilon\) value provides stronger privacy. One main challenge in most differential privacy settings is to decide on the randomized mechanism. In the case of LDP, this is how an individual data point \(X\) should be randomized. For a given randomized algorithm, too little randomization may not guarantee the privacy of individuals, whereas too severe randomization deteriorates the utility of the output of the randomized algorithm. Balancing these conflicting objectives (privacy vs. utility) is the main goal of the research on estimation under privacy constraints.

In many cases, individuals’ data points are collected sequentially. A basic example is opinion polling, where data are collected typically in time intervals of lengths in the order of hours or days. Personal data entered during registration are another example. For example, a hospital can collect patients’ categorical data as they visit the hospital for the first time.

While sequential collection of individual data may make the estimation task under the LDP constraint harder, it may also offer an opportunity to adapt the randomized mechanism in time to improve the estimation quality. Motivated by that, in this article, we address the problem of online Bayesian estimation of a categorical distribution (\(\theta\)) under \(\epsilon\)-LDP, while at the same time choosing the randomization mechanism adaptively so that the utility is improved continually in time.

Contribution:

This article presents Adaptive Online Bayesian Frequency Estimation with LDP (AdOBEst-LDP), a new methodological framework. A flowchart diagram of AdOBEst-LDP is given in Figure 1 to expose the reader to the main idea of the framework. The main idea of AdOBEst-LDP is to collect future privatized categorical data with high estimation utility based on the knowledge extracted from the previously collected privatized categorical data. To achieve this goal, AdOBEst-LDP continually adapts its randomized response mechanism to the estimat ion of \(\theta\).

Fig. 1.

The development of AdOBEst-LDP offers three main contributions to the LDP literature.

–

A New Randomized Response Mechanism: AdOBEst-LDP uses a new adaptive Randomly Restricted Randomized Response(RRRR) mechanism to produce randomized responses under \(\epsilon\)-LDP. RRRR is a generalization of the Standard Randomized Response (SRR)mechanism in that it restricts the response to a subset of categories. This subset is selected such that the sensitive information \(X\) of the next individual is likely contained in that subset. To ensure this, the subset selection step uses two inputs: (i) a sample for \(\theta\) drawn from the posterior distribution of \(\theta\) conditional on the past data, (ii) a utility function that scores the informativeness of the randomized response obtained from RRRR when it is run with a given subset. To that end, we propose several utility functions to score the informativeness of the randomized response. The utility functions are based on well-known tools and metrics from probability and statistics, such as Fisher information [2, 18, 22, 32], entropy, Total Variation (TV) distance, expected squared error, and probability of honest response, i.e., \(Y=X\). We provide some insight into those utility functions both theoretically and numerically. Moreover, we also provide a computational complexity analysis for the proposed utility functions.

–

Posterior Sampling: We equip AdOBEst-LDP with a scalable posterior sampling method for parameter estimation. Bayesian estimation is a natural choice for inference when the data is corrupted or censored [15–17] and such modification can be statistically modeled. In differential privacy settings, too, Bayesian inference is widely employed [2, 8, 13, 31] when the input data are shared with privacy-preserving noise. Standard Markov chain Monte Carlo (MCMC) methods, such as Gibbs sampling, have a computation complexity quadratic in the number of individuals whose data have been collected. As a remedy to this, similar to Mazumdar et al. [19], we propose a Stochastic Gradient Langevin Dynamics(SGLD)-based algorithm to obtain approximate posterior samples [30]. By working on subsets of data, SGLD scales in time.

–

The numerical experiments show that AdOBEst-LDP outperforms its non-adaptive counterpart when run with SGLD for posterior sampling. The results also suggest that the utility functions considered in this article are robust and perform well. The MATLAB code at https://github.com/soneraydin/AdOBEst_LDP can be used to reproduce the results obtained in this article.

–

Convergence Results: Finally, we provide a theoretical analysis of AdOBEst-LDP. We prove two main results:

(i)

The targeted posterior distribution conditional on the generated observations by the adaptive scheme converges to the true parameter in probability in the number of observations, \(n\). This convergence result owes mainly to the smoothness and a special form of concavity of the marginal log-likelihood function of the randomized responses. Another key factor is that the second moment of the sum up to time \(n\) of the gradient of this log-marginal likelihood is increases linearly with \(n\).

(ii)

If posterior sampling is performed exactly, the expected frequency of the algorithm choosing the best subset (according to the utility function) converges to \(1\) as \(n\) goes to \(\infty\).

The theoretical results require fairly weak, realistic, and verifiable assumptions.

Outline:

In Section 2, we discuss the earlier work related to ours. Section 3 presents LDP and the frequency estimation problem and introduces AdOBEst-LDP as a general framework. In Section 4, we delve deeper into the details of AdOBEst-LDP by first presenting RRRR, the proposed randomized response mechanism, then explaining how it chooses an “optimal” subset of categories adaptively at each iteration. Section 4 also presents the utility metrics considered for choosing these subsets in this article. In Section 5, we provide the details of the posterior sampling methods considered in this article, particularly SGLD. The theoretical analysis of AdOBEst-LDP is provided in Section 6. Section 7 contains the numerical experiments. Finally, Section 8 provides some concluding remarks. All the proofs of the theoretical results are given in the appendices.

2 Related Literature

Frequency estimation under the LDP setting has been an increasingly popular research area in recent years. Along with its basic application (estimation of discrete probabilities from locally privatized data), it is also used for a wide range of other estimation and learning purposes such as estimation of CIs and confidence sets for a population mean [28], estimation or identification of heavy hitters [10, 25, 34], estimation of quantiles [5], frequent itemset mining [33], estimation of degree distribution in social networks [23], distributed training of graph neural networks with categorical features and labels [4]. The methods that are proposed for \(\epsilon\)-LDP frequency estimation also form the basis of more complex inferential tasks (with some modifications on these methods), such as the release of “marginals” (contingency tables) between multiple categorical features and their correlations, as in the work of Cormode et al. [6].

AdOBEst-LDP employs RRRR as its randomized mechanism to produce randomized responses. RRRR is a modified version of the SRR mechanism (also known as generalized randomized response, \(k\) -randomized response, and direct encoding in the literature.) Given \(X\) as its input, SRR outputs \(X\) with probability \(\frac{e^{\epsilon}}{e^{\epsilon}+K-1}\), otherwise outputs one of the other categories at random. This is a well-studied mechanism in the DP literature, and the statistical properties of its basic version (such as its estimation variance) can be found in the works by [26] and [27]. When \(K\) is large, the utility of SRR can be too low. RRRR in AdOBEst-LDP is designed to circumvent this problem by constraining its output to a subset of categories. Unlike SRR, the perturbation probability of responses in our algorithm changes adaptively, depending on the cardinality of the selected subset of categories (which we explain in detail in Section 4) for the privatization of \(X\), and the cardinality of its complementary set.

The use of information metrics as utility functions in LDP protocols has been an active line of research in recent years. In the work of Kairouz et al. [12], information metrics like \(f\)-divergence and mutual information are used for selecting optimal LDP protocols. In the same vein, Steinberger [22] uses Fisher Information as the utility metric for finding a nearly optimal LDP protocol for the frequency estimation problem, and Lopuhaä-Zwakenberg et al. [18] use it for comparing the utility of various LDP protocols for frequency estimation and finding the optimal one. In these works, the mentioned information metrics are used statically, i.e., to choose a protocol once and for all, for a given estimation task. The approaches in these works suffer from computational complexity for large values of \(K\) because the search space for optimal protocols there grows in the order of \(2^{K}\). In some other works, such as Wang et al. [24], a randomly sampled subset of size \(k\leq K\) is used to improve the efficiency of this task, where the optimal \(k\) is determined by maximizing the mutual information between real data and the privatized data. However, this approach is also static as the optimal subset size \(k\) is selected only once, and the optimization procedure only determines \(k\) and not the subset itself. Unlike those static approaches, AdOBEst-LDP dynamically uses the information metric (such as the Fisher Information Matrix (FIM) and the other alternatives in Section 4.3) to select the optimal subset at each timestep. In addition, in the subset selection step of AdOBEst-LDP, only \(K\) candidate subsets are compared in terms of their utilities at each iteration, enabling computational tractability. This way of tackling the problem requires computing the given information metric for only \(K\) times at each iteration. We will provide further details of this approach in Section 4.3 and provide a computational complexity analysis in Section 4.4.

Another use of the Fisher Information in the LDP literature is for bounding the estimation error for a given LDP protocol. For example, Barnes et al. [3] use Fisher Information inside van Trees inequality, the Bayesian version of the Cramér-Rao bound [9], for bounding the estimation error of various LDP protocols for Gaussian mean estimation and frequency estimation. Again, their work provides rules for choosing optimal protocols for a given \(\epsilon\) in a static way. As a similar example, Acharya et al. [1] derive a general information contraction bound for parameter estimation problems under LDP and show its relation to van Trees inequality as its special case. To our knowledge, our approach is the first one that adaptively uses a utility metric to dynamically update the inner workings of an LDP protocol for estimating categorical distributions.

The idea of building adaptive mechanisms for improved estimation under the LDP has been studied in the literature, although the focus and methodology of those works differ from ours. For example, Joseph et al. [11] proposed a two-step adaptive method to estimate the unknown mean parameter of data from Gaussian distribution. In this method, the users are split into two groups, an initial mean estimate is obtained from the perturbed data of the first group and the data from the second group are transformed adaptively according to that initial estimate. Similarly, Wei et al. [29] proposed another two-step adaptive method for the mean estimation problem, in which the aggregator first computes a rough distribution estimate from the noisy data of a small sample of users, which is then used for adjusting the amount of perturbation for the data of remaining users. While Joseph et al. [11], Wei et al. [29] consider a two-stage method, AdOBEst-LDP seeks to adapt continually by updating its LDP mechanism each time an individual’s information is collected. Similar to our work, Yıldırım [32] has recently proposed an adaptive LDP mechanism for online parameter estimation for continuous distributions. The LDP mechanism of Yıldırım [32] contains a truncation step with boundaries adapted to the estimate from the past data according to a utility function based on the Fisher information. Unfortunately, the parameter estimation step of Yıldırım [32] does not scale in time. Differently from Yıldırım [32], AdOBEst-LDP focuses on categorical distributions, considers several other utility functions to update its LDP mechanism, employs a scalable parameter estimation step, and its performance is backed up with theoretical results.

3 Problem Definition and General Framework

Suppose we are interested in a discrete probability distribution \(\mathcal{P}\) of a certain form of sensitive categorical information \(X\in[K]:=\{1,\ldots,K\}\) of individuals in a population. Hence, \(\mathcal{P}\) is a categorical distribution \(\text{Cat}(\theta^{\ast})\) with a probability vector

\begin{align*}\theta^{\ast}:=(\theta^{\ast}_{1},\ldots,\theta^{\ast}_{K})\in\Delta,\end{align*}

where \(\Delta\) is the \((K-1)\)-dimensional probability simplex,

\begin{align*}\Delta:=\left\{\theta\in\mathbb{R}^{K}:\sum_{k=1}^{K}\theta_{k}=1\text { and }\theta_{k}\geq 0\text{ for }k\in[K]\right\}.\end{align*}

We assume a setting where individuals’ sensitive data are collected privately and sequentially in time. The privatization is performed via a randomized algorithm that, upon taking a category index in \([K]\) as an input, returns a random category index in \([K]\) such that the whole data collection process is \(\epsilon\)-LDP (see Definition 1). Let \(X_{t}\) and \(Y_{t}\) be the private information and randomized responses of individual \(t\), respectively. According to Definition 1 for LDP, the following inequality must be satisfied for all triples \((x,x^{\prime},y)\in[K]^{3}\) for the randomized mechanism to be \(\epsilon\)-LDP.

\begin{align}\mathbb{P}(Y_{t}=y|X_{t}=x)\leq e^{\epsilon}\mathbb{P}(Y_{t}=y|X_{t}=x^{\prime }).\end{align}

(1)

The inferential goal is to estimate \(\theta^{\ast}\) sequentially based on the responses \(Y_{1},Y_{2},\ldots\), and the mechanisms \(\mathcal{M}_{1},\mathcal{M}_{2},\ldots\) that are used to generate those responses. Specifically, Bayesian estimation is considered, whereby the target is the posterior distribution, denoted by \(\Pi(\mathrm{d}\theta|Y_{1:n},\mathcal{M}_{1:n})\), given a prior probability distribution with pdf \(\eta(\theta)\) on \(\Delta\).

This article concerns the Bayesian estimation of \(\theta\) while adapting the randomized mechanism to improve the estimation utility continually. We propose a general framework called AdOBEst-LDP, in which the randomized mechanism at time \(t\) is adapted to the data collected until time \(t-1\). AdOBEst-LDP is outlined in Algorithm 1.

Algorithm 1 is fairly general, and it does not describe how to choose the \(\epsilon\)-LDP mechanism \(\mathcal{M}_{t}\) at time \(t\), nor does it provide the details of the posterior sampling. However, it is still worth making some critical observations about the nature of the algorithm. First, at time \(t\) the selection of the \(\epsilon\)-LDP mechanism in Step 1 relies on the posterior sample \(\Theta_{t-1}\), which serves as an estimator of the true parameter \(\theta^{\ast}\) based on the past observations. As we shall see in Section 4, at Step 1 the “best” \(\epsilon\)-LDP mechanism is chosen from a set of candidate LDP mechanisms according to a utility function. This step is relevant only when \(\Theta_{t-1}\) is a reliable estimator of \(\theta^{\ast}\). In other words, Step 1 “exploits” the estimator \(\Theta_{t-1}\). Moreover, the random nature of posterior sampling prevents having too much confidence in the current estimator \(\Theta_{t-1}\) and enables a certain degree of “exploration.” In conclusion, Algorithm 1 utilizes an “exploration-exploitation” approach reminiscent of reinforcement learning. In particular, posterior sampling in Step 3 suggests a strong parallelism between AdOBEst-LDP and the well-known exploration-exploitation approach called Thompson sampling [21].

The details of Steps 1–3 of Algorithm 1 are given in Sections 4 and 5, respectively.

4 Constructing Informative Randomized Response Mechanisms

In this section, we describe Steps 1–2 of AdOBEst-LDP in Algorithm 1 where the \(\epsilon\)-LDP mechanism \(\mathcal{M}_{t}\) is selected at time \(t\) based on the posterior sample \(\Theta_{t-1}\) and a randomized response is generated using \(\mathcal{M}_{t}\). For ease of exposition, we will drop the time index \(t\) throughout the section and let \(\Theta_{t-1}=\theta\).

Recall from Definition 1 that an \(\epsilon\)-LDP randomized mechanism is associated with a conditional probability distribution that satisfies (1). An \(\epsilon\)-LDP mechanism is not unique. One such mechanism is the SRR mechanism. For subsequent use, it is convenient to define SRR generally: We let \(\texttt{{{SRR}}}(X;\Omega,\epsilon)\) the output of SRR which operates on the set \(\Omega\) with LDP parameter \(\epsilon\) when the input is \(X\in\Omega\). Then, we have

\begin{align}Y=\text{SRR}(X;\Omega,\epsilon)=\begin{cases}X & \text{w.p. }e^{\epsilon}/(e^{ \epsilon}+|\Omega|-1) \\\sim\text{Uniform}(\Omega/\{X\}) & \text{ else }\end{cases}.\end{align}

(2)

We aim to develop an alternative randomized mechanism whose response \(Y\) is more informative about \(\theta^{\ast}\) than the one generated as \(Y=\text{SRR}(X;[K],\epsilon)\). The main idea is as follows. Supposing that the posterior sample \(\Theta_{t-1}=\theta\) is an accurate estimate of \(\theta^{\ast}\), it is reasonable to aim for the “best” \(\epsilon\)-LDP mechanism (among a range of candidates) which would maximize the (estimation) utility of \(Y\) if the true parameter were \(\theta^{\ast}=\theta\). We follow this main idea to develop the proposed \(\epsilon\)-LDP mechanism.

4.1 The RRRR Mechanism

Given \(\Theta_{t-1}=\theta\in\Delta\), an informative randomized response mechanism can be constructed by considering a high-probability set \(S\subset[K]\) and a low-probability set \(S^{c}=[K]/S\) for \(X\) (according to \(\theta\)). Then, a sensible alternative to \(\text{SRR}(X;[K],\epsilon)\) would be to confine the randomized response to the set \(S\) (unioned by a random element from \(S^{c}\) to remain LDP). The expected benefit of this approach is due to (i) using less amount of randomization since \(|S|K\), and thus (ii) having an informative response when \(X\in S\), which happens with a high probability. Based on this approach, we propose RRRR, whose precise steps are given in Algorithm 2.

RRRR has three algorithmic parameters: a subset \(S\) of \([K]\) and two privacy parameters \(\epsilon_{1}\) and \(\epsilon_{2}\) which operates on \(S\) and \(S^{c}\), respectively. Theorem 1 states the necessary conditions for \(\epsilon_{1}\) and \(\epsilon_{2}\) for RRRR to be \(\epsilon\)-LDP. A proof of Theorem 1 is given in Appendix A.1.

Theorem 1.

RRRR is \(\epsilon\)-DP if \(\epsilon_{1}\leq\epsilon\) and

\begin{align}\epsilon_{2}=\begin{cases}\min\left\{\epsilon,\ln\frac{|S^{c}|-1}{e^{ \epsilon_{1}-\epsilon}|S^{c}|-1}\right\} & \text{for }\epsilon-\epsilon_ {1} < \ln|S^{c}|\text{ and }|S| > 0 \\\epsilon & \text{else}\end{cases}.\end{align}

(3)

Note that when \(S=\emptyset\) and \(\epsilon_{2}=\epsilon\), RRRR reduces to SRR.

4.2 Choosing the Privacy Parameters \(\boldsymbol{\epsilon}_{\textbf{1}}\) , \(\boldsymbol{\epsilon}_{\textbf{2}}\)

We elaborate on the choice of \(\epsilon_{1}\) and \(\epsilon_{2}\) in the light of Theorem 1. In RRRR, the probability of an honest response, i.e., \(X=Y\), given \(X\in S\), is

\begin{align*}\mathbb{P}(Y=X|X\in S)=\frac{e^{\epsilon_{1}}}{e^{\epsilon_{1}}+|S|},\end{align*}

which should be contrasted to \(e^{\epsilon}/(e^{\epsilon}+K-1)\), which would be the probability if \(Y=\text{SRR}(X;[K],\epsilon)\). Anticipating that \(\{X\in S\}\) is likely, one should at least aim for \(\epsilon_{1}\) that satisfies \(\mathbb{P}(X=Y|X\in S)\geq e^{\epsilon}/(e^{\epsilon}+K-1)\) for RRRR to be relevant. This is equivalent to

\begin{align}\epsilon_{1}\geq\epsilon+\ln|S|-\ln(K-1).\end{align}

(4)

Taking into account also the constraint that \(\epsilon_{1}\leq\epsilon\) (by Theorem 1), we suggest \(\epsilon_{1}=\kappa\epsilon\), where \(\kappa\in(0,1)\) is a number close to \(1\), such as \(0.9\), to ensure (4) with a significant margin. (It is possible to choose \(\kappa=1\); however, again by Theorem 1, this requires that \(\epsilon_{2}=0\), which renders \(Y\) completely uninformative when \(X\notin S\).) In Section 7, we discuss the choice of \(\kappa\) in more detail.

For the next section, we assume a fixed \(\kappa\in(0,1)\), and set \(\epsilon_{1}=\kappa\epsilon\); and we focus on the selection of \(S\).

4.3 Subset Selection for RRRR

Let \(\texttt{RRRR}(X;S,\epsilon)\) be the random output of RRRR that achieves \(\epsilon\)-LDP by using the subset \(S\) and the privacy parameters \(\epsilon_{1}=\kappa\epsilon\) and \(\epsilon_{2}\) as in (3) when the input is \(X\). Furthermore, let \(U(\theta,S,\epsilon)\) be the (inferential) “utility” of \(Y=\texttt{RRRR}(X;S,\epsilon)\) when \(X\sim\text{Cat}(\theta)\). One would like to choose \(S\) that maximizes \(U(\theta,S,\epsilon)\). (One could also seek to optimize \(\kappa\) in \(\epsilon_{1}=\kappa\epsilon\), too, however with the expense of additional computation.)

However, since there are \(2^{K}-1\) feasible choices for \(S\), one must confine the search space for \(S\) in practice. As discussed above, RRRR becomes most relevant when the set \(S\) is a high-probability set. Therefore, for a given \(\theta\), we confine the choices for \(S\) to

\begin{align}S_{\theta,k}:=\{\sigma_{\theta}(1),\sigma_{\theta}(2),\ldots,\sigma_{ \theta}(k)\},\quad k=1,\ldots,K.\end{align}

(5)

where \(\sigma_{\theta}:=(\sigma_{\theta}(1),\ldots,\sigma_{\theta}(K))\) be the permutation vector for \(\theta\) so that \(\theta_{\sigma_{\theta}(1)}\geq\ldots\geq\theta_{\sigma_{\theta}(K)}\).

Then the subset selection problem can be formulated as finding

\begin{align}k^{\ast}=\arg\max_{k\in\{0,\ldots,K-1\}}U(\theta,S_{k,\theta}, \epsilon).\end{align}

(6)

The alternatives in (5) can be justified. Since \(S_{\theta,k}\) contains the indices of the \(k\) highest-valued components of \(\theta^{\ast}\), it is expected to cover a large portion of the total probability for \(X\). This can be the case even for a small value of \(k\) relative to \(K\) when the components of \(\theta^{\ast}\) are not evenly distributed. Also, the alternatives cover the basic SRR, which is obtained with \(k=0\) (leading to \(S=\emptyset\) and \(\epsilon_{2}=\epsilon\)).

In the subsequent sections, we present six different utility functions \(U(\theta,S,\epsilon)\) and justify their relevance to estimation; the usefulness of the proposed functions is also demonstrated in the numerical experiments.

4.3.1 FIM.

The first utility function under consideration is based on the FIM at \(\theta\) according to the distribution of \(Y\) given \(\theta\). It is well-known that the inverse of the FIM sets the Cramer-Rao lower bound for the variance of an unbiased estimator. Hence, the Fisher information can be regarded as a reasonable metric to quantify the information contained in \(Y\) about \(\theta\). This approach is adopted in Lopuhaä-Zwakenberg et al. [18], Steinberger [22] for LDP applications for estimating discrete distributions, and Alparslan and Yıldırım [2], Yıldırım [32] in similar problems involving parametric continuous distributions.

For a given \(\theta\in\Delta\), let \(F(\theta;S,\epsilon)\) be the FIM evaluated at \(\theta\) when \(X\sim\text{Cat}(\theta)\) and \(Y=\texttt{RRRR}(X;S,\epsilon)\). Let

\begin{align*}g_{S,\epsilon}(y|x):=\mathbb{P}(Y=y|X=x)\end{align*}

when \(Y=\texttt{RRRR}(X;S,\epsilon)\). The following result states \(F(\theta;S,\epsilon)\) in terms of \(g_{S,\epsilon}\) and \(\theta\). The result is derived in Lopuhaä-Zwakenberg et al. [18]; we also give a simple proof in Appendix A.2. Note that \(F(\theta;S,\epsilon)\) is \((K-1)\times(K-1)\) since \(\theta\) has \(K-1\) free components and \(\theta_{K}=1-\sum_{i=1}^{K-1}\theta_{i}\).

Proposition 1.

The FIM for RRRR is given by

\begin{align}F(\theta;S,\epsilon)=A_{S,\epsilon}^{\top}D_{\theta}^{-1}A_{S,\epsilon},\end{align}

(7)

where \(A_{S,\epsilon}\) is a \(K\times(K-1)\) matrix whose entries are \(A_{S,\epsilon}(i,j):=g_{S,\epsilon}(i|j)-g_{S,\epsilon}(i|K)\) and \(D_{\theta}\) is a \(K\times K\) diagonal matrix with elements \(D_{\theta}(i,i):=\sum_{j=1}^{K}g_{S,\epsilon}(i|j)\theta_{j}\).

We define the following utility function based on the Fisher information

\begin{align}U_{1}(\theta,S,\epsilon):=-\text{Tr}\left[F^{-1}(\theta;S,\epsilon)\right].\end{align}

(8)

This utility function depends on the Fisher information differently from Lopuhaä-Zwakenberg et al. [18], Steinberger [22], who considered the determinant of the FIM as the utility function. The rationale behind (8) is that the for an unbiased estimator \(\hat{\theta}(Y)\) of \(\theta^{\ast}\) based on \(Y=\texttt{RRRR}(X;S,\epsilon)\), the expected MSE is bounded by \(E_{\theta^{\ast}}[\|\hat{\theta}(Y)-\theta^{\ast}\|^{2}]\leq\text{Tr} \left[F^{-1}(\theta^{\ast};S,\epsilon)\right]\). For the utility function in (8) to be well-defined, the FIM needs to be invertible. Proposition 2, proven in Appendix A.2, states that this is indeed the case.

Proposition 2.

\(F(\theta;S,\epsilon)\) in (7) is invertible for all \(\theta\in\Delta\), \(S\subset[K]\), and \(\epsilon_{1},\epsilon_{2}>0\).

4.3.2 Entropy of Randomized Response.

For discrete distributions, entropy measures uniformity. Hence, in the LDP framework, a lower entropy for the randomized response \(Y\) implies a more informative \(Y\). Based on that observation, a utility function can be defined as the negative entropy of the marginal distribution of \(Y\),

\begin{align*}U_{2}(\theta,S,\epsilon):=\sum_{y=1}^{K}\ln h_{S,\epsilon}(y|\theta)h_{S, \epsilon}(y|\theta),\end{align*}

where \(h_{S,\epsilon}(y|\theta)\) is the marginal probability of \(Y=y\) given \(\theta\),

\begin{align*}h_{S,\epsilon}(y|\theta):=\sum_{x=1}^{K}g_{S,\epsilon}(y|x)\theta_{x}.\end{align*}

4.3.3 TV Distance.

The TV distance between two discrete probability distributions \(\mu,\nu\) on \([K]\) is given by

\begin{align*}\text{TV}(\mu,\nu):=\frac{1}{2}\sum_{k=1}^{K}|\mu(x)-\nu(x)|.\end{align*}

We consider two utility functions based on TV distance. The first function arises from the observation that a more informative response \(Y\) generally leads to a larger change in the posterior distribution of \(X\) given \(Y,\theta\),

\begin{align}p_{S,\epsilon}(x|y,\theta):=\frac{\theta_{x}\cdot g_{S,\epsilon}(y|x)}{h_{S, \epsilon}(y|\theta)},\quad x=1,\ldots,K,\end{align}

(9)

relative to its prior \(\text{Cat}(\theta)\). The expected amount of change can be formulated as the expectation of the TV distance between the prior and posterior distributions with respect to the marginal distribution of \(Y\) given \(\theta\). Then, a utility function can be defined as

\begin{align*}U_{3}(\theta,S,\epsilon) &: =\mathbb{E}_{\theta}\left[\text{TV}(p_{S,\epsilon}(\cdot|Y, \theta),\text{Cat}(\theta))\right] \\& =\frac{1}{2}\sum_{x=1}^{K}\sum_{y=1}^{K}\left|g_{S,\epsilon}(y|x) \theta_{x}-h_{S,\epsilon}(y|\theta)\theta_{x}\right|.\end{align*}

Another utility function is related to the TV distance between the marginal probability distributions of \(X\) given \(\theta\) and \(Y\) given \(\theta\). Since \(X\) is more informative about \(\theta\) than the randomized response \(Y\), the mentioned TV distance is desired to be as small as possible. Hence, a utility function may be formulated as

\begin{align*}U_{4}(\theta,S,\epsilon) &: =-\text{TV}(h_{S,\epsilon}(\cdot|\theta),\text{Cat}(\theta)) \\& =-\frac{1}{2}\sum_{i=1}^{K}|h_{S,\epsilon}(i|\theta)-\theta_{i}|.\end{align*}

4.3.4 Expected MSE.

One can also wish to choose \(S\) such that the Bayesian estimator of \(X\) given \(Y\) has the lowest expected squared error. Specifically, given \(k\in[K]\) let \(e_{k}\) be a \(K\times 1\) vector of \(0\)s except that \(e_{k}=1\). A utility function can be defined based on that as

\begin{align}U_{5}(\theta,S,\epsilon):=-\arg\min_{\widehat{e_{X}}}\mathbb{E}_{\theta}\left[ \|e_{X}-\widehat{e_{X}}(Y)\|^{2}\right],\end{align}

(10)

where \(\mathbb{E}_{\theta}\left[\|e_{X}-\widehat{e_{X}}(Y)\|^{2}\right]\) is the MSE for the estimator \(\widehat{e_{X}}\) of \(e_{X}\) given \(Y\) when \(X\sim\text{Cat}(\theta)\) and \(Y=\texttt{RRRR}(X;S,\epsilon)\), which is known to be minimized when \(\widehat{e_{X}}\) is the Bayesian estimator of \(e_{X}\). Proposition 3 provides an explicit formula for this utility function. A proof is given in Appendix A.2.

Proposition 3.

For the utility function in (10), we have

\begin{align*}U_{5}(\theta,S,\epsilon)=\sum_{y=1}^{K}\sum_{x=1}^{K}\frac{g_{S,\epsilon}(y|x) ^{2}\theta_{x}^{2}}{h_{S,\epsilon}(y|\theta)}-1.\end{align*}

4.3.5 Probability of Honest Response.

Our last alternative for the utility function is a simple yet intuitive one, which is the probability of an honest response, i.e.,

\begin{align}U_{6}(\theta,S,\epsilon):=\mathbb{P}_{\theta}(Y=X|S).\end{align}

(11)

This probability is explicitly given by

\begin{align*}\mathbb{P}_{\theta}(Y=X|S) & =\mathbb{P}(Y=X|X\in S)\mathbb{P}_{\theta}(X\in S)+\mathbb{P}(Y=X |X\notin S)\mathbb{P}_{\theta}(X\notin S) \\& =\frac{e^{\epsilon_{1}}}{e^{\epsilon_{1}}+|S|}\left(\sum_{i\in S} \theta_{i}+\frac{e^{\epsilon_{2}}}{e^{\epsilon_{2}}+K-|S|-1}\sum_{i\notin S} \theta_{i}\right).\end{align*}

Recall that, for computational tractability, we confined the possible sets for \(S\) to the subsets \(\{\sigma_{\theta}(1),\ldots,\sigma_{\theta}(k)\}\), \(k=0,\ldots,K-1\) and select \(S\) by solving the maximization problem in (6). Remarkably, if \(U_{6}(\theta,S,\epsilon)\) is used for the utility function, the restricted maximization (6) is equivalent to global maximization, i.e., finding the best \(S\) among all the \(2^{K}\) possible subsets \(S\). We state this as a theorem and prove it in Appendix A.2.

Theorem 2.

For the utility function \(U_{6}(\theta,S,\epsilon)\) in (11) and \(S_{k,\theta}\)s in (5), we have

\begin{align*}\max_{k=0,\ldots,K-1}U_{6}(\theta,S_{k,\theta},\epsilon)=\max_{S\subset[K]}U_{ 6}(\theta,S,\epsilon).\end{align*}

4.3.6 Semi-Adaptive Approach.

We also consider a semi-adaptive approach which uses a fixed parameter \(\alpha\in(0,1)\) to select the smallest \(S_{k,\theta}\) in (5) such that \(\mathbb{P}_{\theta}(X\in S_{k,\theta})\geq\alpha\), that is, \(S=\{\sigma_{\theta}(1),\ldots,\sigma_{\theta}(k^{\ast})\}\) is taken such that

\begin{align*}\mathbb{P}_{\theta}(X\in\{\sigma_{\theta}(1),\ldots,\sigma_{\theta}(k^ {\ast}-1)\}) < \alpha\text{ and }\mathbb{P}_{\theta}(X\in\{ \sigma_{\theta}(1),\ldots,\sigma_{\theta}(k^{\ast})\})\geq\alpha.\end{align*}

Again, the idea is to randomize the most likely values of \(X\) with a high accuracy. The approach forms the subset \(S\) by including values for \(X\) in descending order of their probabilities (given by \(\theta\)) until the cumulative probability exceeds \(\alpha\). In that way, it is expected to have set \(S\) that is small-sized (especially when \(\theta\) is unbalanced) and captures the most likely values of \(X\). The resulting \(S\) has varying cardinality depending on the sampled \(\theta\) at the current timestep.

We call this approach “semi-adaptive” because, while it still adapts to \(\theta\), it uses the fixed parameter \(\alpha\). As we will see in Section 7, the best \(\alpha\) depends on various parameters such as \(\epsilon\), \(K\), and the degree of evenness in \(\theta\).

4.4 Computational Complexity of Utility Functions

We now provide the computational complexity analysis of the utility metrics presented in Sections 4.3.1–4.3.5, and that of the semi-adaptive approach in Section 4.3.6, as a function of \(K\). The first row of Table 1 shows the computational complexities of calculating the utility function for a fixed \(S\), and the second row shows the complexities of choosing the best \(S\) according to (6). To find (6), the utility function generally needs to be calculated \(K\) times, which explains the additional \(K\) factor in the computational complexities in the second row.

Table 1.

	Fisher	Entropy	\(\text{TV}_{1}\)	\(\text{TV}_{2}\)	MSE	\(\mathbb{P}_{\theta}(Y=X)\)	Semi-Adaptive
Computing utility	\(\mathcal{O}(K^{3})\)	\(\mathcal{O}(K^{2})\)	\(\mathcal{O}(K^{2})\)	\(\mathcal{O}(K^{2})\)	\(\mathcal{O}(K^{2})\)	\(\mathcal{O}(K)\)	NA
Choosing \(S\)	\(\mathcal{O}(K^{4})\)	\(\mathcal{O}(K^{3})\)	\(\mathcal{O}(K^{3})\)	\(\mathcal{O}(K^{3})\)	\(\mathcal{O}(K^{3})\)	\(\mathcal{O}(K)\)	\(\mathcal{O}(K)\)

Table 1. Computational Complexity of Utility Functions and Choosing \(S\)

The least demanding utility function is \(U_{6}\), that is based on \(\mathbb{P}_{\theta}(Y=X)\), whose complexity is \(\mathcal{O}(K)\). Moreover, finding the best \(S\) can also be done in \(\mathcal{O}(K)\) time because one can compute this utility metric for all \(k=0,\ldots,K-1\) by starting with \(S=\emptyset\) and expanding it incrementally. Also note that the semi-adaptive approach does not use a utility metric and finding \(k^{\ast}\) can be done in \(\mathcal{O}(K)\) time by summing the components of \(\theta\) from largest to smallest until the cumulative sum exceeds the given \(\alpha\) parameter. So, its complexity is \(\mathcal{O}(K)\).

For all these approaches, it is additionally required to sort \(\theta\) beforehand, which is an \(\mathcal{O}(K\ln K)\) operation with an efficient sorting algorithm like merge sort.

In practice, one can choose among these utility functions depending on the nature of the application. When the number of categories \(K\) or the arrival rate of sensitive data is large, we suggest using \(U_{6}\) or a semi-adaptive approach. When \(K\) and the arrival rate of the personal data are both small, the more computationally demanding utility functions can also be used.

Example 1

(Numerical Illustration). We close this section with an example that shows the benefit of RRRR and the role of \(S\). We consider \(\theta\) values such that \(\theta_{i}/\theta_{i+1}\) is constant for \(i=1,\ldots,K-1\). The ratio \(\theta_{i}/\theta_{i+1}\) controls the degree of “evenness” in \(\theta\): The smaller ratio indicates a more evenly distributed \(\theta\). Note that \(\theta\) is already ordered in this example; hence, we consider using \(S=\{1,\ldots,k\}\) which has the \(k\) most likely values for \(X\) according to \(\theta\). Also, for a given \(\epsilon\), we fix \(\epsilon_{1}=0.9\epsilon\) and set \(\epsilon_{2}\) according to (3).

Figure 2 shows, for a fixed \(\epsilon\) and \(K=20\), and various values of \(k\), the probability of the randomized response being equal to the sensitive information, i.e., \(\mathbb{P}_{\theta}(Y=X)\) vs. \(\theta_{i}/\theta_{i+1}\) when \(S=\{1,\ldots,k\}\) in RRRR. (Recall that this probability corresponds to \(U_{6}(\theta,S,\epsilon)\).) Comparing this probability with \(e^{\epsilon}/(e^{\epsilon}+K-1)\), the probability obtained with \(Y=\text{SRR}(X;[K],\epsilon)\), it can be observed that RRRR can do significantly better than SRR if \(k\) can be chosen suitably. The plots demonstrate that the “suitable” \(k\) depends on \(\theta\): While the best \(k\) tends to be larger for more even \(\theta\), small \(k\) becomes the better choice for non-even \(\theta\) (large \(\theta_{i}/\theta_{i+1}\)). This is because, when \(\theta_{i}/\theta_{i+1}\) is large, the probability is concentrated on just a few components, and \(S\) with a small \(k\) captures most of the probability. Moreover, the plots for \(\epsilon=1\) and \(\epsilon=5\) also show the effect of the level of privacy. In more challenging scenarios where \(\epsilon\) is smaller, the gain obtained by RRRR compared to SRR is bigger.

Fig. 2.

5 Posterior Sampling

Steps 1–2 of AdOBEst-LDP in Algorithm 1 were detailed in the previous section. In this section, we provide the details of Step 3.

Step 3 of AdOBEst-LDP requires sampling from the posterior distribution \(\Pi(\cdot|Y_{1:n},S_{1:n})\) of \(\theta\) given \(Y_{1:n}\) and \(S_{1:n}\) for \(n\geq 1\), where \(S_{t}\) is the subset selected at time \(t\) to generate \(Y_{t}\) from \(X_{t}\). Let \(\pi(\theta|Y_{1:n},S_{1:n})\) denote the pdf of \(\Pi(\cdot|Y_{1:n},S_{1:n})\). Given \(Y_{1:n}=y_{1:n}\) and \(S_{1:n}=s_{1:n}\), the posterior density can be written as

\begin{align}\pi(\theta|y_{1:n},s_{1:n}) & \propto\eta(\theta)\prod_{t=1}^{n}h_{s_{t},\epsilon}(y_{t}|\theta).\end{align}

(12)

Note that the right-hand side does not include a transition probability for \(S_{t}\)’s because the sampling procedure of \(S_{t}\) given \(Y_{1:t-1}\) and \(S_{1:t-1}\) does not depend on \(\theta^{\ast}\). Furthermore, we assume that the prior distribution \(\eta(\theta)\) is a Dirichlet distribution \(\theta\sim\text{Dir}(\rho_{1},\ldots,\rho_{K})\) with prior hyper-parameters \(\rho_{k}>0\), for \(k=1,\ldots,K\).

Unfortunately, the posterior distribution in (12) is intractable. Therefore, we resort to approximate sampling approaches using MCMC. Below, we present two MCMC methods, namely SGLD and Gibbs sampling.

5.1 SGLD

SGLD is an asymptotically exact gradient-based MCMC sampling approach that enables the use of subsamples of size \(m\ll t\). A direct application of SGLD to generate samples for \(\theta\) from the posterior distribution in (12) is difficult. This is because \(\theta\) lives in the probability simplex \(\Delta\), which makes the task of keeping the iterates for \(\theta\) inside \(\Delta\) challenging. We overcome this problem by defining the surrogate variables \(\phi_{1},\ldots,\phi_{K}\) with

\begin{align*}\phi_{k}\overset{\text{ind.}}{\sim}\text{Gamma}(\rho_{k},1),\quad k=1,\ldots,K,\end{align*}

and the mapping from \(\phi\) to \(\theta\) as

\begin{align}\theta(\phi)_{k}:=\frac{\phi_{k}}{\sum_{j=1}^{K}\phi_{j}},\quad k=1,\ldots,K.\end{align}

(13)

It is well-known that the resulting \((\theta_{1},\ldots,\theta_{K})\) has a Dirichlet distribution \(\text{Dir}(\rho_{1},\ldots,\rho_{K})\), which is exactly the prior distribution \(\eta(\theta)\). Therefore, this change of variables preserves the originally constructed probabilistic model. Moreover, since \(\phi=(\phi_{1},\ldots,\phi_{K})\) takes values in \([0,\infty)^{K}\), we run SGLD for \(\phi\), where the \(j\)’th update is

\begin{align}\phi^{(j)}=\left|\phi^{(j-1)}+\frac{\gamma_{n}}{2}\left(\nabla_{\phi}\ln p(\phi^{(j-1)})+\frac{n}{m}\sum_{i=1}^{m}\nabla_{\phi}\ln p_{s_{u_{i}},\epsilon} (y_{u_{i}}|\phi^{(j-1)})\right)+\gamma_{n}W_{j}\right|,\quad W_{j}\sim\mathcal {N}(0,I_{K}).\end{align}

(14)

where \(u=(u_{1},\ldots,u_{m})\) is a random subsample of \(\{1,\ldots,n\}\). In (14), the “new” prior and likelihood functions are

\begin{align}p(\phi):=\prod_{k=1}^{K}\text{Gamma}(\phi_{k};\alpha_{i},1),\quad p_{s, \epsilon}(y|\phi):=h_{s,\epsilon}(y|\theta(\phi)).\end{align}

(15)

The reflection in (14) via taking the component-wise absolute value is necessary because each \(\phi_{k}^{(j)}\) must be positive. Step 3 of Algorithm 1 can be approximated by running SGLD for some \(M>0\) iterations. To exploit the SGLD updates from the previous time, one should start the updates at time \(n\) by setting the initial value for \(\phi\) to the last SGLD iterate at time \(n-1\).

The next proposition provides the explicit formulae for the gradients of the log-prior and the log-likelihood of \(\phi\) in (14). A proof is given in Appendix A.3.

Proposition 4.

For \(p(\phi)\) and in \(p(y|\phi)\) in (15), we have

\begin{align*}[\nabla_{\phi}\ln p(\phi)]_{i}=\frac{\alpha_{i}-1}{\phi_{i}}-1,\quad[\nabla_{ \phi}\ln p(y|\phi)]_{i}=\sum_{k=1}^{K-1}J(i,k)\frac{g_{S,\epsilon}(y|k)-g_{S, \epsilon}(y|K)}{h_{S,\epsilon}(y|\theta(\phi))},\end{align*}

where \(J\) is a \(K\times(K-1)\) Jacobian matrix whose \((i,j)\)th element is

\begin{align*}J(i,j)=\mathbb{I}(i=j)\frac{1}{\sum_{k=1}^{K}\phi_{k}}-\frac{\phi_{j}}{\left(\sum_{k=1}^{K}\phi_{k}\right)^{2}}.\end{align*}

5.2 Gibbs Sampling

An alternative to SGLD is the Gibbs sampler, which operates on the joint posterior distribution of \(\theta\) and \(X_{1:n}\) given \(Y_{1:n}=y_{1:n}\) and \(S_{1:n}=s_{1:n}\),

\begin{align*}p(\theta,x_{1:n}|y_{1:n},s_{1:n})\propto\eta(\theta)\left[\prod_{t=1}^{n} \theta_{x_{t}}g_{s_{t},\epsilon}(y_{t}|x_{t})\right].\end{align*}

The full conditional distributions of \(X_{1:n}\) and \(\theta\) are tractable. Specifically, for \(X_{1:n}\), we have

\begin{align}p(x_{1:n}|y_{1:n},s_{1:n},\theta)=\prod_{t=1}^{n}p_{s_{t},\epsilon}(x_{t}|y_{t},\theta),\end{align}

(16)

where \(p_{s_{t},\epsilon}(x_{t}|y_{t},\theta)\) is defined in (9). Therefore, (16) is a product of \(n\) categorical distributions, each with support \([K]\). Furthermore, the full conditional distribution of \(\theta\) is a Dirichlet distribution due to the conjugacy between the categorical and the Dirichlet distributions. Specifically,

\begin{align*}p(\theta|x_{1:n},y_{1:n},s_{1:n})=\text{Dir}(\theta|\rho^{\text{post}}_{1}, \ldots,\rho^{\text{post}}_{K}),\end{align*}

where the hyper-parameters of the posterior distribution are given by \(\rho^{\text{post}}_{k}:=\rho_{k}+\sum_{t=1}^{n}\mathbb{I}(x_{t}=k)\) for \(k=1,\ldots,K\).

Computational load at time \(t\) of sampling from \(t\) distributions in (16), is proportional to \(tK\), which renders the computational complexity of Gibbs sampling \(\mathcal{O}(n^{2}K)\) after \(n\) timesteps. This can be computationally prohibitive when \(n\) gets large.

6 Theoretical Analysis

We address two questions concerning AdOBEst-LDP in Algorithm 1 when it is run with RRRR whose subset is selected as described in Section 4.3. (i) Does the targeted posterior distribution based on the observations generated by Algorithm 1 converge to the true value \(\theta^{\ast}\)? (ii) How frequently does Algorithm 1 with RRRR select the optimum subset \(S\) according to the chosen utility function?

6.1 Convergence of the Posterior Distribution

We begin by developing the joint probability distribution of the random variables involved in AdOBEst-LDP.

–

Given \(Y_{1:n}\) and \(S_{1:n}\), the posterior distribution \(\Pi(\cdot|Y_{1:n},S_{1:n})\) is defined such that for any measurable set \(A\subseteq\Delta\), the posterior probability of \(\{\theta\in A\}\) is given by

\begin{align}\Pi(A|Y_{1:n},S_{1:n}):=\frac{\int_{A}\eta(\theta)\prod_{t=1}^{n}h_{S_{t}, \epsilon}(Y_{t}|\theta)\mathrm{d}\theta}{\int_{\Delta}\eta(\theta)\prod_{t=1}^ {n}h_{S_{t},\epsilon}(Y_{t}|\theta)\mathrm{d}\theta}.\end{align}

(17)

Let \(Q(\cdot|Y_{1:n},S_{1:n},\Theta_{n-1})\) be the probability distribution corresponding to the posterior sampling process for \(\Theta_{n}\). Note that if exact posterior sampling was used, we would have \(Q(A|Y_{1:n},S_{1:n},\) \(\Theta_{n-1})=\Pi(A|Y_{1:n},S_{1:n})\); however, when approximate sampling techniques are used to target \(\Pi\), such as SGLD or Gibbs sampling, the equality does not hold in general.

For \(\theta\in\Delta\), let

\begin{align*}S^{\ast}_{\theta}:=\{\sigma_{\theta}(1),\ldots,\sigma_{\theta}(k_{ \theta}^{\ast})\},\quad\text{with}\quad k^{\ast}_{\theta}:=\arg\max_{k \in\{0,\ldots,K-1\}}U(\theta,S_{k,\theta},\epsilon),\end{align*}

be the best subset according to \(\theta\), where \(S_{k,\theta}=\{\sigma_{\theta}(1),\ldots,\sigma_{\theta}(k)\}\) is defined in (5). Given \(\Theta_{1:t-1}\) and \(Y_{1:t}\), \(S_{t}\) depends only on \(\Theta_{t-1}\) and it is given by \(S_{t}=S^{\ast}_{\Theta_{t-1}}\).

Combining all, the joint law of \(S_{1:n},Y_{1:n}\) can be expressed as

\begin{align}P_{\theta^{\ast}}(S_{1:n},Y_{1:n}):=\prod_{t=1}^{n}h_{S_{t},\epsilon}(Y_{t}| \theta^{\ast})\left[\int_{\Delta}\mathbb{I}(S_{t}=S_{k^{\ast},\theta_{t-1}})Q(\mathrm{d}\theta_{t-1}|Y_{1:t-1},S_{1:t-1},\theta_{t-2})\right],\end{align}

(18)

where we use the convention that \(Q(\mathrm{d}\theta_{0}|Y_{1:0},S_{1:0},\theta_{-1})=\delta_{\theta_{\text{init }}}(\mathrm{d}\theta_{0})\) for an initial value \(\theta_{\text{init}}\in\Delta\).

The posterior probability in (17) is a random variable with respect to \(P_{\theta^{\ast}}\) defined in (18). Theorem 3 establishes that under the fairly mild Assumption 1 on the prior, the \(\Pi(\cdot|Y_{1:n},S_{1:n})\) converges to \(\theta^{\ast}\) regardless of the choice of \(Q\) for posterior sampling.

Assumption 1.

There exist finite positive constants \(d>0\) and \(B>0\) such that \(\eta(\theta)/\eta(\theta^{\prime}){\lt}B\) for all \(\theta,\theta^{\prime}\in\Delta\) whenever \(\|\theta^{\prime}-\theta^{\ast}\|{\lt}d\).

Theorem 3.

Under Assumption 1, there exists a constant \(c>0\) such that, for any \(0{\lt}a{\lt}1\) and the sequence of sets

\begin{align*}\Omega_{n}=\{\theta\in\Delta:\|\theta-\theta^{\ast}\|^{2}\leq cn^{-a}\},\end{align*}

the sequence of probabilities

\begin{align*}\lim_{n\rightarrow\infty}\Pi(\Omega_{n}|Y_{1:n},S_{1:n})\overset{P_{\theta^{ \ast}}}{\rightarrow}1,\end{align*}

regardless of the choice of \(Q\).

A proof is given in Appendix A.4.2, where the constant \(c\) in the sets \(\Omega_{n}\) is explicitly given.

6.2 Selecting the Best Subset

Let \(S^{\ast}:=S^{\ast}_{\theta^{\ast}}\) be the best subset at \(\theta^{\ast}\). In this part, we prove that if posterior sampling is performed exactly, the best subset is chosen with an expected long-run frequency of \(1\). Our result relies on some mild assumptions.

Assumption 2.

The components of \(\theta^{\ast}\) are strictly ordered, that is, \(\theta_{\sigma_{\theta^{\ast}}(1)}>\ldots>\theta_{\sigma_{\theta^{\ast}}(K)}\).

Assumption 3.

Given any \(S\subset[K]\) and \(\epsilon>0\), \(U(\theta,S,\epsilon)\) is a continuous function of \(\theta\) with respect to the \(L_{2}\)-norm.

Assumption 4.

The solution of (6) is unique at \(\theta^{\ast}\).

Assumption 2 is required to avoid technical issues regarding the uniqueness of \(S^{\ast}\). Assumptions 3 and 4 impose a certain form of regularity on the utility function.

Theorem 4.

Suppose Assumptions 1–4 hold and \(\Theta_{t}\)s are generated by exact sampling, that is, \(Q(A|Y_{1:t},S_{1:t})=\Pi(A|Y_{1:t},S_{1:t})\) for all measurable \(A\subseteq\Delta\). Then,

\begin{align}\lim_{n\rightarrow\infty}P_{\theta^{\ast}}(S_{n}=S^{\ast})\rightarrow 1.\end{align}

(19)

As a corollary, \(S^{\ast}\) is selected with an expected long-run frequency of \(1\), that is,

\begin{align}\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}E_{\theta^{\ast}}\left[ \mathbb{I}(S_{t}=S^{\ast})\right]=1.\end{align}

(20)

The result in (20) can be likened to sublinear regret from the reinforcement learning theory.

7 Numerical Results

We tested¹ the performance of AdOBEst-LDP when the subset \(S\) in RRRR is determined according to a utility function in Section 4.3. We compared AdOBEst-LDP when combined with each of the utility functions defined in Sections 4.3.1–4.3.5 with its non-adaptive counterpart when SRR is used to generate \(Y_{t}\) at all steps. We also included the semi-adaptive subset selection method in Section 4.3.6 into the comparison. For the semi-adaptive approach, we obtained results for five different values of its \(\alpha\) parameter, namely \(\alpha\in\{0.2,0.6,0.8,0.9,0.95\}\).

We ran each method for 50 Monte Carlo runs. Each run contained \(T=500K\) timesteps. For each run, the sensitive information is generated as \(X_{t}\overset{\text{i.i.d.}}{\sim}\text{Cat}(\theta^{\ast})\) where \(\theta^{\ast}\) itself was randomly drawn from \(\text{Dirichlet}(\rho,\ldots,\rho)\). Here, the parameter \(\rho\) was used to control the unevenness among the components of \(\theta^{\ast}\). (Smaller \(\rho\) leads to more uneven components in general). At each timestep, Step 3 of Algorithm 1 was performed by running \(M=20\) updates of an SGLD-based MCMC kernel as described Section 5.1. In SGLD, we took the subsample size \(m=50\) and the step-size parameter \(a=\frac{0.5}{t}\) at timestep \(t\). Prior hyper-parameters for the gamma distribution were taken \(\rho_{0}=1_{K}\). The posterior sample \(\Theta_{t}\) was taken as the last iterate of those SGLD updates. Only for the last timestep, \(t=T\), the number of MCMC iterations was taken \(2{,}000\) to reliably calculate the final estimate \(\hat{\theta}\) of \(\theta\) by averaging the last \(1{,}000\) of those \(2{,}000\) iterates. (This average is the MCMC approximation of the posterior mean of \(\theta\) given \(Y_{1:T}\) and \(S_{1:T}\).) We compared the mean posterior estimate of \(\theta\) and the true value, and the performance measure was taken as the TV distance between \(\text{Cat}(\theta^{\ast})\) and \(\text{Cat}(\hat{\theta})\), that is,

\begin{align}\frac{1}{2}\sum_{i=1}^{K}|\hat{\theta}_{i}-\theta_{i}|.\end{align}

(21)

Finally, the comparison among the methods was repeated for all the combinations (\(K,\epsilon,\kappa,\rho\)) of \(K\in\{10,20\}\), \(\epsilon\in\{0.5,1,5\}\), \(\kappa\in\{0.8,0.9\}\), and \(\rho\in\{0.01,0.1,1\}\).

The accuracy results for the methods in comparison are summarized in Figures 3 and 4 in terms of the error given in (21). The box plots are centered at the error median, and the whiskers stretch from the minimum to the maximum over the 50 MC runs, excluding the outliers. When the medians are compared, the fully adaptive algorithms, which use a utility function to select \(S_{t}\), yield comparable results to the best semi-adaptive approach in both figures. As one may expect, the non-adaptive approach yielded the worst results in general, especially in the high-privacy regimes (smaller \(\epsilon\)) and uneven \(\theta^{\ast}\) (smaller \(\rho\)). We also observe that, while most utility metrics are generally robust, the one based on FIM seems sensitive to the choice of \(\epsilon_{1}\) parameter. This can be attributed to the fact that the FIM approaches singularity when \(\epsilon_{2}\) is too small, which is the case if \(\epsilon_{1}\) is chosen too close to \(\epsilon\). Supporting this, we see that when \(\epsilon_{1}=0.8\epsilon\), the utility metric based on FIM becomes more robust. Another remarkable observation is that the utility function based on the probability of honest response, \(U_{6}\), has competitive performance despite being the lightest utility metric in computational complexity. Finally, while the semi-adaptive approach is computationally less demanding than most fully adaptive versions, the results show it can dramatically fail if its \(\alpha\) hyper-parameter is not tuned properly. In contrast, the fully adaptive approaches adapt well to \(\epsilon\) or \(\rho\) and do not need additional tuning.

Fig. 3.

Fig. 4.

In addition to the error graphs, the heat maps in Figures 5 and 6 show the effect of parameters \(\rho\) and \(\epsilon\) on the average cardinality of the subsets \(S\) chosen by each algorithm (again, averaged over 50 Monte Carlo runs). According to these figures, increasing the value of \(\rho\) causes an increase in the cardinalities of subsets chosen by each algorithm (except the non-adaptive one since it uses all \(K\) categories rather than a smaller subset). This is expected since higher \(\rho\)-values cause \(\text{Cat}(\theta^{\ast})\) to be closer to the uniform distribution, thus causing \(X\) to be more evenly distributed among the categories. Moreover, for small \(\rho\), increasing the value of \(\epsilon\) causes a decrease in the cardinalities of these subsets, which can be attributed to a higher \(\epsilon\), leading to a more accurate estimation. When we compare the utility functions for the adaptive approach among themselves, we observe that for \(\epsilon_{1}=0.8\epsilon\), the third utility function (TV1) uses the subsets with the largest cardinality (on average). However, when we increase the \(\epsilon_{1}\) value to \(\epsilon_{1}=0.9\epsilon\), the second utility function (FIM) uses the subsets with the largest cardinality. This might be due to the sensitivity of the FIM-based utility function to the choice of \(\epsilon_{1}\) parameter that we mentioned before, which affects the invertibility of the FIM when \(\epsilon_{1}\) is too close to \(\epsilon\).

Fig. 5.

Fig. 6.

8 Conclusion

In this article, we proposed a new adaptive framework, AdOBEst-LDP, for online estimation of the distribution of categorical data under the \(\epsilon\)-LDP constraint. AdOBEst-LDP, run with RRRR for randomization, encompasses both privatization of the sensitive data and accurate Bayesian estimation of population parameters from privatized data in a dynamic way. Our privatization mechanism (RRRR) is distinguished from the baseline approach (SRR) in a way that it operates on a smaller subset of the sample space rather than the entire sample space. We employed an adaptive approach to dynamically adjust the subset at each iteration, based on the knowledge about \(\theta^{\ast}\) obtained from the past data. The selection of these subsets was guided by various alternative utility functions that we used throughout the article. For the posterior sampling of \(\theta\) at each iteration, we employed an efficient SGLD-based sampling scheme on a constrained region, namely the \(K\)-dimensional probability simplex. We distinguished this scheme from Gibbs sampling, which uses all of the historical data and is not scalable to large datasets.

In the numerical experiments, we demonstrated that AdOBEst-LDP can estimate the population distribution more accurately than the non-adaptive approach under experimental settings with various privacy levels \(\epsilon\) and degrees of evenness among the components of \(\theta^{\ast}\). While the performance of AdOBEst-LDP is generally robust for all the utility functions considered in the article, the utility function based on the probability of honest response can be preferred due to its much lower computational complexity than the other utility functions. Our experiments also showed that the accuracy of the adaptive approach is comparable to that of the semi-adaptive approach. However, the semi-adaptive approach requires adjusting its parameter \(\alpha\) carefully, which makes it challenging to use.

In a theoretical analysis, we showed that, regardless of whether the posterior sampling is conducted exactly or approximately, the posterior distribution targeted in AdOBEst-LDP converges to the true population parameter \(\theta^{\ast}\). We also showed that, under exact posterior sampling, the best subset given utility function is selected with probability \(1\) in the long run.

It is important to note that the observations \(\{Y_{t}\}_{t\geq 1}\) generated by AdOBEst-LDP are dependent. Therefore, the theoretical analysis presented in Section 6 can also be seen as a contribution to the literature on the convergence of posterior distributions with dependent data. Additionally, we have already highlighted an analogy between AdOBEst-LDP and Thompson sampling [21]. Both methods involve posterior sampling, and the subset selection step in AdOBEst-LDP can be viewed as analogous to the action selection step in reinforcement learning schemes. In this regard, we believe that the theoretical results may also inspire future research on the convergence of dynamic reinforcement learning algorithms, especially those based on Thompson sampling.

Categorical distributions serve as useful nonparametric discrete approximations of continuous distributions. As a potential future direction, AdOBEst-LDP could be adapted for nonparametric density estimation. A key challenge in this context would be determining how to partition the support domain of the data.

RRRR is a practical LDP mechanism with a subset parameter that adapts based on past data. It has been shown to outperform SRR when leveraging the knowledge of \(\theta^{\ast}\). However, in this work, it is not proven that RRRR is the optimal \(\epsilon\)-LDP mechanism with respect to the utility functions considered. While the optimal \(\epsilon\)-LDP mechanism could be identified numerically by solving a constrained optimization problem—where the utility function is maximized under the LDP constraint—it may not have a closed-form solution for complex utility functions. A promising direction for future research would be to compare the optimal \(\epsilon\)-LDP mechanism with the \(\epsilon\)-LDP RRRR mechanism by analyzing their transition probability matrices and assessing the suboptimality of RRRR. Additionally, insights from the optimal \(\epsilon\)-LDP mechanism could inspire the development of new, tractable, and approximately optimal \(\epsilon\)-LDP mechanisms.

Acknowledgments

We thank our colleague Prof. Berrin Yanıkoğlu for reviewing the draft of the article and providing insightful comments.

Footnote

The MATLAB code at https://github.com/soneraydin/AdOBEst_LDP can be used to reproduce the results obtained in this article.

Appendix

A Proofs

A.1 Proofs for LDP of RRRR

Proof of Theorem 1.

Let \(k=|S|\). We can write as

\begin{align}g_{S,\epsilon}(y|x)=\begin{cases}\frac{e^{\epsilon_{1}}}{e^{\epsilon_{1}}+k} & x \in S,y\in S,x=y \\\frac{1}{e^{\epsilon_{1}}+k} & x\in S,y\in S,x\neq y \\\frac{1}{K-k}\frac{1}{e^{\epsilon_{1}}+k} & x\in S,y\notin S \\\frac{1}{e^{\epsilon_{1}}+k} & x\notin S,y\in S \\\frac{e^{\epsilon_{2}}}{e^{\epsilon_{2}}+K-k-1}\frac{e^{\epsilon_{1}}}{e^{ \epsilon_{1}}+k} & x\notin S,y\notin S,x=y \\\frac{1}{e^{\epsilon_{2}}+K-k-1}\frac{e^{\epsilon_{1}}}{e^{\epsilon_{1}}+k} & x \notin S,y\notin S,x\neq y \\\end{cases}.\end{align}

(22)

We will show that when \(\epsilon_{1},\epsilon_{2}\) are chosen according to the theorem,

\begin{align}e^{-\epsilon}\leq\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}\leq e ^{\epsilon}\end{align}

(23)

for all possible \(x,x^{\prime},y\in[K]\). When \(S=\emptyset\), the proof is trivial; we focus on the non-trivial case \(S\neq\emptyset\). For the non-trivial case, the transition probability \(g_{S,\epsilon}(y|x)\) requires checking the ratio in (23) in \(10\) different cases for \(x,x^{\prime},y\) concerning their interrelation.

(C1)

\(x\in S\), \(x^{\prime}\notin S\), \(y\in S\), \(y=x\). We have

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{\frac{e^{ \epsilon_{1}}}{e^{\epsilon_{1}}+k}}{\frac{1}{e^{\epsilon_{1}}+k}}=e^{\epsilon_ {1}}.\end{align*}

Since \(\epsilon_{1}\leq\epsilon\), (23) holds.

(C2)

\(x\in S\), \(x^{\prime}\notin S\), \(y\in S\), \(y\neq x\). We have

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{\frac{1}{e^{ \epsilon_{1}}+k}}{\frac{1}{e^{\epsilon_{1}}+k}}=1,\end{align*}

which trivially implies (23).

(C3)

\(x\in S\), \(x^{\prime}\notin S\), \(y\notin S\), \(y=x^{\prime}\). We need

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{\frac{1}{K-k} \frac{1}{e^{\epsilon_{1}}+k}}{\frac{e^{\epsilon_{2}}}{e^{\epsilon_{2}}+K-k-1} \frac{e^{\epsilon_{1}}}{e^{\epsilon_{1}}+k}}=\frac{e^{\epsilon_{2}}+K-k-1}{(K- k)e^{\epsilon_{1}+\epsilon_{2}}}.\end{align*}

We can show that \(\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}\leq 1\leq e^{\epsilon}\) already holds since

\begin{align*}\frac{e^{\epsilon_{2}}+K-k-1}{(K-k)e^{\epsilon_{1}+\epsilon_{2}}}=\frac{(K-k-1) +e^{\epsilon_{2}}}{(K-k-1)e^{\epsilon_{1}+\epsilon_{2}}+e^{\epsilon_{1}+ \epsilon_{2}}},\end{align*}

and the first and the second terms in the numerator are smaller than those in the denominator, respectively. For the other side of the inequality,

\begin{align*}\frac{e^{\epsilon_{2}}+K-k-1}{(K-k)e^{\epsilon_{1}+\epsilon_{2}}}\geq e^{-\epsilon}\end{align*}

requires

\begin{align*}e^{\epsilon_{2}}\leq\frac{K-k-1}{e^{\epsilon_{1}-\epsilon}(K-k)-1}\end{align*}

whenever \(e^{\epsilon_{1}-\epsilon}(K-k)-1>0\), which is the condition given in the theorem.

(C4)

\(x\in S\), \(x^{\prime}\notin S\), \(y\notin S\), \(y\neq x^{\prime}\). We need

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{\frac{1}{K-k} \frac{1}{e^{\epsilon_{1}}+k}}{\frac{1}{e^{\epsilon_{2}}+K-k-1}\frac{e^{ \epsilon_{1}}}{e^{\epsilon_{1}}+k}}=\frac{e^{\epsilon_{2}}+(K-k)-1}{(K-k)e^{ \epsilon_{1}}}.\end{align*}

Since \(\epsilon_{2}\leq\epsilon\), we have

\begin{align*}e^{\epsilon_{2}}\leq(K-k)(e^{\epsilon+\epsilon_{1}}-1)+1.\end{align*}

Hence

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}\leq\frac{(K-k)(e^{ \epsilon+\epsilon_{1}}-1)+1+K-k-1}{(K-k)e^{\epsilon_{1}}}\leq e^{\epsilon},\end{align*}

Hence, we proved the right-hand side inequality. For the left-hand side, we have

\begin{align*}\frac{e^{\epsilon_{2}}+K-k-1}{(K-k)e^{\epsilon_{1}}}=\frac{(e^{\epsilon_{2}}-1) +K-k}{(K-k)e^{\epsilon_{1}}}\geq\frac{K-k}{(K-k)e^{\epsilon_{1}}}=e^{- \epsilon_{1}}\geq e^{-\epsilon}\end{align*}

since \(\epsilon_{2}\geq 0\) and \(\epsilon_{1}\leq\epsilon\).

(C5)

\(x,x^{\prime}\in S\), \(y\in S\), \(y=x\). We have

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{e^{\epsilon_{1} }/(e^{\epsilon_{1}}+k)}{1/(e^{\epsilon_{1}}+k)}=e^{\epsilon_{1}}.\end{align*}

Since \(\epsilon_{1}\leq\epsilon\), (23) holds.

(C6)

\(x,x^{\prime}\in S\), \(y\in S\), \(y\neq x\) and \(y\neq x^{\prime}\). We have

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{1/(e^{\epsilon_ {1}}+k)}{1/(e^{\epsilon_{1}}+k)}=1.\end{align*}

So (23) trivially holds.

(C7)

\(x,x^{\prime}\in S\), \(y\notin S\). We have

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{\frac{1}{K-k} \frac{1}{e^{\epsilon_{1}}+k}}{\frac{1}{K-k}\frac{1}{e^{\epsilon_{1}}+k}}=1.\end{align*}

So, (23) trivially holds.

(C8)

\(x,x^{\prime}\notin S\), \(y\notin S\), \(y=x\). We have

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{e^{\epsilon_{2} }/(e^{\epsilon_{2}}+K-k-1)e^{\epsilon_{1}}/(e^{\epsilon_{1}}+k)}{1/(e^{ \epsilon_{2}}+K-k-1)e^{\epsilon_{1}}/(e^{\epsilon_{1}}+k)}=e^{\epsilon_{2}}.\end{align*}

Since \(\epsilon_{2}\leq\epsilon\), (23) holds.

(C9)

\(x,x^{\prime}\notin S\), \(y\notin S\), \(y\neq x\), \(y\neq x^{\prime}\). We have

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{1/(e^{\epsilon_ {2}}+K-k-1)e^{\epsilon_{1}}/(e^{\epsilon_{1}}+k)}{1/(e^{\epsilon_{2}}+K-k-1)e^ {\epsilon_{1}}/(e^{\epsilon_{1}}+k)}=1.\end{align*}

So, (23) trivially holds.

(C10)

\(x,x^{\prime}\notin S\), \(y\in S\). We have

\begin{align*}\frac{g_{S,\epsilon}(y|x)}{g_{S,\epsilon}(y|x^{\prime})}=\frac{1/(e^{\epsilon_ {1}}+|S|)}{1/(e^{\epsilon_{1}}+|S|)}=1.\end{align*}

So (23) trivially holds.

We conclude the proof by noting that any other case left out is symmetric in \((x,x^{\prime})\) to one of the covered cases and, therefore, does not need to be checked separately. ∎

A.2 Proofs about Utility Functions

Proof of Proposition 1.

Given \(\theta\in\Delta\), let \(\vartheta\) be the \((K-1)\times 1\) column vector such that \(\vartheta_{i}=\theta_{i}\) for \(i=1,\ldots,K-1\). We can write the FIM in terms of the score vector as follows.

\begin{align*}F(\theta;S,\epsilon)=\mathbb{E}_{Y}\left[\nabla_{\vartheta}\ln h_{S,\epsilon}(Y|\theta)\nabla_{\vartheta}\ln h_{S,\epsilon}(Y|\theta)^{\top}\right]=\sum_{y= 1}^{K}h_{S,\epsilon}(y|\theta)\left[\nabla_{\vartheta}\ln h_{S,\epsilon}(y| \theta)\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}\right].\end{align*}

Noting that

\begin{align*}h_{S,\epsilon}(y|\theta)=\sum_{k=1}^{K-1}g_{S,\epsilon}(y|k)\vartheta_{k}+g_{S, \epsilon}(y|K)\left(1-\sum_{k=1}^{K-1}\vartheta_{k}\right),\end{align*}

the score vector can be derived as

\begin{align}[\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)]_{k}=\frac{g_{S,\epsilon}(y|k) -g_{S,\epsilon}(y|K)}{h_{S,\epsilon}(y|\theta)},\quad k=1,\ldots,K-1.\end{align}

(24)

As the \(K\times(K-1)\) matrix \(A_{S,\epsilon}\) defined as \(A(i,j)=g(i|j)-g(i|K)\), we can rewrite (24) as \([\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)]_{k}=A_{S,\epsilon}(y,k)/h_{S, \epsilon}(y|\theta)\). Let \(a_{y}\) be the \(y\)’th row of \(A_{S,\epsilon}\), and recall that \(D_{\theta}\) is defined as a diagonal matrix with \(1/h_{S,\epsilon}(j|\theta)\) being the \(j\)’th element in the diagonal. Then, the FIM is

\begin{align*}F(\theta;S,\epsilon)=\sum_{y=1}^{K}\frac{a_{y}^{\top}}{h_{S,\epsilon}(y|\theta) }\frac{a_{y}}{h_{S,\epsilon}(y|\theta)}h_{S,\epsilon}(y|\theta)=\sum_{y=1}^{K }a_{y}^{\top}\frac{1}{h_{S,\epsilon}(y|\theta)}a_{y}=A_{S,\epsilon}^{\top}D_{ \theta}A_{S,\epsilon},\end{align*}

as claimed. ∎

Next, we prove that \(F(\theta;S,\epsilon)\) is invertible. Let \(G_{S,\epsilon}\) be the \(K\times K\) matrix whose elements are

\begin{align}G_{S,\epsilon}(i,j)=g_{S,\epsilon}(i|j),\quad i,j=1,\ldots,K.\end{align}

(25)

To prove that \(F(\theta;S,\epsilon)\) is invertible, we first prove the intermediate result that \(G_{S,\epsilon}\) is invertible.

Lemma 1

\(G_{S,\epsilon}\) is invertible for all \(S\subset[K]\) and \(\epsilon>0\).

Proof.

It suffices to prove that of \(G_{S,\epsilon}\) is invertible for \(S=\{1,2,\ldots,k\}\) and for all \(k\in\{0,\ldots,K-1\}\). For other \(S\), \(G_{S,\epsilon}\) can be obtained by permutation. Fix \(k\) and let \(S=\{1,2,\ldots,k\}\). It can be verified by inspection that \(G_{S,\epsilon}\) is a block matrix as

\begin{align*}G_{S,\epsilon}=\begin{bmatrix}a_{1}I_{k}+a_{2}1_{k}1_{k}^{\top} & b1_{k}1_{K-k}^ {\top}\\c1_{K-k}1_{k}^{\top} & d_{1}I_{K-k}+d_{2}1_{K-k}1_{K-k}^{\top}\end{bmatrix},\end{align*}

where \(I_{n}\) is the identity matrix of size \(n\) and \(1_{n}\) is the column vector of \(1\)’s of size \(n\). The constants \(a_{1},a_{2},b,c,d_{1},d_{2}\) are given as

\begin{align*}a_{1}=\frac{e^{\epsilon_{1}}}{k+e^{\epsilon_{1}}}-a_{2},\quad a_ {2}=\frac{1}{k+e^{\epsilon_{1}}},\quad b=\frac{1}{k+e^{\epsilon_{1}}},\quad c= \frac{1}{K-k}a_{2} \\d_{1}=\frac{e^{\epsilon_{2}}}{e^{\epsilon_{2}}+K-k-1}\frac{e^{ \epsilon_{1}}}{k+e^{\epsilon_{1}}}-d_{2},\quad d_{2}=\frac{1}{e^{\epsilon_{2}} +K-k-1}\frac{e^{\epsilon_{1}}}{k+\epsilon^{1}}.\end{align*}

Also, note that since \(\epsilon_{1}>0\) and \(\epsilon_{2}>0\), \(a_{1}\) and \(d_{1}\) (whenever it is defined) are strictly positive.

The case \(k=0\) is trivial since then \(G_{S,\epsilon}=d_{1}I_{K}+d_{2}1_{K}1_{K}^{\top}\) is invertible. Hence, we focus on the case \(0{\lt}k{\lt}K\). For this case, first, note that the matrices on the diagonal are invertible. So, by Weinstein–Aronszajn identity, for \(G_{S,\epsilon}\) to be invertible, it suffices to show that the matrix

\begin{align*}M=a_{1}I_{k}+a_{2}1_{k}1_{k}^{\top}-b1_{k}1_{K-k}^{\top}(d_{1}I_{K-k}+d_{2}1_{ K-k}1_{K-k}^{\top})^{-1}c1_{K-k}1_{k}^{\top}\end{align*}

is invertible. Using the Woodbury matrix identity, the matrix \(M\) can be expanded as

\begin{align*}M & =a_{1}I_{k}+a_{2}1_{k}1_{k}^{\top}-b1_{k}1_{K-k}^{\top}\left(\frac{I_{K-k}}{d_{1}}-\frac{1}{d_{1}}1_{K-k}\left(\frac{1}{d_{2}}+1_{K-k}^{ \top}\frac{1}{d_{1}}1_{K-k}\right)^{-1}1_{K-k}^{\top}\frac{1}{d_{1}}\right)c1_ {K-k}1_{k}^{\top} \\& =a_{1}I_{k}+a_{2}1_{k}1_{k}^{\top}-\frac{bc}{d_{1}}1_{k}1_{K-k}^{ \top}1_{K-k}1_{k}^{\top}+\frac{bc}{d_{1}^{2}}\left(\frac{1}{d_{2}}+\frac{K-k}{ d_{1}}\right)^{-1}1_{k}1_{K-k}^{\top}1_{K-k}1_{K-k}^{\top}1_{K-k}1_{k}^{\top} \\& =a_{1}I_{k}+\left[a_{2}-\frac{(K-k)bc}{d_{1}}+\frac{bc}{d_{1}^{2} }\left(\frac{1}{d_{2}}+\frac{K(K-k)}{d_{1}}\right)^{-1}\right]1_{k}1_{k}^{\top}.\end{align*}

Inside the square brackets is a scalar, therefore, \(M\) in question is the sum of an identity matrix and a rank-\(1\) matrix, which is invertible. Hence, \(G_{S,\epsilon}\) is invertible. ∎

Proof of Proposition 2.

Note that \(A_{S,\epsilon}=G_{S,\epsilon}J\), where the \(K\times(K-1)\) matrix \(J\) satisfies \(J(i,i)=1\) and \(J(i,K)=-1\) for \(i=1,\ldots,K\), and \(J(i,j)=0\) otherwise. Since \(G_{S,\epsilon}\) is invertible, it is full rank. Also, the columns of \(A_{S,\epsilon}\), denoted by \(c^{A}_{i}\), \(i=1,\ldots,K-1\) are given by

\begin{align*}c^{A}_{1}=c^{G}_{1}-c^{G}_{K},\quad\ldots,\quad c^{A}_{K-1}=c^{G}_{K-1}-c^{G}_ {K},\end{align*}

where \(c^{G}_{i}\) is the \(i\)’th column of \(G_{S,\epsilon}\) for \(i=1,\ldots,K\). Observe that \(c^{A}_{i}\), \(i=1,\ldots,K-1\) are linearly independent since any linear combination of those columns is in the form of

\begin{align*}\sum_{i=1}^{K-1}a_{i}c^{A}_{i}=\sum_{i=1}^{K-1}a_{i}c^{G}_{i}-\left(\sum_{i=1} ^{K-1}a_{i}\right)c^{G}_{K}.\end{align*}

Since the columns of \(G_{S,\epsilon}\) are linearly independent, the linear combination above becomes \(0\) only if \(a_{1}=\ldots=a_{K-1}=0\). This shows that the columns of \(A_{S,\epsilon}\) are also linearly independent. Thus, we conclude that \(A_{S,\epsilon}\) has rank \(K-1\). Finally, since \(D_{\theta}\) is diagonal with positive diagonal entries, \(A_{S,\epsilon}^{\top}D_{\theta}A_{S,\epsilon}=A_{S,\epsilon}^{\top}D_{\theta}^ {1/2}D_{\theta}^{1/2}A_{S,\epsilon}\) is positive definite, hence invertible. ∎

The following proof contains a derivation of the utility function based on the MSE of the Bayesian estimator of \(X\) given \(Y\).

Proof of Proposition 3.

It is well-known that the expectation in (10) is minimized when \(\widehat{e_{X}}=\nu(Y):=\mathbb{E}_{\theta}[e_{X}|Y]\), i.e., the posterior expectation of \(e_{X}\) given \(Y\). That is,

\begin{align*}\min_{\widehat{e_{X}}}\mathbb{E}_{\theta}\left[\|e_{X}-\widehat{e_{X}}(Y)\|^{2 }\right]=\mathbb{E}_{\theta}\left[\|e_{X}-\nu(Y)\|^{2}\right].\end{align*}

For the squared norm inside the expectation, we have

\begin{align}\|e_{X}-\nu(Y)\|^{2} & =(1-v(Y)_{X})^{2}+\sum_{k\neq X}v(Y)_{k}^{2} \\& =1+\nu(Y)_{X}^{2}-2\nu(Y)_{X}+\sum_{k\neq X}v(Y)_{k}^{2} \\ & =1-2\nu(Y)_{X}+\sum_{k=1}^{K}v(Y)_{k}^{2}.\end{align}

(26)

The expectation of the last term in (26) is

\begin{align}\mathbb{E}_{\theta}\left[\sum_{k=1}^{K}v(Y)_{k}^{2}\right] & =\sum_{y}h_{S,\epsilon}(y|\theta)\sum_{x=1}^{K}p_{S,\epsilon}(x|y, \theta)^{2} \\& =\sum_{y=1}^{K}\sum_{x=1}^{K}h_{S,\epsilon}(y|\theta)p_{S, \epsilon}(x|y,\theta)^{2} \\ & =\sum_{y=1}^{K}\sum_{x=1}^{K}\frac{g_{S,\epsilon}(y|x)^{2}\theta_ {x}^{2}}{h_{S,\epsilon}(y|\theta)}.\end{align}

(27)

For the expectation of the second term in (26), we have

\begin{align*}\mathbb{E}_{\theta}\left[\nu(Y)_{X}\right]=\sum_{x,y}p_{S,\epsilon}(x|y,\theta) p_{S,\epsilon}(x,y|\theta),\end{align*}

where \(p(x,y|\theta)\) denotes the joint probability of \(X,Y\) given \(\theta\). Substituting \(p(x,y|\theta)=p_{S,\epsilon}(x|y,\theta)h_{S,\epsilon}(y|\theta)\) into the equation above, we get

\begin{align}\mathbb{E}_{\theta}\left[\nu(Y)_{X}\right] & =\sum_{x=1}^{K}\sum_{y=1}^{K}h_{S,\epsilon}(y|\theta)p_{S, \epsilon}(x|y,\theta)^{2}. \\ & =\sum_{x=1}^{K}\sum_{y=1}^{K}\frac{g_{S,\epsilon}(y|x)^{2}\theta_ {x}^{2}}{h_{S,\epsilon}(y|\theta)},\end{align}

(28)

which is equal to what we get in (27). Substituting (27) and (28) into (26), we obtain

\begin{align*}\mathbb{E}_{\theta}\left[\|e_{X}-\nu(Y)\|^{2}\right] & =1-2\sum_{x=1}^{K}\sum_{y=1}^{K}\frac{g_{S,\epsilon}(y|x)^{2} \theta_{x}^{2}}{h_{S,\epsilon}(y|\theta)}+\sum_{x=1}^{K}\sum_{y=1}^{K}\frac{g_ {S,\epsilon}(y|x)^{2}\theta_{x}^{2}}{h_{S,\epsilon}(y|\theta)} \\& =1-\sum_{x=1}^{K}\sum_{y=1}^{K}\frac{g_{S,\epsilon}(y|x)^{2} \theta_{x}^{2}}{h_{S,\epsilon}(y|\theta)}.\end{align*}

Finally, using the definition \(U_{5}(\theta,S,\epsilon)=-\min_{\widehat{e_{X}}}\mathbb{E}_{\theta}\left[\|e_{ X}-\widehat{e_{X}}(Y)\|^{2}\right]\), we conclude the proof. ∎

Proof of Theorem 2.

The global maximization of \(U_{6}\) over the set of all the subsets \(S\subset[K]\) can be decomposed as

\begin{align}\max_{S\subset[K]}U_{6}(\theta,S,\epsilon)=\max_{k\in\{0, \ldots,K-1\}}\left\{\max_{S\subset[K]:|S|=k}U_{6}(\theta,S, \epsilon)\right\}.\end{align}

(29)

This inner maximization is equivalent to fixing the cardinality of \(S\) to \(k\) and finding the best \(S\) with cardinality \(k\). Now, the utility function can be written as

\begin{align*}U_{6}(\theta,S,\epsilon) & =\frac{e^{\epsilon_{1}}}{e^{\epsilon_{1}}+k}\left(\sum_{i\in S} \theta_{i}+\frac{e^{\epsilon_{2}}}{e^{\epsilon_{2}}+K-k-1}\sum_{i\notin S} \theta_{i}\right) \\& =\left(\frac{e^{\epsilon_{1}}}{e^{\epsilon_{1}}+k}\right)\sum_{i \in S}\theta_{i}+\left(\frac{e^{\epsilon_{1}}}{e^{\epsilon_{1}}+k}\right)\left (\frac{e^{\epsilon_{2}}}{e^{\epsilon_{2}}+K-k-1}\right)\sum_{i\notin S}\theta_ {i},\end{align*}

where \(k\) appears in the first line since \(|S|=k\). Note that \(\sum_{i\in S}\theta_{i}\) and \(\sum_{i\notin S}\theta_{i}\) sum to \(1\) and the constants in front of the first sum is larger than that of the second. Hence, we seek to maximize an expression in the form of

\begin{align*}ax+b(1-x)\end{align*}

over a variable \(x>0\) when \(a>b>0\). This is maximized when \(x>0\) is taken as large as possible. Therefore, \(U_{6}(\theta,S,\epsilon)\) is maximized when \(\sum_{i\in S}\theta_{i}\) is made as large as possible under the constraint that \(|S|=k\). Under this constraint this sum is maximized when \(S\) has the indices of the \(k\) largest components of \(\theta\), that is, when \(S=S_{k,\theta}=\{\sigma_{\theta}(1),\ldots,\sigma_{\theta}(k)\}\).

Then, (29) reduces to \(\max_{k=1,\ldots,K}U_{6}(\theta,S_{k,\theta},\epsilon)\). Hence, we conclude. ∎

A.3 Proof for SGLD Update

Proof of Proposition 4.

For the prior component of the gradient, recall that we have \(\phi=(\phi_{1},\ldots,\phi_{K})\), where

\begin{align*}\phi_{i}\overset{\text{iid}}{\sim}\text{Gamma}(\rho_{i},1),\quad i=1,\ldots,K.\end{align*}

Then, the marginal pdf of \(\phi_{i}\) satisfies

\begin{align*}\ln p(\phi_{i})=(\rho_{i}-1)\ln\phi_{i}-\ln\Gamma(\rho_{i})-\phi_{i},\quad i=1, \ldots,K.\end{align*}

Taking the partial derivatives of \(\ln p(\phi_{i})\) with respect to \(\phi_{i}\), we have

\begin{align*}[\nabla_{\phi}\ln p(\phi)]_{i}=\frac{\rho_{i}-1}{\phi_{i}}-1,\quad i=1,\ldots,K.\end{align*}

For the likelihood component, given \(\theta\in\Delta\), let the \((K-1)\times 1\) vector \(\vartheta\) be the reparametrization of \(\theta\) such that \(\vartheta_{i}=\theta_{i}\) for \(i=1,\ldots,K-1\). Then, according to (13),

\begin{align}\vartheta_{k}=\frac{\phi_{k}}{\sum_{j=1}^{K}\phi_{j}},\quad k=1,\ldots,K-1.\end{align}

(30)

Using the chain rule, we can write the gradient of the log-likelihood with respect to \(\phi\) as

\begin{align*}\nabla_{\phi}\ln p(y|\phi) & =J\cdot\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta).\end{align*}

where \(J\) is the \(K\times(K-1)\) Jacobian matrix for the mapping from \(\phi\) to \(\vartheta\) in (30), whose \((i,j)\)th element can be derived as

\begin{align*}J(i,j) & =\frac{\partial\vartheta_{j}}{\partial\phi_{i}}=\frac{\partial}{ \partial\phi_{i}}\frac{\phi_{j}}{\sum_{k=1}^{K}\phi_{k}} \\& =\mathbb{I}(i=j)\frac{1}{\sum_{k=1}^{K}\phi_{k}}-\frac{\phi_{j}}{ \left(\sum_{k=1}^{K}\phi_{k}\right)^{2}}.\end{align*}

Using (24) for \(\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)\), we complete the proof. ∎

A.4 Proofs for Convergence and Consistency Results

A.4.1 Preliminary Results.

Lemma 2.

Given \(\epsilon\geq 0\), there exists constants \(0{\lt}c\leq C{\lt}\infty\) such that for all \(\theta\in\Delta\) and all \(S\subset[K]\), we have

\begin{align*}c\leq g_{S,\epsilon}(y|x)\leq C,\quad c\leq h_{S,\epsilon}(y|\theta)\leq C.\end{align*}

Proof.

The bounds for \(g_{S,\epsilon}(y|x)\) can directly be verified from (22). Moreover,

\begin{align*}c\leq\min_{i=1,\ldots,K}g_{S,\epsilon}(y|i)\leq h_{S,\epsilon}(y|\theta)=\sum_ {i=1}^{K}g_{S,\epsilon}(y|i)\theta_{i}\leq\max_{i=1,\ldots,K}g_{S,\epsilon}(y| i)\leq C.\end{align*}

Hence, we conclude. ∎

Remark 1.

For the symbol \(\theta\), which is used for \(K\times 1\) probability vectors in \(\Delta\), we will associate the symbol \(\vartheta\) such that \(\vartheta\) denotes the shortened vector of the first \(K-1\) elements of \(\theta\). Accordingly, we will use \(\vartheta\), \(\vartheta^{\prime}\), \(\vartheta^{\ast}\), and so on, to denote the shortened versions of \(\theta\), \(\theta^{\prime}\), \(\theta^{\ast}\), and so on.

Lemma 3.

For any \(\theta,\theta^{\prime}\in\Delta\), \(y\in[K]\), and \(S\subset[K]\), we have

\begin{align}\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}(\vartheta- \vartheta^{\prime}) & =\frac{h_{S,\epsilon}(y|\theta)-h_{S,\epsilon}(y|\theta^{\prime}) }{h_{S,\epsilon}(y|\theta)}.\end{align}

(31)

Proof.

Recall from (24) that

\begin{align*}[\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)]_{i}=\frac{g_{S,\epsilon}(y|i) -g_{S,\epsilon}(y|K)}{h_{S,\epsilon}(y|\theta)},\quad i=1,\ldots,K-1.\end{align*}

Hence,

\begin{align*}\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}\vartheta & =\frac{1}{h_{S,\epsilon}(y|\theta)}\sum_{i=1}^{K-1}(g_{S,\epsilon }(y|i)-g_{S,\epsilon}(y|K))\vartheta_{i} \\& =\frac{1}{h_{S,\epsilon}(y|\theta)}\left[\sum_{i=1}^{K-1}g_{S, \epsilon}(y|i)\vartheta_{i}-g_{S,\epsilon}(y|K)\sum_{i=1}^{K-1}\vartheta_{i}\right] \\& =\frac{1}{h_{S,\epsilon}(y|\theta)}\left[\sum_{i=1}^{K-1}g_{S, \epsilon}(y,i)\theta_{i}-g_{S,\epsilon}(y|K)(1-\theta_{K})\right] \\& =\frac{1}{h_{S,\epsilon}(y|\theta)}\left[h_{S,\epsilon}(y|\theta) -g_{S,\epsilon}(y|K)\right],\end{align*}

and likewise \(\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}\vartheta^{\prime}=\frac{ 1}{h_{S,\epsilon}(y|\theta)}\left[h_{S,\epsilon}(y|\theta^{\prime})-g_{S, \epsilon}(y|K)\right]\). Taking the difference between \(\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}\vartheta\) and \(\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}\vartheta^{\prime}\), we arrive at the result. ∎

Concavity of \(\ln h_{S,\epsilon}(y|\theta)\):

The following lemmas help with proving the concavity of \(h_{S,\epsilon}(y|\theta)\) as a function of \(\theta\).

Lemma 4.

For \(0{\lt}b\leq a\leq 1\), we have \(\ln\frac{a}{b}\geq\frac{a-b}{a}+\frac{(a-b)^{2}}{2}\).

Proof.

For \(z>0\), using the series based on the inverse hyperbolic tangent function, we can write

\begin{align*}\ln z=2\sum_{k=0}^{\infty}\frac{1}{2k+1}\left(\frac{z-1}{z+1}\right)^{2k+1}.\end{align*}

Apply the expansion to \(z=a/b\) when \(a\geq b\). Noting that \((z-1)/(z+1)=(a-b)/(a+b)\),

\begin{align*}\ln\frac{a}{b} & \geq 2\frac{a-b}{a+b}\geq\frac{2(a-b)}{2a}=\frac{a-b}{a}\end{align*}

where the difference is

\begin{align*}\frac{2(a-b)}{a+b}-\frac{a-b}{a}=\frac{2a^{2}-2ab-a^{2}+b^{2}}{a(a+b)}=\frac{(a-b)^{2}}{(a+b)a}\geq\frac{(a-b)^{2}}{2}\end{align*}

since \(0{\lt}a,b\leq 1\). ∎

Lemma 5.

Let \(0{\lt}\alpha\leq 1\). For \(x\geq 1\), we have \(\frac{1}{\alpha}(x^{\alpha}-1)\leq(x-1)\).

Proof.

Consider \(f(x)=\frac{1}{\alpha}(x^{\alpha}-1)-(x-1)\). We have \(f(1)=0\) and \(f^{\prime}(x)=x^{\alpha-1}-1\leq 0\) for \(x\geq 1\). Hence, we conclude. ∎

Lemma 6.

Let \(0{\lt}a\leq b\leq 1\) and \(0{\lt}\alpha\leq 1\). Then, \(b^{\alpha}-a^{\alpha}\geq\alpha(b-a)\).

Proof.

Fix \(a\) and let \(b=a+x\) for \(0\leq x\leq 1-a\). Consider the function \(f(x)=(a+x)^{\alpha}-a^{\alpha}-\alpha x\). We have \(f(0)=0\) and \(f^{\prime}(x)=\alpha(a+x)^{\alpha-1}-\alpha\geq 0\) over \(0\leq x\leq(1-a)\) since \(a+x\leq 1\) and \(\alpha-1\leq 0\). Hence, we conclude. ∎

Lemma 7.

Given \(\epsilon>0\), there exists \(m_{0}>0\) such that, for all \(S\subset[K]\) and \(y\in[K]\), \(\ln h_{S,\epsilon}(y|\theta)\) is a concave function of \(\vartheta\) that satisfies

\begin{align*}\ln h_{S,\epsilon}(y|\theta)-\ln h_{S,\epsilon}(y|\theta^{\prime})\geq\nabla_{ \vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}(\vartheta-\vartheta^{\prime})+m_ {0}(h_{S,\epsilon}(y|\theta)-h_{S,\epsilon}(y|\theta^{\prime}))^{2}\end{align*}

for all \(\theta,\theta^{\prime}\in\Delta\).

Proof.

We will look at the cases \(h_{S,\epsilon}(y|\theta)\geq h_{S,\epsilon}(y|\theta^{\prime})\) and \(h_{S,\epsilon}(y|\theta)\leq h_{S,\epsilon}(y|\theta^{\prime})\) separately.

(1)

Assume \(h_{S,\epsilon}(y|\theta)\geq h_{S,\epsilon}(y|\theta^{\prime})\). Using Lemma 4, we have

\begin{align*}\ln h_{S,\epsilon}(y|\theta)-\ln h_{S,\epsilon}(y|\theta^{\prime}) & \geq\frac{h_{S,\epsilon}(y|\theta)-h_{S,\epsilon}(y|\theta^{ \prime})}{h_{S,\epsilon}(y|\theta)}+\frac{1}{2}(h_{S,\epsilon}(y|\theta)-h_{S, \epsilon}(y|\theta^{\prime}))^{2} \\& =\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}(\vartheta- \vartheta^{\prime})+\frac{1}{2}(h_{S,\epsilon}(y|\theta)-h_{S,\epsilon}(y| \theta^{\prime}))^{2},\end{align*}

where the last line follows from Lemma 3.

(2)

Assume \(h_{S,\epsilon}(y|\theta)\leq h_{S,\epsilon}(y|\theta^{\prime})\). Let \(a=h_{S,\epsilon}(y|\theta)\) and \(b=h_{S,\epsilon}(y|\theta^{\prime})\). Let

\begin{align*}\alpha=\min\left\{1,\frac{\ln 2}{\ln(C/c)}\right\},\end{align*}

where \(c\) and \(C\) are given in Lemma 2. This \(\alpha\) ensures that \(0{\lt}\alpha\leq 1\) and \(b^{\alpha}/a^{\alpha}\leq 2\), so that we can use Taylor’s expansion of \(\ln\frac{b^{\alpha}}{a^{\alpha}}\) around \(1\) and have

\begin{align*}\ln\frac{b}{a}=\frac{1}{\alpha}\ln\frac{b^{\alpha}}{a^{\alpha}}=\frac{1}{ \alpha}\left[\sum_{k=1}^{\infty}(-1)^{k+1}\frac{1}{k}\left(\frac{b^{\alpha}}{a ^{\alpha}}-1\right)^{k}\right].\end{align*}

Approximating the expansion up to its third term, we have the following upper bound on \(\ln\frac{b}{a}\) as

\begin{align}\ln\frac{b}{a}\leq\frac{1}{\alpha}\left(\frac{b^{\alpha}}{a^{\alpha}}-1\right) -\frac{1}{2\alpha}\left(\frac{b^{\alpha}}{a^{\alpha}}-1\right)^{2}+\frac{1}{3 \alpha}\left(\frac{b^{\alpha}}{a^{\alpha}}-1\right)^{3}.\end{align}

(32)

For the third term, we have

\begin{align*}\frac{1}{3\alpha}\left(\frac{b^{\alpha}}{a^{\alpha}}-1\right)^{3}=\frac{1}{3 \alpha}\left(\frac{b^{\alpha}}{a^{\alpha}}-1\right)^{2}\left(\frac{b^{\alpha}} {a^{\alpha}}-1\right)\leq\frac{1}{3\alpha}\left(\frac{b^{\alpha}}{a^{\alpha}}- 1\right)^{2},\end{align*}

since \(1\leq\frac{b^{\alpha}}{a^{\alpha}}\leq 2\). Substituting this into (32), the inequality can be continued as

\begin{align}\ln\frac{b}{a}\leq\frac{1}{\alpha}\left(\frac{b^{\alpha}}{a^{ \alpha}}-1\right)-\frac{1}{6\alpha}\left(\frac{b^{\alpha}}{a^{\alpha}}-1\right) ^{2}.\end{align}

(33)

Using Lemma 5 to bound the first term in (33), we have

\begin{align}\ln\frac{b}{a}\leq\left(\frac{b}{a}-1\right)-\frac{1}{6\alpha} \left(\frac{b^{\alpha}}{a^{\alpha}}-1\right)^{2}.\end{align}

(34)

Finally, using Lemma 6 we can lower-bound the second term in (33) as

\begin{align*}\frac{b^{\alpha}}{a^{\alpha}}-1=\frac{b^{\alpha}-a^{\alpha}}{a^{\alpha}}\geq \frac{\alpha(b-a)}{a^{\alpha}}\geq\alpha(b-a),\end{align*}

where the last inequality follows from \(a\leq 1\) and \(\alpha>0\). We end up with

\begin{align*}\ln\frac{b}{a}\leq\left(\frac{b}{a}-1\right)-\frac{\alpha}{6} \left(b-a\right)^{2}.\end{align*}

Referring to the definitions of \(a\) and \(b\), we have

\begin{align*}\ln h_{S,\epsilon}(y|\theta^{\prime})-\ln h_{S,\epsilon}(y|\theta)\leq\left(\frac{h_{S,\epsilon}(y|\theta^{\prime})}{h_{S,\epsilon}(y|\theta)}-1\right)- \frac{\alpha}{6}(h_{S,\epsilon}(y|\theta)-h_{S,\epsilon}(y|\theta^{\prime}))^{ 2},\end{align*}

or, reversing the inequality,

\begin{align}\ln h_{S,\epsilon}(y|\theta)-\ln h_{S,\epsilon}(y|\theta^{\prime}) \geq\left(1-\frac{h_{S,\epsilon}(y|\theta^{\prime})}{h_{S,\epsilon}(y|\theta) }\right)+\frac{\alpha}{6}(h_{S,\epsilon}(y|\theta)-h_{S,\epsilon}(y|\theta^{ \prime}))^{2}.\end{align}

(35)

Using (31) in Lemma 3, we rewrite (35) as

\begin{align}\ln h_{S,\epsilon}(y|\theta)-\ln h_{S,\epsilon}(y|\theta^{\prime}) \geq\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}(\vartheta-\vartheta ^{\prime})+\frac{\alpha}{6}(h_{S,\epsilon}(y|\theta)-h_{S,\epsilon}(y|\theta^{ \prime}))^{2},\end{align}

(36)

which is the inequality we look for.

To cover both cases, take \(m_{0}=\min\{\frac{\alpha}{6},\frac{1}{2}\}\). So, the proof is complete. ∎

Recalling that \(S_{t}\) is the selected subset at time \(t\), define

\begin{align}V_{t}(\theta,\theta^{\prime}):=(h_{S_{t},\epsilon}(Y_{i}|\theta)-h_{S_{t}, \epsilon}(Y_{t}|\theta^{\prime}))^{2}.\end{align}

(37)

The proof of Theorem 3 requires a probabilistic bound for \(\sum_{t=1}^{n}V_{t}(\theta,\theta^{\prime})\), which we provide next.

Lemma 8.

For all \(\theta,\theta^{\prime}\in\Delta\) and \(t\geq 0\), \(V_{t}(\theta,\theta^{\prime})\leq\|\theta-\theta^{\prime}\|^{2}\).

Proof.

Let the \(1\times K\) vector \(r_{i}\) be the \(i\)th row of \(G_{S_{t},\epsilon}\). Then, we obtain

\begin{align*}V_{t}(\theta,\theta^{\prime})=[h_{S_{t},\epsilon}(y|\theta)-h_{S _{t},\epsilon}(y|\theta^{\prime})]^{2} & =(r_{i}(\theta-\theta^{\prime}))^{2} \\& =(\theta-\theta^{\prime})^{\top}r_{i}^{\top}r_{i}(\theta-\theta^{ \prime}) \\& \leq\|\theta-\theta^{\prime}\|^{2},\end{align*}

since every element of \(r_{i}\) is at most \(1\). ∎

Lemma 9.

For all \(\theta,\theta^{\prime}\in\Delta\) and \(t\geq 1\), there exists a constant \(c_{u}>0\) such that

\begin{align*}E_{\theta^{\ast}}[V_{t}(\theta,\theta^{\prime})]\geq c_{u}\|\theta-\theta^{ \prime}\|^{2},\end{align*}

where \(E_{\theta^{\ast}}\) is the expectation operator with respect to \(P_{\theta^{\ast}}\) defined in (18), \(\lambda_{\text{min}}(A)\) is the minimum eigenvalue of the square matrix \(A\) and the matrix \(G_{S,\epsilon}\) is defined in (25).

Proof.

The overall expectation can be written as

\begin{align*}E_{\theta^{\ast}}[V_{t}(\theta,\theta^{\prime})] & =\sum_{S\subset[K]}P_{\theta^{\ast}}(S_{t}=S)E_{\theta^{\ast}}[V_ {t}(\theta,\theta^{\prime})|S_{t}=S]\end{align*}

where the conditional expectation can be bounded as

\begin{align}E_{\theta^{\ast}}[V_{t}(\theta,\theta^{\prime})|S_{t}=S] & =\sum_{i=1}^{K}(h_{S,\epsilon}(i|\theta)-h_{S,\epsilon}(i|\theta^ {\prime}))^{2}h_{S,\epsilon}(i|\theta^{\ast}) \\ & \geq\sum_{i=1}^{K}(h_{S,\epsilon}(i|\theta)-h_{S,\epsilon}(i| \theta^{\prime}))^{2}c,\end{align}

(38)

where the second line is due to Lemma 2. Further, let the \(1\times K\) vector \(r_{i}\) be the \(i\)th row of \(G_{S,\epsilon}\). Then,

\begin{align}\sum_{i=1}^{K}(h_{S,\epsilon}(i|\theta)-h_{S,\epsilon}(i|\theta^{ \prime}))^{2} & =\sum_{i=1}^{K}(r_{i}(\theta-\theta^{\prime}))^{2} \\& =\sum_{i=1}^{K}(\theta-\theta^{\prime})^{\top}r_{i}^{\top}r_{i}(\theta-\theta^{\prime}) \\& =(\theta-\theta^{\prime})^{\top}G_{S,\epsilon}^{\top}G_{S, \epsilon}(\theta-\theta^{\prime}) \\& =\|G_{S,\epsilon}(\theta-\theta^{\prime})\|^{2} \\ & \geq\lambda_{\min}(G_{S,\epsilon}^{\top}G_{S,\epsilon})\|\theta- \theta^{\prime}\|^{2}.\end{align}

(39)

Combining (38) and (39) and letting \(c_{u}:=c\min_{S\subset[K]}\lambda_{\min}(G_{S,\epsilon}^{\top}G_{S,\epsilon})\), we have

\begin{align}E_{\theta^{\ast}}[V_{t}(\theta,\theta^{\prime})|S_{t}=S]\geq c_{u}\|\theta- \theta^{\prime}\|^{2}.\end{align}

(40)

for all \(S\), which directly implies \(E_{\theta^{\ast}}[V_{t}(\theta,\theta^{\prime})]\geq c_{u}\|\theta-\theta^{ \prime}\|^{2}\) for the overall expectation. Finally, \(c_{u}>0\) since, by Lemma 1, every \(G_{S,\epsilon}\) is invertible. ∎

Further, define

\begin{align*}W_{t}(\theta,\theta^{\prime}):=1-\frac{V_{t}(\theta,\theta^{\prime})}{\|\theta -\theta^{\prime}\|^{2}}.\end{align*}

Lemma 10.

For any \(0{\lt}t_{1}{\lt}\ldots{\lt}t_{k}\), we have \(E_{\theta^{\ast}}\left[\prod_{i=1}^{k}W_{t_{i}}(\theta,\theta^{\prime})\right] \leq(1-c_{u})^{k}\).

Proof.

For simplicity, we drop \((\theta,\theta^{\prime})\) from the notation and denote the random variables in question as \(W_{t_{1}},\ldots,W_{t_{k}}\). We can write

\begin{align}E_{\theta^{\ast}}\left(\prod_{i=1}^{k}W_{t_{i}}\right) & =E_{\theta^{\ast}}\left[E_{\theta^{\ast}}\left(\left.\prod_{i=1}^ {k}W_{t_{i}}\right|W_{t_{1}},\ldots,W_{t_{k-1}}\right)\right] \\ & =E_{\theta^{\ast}}\left[\left(\prod_{i=1}^{k-1}W_{t_{i}}\right)E_ {\theta^{\ast}}\left(W_{t_{k}}|W_{t_{1}},\ldots,W_{t_{k-1}}\right)\right].\end{align}

(41)

By construction of \(W_{i}(\theta,\theta^{\prime})\), we have \(E_{\theta^{\ast}}(W_{i}|S_{i}=S)\leq 1-c_{u}\), which follows from (40). Using this, the inner conditional expectation can be bounded as

\begin{align}E_{\theta^{\ast}}\left(W_{t_{k}}|W_{t_{1}},\ldots,W_{t_{k-1}}\right) & =\sum_{S\subset[K]}P_{\theta^{\ast}}(S_{t_{k}}=S|W_{t_{1}},\ldots, W_{t_{k-1}})E_{\theta^{\ast}}(W_{t_{k}}|S_{t_{k}}=S) \\& \leq\sum_{S\subset[K]}P_{\theta^{\ast}}(S_{t_{k}}=S|W_{t_{1}}, \ldots,W_{t_{k-1}})\left(1-c_{u}\right) \\ & =\left(1-c_{u}\right).\end{align}

(42)

Combining (41) and (42), we have

\begin{align}E_{\theta^{\ast}}\left(\prod_{i=1}^{k}W_{t_{i}}\right)\leq\left(1-c_{u}\right) E_{\theta^{\ast}}\left(\prod_{i=1}^{k-1}W_{t_{i}}\right).\end{align}

(43)

By Lemmas 8 and 9, we have \(V_{t}(\theta,\theta^{\prime})\leq\|\theta-\theta^{\prime}\|^{2}\) and \(E_{\theta^{\ast}}[V_{t}(\theta,\theta^{\prime})]\geq c_{u}\|\theta-\theta^{ \prime}\|^{2}\). Thus, we necessarily have \(c_{u}{\lt}1\). Therefore, the recursion in (43) can be used until \(k=1\) to obtain the desired result. ∎

Note that \(W_{t}(\theta,\theta^{\prime})\) is bounded as \(0\leq W_{i}(\theta,\theta^{\prime})\leq 1\). We now quote a critical theorem from Pelekis and Ramon [20, Theorem 3.2] regarding the sum of dependent and bounded random variables, which will be useful for bounding \(\sum_{t=1}^{n}V_{t}(\theta,\theta^{\prime})\).

Theorem 5.

[20, Theorem 3.2] Let \(W_{1},\ldots,W_{n}\) be random variables, such that \(0\leq W_{t}\leq 1\), for \(t=1,\ldots,n\). Fix a real number \(\tau\in(0,n)\) and let \(k\) be any positive integer, such that \(0{\lt}k{\lt}\tau\). Then

\begin{align}\mathbb{P}\left(\sum_{t=1}^{n}W_{t}\geq\tau\right)\leq\frac{1}{ \binom{\tau}{k}}\sum_{A\subseteq\{1,\ldots,n\}:|A|=k}\mathbb{E }\left[\prod_{i\in A}W_{i}\right],\end{align}

(44)

where \(\binom{\tau}{k}=\frac{\tau(\tau-1)\ldots(\tau-k+1)}{k!}\).

In the following, we apply Theorem 5 for \(\sum_{t=1}^{n}V_{t}(\theta,\theta^{\prime})\).

Lemma 11.

For every \(\theta,\theta^{\prime}\in\Delta\) and \(a\in(0,1)\),

\begin{align*}\lim_{n\rightarrow\infty}P_{\theta^{\ast}}\left(\frac{1}{n}\sum_{t=1}^{n}V_{t} (\theta,\theta^{\prime})\leq ac_{u}\|\theta-\theta^{\prime}\|^{2}\right)=0.\end{align*}

Proof.

For any integer \(k{\lt}n(1-ac_{u}){\lt}n\), we have

\begin{align*}P_{\theta^{\ast}}\left(\frac{1}{n}\sum_{t=1}^{n}V_{t}(\theta, \theta^{\prime})\leq ac_{u}\|\theta-\theta^{\prime}\|^{2}\right) & =P_{\theta^{\ast}}\left(\|\theta-\theta^{\prime}\|^{2}\frac{1}{n} \sum_{t=1}^{n}(1-W_{t}(\theta,\theta^{\prime}))\leq ac_{u}\|\theta-\theta^{ \prime}\|^{2}\right) \\& =P_{\theta^{\ast}}\left(\frac{1}{n}\sum_{t=1}^{n}(1-W_{t}(\theta, \theta^{\prime}))\leq ac_{u}\right) \\& =P_{\theta^{\ast}}\left(\sum_{t=1}^{n}W_{t}(\theta,\theta^{\prime })\geq n\left(1-ac_{u}\right)\right) \\& =P_{\theta^{\ast}}\left(\sum_{t=1}^{n}W_{t}(\theta,\theta^{\prime })\geq n\left(1-ac_{u}\right)\right) \\& \leq\frac{1}{\binom{n\left(1-ac_{u}\right)}{k}}\sum_{A\subseteq\{1,\ldots,n\}:|A|=k}E_{\theta^{\ast}}\left[\prod_{i\in A}W_{i} (\theta,\theta^{\prime})\right] \\& \leq\frac{1}{\binom{n\left(1-ac_{u}\right)}{k}}\binom{n}{k}\left(1-c_{u}\right)^{k},\end{align*}

where the last two lines follow from Theorem 5 and Lemma 10, respectively. Select \(k=k^{\ast}=\lceil n(1-a)\rceil\) and note that when \(n>1/(a-ac_{u})\) one always has \(k^{\ast}{\lt}n(1-ac_{u})\). Then, for \(n>1/(a-ac_{u})\),

\begin{align}P_{\theta^{\ast}}\left(\frac{1}{n}\sum_{t=1}^{n}V_{t}(\theta, \theta^{\prime})\leq ac_{u}\|\theta-\theta^{\prime}\|^{2}\right) & \leq\frac{1}{\binom{n\left(1-ac_{u}\right)}{k^{\ast}}}{\binom{n}{ k^{\ast}}}\left(1-c_{u}\right)^{k^{\ast}} \\ & =\left(1-c_{u}\right)^{k^{\ast}}\prod_{i=1}^{k^{\ast}}\frac{n-i+1 }{n(1-ac_{u})-i+1}.\end{align}

(45)

The right-hand side does not depend on \(\theta,\theta^{\prime}\) and converges to \(0\). ∎

Smoothness of \(\ln h_{S,\epsilon}(y|\theta)\):

Next, we establish the \(L\)-smoothness of \(h_{S,\epsilon}(y|\theta)\) as a function of \(\theta\) for any \(y\in[K]\) and \(S\subset[K]\). Some technical lemmas are needed first.

Lemma 12.

For \(x>0\), \(\ln(1+x)\geq x-\frac{1}{2}x^{2}\).

Proof.

The function \(\ln(1+x)-x+0.5x^{2}\) is \(0\) at \(x=0\) and its derivative \(1/(1+x)-1+x=\frac{1}{1+x}-(1-x)=\frac{1-(1-x^{2})}{1+x}=x^{2}/(1+x)>0\) when \(x>0\). ∎

Lemma 13.

For \(x>0\), \(\ln(x+1)\leq 1-\frac{1}{x+1}+\frac{1}{2}x^{2}\).

Proof.

The function \(\ln(x+1)-1+\frac{1}{x+1}-\frac{1}{2}x^{2}\) is \(0\) at \(x=0\) and its derivative,

\begin{align*}\frac{1}{x+1}-\frac{1}{(x+1)^{2}}-x=\frac{x-x(x+1)^{2}}{(x+1)^{2}}=\frac{x(1-(x+1)^{2})}{(x+1)^{2}},\end{align*}

is negative for \(x>0\). ∎

Lemma 14.

There exists an \(L_{0}>0\) such that, for all \(S\subset[K]\) and \(y\in[K]\), the function \(\ln h_{S,\epsilon}(y|\theta)\) is an \(L_{0}\)-smooth function of \(\theta\). That is for all \(\theta,\theta^{\prime}\in\Delta\), we have

\begin{align*}\ln h_{S,\epsilon}(y|\theta)-\ln h_{S,\epsilon}(y|\theta^{\prime})\leq\nabla_{ \vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}(\vartheta-\vartheta^{\prime})+L_ {0}\|\theta-\theta^{\prime}\|^{2}.\end{align*}

Proof.

Assume \(h_{S,\epsilon}(y|\theta)\geq h_{S,\epsilon}(y|\theta^{\prime})\). Using Lemma 13 with \(x=\frac{h_{S,\epsilon}(y|\theta)}{h_{S,\epsilon}(y|\theta^{\prime})}-1\geq 0\), we have

\begin{align}\ln h_{S,\epsilon}(y|\theta)-\ln h_{S,\epsilon}(y|\theta^{\prime}) & \leq 1-\frac{h_{S,\epsilon}(y|\theta^{\prime})}{h_{S,\epsilon}(y| \theta)}+\frac{1}{2}\left(\frac{h_{S,\epsilon}(y|\theta)}{h_{S,\epsilon}(y| \theta^{\prime})}-1\right)^{2} \\& =\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}(\vartheta- \vartheta^{\prime})+\frac{1}{2}\left(\frac{h_{S,\epsilon}(y|\theta)-h_{S, \epsilon}(y|\theta^{\prime})}{h_{S,\epsilon}(y|\theta^{\prime})}\right)^{2} \\& \leq\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}(\vartheta-\vartheta^{\prime})+\frac{1}{2c^{2}}\left(h_{S,\epsilon}(y|\theta)-h _{S,\epsilon}(y|\theta^{\prime})\right)^{2} \\ & \leq\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}(\vartheta-\vartheta^{\prime})+\frac{1}{2c^{2}}\|\theta-\theta^{\prime}\|^{2},\end{align}

(46)

where the first term in the second line follows from Lemma 3, the third line follows from Lemma 2, and the last line follows from Lemma 8.

Now, assume \(h_{S,\epsilon}(y|\theta)\leq h_{S,\epsilon}(y|\theta^{\prime})\). By Lemma 12 with \(x=\frac{h_{S,\epsilon}(y|\theta^{\prime})}{h_{S,\epsilon}(y|\theta)}-1\geq 0\),

\begin{align*}\ln h_{S,\epsilon}(y|\theta^{\prime})-\ln h_{S,\epsilon}(y|\theta)\geq\left(\frac{h_{S,\epsilon}(y|\theta^{\prime})}{h_{S,\epsilon}(y|\theta)}-1\right)- \frac{1}{2}\left(\frac{h_{S,\epsilon}(y|\theta^{\prime})}{h_{S,\epsilon}(y| \theta)}-1\right)^{2},\end{align*}

or, reversing the sign of inequality,

\begin{align*}\ln h_{S,\epsilon}(y|\theta)-\ln h_{S,\epsilon}(y|\theta^{\prime}) & \leq\left(1-\frac{h_{S,\epsilon}(y|\theta^{\prime})}{h_{S, \epsilon}(y|\theta)}\right)+\frac{1}{2}\left(\frac{h_{S,\epsilon}(y|\theta^{ \prime})}{h_{S,\epsilon}(y|\theta)}-1\right)^{2} \\& \leq\nabla_{\vartheta}\ln h_{S,\epsilon}(y|\theta)^{\top}(\vartheta-\vartheta^{\prime})+\frac{1}{2c^{2}}\left(h_{S,\epsilon}(y|\theta)-h _{S,\epsilon}(y|\theta^{\prime})\right)^{2},\end{align*}

where the third line follows from Lemma 2. Hence, we have arrived at the same inequality as (46). Hence, Lemma 14 holds with \(L_{0}=\frac{1}{2c^{2}}\). ∎

Second Moment of the Gradient at \(\theta^{\ast}\):

Let the average log-marginal likelihoods be defined as

\begin{align}\Phi_{n}(\theta):=\frac{1}{n}\sum_{t=1}^{n}\ln h_{S_{t},\epsilon}(Y_{t}|\theta) ,\quad n\geq 1.\end{align}

(47)

The following bound on the second moment of this average at \(\theta^{\ast}\) will be useful.

Lemma 15.

For \(\Phi_{n}(\theta)\) defined in (47), we have

\begin{align*}E_{\theta^{\ast}}\left[\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast })\|^{2}\right]\leq\frac{1}{n}\max_{S\subset[K]}\text{Tr}\left[F(\theta^{ \ast};S,\epsilon)\right].\end{align*}

Proof.

First, we evaluate the mean at \(\theta=\theta^{\ast}\).

\begin{align*}E_{\theta^{\ast}}\left[\nabla_{\vartheta}\Phi_{n}(\theta^{\ast}) \right]=\frac{1}{n}\sum_{t=1}^{n}E_{\theta^{\ast}}\left[\nabla_{\vartheta}\ln h _{S_{t},\epsilon}(Y_{t}|\theta^{\ast})\right].\end{align*}

Focusing on a single term,

\begin{align*}E_{\theta^{\ast}}\left[\nabla_{\vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{\ast})\right] & =\sum_{S\subset[K]}P_{\theta^{\ast}}(S_{t}=S)E_{\theta^{\ast}} \left[\nabla_{\vartheta}\ln h_{S,\epsilon}(Y_{t}|\theta^{\ast})|S_{t}=S\right].\end{align*}

Each term in the sum is equal to \(0\), since

\begin{align}E_{\theta^{\ast}}\left[\nabla_{\vartheta}\ln h_{S,\epsilon}(Y_{t}|\theta^{\ast })|S_{t}=S\right]=\sum_{k=1}^{K}\nabla_{\vartheta}\ln h_{S,\epsilon}(k|\theta^ {\ast})h_{S,\epsilon}(k|\theta^{\ast})=0.\end{align}

(48)

For the second moment at \(\theta=\theta^{\ast}\),

\begin{align*}E_{\theta^{\ast}}\left[\nabla_{\vartheta}\Phi_{n}(\theta^{\ast}) \nabla_{\vartheta}\Phi_{n}(\theta^{\ast})^{\top}\right]= & \frac{1}{n^{2}}\sum_{t=1}^{n}E_{\theta^{\ast}}\left[\nabla_{ \vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{\ast})\nabla_{\vartheta}\ln h_{ S_{t},\epsilon}(Y_{t}|\theta^{\ast})^{\top}\right] \\& +\frac{2}{n^{2}}\sum_{t=1}^{n}\sum_{t^{\prime}=1}^{t-1}E_{\theta^ {\ast}}\left[\nabla_{\vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{\ast}) \nabla_{\vartheta}\ln h_{S_{t^{\prime}}}(Y_{t^{\prime}}|\theta^{\ast})^{\top} \right].\end{align*}

For the diagonal terms, for all \(t=1,\ldots,n\), we have

\begin{align*}&E_{\theta^{\ast}}\left[\nabla_{\vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{\ast})\nabla_{\vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{ \ast})^{\top}\right] \\&\quad\quad=\sum_{S\subset[K]}P_{\theta^{\ast}}(S_{t}=S)E_{\theta^ {\ast}}\left[\nabla_{\vartheta}\ln h_{S,\epsilon}(Y_{t}|\theta^{\ast})\nabla_{ \vartheta}\ln h_{S,\epsilon}(Y_{t}|\theta^{\ast})^{\top}|S_{t}=S\right] \\&\quad\quad=\sum_{S\subset[K]}P_{\theta^{\ast}}(S_{t}=S)F(\theta^{ \ast};S,\epsilon).\end{align*}

For the cross terms, for \(1\leq t^{\prime}{\lt}t\leq n\),

\begin{align*}E_{\theta^{\ast}}\kern-2pt\left[\nabla_{\vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{\ast})\nabla_{\vartheta}\ln h_{S_{t^{\prime}}}(Y_{t^{\prime}}| \theta^{\ast})^{\top}\right] &\!=\!E_{\theta^{\ast}}\kern-2pt\left\{E_{\theta^{\ast}}\kern-2pt\left[\nabla_{ \vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{\ast})\nabla_{\vartheta}\ln h_{ S_{t^{\prime}}}(Y_{t^{\prime}}|\theta^{\ast})^{\top}|Y_{t^{\prime}},S_{t^{ \prime}}\right]\right\} \\&\!=\!E_{\theta^{\ast}}\kern-2pt\left\{E_{\theta^{\ast}}\kern-2pt\left[\nabla_{ \vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{\ast})|Y_{t^{\prime}},S_{t^{ \prime}}\right]\!\nabla_{\vartheta}\ln h_{S_{t},\epsilon}(Y_{t^{\prime}}|\theta^ {\ast})^{\top}\right\}\!.\end{align*}

The conditional expectation inside is zero, since, by (48),

\begin{align*}E_{\theta^{\ast}}\left[\nabla_{\vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{\ast})|Y_{t^{\prime}},A_{t^{\prime}}\right]=\sum_{S\subset[K]}P_ {\theta^{\ast}}(S_{t}=S|Y_{t^{\prime}},S_{t^{\prime}})E_{\theta^{\ast}}\left[ \nabla_{\vartheta}\ln h_{S,\epsilon}(Y_{t}|\theta^{\ast})|S_{t}=S\right]=0.\end{align*}

Therefore, all the cross terms are zero,

\begin{align*}E_{\theta^{\ast}}\left[\nabla_{\vartheta}\ln h_{S_{t},\epsilon}(Y_{t}|\theta^{ \ast})\nabla_{\vartheta}\ln h_{S_{t^{\prime}}}(Y_{t^{\prime}}|\theta^{\ast})^{ \top}\right]=0,\end{align*}

and hence, we arrive at

\begin{align} E_{\theta^{\ast}}\left[\nabla_{\vartheta}\Phi_{n}(\theta^{\ast}) \nabla_{\vartheta}\Phi_{n}(\theta^{\ast})^{\top}\right]=\frac{1}{n^{2}}\sum_{t =1}^{n}\sum_{S\subset[K]}P_{\theta^{\ast}}(S_{t}=S)\left[F(\theta^{\ast};S, \epsilon)\right]\end{align}

(49)

for the second moment of the gradient of \(\Phi_{n}(\theta^{\ast})\). Therefore,

\begin{align*}E_{\theta^{\ast}}\left[\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast })\|^{2}\right] & =E_{\theta^{\ast}}\left[\text{Tr}\left(\nabla_{\vartheta}\Phi_{ n}(\theta^{\ast})\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})^{\top}\right)\right] \\& =\text{Tr}\left(E_{\theta^{\ast}}\left[\nabla_{\vartheta}\Phi_{ n}(\theta^{\ast})\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})^{\top}\right]\right) \\& =\frac{1}{n^{2}}\sum_{t=1}^{n}\sum_{S\subset[K]}P_{\theta^{\ast}} (S_{t}=S)\text{Tr}(F(\theta^{\ast};S,\epsilon)) \\& \leq\frac{1}{n}\max_{S\subset[K]}\text{Tr}\left(F(\theta^{\ast}; S,\epsilon)\right),\end{align*}

which concludes the proof. ∎

A.4.2 Convergence of the Posterior Distribution.

Let \(\mu\in(0,1)\) and, for \(\theta,\theta^{\prime}\in\Delta\), define

\begin{align}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\prime}):=\mu c_{u}\|\theta-\theta^{ \prime}\|^{2}-\frac{1}{n}\sum_{t=1}^{n}V_{t}(\theta,\theta^{\prime}),\end{align}

(50)

where \(V_{t}(\theta,\theta^{\prime})\) was defined in (37) and \(c_{u}>0\) was defined in the proof of Lemma 9, respectively. The proof of Theorem 3 requires the following lemma concerning \(\mathcal{E}^{\mu}_{n}(\theta,\theta^{\prime})\).

Lemma 16.

There exists \(\mu\in(0,1)\) such that, for any \(\varepsilon>0\), we have

\begin{align*}\lim_{n\rightarrow\infty}P_{\theta^{\ast}}\left(\int_{\Delta}e^{nm_{0}\mathcal {E}^{\mu}_{n}(\theta,\theta^{\ast})}\mathrm{d}\theta > e^{\varepsilon}\right)=0.\end{align*}

Proof.

Define the product measure

\begin{align}P^{\otimes}(\mathrm{d}(\theta,\cdot)):=\frac{\mathrm{d}\theta}{|\Delta|}\times dP _{\theta^{\ast}}(\cdot)\end{align}

(51)

for random variables \((\Theta\in\Delta,\{S_{t}\subset\{1,\ldots,K\},Y_{t}\in[K]\}_{t\geq 1})\), where \(\mathrm{d}\theta\) is the Lebesgue measure for \(\vartheta\) restricted to \(\Delta\) and \(|\Delta|:=\int_{\Delta}\mathrm{d}\theta\). We will show that the parameter \(\mu\) in (50) can be chosen such that the collection of random variables

\begin{align*}\mathcal{C}:=\{f_{n}:=\max\{1,e^{nm_{0}\mathcal{E}^{\mu}_{n}(\Theta,\theta^{\ast})}\}:n\geq 1\}\end{align*}

is uniformly integrable with respect to \(P^{\otimes}\). For uniform integrability, we need to show that for any \(\varepsilon>0\), there exists a \(K>0\) such that

\begin{align*}E^{\otimes}[|f_{n}|\cdot\mathbb{I}(f_{n} > K)] < \varepsilon,\quad\forall n\geq 1.\end{align*}

For any \(K>1\) and \(n\geq 1\), we have

\begin{align*}E^{\otimes}[|f_{n}|\cdot\mathbb{I}(f_{n} > K)] & =E^{\otimes}[e^{nm_{0}\mathcal{E}^{\mu}_{n}(\Theta,\theta^{\ast}) }\mathbb{I}(f_{n} > K)] \\& \leq\sup_{\theta\in\Delta}e^{nm_{0}\mathcal{E}^{\mu}_{n}(\theta, \theta^{\ast})}P^{\otimes}(f_{n} > K) \\& =\sup_{\theta\in\Delta}e^{nm_{0}\mathcal{E}^{\mu}_{n}(\theta, \theta^{\ast})}\int_{\Delta}P_{\theta^{\ast}}\left(\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast}) > \frac{\ln K}{nm_{0}}\right)\frac{\mathrm{d}\theta}{| \Delta|} \\& \leq e^{n\mu m_{0}c_{u}}P_{\theta^{\ast}}\left(\mathcal{E}^{\mu}_ {n}(\theta,\theta^{\ast}) > 0\right),\end{align*}

where the last line follows from \(\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast})\leq\mu c_{u}\|\theta-\theta^{\ast} \|^{2}\leq\mu c_{u}\). Using (45), the last expression can be upper-bounded as

\begin{align}e^{nm_{0}\mu c_{u}}P_{\theta^{\ast}}\left(\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast}) > 0\right) & =e^{n\mu m_{0}c_{u}}P_{\theta^{\ast}}\left(\frac{1}{n}\sum_{t=1}^ {t}V_{t}(\theta,\theta^{\ast}) < \mu c_{u}\|\theta-\theta^{\prime}\|^{2}\right) \\ & \leq e^{n\mu m_{0}c_{u}}\left(1-c_{u}\right)^{\lceil n(1-\mu) \rceil}\prod_{i=1}^{\lceil n(1-\mu)\rceil}\frac{n-i+1}{n(1-\mu c_{u})-i+1}.\end{align}

(52)

The parameter \(\mu\) can be arranged such that (52) converges to \(0\). For such \(\mu\), we have that for any \(\varepsilon\) there exists a \(N_{\varepsilon}>0\) such that for all \(n>N_{\varepsilon}\), \(E^{\otimes}[|f_{n}|\cdot\mathbb{I}(f_{n}>K)]{\lt}\varepsilon\) for any \(K>0\). Finally, choose \(K_{\varepsilon}=e^{N_{\varepsilon}m_{0}\mu c_{u}}\) so that \(E^{\otimes}[|f_{n}|\cdot\mathbb{I}(f_{n}>K_{\varepsilon})]{\lt}\varepsilon\) for any \(n\geq 1\). Hence, \(\mathcal{C}\) is uniformly integrable for a suitable choice of \(\mu\).

Next, we show that each \(f_{n}\) in \(\mathcal{C}\) converges in probability to \(1\). The convergence is implied by the fact that for every \(\theta\in\Delta\) and \(\varepsilon>0\), we have

\begin{align*}P_{\theta^{\ast}}(\max\{1,e^{nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta ^{\ast})}\} > e^{\varepsilon})=P_{\theta^{\ast}}(\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast}) > \varepsilon)\rightarrow 0.\end{align*}

by Lemma 11. Since \(\mathcal{C}\) is uniformly integrable, the Vitali convergence theorem ensures that \(f_{n}\) converges in distribution (with respect to \(P^{\otimes}\)) to \(1\), i.e., \(\lim_{n\rightarrow\infty}E^{\otimes}(f_{n})=1\). Since \(P^{\otimes}=\frac{\mathrm{d}\theta}{|\Delta|}\times dP_{\theta^{\ast}}(\cdot)\) is a product measure as defined in (51), the stated limit implies that

\begin{align*}E_{\theta^{\ast}}\left[\int_{\Delta}\max\{1,e^{nm_{0}\mathcal{E}^{\mu} _{n}(\theta,\theta^{\ast})}\}\mathrm{d}\theta\right]\rightarrow 1,\end{align*}

that is, the sequence \(\int_{\Delta}\max\{1,e^{nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast})}\} \mathrm{d}\theta\) converges to \(1\) in distribution with respect to \(P_{\theta^{\ast}}\). Since convergence in distribution to a constant implies convergence in probability, we have

\begin{align}\int_{\Delta}\max\{1,e^{nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{ \ast})}\}\mathrm{d}\theta\overset{P_{\theta^{\ast}}}{\rightarrow}1.\end{align}

(53)

Finally, since we have

\begin{align*}\int_{\Delta}e^{nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast})}\mathrm{d} \theta\leq\int_{\Delta}\max\{1,e^{nm_{0}\mathcal{E}^{\mu}_{n}(\theta, \theta^{\ast})}\}\mathrm{d}\theta,\end{align*}

and the right-hand side converges in probability to 1, we conclude. ∎

Proof of Theorem 3.

Writing down Lemma 7 with \(\theta^{\ast}\) and any \(\theta\in\Delta\) separately for \(t=1,\ldots,n\), summing the inequalities and dividing by \(n\), we obtain

\begin{align*}\Phi_{n}(\theta^{\ast})-\Phi_{n}(\theta) & \geq\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})^{\top}(\vartheta^{ \ast}-\vartheta)+m_{0}\sum_{t=1}^{n}V_{t}(\theta,\theta^{\ast}) \\& =\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})^{\top}(\vartheta^{\ast }-\vartheta)+m\|\theta-\theta^{\ast}\|^{2}-m_{0}\mathcal{E}^{\mu}_{n}(\theta, \theta^{\ast}).\end{align*}

where \(m:=\mu m_{0}c_{u}\). Reversing the sign,

\begin{align*}\Phi_{n}(\theta)-\Phi_{n}(\theta^{\ast})\leq\nabla_{\vartheta} \Phi_{n}(\theta^{\ast})^{\top}(\vartheta-\vartheta^{\ast})-m\|\theta^{\ast}- \theta\|^{2}+m_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast}).\end{align*}

Using Cauchy-Schwarz inequality for the first term on the right-hand side, we get

\begin{align*}\Phi_{n}(\theta)-\Phi_{n}(\theta^{\ast})\leq\|\nabla_{\vartheta} \Phi_{n}(\theta^{\ast})\|\|\theta-\theta^{\ast}\|-m\|\theta^{\ast}-\theta\|^{2 }+m_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast}).\end{align*}

Using Young’s inequality \(uv\leq\frac{u^{2}}{2\kappa}+\frac{v^{2}\kappa}{2}\) for the second term with \(u=\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|\), \(v=\|\theta^{\ast}-\theta\|\), and \(\kappa=m\), we get

\begin{align}\Phi_{n}(\theta)-\Phi_{n}(\theta^{\ast})\leq\frac{\|\nabla_{\vartheta}\Phi_{n} (\theta^{\ast})\|^{2}}{2m}-\frac{m}{2}\|\theta^{\ast}-\theta\|^{2}+m_{0} \mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast}).\end{align}

(54)

Similarly, using Lemma 14 with \(\theta^{\ast}\) and any \(\theta^{\prime}\in\Delta\) for \(t=1,\ldots,n\), summing the inequalities and dividing by \(n\), we obtain

\begin{align*}\Phi_{n}(\theta^{\ast})-\Phi_{n}(\theta^{\prime})\leq\nabla_{\vartheta}\Phi_{n }(\theta^{\ast})^{\top}(\vartheta^{\ast}-\vartheta^{\prime})+L_{0}\|\theta^{ \ast}-\theta^{\prime}\|^{2}.\end{align*}

Again, using Cauchy-Schwarz inequality and Young’s inequality \(uv\leq\frac{u^{2}}{2\kappa}+\frac{v^{2}\kappa}{2}\) with \(u=\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|\), \(v=\|\theta^{\ast}-\theta\|\), and \(\kappa=2L_{0}\), we get

\begin{align}\Phi_{n}(\theta^{\ast})-\Phi_{n}(\theta^{\prime}) & \leq\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|\|\theta^{\ast}- \theta^{\prime}\|+L_{0}\|\theta^{\ast}-\theta^{\prime}\|^{2} \\& \leq\frac{\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|^{2}}{4L_{ 0}}+2L_{0}\|\theta^{\ast}-\theta^{\prime}\|^{2} \\ & =\frac{\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|^{2}}{2L}+L\| \theta^{\ast}-\theta^{\prime}\|^{2},\end{align}

(55)

where we let \(L:=2L_{0}\). Summing the inequalities in (54) and (55), we obtain

\begin{align}\Phi_{n}(\theta)-\Phi_{n}(\theta^{\prime})\leq\left(\frac{1}{2L}+\frac{1}{2m} \right)\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|^{2}-\frac{m}{2}\|\theta^{ \ast}-\theta\|^{2}+\frac{L}{2}\|\theta^{\ast}-\theta^{\prime}\|^{2}+m_{0} \mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast}).\end{align}

(56)

Let \(a\in(0,1)\) be a constant and define the sequences

\begin{align*}\Omega_{n} &: =\{\theta\in\Delta:\|\theta-\theta^{\ast}\|^{2} < \max\{4/m,4/L\}n^{-a}\},\quad n\geq 1. \\A_{n} &: =\{\theta\in\Delta:\|\theta-\theta^{\ast}\|^{2} > \max\{4/m,4/L\}n^{-a}\},\quad n\geq 1. \\B_{n} &: =\{\theta\in\Delta:\|\theta-\theta^{\ast}\|^{2}\leq\min\{2/m,2/L\}n^{-a}\},\quad n\geq 1.\end{align*}

For \(\theta\in A_{n}\) and \(\theta^{\prime}\in B_{n}\), (56) can be used to obtain

\begin{align*}\Phi_{n}(\theta)-\Phi_{n}(\theta^{\prime})\leq\left(\frac{1}{2L}+\frac{1}{2m} \right)\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|^{2}-n^{-a}\left(\max\left \{2,\frac{2m}{L}\right\}-\min\left\{\frac{L}{m},1 \right\}\right)+m_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast}).\end{align*}

Noting that \(\max\left\{2,\frac{2m}{L}\right\}-\min\left\{\frac{L}{m},1\right\}\geq 1\), we have

\begin{align}\Phi_{n}(\theta)-\Phi_{n}(\theta^{\prime})\leq\left(\frac{1}{2L}+\frac{1}{2m} \right)\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|^{2}-n^{-a}+m_{0}\mathcal{ E}^{\mu}_{n}(\theta,\theta^{\ast}),\quad\theta\in A_{n};\theta^{\prime}\in B_{ n}.\end{align}

(57)

Multiplying (57) with \(n\), exponentiating, and multiplying the ratio of the priors, we get

\begin{align}\frac{\eta(\theta)\exp\{n\Phi_{n}(\theta)\}}{\eta (\theta^{\prime})\exp\{n\Phi_{n}(\theta^{\prime})\}} & \leq\frac{\eta(\theta)}{\eta(\theta^{\prime})}\exp\left[\left(\frac{1}{2L}+\frac{1}{2m}\right)n\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\| ^{2}-n^{1-a}+nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast})\right] \\ & \leq C_{\eta,n}\exp\left[C_{1}n\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|^{2}-n^{1-a}+nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast})\right]\end{align}

(58)

for all \(\theta\in A_{n}\) and \(\theta^{\prime}\in B_{n}\), where \(C_{1}:=\frac{1}{2L}+\frac{1}{2m}\) and \(C_{\eta,n}:=\sup_{\theta\in A_{n},\theta^{\prime}\in B_{n}}\frac{\eta(\theta)} {\eta(\theta^{\prime})}\). The bound in (58) can be used to bound the ratio between the posterior probabilities \(\Pi(A_{n}|Y_{1:n},S_{1:n})\) and \(\Pi(B_{n}|Y_{1:n},S_{1:n})\), since

\begin{align*}\frac{\Pi(A_{n}|Y_{1:n},S_{1:n})}{\Pi(B_{n}|Y_{1:n},S_{1:n})} & =\frac{\int_{A_{n}}\eta(\theta)\exp\{n\Phi_{n}(\theta)\}\mathrm{d}\theta}{\int_{B_{n}}\eta(\theta)\exp\{n\Phi_{n}(\theta)\}\mathrm{d}\theta} \\& =\frac{\int_{A_{n}}\frac{\eta(\theta)\exp\{n\Phi_{n}(\theta)\}}{\inf_{\theta^{\prime}\in B_{n}}\eta(\theta^{\prime})\exp\{n\Phi_{n}(\theta^{\prime})\}}\mathrm{d}\theta}{\int_{B_{n}} \frac{\eta(\theta)\exp\{n\Phi_{n}(\theta)\}}{\inf_{\theta^{ \prime}\in B_{n}}\eta(\theta^{\prime})\exp\{n\Phi_{n}(\theta^{\prime}) \}}\mathrm{d}\theta} \\& \leq\frac{\int_{A_{n}}\frac{\eta(\theta)\exp\{n\Phi_{n}(\theta)\}}{\inf_{\theta^{\prime}\in B_{n}}\eta(\theta^{\prime})\exp\{n\Phi_{n}(\theta^{\prime})\}}\mathrm{d}\theta}{\int_{B_{n}} \frac{\eta(\theta)\exp\{n\Phi_{n}(\theta)\}}{\inf_{\theta^{ \prime}\in B_{n}}\eta(\theta^{\prime})\exp\{n\Phi_{n}(\theta^{\prime}) \}}\mathrm{d}\theta} \\& \leq\frac{\int_{A_{n}}C_{\eta,n}\exp\left[C_{1}n\|\nabla_{ \vartheta}\Phi_{n}(\theta^{\ast})\|^{2}-n^{1-a}+nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast})\right]\mathrm{d}\theta}{\int_{B_{n}}1\mathrm{d}\theta} \\& =\frac{1}{\text{Vol}(B_{n})}C_{\eta,n}\exp\left[C_{1}n\|\nabla_{ \vartheta}\Phi_{n}(\theta^{\ast})\|^{2}-n^{1-a}\right]\int_{A_{n}}\exp\left[nm _{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast})\right]\mathrm{d}\theta,\end{align*}

where \(\text{Vol}(B_{n}):=\int_{B_{n}\cap\Delta}\mathrm{d}\theta\). Note that \(B_{n}\) shrinks with \(n\), so there exists a \(N_{B}\) such that for \(n>N_{B}\), the volume \(B_{n}\) can be lower-bounded as

\begin{align*}B_{n}\geq\frac{1}{2(K-2)!}\frac{(\sqrt{\pi}\min\{2/m,2/L\}n^{- a})^{(K-1)}}{\Gamma((K-1)/2+1)},\end{align*}

where the factor \(\frac{1}{2(K-2)!}\) corresponds to the worst-case situation where \(\theta^{\ast}\) is on one of the corners of \(\Delta\), such as \(\theta^{\ast}=(1,0,\ldots,0)^{\top}\), and the rest is the volume of a (\(K-1\))-dimensional sphere with radius \(\min\{2/m,2/L\}n^{-\alpha}\). The lower bound is the volume of the intersection of a simplex with a sphere centered at one of the sharpest corners of the simplex. Therefore, for \(n>N_{B}\), ratio can further be bounded as

\begin{align*}\frac{\Pi(A_{n}|Y_{1:n},S_{1:n})}{\Pi(B_{n}|Y_{1:n},S_{1:n})}\leq C_{2}C_{\eta, n}\exp\left[C_{1}n\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|^{2}+(K-1)a\ln n -n^{1-a}\right]\kern-2pt\int_{A_{n}}\kern-1pt\exp\kern-1pt\left[nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta ^{\ast})\right]\mathrm{d}\theta,\end{align*}

where \(C_{2}:=\frac{\Gamma((K-1)/2+1)}{\min\{2/m,2/L\}^{K-1}\pi^{(K-1)/2}}\) does not depend on \(n\).

Next, we prove that the sequence of random variables

\begin{align*}Z_{n}:=C_{\eta,n}\exp\left[C_{1}n\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\| ^{2}+(K-1)a\ln n-n^{1-a}\right]\int_{A_{n}}\exp\left[nm_{0}\mathcal{E}^{\mu}_{ n}(\theta,\theta^{\ast})\right]\mathrm{d}\theta\end{align*}

converges to \(0\) in probability, which in turn proves the convergence of \(\frac{\Pi(A_{n}|Y_{1:n},S_{1:n})}{\Pi(B_{n}|Y_{1:n},S_{1:n})}\) in probability to \(0\). To do that, we need to prove that for each \(\varepsilon>0\) and \(\delta>0\), there exists a \(N>0\) such that for all \(n>N\) we have \(P_{\theta^{\ast}}(Z_{n}\geq 2\varepsilon){\lt}2\delta\). Fix \(\varepsilon>0\) and \(\delta>0\).

–

First, by Assumption 1, because \(B_{n}\) shrinks toward \(\theta^{\ast}\), there exists \(N_{\eta}>0\) such that \(C_{\eta,n}{\lt}B\) for all \(n>N_{\eta}\).

–

Next, let \(\beta:=C_{1}\max_{S\subset[K]}\text{Tr}(F(\theta^{\ast};S,\epsilon))/\delta\). Using Markov’s inequality for \(\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|^{2}\) with Lemma 15, we have

\begin{align*}P_{\theta^{\ast}}\left(\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast })\|^{2}\geq\frac{1}{n}\frac{\beta}{C_{1}}\right) & \leq\frac{1}{n}\max_{S\subset\{1,\ldots,K\}} \text{Tr}(F(\theta;S,\epsilon))\frac{C_{1}n}{\beta}=\delta.\end{align*}

Also, since the \(n^{1-a}\) dominates the term \(\ln n\), one can choose an integer \(N_{\Phi}>0\) such that

\begin{align*}\beta\leq\ln(\varepsilon/B)+n^{1-a}-(K-1)a\ln n,\quad\forall n\geq N_{\Phi}.\end{align*}

Now we deal with the integral in \(Z_{n}\). We have

\begin{align*}P_{\theta^{\ast}}\left(\int_{A_{n}}\exp\left[nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast})\right]\mathrm{d}\theta\geq 2\right)\leq P_{\theta^{\ast} }\left(\int_{\Delta}\exp\left[nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast}) \right]\mathrm{d}\theta\geq 2\right)\rightarrow 1,\end{align*}

where the convergence is due to Lemma 16. Hence, there exists a \(N_{\mathcal{E}}\) such that for all \(n>N_{\mathcal{E}}\),

\begin{align*}P_{\theta^{\ast}}\left(\int_{A_{n}}\exp\left[nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{\ast})\right]\mathrm{d}\theta\geq 2\right)\leq\delta.\end{align*}

Gathering the results, for \(n>\max\{N_{\eta},N_{\Phi},N_{\mathcal{E}}\}\), we have

\begin{align*}P_{\theta^{\ast}}(Z_{n}\geq\varepsilon) & \leq P_{\theta^{\ast}}\left(e^{C_{1}n\|\nabla_{\vartheta}\Phi_{n} (\theta^{\ast})\|^{2}+(K-1)a\ln n-n^{1-a}}\geq\varepsilon/B\right)+P_{\theta^{ \ast}}\left(\int_{\Delta}\exp\left[nm_{0}\mathcal{E}^{\mu}_{n}(\theta,\theta^{ \ast})\right]\mathrm{d}\theta\geq 2\right) \\& \leq P_{\theta^{\ast}}\left(C_{1}n\|\nabla_{\vartheta}\Phi_{n}(\theta^{\ast})\|^{2}+(K-1)a\ln n-n^{1-a}\geq\ln(\varepsilon/B)\right)+\delta \\& =P_{\theta^{\ast}}\left(n\|\nabla_{\vartheta}\Phi_{n}(\theta^{ \ast})\|^{2}\geq\frac{\ln(\varepsilon/B)+n^{1-a}-(K-1)a\ln n}{C_{1}}\right)+\delta \\& \leq P_{\theta^{\ast}}\left(\|\nabla_{\vartheta}\Phi_{n}(\theta^{ \ast})\|^{2}\geq\frac{\beta}{C_{1}n}\right)+\delta \\& \leq 2\delta.\end{align*}

(In the first line, we have used \(\mathbb{P}(XY>pq)=1-\mathbb{P}(XY{\lt}pq)\leq 1-\mathbb{P}(X{\lt}p\text{ and }Y{\lt}q)= \mathbb{P}(X>p\text{ or }Y>q)\leq\mathbb{P}(X>p)+\mathbb{P}(Y>q)\) for non-negative random variables \(X,Y\) and positive \(p,q\).) Therefore we have proved that \(Z_{n}\rightarrow 0\) in probability. Finally, since \(B_{n}\subset\Omega_{n}\), we have

\begin{align*}\frac{\Pi(A_{n}|Y_{1:n},S_{1:n})}{\Pi(\Omega_{n}|Y_{1:n},S_{1:n})}\leq\frac{ \Pi(A_{n}|Y_{1:n},S_{1:n})}{\Pi(B_{n}|Y_{1:n},S_{1:n})}\leq C_{2}Z_{n}\end{align*}

for all \(n>N_{B}\). This implies that,

\begin{align*}\frac{\Pi(A_{n}|Y_{1:n},S_{1:n})}{\Pi(\Omega_{n}|Y_{1:n},S_{1:n})}\overset{P_{ \theta^{\ast}}}{\rightarrow}0.\end{align*}

Since \(A_{n}=\Delta/\Omega_{n}\), as a result we get \(\Pi(\Omega_{n}|Y_{1:n},S_{1:n})\overset{P_{\theta^{\ast}}}{\rightarrow}1\). This concludes the proof. ∎

A.4.3 Convergence of the Expected Frequency.

Proof of Theorem 4.

Assumption 4 ensures that there exists a \(\kappa_{0}>0\) and \(k^{\ast}\in\{0,\ldots,K-1\}\) such that for all \(0\leq k\neq k^{\ast}{\lt}K\),

\begin{align*}U(\theta^{\ast};S^{\ast},\epsilon)-U(\theta^{\ast};\{\sigma_{\theta^{ \ast}}(1),\ldots,\sigma_{\theta^{\ast}}(k)\},\epsilon)\geq\kappa_{0}.\end{align*}

By Assumption 3, there exists a \(\delta_{1}>0\) such that

\begin{align*}\|\theta-\theta^{\prime}\|\leq\delta_{1}\Rightarrow|U(\theta,S,\epsilon)-U(\theta^{\prime},S,\epsilon)| < \kappa_{0}/2.\end{align*}

Moreover, since the components of \(\theta^{\ast}\) are strictly ordered,

\begin{align*}\delta_{2}:=\min_{k=1,\ldots,K-1}(\theta^{\ast}(k)-\theta^{\ast}(k+1)) > 0.\end{align*}

Choose \(\delta=\min\{\delta_{1},\delta_{2}/\sqrt{2}\}\). Define the set

\begin{align*}\Omega_{\delta}=\{\theta\in\Delta:\|\theta-\theta^{\ast}\|^{2}\leq \delta^{2}\}.\end{align*}

Then, for any \(\theta\in\Omega_{\delta}\), \(\sigma_{\theta}=\sigma_{\theta^{\ast}}\) and \(S^{\ast}_{\theta}=S^{\ast}\). This implies that \(\{\theta_{n}\in\Omega_{\delta}\}\subseteq\{S_{n+1}=S^{\ast}\}\). Since perfect sampling is assumed, we have \(Q(\mathrm{d}\theta_{t}|Y_{1:t},S_{1:t})=\Pi(\mathrm{d}\theta_{t}|Y_{1:t},S_{1: t})\). Hence,

\begin{align}P_{\theta^{\ast}}(S_{n+1}=S^{\ast})\geq E_{\theta^{\ast}}\left[P_{\theta^{\ast }}(\theta_{n}\in\Omega_{\delta}|S_{1:n},Y_{1:n})\right]=E_{\theta^{\ast}}\left [\Pi(\Omega_{\delta}|Y_{1:n},S_{1:n})\right]\end{align}

(59)

Recall the sequence of sets

\begin{align*}\Omega_{n}=\{\theta\in\Delta:\|\theta-\theta^{\ast}\|^{2}\leq cn^{-a}\}\end{align*}

defined in Theorem 3. There exists an \(N_{1}>0\) such that \(n>N_{1}\) we have \(\Omega_{n}\subseteq\Omega_{\delta}\). For such \(N_{1}\), we have

\begin{align*}E_{\theta^{\ast}}\left[\Pi(\Omega_{\delta}|Y_{1:n},S_{1:n})\right]\geq E_{ \theta^{\ast}}\left[\Pi(\Omega_{n}|Y_{1:n},S_{1:n})\right],\quad n > N_{1}.\end{align*}

Combining with (59), we can write as

\begin{align}P_{\theta^{\ast}}(S_{n+1}=S^{\ast})\geq E_{\theta^{\ast}}\left[\Pi(\Omega_{n}| Y_{1:n},S_{1:n})\right],\quad n > N_{1}.\end{align}

(60)

We will show that the right-hand side converges to \(1\). To do that, fix \(\varepsilon>0\). By Theorem 3, there exists a \(N_{2}>0\) such that

\begin{align*}P_{\theta^{\ast}}\left(\Pi(\Omega_{n}|Y_{1:n},S_{1:n}) > \sqrt{1-\varepsilon} \right) > \sqrt{1-\varepsilon}.\end{align*}

This implies that, for \(n>N_{2}\),

\begin{align*}E_{\theta^{\ast}}(\Pi(\Omega_{n}|Y_{1:n},S_{1:n})) > \sqrt{1-\varepsilon}\sqrt{1 -\varepsilon}+0(1-\sqrt{1-\varepsilon})=1-\varepsilon.\end{align*}

This shows that \(E_{\theta^{\ast}}\left[\Pi(\Omega_{n}|Y_{1:n},S_{1:n})\right]\rightarrow 1\) as \(n\rightarrow\infty\). Since the right-hand side of (60) converges to \(1\), so does the left-hand side. Therefore, we have proven (19).

To prove (20), we utilize the convergence of Cesaro means and write

\begin{align*}\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}P_{\theta^{\ast}}(S_{t}=S^{ \ast})=\lim_{t\rightarrow\infty}P_{\theta^{\ast}}(S_{t}=S^{\ast})=1,\end{align*}

where the last equality is by (19). Finally, we replace \(P_{\theta^{\ast}}(S_{t}=S)\) by \(E_{\theta}(\mathbb{I}(S_{t}=S_{t}))\) on the left-hand side and conclude the proof. ∎

References

[1]

Jayadev Acharya, Clément L. Canonne, Ziteng Sun, and Himanshu Tyagi. 2023. Unified lower bounds for interactive high-dimensional estimation under information constraints. In Advances in Neural Information Processing Systems. A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36, Curran Associates, Inc., New Orleans, US, 51133–51165. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2023/file/a07e87ecfa8a651d62257571669b0150-Paper-Conference.pdf

Google Scholar

[2]

Barış Alparslan and Sinan Yıldırım. 2022. Statistic selection and MCMC for differentially private Bayesian estimation. Statistics and Computing 32, 5 (2022), 66.

Digital Library

Google Scholar

[3]

Leighton Pate Barnes, Wei-Ning Chen, and Ayfer Özgür. 2020. Fisher information under local differential privacy. IEEE Journal on Selected Areas in Information Theory 1, 3 (2020), 645–659.

Crossref

Google Scholar

[4]

Karuna Bhaila, Wen Huang, Yongkai Wu, and Xintao Wu. 2024. Local differential privacy in graph neural networks: A reconstruction approach. In Proceedings of the 2024 SIAM International Conference on Data Mining (SDM ’24). SIAM, 1–9.

Crossref

Google Scholar

[5]

Graham Cormode and Akash Bharadwaj. 2022. Sample-and-threshold differential privacy: Histograms and applications. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 1420–1431.

Google Scholar

[6]

Graham Cormode, Tejas Kulkarni, and Divesh Srivastava. 2018. Marginal release under local differential privacy. In Proceedings of the 2018 International Conference on Management of Data, 131–146.

Digital Library

Google Scholar

[7]

Cynthia Dwork. 2006. Differential privacy. In International Colloquium on Automata, Languages, and Programming. Springer, 1–12.

Digital Library

Google Scholar

[8]

James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. 2016. On the theory and practice of privacy-preserving Bayesian data analysis. arXiv:1603.07294. Retrieved from https://arxiv.org/abs/1603.07294

Google Scholar

[9]

Richard D. Gill and Boris Y. Levit. 1995. Applications of the van trees inequality: A Bayesian CraméR-Rao bound. Bernoulli 1, 1/2 (1995), 59–79. Retrieved from http://www.jstor.org/stable/3318681

Crossref

Google Scholar

[10]

Jinyuan Jia and Neil Zhenqiang Gong. 2019. Calibrate: Frequency estimation and heavy hitter identification with local differential privacy via incorporating Prior knowledge. In Proceedings of the IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 2008–2016.

Digital Library

Google Scholar

[11]

Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Steven Z. Wu. 2019. Locally private gaussian estimation. In Advances in Neural Information Processing Systems. H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2019/file/a588a6199feff5ba48402883d9b72700-Paper.pdf

Google Scholar

[12]

Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2016. Extremal mechanisms for local differential privacy. The Journal of Machine Learning Research 17, 1 (2016), 492–542.

Digital Library

Google Scholar

[13]

Vishesh Karwa, Aleksandra B. Slavković, and Pavel Krivitsky. 2014. Differentially private exponential random graphs. In Privacy in Statistical Databases. Josep Domingo-Ferrer (Ed.). Springer International Publishing, Cham, 143–155.

Crossref

Google Scholar

[14]

Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately? SIAM Journal on Computing 40, 3 (2011), 793–826. DOI:

Digital Library

Google Scholar

[15]

Chansoo Kim, Jinhyouk Jung, and Younshik Chung. 2011. Bayesian estimation for the exponentiated weibull model under type II progressive censoring. Statistical Papers 52, 1 (2011), 53–70. DOI:

Crossref

Google Scholar

[16]

Tianyu Liu, Lulu Zhang, Guang Jin, and Zhengqiang Pan. 2022. Reliability assessment of heavily censored data based on E-Bayesian estimation. Mathematics 10, 22 (2022). DOI:

Crossref

Google Scholar

[17]

Showkat Ahmad Lone, Hanieh Panahi, Sadia Anwar, and Sana Shahab. 2024. Inference of reliability model with burr type XII distribution under two sample balanced progressive censored samples. Physica Scripta 99, 2 (Jan. 2024), 025019. DOI:

Crossref

Google Scholar

[18]

Milan Lopuhaä-Zwakenberg, Boris Škorić, and Ninghui Li. 2022. Fisher information as a utility metric for frequency estimation under local differential privacy. In Proceedings of the 21st Workshop on Privacy in the Electronic Society, 41–53.

Digital Library

Google Scholar

[19]

Eric Mazumdar, Aldo Pacchiano, Yi-An Ma, Peter L. Bartlett, and Michael I. Jordan. 2020. On approximate Thompson sampling with langevin algorithms. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 631, 11 pages.

Google Scholar

[20]

Christos Pelekis and Jan Ramon. 2017. Hoeffding’s inequality for sums of dependent random variables. Mediterranean Journal of Mathematics 14, 6 (2017), 243. DOI:

Crossref

Google Scholar

[21]

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning 11, 1 (2018), 1–96. Retrieved from http://dblp.uni-trier.de/db/journals/ftml/ftml11.html#RussoRKOW18

Google Scholar

[22]

Lukas Steinberger. 2024. Efficiency in local differential privacy. arXiv:2301.10600. Retrieved from https://arxiv.org/abs/2301.10600

Google Scholar

[23]

M. Wang, H. Jiang, P. Peng, and Y. Li. 2024. Accurately estimating frequencies of relations with relation privacy preserving in decentralized networks. IEEE Transactions on Mobile Computing 23, 5 (May 2024), 6408–6422. DOI:

Digital Library

Google Scholar

[24]

Shaowei Wang, Liusheng Huang, Pengzhan Wang, Yiwen Nie, Hongli Xu, Wei Yang, Xiang-Yang Li, and Chunming Qiao. 2016. Mutual information optimally local private discrete distribution estimation. arXiv:1607.08025. Retrieved from https://arxiv.org/abs/1607.08025

Google Scholar

[25]

S. Wang, Y. Li, Y. Zhong, K. Chen, X. Wang, Z. Zhou, F. Peng, Y. Qian, J. Du, and W. Yang. 2024. Locally private set-valued data analyses: Distribution and heavy hitters estimation. IEEE Transactions on Mobile Computing 23, 8 (Aug 2024), 8050–8065. DOI:

Digital Library

Google Scholar

[26]

Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. 2017. Locally differentially private protocols for frequency estimation. In Proceedings of the 26th USENIX Security Symposium (USENIX Security ’17), 729–745.

Google Scholar

[27]

Tianhao Wang, Milan Lopuhaä-Zwakenberg, Zitao Li, Boris Skoric, and Ninghui Li. 2020. Locally differentially private frequency estimation with consistency. In Proceedings of the 27th Annual Network and Distributed System Security Symposium (NDSS ’20). 16 pages. DOI:

Crossref

Google Scholar

[28]

Ian Waudby-Smith, Steven Wu, and Aaditya Ramdas. 2023. Nonparametric extensions of randomized response for private confidence sets. In Proceedings of the International Conference on Machine Learning. PMLR, 36748–36789.

Google Scholar

[29]

Fei Wei, Ergute Bao, Xiaokui Xiao, Yin Yang, and Bolin Ding. 2024. AAA: An adaptive mechanism for locally differential private mean estimation. arXiv:2404.01625. Retrieved from https://arxiv.org/abs/2404.01625

Google Scholar

[30]

Max Welling and Yee Whye Teh. 2011. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML’11). Omnipress, Madison, WI, 681–688.

Digital Library

Google Scholar

[31]

Oliver Williams and Frank Mcsherry. 2010. Probabilistic inference and differential privacy. In Advances in Neural Information Processing Systems. J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta (Eds.), Vol. 23, Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2010/file/fb60d411a5c5b72b2e7d3527cfc84fd0-Paper.pdf

Google Scholar

[32]

Sinan Yıldırım. 2024. Differentially private online Bayesian estimation with adaptive truncation. Turkish Journal of Electrical Engineering and Computer Sciences 32, 2 (2024), 34–50. Retrieved from http://dblp.uni-trier.de/db/journals/ftml/ftml11.html#RussoRKOW18

Crossref

Google Scholar

[33]

Dan Zhao, Su-Yun Zhao, Hong Chen, Rui-Xuan Liu, Cui-Ping Li, and Xiao-Ying Zhang. 2023. Hadamard encoding based frequent itemset mining under local differential privacy. Journal of Computer Science and Technology 38, 6 (2023), 1403–1422.

Digital Library

Google Scholar

[34]

Youwen Zhu, Yiran Cao, Qiao Xue, Qihui Wu, and Yushu Zhang. 2024. Heavy hitter identification over large-domain set-valued data with local differential privacy. IEEE Transactions on Information Forensics and Security 19 (2024), 414–426. DOI:

Digital Library

Google Scholar

Index Terms

Bayesian Frequency Estimation under Local Differential Privacy with an Adaptive Randomized Response Mechanism

Recommendations

A Novel Differential Privacy Approach that Enhances Classification Accuracy
C3S2E '16: Proceedings of the Ninth International C* Conference on Computer Science & Software Engineering

In the recent past, there has been a tremendous increase of large repositories of data, examples being in healthcare data, consumer data from retailers, and airline passenger data. These data are continually being shared with interested parties, either ...
A New Bound for Privacy Loss from Bayesian Posterior Sampling
CODASPY '22: Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy

Differential privacy (DP) is a state-of-the-art concept that formalizes privacy guarantees. We derive a new bound for the privacy loss from releasing Bayesian posterior samples in the setting of DP. The new bound is tighter than the existing bounds for ...
Utility-optimized local differential privacy mechanisms for distribution estimation
SEC'19: Proceedings of the 28th USENIX Conference on Security Symposium

LDP (Local Differential Privacy) has been widely studied to estimate statistics of personal data (e.g., distribution underlying the data) while protecting users' privacy. Although LDP does not require a trusted third party, it regards all personal data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 19, Issue 2

February 2025

153 pages

EISSN:1556-472X

DOI:10.1145/3703012

Editor:
Jian Pei
Duke University

Issue’s Table of Contents

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2025

Online AM: 03 December 2024

Accepted: 21 November 2024

Revised: 08 October 2024

Received: 11 May 2024

Published in TKDD Volume 19, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
349
Total Downloads

Downloads (Last 12 months)349
Downloads (Last 6 weeks)260

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

1 Introduction

Contribution:

Outline:

2 Related Literature

3 Problem Definition and General Framework

4 Constructing Informative Randomized Response Mechanisms

4.1 The RRRR Mechanism

4.2 Choosing the Privacy Parameters \(\boldsymbol{\epsilon}_{\textbf{1}}\) , \(\boldsymbol{\epsilon}_{\textbf{2}}\)

4.3 Subset Selection for RRRR

4.3.1 FIM.

4.3.2 Entropy of Randomized Response.

4.3.3 TV Distance.

4.3.4 Expected MSE.

4.3.5 Probability of Honest Response.

4.3.6 Semi-Adaptive Approach.

4.4 Computational Complexity of Utility Functions

5 Posterior Sampling

5.1 SGLD

5.2 Gibbs Sampling

6 Theoretical Analysis

6.1 Convergence of the Posterior Distribution

6.2 Selecting the Best Subset

7 Numerical Results

8 Conclusion

Acknowledgments

Footnote

Appendix

A Proofs

A.1 Proofs for LDP of RRRR

A.2 Proofs about Utility Functions

A.3 Proof for SGLD Update

A.4 Proofs for Convergence and Consistency Results

A.4.1 Preliminary Results.

Concavity of \(\ln h_{S,\epsilon}(y|\theta)\):

Smoothness of \(\ln h_{S,\epsilon}(y|\theta)\):

Second Moment of the Gradient at \(\theta^{\ast}\):

A.4.2 Convergence of the Posterior Distribution.

A.4.3 Convergence of the Expected Frequency.

References

Index Terms

Recommendations

A Novel Differential Privacy Approach that Enhances Classification Accuracy

A New Bound for Privacy Loss from Bayesian Posterior Sampling

Utility-optimized local differential privacy mechanisms for distribution estimation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations