research-article

Public Access

Private and Online Learnability Are Equivalent

Authors:

Shay MoranAuthors Info & Claims

ACM Journal of the ACM (JACM), Volume 69, Issue 4

Article No.: 28, Pages 1 - 34

https://doi.org/10.1145/3526074

Published: 16 August 2022 Publication History

All formats PDF

Abstract

Let H be a binary-labeled concept class. We prove that H can be PAC learned by an (approximate) differentially private algorithm if and only if it has a finite Littlestone dimension. This implies a qualitative equivalence between online learnability and private PAC learnability.

1 Introduction

This work studies the relationship between private PAC learning and online learning.

Differentially private learning. Statistical analyses and computer algorithms play significant roles in the decisions that shape modern society. The collection and analysis of individuals’ data drives computer programs that determine many critical outcomes, including the allocation of community resources, decisions to give loans, and school admissions.

Although data-driven and automated approaches have obvious benefits in terms of efficiency, they also raise the possibility of unintended negative impacts, especially against marginalized groups. This possibility highlights the need for responsible algorithms that obey relevant ethical requirements (e.g., see [71]).

Differential privacy (DP) [33] plays a key role in this context. Its initial (and primary) purpose was to provide a formal framework for ensuring individuals’ privacy in the statistical analysis of large datasets. But it has also found use in addressing other ethical issues such as algorithmic fairness (e.g., see [29, 30]).

There is extensive literature identifying differentially private algorithms and their limitations in a variety of contexts, including statistical query release, synthetic data generation, classification, clustering, graph analysis, hypothesis testing, and more. In general, the goal is to understand when and how privacy can be achieved in these tasks with a modest overhead in resources, such as data samples, computation time, or communication. Nevertheless, many basic questions remain regarding which tasks are compatible with DP whatsoever, especially in settings where the data are complex, high dimensional, or infinite.

We study these questions in the private PAC model [57, 78], which captures binary classification tasks under DP. This is the simplest and most extensively studied model of how sensitive data is analyzed in machine learning. In their work introducing this model, Kasiviswanathan et al. [57] showed that every finite class \( \mathcal {H} \) is privately learnable using \( O(\log |\mathcal {H}|) \) samples. However, this bound is loose for many specific concept classes of interest and says nothing when \( \mathcal {H} \) is infinite. Several works gave improved bounds for specific classes [8, 9, 11, 12, 19, 22, 38, 54, 55, 72], but a general characterization of learnability in terms of the combinatorial structure of \( \mathcal {H} \) remains elusive. This situation stands in stark contrast to the non-private case, where early results showed that the sample complexity of PAC learning is characterized, up to constant factors, by the VC dimension [14, 79].

In this article, we make progress toward characterizing PAC learnability by algorithms satisfying approximate DP. We prove a qualitative characterization: we show that a hypothesis class \( \mathcal {H} \) is differentially privately learnable (with some finite number of samples) if and only if it is online learnable (with some finite mistake bound).

Online learning. Online learning is a well-studied branch of machine learning that addresses algorithms making real-time predictions on sequentially arriving data. Such tasks arise in contexts including recommendation systems and advertisement placement. The literature on this subject is vast and includes several works (e.g., [24, 46, 73]).

Online Prediction, or Prediction with Expert Advice, is a basic setting within online learning. Let \( \mathcal {H} = \lbrace h:X\rightarrow \lbrace \pm 1\rbrace \rbrace \) be a class of predictors (also called experts) over a domain \( X \) . Consider an algorithm that observes examples \( (x_1,y_1)\ldots (x_T,y_T)\in X\times \lbrace \pm 1\rbrace \) in a sequential manner. More specifically, in each timestep \( t \) , the algorithm first observes the instance \( x_t \) , then predicts a label \( \hat{y}_t\in \lbrace \pm 1\rbrace \) , and finally learns whether its prediction was correct. The goal is to minimize the regret, namely the number of mistakes compared to the best expert in \( \mathcal {H} \) :

\( \begin{equation*} \sum _{t=1}^T 1[y_t\ne \hat{y}_t] - \min _{h^*\in \mathcal {H}} \sum _{t=1}^T 1[y_t\ne h^*(x_t)]. \end{equation*} \)

In this context, a class \( \mathcal {H} \) is said to be online learnable if for every \( T \) there is an algorithm that achieves sublinear regret \( o(T) \) against any sequence of \( T \) examples. The Littlestone dimension is a combinatorial parameter associated to the class \( \mathcal {H} \) that characterizes its online learnability [13, 61]: \( \mathcal {H} \) is online learnable if and only if it has a finite Littlestone dimension \( d\lt \infty \) . Moreover, the best possible regret \( R(T) \) for online learning of \( \mathcal {H} \) satisfies

\( \begin{equation*} \Omega (\sqrt {dT}) \le R(T) \le O\left(\sqrt {dT\log T}\right). \end{equation*} \)

Furthermore, if it is known that if one of the experts never errs (a.k.a. the realizable setting), then the optimal regret is exactly \( d \) .¹ (The regret is referred to as mistake bound in this context.)

Stability. Although at a first glance it may seem that online learning and differentially private learning have little to do with one another, a recent line of work has revealed a tight connection between the two [2, 3, 16, 44, 49, 69].

At a high level, this connection appears to boil down to the notion of stability, which plays a key role in both topics. On one hand, the definition of DP is itself a form of stability; it requires robustness of the output distribution of an algorithm when its input undergoes small changes. On the other hand, stability also arises as a central motif in online learning paradigms such as Follow the Perturbed Leader [51, 52] and Follow the Regularized Leader [1, 46, 75].

In their monograph, Dwork and Roth [34] identified stability as a common factor of learning and DP: “Differential privacy is enabled by stability and ensures stability . . . we observe a tantalizing moral equivalence between learnability, differential privacy, and stability.” This insight has found formal manifestations in several works. For example, Abernethy et al. [2] used DP-inspired stability methodology to derive a unified framework for proving state-of-the-art bounds in online learning. In the opposite direction, Agarwal and Singh [3] showed that certain standard stabilization techniques in online learning imply DP.

Stability plays a key role in this work as well. The direction that any class with a finite Littlestone dimension can be privately learned hinges on the following form of stability: for \( \eta \gt 0 \) and \( n\in \mathbb {N} \) , a learning algorithm \( \mathcal {A} \) is \( (n,\eta) \) -globally stable² with respect to a distribution \( \mathcal {D} \) over examples if there exists an hypothesis \( h \) whose frequency as an output is at least \( \eta \) . Namely,

\( \begin{equation*} \Pr _{S\sim \mathcal {D}^n}[\mathcal {A}(S) = h] \ge \eta . \end{equation*} \)

Our argument follows by showing that every \( \mathcal {H} \) can be learned by a globally stable algorithm with parameters \( \eta = 2^{-2^{O(d)}}, n=2^{O(d)} \) , where \( d \) is the Littlestone dimension of \( \mathcal {H} \) . As a corollary, we get an equivalence between global stability and DP (which can be viewed as a form of local stability). In other words, the existence of a globally stable learner for \( \mathcal {H} \) is equivalent to the existence of a differentially private learner for it (and both are equivalent to having a finite Littlestone dimension).

Littlestone dimension and thresholds. The converse direction—that every DP-learnable class has a finite Littlestone dimension—utilizes an intimate relationship between thresholds and the Littlestone dimension: a class \( \mathcal {H} \) has a finite Littlestone dimension if and only if it does not embed thresholds as a subclass (for a formal statement, see Theorem 10); this follows from a seminal result in model theory by Shelah [76]. As explained in the preliminaries (Section 3), Shelah’s theorem is usually stated in terms of orders and ranks. Chase and Freitag [25] noticed³ that the Littlestone dimension is the same as the model-theoretic rank. Meanwhile, order translates naturally to thresholds. To make Theorem 10 more accessible for readers with less background in model theory, we provide a combinatorial proof in the appendix.

Littlestone classes. It is natural to ask which classes have finite Littlestone dimension. First, note that every finite class \( \mathcal {H} \) has Littlestone dimension \( d \le \log \vert \mathcal {H}\vert \) . There are also many natural and interesting infinite classes with finite Littlestone dimension. For example, let \( X=\mathbb {F}^n \) be an \( n \) -dimensional vector space over a field \( \mathbb {F} \) and let \( \mathcal {H}\subseteq \lbrace \pm 1\rbrace ^X \) consist of all (indicators of) affine subspaces of dimension \( \le d \) . The Littlestone dimension of \( \mathcal {H} \) is \( d \) . More generally, any class of hypotheses that can be described by polynomial equalities of bounded degree has a bounded Littlestone dimension.⁴ This can be generalized even further to classes that are definable in stable theories. This (different, still) notion of stability is deep and well explored in model theory. We refer the reader to Section 5.1 in the work of Chase and Freitag [26] for more examples of stable theories and the Littlestone classes they correspond to.

Organization. The rest of this article is organized as follows. In Section 2, we formally state our main results and discuss some implications and other related and subsequent work. We present the preliminaries in Section 3. Then, in Section 4, we prove the direction that differentially private learnable classes have a finite Littlestone dimension, and in Section 5, we prove the converse direction—that every Littlestone class is differentially private PAC learnable. Finally, Section 6 concludes the article with some suggestions for future work.

2 Results

We next present our main results that yield an equivalence between private PAC learning and online learning. We note that the derived equivalence is qualitative in the sense that the gap between the best-known lower and upper bounds for learning a class \( \mathcal {H} \) is incredibly large: the lower bound is proportional to \( \log ^*(d) \) , whereas the upper bound is doubly exponential in \( d \) , where \( d \) is the Littlestone dimension of \( \mathcal {H} \) . Our upper bound has recently been reduced to \( \tilde{O}(d^6) \) in subsequent work [40].

The rest of this section is organized as follows: Sections 2.1, 2.2, and 2.3 are dedicated to the relationship between differentially private learning, Littlestone dimension, and online learning, and in Section 2.4. we discuss an implication for private boosting. Throughout this section, some standard technical terms are used. For definitions of these terms, we refer the reader to Section 3.

2.1 Private Learning Implies Finite Littlestone Dimension

We begin by the following statement that resolves an open problem in the work of Feldman and Xiao [38] and Bun et al. [22].

Theorem 1 (Thresholds Are Not Privately Learnable).

Let \( X\subseteq \mathbb {R} \) and let \( \mathcal {A} \) be a \( (\frac{1}{16},\frac{1}{16}) \) -accurate learning algorithm for the class of thresholds over \( X \) with sample complexity \( n \) that satisfies \( (\varepsilon ,\delta) \) -DP with \( \varepsilon =0.1 \) and \( \delta = O(\frac{1}{n^2\log n}) \) . Then,

\( \begin{equation*} n\ge \Omega (\log ^*|X|). \end{equation*} \)

In particular, the class of thresholds over an infinite \( X \) cannot be learned privately.

We note that an upper bound which scales with \( (\log ^*|X|)^{\frac{3}{2}} \) on the private sample complexity of learning thresholds over a domain of size \( n \) is given by Kaplan et al. [53]. Thus, Theorem 1 is tight up to polynomial factors. A weaker version of Theorem 1 by Bun et al. [22] provides a similar lower bound but applies only to proper learning algorithms.

Theorems 1 and 10 (which is stated in Section 3) imply that any privately learnable class has a finite Littlestone dimension.

Theorem 2 (Private Learning Implies Finite Littlestone Dimension).

Let \( H \) be an hypothesis class with Littlestone dimension \( d\in \mathbb {N}\cup \lbrace \infty \rbrace \) and let \( \mathcal {A} \) be a \( (\frac{1}{16},\frac{1}{16}) \) -accurate learning algorithm for \( H \) with sample complexity \( n \) that satisfies \( (\varepsilon ,\delta) \) -differentially private with \( \varepsilon =0.1 \) and \( \delta = O(\frac{1}{n^2\log n}) \) . Then,

\( \begin{equation*} n\ge \Omega (\log ^*d). \end{equation*} \)

In particular, any class that is privately learnable has a finite Littlestone dimension.

2.1.1 On the Proof of Theorem 1.

A common approach of proving impossibility results in computer science (and in machine learning in particular) exploits a Minmax principle, whereby one specifies a fixed hard distribution over inputs, and establishes the desired impossibility result for any algorithm with respect to random inputs from that distribution. As an example, consider the “No-Free-Lunch Theorem,” which establishes that the VC dimension lower bounds the sample complexity of PAC learning a class \( \mathcal {H} \) . Here, the hard distribution is picked to be uniform on a shattered set of size \( d=\mathsf {VC}(H) \) , and the argument follows by showing that every learning algorithm must observe \( \Omega (d) \) examples. (For example, see Theorem 5.1 in the work of Shalev-Shwartz and Singer [74].)

Such “Minmax” proofs establish a stronger assertion: they apply even to algorithms that “know” the input distribution. For example, the No-Free-Lunch Theorem applies even to learning algorithms that are designed given the knowledge that the marginal distribution is uniform over some shattered set.

Interestingly, such an approach is bound to fail in proving Theorem 1. The reason is that if the marginal distribution \( D_X \) is fixed, then one can pick an \( \epsilon /2 \) -cover,⁵ which we denote by \( \mathcal {C}_{\varepsilon /2} \) , for the class thresholds over \( X \) of size \( \vert \mathcal {C}_{\varepsilon /2}\vert = O(1/\epsilon) \) , and use the exponential mechanism [64] to DP-learn the finite class \( \mathcal {C}_{\varepsilon /2} \) with sample complexity that scales with \( \log \vert C_{\varepsilon /2}\vert =O(\log (1/\varepsilon)) \) . Since \( \mathcal {C}_{\varepsilon /2} \) is an \( \varepsilon \) -cover for the class of thresholds, the obtained algorithm PAC learns the class of thresholds in a differentially private manner. To conclude, there is no single distribution that is “hard” for all DP algorithms that learn thresholds.

To overcome this difficulty, one must come up with a method of assigning to any given algorithm \( A \) a “hard” distribution \( D=D_A \) that is tailored to \( A \) and witnesses Theorem 1 with respect to \( A \) . The challenge is that \( A \) can be arbitrary—for example, it may be improper.⁶ We refer the reader elsewhere [7, 67, 68] for a line of work that explores in detail a similar “failure” of the Minmax principle in the context of PAC learning with low mutual information.

The “method” we use to prove Theorem 1 exploits Ramsey theory. In a nutshell, Ramsey theory provides tools that allow to detect, for any learning algorithm, a “largish”’ set \( X^{\prime }\subseteq X \) such that the behavior of \( A \) on input samples from \( X^{\prime } \) is highly regular. Then, the uniform distribution over \( X^{\prime } \) is the “hard” distribution that is used to derive Theorem 1.

We note that similar applications of Ramsey theory in computer science date back to the 1980s [65]. For more recent usages, see other works [23, 27, 28].

Finally, we note that in the proper case, Bun et al. [22] demonstrated an ensemble, namely a distribution over distributions, which is hard for every differentially private algorithm \( A \) : if one draws a random distribution \( D \) from the ensemble and runs \( A \) on an input sample from \( D \) , then the expected error of \( A \) will be large. It is plausible that such a statement also holds for a general (possibly improper) algorithm, and it would be interesting to find such a natural ensemble.

2.2 Finite Littlestone Dimension Implies Private Learning

The following statement provides an upper bound on the sample complexity of DP-learning \( \mathcal {H} \) , which depends only on the Littlestone dimension of \( \mathcal {H} \) and the privacy/utility parameters. In particular, it does not depend on \( \vert \mathcal {H}\vert \) .

Theorem 3 (Littlestone Classes are Privately Learnable).

Let \( \mathcal {H}\subseteq \lbrace \pm 1\rbrace ^X \) be a class with Littlestone dimension \( d \) , let \( \varepsilon ,\delta \in (0, 1) \) be privacy parameters, and let \( \alpha ,\beta \in (0, 1/2) \) be accuracy parameters. For

\( \begin{equation*} n = {O\left(\frac{2^{\tilde{O}(2^d)}+\log 1/\beta \delta }{\alpha \epsilon }\right)} = O_d\left(\frac{\log (1/\beta \delta)}{\alpha \epsilon }\right) \end{equation*} \)

there exists an \( (\varepsilon ,\delta) \) -DP learning algorithm such that for every realizable distribution \( \mathcal {D} \) , given an input sample \( S\sim \mathcal {D}^n \) , the output hypothesis \( f=\mathcal {A}(S) \) satisfies \( \operatorname{loss}_{\mathcal {D}}(f)\le \alpha \) with probability at least \( 1-\beta \) , where the probability is taken over \( S\sim \mathcal {D}^n \) as well as the internal randomness of \( \mathcal {A} \) .

A similar result holds in the agnostic setting.

Corollary 4 (Agnostic Learner for Littlestone Classes).

Let \( \mathcal {H}\subseteq \lbrace \pm 1\rbrace ^X \) be a class with Littlestone dimension \( d \) , let \( \varepsilon \) and \( \delta \in (0, 1) \) be privacy parameters, and let \( \alpha ,\beta \in (0, 1/2) \) be accuracy parameters. For

\( \begin{equation*} n = O\left({\frac{2^{\tilde{O}(2^d)}+\log (1/\beta \delta)}{\alpha \epsilon }} +\frac{\textrm {VC}(\mathcal {H})+\log (1/\beta)}{\alpha ^2\epsilon } \right) \end{equation*} \)

there exists an \( (\varepsilon ,\delta) \) -DP learning algorithm such that for every distribution \( \mathcal {D} \) , given an input sample \( S\sim \mathcal {D}^n \) , the output hypothesis \( f=\mathcal {A}(S) \) satisfies

\( \begin{equation*} \operatorname{loss}_{\mathcal {D}}(f)\le \min _{h\in \mathcal {H}} \operatorname{loss}_{\mathcal {D}}(h)+ \alpha \end{equation*} \)

with probability at least \( 1-\beta \) , where the probability is taken over \( S\sim \mathcal {D}^n \) as well as the internal randomness of \( \mathcal {A} \) .

Corollary 4 follows from Theorem 3 by Theorem 2.3 in [4] which provides a general mechanism to transform a learner in the realizable setting to a learner in the agnostic setting.⁷ We note that formally the transformation in the work of Alon et al. [4] is stated for a constant \( \varepsilon =O(1) \) . Taking \( \varepsilon =O(1) \) is without loss of generality, as a standard “secrecy-of-the-sample” argument can be used to convert this learner into one that is \( (\varepsilon , \delta) \) -differentially private by increasing the sample size by a factor of roughly \( 1/\varepsilon \) and running the algorithm on a random subsample. See other works [57, 77] for further details.

2.3 Online Learning Versus Differentially Private PAC Learning

Since the Littlestone dimension characterizes online learnability [13, 61], Theorems 2 and 3 imply an equivalence between differentially private PAC learning and online learning.

Theorem 5 (Private PAC Learning \( \equiv \) Online Prediction)

The following statements are equivalent for a class \( \mathcal {H}\subseteq \lbrace \pm 1\rbrace ^X \) :

(1)

\( \mathcal {H} \) is online learnable.

(2)

\( \mathcal {H} \) is approximate differentially privately PAC learnable.

Theorem 5 directly follows from Theorem 2 (which gives \( 2\rightarrow 1 \) ) and Theorem 3 (which gives \( 1\rightarrow 2 \) ). We comment that a quantitative relation between the learning rates and mistake/regret bounds is also implied—for example, in the agnostic setting, it is known that the optimal regret bound for \( \mathcal {H} \) is \( \tilde{\Theta }_d(\sqrt {T}) \) , where the \( \tilde{\Theta }_d \) conceals a constant that depends on the Littlestone dimension of \( \mathcal {H} \) [13]. Similarly, we get that the optimal sample complexity of agnostically privately learning \( \mathcal {H} \) is \( \Theta _d(\frac{\log ({1}/(\beta \delta))}{\alpha ^2\varepsilon }) \) .

We remark, however, that the preceding equivalence is mostly interesting from a theoretical perspective, and should not be regarded as an efficient transformation between online and private learning. Indeed, the Littlestone dimension dependencies concealed by the \( \tilde{\Theta }_d(\cdot) \) in the preceding bounds on the regret and sample complexities may be quite different from one another. For example, there are classes for which the \( \Theta _d(\frac{\log ({1}/(\beta \delta))}{\alpha \varepsilon }) \) bound hides a \( \mathrm{poly}(\log ^*(d)) \) dependence, and the \( \tilde{\Theta }_d(\sqrt {T}) \) bound hides a \( \Theta (d) \) dependence. One example that attains both of these dependencies is the class of thresholds over a linearly ordered domain of size \( 2^d \) [53].

2.3.1 Global Stability.

Our proof of Theorem 3 hinges on an intermediate property that we call global stability.

Definition 6 (Global Stability).

Let \( n\in \mathbb {N} \) be a sample size and \( \eta \gt 0 \) be a global stability parameter. An algorithm \( \mathcal {A} \) is \( (n,\eta) \) -globally stable with respect to a distribution \( \mathcal {D} \) if there exists an hypothesis \( h \) such that

\( \begin{equation*} \Pr _{S\sim \mathcal {D}^n}[A(S) = h] \ge \eta . \end{equation*} \)

Although global stability is a rather strong property, it holds automatically for learning algorithms using a finite hypothesis class. By an averaging argument, every learner using \( n \) samples that produces a hypothesis in a finite hypothesis class \( \mathcal {H} \) is \( (n, 1/|\mathcal {H}|) \) -globally stable. The following proposition generalizes “Occam’s Razor” for finite hypothesis classes to show that global stability is enough to imply similar generalization bounds in the realizable setting.

Proposition 7 (Global Stability \( \Rightarrow \) Generalization).

Let \( \mathcal {H}\subseteq \lbrace \pm 1\rbrace ^X \) be a class, and assume that \( \mathcal {A} \) is a consistent learner for \( \mathcal {H} \) (i.e., \( \operatorname{loss}_S(\mathcal {A}(S))=0 \) for every realizable sample \( S \) ). Let \( \mathcal {D} \) be a realizable distribution such that \( \mathcal {A} \) is \( (n,\eta) \) -globally stable with respect to \( \mathcal {D} \) , and let \( h \) be a hypothesis such that \( \Pr _{S\sim \mathcal {D}^n}[A(S) = h] \ge \eta \) , as guaranteed by the definition of global stability. Then,

\( \begin{equation*} \operatorname{loss}_\mathcal {D}(h) \le \frac{\ln (1/\eta)}{n}. \end{equation*} \)

Proof.

Let \( \alpha \) denote the loss of \( h \) (i.e., \( \operatorname{loss}_\mathcal {D}(h) = \alpha \) ), and let \( E_1 \) denote the event that \( h \) is consistent with the input sample \( S \) . Thus, \( \Pr [E_1] = (1-\alpha)^n \) . Let \( E_2 \) denote the event that \( \mathcal {A}(S)=h \) . By assumption, \( \Pr [E_2]\ge \eta \) . Now, since \( \mathcal {A} \) is consistent we get that \( E_2\subseteq E_1 \) , and hence that \( \eta \le (1-\alpha)^n \) . This finishes the proof (using the fact that \( 1-\alpha \le e^{-\alpha } \) and taking the logarithm of both sides).□

Another way to view global stability is in the context of pseudo-deterministic algorithms [39]. A pseudo-deterministic algorithm is a randomized algorithm that yields some fixed output with high probability. Thinking of a realizable distribution \( \mathcal {D} \) as an instance on which the PAC-learning algorithm has oracle access, a globally stable learner is one that is “weakly” pseudo-deterministic in that it produces some fixed output with probability bounded away from zero. A different model of pseudo-deterministic learning, in the context of learning from membership queries, was defined and studied by Oliveira and Santhanam [70].

We prove Theorem 3 by constructing, for a given Littlestone class \( \mathcal {H} \) , an algorithm \( \mathcal {A} \) that is globally stable with respect to every realizable distribution.

2.4 Boosting for Approximate DP

Our characterization of private learnability in terms of the Littlestone dimension has new consequences for boosting the privacy and accuracy guarantees of differentially private learners. Specifically, it shows that the existence of a learner with weak (but non-trivial) privacy and accuracy guarantees implies the existence of a learner with any desired privacy and accuracy parameters—in particular, one with \( \delta (n) = \exp (-\Omega (n)) \) .

Theorem 8.

There exists a constant \( c \gt 0 \) for which the following holds. Suppose that for some sample size \( n_0 \) there is an \( (\varepsilon _0, \delta _0) \) -differentially private learner \( \mathcal {W} \) for a class \( \mathcal {H} \) satisfying the guarantee

\( \begin{equation*} \Pr _{S\sim \mathcal {D}^{n_0}}[\operatorname{loss}_{\mathcal {D}}({\mathcal {W}}(S)) \gt \alpha _0 ] \lt \beta _0 \end{equation*} \)

for \( \varepsilon _0 = 0.1, \alpha _0 = \beta _0 = 1/16 \) , and \( \delta _0 \le c/n_0^2\log n_0 \) .

Then there exists a constant \( C_\mathcal {H} \) such that for every \( \alpha , \beta , \varepsilon , \delta \in (0, 1) \) there exists an \( (\varepsilon , \delta) \) -differentially private learner for \( \mathcal {H} \) with

\( \begin{equation*} \Pr _{S\sim \mathcal {D}^{n}}[\operatorname{loss}_{\mathcal {D}}({\mathcal {A}}(S)) \gt \alpha ] \lt \beta \end{equation*} \)

whenever \( n \ge C_\mathcal {H}\cdot \log (1/\beta \delta)/\alpha \varepsilon \) .

Given a weak learner \( \mathcal {W} \) as in the statement of Theorem 8, Theorem 2 imply that \( \operatorname{Ldim}(\mathcal {H}) \) is finite. Hence, Theorem 3 allows us to construct a learner for \( \mathcal {H} \) with arbitrarily small privacy and accuracy, yielding Theorem 8. The constant \( C_{\mathcal {H}} \) in the last line of the theorem statement suppresses a factor depending on \( \operatorname{Ldim}(\mathcal {H}) \) .

Prior to our work, it was open whether arbitrary learning algorithms satisfying approximate DP could be boosted in this strong a manner. We remark, however, that in the case of pure DP, such boosting can be done algorithmically and efficiently. Specifically, given an \( (\varepsilon _0, 0) \) -differentially private weak learner as in the statement of Theorem 8, one can first apply random sampling to improve the privacy guarantee to \( (p\varepsilon _0, 0) \) -DP at the expense of increasing its sample complexity to roughly \( n_0 /p \) for any \( p \in (0, 1) \) . The Boosting-for-People construction of Dwork et al. [36] (also see the work of Bun et al. [18]) then produces a strong learner by making roughly \( T \approx \log (1/\beta)/\alpha ^2 \) calls to the weak learner. By composition of DP, this gives an \( (\varepsilon , 0) \) -differentially private strong learner with sample complexity roughly \( n_0 \cdot \log (1/\beta)/\alpha ^2\varepsilon \) .

What goes wrong if we try to apply this argument using an \( (\varepsilon _0, \delta _0) \) -differentially private weak learner? Random sampling still gives a \( (p\varepsilon _0, p\delta _0) \) -differentially private weak learner with sample complexity \( n_0 / p \) . However, this is not sufficient to improve the \( \delta \) parameter of the learner as a function of the number of samples \( n \) . Thus, the strong learner one obtains using Boosting-for-People still at best guarantees \( \delta (n) = \tilde{O}(1/n^2) \) . Meanwhile, Theorem 8 shows that the existence of a \( (0.1, \tilde{O}(1/n^2)) \) -differentially private learner for a given class implies the existence of a \( (0.1, \exp (-\Omega (n)) \) -differentially private learner for that class.

We leave it as an interesting open question to determine whether this kind of boosting for approximate DP can be done algorithmically.

2.5 Related and Subsequent Work

In this work, we determine that the (approximately) differentially-privately learnable classes are exactly those that are online learnable. We note that PAC learnability under the much stronger constraint of pure differentially privacy has already been characterized by several natural parameters such as the probabilistic representation dimension [12] and one-way communication complexity [38]. These characterizations even imply nearly tight bounds on the optimal sample complexity. This is in contrast with the equivalence derived in this work whose implied upper and lower bounds on the sample complexity are extremely far away from each other.

Subsequent to our work, Ghazi et al. [40] gave a significantly improved upper bound of \( \tilde{O}(d^6) \) on the sample complexity of learning any class with Littlestone dimension \( d \) . Moreover, their learning algorithm is proper. There is still an enormous gap between this and our lower bound of \( \Omega (\log ^* d) \) , but both the upper and lower bound are within polynomial factors of the best possible sample complexity bounds that depend only on the Littlestone dimension. Thus, despite the fact that DP learnability is characterized by the finiteness of the Littlestone dimension, it remains wide open to find meaningful quantitative bounds on the sample complexity of DP learning. This is discussed in more detail in Section 5.4, where we suggest directions for future work.

Subsequent work has also extended the connection between online learning, global stability, and private learning to settings beyond binary classification. The private learnability of Littlestone classes has been studied in multiclass classification [20, 50], real-valued classification (regression) [42, 50], quantum state learning [6], and the online private learning model [43].

Ghazi et al. [41] used a generalization of global stability to derive private learning algorithms for datasets where each individual contributes multiple samples. Global stability is also related to a definition of reproducibility for machine learning algorithms put forth by Impagliazzo et al. [48].

Finally, several works have studied the question of whether computationally efficient reductions exist between online and private learning. Gonen et al. [44] gave an efficient compiler from low sample complexity pure private learners to online learners, whereas Bun [17] showed that under cryptographic assumptions, such a reduction cannot exist in general.

3 Preliminaries

3.1 PAC Learning

We use standard notation from statistical learning (e.g., see [74]). Let \( X \) be any “domain” set and consider the “label” set \( Y=\lbrace \pm 1\rbrace \) . A hypothesis is a function \( h : X\rightarrow Y \) , which we alternatively write as an element of \( Y^X \) . An example is a pair \( (x, y) \in X\times Y \) . A sample \( S \) is a finite sequence of examples. We also use the following notation: for samples \( S,T \) , let \( S\circ T \) denote the combined sample obtained by appending \( T \) to the end of \( S \) .

Definition 9 (Population and Empirical Loss).

Let \( \mathcal {D} \) be a distribution over \( X \times \lbrace \pm 1\rbrace \) . The population loss of a hypothesis \( h : X \rightarrow \lbrace \pm 1\rbrace \) is defined by

\( \begin{equation*} \operatorname{loss}_{\mathcal {D}}(h) = \Pr _{(x, y) \sim \mathcal {D}}[h(x) \ne y]. \end{equation*} \)

Let \( S=((x_i,y_i))_{i=1}^n \) be a sample. The empirical loss of \( h \) with respect to \( S \) is defined by

\( \begin{equation*} \operatorname{loss}_{S}(h) = \frac{1}{n}\sum _{i=1}^n1[h(x_i)\ne y_i]. \end{equation*} \)

Let \( \mathcal {H}\subseteq Y^X \) be a hypothesis class. A sample \( S \) is said to be realizable by \( \mathcal {H} \) if there is \( h\in H \) such that \( \operatorname{loss}_S(h)=0 \) . A distribution \( \mathcal {D} \) is said to be realizable by \( \mathcal {H} \) if there is \( h\in H \) such that \( \operatorname{loss}_\mathcal {D}(h)=0 \) . A learning algorithm \( A \) is a (possibly randomized) mapping taking input samples to output hypotheses. We denote by \( A(S) \) the distribution over hypotheses induced by the algorithm when the input sample is \( S \) . We say that \( A \) learns⁸ a class \( \mathcal {H} \) with \( \alpha \) -error, \( (1-\beta) \) -confidence, and sample complexity \( m \) if for every realizable distribution \( \mathcal {D} \) ,

\( \begin{equation*} \Pr _{S\sim D^m,~h\sim A(S)}[\operatorname{loss}_\mathcal {D}(h) \gt \alpha ] \le \beta . \end{equation*} \)

For brevity, if \( A \) is a learning algorithm with \( \alpha \) -error and \( (1-\beta) \) -confidence, we will say that \( A \) is an \( (\alpha ,\beta) \) -accurate learner.

3.2 Online Learning

Littlestone dimension. The Littlestone dimension is a combinatorial parameter that captures mistake and regret bounds in online learning [13, 61].⁹ Its definition uses the notion of mistake trees. A mistake tree is a binary decision tree whose internal nodes are labeled by elements of \( X \) . Any root-to-leaf path in a mistake tree can be described as a sequence of examples \( (x_1,y_1),\ldots ,(x_d,y_d) \) , where \( x_i \) is the label of the \( i \) ’th internal node in the path, and \( y_i=+1 \) if the \( (i+1) \) ’th node in the path is the right child of the \( i \) ’th node and \( y_i = -1 \) otherwise. We say that a mistake tree \( T \) is shattered by \( \mathcal {H} \) if for any root-to-leaf path \( (x_1,y_1),\ldots ,(x_d,y_d) \) in \( T \) there is an \( h\in \mathcal {H} \) such that \( h(x_i)=y_i \) for all \( i\le d \) (Figure 1). The Littlestone dimension of \( \mathcal {H} \) , denoted \( \operatorname{Ldim}(\mathcal {H}) \) , is the depth of largest complete tree that is shattered by \( \mathcal {H} \) . We say that \( \mathcal {H} \) is a Littlestone class if it has finite Littlestone dimension.

Fig. 1.

Littlestone dimension and thresholds. Recently, Chase and Freitag [25] noticed that the Littlestone dimension coincides with a model-theoretic measure of complexity—Shelah’s 2-rank.

A classical theorem of Shelah connects bounds on 2-rank (Littlestone dimension) to bounds on the so-called order property in model theory. The order property corresponds naturally to the concept of thresholds. Let \( \mathcal {H}\subseteq \lbrace \pm 1\rbrace ^X \) be an hypothesis class. We say that \( \mathcal {H} \) contains \( k \) thresholds if there are \( x_1,\ldots ,x_k\in X \) and \( h_1,\ldots ,h_k\in \mathcal {H} \) such that \( h_i(x_j) = 1 \) if and only if \( i\le j \) for all \( i,j\le k \) .

Shelah’s result (part of the so-called Unstable Formula Theorem¹⁰) [47, 76], which we use in the following translated form, provides a simple and elegant connection between Littlestone dimension and thresholds.

Theorem 10 (Littlestone dimension and thresholds [47, 76]).

Let \( \mathcal {H} \) be an hypothesis class, then

(1)

If the \( \operatorname{Ldim}{\mathcal {H}}\ge d, \) then \( \mathcal {H} \) contains \( \lfloor \log d\rfloor \) thresholds.

(2)

If \( \mathcal {H} \) contains \( d \) thresholds, then its \( \operatorname{Ldim}{\mathcal {H}}\ge \lfloor \log d\rfloor \) .

For completeness, we provide a combinatorial proof of Theorem 10 in Section A.

In the context of model theory, Theorem 10 is used to establish an equivalence between finite Littlestone dimension and stable theories. It is interesting to note that an analogous connection between theories that are called NIP theories and VC dimension has also been previously observed and was pointed out by Laskowski [59]; this in turn led to results in learning theory, particularly within the context of compression schemes [62] but also some of the first polynomial bounds for the VC dimension for sigmoidal neural networks [56].

Mistake bound and the Standard Optimal Algorithm. The simplest setting in which learnability is captured by the Littlestone dimension is called the mistake-bound model [61]. Let \( \mathcal {H}\subseteq \lbrace \pm 1\rbrace ^X \) be a fixed hypothesis class known to the learner. The learning process takes place in a sequence of trials, where the order of events in each trial \( t \) is as follows:

•

The learner receives an instance \( x_t\in X \) ,

•

The learner responses with a prediction \( \hat{y}_t\in \lbrace \pm 1\rbrace \) , and

•

The learner is told whether or not the responds was correct.

We assume that the examples given to the learner are realizable in the following sense: for the entire sequence of trials, there is a hypothesis \( h\in \mathcal {H} \) such that \( y_t = h(x_t) \) for every instance \( x_t \) and correct response \( y_t \) . An algorithm in this model learns \( \mathcal {H} \) with mistake bound \( M \) if for every realizable sequence of examples presented to the learner it makes a total of at most \( M \) incorrect predictions.

Littlestone [61] showed that the minimum mistake bound achievable by any online learner is exactly \( \operatorname{Ldim}(\mathcal {H}). \) Furthermore, he described an explicit algorithm, called the Standard Optimal Algorithm ( \( \mathsf {SOA} \) ), which achieves this optimal mistake bound.

Standard Optimal Algorithm ( \( \mathsf {SOA} \) )

(1)

Initialize \( \mathcal {H}_1 = \mathcal {H} \) .

(2)

For trials \( t = 1, 2, \dots \) :

(i)

For each \( b \in \lbrace \pm 1\rbrace \) and \( x \in X \) , let \( \mathcal {H}_t^b(x) = \lbrace h \in \mathcal {H}_t : h(x) = b\rbrace \) . Define \( h : X \rightarrow \lbrace \pm 1\rbrace \) by \( h_t(x) = \operatorname{argmax}_b \operatorname{Ldim}(\mathcal {H}_t^{b}(x)) \) .

(ii)

Receive instance \( x_t \) .

(iii)

Predict \( \hat{y}_t = h_t(x_t) \) .

(iv)

Receive correct response \( y_t \) .

(v)

Update \( \mathcal {H}_{t+1} = \mathcal {H}_t^{y_t}(x_t) \) .

Extending the \( \mathsf {SOA} \) to non-realizable sequences. Our globally stable learner for Littlestone classes will make use of an optimal online learner in the mistake bound model. For concreteness, we pick the \( \mathsf {SOA} \) (any other optimal algorithm will also work). It will be convenient to extend the \( \mathsf {SOA} \) to sequences that are not necessarily realizable by a hypothesis in \( \mathcal {H} \) . We will use the following simple extension of the \( \mathsf {SOA} \) to non-realizable samples.

Definition 11 (Extending the SOA to Non-realizable Sequences).

Consider a run of the \( \mathsf {SOA} \) on examples \( (x_1,y_1),\ldots , (x_m,y_m) \) , and let \( h_t \) denote the predictor used by the \( \mathsf {SOA} \) after seeing the first \( t \) examples (i.e., \( h_t \) is the rule used by the \( \mathsf {SOA} \) to predict in the \( (t+1) \) ’st trial). Then, after observing both \( x_{t+1},y_{t+1} \) do the following:

•

If the sequence \( (x_1,y_1),\ldots , (x_{t+1},y_{t+1}) \) is realizable by some \( h\in \mathcal {H}, \) then apply the usual update rule of the \( \mathsf {SOA} \) to obtain \( h_{t+1} \) .

•

Else, set \( h_{t+1} \) as follows: \( h_{t+1}(x_{t+1}) = y_{t+1} \) , and \( h_{t+1}(x)=h_t(x) \) for every \( x\ne x_{t+1} \) .

Thus, upon observing a non-realizable sequence, this update rule locally updates the maintained predictor \( h_t \) to agree with the last example.

3.3 Differential Privacy

We use standard definitions and notation from the DP literature. For more background, see surveys found elsewhere [34, 77]. For \( a, b, \varepsilon , \delta \in [0, 1], \) let \( a\approx _{\varepsilon ,\delta } b \) denote the statement

\( \begin{equation*} a\le e^{\varepsilon }b + \delta ~\text{ and }~ b\le e^\varepsilon a + \delta . \end{equation*} \)

We say that two probability distributions \( p,q \) are \( (\varepsilon ,\delta) \) -indistinguishable if \( p(E) \approx _{\varepsilon ,\delta } q(E) \) for every event \( E \) .

Definition 12 (Private Learning Algorithm).

A randomized algorithm

\( \begin{equation*} A: (X\times \lbrace \pm 1\rbrace)^m \rightarrow \lbrace \pm 1\rbrace ^X \end{equation*} \)

is \( (\varepsilon ,\delta) \) -differentially private if for every two samples \( S,S^{\prime }\in (X\times \lbrace \pm 1\rbrace)^n \) that disagree on a single example the output distributions \( A(S) \) and \( A(S^{\prime }) \) are \( (\varepsilon ,\delta) \) -indistinguishable.

We emphasize that \( (\varepsilon , \delta) \) -indistinguishability must hold for every such pair of samples, even if they are not generated according to a (realizable) distribution.

The parameters \( \varepsilon ,\delta \) are usually treated as follows: \( \varepsilon \) is a small constant (say 0.1), and \( \delta \) is negligible, \( \delta = n^{-\omega (1)} \) , where \( n \) is the input sample size. The case of \( \delta =0 \) is also referred to as pure DP. Thus, a class \( \mathcal {H} \) is privately learnable if it is PAC learnable by an algorithm \( A \) that is \( (\varepsilon (n),\delta (n)) \) -differentially private with \( \varepsilon (n) \le 0.1 \) , and \( \delta (n) \le n^{-\omega (1)} \) .

We will use the following corollary of the Basic Composition Theorem from DP (e.g., see Theorem 3.16 in the work of Dwork and Roth [35]).

Lemma 13 (Composition [31, 32]).

If \( p,q \) are \( (\varepsilon ,\delta) \) -indistinguishable, then for all \( k\in \mathbb {N} \) , \( p^k \) and \( q^k \) are \( (k\varepsilon ,k\delta) \) -indistinguishable, where \( p^k,q^k \) are the k-fold products of \( p,q \) (i.e., corresponding to \( k \) independent samples).

Private empirical learners. For the proof of Theorem 1, it will be convenient to consider the following task of minimizing the empirical loss.

Definition 14 (Empirical Learner).

Algorithm \( A \) is \( (\alpha ,\beta) \) -accurate empirical learner for a hypothesis class \( \mathcal {H} \) with sample complexity \( m \) if for every \( h\in \mathcal {H} \) and for every sample \( S=((x_1,h(x_1),\ldots , (x_m,h(x_m)))\in \left(X\times \lbrace \pm 1\rbrace \right)^m \) the algorithm \( A \) outputs a function \( f \) satisfying

\( \begin{equation*} \Pr _{f\sim A(S)}(\operatorname{loss}_{S}(f) \le \alpha)\ge 1-\beta . \end{equation*} \)

This task is simpler to handle than PAC learning, which is a distributional loss minimization task. Replacing PAC learning by this task does not lose generality; this is implied by the following result of Bun et al. [22].

Lemma 15 ([22], Lemma 5.9).

Suppose \( \varepsilon \lt 1 \) and \( A \) is an \( (\epsilon ,\delta) \) -differentially private \( (\alpha ,\beta) \) -accurate learning algorithm for a hypothesis class \( \mathcal {H} \) with sample complexity \( m \) . Then there exists an \( (\epsilon ,\delta) \) –differentially private \( (\alpha ,\beta) \) -accurate empirical learner for \( \mathcal {H} \) with sample complexity \( 9m \) .

3.4 Additional Notation

A sample \( S \) of an even length is called balanced if half of its labels are \( +1 \) ’s and half are \( -1 \) ’s.

For a sample \( S \) , let \( S_X \) denote the underlying set of unlabeled examples: \( S_X = \lbrace x \vert (\exists y): (x,y)\in S\rbrace \) . Let \( A \) be a randomized learning algorithm. It will be convenient to associate with \( A \) and \( S \) the function \( A_S:X\rightarrow [0,1] \) defined by

\( \begin{equation*} A_S(x) = \Pr _{h\sim A(S)}\bigl [h(x)=1\bigr ]. \end{equation*} \)

Intuitively, this function represents the average hypothesis outputted by \( A \) when the input sample is \( S \) .

For the next definitions assume that the domain \( X \) is linearly ordered. Let \( S=((x_i,y_i))_{i=1}^{m} \) be a sample. We say that \( S \) is increasing if \( x_1\lt x_2\lt \cdots \lt x_m \) . For \( x\in X, \) define \( {\text{ord}}_S(x) \) by \( \vert \lbrace i \vert x_i \le x\rbrace \vert \) . Note that the set of points \( x\in X \) with the same \( {\text{ord}}_S(x) \) form an interval whose endpoints are two consecutive examples in \( S \) (consecutive with respect to the order on \( X \) , i.e., there is no example \( x_i \) between them).

The tower function \( \mathsf {twr}_k(x) \) is defined by the recursion

\( \begin{equation*} \mathsf {twr}^{(i)} x = {\left\lbrace \begin{array}{ll}x &i = 1,\\ 2^{\mathsf {twr}{(i-1)}(x)} &i\gt 1. \end{array}\right.} \end{equation*} \)

The iterated logarithm, \( \log ^{(k)}(x) \) is defined by the recursion

\( \begin{equation*} \log ^{(i)} x = {\left\lbrace \begin{array}{ll}\log x &i = 0,\\ 1 + \log ^{(i-1)}\log x &i\gt 0. \end{array}\right.} \end{equation*} \)

The function \( \log ^*x \) equals the number of times the iterated logarithm must be applied before the result is less than or equal to 1. It is defined by the recursion

\( \begin{equation*} \log ^* x = {\left\lbrace \begin{array}{ll}0 &x\le 1,\\ 1 + \log ^*\log x &x\gt 1. \end{array}\right.} \end{equation*} \)

4 Private Learning Implies Finite Littlestone Dimension

In this section, we prove that every class \( \mathcal {H} \) that can be PAC learned by a DP algorithm has a finite Littlestone dimension. This is achieved by establishing a lower bound on the sample complexity of privately learning \( \mathcal {H} \) that depends on its Littlestone dimension (Theorem 2). The crux of this lower bound lies in Theorem 1, which provides a lower bound for the task of privately learning 1-dimensional thresholds. This section is organized as follows. In Section 4.1, we provide an overview of the proof. Then, in Sections 4.2 and 4.3, we prove Theorems 1 and 2.

4.1 Proof Overview

The starting point of the proof is Theorem 10, which asserts that if \( \mathcal {H} \) has Littlestone dimension \( d \) , then it contains, as a subclass, at least some \( \log d \) thresholds. In other words, the class of thresholds is “complete” in the sense that a lower bound on the sample complexity of DP learning thresholds yields a lower bound for classes with a large Littlestone dimension.

Thus, consider an arbitrary differentially private algorithm \( A \) that learns the class of thresholds over an ordered domain \( X \) of size \( n \) . Our goal is to show a lower bound of \( \Omega (\log ^* n) \) on the sample complexity of \( A \) . A central challenge in the proof emerges because \( A \) may be improper and output arbitrary hypotheses (this is in contrast with proving impossibility results for proper algorithms where the structure of the learned class can be exploited).

The proof consists of two parts. The first part handles the preceding challenge by showing that for any algorithm (in fact, for any mapping that takes input samples to output hypotheses) there is a large subset of the domain that is homogeneous with respect to the algorithm. This notion of homogeneity places useful restrictions on the algorithm on input samples from the homogeneous set. The second part of the argument utilizes the homogeneity of \( X^{\prime }\subseteq X \) to derive a lower bound on the sample complexity of the algorithm in terms of \( \vert X^{\prime } \vert \) .

We note that the Ramsey argument in the first part is quite general: it does not use the definition of DP and could perhaps be useful in other sample complexity lower bounds. It is also worth noting that a Ramsey-based argument was used by Bun [23] in a weaker lower bound for DP learning thresholds in the proper case. In contrast to the first part, the second (and more technical) part of the proof is tailored specifically to the definition of DP. We next outline each of these two parts.

Reduction to homogeneous sets. As discussed earlier, the first step in the proof is about identifying a large homogeneous subset of the input domain \( X \) on which we can control the output of \( A \) . To define homogeneity, recall from Section 3.4 that a sample \( S=((x_i,y_i))_{i=1}^{m} \) of an even length is called balanced if half of its labels are \( +1 \) ’s and half are \( -1 \) ’s, and that \( S \) is said to be increasing if \( x_1\lt x_2\lt \cdots \lt x_m \) . Now, a subset \( X^{\prime }\subseteq X \) is called homogeneous with respect to \( A \) if there is a list of numbers \( p_0,p_1,\ldots ,p_m \) such that for every increasing balanced sample \( S \) of points from \( X^{\prime } \) and for every \( x^{\prime } \) from \( X^{\prime } \) with \( {\text{ord}}_S(x^{\prime }) = i \) ,

\( \begin{equation*} \vert A_S(x^{\prime }) - p_i\vert \le \gamma , \end{equation*} \)

where \( \gamma \) is sufficiently small. For simplicity, in this proof overview we will assume that \( \gamma =0 \) . (In the proof, \( \gamma \) is some \( O(1/m) \) (see 16).) So, for example, if \( A \) is deterministic, then \( h=A(S) \) is constant over each of the intervals defined by consecutive examples from \( S \) . Figure 2 presents an illustration.

Fig. 2.

The derivation of a large homogeneous set follows by a standard application of the Ramsey theorem for hypergraphs using an appropriate coloring (Lemma 17).

Lower bound for homogenous algorithms. We next assume that \( X^{\prime }=\lbrace 1,\ldots ,k\rbrace \) is a large homogeneous set with respect to \( A \) (with \( \gamma =0 \) ). We will obtain a lower bound on the sample complexity of \( A \) , denoted by \( m \) , by constructing a family \( \mathcal {P} \) of distributions such that (i) on the one hand, \( \vert \mathcal {P}\vert \le 2^{\tilde{O}(m^2)} \) , and (ii) on the other hand, \( \vert \mathcal {P}\vert \ge \Omega (k) \) . Combining these inequalities yields a lower bound on \( m \) in terms of \( \vert X^{\prime }\vert =k \) and concludes the proof.

The construction of \( \mathcal {P} \) proceeds as follows and is depicted in Figure 3: let \( S \) be an increasing balanced sample of points from \( X^{\prime } \) . Using the fact that \( A \) learns thresholds, it is shown that for some \( i_1\lt i_2 \) we have that \( p_{i_1}\le 1/3 \) and \( p_{i_2} \ge 2/3 \) . Thus, by a simple averaging argument, there is some \( i_1\le i \le i_2 \) such that \( p_{i} - p_{i-1} \ge \Omega (1/m) \) .

Fig. 3.

The last step in the construction is done by picking an increasing sample \( S \) such that the interval \( (x_{i-1}, x_{i+1}) \) has size \( n=\Omega (k) \) . For \( x\in (x_{i-1}, x_{i+1}) \) , let \( S_x \) denote the sample obtained by replacing \( x_i \) with \( x \) in \( S \) . By restricting the output hypothesis to the interval \( (x_{i-1}, x_{i+1}) \) (which is of size \( n \) ), each output distribution \( A(S_x) \) can be seen as a distribution over the cube \( \lbrace \pm 1\rbrace ^n \) . Thus, the family of distributions \( \mathcal {P} \) consists of all distributions \( P_x=A(S_x) \) for \( x\in (x_{i-1},x_{i+1}) \) . Since \( A \) is private, it follows that \( \mathcal {P} \) has the following two properties:

•

\( P_{x^{\prime }}, P_{x^{\prime \prime }}\in \mathcal {P} \) are \( (\varepsilon ,\delta) \) -indistinguishable for all \( x^{\prime },x^{\prime \prime }\in (x_{i-1},x_{i+1}) \) , and

•

Put \( r=\frac{p_{i-1} + p_{i}}{2} \) , then for all \( P_x\in \mathcal {P}, \)

\( \begin{equation*} (\forall x^{\prime }\le n): \Pr _{h\sim P_x}\bigl [h(x^{\prime })=1\bigr ] = {\left\lbrace \begin{array}{ll}r-\Omega (1/m) &x^{\prime }\lt x,\\ r+\Omega (1/m) &x^{\prime }\gt x. \end{array}\right.} \end{equation*} \)

It remains to show that \( \Omega (k) \le \vert \mathcal {P}\vert \le 2^{\tilde{O}(m^2)} \) . The lower bound follows directly from the definition of \( \mathcal {P} \) . The upper bound requires a more subtle argument: it exploits the composition property for DP (see Lemma 13) via a privacy-breaching “attack” that is based on binary search. This argument appears in Lemma 21, whose proof is self-contained.

4.2 A Lower Bound for Privately Learning Thresholds

4.2.1 Proof of Theorem 1.

The proof uses the following definition of homogeneous sets. Recall the definitions of balanced sample and of an increasing sample—in particular, that a sample \( S=((x_1,y_1),\ldots ,(x_m,y_m)) \) of an even size is realizable (by thresholds), balanced, and increasing if and only if \( x_1\lt x_2\lt \cdots \lt x_m \) and the first half of the \( y_i \) ’s are \( -1 \) and the second half are \( +1 \) .

Definition 16 (m-Homogeneous Set).

A set \( X^{\prime }\subseteq X \) is \( m \) -homogeneous with respect to a learning algorithm \( A \) if there are numbers \( p_i\in [0,1] \) , for \( 0\le i\le m \) such that for every increasing balanced realizable sample \( S\in (X^{\prime } \times \lbrace \pm 1\rbrace)^m \) and for every \( x\in X^{\prime }\setminus S_X \) ,

\( \begin{equation*} \vert A_S(x) - p_i\vert \le \frac{1}{10^2 m}, \end{equation*} \)

where \( i = {\text{ord}}_S(x) \) . The list \( (p_i)_{i=0}^m \) is referred to as the probabilities list of \( X^{\prime } \) with respect to \( A \) .

Proof of 1

Let \( A \) be a \( (1/16,1/16) \) -accurate learning algorithm that learns the class of thresholds over \( X \) with \( m \) examples and is \( (\varepsilon ,\delta) \) -differentially private with \( \varepsilon =0.1,\delta = \frac{1}{10^3m^2\log m} \) . By Lemma 15, we may assume without loss of generality that \( A \) is an empirical learner with the same privacy and accuracy parameters and sample size that is at most nine times larger.

Theorem 1 follows from the next two lemmas, which we prove later.

Lemma 17 (Every Algorithm Has Large Homogeneous Sets).

Let \( A \) be a (possibly randomized) algorithm that is defined over input samples of size \( m \) over a domain \( X\subseteq R \) with \( \vert X\vert = n \) . Then, there is a set \( X^{\prime }\subseteq X \) that is \( m \) -homogeneous with respect to \( A \) of size

\( \begin{equation*} \vert X^{\prime }\vert \ge \frac{\log ^{(m)}(n)}{2^{O(m\log m)}}. \end{equation*} \)

□

Lemma 17 allows us to focus on a large homogeneous set with respect to \( A \) . The next lemma implies a lower bound in terms of the size of a homogeneous set. For simplicity and without loss of generality, assume that the homogeneous set is \( \lbrace 1,\ldots ,k\rbrace \) .

Lemma 18 (Large Homogeneous Sets Imply Lower Bounds for Private Learning).

Let \( A \) be an \( (0.1,\delta) \) -differentially private algorithm with sample complexity \( m \) and \( \delta \le \frac{1}{10^3m^2\log m} \) . Let \( X=\lbrace 1,\ldots , k\rbrace \) be \( m \) -homogeneous with respect to \( A \) . Then, if \( A \) empirically learns the class of thresholds over \( X \) with \( (1/16,1/16) \) -accuracy, then

\( \begin{equation*} k \le 2^{O(m^2\log ^2m)} \end{equation*} \)

(i.e., \( m \ge \Omega (\tfrac{\sqrt {\log k}}{\log \log k}) \) ).

With these lemmas in hand, Theorem 1 follows by a short calculation: indeed, Lemma 17 implies the existence of an homogeneous set \( X^{\prime } \) with respect to \( A \) of size \( k\ge {\log ^{(m)}(n)}/{2^{O(m\log m)}} \) . We then restrict \( A \) to input samples from the set \( X^{\prime } \) , and by relabeling the elements of \( X^{\prime } \) assume that \( X^{\prime }=\lbrace 1,\ldots ,k\rbrace . \) Lemma 18 then implies that \( k = 2^{O(m^2\log ^2m)} \) . Together, we obtain that

\( \begin{equation*} \log ^{(m)}(n) \le 2^{c\cdot m^2\log m} \end{equation*} \)

for some constant \( c \gt 0 \) . Applying the iterated logarithm \( t=\log ^*(2^{c\cdot m^2\log m}) = \log ^{*}(m)+O(1) \) times on the inequality yields that

\( \begin{equation*} \log ^{(m+t)}(n)=\log ^{(m + \log ^*(m) + O(1))}(n) \le 1, \end{equation*} \)

and therefore \( \log ^*(n) \le \log ^*(m) + m +O(1) \) , which implies that \( m \ge \Omega (\log ^* n) \) as required.

4.2.2 Proof of Lemma 17.

We next prove that every learning algorithm has a large homogeneous set. We will use the following quantitative version of the Ramsey theorem due to Erdös and Rado [37] (see also the book by Graham et al. [45] or Theorem 10.1 in the survey by Mubayi and Suk [66]).

Theorem 19 ([37]).

Let \( s\gt t\ge 2 \) and \( q \) be integers, and let

\( \begin{equation*} N\ge \mathsf {twr}_t(3sq\log q). \end{equation*} \)

Then for every coloring of the subsets of size \( t \) of a universe of size \( N \) using \( q \) colors, there is a homogeneous subset¹¹ of size \( s \) .

Proof of 17

Define a coloring on the \( (m+1) \) -subsets of \( X \) as follows. Let \( D=\lbrace x_1\lt x_2\lt \cdots \lt x_{m+1}\rbrace \) be an \( (m+1) \) -subset of \( X \) . For each \( i\le m+1, \) let \( D^{-i} = D\setminus \lbrace x_i\rbrace \) , and let \( S^{-i} \) denote the balanced increasing sample on \( D^{-i} \) . Set \( p_{i} \) to be the fraction of the form \( \frac{t}{10^2m} \) that is closest to \( A_{S^{-i}}(x_{i}) \) (in case of ties, pick the smallest such fraction). The coloring assigned to \( A \) is the list \( (p_1,p_2,\ldots ,p_{m+1}) \) .

Thus, the total number of colors is \( (10^2m+1)^{(m+1)} \) . By applying Theorem 19 with \( t:=m+1, q:=(10^2m+1)^{(m+1)} \) , and \( N:=n, \) there is a set \( X^{\prime }\subseteq X \) of size

\( \begin{equation*} \vert X^{\prime }\vert \ge \frac{\log ^{(m)}(n)}{{3(10^2m+1)^{m+1} (m+1)\log (10^2m+1)}} = \frac{\log ^{(m)}(N)}{2^{O(m\log m)}} \end{equation*} \)

such that all \( m+1 \) -subsets of \( X^{\prime } \) have the same color. One can verify that \( X^{\prime } \) is indeed \( m \) -homogeneous with respect to \( A \) .□

4.2.3 Proof of Lemma 18.

The lower bound is proven by using the algorithm \( A \) to construct a family of distributions \( \mathcal {P} \) with certain properties, and use these properties to derive that \( \Omega (k) \le \mathcal {P}\le 2^{O(m^2\log ^2 m)} \) , which implies the desired lower bound.

Lemma 20.

Let \( A,X^{\prime },m,k \) as in Lemma 18, and set \( n=k-m \) . Then there exists a family \( \mathcal {P}=\lbrace P_i : i\le n\rbrace \) of distributions over \( \lbrace \pm 1\rbrace ^n \) with the following properties:

(1)

Every \( P_i,P_j\in \mathcal {P} \) are \( (0.1,\delta) \) -indistinguishable.

(2)

There exists \( r\in [0,1] \) such that for all \( i,j\le n \) :

\( \begin{equation*} \Pr _{v\sim P_i}[v(j)=1] = {\left\lbrace \begin{array}{ll}\le r-\frac{1}{10m} &j \lt i,\\ \ge r+\frac{1}{10m} &j\gt i. \end{array}\right.} \end{equation*} \)

Lemma 21.

Let \( \mathcal {P},n,m,r \) as in Lemma 20. Then, \( n \le 2^{10^3 m^2\log ^2m} \) .

By the preceding lemmas, \( k-m = \vert \mathcal {P}\vert \le 2^{10^3m^2\log ^2 m} \) , which implies that \( k=2^{O(m^2\log ^2 m)} \) as required. Thus, it remains to prove these lemmas, which we do next.

For the proof of lemma 20, we will need the following claim.

Claim 22.

Let \( (p_i)_{i=0}^m \) denote the probabilities list of \( X^{\prime } \) with respect to \( A \) . Then for some \( 0 \lt i \le m \) ,

\( \begin{equation*} p_{i} - p_{i-1} \ge \frac{1}{4m}. \end{equation*} \)

Proof.

The proof of this claim uses the assumption that \( A \) empirically learns thresholds. Let \( S \) be a balanced increasing realizable sample such that \( S_X=\lbrace x_1\lt \cdots \lt x_m\rbrace \subseteq X^{\prime } \) are evenly spaced points on \( K \) (so, \( S=(x_i,y_i)_{i=1}^m \) , where \( y_i = -1 \) for \( i\le m/2 \) and \( y_i=+1 \) for \( i\gt m_2 \) ).

\( A \) is an \( (\alpha =1/16,\beta =1/16) \) -empirical learner, and therefore its expected empirical loss on \( S \) is at most \( (1-\beta)\cdot \alpha + \beta \cdot 1\le \alpha + \beta = 1/8 \) , and so

\( \begin{align*} \frac{7}{8}& \le \mathop{\mathbb {E}_{h\sim A(S)}} (1-\operatorname{loss}_S(h))\\ &= \frac{1}{m}\sum _{i=1}^{m/2} \left[1-A_S(x_i) \right]+\frac{1}{m}\sum _{i=m/2+1}^m \left[ A_S(x_i) \right].\;\;\; \text{since}\; S\; \text{is balanced} \end{align*} \)

This implies that there is \( m/2\le m_1\le m \) such that \( A_S(x_{m_1})\ge 3/4 \) . Next, by privacy, if we consider \( S^{\prime } \) the sample where we replace \( x_{m_1} \) by \( x_{m_1}+1 \) (with the same label), we have that

\( \begin{equation*} A_{S^{\prime }}(x_{m_1}) \ge \Bigl (\frac{3}{4} -\delta \Bigr) e^{-0.1} \ge \frac{2}{3}. \end{equation*} \)

Note that \( {\text{ord}}_{S^{\prime }}(x_{m_1}) = m_1-1 \) , hence by homogeneity: \( p_{m_1-1}\ge \frac{2}{3}- \frac{1}{10^2m} \) . Similarly, we can show that for some \( 1 \le m_2\le \frac{m}{2} \) , we have \( p_{m_2-1}\le \frac{1}{3} + \frac{1}{10^2m} \) . This implies that for some \( m_2 - 1 \le i \le m_1 - 1 \) ,

\( \begin{equation*} p_{i} - p_{i-1} \ge \frac{1/3}{m} - \frac{1}{50m^2} \ge \frac{1}{4m}, \end{equation*} \)

as required.□

Proof of Lemma 20. Let \( i \) be the index guaranteed by Claim 22 such that \( p_{i} - p_{i-1}\ge 1/4m \) . Pick an increasing realizable sample \( S\in (X^{\prime }\times \lbrace \pm 1\rbrace)^m \) so that the interval \( J\subseteq X^{\prime } \) between \( x_{i-1} \) and \( x_{i+1} \) ,

\( \begin{equation*} J = \bigl \lbrace x\in \lbrace 1,\ldots , k\rbrace : x_{i-1} \lt x \lt x_{i+1}\bigr \rbrace , \end{equation*} \)

is of size \( k-m \) . For every \( x\in J, \) let \( S_x \) be the neighboring sample of \( S \) that is obtained by replacing \( x \) with \( x_i \) . This yields family of neighboring samples \( \lbrace S_x: x\in (x_{i-1},x_{i+1})\rbrace \) such that

•

Every two output distributions \( A(S_{x^{\prime }}) \) , \( A(S_{x^{\prime \prime }}) \) are \( (\varepsilon ,\delta) \) -indistinguishable (because \( A \) satisfies \( (\varepsilon ,\delta) \) DP).

•

Set \( r = \frac{p_{i+1} + p_i}{2} \) . Then for all \( x,x^{\prime }\in J \) ,

\( \begin{equation*} \Pr _{h\sim A(S_x) }\bigl [h(x^{\prime })=1\bigr ] = {\left\lbrace \begin{array}{ll}\le r-\frac{1}{10m} & x^{\prime } \lt x,\\ \ge r+\frac{1}{10m} & x^{\prime } \gt x. \end{array}\right.} \end{equation*} \)

The proof is concluded by restricting the output of \( A \) to \( J \) , and identifying \( J \) with \( [n] \) and each output distributions \( A(S_x) \) with a distribution over \( \lbrace \pm 1 \rbrace ^n \) .

Proof of 21

Set \( T=10^3 m^2\log ^2 m - 1 \) , and \( D = 10^2m^2\log T \) . We want to show that \( n\le 2^{T+1} \) . Assume toward contradiction that \( n \gt 2^{T+1} \) . Consider the family of distributions \( Q_i=P_i^D \) for \( i=1,\ldots ,n \) . By Lemma 13, each \( Q_i,Q_j \) is \( (0.1D,\delta D) \) -indistinguishable.

We next define a set of mutually disjoint events \( E_i \) for \( i\le 2^T \) that are measurable with respect to each of the \( Q_i \) ’s. For a sequence of vectors \( \mathbf {v}=(v_1,\ldots ,v_D) \) in \( \lbrace \pm 1\rbrace ^n, \) we let \( \bar{\mathbf {v}}\in \lbrace \pm 1\rbrace ^n \) be the threshold vector defined by

\( \begin{equation*} \bar{\mathbf {v}}(j) = {\left\lbrace \begin{array}{ll}-1 & \frac{1}{D}\sum _{i=1}^D v_i(j) \le r, \\ +1 & \frac{1}{D}\sum _{i=1}^D v_i(j) \ge r. \end{array}\right.} \end{equation*} \)

Given a point in the support of any of the \( Q_i \) ’s, namely a sequence \( \mathbf {v}= (v_1,\ldots , v_{D}) \) of \( D \) vectors in \( \lbrace \pm 1\rbrace ^n, \) define a mapping \( B \) according to the outcome of \( T \) steps of binary search on \( \bar{\mathbf {v}} \) as follows: probe the \( \frac{n}{2} \) ’th entry of \( \bar{\mathbf {v}} \) ; if it is \( +1, \) then continue recursively with the first half of \( \bar{\mathbf {v}} \) . Else, continue recursively with the second half of \( \bar{\mathbf {v}} \) . Define the mapping \( B=B(\mathbf {v}) \) to be the entry that was probed at the \( T \) ’th step. The events \( E_j \) correspond to the \( 2^T \) different outcomes of \( B \) . These events are mutually disjoint by the assumption that \( n \gt 2^{T+1} \) .

Notice that for any possible \( i \) in the image of \( B \) , applying the binary search on a sufficiently large independent and identically distributed sample \( \mathbf {v} \) from \( P_i \) would yield \( B(\mathbf {v}) = i \) with high probability. Quantitatively, a standard application of Chernoff inequality and a union bound imply that the event \( E_i= \lbrace \mathbf {v}: B(\bar{\mathbf {v}})=i\rbrace \) for \( \mathbf {v}\sim Q_i \) , has probability at least

\( \begin{equation*} 1 - T\exp \Bigl (-2\frac{1}{10^2m^2}D\Bigr) = 1-T\exp (-2\log T) \ge \frac{2}{3}. \end{equation*} \)

We claim that for all \( j\le n \) , and \( i \) in the image of \( B \) ,

\( \begin{equation} Q_j(E_i) \ge \frac{1}{2}\exp (-0.1D). \end{equation} \)

(1)

This will finish the proof since the \( 2^T \) events are mutually disjoint, and therefore

\( \begin{align*} 1 &\ge Q_j(\cup _i E_i)\\ &= \sum _i Q_j(E_i) \\ &\ge 2^T \cdot \frac{1}{2}e^{-0.1 D}\\ & = 2^{T-1} e^{-0.1 D}. \end{align*} \)

However, \( 2^{T-1} e^{-0.1 D} \gt 1 \) by the choice of \( T,D \) , which is a contradiction.

Thus, it remains to prove Equation (1). This follows since \( Q_i,Q_j \) are \( (0.1D,D\delta) \) -indistinguishable:

\( \begin{equation*} \frac{2}{3} \le Q_i(E_i) \le \exp (0.1 D) Q_j(E_i) + D\delta , \end{equation*} \)

and by the choice of \( \delta \) , which implies that \( \frac{2}{3}-D\delta \ge \frac{1}{2} \) .□

4.3 Privately Learnable Classes Have Finite Littlestone Dimension

We conclude this section by deriving Theorem 2, which gives a lower bound of \( \Omega (\log ^* d) \) on the sample complexity of privately learning a class with Littlestone dimension \( d \) .

Proof of 2

The proof is a direct corollary of Theorems 10 and 1. Indeed, let \( H \) be a class with Littlestone dimension \( d \) , and let \( c= \lfloor \log d\rfloor \) . By Item 1 of Theorem 10, there are \( x_1,\ldots , x_c \) and \( h_1,\ldots , h_c\in H \) such that \( h_i(x_j) = +1 \) if and only if \( j\ge i \) . Theorem 1 implies a lower bound of \( m\ge \Omega (\log ^* c) = \Omega (\log ^* d) \) for any algorithm that learns \( \lbrace h_i : i\le c\rbrace \) with accuracy \( (1/16,1/16) \) and privacy \( (0.1,O(1/m^2\log m)) \) .□

5 Finite Littlestone Dimension Implies Private Learning

In this section, we prove that every Littlestone class \( \mathcal {H} \) is PAC learnable by a DP algorithm (Theorem 3). We begin by providing a proof overview in Section 5.1. Then, in Section 5.2, we prove that every Littlestone class can be learned by a globally stable algorithm, and in Section 5.3 that globally stable algorithms can be transformed to DP algorithms. Finally, in Section 5.4, we wrap up by proving Theorem 3.

5.1 Proof Overview

We next give an overview of the main arguments used in the proof of Theorem 3. The proof consist of two parts: (i) we first show that every class with a finite Littlestone dimension can be learned by a globally stable algorithm, and (ii) we then show how to generically obtain a differentially private learner from any globally stable learner.

5.1.1 Step 1: Finite Littlestone Dimension → Globally Stable Learning.

Let \( \mathcal {H} \) be a concept class with Littlestone dimension \( d \) . Our goal is to design a globally stable learning algorithm for \( \mathcal {H} \) with stability parameter \( \eta = {2^{-2^{O(d)}}}{\exp (-d)} \) and sample complexity \( n={2^{2^{O(d)}}}{\exp (d)} \) . We will sketch here a weaker variant of our construction that uses the same ideas but is simpler to describe.

The property of \( \mathcal {H} \) that we will use is that it can be online learned in the realizable setting with at most \( d \) mistakes (see Section 3.2 for a brief overview of this setting). Let \( \mathcal {D} \) denote a realizable distribution with respect to which we wish to learn in a globally stable manner. In other words, \( \mathcal {D} \) is a distribution over examples \( (x,c(x)), \) where \( c\in \mathcal {H} \) is an unknown target concept. Let \( \mathcal {A} \) be a learning algorithm that makes at most \( d \) mistakes while learning an unknown concept from \( \mathcal {H} \) in the online model. Consider applying \( \mathcal {A} \) on a sequence \( S=((x_1,c(x_1))\ldots (x_n,c(x_n)))\sim \mathcal {D}^n \) , and denote by \( M \) the random variable counting the number of mistakes \( \mathcal {A} \) makes in this process. The mistake bound guarantee on \( \mathcal {A} \) guarantees that \( M\le d \) always. Consequently, there is \( 0\le i \le d \) such that

\( \begin{equation*} \Pr [M=i] \ge \frac{1}{d+1}. \end{equation*} \)

Note that we can identify, with high probability, an \( i \) such that \( \Pr [M=i] \ge 1/2d \) by running \( \mathcal {A} \) on \( O(d) \) samples from \( \mathcal {D}^n \) . We next describe how to handle each of the \( d+1 \) possibilities for \( i \) . Let us first assume that \( i=d \) , namely that

\( \begin{equation*} \Pr [M=d] \ge \frac{1}{2d}. \end{equation*} \)

We claim that in this case we are done: indeed, after making \( d \) mistakes, it must be the case that \( \mathcal {A} \) has completely identified the target concept \( c \) (or else \( \mathcal {A} \) could be presented with another example that forces it to make \( d+1 \) mistakes). Thus, in this case, it holds with probability at least \( 1/2d \) that \( \mathcal {A}(S)=c \) and we are done. Let us next assume that \( i=d-1 \) , namely that

\( \begin{equation*} \Pr [M=d-1] \ge \frac{1}{2d}. \end{equation*} \)

The issue with applying the previous argument here is that before making the \( d \) ’th mistake, \( \mathcal {A} \) can output many different hypotheses (depending on the input sample \( S \) ). We use the following idea: draw two samples \( S_1,S_2 \sim \mathcal {D}^n \) independently, and set \( f_1 = \mathcal {A}(S_1) \) and \( f_2=\mathcal {A}(S_2) \) . Condition on the event that the number of mistakes made by \( \mathcal {A} \) on each of \( S_1,S_2 \) is exactly \( d-1 \) (by assumption, this event occurs with probability at least \( (1/2d)^2 \) ) and consider the following two possibilities:

(i)

\( \Pr [f_1=f_2]\ge \frac{1}{4} \) ,

(ii)

\( \Pr [f_1=f_2] \lt \frac{1}{4} \) .

If (i) holds, then using a simple calculation one can show that there is \( h \) such that \( \Pr [A(S) = h] \ge \frac{1}{(2d)^2}\cdot \frac{1}{4} \) and we are done. If (ii) holds, then we apply the following random contest between \( S_1,S_2 \) :

(1)

Pick \( x \) such that \( f_1(x)\ne f_2(x) \) and draw \( y\sim \lbrace \pm 1\rbrace \) uniformly at random.

(2)

If \( f_1(x)\ne y, \) then the output is \( \mathcal {A}(S_1 \circ (x,y)) \) , where \( S_1\circ (x,y) \) denotes the sample obtained by appending \( (x,y) \) to the end of \( S \) . In this case, we say that \( S_1 \) “won the contest.”

(3)

Else, \( f_2(x)\ne y, \) then the output is \( \mathcal {A}(S_2 \circ (x,y)) \) . In this case, we that \( S_2 \) “won the contest.”

Note that adding the auxiliary example \( (x,y) \) forces \( \mathcal {A} \) to make exactly \( d \) mistakes on \( S_i\circ (x,y) \) . Now, if \( y\sim \lbrace \pm 1\rbrace \) satisfies \( y = c(x), \) then by the mistake bound argument it holds that \( \mathcal {A}(S_i\circ (x,y))=c \) . Therefore, since \( \Pr _{y\sim \lbrace \pm 1\rbrace }[c(x)=y] = 1/2 \) , it follows that

\( \begin{equation*} \Pr _{S_1,S_2, y}[\mathcal {A}(S_i\circ (x,y))=c] \ge \frac{1}{(2d)^2}\cdot \frac{3}{4}\cdot \frac{1}{2} =\Omega (1/d^2), \end{equation*} \)

and we are done.

Similar reasoning can be used by induction to handle the remaining cases (the next one would be that \( \Pr [M=d-2] \ge \frac{1}{2d} \) , and so on). As the number of mistakes reduces, we need to guess more labels, to enforce mistakes on the algorithm. As we guess more labels, the success rate reduces; nevertheless, we never need to make more than \( 2^d \) such guesses. (Note that the random contests performed by the algorithm can naturally be presented using the internal nodes of a binary tree of depth \( \le d.) \) The proof we present in Section 5.2 is based on a similar idea of performing random contests, although the construction becomes more complex to handle other issues, such as generalization, which were not addressed here. For more details, we refer the reader to the complete argument in Section 5.2.

5.1.2 Step 2: Globally Stable Learning → Differentially Private Learning.

Given a globally stable learner \( \mathcal {A} \) for a concept class \( \mathcal {H} \) , we can obtain a differentially private learner using standard techniques in the literature on private learning and query release. If \( \mathcal {A} \) is a \( (\eta , m) \) -globally stable learner with respect to a distribution \( \mathcal {D} \) , we obtain a differentially private learner using roughly \( m/\eta \) samples from that distribution as follows. We first run \( \mathcal {A} \) on \( k \approx 1/\eta \) independent samples, non-privately producing a list of \( k \) hypotheses. We then apply a differentially private “Stable Histograms” algorithm [21, 58] to this list that allows us to privately publish a short list of hypotheses that appear with frequency \( \Omega (\eta) \) . Global stability of the learner \( \mathcal {A} \) guarantees that with high probability, this list contains some hypothesis \( h \) with small population loss. We can then apply a generic differentially private learner (based on the exponential mechanism) on a fresh set of examples to identify such an accurate hypothesis from the short list.

5.2 Globally Stable Learning of Littlestone Classes

5.2.1 Theorem Statement.

The following theorem states that any class \( \mathcal {H} \) with a bounded Littlestone dimension can be learned by a globally stable algorithm.

Theorem 23.

Let \( \mathcal {H} \) be a hypothesis class with Littlestone dimension \( d\ge 1 \) , let \( \alpha \gt 0 \) , and set

\( \begin{equation*} m = {2^{2^{d+2}+1}4^{d+1}}\cdot \Bigl \lceil \frac{{2^{d+2}}{d}}{\alpha }\Bigr \rceil . \end{equation*} \)

Then there exists a randomized algorithm \( G : (X \times \lbrace \pm 1\rbrace)^m \rightarrow \lbrace \pm 1\rbrace ^X \) with the following properties. Let \( \mathcal {D} \) be a realizable distribution, and let \( S\sim \mathcal {D}^m \) be an input sample. Then there exists a hypothesis \( f \) such that

\( \begin{equation*} \Pr [G(S) = f] \ge \frac{1}{(d+1)2^{{2^d}{d}+1}} \text{ and } \operatorname{loss}_{\mathcal {D}}(f) \le \alpha . \end{equation*} \)

5.2.2 The Distributions D_k.

Algorithm \( G \) is obtained by running the \( \mathsf {SOA} \) on a sample drawn from a carefully tailored distribution. This distribution belongs to a family of distributions that we define next. Each of these distributions can be sampled from using black-box access to independent and identically distributed samples from \( \mathcal {D} \) . Recall that for a pair of samples \( S,T \) , we denote by \( S\circ T \) the sample obtained by appending \( T \) to the end of \( S \) . Define a sequence of distributions \( \mathcal {D}_k \) for \( k\ge 0 \) as shown in the boxed text.

Distributions \( \mathcal {D}_k \)

Let \( n \) denote an “auxiliary sample” size (to be fixed later), and let \( \mathcal {D} \) denote the target realizable distribution over examples. The distributions \( \mathcal {D}_k = \mathcal {D}_k(\mathcal {D},n) \) are defined by induction on \( k \) as follows:

(1)

\( \mathcal {D}_0 \) : output the empty sample \( \emptyset \) with probability 1.

(2)

Let \( k\ge 1 \) . If there exists an \( f \) such that

\( \begin{equation*} \Pr _{S \sim \mathcal {D}_{k-1}, T\sim \mathcal {D}^n}[\mathsf {SOA}(S\circ T) = f] \ge {2^{-2^{d+2}}}{2^{-d}}, \end{equation*} \)

or if \( \mathcal {D}_{k-1} \) is undefined, then \( \mathcal {D}_k \) is undefined.

(3)

Else, \( \mathcal {D}_k \) is defined recursively by the following process:

(i)

Draw \( S_0,S_1\sim \mathcal {D}_{k-1} \) and \( T_0,T_1\sim \mathcal {D}^n \) independently.

(ii)

Let \( f_0=\mathsf {SOA}(S_0\circ T_0) \) , \( f_1=\mathsf {SOA}(S_1\circ T_1) \) .

(iii)

If \( f_0=f_1, \) then go back to step (i).

(iv)

Else, pick \( x\in \lbrace x: f_0(x)\ne f_1(x)\rbrace \) and sample \( y\sim \lbrace \pm 1\rbrace \) uniformly.

(v)

If \( f_0(x)\ne y, \) then output \( S_0 \circ T_0\circ ((x,y)) \) and else output \( S_1\circ T_1\circ ((x,y)) \) .

Please see Figure 4 for an illustration of sampling \( S\sim \mathcal {D}_k \) for \( k=3 \) .

Fig. 4.

We next observe some basic facts regarding these distributions. First, note that whenever \( \mathcal {D}_k \) is well defined, the process in Item 3 terminates with probability 1.

Let \( k \) be such that \( \mathcal {D}_k \) is well defined and consider a sample \( S \) drawn from \( \mathcal {D}_k \) . The size of \( S \) is \( \vert S\vert = k\cdot (n + 1) \) . Among these \( k\cdot (n+1) \) examples, there are \( k\cdot n \) examples drawn from \( \mathcal {D} \) and \( k \) examples that are generated in Item 3(iv). We will refer to these \( k \) examples as tournament examples. Note that during the generation of \( S\sim \mathcal {D}_k, \) there are examples drawn from \( \mathcal {D} \) that do not actually appear in \( S \) . In fact, the number of such examples may be unbounded, depending on how many times Items 3(i) through 3(iii) were repeated. In Section 5.2.3, we will define a “Monte Carlo” variant of \( \mathcal {D}_k \) in which the number of examples drawn from \( \mathcal {D} \) is always bounded. This Monte Carlo variant is what we actually use to define our globally stable learning algorithm, but we introduce the simpler distributions \( \mathcal {D}_k \) to clarify our analysis.

The \( k \) tournament examples satisfy the following important properties.

Observation 24.

Let \( k \) be such that \( \mathcal {D}_k \) is well defined and consider running the \( \mathsf {SOA} \) on the concatenated sample \( S\circ T \) , where \( S\sim \mathcal {D}_k \) and \( T\sim \mathcal {D}^n \) . Then

(1)

Each tournament example forces a mistake on the \( \mathsf {SOA} \) . Consequently, the number of mistakes made by the \( \mathsf {SOA} \) when run on \( S\circ T \) is at least \( k \) .

(2)

\( \mathsf {SOA}(S\circ T) \) is consistent with \( T \) .

The first item follows directly from the definition of \( x \) in Item 3(iv) and the definition of \( S \) in Item 3(v). The second item clearly holds when \( S\circ T \) is realizable by \( \mathcal {H} \) (because the \( \mathsf {SOA} \) is consistent). For non-realizable \( S\circ T \) , Item 2 holds by our extension of the \( \mathsf {SOA} \) in Definition 11.

The existence of frequent hypotheses. The following lemma is the main step in establishing global stability.

Lemma 25.

There exists \( k\le d \) and an hypothesis \( f:X\rightarrow \lbrace \pm 1\rbrace \) such that

\( \begin{equation*} \Pr _{S\sim \mathcal {D}_k, T\sim \mathcal {D}^n}[\mathsf {SOA}(S\circ T) = f] \ge {2^{-2^{d+2}}}{2^{-d}}. \end{equation*} \)

Proof.

Suppose for the sake of contradiction that this is not the case. In particular, this means that \( \mathcal {D}_d \) is well defined and that for every \( f \) ,

\( \begin{equation} \Pr _{S\sim \mathcal {D}_d, T\sim \mathcal {D}^n}[\mathsf {SOA}(S\circ T) = f] \lt {2^{-2^{d+2}}}{2^{-d}}. \end{equation} \)

(2)

We show that this cannot be the case when \( f=c \) is the target concept (i.e., for \( c\in \mathcal {H}, \) which satisfies \( \operatorname{loss}_\mathcal {D}(c)=0 \) ). Toward this end, we first show note that with probability \( {2^{-2^{d+2}}}{2^{-d}} \) over \( S\sim \mathcal {D}_d, \) we have that all \( d \) tournament examples are consistent with \( c \) : for \( k\le d, \) let \( \rho _k \) denote the probability that all \( k \) tournament examples in \( S\sim \mathcal {D}_k \) are consistent with \( c \) . We claim that \( \rho _k \) satisfies the recursion \( \rho _k\ge \frac{1}{2}(\rho _{k-1}^2-8\cdot 2^{-2^{d+2}}) \) . Indeed, consider the event \( E_k \) that (i) in each of \( S_0,S_1\sim \mathcal {D}_{k-1} \) , all \( k-1 \) tournament examples are consistent with \( c \) , and (ii) that \( f_0\ne f_1 \) . Since \( f_0=f_1 \) occurs with probability at most \( 2^{-2^{d+2}}\lt 8\cdot 2^{-2^{d+2}} \) , it follows that \( \Pr [E_k]\ge \rho _{k-1}^2- 8\cdot 2^{-2^{d+2}} \) . Further, since \( y\in \lbrace \pm 1\rbrace \) is chosen uniformly at random and independently of \( S_0 \) and \( S_1 \) , we have that conditioned on \( E_k \) , \( c(x)=y \) with probability \( 1/2 \) . Taken together, we have that \( \rho _k\ge \tfrac{1}{2}\Pr [E_k]\ge \tfrac{1}{2}(\rho _{k-1}^2-8\cdot 2^{-2^{d+2}}) \) . Since \( \rho _0=1, \) we get the recursive relation

\( \begin{equation*} \rho _{k}\ge \frac{\rho _{k-1}^2-8\cdot 2^{-2^{d+2}}}{2}, ~\textrm {and}~ \rho _0=1. \end{equation*} \)

Thus, it follows by induction that for \( k\le d \) , \( \rho _k\ge 4\cdot 2^{-2^{k+1}} \) : the base case is verified readily, and the induction step is as follows:

\( \begin{align*} \rho _{k} &\ge \frac{\rho _{k-1}^2-8\cdot 2^{-2^{d+2}}}{2}\\ &\ge \frac{(4\cdot 2^{-2^{k}})^2-8\cdot 2^{-2^{d+2}}}{2}\qquad\qquad\qquad\qquad \text{(by induction)}\\ &=8\cdot 2^{-2^{k+1}} - 4\cdot 2^{-2^{d+2}}\\ &\ge 4\cdot 2^{-2^{k+1}}\;\;\;\; {(k\le d\;\quad \text{and therefore}\; 2^{-2^{d+2}} \le 2^{-2^{k+1}})}. \end{align*} \)

Therefore, with probability \( {2^{-2^{d+2}}}{2^{-d}} \) , we have that \( S\circ T \) is consistent with \( c \) (because all examples in \( S\circ T \) that are drawn from \( \mathcal {D} \) are also consistent with \( c \) ). Now, since each tournament example forces a mistake on the \( \mathsf {SOA} \) (Observation 24), and since the \( \mathsf {SOA} \) does not make more than \( d \) mistakes on realizable samples, it follows that if all tournament examples in \( S\sim \mathcal {D}_d \) are consistent with \( c, \) then \( \mathsf {SOA}(S)=\mathsf {SOA}(S\circ T)=c \) . Thus,

\( \begin{equation*} \Pr _{S\sim \mathcal {D}_d, T\sim \mathcal {D}^n}[\mathsf {SOA}(S\circ T) = c] \ge {2^{-2^{d+2}}}{2^{-d}}, \end{equation*} \)

which contradicts Equation (2) and finishes the proof.□

Generalization. The next lemma shows that only hypotheses \( f \) that generalize well satisfy the conclusion of Lemma 25 (note the similarity of this proof with the proof of Proposition 7).

Lemma 26 (Generalization).

Let \( k \) be such that \( \mathcal {D}_k \) is well defined. Then every \( f \) such that

\( \begin{equation*} \Pr _{S\sim \mathcal {D}_k, T\sim \mathcal {D}^n}[\mathsf {SOA}(S\circ T) = f] \ge {2^{-2^{d+2}}}{2^{-d}} \end{equation*} \)

satisfies \( \operatorname{loss}_{\mathcal {D}}(f) \le \frac{{2^{d+2}}{d}}{ n} \) .

Proof.

Let \( f \) be a hypothesis such that \( \Pr _{S\sim \mathcal {D}_k, T \sim \mathcal {D}^n}[\mathsf {SOA}(S \circ T) = f] \ge {2^{-2^{d+2}}}{2^{-d}} \) and let \( \alpha =\operatorname{loss}_{\mathcal {D}}(h) \) . We will argue that

\( \begin{equation} {2^{-2^{d+2}}}{2^{-d}} \le (1-\alpha)^{n}. \end{equation} \)

(3)

Define the events \( A,B \) as follows:

(1)

\( A \) is the event that \( \mathsf {SOA}(S\circ T) = f \) . By assumption, \( \Pr [A] \ge {2^{-2^{d+2}}}{2^{-d}} \) .

(2)

\( B \) is the event that \( f \) is consistent with \( T \) . Since \( \vert T\vert = n \) , we have that \( \Pr [B] = (1-\alpha)^{n} \) .

Note that \( A \subseteq B \) : indeed, \( \mathsf {SOA}(S\ \circ \ T) \) is consistent with \( T \) by the second item of Observation 24. Thus, whenever \( \mathsf {SOA}(S\circ T)=f \) , it must be the case that \( f \) is consistent with \( T \) . Hence, \( \Pr [A]\le \Pr [B] \) , which implies Inequality (3) and finishes the proof (using the fact that \( 1-\alpha \le 2^{-\alpha } \) and taking logarithms on both sides).□

5.2.3 The Algorithm G.

A Monte Carlo variant of \( \mathcal {D}_k \) . Consider the following first attempt of defining a globally stable learner \( G \) : (i) draw \( i\in \lbrace 0\ldots d\rbrace \) uniformly at random, (ii) sample \( S\sim \mathcal {D}_i \) , and (iii) output \( \mathsf {SOA}(S\circ T) \) , where \( T\sim \mathcal {D}^n \) . The idea is that with probability \( 1/(d+1), \) the sampled \( i \) will be equal to a number \( k \) satisfying the conditions of Lemma 25, and so the desired hypothesis \( f \) guaranteed by this lemma (which also has low population loss by Lemma 26) will be outputted with probability at least \( {2^{-2^d}}{2^{-d}}/(d+1) \) .

The issue here is that sampling \( f\sim \mathcal {D}_i \) may require an unbounded number of samples from the target distribution \( \mathcal {D} \) (in fact, \( \mathcal {D}_i \) may even be undefined). To circumvent this possibility, we define a Monte Carlo variant of \( \mathcal {D}_k \) in which the number of examples drawn from \( \mathcal {D} \) is always bounded.

The Distributions \( \tilde{\mathcal {D}}_k \) (a Monte Carlo Variant of \( \mathcal {D}_k \) )

(1)

Let \( n \) be the auxiliary sample size and \( N \) be an upper bound on the number of examples drawn from \( \mathcal {D} \) .

(2)

\( \tilde{\mathcal {D}}_0 \) : output the empty sample \( \emptyset \) with probability 1.

(3)

For \( k\gt 0 \) , define \( \tilde{\mathcal {D}}_k \) recursively by the following process:

(*)

Throughout the process, if more than \( N \) examples from \( \mathcal {D} \) are drawn (including examples drawn in the recursive calls), then output “Fail.”

(i)

Draw \( S_0,S_1\sim \tilde{\mathcal {D}}_{k-1} \) and \( T_0,T_1\sim \mathcal {D}^n \) independently.

(ii)

Let \( f_0=\mathsf {SOA}(S_0\circ T_0) \) , \( f_1=\mathsf {SOA}(S_1\circ T_1) \) .

(iii)

If \( f_0=f_1, \) then go back to step (i).

(iv)

Else, pick \( x\in \lbrace x: f_0(x)\ne f_1(x)\rbrace \) and sample \( y\sim \lbrace \pm 1\rbrace \) uniformly.

(v)

If \( f_0(x)\ne y, \) then output \( S_0 \circ T_0\circ ((x,y)) \) and else output \( S_1\circ T_1\circ ((x,y)) \) .

Note that \( \tilde{\mathcal {D}}_k \) is well defined for every \( k \) , even for \( k \) such that \( \mathcal {D}_k \) is undefined (however, for such \( k \) ’s, the probability of outputting “Fail” may be large).

It remains to specify the upper bound \( N \) on the number of examples drawn from \( \mathcal {D} \) in \( \tilde{\mathcal {D}}_k \) . Toward this end, we prove the following bound on the expected number of examples from \( \mathcal {D} \) that are drawn during generating \( S\sim \mathcal {D}_k \) .

Lemma 27 (Expected Sample Complexity of Sampling from \( \mathcal {D}_k \) ).

Let \( k \) be such that \( \mathcal {D}_k \) is well defined, and let \( M_k \) denote the number of examples from \( \mathcal {D} \) that are drawn in the process of generating \( S\sim \mathcal {D}_k \) . Then,

\( \begin{equation*} \mathbb {E}[M_k] \le 4^{k+1}\cdot n. \end{equation*} \)

Proof.

Note that \( \mathbb {E}[M_0]=0 \) as \( \mathcal {D}_0 \) deterministically produces the empty sample. We first show that for all \( 0 \lt i \lt k \) ,

\( \begin{equation} \mathbb {E}[M_{i+1}] \le 4\mathbb {E}[M_{i}] + 4n, \end{equation} \)

(4)

and then conclude the desired inequality by induction.

To see why Inequality (4) holds, let the random variable \( R \) denote the number of times Item 3(i) was executed during the generation of \( S\sim \mathcal {D}_{i+1} \) . In other words, \( R \) is the number of times a pair \( S_0,S_1\sim \mathcal {D}_i \) and a pair \( T_0,T_1\sim \mathcal {D}^n \) were drawn. Observe that \( R \) is distributed geometrically with success probability \( \theta \) , where

\( \begin{align*} \theta &= 1 - \Pr _{S_0,S_1, T_0,T_1}\bigl [\mathsf {SOA}(S_0\circ T_0) = \mathsf {SOA}(S_1\circ T_1)\bigr ] \\ &= 1 - \sum _{{h}{f}}\Pr _{S, T}\bigl [\mathsf {SOA}(S\circ T) = {h}{f}\bigr ]^2\\ &\ge 1-{2^{-2^{d+2}}}{2^{-d+2}}, \end{align*} \)

where the last inequality follows because \( i\lt k \) and hence \( \mathcal {D}_i \) is well defined, which implies that \( \Pr _{S, T}[\mathsf {SOA}(S\circ T) = {h}{f}]\le {2^{-2^{d+2}}}{2^{-d}} \) for all \( h \) .

Now, the random variable \( M_{i+1} \) can be expressed as follows:

\( \begin{equation*} M_{i+1} = \sum _{j=1}^\infty M_{i+1}^{(j)}, \end{equation*} \)

where

\( \begin{equation*} M_{i+1}^{(j)} = {\left\lbrace \begin{array}{ll}0 &\text{if } R \lt j,\\ \text{# of examples drawn from}\; {\mathcal {D}}\; \text{in the}\; j'\;\text{th execution of Item 3(i)} &\text{if } R\ge j. \end{array}\right.} \end{equation*} \)

Thus, \( \mathbb {E}[M_{i+1}] = \sum _{j=1}^\infty \mathbb {E}[M_{i+1}^{(j)}] \) . We claim that

\( \begin{equation*} \mathbb {E}[M_{i+1}^{(j)}] = (1-\theta)^{j-1}\cdot (2\mathbb {E}[M_i] + 2n). \end{equation*} \)

Indeed, the probability that \( R\ge j \) is \( (1-\theta)^{j-1} \) and conditioned on \( R\ge j \) , in the \( j \) ’th execution of Item 3(i) two samples from \( \mathcal {D}_{i} \) are drawn and two samples from \( \mathcal {D}^n \) are drawn. Thus,

\( \begin{equation*} \mathbb {E}[M_{i+1}] = \sum _{j=1}^\infty (1-\theta)^{j-1}\cdot (2\mathbb {E}[M_i] + 2n)= \frac{1}{\theta }\cdot (2\mathbb {E}[M_i] + 2n) \le 4\mathbb {E}[M_i] + 4n, \end{equation*} \)

where the last inequality is true because \( \theta \ge 1- {2^{-2^{d+2}}}{2^{-d+2}}\ge 1/2 \) .

This gives Inequality (4). Next, using that \( \mathbb {E}[M_0]=0 \) , a simple induction gives

\( \begin{equation*} \mathbb {E}[M_{i+1}]\le (4+4^2+\cdots + 4^{i+1})n \le 4^{i+2}n, \end{equation*} \)

and the lemma follows by taking \( i+1=k \) .□

Proof of Theorem 23

Our globally stable learning algorithm \( G \) is defined as shown in the boxed text.

Algorithm \( G \)

(1)

Consider the distribution \( \tilde{\mathcal {D}}_k \) , where the auxiliary sample size is set to \( n=\lceil \frac{{2^{d+2}}{d}}{\alpha }\rceil \) and the sample complexity upper bound is set to \( N={2^{2^{d+2}+1}4^{d+1}}{8^{d+1}}\cdot n \) .

(2)

Draw \( k\in \lbrace 0,1,\ldots , d\rbrace \) uniformly at random.

(3)

Output \( h=\mathsf {SOA}(S\circ T) \) , where \( T\sim \mathcal {D}^n \) and \( S\sim \tilde{\mathcal {D}}_k \) .

First note that the sample complexity of \( G \) is \( \vert S\vert + \vert T\vert \le N+n = ({2^{2^{d+2}+1}4^{d+1}}{8^{d+1}}+1)\cdot \lceil \frac{{2^{d+2}}{d}}{\alpha }\rceil \) , as required. It remains to show that there exists a hypothesis \( f \) such that

\( \begin{equation*} \Pr [G(S) = f] \ge \frac{{2^{-2^{d+2}}}{2^{-(d+1)}}}{d+1} \text{ and } \operatorname{loss}_{\mathcal {D}}(f) \le \alpha . \end{equation*} \)

By Lemma 25, there exists \( k^*\le d \) and \( f^* \) such that

\( \begin{equation*} \Pr _{S\sim \mathcal {D}_{k^*}, T\sim \mathcal {D}^n}[\mathsf {SOA}(S\circ T) = f^*] \ge {2^{-2^{d+2}}}{2^{-d}}. \end{equation*} \)

We assume \( k^* \) is minimal—in particular, \( \mathcal {D}_k \) is well defined for \( k\le k^* \) . By Lemma 26,

\( \begin{equation*} \operatorname{loss}_{\mathcal {D}}(f^*) \le \frac{{2^{d+2}}{d}}{n} \le \alpha . \end{equation*} \)

We claim that \( G \) outputs \( f^* \) with probability at least \( {2^{-2^{d+2}-1}}{2^{-(d+1)}} \) . To see this, let \( M_{k^*} \) denote the number of examples drawn from \( \mathcal {D} \) during the generation of \( S\sim \mathcal {D}_{k^*} \) . Lemma 27 and an application of Markov’s inequality yield

\( \begin{align*} \Pr \bigl [M_{k^*} \gt {2^{2^{d+2}+1}\cdot 4^{d+1}}\cdot n\bigr ] &\le \Pr \bigl [M_{k^*} \gt {2^{2^{d+2}+1}}\cdot 4^{k^*+1}\cdot n\bigr ]\qquad\qquad\qquad \text{because}\; {k^*\le d}\\ &\le {2^{-2^{d+2}-1}}.\quad \text{by Markov's inequality, since}\; \mathbb {E}[M_{k^*}]\le 4^{k^*+1}\cdot n \end{align*} \)

Therefore,

\( \begin{align*} \Pr _{S\sim \tilde{\mathcal {D}}_{k^*}, T\sim \mathcal {D}^n}[\mathsf {SOA}(S\circ T) = f^*] &= \Pr _{S\sim \mathcal {D}_{k^*}, T\sim \mathcal {D}^n}\left[\mathsf {SOA}(S\circ T) = f^* \text{ and } M_{k^*} \le {2^{2^d+2}4^{d+1}}\cdot n\right] \\ &\ge {2^{-2^{d+2}}} - {2^{-2^{d+2}-1}} {= 2^{-2^{d}-1}}. \end{align*} \)

Thus, since \( k=k^* \) with probability \( 1/(d+1) \) , it follows that \( G \) outputs \( f^* \) with probability at least \( \tfrac{{2^{-2^{d+2}-1}}}{d+1} \) as required.□

5.3 Globally Stable Learning Implies Private Learning

In this section, we prove that any globally stable learning algorithm yields a differentially private learning algorithm with finite sample complexity.

5.3.1 Tools from DP.

We begin by stating a few standard tools from the DP literature that underlie our construction of a learning algorithm.

Let \( X \) be a data domain, and let \( S \in X^n \) . For an element \( x \in X \) , define \( \operatorname{freq}_S(x) = \frac{1}{n} \cdot \#\lbrace i \in [n] : x_i = x\rbrace \) —that is, the fraction of the elements in \( S \) which are equal to \( x \) .

Lemma 28 (Stable Histograms [21, 58]).

Let \( X \) be any data domain. For

\( \begin{equation*} n \ge O\left(\frac{\log (1/\eta \beta \delta)}{\eta \varepsilon }\right), \end{equation*} \)

there exists an \( (\varepsilon , \delta) \) -differentially private algorithm \( \mathsf {Hist} \) that, with probability at least \( 1-\beta \) , on input \( S = (x_1, \dots , x_n) \) outputs a list \( L \subseteq X \) and a sequence of estimates \( a \in [0, 1]^{|L|} \) such that

•

Every \( x \) with \( \operatorname{freq}_S(x) \ge \eta \) appears in \( L, \) and

•

For every \( x \in L \) , the estimate \( a_x \) satisfies \( |a_x - \operatorname{freq}_S(x)| \le \eta \) .

Using the Exponential Mechanism of McSherry and Talwar [64], Kasiviswanathan et al. [57] described a generic differentially private learner based on approximate empirical risk minimization.

Lemma 29 (Generic Private Learner [57]).

Let \( H \subseteq \lbrace \pm 1\rbrace ^X \) be a collection of hypotheses. For

\( \begin{equation*} n = O\left(\frac{\log |H| +\log (1/\beta)}{\alpha \varepsilon }\right), \end{equation*} \)

there exists an \( \varepsilon \) -differentially private algorithm \( \mathsf {GenericLearner}: (X \times \lbrace \pm 1\rbrace)^n \rightarrow H \) such that the following holds. Let \( \mathcal {D} \) be a distribution over \( (X \times \lbrace \pm 1\rbrace) \) such that there exists \( h^* \in H \) with

\( \begin{equation*} \operatorname{loss}_{\mathcal {D}}(h^*) \le \alpha . \end{equation*} \)

Then on input \( S \sim \mathcal {D}^n \) , algorithm \( \mathsf {GenericLearner} \) outputs, with probability at least \( 1-\beta \) , a hypothesis \( \hat{h} \in H \) such that

\( \begin{equation*} \operatorname{loss}_\mathcal {D}(\hat{h}) \le 2\alpha . \end{equation*} \)

Our formulation of the guarantees of this algorithm differ slightly from those of Kasiviswanathan et al. [57], so we give its standard proof for completeness.

Proof of Lemma 29

The algorithm \( \mathsf {GenericLearner}(S) \) samples a hypothesis \( h \in H \) with probability proportional to \( \exp (-\varepsilon n \operatorname{loss}_S(h) / 2) \) . This algorithm can be seen as an instantiation of the Exponential Mechanism [64]; the fact that changing one sample changes the value of \( \operatorname{loss}_S(h) \) by at most 1 implies that \( \mathsf {GenericLearner} \) is \( \varepsilon \) -differentially private.

We now argue that \( \mathsf {GenericLearner} \) is an accurate learner. Let \( E \) denote the event that the sample \( S \) satisfies the following conditions:

(1)

For every \( h \in H \) such that \( \operatorname{loss}_{\mathcal {D}}(h) \gt 2\alpha \) , it also holds that \( \operatorname{loss}_{S}(h) \gt 5\alpha /3 \) , and

(2)

For the hypothesis \( h^* \in H \) satisfying \( \operatorname{loss}_{\mathcal {D}}(h^*) \le \alpha \) , it also holds that \( \operatorname{loss}_{S}(h^*) \le 4\alpha / 3 \) .

We claim that \( \Pr [E] \ge 1-\beta /2 \) as long as \( n \ge O(\log (|H|/\beta) / \alpha) \) . To see this, let \( h \in H \) be an arbitrary hypothesis with \( \operatorname{loss}_D(h) \gt 2\alpha \) . By a multiplicative Chernoff bound,¹² we have \( \operatorname{loss}_S(h) \gt 7\alpha / 4 \) with probability at least \( 1 - \beta /(4|H|) \) as long as \( n \ge O(\log (|H|/\beta) / \alpha) \) . Taking a union bound over all \( h \in H \) shows that condition (1) holds with probability at least \( 1 - \beta /4 \) . Similarly, a multiplicative Chernoff bound ensures that condition (2) holds with probability at least \( 1 - \beta /4 \) , so \( E \) holds with probability at least \( 1-\beta /2 \) .

Now we show that conditioned on \( E \) , the algorithm \( \mathsf {GenericLearner}(S) \) indeed produces a hypothesis \( h \) with \( \operatorname{loss}_D(\hat{h}) \le 2\alpha \) . This follows the standard analysis of the accuracy guarantees of the Exponential Mechanism. Condition 2 of the definition of event \( E \) guarantees that \( \operatorname{loss}_S(h^*) \le 4\alpha / 3 \) . This ensures that the normalization factor in the definition of the Exponential Mechanism is at least \( \exp (-2\varepsilon \alpha n /3) \) . Hence, by a union bound,

\( \begin{equation*} \Pr [\operatorname{loss}_S(\hat{h}) \gt 5\alpha /3] \le |H| \cdot \frac{\exp (-5\varepsilon \alpha n / 6)}{\exp (-2\varepsilon \alpha n / 3)} = |H| e^{-\varepsilon \alpha n / 6}. \end{equation*} \)

Taking \( n \ge O(\log (|H|/\beta) / \alpha \varepsilon) \) ensures that this probability is at most \( \beta / 2 \) . Given that \( \operatorname{loss}(\hat{h}) \le 5\alpha / 3 \) , condition (1) of the definition of event \( E \) ensures that \( \operatorname{loss}_{\mathcal {D}}(\hat{h}) \le 2\alpha \) . Thus, for \( n \) sufficiently large as described, we have overall that \( \operatorname{loss}_{\mathcal {D}}(\hat{h}) \le 2\alpha \) with probability at least \( 1- \beta \) .□

5.3.2 Construction of a Private Learner.

We now describe how to combine the Stable Histograms algorithm with the Generic Private Learner to convert any globally stable learning algorithm into a differentially private one.

Theorem 30.

Let \( \mathcal {H} \) be a concept class over data domain \( X \) . Let \( G : (X \times \lbrace \pm 1\rbrace)^m \rightarrow \lbrace \pm 1\rbrace ^X \) be a randomized algorithm such that for \( \mathcal {D} \) a realizable distribution and \( S \sim \mathcal {D}^m \) , there exists a hypothesis \( h \) such that \( \Pr [G(S) = h] \ge \eta \) and \( \operatorname{loss}_{\mathcal {D}}(h) \le \alpha / 2 \) .

Then for some

\( \begin{equation*} n = \tilde{O}\left(\frac{m \cdot \log (1/\eta \beta \delta)}{\eta \varepsilon } + \frac{\log (1/\eta \beta)}{\alpha \varepsilon }\right), \end{equation*} \)

there exists an \( (\varepsilon , \delta) \) -differentially private algorithm \( M: (X \times \lbrace \pm 1\rbrace)^n \rightarrow \lbrace \pm 1\rbrace ^X \) that, given \( n \) independent and identically distributed samples from \( \mathcal {D} \) , produces a hypothesis \( \hat{h} \) such that \( \operatorname{loss}_{\mathcal {D}}(\hat{h}) \le \alpha \) with probability at least \( 1-\beta \) .

Theorem 30 is realized via the learning algorithm \( M \) described in the following. Here, the parameter

\( \begin{equation*} k = \tilde{O}\left(\frac{\log (1/\eta \beta \delta)}{\eta \varepsilon }\right) \end{equation*} \)

is chosen so that Lemma 28 guarantees the algorithm \( \mathsf {Hist} \) succeeds with the stated accuracy parameters. The parameter

\( \begin{equation*} n^{\prime } = \tilde{O}\left(\frac{\log (1/\eta \beta)}{\alpha \varepsilon }\right) \end{equation*} \)

is chosen so that Lemma 29 guarantees that \( \mathsf {GenericLearner} \) succeeds on a list \( L \) of size \( |L| \le 2/\eta \) with the given accuracy and confidence parameters.

Differentially Private Learner \( M \)

(1)

Let \( S_1, \dots , S_k \) each consist of \( m \) independent and identically distributed samples from \( \mathcal {D} \) . Run \( G \) on each batch of samples producing \( h_1 = G(S_1), \dots , h_k = G(S_k) \) .

(2)

Run the Stable Histogram algorithm \( \mathsf {Hist} \) on input \( H = (h_1, \dots , h_k) \) using privacy parameters \( (\varepsilon /2, \delta) \) and accuracy parameters \( (\eta /8, \beta /3) \) , producing a list \( L \) of frequent hypotheses.

(3)

Remove from \( L \) all hypotheses with estimated frequency \( a_h\lt 3\eta /4 \) .

(4)

Let \( S^{\prime } \) consist of \( n^{\prime } \) independent and identically distributed samples from \( \mathcal {D} \) . Run \( \mathsf {GenericLearner}(S^{\prime }) \) using the collection of hypotheses \( L \) with privacy parameter \( \varepsilon / 2 \) and accuracy parameters \( (\alpha / 2, \beta /3) \) to output a hypothesis \( \hat{h} \) .

Proof of Theorem 30

We first argue that the algorithm \( M \) is differentially private. The outcome \( L \) of step 2 is generated in a \( (\varepsilon /2, \delta) \) -differentially private manner as it inherits its privacy guarantee from \( \mathsf {Hist} \) . For every fixed choice of the coin tosses of \( G \) during the executions \( G(S_1), \dots , G(S_k) \) , a change to one entry of some \( S_i \) changes at most one outcome \( h_i \in H \) . DP for step 2 follows by taking expectations over the coin tosses of all executions of \( G \) , and for the algorithm as a whole by simple composition.

We now argue that the algorithm is accurate. Using the fact that \( k \ge \tilde{O}(\log (1/\beta)/\eta) \) , standard generalization arguments (e.g., see Theorem A3.1 in the work of Blumer et al. [14]) imply that with probability at least \( 1-\beta /3 \) , every \( h \) such that \( \Pr _{S \sim \mathcal {D}^m}[G(S) = h]\gt \eta \) satisfies

\( \begin{equation*} \operatorname{freq}_H(h)\ge \frac{7\eta }{8}. \end{equation*} \)

Let us condition on this event. Then by the accuracy of the algorithm \( \mathsf {Hist} \) , with probability at least \( 1-\beta /2 \) it produces a list \( L \) containing \( h^* \) together with a sequence of estimates that are accurate to within additive error \( \eta / 8 \) . In particular, \( h^* \) appears in \( L \) with an estimate \( a_{h^*} \ge 7\eta /8 - \eta /8 \ge 3\eta / 4 \) .

Now remove from \( L \) every item \( h \) with estimate \( a_h \lt 3\eta / 4 \) . Since every estimate is accurate to within \( \eta / 8 \) , this leaves a list with \( |L| \le 2/\eta \) that contains \( h^* \) with \( \operatorname{loss}_{\mathcal {D}}(h^*) \le \alpha \) . Hence, with probability at least \( 1-\beta /3 \) , step 4 succeeds in identifying \( h^* \) with \( \operatorname{loss}_{\mathcal {D}}(h^*) \le \alpha /2 \) .

The total sample complexity of the algorithm is \( k\cdot m + n^{\prime }, \) which matches the asserted bound.□

5.4 Wrapping up

We now combine Theorem 23 (finite Littlestone dimension \( \Rightarrow \) global stability) with Theorem 30 (global stability \( \Rightarrow \) private learnability) to prove Theorem 3.

Proof of Theorem 3. Let \( \mathcal {H} \) be a hypothesis class with Littlestone dimension \( d, \) and let \( \mathcal {D} \) be any realizable distribution. Then, Theorem 23 guarantees, for \( m = O({2^{2^{d+2}+1}4^{d+1}} \cdot d / \alpha) \) , the existence of a randomized algorithm \( G : (X \times \lbrace \pm 1\rbrace)^m \rightarrow \lbrace \pm 1\rbrace ^X \) and a hypothesis \( f \) such that

\( \begin{equation*} \Pr [G(S) = f] \ge \frac{1}{(d+1)2^{{2^{d+2}}{d}+1}} \text{ and } \operatorname{loss}_{\mathcal {D}}(f) \le \alpha / 2. \end{equation*} \)

Taking \( \eta = 1/(d+1)2^{{2^{d+2}}{d}+1} \) , Theorem 30 gives an \( (\varepsilon , \delta) \) -differentially private learner with sample complexity

\( \begin{equation*} \qquad \qquad \qquad n = O\left(\frac{m \cdot \log (1/\eta \beta \delta)}{\eta \varepsilon } + \frac{\log (1/\eta \beta)}{\alpha \varepsilon }\right) = {O\left(\frac{2^{\tilde{O}(2^d)}+\log 1/\beta \delta }{\alpha \epsilon }\right)}.\qquad \qquad \qquad \Box \end{equation*} \)

6 Conclusion

We conclude this article with a few suggestions for future work:

(1)

Sharper quantitative bounds: Our upper bound on the differentially private sample complexity of a class \( \mathcal {H} \) has a double exponential an exponential dependence on the Littlestone dimension \( \operatorname{Ldim}(\mathcal {H}) \) , whereas the lower bound by Alon et al. [5] depends on \( \log ^*(\operatorname{Ldim}(\mathcal {H})) \) . The work by Kaplan et al. [53] shows that for thresholds, the lower bound is nearly tight (up to a polynomial factor). In a follow-up work to this article, Ghazi et al. [40] improved the upper bound to \( \mathrm{poly}(\operatorname{Ldim}(\mathcal {H})) \) (roughly, with an exponent of 6). This is also tight up to polynomial factors for some classes, particularly those with maximal Littlestone dimension equal to \( \log |\mathcal {H}| \) . However the tower-of-exponents gap between the upper bound and the lower bound remains essentially the same (with two fewer levels). We thus pose the following question:

Can every class \( \mathcal {H} \) be privately learned with sample complexity \( \mathrm{poly}(\mathrm{VC}(\mathcal {H}),\log ^*(\operatorname{Ldim}(\mathcal {H}))) \) ?

(2)

Characterizing private query release: Another fundamental problem in differentially private data analysis is the query release, or equivalently, the data sanitization problem: given a class \( \mathcal {H} \) and a sensitive dataset \( S \) , output a synthetic dataset \( \hat{S} \) such that \( h(S) \approx h(\hat{S}) \) for every \( h \in \mathcal {H} \) . In earlier versions of this work, we asked whether a finite Littlestone dimension characterizes when this task is possible. This was shown to be true by Bousquet et al. [16] and Ghazi et al. [40]. (Bousquet et al. [16] showed how to transform a proper private learner to a sanitizer, and Ghazi et al. [40] proved that every Littlestone class can be learned properly.) However, as with private classification, massive quantitative gaps between the known upper and lower bounds remain.

(3)

Oracle-efficient learning: Neel et al. [69] recently began a systematic study of oracle-efficient learning algorithms: differentially private algorithms that are computationally efficient when given oracle access to their non-private counterparts. The main open question left by their work is whether every privately learnable concept class can be learned in an oracle-efficient manner. Our characterization shows that this is possible if and only if Littlestone classes admit oracle-efficient learners.

(4)

General loss functions: It is natural to explore whether the equivalence between online and private learning extends beyond binary classification (which corresponds to the 0-1 loss) to regression and other real-valued losses. These more general loss functions have been studied in subsequent work [6, 20, 42, 50], although the problem of exactly characterizing private learnability in the regression setting remains open.

(5)

Global stability: It would be interesting to perform a thorough investigation of global stability and to explore potential connections to other forms of stability in learning theory, including uniform hypothesis stability [15], PAC-Bayes [63], local statistical stability [60], and others.

(6)

Differentially private boosting: Can the type of private boosting presented in Section 2.4 be done algorithmically, and ideally, efficiently?

Acknowledgments

We would like to thank Amos Beimel and Uri Stemmer for pointing out and helping to fix a mistake in the derivation of corollary 4 in a previous version. We also thank Raef Bassily and Yuval Dagan for providing useful comments and for insightful conversations. Last but not least, we thank the two anonymous reviewers for their comments and suggestions, which helped us to improve the presentation of this article.

Footnotes

More precisely, there is a deterministic algorithm that makes no more than \( d \) mistakes, and for every deterministic algorithm there is a (realizable) input sequence on which it makes at least \( d \) mistakes. For randomized algorithms, a slightly weaker lower bound of \( d/2 \) holds with respect to the expected number of mistakes.

The word global highlights a difference with other forms of algorithmic stability. Indeed, previous forms of stability such as DP and uniform hypothesis stability [15] are local in the sense that they require output robustness subject to local changes in the input. However, the property required by global stability captures stability with respect to resampling the entire input.

Interestingly, although the Littlestone dimension is a basic parameter in machine learning, this result has not appeared in the machine learning literature.

⁴

Note that if one replaces “equalities” with “inequalities,” then the Littlestone dimension may become unbounded while the VC dimension remains bounded. This is demonstrated, for example, by halfspaces that are captured by polynomial inequalities of degree 1.

⁵

In other words, \( \mathcal {C}_{\varepsilon /2} \) satisfies that for every threshold \( h \) there exists \( c\in \mathcal {C}_{\varepsilon /2} \) such that \( \Pr _{x\sim D_X}(c(x)\ne h(x))\le \epsilon /2 \) .

⁶

In other words, it may output hypotheses that are not thresholds.

⁷

Theorem 2.3 of Alon et al. [4] is based on a previous realizable-to-agnostic transformation from Beimel et al. [10] that applies to proper learners. Here we require the more general transformation from Alon et al. [4], as the learner implied by Theorem 3 may be improper.

⁸

We focus on the realizable case.

⁹

It appears that the name “Littlestone dimension” was coined in the work of Ben-David et al. [13].

¹⁰

Shelah [76] provides a qualitative statement, and a quantitative one that is more similar to Theorem 10 can be found in the work of Hodges [47].

¹¹

A subset of the universe is homogeneous if all of its \( t \) -subsets have the same color.

¹²

In other words, for independent random variables \( Z_1, \dots , Z_n \) whose sum \( Z \) satisfies \( \text{\mathbb {E}}[Z] = \mu \) , we have for every \( \delta \in (0, 1) \) that \( \Pr [Z \le (1-\delta)\mu ] \le \exp (-\delta ^2\mu / 2) \) and \( \Pr [Z \ge (1 + \delta)\mu ] \le \exp (-\delta ^2\mu / 3) \) .

Appendix

A Proof of Theorem 10

In this appendix, we prove Theorem 10. Throughout the proof, a labeled binary tree means a full binary tree whose internal vertices are labeled by instances.

The second part of the theorem is easy. If \( \mathcal {H} \) contains \( 2^t \) thresholds, then there are \( h_i \in \mathcal {H} \) for \( 0 \le i \lt 2^t \) and \( x_j \) for \( 0 \le j \lt 2^t-1 \) such that \( h_i(x_j)=0 \) for \( j\lt i \) and \( h_i(x_j)=1 \) for \( j \ge i \) . Define a labeled binary tree of height \( t \) corresponding to the binary search process. In other words, the root is labeled by \( x_{2^{t-1}-1} \) , its left child by \( x_{2^{t-1}+2^{t-2}-1} \) and its right child by \( x_{2^{t-1}-2^{t-2}-1} \) and so on. If the label of an internal vertex of distance \( q \) from the root, where \( 0 \le q \le t-1 \) , is \( x_p \) , then the label of its left child is \( x_{p+2^{t-q-1}} \) and the label of its right child is \( x_{p-2^{t-q-1}} \) . It is easy to check that the root-to-leaf path corresponding to each of the functions \( h_i \) leads to leaf number \( i \) from the right among the leaves of the tree (counting from 0 to \( 2^t-1 \) ).

To prove the first part of the theorem, we first define the notion of a subtree \( T^{\prime } \) of depth \( h \) of a labeled binary tree \( T \) by induction on \( h \) . Any leaf of \( T \) is a subtree of height 0. For \( h \ge 1, \) a subtree of height \( h \) is obtained from an internal vertex of \( T \) together with a subtree of height \( h-1 \) of the tree rooted at its left child and a subtree of height \( h-1 \) of the tree rooted at its right child. Note that if \( T \) is a labeled tree and it is shattered by the class \( \mathcal {H} \) , then any subtree \( T^{\prime } \) of it with the same labeling of its internal vertices is shattered by the class \( \mathcal {H} \) . With this definition, we prove the following simple lemma.

Lemma 31.

Let \( p,q \) be positive integers and let \( T \) be a labeled binary tree of height \( p+q-1 \) whose internal vertices are colored by two colors: red and blue. Then \( T \) contains either a subtree of height \( p \) in which all internal vertices are red (a red subtree) or a subtree of height \( q \) in which all vertices are blue (a blue subtree).

Proof.

We apply induction on \( p+q \) . The result is trivial for \( p=q=1 \) as the root of \( T \) is either red or blue. Assuming the assertion holds for \( p^{\prime }+q^{\prime }\lt p+q \) , let \( T \) be of height \( p+q-1 \) . Without loss of generality, assume the root of \( t \) is red. If \( p=1 \) we are done, as the root together with a leaf in the subtree of its left child and one in the subtree of its right child form a red subtree of height \( p \) . If \( p\gt 1, \) then by the induction hypothesis, the tree rooted at the left child of the root of \( T \) contains either a red subtree of height \( p-1 \) or a blue subtree of height \( q \) , and the same applies to the tree rooted at the right child of the root. If at least one of them contains a blue subtree as earlier, we are done; otherwise, the two red subtrees together with the root provide the required red subtree.□

We can now prove the first part of the theorem, showing that if the Littlestone dimension of \( \mathcal {H} \) is at least \( 2^{t+1}-1, \) then \( \mathcal {H} \) contains \( t+2 \) thresholds. We apply induction on \( t \) . If \( t=0, \) we have a tree of height 1 shattered by \( \mathcal {H} \) . Its root is labeled by some variable \( x_0 \) and as it is shattered there are two functions \( h_0,h_1 \in \mathcal {H} \) so that \( h_0(x_0)=1, h_1(x_0)=0 \) , meaning that \( \mathcal {H} \) contains two thresholds, as needed. Assuming the desired result holds for \( t-1, \) we prove it for \( t \) , \( t \ge 1 \) . Let \( T \) be a labeled binary tree of height \( 2^{t+1}-1 \) shattered by \( \mathcal {H} \) . Let \( h \) be an arbitrary member of \( \mathcal {H} \) and define a two coloring of the internal vertices of \( T \) as follows. If an internal vertex is labeled by \( x \) and \( h(x)=1, \) then color it red, else color it blue. Since \( 2^{t+1}-1=2 \cdot 2^t-1 \) , Lemma 31 with \( p=q=2^t \) implies that \( T \) contains either a red or a blue subtree \( T^{\prime } \) of height \( 2^t \) . In the first case, define \( h_0=h \) and let \( X \) be the set of all variables \( x \) so that \( h(x)=1 \) . Let \( x_0 \) be the root of \( T^{\prime } \) and let \( T^{\prime \prime } \) be the subtree of \( T^{\prime } \) rooted at the left child of \( T^{\prime } \) . Let \( \mathcal {H}^{\prime } \) be the set of all \( h^{\prime } \in \mathcal {H} \) so that \( h^{\prime }(x_0)=0 \) . Note that \( \mathcal {H}^{\prime } \) shatters the tree \( T^{\prime \prime } \) , and that the depth of \( T^{\prime \prime } \) is \( 2^t-1 \) . We can thus apply the induction hypothesis and get a set of \( t+1 \) thresholds \( h_1,h_2,\ldots ,h_{t+1} \in \mathcal {H}^{\prime } \) and variables \( x_1,x_2,\ldots ,x_t \in X \) so that \( h_i(x_j)=1 \) iff \( j \ge i \) . Adding \( h_0 \) and \( x_0 \) to these we get the desired \( t+2 \) thresholds.

Similarly, if \( T \) contains a blue subtree \( T^{\prime } \) , define \( h_{t+1}=h \) and let \( X \) be the set of all variables \( x \) so that \( h(x)=0 \) . In this case denote the root of \( T^{\prime } \) by \( x_{t} \) and let \( T^{\prime \prime } \) be the subtree of \( T^{\prime } \) rooted at the right child of \( T^{\prime } \) . Let \( \mathcal {H}^{\prime } \) be the set of all \( h^{\prime } \in \mathcal {H} \) so that \( h^{\prime }(x_{t})=1 \) . As before, \( \mathcal {H}^{\prime } \) shatters the tree \( T^{\prime \prime } \) whose depth is \( 2^t-1 \) . By the induction hypothesis we get \( t+1 \) thresholds \( h_0,h_1,\ldots ,h_t \) and variables \( x_0, x_1,\ldots ,x_{t-1} \in X \) so that \( h_i(x_j)=1 \) if and only if \( j \ge i \) , and the desired result follows by appending to them \( h_{t+1} \) and \( x_t \) . This completes the proof.

References

[1]

Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. 2008. Competing in the dark: An efficient algorithm for bandit linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory (COLT’08). 263–274.

Google Scholar

[2]

Jacob D. Abernethy, Chansoo Lee, Audra McMillan, and Ambuj Tewari. 2017. Online learning via differential privacy. CoRR abs/1711.10019 (2017).

Google Scholar

[3]

Naman Agarwal and Karan Singh. 2017. The price of differential privacy for online learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.). Vol. 70. PMLR, 32–40. http://proceedings.mlr.press/v70/agarwal17a.html.

Google Scholar

[4]

Noga Alon, Amos Beimel, Shay Moran, and Uri Stemmer. 2020. Closure properties for private classification and online prediction. arXiv preprint arXiv:2003.04509 (2020).

Google Scholar

[5]

Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. 2019. Private PAC learning implies finite Littlestone dimension. In Proceedings of the 51st Annual ACM Symposium on the Theory of Computing (STOC’19). ACM, New York, NY.

Digital Library

Google Scholar

[6]

Srinivasan Arunachalam, Yihui Quek, and John A. Smolin. 2021. Private learning implies quantum stability. CoRR abs/2102.07171 (2021). https://arxiv.org/abs/2102.07171.

Google Scholar

[7]

Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. 2018. Learners that use little information. In Algorithmic Learning Theory, ALT 2018, 7–9 April 2018, Lanzarote, Canary Islands, Spain (Proceedings of Machine Learning Research), Firdaus Janoos, Mehryar Mohri, and Karthik Sridharan (Eds.). Vol. 83. PMLR, 25–55. http://proceedings.mlr.press/v83/bassily18a.html.

Google Scholar

[8]

Amos Beimel, Hai Brenner, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. 2014. Bounds on the sample complexity for private learning and private data release. Machine Learning 94, 3 (2014), 401–437.

Digital Library

Google Scholar

[9]

Amos Beimel, Shay Moran, Kobbi Nissim, and Uri Stemmer. 2019. Private center points and learning of halfspaces. In Conference on Learning Theory, COLT 2019, 25–28 June 2019, Phoenix, AZ, USA (Proceedings of Machine Learning Research), Alina Beygelzimer and Daniel Hsu (Eds.). Vol. 99. PMLR, 269–282.

Google Scholar

[10]

Amos Beimel, Kobbi Nissim, and Uri Stemmer. 2015. Learning privately with labeled and unlabeled examples. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms. 461–477.

Crossref

Google Scholar

[11]

Amos Beimel, Kobbi Nissim, and Uri Stemmer. 2016. Private learning and sanitization: Pure vs. approximate differential privacy. Theory of Computing 12, 1 (2016), 1–61.

Crossref

Google Scholar

[12]

Amos Beimel, Kobbi Nissim, and Uri Stemmer. 2019. Characterizing the sample complexity of pure private learners. Journal of Machine Learning Research 20, 146 (2019), 1–33. http://jmlr.org/papers/v20/18-269.html.

Google Scholar

[13]

Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. 2009. Agnostic online learning. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT’09). 1–11.

Google Scholar

[14]

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. 1989. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM 36, 4 (1989), 929–965.

Digital Library

Google Scholar

[15]

Olivier Bousquet and André Elisseeff. 2002. Stability and generalization. Journal of Machine Learning Research 2 (2002), 499–526. http://jmlr.org/papers/v2/bousquet02a.html.

Digital Library

Google Scholar

[16]

Olivier Bousquet, Roi Livni, and Shay Moran. 2019. Passing tests without memorizing: Two models for fooling discriminators. arxiv:cs.LG/1902.03468(2019).

Google Scholar

[17]

Mark Bun. 2020. A computational separation between private learning and online learning. In Proceedings of the 34th Joint Conference on Neural Information Processing Systems (NeurIPS’20).

Google Scholar

[18]

Mark Bun, Marco L. Carmosino, and Jessica Sorrell. 2020. Efficient, noise-tolerant, and private learning via boosting. CoRR abs/2002.01100 (2020).

Google Scholar

[19]

Mark Bun, Cynthia Dwork, Guy N. Rothblum, and Thomas Steinke. 2018. Composable and versatile privacy via truncated CDP. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing (STOC’18). ACM, New York, NY, 74–86.

Digital Library

Google Scholar

[20]

Mark Bun, Marco Gaboardi, and Satchit Sivakumar. 2021. Multiclass versus binary differentially private PAC learning. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS’21). 1–12.

Google Scholar

[21]

Mark Bun, Kobbi Nissim, and Uri Stemmer. 2016. Simultaneous private learning of multiple concepts. In Proceedings of the 7th Conference on Innovations in Theoretical Computer Science (ITCS’16). ACM, New York, NY, 369–380.

Digital Library

Google Scholar

[22]

Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. 2015. Differentially private release and learning of threshold functions. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS’15). IEEE, Los Alamitos, CA, 634–649.

Digital Library

Google Scholar

[23]

Mark Mar Bun. 2016. New Separations in the Complexity of Differential Privacy. Ph.D. Dissertation. Graduate School of Arts & Sciences, Harvard University.

Google Scholar

[24]

Nicolò Cesa-Bianchi and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press.

Crossref

Google Scholar

[25]

Hunter Chase and James Freitag. 2018. Model theory and machine learning. arXiv preprint arXiv:1801.06566 (2018).

Google Scholar

[26]

Hunter Chase and James Freitag. 2019. Model theory and machine learning. Bulletin of Symbolic Logic 25, 03 (Feb. 2019), 319–332.

Crossref

Google Scholar

[27]

Alon Cohen, Avinatan Hassidim, Haim Kaplan, Yishay Mansour, and Shay Moran. 2019. Learning to screen. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS’19). 8612–8621. http://papers.nips.cc/paper/9067-learning-to-screen.

Google Scholar

[28]

José R. Correa, Paul Dütting, Felix A. Fischer, and Kevin Schewior. 2019. Prophet inequalities for I.I.D. random variables from an unknown distribution. In Proceedings of the 2019 ACM Conference on Economics and Computation, EC 2019, Phoenix, AZ, USA, June 24–28, 2019, Anna Karlin, Nicole Immorlica, and Ramesh Johari (Eds.). ACM, New York, NY, 3–17.

Digital Library

Google Scholar

[29]

Rachel Cummings, Varun Gupta, Dhamma Kimpara, and Jamie Morgenstern. 2019. On the compatibility of privacy and fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization (UMAP’19 Adjunct). ACM, New York, NY, 309–315.

Digital Library

Google Scholar

[30]

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. 2012. Fairness through awareness. In Innovations in Theoretical Computer Science 2012, Cambridge, MA, USA, January 8–10, 2012, Shafi Goldwasser (Ed.). ACM, New York, NY, 214–226.

Google Scholar

[31]

Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. 2006. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT. Lecture Notes in Computer Science, Vol. 4004. Springer, 486–503.

Google Scholar

[32]

Cynthia Dwork and Jing Lei. 2009. Differential privacy and robust statistics. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC’09). 371–380.

Google Scholar

[33]

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography (TCC’06). 265–284.

Digital Library

Google Scholar

[34]

Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 3–4 (2014), 211–407.

Digital Library

Google Scholar

[35]

Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 3–4 (2014), 211–407.

Digital Library

Google Scholar

[36]

Cynthia Dwork, Guy N. Rothblum, and Salil Vadhan. 2010. Boosting and differential privacy. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS’10). IEEE, Los Alamitos, CA, 51–60.

Digital Library

Google Scholar

[37]

P. Erdős and R. Rado. 1952. Combinatorial theorems on classifications of subsets of a given set. Proceedings of the London Mathematical Society s3-2, 1 (1952), 417–439.

Crossref

Google Scholar

[38]

Vitaly Feldman and David Xiao. 2015. Sample complexity bounds on differentially private learning via communication complexity. SIAM Journal on Computing 44, 6 (2015), 1740–1764.

Digital Library

Google Scholar

[39]

Eran Gat and Shafi Goldwasser. 2011. Probabilistic search algorithms with unique answers and their cryptographic applications. Electronic Colloquium on Computational Complexity 18 (2011), 136.

Google Scholar

[40]

Badih Ghazi, Noah Golowich, Ravi Kumar, and Pasin Manurangsi. 2020. Sample-efficient proper PAC learning with approximate differential privacy. CoRR abs/2012.03893 (2020).

Google Scholar

[41]

Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. 2021. User-level private learning via correlated sampling. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS’21). 1–13.

Google Scholar

[42]

Noah Golowich. 2021. Differentially private nonparametric regression under a growth condition. In Conference on Learning Theory, COLT 2021, 15–19 August 2021, Boulder, Colorado, USA (Proceedings of Machine Learning Research), Mikhail Belkin and Samory Kpotufe (Eds.). Vol. 134. PMLR, 2149–2192.

Google Scholar

[43]

Noah Golowich and Roi Livni. 2021. Littlestone classes are privately online learnable. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS’21). 1–12.

Google Scholar

[44]

Alon Gonen, Elad Hazan, and Shay Moran. 2019. Private learning implies online learning: An efficient reduction. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS’19).

Google Scholar

[45]

R. L. Graham, B. L. Rothschild, and J. H. Spencer. 1990. Ramsey Theory. Wiley. 89022670 https://books.google.com/books?id=55oXT60dC54C.

Digital Library

Google Scholar

[46]

Elad Hazan. 2016. Introduction to online convex optimization. Foundations and Trends in Optimization 2, 3–4 (Aug. 2016), 157–325.

Digital Library

Google Scholar

[47]

Wilfrid Hodges. 1997. A Shorter Model Theory. Cambridge University Press, New York, NY.

Digital Library

Google Scholar

[48]

Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. 2022. Reproducibility in learning. arxiv:cs.LG/2201.08430 (2022).

Google Scholar

[49]

Matthew Joseph, Jieming Mao, Seth Neel, and Aaron Roth. 2019. The role of interactivity in local differential privacy. In Proceedings of the 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS’19).

Crossref

Google Scholar

[50]

Young Hun Jung, Baekjin Kim, and Ambuj Tewari. 2020. On the equivalence between online and private learnability beyond binary classification. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS’20). 1–10.

Google Scholar

[51]

Adam Kalai and Santosh Vempala. 2002. Geometric Algorithms for Online Optimization. Technical Report MIT-LCS-TR-861. Massachusetts Institute of Technology, Cambridge, MA.

Google Scholar

[52]

Adam Kalai and Santosh Vempala. 2005. Efficient algorithms for online decision problems. Journal of Computer and System Sciences 71, 3 (Oct. 2005), 291–307.

Digital Library

Google Scholar

[53]

Haim Kaplan, Katrina Ligett, Yishay Mansour, Moni Naor, and Uri Stemmer. 2019. Privately learning thresholds: Closing the exponential gap. arxiv:cs.DS/1911.10137 (2019).

Google Scholar

[54]

Haim Kaplan, Katrina Ligett, Yishay Mansour, Moni Naor, and Uri Stemmer. 2020. Privately learning thresholds: Closing the exponential gap. In Conference on Learning Theory, COLT 2020, 9–12 July 2020, Virtual Event [Graz, Austria] (Proceedings of Machine Learning Research), Jacob D. Abernethy and Shivani Agarwal (Eds.). Vol. 125. PMLR, 2263–2285.

Google Scholar

[55]

Haim Kaplan, Yishay Mansour, Uri Stemmer, and Eliad Tsfadia. 2020. Private learning of halfspaces: Simplifying the construction and reducing the sample complexity. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS’20). 1–10.

Google Scholar

[56]

Marek Karpinski and Angus Macintyre. 1997. Polynomial bounds for VC dimension of sigmoidal and general Pfaffian neural networks. Journal of Computer and System Sciences 54, 1 (1997), 169–176.

Digital Library

Google Scholar

[57]

Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately? SIAM Journal on Computing 40, 3 (2011), 793–826.

Digital Library

Google Scholar

[58]

Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas. 2009. Releasing search queries and clicks privately. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, NY, 171–180.

Digital Library

Google Scholar

[59]

Michael C. Laskowski. 1992. Vapnik-Chervonenkis classes of definable sets. Journal of the London Mathematical Society 2, 2 (1992), 377–384.

Crossref

Google Scholar

[60]

Katrina Ligett and Moshe Shenfeld. 2019. A necessary and sufficient stability notion for adaptive generalization. CoRR abs/1906.00930 (2019).

Google Scholar

[61]

Nick Littlestone. 1987. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2, 4 (1987), 285–318.

Digital Library

Google Scholar

[62]

Roi Livni and Pierre Simon. 2013. Honest compressions and their application to compression schemes. In Proceedings of the Conference on Learning Theory. 77–92.

Google Scholar

[63]

David A. McAllester. 1999. Some PAC-Bayesian theorems. Machine Learning 37, 3 (1999), 355–363.

Digital Library

Google Scholar

[64]

Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07). IEEE, Los Alamitos, CA, 94–103.

Digital Library

Google Scholar

[65]

Shlomo Moran, Marc Snir, and Udi Manber. 1985. Applications of Ramsey’s theorem to decision tree complexity. Journal of the ACM 32, 4 (1985), 938–949.

Digital Library

Google Scholar

[66]

D. Mubayi and A. Suk. 2017. A survey of quantitative bounds for hypergraph Ramsey problems. ArXiv e-printsarxiv:math.CO/1707.04229(2017).

Google Scholar

[67]

Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. 2018. A direct sum result for the information complexity of learning. In Conference on Learning Theory, COLT 2018, Stockholm, Sweden, 6–9 July 2018 (Proceedings of Machine Learning Research), Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet (Eds.). Vol. 75. PMLR, 1547–1568. http://proceedings.mlr.press/v75/nachum18a.html.

Google Scholar

[68]

Ido Nachum and Amir Yehudayoff. 2019. Average-case information complexity of learning. In Algorithmic Learning Theory, ALT 2019, 22–24 March 2019, Chicago, Illinois, USA (Proceedings of Machine Learning Research), Aurélien Garivier and Satyen Kale (Eds.). Vol. 98. PMLR, 633–646. http://proceedings.mlr.press/v98/nachum19a.html.

Google Scholar

[69]

Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2019. How to use heuristics for differential privacy. In Proceedings of the 60th IEEE Annual Symposium on Foundations of Computer Science (FOCS’19). 72–93.

Crossref

Google Scholar

[70]

Igor Carboni Oliveira and Rahul Santhanam. 2018. Pseudo-derandomizing learning and approximation. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques(APPROX/RANDOM’18), E. Blais, J. D. P. Rolim, and D. Steurer (Eds.). Leibniz International Proceedings in Informatics. Schloss Dagstuhl–Leibniz-Zentrum fur Informatik, Dagstuhl Publishing, Dagstuhl, Germany, Article 55, 19 pages.

Google Scholar

[71]

Cathy O’Neil. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy.Crown, New York, NY.

Google Scholar

[72]

Menachem Sadigurschi and Uri Stemmer. 2021. On the sample complexity of privately learning axis-aligned rectangles. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS’21). 1–14.

Google Scholar

[73]

Shai Shalev-Shwartz. 2012. Online learning and online convex optimization. Foundations and Trends in Machine Learning 4, 2 (Feb. 2012), 107–194.

Digital Library

Google Scholar

[74]

Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY.

Crossref

Google Scholar

[75]

Shai Shalev-Shwartz and Yoram Singer. 2007. A primal-dual perspective of online learning algorithms. Machine Learning 69, 2 (2007), 115–142.

Digital Library

Google Scholar

[76]

Saharon Shelah. 1978. Classification Theory and the Number of Non-isomorphic Models. North-Holland Publishing, Amsterdam, Netherlands.

Google Scholar

[77]

Salil Vadhan. 2017. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, Yehuda Lindell (Ed.). Springer International Publishing AG, Cham, Switzerland, 347–450.

Crossref

Google Scholar

[78]

Leslie G. Valiant. 1984. A theory of the learnable. Communications of the ACM 27, 11 (1984), 1134–1142.

Digital Library

Google Scholar

[79]

Vladimir Vapnik and Alexey Chervonenkis. 1974. Theory of Pattern Recognition. Nauka.

Google Scholar

Cited By

View all

Chen JLuo YHuang XJiang FShi YZhang TGao XSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)IPOC: An Adaptive Interval Prediction Model based on Online Chasing and Conformal Inference for Large-Scale SystemsProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599396(202-212)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599396

Index Terms

Private and Online Learnability Are Equivalent
1. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Social aspects of security and privacy
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Online learning theory
      2. Sample complexity and generalization bounds

Recommendations

Private PAC learning implies finite Littlestone dimension
STOC 2019: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing

We show that every approximately differentially private learning algorithm (possibly improper) for a class H with Littlestone dimension d requires Ω(log^*(d)) examples. As a corollary it follows that the class of thresholds over ℕ can not be learned in a ...
Sample-efficient proper PAC learning with approximate differential privacy
STOC 2021: Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing

In this paper we prove that the sample complexity of properly learning a class of Littlestone dimension d with approximate differential privacy is Õ(d⁶), ignoring privacy and accuracy parameters. This result answers a question of Bun et al. (FOCS 2020) ...
Bounds on the sample complexity for private learning and private data release

Learning is a task that generalizes many of the analyses that are applied to collections of data, in particular, to collections of sensitive individual information. Hence, it is natural to ask what can be learned while preserving individual privacy. ...

Comments

Information & Contributors

Information

Published In

Journal of the ACM Volume 69, Issue 4

August 2022

262 pages

ISSN:0004-5411

EISSN:1557-735X

DOI:10.1145/3555788

Editor:
Venkatesan Guruswami
University of California, Berkeley, United States

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2022

Online AM: 04 April 2022

Accepted: 01 March 2022

Revised: 01 January 2022

Received: 01 September 2020

Published in JACM Volume 69, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

NSF
BSF
NSF
CAREER
ISF
NSF CAREER
ISF
BSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
2,356
Total Downloads

Downloads (Last 12 months)1,389
Downloads (Last 6 weeks)63

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Chen JLuo YHuang XJiang FShi YZhang TGao XSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)IPOC: An Adaptive Interval Prediction Model based on Online Chasing and Conformal Inference for Large-Scale SystemsProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599396(202-212)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599396

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

1 Introduction

2 Results

2.1 Private Learning Implies Finite Littlestone Dimension

2.1.1 On the Proof of Theorem 1.

2.2 Finite Littlestone Dimension Implies Private Learning

2.3 Online Learning Versus Differentially Private PAC Learning

2.3.1 Global Stability.

2.4 Boosting for Approximate DP

2.5 Related and Subsequent Work

3 Preliminaries

3.1 PAC Learning

3.2 Online Learning

3.3 Differential Privacy

3.4 Additional Notation

4 Private Learning Implies Finite Littlestone Dimension

4.1 Proof Overview

4.2 A Lower Bound for Privately Learning Thresholds

4.2.1 Proof of Theorem 1.

4.2.2 Proof of Lemma 17.

4.2.3 Proof of Lemma 18.

4.3 Privately Learnable Classes Have Finite Littlestone Dimension

5 Finite Littlestone Dimension Implies Private Learning

5.1 Proof Overview

5.1.1 Step 1: Finite Littlestone Dimension → Globally Stable Learning.

5.1.2 Step 2: Globally Stable Learning → Differentially Private Learning.

5.2 Globally Stable Learning of Littlestone Classes

5.2.1 Theorem Statement.

5.2.2 The Distributions Dk.

5.2.3 The Algorithm G.

5.3 Globally Stable Learning Implies Private Learning

5.3.1 Tools from DP.

5.3.2 Construction of a Private Learner.

5.4 Wrapping up

6 Conclusion

Acknowledgments

Footnotes

A Proof of Theorem 10

References

Cited By

Index Terms

Recommendations

Private PAC learning implies finite Littlestone dimension

Sample-efficient proper PAC learning with approximate differential privacy

Bounds on the sample complexity for private learning and private data release

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

5.2.2 The Distributions D_k.