Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Error Exponent in Agnostic PAC Learning

Adi Hendel and Meir Feder School of Electrical Engineering
Tel-Aviv University, Tel-Aviv, Israel
Email: adihendel@mail.tau.ac.il, meir@tau.ac.il
Abstract

Statistical learning theory and the Probably Approximately Correct (PAC) criterion are the common approach to mathematical learning theory. PAC is widely used to analyze learning problems and algorithms, and have been studied thoroughly. Uniform worst case bounds on the convergence rate have been well established using, e.g., VC theory or Radamacher complexity. However, in a typical scenario the performance could be much better. In this paper, we consider PAC learning using a somewhat different tradeoff, the error exponent - a well established analysis method in Information Theory - which describes the exponential behavior of the probability that the risk will exceed a certain threshold as function of the sample size. We focus on binary classification and find, under some stability assumptions, an improved distribution dependent error exponent for a wide range of problems, establishing the exponential behavior of the PAC error probability in agnostic learning. Interestingly, under these assumptions, agnostic learning may have the same error exponent as realizable learning. The error exponent criterion can be applied to analyze knowledge distillation, a problem that so far lacks a theoretical analysis.

I Introduction

Statistical machine learning studies the generalization ability and convergence rate of learning algorithms. One of the most popular criteria for learnability is the Probably Approximately Correct (PAC) criterion, suggested in [1, 2], which describes the probability of a learning algorithm to output a hypothesis that is not too far from the optimal one.

In this work, we will consider the class of Empirical Risk Minimization (ERM) predictors, which is the most prominent method for learning problems. ERM predictors choose the hypothesis achieving minimal loss on a given training sample, and their analysis under the PAC criterion is well established through VC theory [3, 4].

Classical setting divides the learning problem into two cases - Realizable learning, in which the target function is taken from the hypothesis class, and Agnostic learning, in which the target function could be outside the class. The general worst-case upper bounds of both cases are well established, see [5] for example.

Although VC theory is powerful, it provides a uniform upper-bound for the worst case scenario, where in a typical scenario the convergence rate could be much faster, as suggested by [6]. Actually, the recent rise of deep learning demonstrate that uniform bounds fail to describe many practical situations and better characteristics comes from considering non-uniform, possibly distribution dependent analysis.

In this paper we consider agnostic PAC learning for the case of binary labels and 0-1 loss function. We derive an improved distribution-dependent error exponent for the PAC error probability, using some assumptions, for a wide range of learning problems. Moreover, we show that under the specified assumptions, the derived error exponent can be the same for both agnostic and realizable learning.

I-A Related Work

VC theory and the PAC model provide conditions for uniform consistency and bounds, that are achieved, e.g., by ERM predictors [3]. This theory fails to explain the success of recent learning models, such as neural networks, as presented in [7, 8], where practical learning rates can be much faster than the ones predicted by the VC theory. Moreover, in [9] different types of over-parameterized models are analyzed and it is proved that any uniform bound would yield a bad generalization bound. This issue motivated theories that provide better, non-uniform learning rates.

In this direction, works such as [10, 11, 12], establish improved bounds for specific cases and algorithms. However, these works do not provide a general theory. [13, 14] developed tighter bounds for distribution dependent PAC-Bayes priors. [15] showed the existence of classes with faster rates than the classical agnostic bound, but the provided condition for such a rate is impractical for infinite feature spaces. Other works relax the uniformity property, as done by [16], who proposed a relaxed model of PAC in which the bound on the learning rate may depend on a hypothesis, but is uniform on all distributions consistent with that hypothesis. Other works focus on totally non-uniform learning bounds. For example, [17, 18] established a theory for non-uniform consistency, in which an algorithm is considered consistent if it convergence to the optimal risk for any ground truth, and showed there exists such algorithm for separable metric spaces. In [6] a theory for non-uniform PAC learning in the realizable setting is developed showing that the learning rate can be one of 3 types: exponential, linear and arbitrarily slow.

II Problem Formulation

Let the training data be n𝑛nitalic_n pairs of data samples and their labels (x1,y1),,(xn,yn)subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛(x_{1},y_{1}),...,(x_{n},y_{n})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are i.i.d and drawn from a feature space 𝐗𝐑𝐍𝐗superscript𝐑𝐍\mathbf{X}\subseteq\mathbf{R^{N}}bold_X ⊆ bold_R start_POSTSUPERSCRIPT bold_N end_POSTSUPERSCRIPT according to an unknown distribution \mathcal{F}caligraphic_F, and the labels yi{0,1}subscript𝑦𝑖01y_{i}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } are generated by some unknown deterministic function yi=g(xi)subscript𝑦𝑖𝑔subscript𝑥𝑖y_{i}=g(x_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), called the ground-truth function. We have a hypothesis class FΘ={fθ,θΘ}subscript𝐹Θsubscript𝑓𝜃𝜃ΘF_{\Theta}=\{f_{\theta},\theta\in\Theta\}italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_θ ∈ roman_Θ }, and would like to find the closest hypothesis to the ground truth in the class under some loss function. We focus on the setting in which the hypotheses range is binary and the loss function is the 0-1 loss:

(ya,yb)={0ya=yb1yaybsubscript𝑦𝑎subscript𝑦𝑏cases0subscript𝑦𝑎subscript𝑦𝑏1subscript𝑦𝑎subscript𝑦𝑏\ell(y_{a},y_{b})=\begin{cases}0&\quad y_{a}=y_{b}\\ 1&\quad y_{a}\neq y_{b}\end{cases}roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = { start_ROW start_CELL 0 end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_CELL end_ROW

The following notations are with regard to some arbitrary function f𝑓fitalic_f, where f𝑓fitalic_f can be the ground truth or some other function in discussion. For hypothesis class FΘsubscript𝐹ΘF_{\Theta}italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, denote the risk between hypothesis fθFΘsubscript𝑓𝜃subscript𝐹Θf_{\theta}\in F_{\Theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT and some function f(x)𝑓𝑥f(x)italic_f ( italic_x ) as:

Rf(θ)=R(f,fθ)=𝐗(f(x),fθ(x))d(x),θΘ\displaystyle R_{f}(\theta)=R(f,f_{\theta})=\int_{\mathbf{X}}\ell(f(x),f_{% \theta}(x))d\mathcal{F}(x)\quad,\theta\in\Thetaitalic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_θ ) = italic_R ( italic_f , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = ∫ start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT roman_ℓ ( italic_f ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) italic_d caligraphic_F ( italic_x ) , italic_θ ∈ roman_Θ (1)

and the empirical risk between fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and f(x)𝑓𝑥f(x)italic_f ( italic_x ) on sample xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as:

Rfemp(θ,xn)=1ni=1n(f(xi),fθ(xi)),θΘ\displaystyle R_{f}^{emp}(\theta,x^{n})=\frac{1}{n}\sum_{i=1}^{n}\ell(f(x_{i})% ,f_{\theta}(x_{i}))\quad,\theta\in\Thetaitalic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , italic_θ ∈ roman_Θ (2)

In this paper, we analyze the Empirical Risk Minimization (ERM) algorithm, which selects the hypothesis that minimizes the empirical risk on the sample xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT out of all the hypotheses in the class. Specifically, the ERM on a sample xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and hypothesis class ΘΘ\Thetaroman_Θ, with regard to a function f(x)𝑓𝑥f(x)italic_f ( italic_x ), is defined as:

θ^nf=θ^f(xn)=argminθΘRfemp(θ,xn)subscriptsuperscript^𝜃𝑓𝑛superscript^𝜃𝑓superscript𝑥𝑛subscriptargmin𝜃Θsuperscriptsubscript𝑅𝑓𝑒𝑚𝑝𝜃superscript𝑥𝑛\displaystyle\hat{\theta}^{f}_{n}=\hat{\theta}^{f}(x^{n})=\operatorname*{arg\,% min}_{\theta\in\Theta}R_{f}^{emp}(\theta,x^{n})over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) (3)

Whenever there are multiple hypotheses with the same minimal empirical risk, we use the convention of choosing the one maximizing the true risk (i.e., the worst one).

Denote the hypothesis achieving minimum risk with regard to the ground truth g𝑔gitalic_g as θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT:

θopt=argminθΘRg(θ),fopt=fθoptformulae-sequencesubscript𝜃𝑜𝑝𝑡subscriptargmin𝜃Θsubscript𝑅𝑔𝜃subscript𝑓𝑜𝑝𝑡subscript𝑓subscript𝜃𝑜𝑝𝑡\displaystyle\theta_{opt}=\operatorname*{arg\,min}_{\theta\in\Theta}R_{g}(% \theta),\quad f_{{opt}}=f_{\theta_{opt}}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) , italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT (4)

We will refer to foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT as the projection of g𝑔gitalic_g on FΘsubscript𝐹ΘF_{\Theta}italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT. We assume for simplicity that the hypothesis class is non-degenerate in the sense that there is no subset of the feature space X𝐗superscript𝑋𝐗X^{{}^{\prime}}\subseteq\mathbf{X}italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ⊆ bold_X for which all hypotheses coincide. That is, for any set 𝐗𝐗superscript𝐗𝐗\mathbf{X}^{{}^{\prime}}\subseteq\mathbf{X}bold_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ⊆ bold_X with positive probability, we have:

Pr(fθ(x)=constθΘxX)=0Prsubscript𝑓𝜃𝑥constfor-all𝜃conditionalΘ𝑥superscript𝑋0\displaystyle\Pr\left(f_{\theta}(x)=\text{const}\ \forall\theta\in\Theta\mid x% \in X^{{}^{\prime}}\right)=0roman_Pr ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) = const ∀ italic_θ ∈ roman_Θ ∣ italic_x ∈ italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = 0 (5)

This is not restrictive as any part of the feature space on which all hypotheses coincide will contribute the same risk to all hypotheses, thus not affecting the choice of ERM. We’ll also assume that for any positive (Lebesgue) measure set the probability measure is positive (this is non restrictive as such regions with zero probability have no effect).

II-A PAC Learning

In the context of ERM, we say that the class FΘsubscript𝐹ΘF_{\Theta}italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT is (agnostic) PAC learnable [19] if there exist a sample size N(ϵ,δ)𝑁italic-ϵ𝛿N(\epsilon,\delta)italic_N ( italic_ϵ , italic_δ ) and an algorithm θ^ngsuperscriptsubscript^𝜃𝑛𝑔\hat{\theta}_{n}^{g}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT such that for every ground truth function g(x)𝑔𝑥g(x)italic_g ( italic_x ), every probability distribution \mathcal{F}caligraphic_F on 𝐗𝐗\mathbf{X}bold_X and every δ,η(0,1)𝛿𝜂01\delta,\eta\in(0,1)italic_δ , italic_η ∈ ( 0 , 1 ), for n>N𝑛𝑁n>Nitalic_n > italic_N, with probability at least 1η1𝜂1-\eta1 - italic_η we have:

Rg(θ^ng)<Rg(θopt)+δ.subscript𝑅𝑔superscriptsubscript^𝜃𝑛𝑔subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿\displaystyle R_{g}(\hat{\theta}_{n}^{g})<R_{g}(\theta_{opt})+\delta.italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) < italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) + italic_δ . (6)

PAC actually describes the relationship between three quantities: the deviation from the optimal risk δ𝛿\deltaitalic_δ, the probability η𝜂\etaitalic_η for deviation larger than δ𝛿\deltaitalic_δ and the size of the sample n𝑛nitalic_n. We will refer to η𝜂\etaitalic_η as the PAC error probability for shortness.

The analysis of learning algorithms using PAC is usually done by writing one parameter as a function of the other two. Most notably, writing n𝑛nitalic_n as a function of η𝜂\etaitalic_η and δ𝛿\deltaitalic_δ (known as sample complexity) or writing δ𝛿\deltaitalic_δ as function of n𝑛nitalic_n for some fixed value of η𝜂\etaitalic_η (known as excess risk). In this way, we can say one algorithm is better than the other if, for example, it has a better sample complexity (i.e., n𝑛nitalic_n increases slower as a function of 1δ1𝛿\frac{1}{\delta}divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG for a fixed η𝜂\etaitalic_η). We propose to fix δ𝛿\deltaitalic_δ and to look instead at the probability of deviation η𝜂\etaitalic_η as function of the sample size n𝑛nitalic_n. In this case, we say that one algorithm is better than the other if η𝜂\etaitalic_η decays faster as a function of n𝑛nitalic_n.

II-B VC Theory

VC theory [3] provides consistency conditions and uniform (worst-case) bounds for PAC learning of ERM predictors. This is done using the VC dimension of the hypothesis class, denoted hhitalic_h, defined as the maximum sample size hhitalic_h for which the sample can be separated into two classes in all 2hsuperscript22^{h}2 start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT possible label sequences, using functions from the hypothesis class.

VC theory provides the following well known results for ERM (see [5] for example). In agnostic learning, for n>h𝑛n>hitalic_n > italic_h, with probability at least 1η1𝜂1-\eta1 - italic_η we have:

Rg(θ^ng)Rg(θopt)42hln2enh+ln2ηnsubscript𝑅𝑔subscriptsuperscript^𝜃𝑔𝑛subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡422𝑒𝑛2𝜂𝑛\displaystyle R_{g}(\hat{\theta}^{g}_{n})-R_{g}(\theta_{opt})\leq 4\sqrt{2% \frac{h\ln{\frac{2en}{h}}+\ln{\frac{2}{\eta}}}{n}}italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) ≤ 4 square-root start_ARG 2 divide start_ARG italic_h roman_ln divide start_ARG 2 italic_e italic_n end_ARG start_ARG italic_h end_ARG + roman_ln divide start_ARG 2 end_ARG start_ARG italic_η end_ARG end_ARG start_ARG italic_n end_ARG end_ARG (7)

In realizable learning, for n>h𝑛n>hitalic_n > italic_h, with probability at least 1η1𝜂1-\eta1 - italic_η:

Rg(θ^ng)4h(ln2neh)lnη4nsubscript𝑅𝑔subscriptsuperscript^𝜃𝑔𝑛42𝑛𝑒𝜂4𝑛\displaystyle R_{g}(\hat{\theta}^{g}_{n})\leq 4\frac{h(\ln{\frac{2ne}{h}})-\ln% {\frac{\eta}{4}}}{n}italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ 4 divide start_ARG italic_h ( roman_ln divide start_ARG 2 italic_n italic_e end_ARG start_ARG italic_h end_ARG ) - roman_ln divide start_ARG italic_η end_ARG start_ARG 4 end_ARG end_ARG start_ARG italic_n end_ARG (8)

These bounds describe the generalization and convergence of the ERM predictor.

Focusing on (7), we can get η𝜂\etaitalic_η as a function of δ𝛿\deltaitalic_δ and n𝑛nitalic_n by setting the right hand side to δ𝛿\deltaitalic_δ:

Pr(Rg(θ^ng)Rg(θopt)>δ)2eδ232n+hln2enhPrsubscript𝑅𝑔subscriptsuperscript^𝜃𝑔𝑛subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿2superscript𝑒superscript𝛿232𝑛2𝑒𝑛\displaystyle\Pr\left(R_{g}(\hat{\theta}^{g}_{n})-R_{g}(\theta_{opt})>\delta% \right)\leq 2e^{-\frac{\delta^{2}}{32}n+h\ln{\frac{2en}{h}}}roman_Pr ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) > italic_δ ) ≤ 2 italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 32 end_ARG italic_n + italic_h roman_ln divide start_ARG 2 italic_e italic_n end_ARG start_ARG italic_h end_ARG end_POSTSUPERSCRIPT (9)

Similarly, we get the following for the realizable case:

Pr(Rg(θ^ng)Rg(θopt)>δ)4eδ4n+hln2enhPrsubscript𝑅𝑔subscriptsuperscript^𝜃𝑔𝑛subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿4superscript𝑒𝛿4𝑛2𝑒𝑛\displaystyle\Pr\left(R_{g}(\hat{\theta}^{g}_{n})-R_{g}(\theta_{opt})>\delta% \right)\leq 4e^{-\frac{\delta}{4}n+h\ln{\frac{2en}{h}}}roman_Pr ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) > italic_δ ) ≤ 4 italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG italic_n + italic_h roman_ln divide start_ARG 2 italic_e italic_n end_ARG start_ARG italic_h end_ARG end_POSTSUPERSCRIPT (10)

This formulation of the upper bound shows that the PAC error probability η𝜂\etaitalic_η decays exponentially with n𝑛nitalic_n and allows us to explore its error exponent. Recall that the error exponent d𝑑ditalic_d of a series ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is defined as:

d=limn1nlnan𝑑subscript𝑛1𝑛subscript𝑎𝑛\displaystyle d=-\lim_{n\to\infty}\frac{1}{n}\ln{a_{n}}italic_d = - roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG roman_ln italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (11)

We will use the notation anbnapproaches-limitsubscript𝑎𝑛subscript𝑏𝑛a_{n}\doteq b_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≐ italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to indicate that series ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT has the same error exponent as bnsubscript𝑏𝑛b_{n}italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The concept of error exponent (see section 5.6 in [20] for example), was proven useful in Information Theory for analyzing the decay rate of probabilities to zero. It allows utilizing powerful mathematical tools such as the method of types [21] and Sanov’s theorem [22]. We can see from (9) that the error exponent in the agnostic case is δ232superscript𝛿232\frac{\delta^{2}}{32}divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 32 end_ARG, and from (10) that the error exponent in the realizable case is δ4𝛿4\frac{\delta}{4}divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG. We note that the bound in (7) can be manipulated using a chaining technique [23] to get rid of the ln2enh2𝑒𝑛\ln{\frac{2en}{h}}roman_ln divide start_ARG 2 italic_e italic_n end_ARG start_ARG italic_h end_ARG factor, but the resulting error exponent will be worse. In the next sections, we will derive an improved distribution-dependent bound for the PAC error probability η=Pr(Rg(θ^ng)Rg(θopt)>δ)𝜂Prsubscript𝑅𝑔subscriptsuperscript^𝜃𝑔𝑛subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿\eta=\Pr\left(R_{g}(\hat{\theta}^{g}_{n})-R_{g}(\theta_{opt})>\delta\right)italic_η = roman_Pr ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) > italic_δ ) for the agnostic case. This will be done using some assumption on the learning problem (i.e., on the hypothesis class and the ground truth) described in the next sections. Under these assumptions the error exponent in the agnostic case can be the same as in the realizable case for small enough δ𝛿\deltaitalic_δ.

III Preliminaries

In this section we introduce a few key concepts. In order to provide some intuition, we will use the k-boundary hypothesis class as a case study and demonstrate these concepts on it.

Definition 1 (k-boundary hypothesis class)

Let X𝐑𝑋𝐑X\subseteq\mathbf{R}italic_X ⊆ bold_R. The k-boundary hypothesis set is defined as

fb1,,bk(x)={0x<b11b1x<b20b2x<b31bkxsubscript𝑓subscript𝑏1subscript𝑏𝑘𝑥cases0𝑥subscript𝑏11subscript𝑏1𝑥subscript𝑏20subscript𝑏2𝑥subscript𝑏3otherwise1subscript𝑏𝑘𝑥\displaystyle f_{b_{1},...,b_{k}}(x)=\begin{cases}0&\quad x<b_{1}\\ 1&\quad b_{1}\leq x<b_{2}\\ 0&\quad b_{2}\leq x<b_{3}\\ ...\\ 1&\quad b_{k}\leq x\\ \end{cases}italic_f start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL 0 end_CELL start_CELL italic_x < italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_x < italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_x < italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL … end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_x end_CELL end_ROW

Where b1b2bk,x,b1,,bkXformulae-sequencesubscript𝑏1subscript𝑏2subscript𝑏𝑘𝑥subscript𝑏1subscript𝑏𝑘𝑋b_{1}\leq b_{2}\leq...\leq b_{k},\quad x,b_{1},...,b_{k}\in Xitalic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ … ≤ italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_X. For uniqueness, equality is allowed only between the first 2 parameters or between last parameters (e.g., b1=b2<b3<<bk2=bk1=bksubscript𝑏1subscript𝑏2subscript𝑏3subscript𝑏𝑘2subscript𝑏𝑘1subscript𝑏𝑘b_{1}=b_{2}<b_{3}<...<b_{k-2}=b_{k-1}=b_{k}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT < … < italic_b start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT).

For example, on the feature space X=[0,1]𝑋01X=[0,1]italic_X = [ 0 , 1 ] with uniform distribution, the 2-boundary function with parameters b1=0.5,b2=0.9formulae-sequencesubscript𝑏10.5subscript𝑏20.9b_{1}=0.5,b_{2}=0.9italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5 , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.9 is

g(x)=fb1,b2(x)={00x<0.510.5x<0.900.9x1𝑔𝑥subscript𝑓subscript𝑏1subscript𝑏2𝑥cases00𝑥0.510.5𝑥0.900.9𝑥1\displaystyle g(x)=f_{b_{1},b_{2}}(x)=\begin{cases}0&\quad 0\leq x<0.5\\ 1&\quad 0.5\leq x<0.9\\ 0&\quad 0.9\leq x\leq 1\\ \end{cases}italic_g ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL 0 end_CELL start_CELL 0 ≤ italic_x < 0.5 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0.5 ≤ italic_x < 0.9 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0.9 ≤ italic_x ≤ 1 end_CELL end_ROW

We will use this example throughout this section to demonstrate the presented concepts. Another important hypothesis class, which is more closely related to neural networks, is the class of linear classifiers:

Definition 2 (linear hypothesis class)

Let there be a feature space X𝐑𝐤𝑋superscript𝐑𝐤X\subseteq\mathbf{R^{k}}italic_X ⊆ bold_R start_POSTSUPERSCRIPT bold_k end_POSTSUPERSCRIPT. The k-dimensional linear hypothesis set is:

fb0,,bk(x)=𝟏(b0+b1x1++bkxk>0)subscript𝑓subscript𝑏0subscript𝑏𝑘𝑥1subscript𝑏0subscript𝑏1subscript𝑥1subscript𝑏𝑘subscript𝑥𝑘0\displaystyle f_{b_{0},...,b_{k}}(x)=\mathbf{1}(b_{0}+b_{1}x_{1}+...+b_{k}x_{k% }>0)italic_f start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = bold_1 ( italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0 )

Where (b0,,bk)𝐑k+1subscript𝑏0subscript𝑏𝑘superscript𝐑𝑘1(b_{0},...,b_{k})\in\mathbf{R}^{k+1}( italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ bold_R start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT, (x1,,xk)Xsubscript𝑥1subscript𝑥𝑘𝑋(x_{1},...,x_{k})\in X( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ italic_X and 𝟏{.}\mathbf{1}\{.\}bold_1 { . } is the indicator function.

Definition 3 (Generalized Optimum Point)

For hypothesis class {fθ,θΘ}subscript𝑓𝜃𝜃Θ\{f_{\theta},\theta\in\Theta\}{ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_θ ∈ roman_Θ } and ground truth function g(x)𝑔𝑥g(x)italic_g ( italic_x ), we say that θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ is a generalized optimum point (GLP) of Rg(θ)subscript𝑅𝑔𝜃R_{g}(\theta)italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) if θ~Θθfor-all~𝜃Θ𝜃\forall\tilde{\theta}\in\Theta\setminus\theta∀ over~ start_ARG italic_θ end_ARG ∈ roman_Θ ∖ italic_θ there exists a set X~𝐗~𝑋𝐗\tilde{X}\subseteq\mathbf{X}over~ start_ARG italic_X end_ARG ⊆ bold_X with positive probability, such that xX~for-all𝑥~𝑋\forall x\in\tilde{X}∀ italic_x ∈ over~ start_ARG italic_X end_ARG we have (fθ(x),g(x))<(fθ~(x),g(x))subscript𝑓𝜃𝑥𝑔𝑥subscript𝑓~𝜃𝑥𝑔𝑥\ell(f_{\theta}(x),g(x))<\ell(f_{\tilde{\theta}}(x),g(x))roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) < roman_ℓ ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ).

In simple words, θ𝜃\thetaitalic_θ is a GLP if no other hypothesis can beat it uniformly on the feature space 𝐗𝐗\mathbf{X}bold_X. Notice that the hypothesis θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT minimizing the risk is always a GLP, as for every other hypothesis in the class there must exist a set for which θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT is uniformly better, otherwise it would not be the minimizer of the risk. We will refer to θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT as the global optimum.

Consider for example the 1-boundary hypothesis class with a ground truth g𝑔gitalic_g as described above. The GLP’s will be θ0=0.5subscript𝜃00.5\theta_{0}=0.5italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.5 and θ1=1subscript𝜃11\theta_{1}=1italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, as no other hypothesis θ[0,1]𝜃01\theta\in[0,1]italic_θ ∈ [ 0 , 1 ] achieves a lower loss for all x𝐗𝑥𝐗x\in\mathbf{X}italic_x ∈ bold_X. Notice that these are the only GLP’s since any other hypothesis θ𝜃\thetaitalic_θ is no better (for all x𝑥xitalic_x) than either θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This divides the parameter space into two groups: hypotheses that are no better than θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and hypotheses that are no better than θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We can informally say that when an ERM learns from g𝑔gitalic_g using the 1-boundary hypothesis class, there is going to be a competition between these 2 groups. The following definition generalizes this concept.

For each GLP θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, denote the set AθΘsubscript𝐴superscript𝜃ΘA_{\theta^{*}}\subseteq\Thetaitalic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊆ roman_Θ:

Definition 4 (Aθsubscript𝐴𝜃A_{\theta}italic_A start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT region)

Let ΘoptsubscriptΘ𝑜𝑝𝑡\Theta_{opt}roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT be the set of GLP’s of Rg(θ)subscript𝑅𝑔𝜃R_{g}(\theta)italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) and θ0Θoptsubscript𝜃0subscriptΘ𝑜𝑝𝑡\theta_{0}\in\Theta_{opt}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT be the global optimum. For every θΘoptsuperscript𝜃subscriptΘ𝑜𝑝𝑡\theta^{*}\in\Theta_{opt}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT denote the regions:

A~θ={θΘPr((fθ(x),g(x))(fθ(x),g(x)))=1}subscript~𝐴superscript𝜃conditional-set𝜃ΘPrsubscript𝑓superscript𝜃𝑥𝑔𝑥subscript𝑓𝜃𝑥𝑔𝑥1\displaystyle\tilde{A}_{\theta^{*}}=\{\theta\in\Theta\mid\Pr\left(\ell\big{(}f% _{\theta^{*}}(x),g(x)\big{)}\leq\ell\big{(}f_{\theta}(x),g(x)\big{)}\right)=1\}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { italic_θ ∈ roman_Θ ∣ roman_Pr ( roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ≤ roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ) = 1 } (12)

In order to make these regions disjoint, we handle the intersections in the following way:

  1. 1.

    remove all overlaps from A~θ0subscript~𝐴subscript𝜃0\tilde{A}_{\theta_{0}}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

    Aθ0=A~θ0θΘopt,θθ0Aθsubscript𝐴subscript𝜃0subscript~𝐴subscript𝜃0subscriptformulae-sequencesuperscript𝜃subscriptΘ𝑜𝑝𝑡superscript𝜃subscript𝜃0subscript𝐴superscript𝜃A_{\theta_{0}}=\tilde{A}_{\theta_{0}}\setminus\bigcup_{\theta^{*}\in\Theta_{% opt},\ \theta^{*}\neq\theta_{0}}A_{\theta^{*}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∖ ⋃ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≠ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
  2. 2.

    For the other regions A~θ,θθ0subscript~𝐴superscript𝜃superscript𝜃subscript𝜃0\tilde{A}_{\theta^{*}},\theta^{*}\neq\theta_{0}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≠ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, arbitrarily assign the intersection to one of the regions such that there will not be any overlap, to obtain the regions Aθsubscript𝐴superscript𝜃A_{\theta^{*}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

These regions form a complete partitioning of ΘΘ\Thetaroman_Θ such that θΘoptAθ=Θsubscriptsuperscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝐴superscript𝜃Θ\cup_{\theta^{*}\in\Theta_{opt}}A_{\theta^{*}}=\Theta∪ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = roman_Θ (see proof in appendix A).

In simple words, Aθsubscript𝐴superscript𝜃A_{\theta^{*}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the set of hypotheses in ΘΘ\Thetaroman_Θ that are no better than the GLP θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for any given xX𝑥𝑋x\in Xitalic_x ∈ italic_X (with probability 1). For the example above we have the sets Aθ0=(0,0.9)subscript𝐴subscript𝜃000.9A_{\theta_{0}}=(0,0.9)italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( 0 , 0.9 ) and Aθ1=(0.9,1)subscript𝐴subscript𝜃10.91A_{\theta_{1}}=(0.9,1)italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( 0.9 , 1 ).

Note that in any (non-degenerate) agnostic learning problem we will have at least 2 GLP’s, because if g𝑔gitalic_g is outside the class, there must be a set in 𝐗𝐗\mathbf{X}bold_X with positive probability for which g𝑔gitalic_g is different than foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. Any hypothesis equal to g𝑔gitalic_g on this set will be universally better than foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT on this set, and will not belong to Aθ0subscript𝐴subscript𝜃0A_{\theta_{0}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (this is a consequence of (5)). Thus, there must be other GLP’s in addition to θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Definition 5 (Dominating region)

For hypothesis class {fθ,θΘ}subscript𝑓𝜃𝜃Θ\{f_{\theta},\theta\in\Theta\}{ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_θ ∈ roman_Θ }, the Dominating region of θasubscript𝜃𝑎\theta_{a}italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT on θbsubscript𝜃𝑏\theta_{b}italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, where θa,θbΘsubscript𝜃𝑎subscript𝜃𝑏Θ\theta_{a},\theta_{b}\in\Thetaitalic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ roman_Θ, with regard to g(x)𝑔𝑥g(x)italic_g ( italic_x ), denoted as D(θa,θb)𝐗𝐷subscript𝜃𝑎subscript𝜃𝑏𝐗D(\theta_{a},\theta_{b})\subseteq\mathbf{X}italic_D ( italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ⊆ bold_X, is defined as

D(θa,θb)=𝐷subscript𝜃𝑎subscript𝜃𝑏absent\displaystyle D(\theta_{a},\theta_{b})=italic_D ( italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = (13)
{x𝐗Pr((fθa(x),g(x))<(fθb(x),g(x)))=1}conditional-set𝑥𝐗Prsubscript𝑓subscript𝜃𝑎𝑥𝑔𝑥subscript𝑓subscript𝜃𝑏𝑥𝑔𝑥1\displaystyle\big{\{}x\in\mathbf{X}\mid\Pr\left(\ell\big{(}f_{\theta_{a}}(x),g% (x)\big{)}<\ell\big{(}f_{\theta_{b}}(x),g(x)\big{)}\right)=1\big{\}}{ italic_x ∈ bold_X ∣ roman_Pr ( roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) < roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ) = 1 }

The dominating region is the set in the feature space for which θasubscript𝜃𝑎\theta_{a}italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT achieves lower loss than θbsubscript𝜃𝑏\theta_{b}italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (i.e., fθa=gsubscript𝑓subscript𝜃𝑎𝑔f_{\theta_{a}}=gitalic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_g, fθbgsubscript𝑓subscript𝜃𝑏𝑔f_{\theta_{b}}\neq gitalic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≠ italic_g). Using our example, the dominating region of θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is D(θ0,θ1)=(0.5,0.9)𝐷subscript𝜃0subscript𝜃10.50.9D(\theta_{0},\theta_{1})=(0.5,0.9)italic_D ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ( 0.5 , 0.9 ).

Definition 6 (Stability)

We say that a GLP θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is stable if we can define a distance in ΘΘ\Thetaroman_Θ and there exist ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 such that for every θ𝜃\thetaitalic_θ with distance θθ<ϵnorm𝜃superscript𝜃italic-ϵ||\theta-\theta^{*}||<\epsilon| | italic_θ - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < italic_ϵ the following holds with probability 1:

(fθ(x),g(x))(fθ(x),g(x))subscript𝑓superscript𝜃𝑥𝑔𝑥subscript𝑓𝜃𝑥𝑔𝑥\displaystyle\ell(f_{\theta^{*}}(x),g(x))\leq\ell(f_{\theta}(x),g(x))roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ≤ roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) (14)

Informally, θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a stable GLP if any hypothesis in its neighborhood does not have an improved classification ability with regard to any x𝐗𝑥𝐗x\in\mathbf{X}italic_x ∈ bold_X. Using our example from above, both θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are stable.

IV Theoretical Results

We consider the following assumptions.

Assumption 1

foptsubscript𝑓𝑜𝑝𝑡f_{opt}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT is a stable GLP.

Assumption 2

foptsubscript𝑓𝑜𝑝𝑡f_{opt}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT is a unique GLP.

Assumption 3

The following is true in probability:

limnθ^ng=θoptsubscript𝑛subscriptsuperscript^𝜃𝑔𝑛subscript𝜃𝑜𝑝𝑡\lim_{n\to\infty}\hat{\theta}^{g}_{n}=\theta_{opt}roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT
Assumption 4

FΘsubscript𝐹ΘF_{\Theta}italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT is a complete space. i.e., the limit of any Cauchy sequence in ΘΘ\Thetaroman_Θ is also in ΘΘ\Thetaroman_Θ, and the limit of any sequence fθmFΘsubscript𝑓subscript𝜃𝑚subscript𝐹Θf_{\theta_{m}}\in F_{\Theta}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, such that the series 𝐗(fθm(x),fθm+1(x))𝑑(x)subscript𝐗subscript𝑓subscript𝜃𝑚𝑥subscript𝑓subscript𝜃𝑚1𝑥differential-d𝑥\int_{\mathbf{X}}\ell(f_{\theta_{m}}(x),f_{\theta_{m+1}}(x))d\mathcal{F}(x)∫ start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) italic_d caligraphic_F ( italic_x ) has a limit, is also in FΘsubscript𝐹ΘF_{\Theta}italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT.

Assumption 5

0<δ<δmax0𝛿subscript𝛿𝑚𝑎𝑥0<\delta<\delta_{max}0 < italic_δ < italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, where δmaxsubscript𝛿𝑚𝑎𝑥\delta_{max}italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is denoted as:

δmax=min{minθAθ0Rfopt(θ),minθAθ0Rg(θ)Rg(θopt)}subscript𝛿𝑚𝑎𝑥subscript𝜃subscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡𝜃subscript𝜃subscript𝐴subscript𝜃0subscript𝑅𝑔𝜃subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡\displaystyle\delta_{max}=\min\{\min_{\theta\notin A_{\theta_{0}}}R_{f_{{opt}}% }(\theta),\min_{\theta\notin A_{\theta_{0}}}R_{g}(\theta)-R_{g}(\theta_{opt})\}italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = roman_min { roman_min start_POSTSUBSCRIPT italic_θ ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) , roman_min start_POSTSUBSCRIPT italic_θ ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) } (15)

Assumption 2 is needed mainly to ease the analysis and can be generalized. Relaxing this assumption will cause the ERM to alternate between multiple Aθsubscript𝐴𝜃A_{\theta}italic_A start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT regions of the (non-unique) global GLP’s. Assumption 3 is a non-uniform consistency requirement (this is weaker than finite VC dimension), which is reasonable. Assumption 4 is mainly a mathematical technicality. Assumption 5 means that we are looking at what happens for small enough δ𝛿\deltaitalic_δ, which is reasonable as we are interested in the behavior in the asymptotic regime. Assumption 1 is the only one that poses a significant constraint. Nevertheless, a wide range of learning problems satisfy it, such as k𝑘kitalic_k-boundary class with a ground truth with finite number of transition points (see proof in appendix B-A ). It is also satisfied by some cases of linear classifiers, which are more closely related to neural networks. An example of a ground truth that satisfies these assumptions, using a 2-dimensional linear hypothesis set is shown in Figure 1 where the optimal linear hypothesis is 𝟏(x2>1.3x1)1subscript𝑥21.3subscript𝑥1\mathbf{1}(x_{2}>1.3-x_{1})bold_1 ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 1.3 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), and any change in its parameters will result in mis-classification of more features, thus it is a stable GLP.

Refer to caption
Figure 1: Ground truth with stable optimal 2-dimensional linear hypothesis (see definition 2). x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are uniformly distributed in [0,1]01[0,1][ 0 , 1 ]. The optimal hypothesis is achieved by b0=1.3,b1=1,b2=1formulae-sequencesubscript𝑏01.3formulae-sequencesubscript𝑏11subscript𝑏21b_{0}=-1.3,\ b_{1}=1,\ b_{2}=1italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 1.3 , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1.

We can now move to state our main results.

Theorem 1

Given a hypothesis class {fθ,θΘ}subscript𝑓𝜃𝜃Θ\{f_{\theta},\theta\in\Theta\}{ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_θ ∈ roman_Θ }, and ground truth function g𝑔gitalic_g with projection foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT on the hypothesis class, the following holds under assumptions 1-5:

Pr(Rg(θ^ng)Rg(θopt)>δ)=Prsubscript𝑅𝑔subscriptsuperscript^𝜃𝑔𝑛subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿absent\displaystyle\Pr\left(R_{g}(\hat{\theta}^{g}_{n})-R_{g}(\theta_{opt})>\delta% \right)=roman_Pr ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) > italic_δ ) = (16)
PR+(1PR)Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)subscript𝑃𝑅1subscript𝑃𝑅Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle P_{R}+(1-P_{R})\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}% }\mid R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + ( 1 - italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )

where PR=Pr(Rfopt(θ^nfopt)>δ)subscript𝑃𝑅Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿P_{R}=\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n})>\delta\right)italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ ) is the realizable PAC error probability when learning from foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT (see proof in appendix B-C). This theorem decomposes the PAC error probability into the error incurred in realizable learning and the additional error incurred in agnostic learning. Notice that for realizable learning, gFΘ𝑔subscript𝐹Θg\in F_{\Theta}italic_g ∈ italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, we get Pr(Rfopt(θ^ng)>δ)=PRPrsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛𝛿subscript𝑃𝑅\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})>\delta\right)=P_{R}roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ ) = italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, as expected.

Denote the KL divergence projection of a distribution Q~~𝑄\tilde{Q}over~ start_ARG italic_Q end_ARG on a set of distributions Π~~Π\tilde{\Pi}over~ start_ARG roman_Π end_ARG as:

DKL(Π~Q~)=infPΠ~𝒟KL(PQ~)subscript𝐷𝐾𝐿conditional~Π~𝑄subscriptinfimum𝑃~Πsubscript𝒟𝐾𝐿conditional𝑃~𝑄\displaystyle D_{KL}(\tilde{\Pi}\;\|\;\tilde{Q})=\inf_{P\in\tilde{\Pi}}% \mathcal{D}_{KL}(P\;\|\;\tilde{Q})italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over~ start_ARG roman_Π end_ARG ∥ over~ start_ARG italic_Q end_ARG ) = roman_inf start_POSTSUBSCRIPT italic_P ∈ over~ start_ARG roman_Π end_ARG end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ∥ over~ start_ARG italic_Q end_ARG ) (17)

where 𝒟KL(PQ~)subscript𝒟𝐾𝐿conditional𝑃~𝑄\mathcal{D}_{KL}(P\;\|\;\tilde{Q})caligraphic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ∥ over~ start_ARG italic_Q end_ARG ) is the KL divergence.

Theorem 2

Under assumptions 1 - 5, if ΘΘ\Thetaroman_Θ has a finite VC dimension, there exists a positive real number d𝐑+𝑑superscript𝐑d\in\mathbf{R}^{+}italic_d ∈ bold_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT such that the following holds:

Pr(Rg(θ^ng)Rg(θopt)>δ)enmin{δ4,d}approaches-limitPrsubscript𝑅𝑔subscriptsuperscript^𝜃𝑔𝑛subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿superscript𝑒𝑛𝛿4𝑑\displaystyle\Pr\left(R_{g}(\hat{\theta}^{g}_{n})-R_{g}(\theta_{opt})>\delta% \right)\doteq e^{-n\cdot\min\{\frac{\delta}{4},d\}}roman_Pr ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) > italic_δ ) ≐ italic_e start_POSTSUPERSCRIPT - italic_n ⋅ roman_min { divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG , italic_d } end_POSTSUPERSCRIPT (18)

where d=DKL(ΠQ)𝑑subscript𝐷𝐾𝐿conditionalΠ𝑄d=D_{KL}(\Pi\;\|\;Q)italic_d = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ), ΠΠ\Piroman_Π is a set of distributions on some alphabet χ𝜒\chiitalic_χ, induced by the distribution on 𝐗𝐗\mathbf{X}bold_X for which the ERM will output a hypothesis outside of Aθ0subscript𝐴subscript𝜃0A_{\theta_{0}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Q𝑄Qitalic_Q is the true distribution on the alphabet.

The proof of the Theorem is provided in appendix B-E. The distribution Q𝑄Qitalic_Q and the set of distributions ΠΠ\Piroman_Π will be explicitly derived in the next section.

This theorem establishes the exponential behavior of the PAC error probability and is achieved by showing that the error exponent of Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt}}}(\hat{% \theta}^{f_{{opt}}}_{n})<\delta\right)roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) is DKL(ΠQ)subscript𝐷𝐾𝐿conditionalΠ𝑄D_{KL}(\Pi\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) and using the uniform realizable learning bound in (10). This implies that any improved realizable bound can be plugged into theorem 2 to get an improved agnostic bound, and the requirement of finite VC dimension might be unnecessary.

The achieved error exponent min(δ4,d)𝛿4𝑑\min(\frac{\delta}{4},d)roman_min ( divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG , italic_d ) is better than the classical error exponent for agnostic learning in (9), which is δ232superscript𝛿232\frac{\delta^{2}}{32}divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 32 end_ARG, as it is linear in δ𝛿\deltaitalic_δ instead of quadratic in δ𝛿\deltaitalic_δ.

Notice that because d𝑑ditalic_d is independent of δ𝛿\deltaitalic_δ, for δ<min{4d,δmax}𝛿4𝑑subscript𝛿𝑚𝑎𝑥\delta<\min\{4d,\delta_{max}\}italic_δ < roman_min { 4 italic_d , italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT } the error exponent is δ4𝛿4\frac{\delta}{4}divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG, which is the same as the worst case realizable learning exponent. Thus, not only the error exponent is much better than the general one for agnostic learning, it also shows that agnostic learning might be no harder than realizable learning in some cases. This result can be expressed as a bound on the excess risk:

δ=O(1nln1η)𝛿𝑂1𝑛1𝜂\displaystyle\delta=O(\frac{1}{n}\ln{\frac{1}{\eta}})italic_δ = italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG roman_ln divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ) (19)

IV-A Derivation of Error Exponent

In this section we provide details on how to construct the set ΠΠ\Piroman_Π and the distribution Q𝑄Qitalic_Q. The derivation in this section is partial and is done under the assumption that there are K+1 GLP’s. However, this is only to simplify the already complex derivation and is not a requirement (see appendices B and C for more details). θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT must be one of the K+1𝐾1K+1italic_K + 1 GLP’s. Denote θ0=θopt,fθ0=foptformulae-sequencesubscript𝜃0subscript𝜃𝑜𝑝𝑡subscript𝑓subscript𝜃0subscript𝑓𝑜𝑝𝑡\theta_{0}=\theta_{opt},\ f_{\theta_{0}}=f_{{opt}}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. For each GLP θi,i=1,,Kformulae-sequencesubscript𝜃𝑖𝑖1𝐾\theta_{i}\ ,i=1,...,Kitalic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_K of Rg(θ)subscript𝑅𝑔𝜃R_{g}(\theta)italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ), denote the following regions:

Di=D(θ0,θi),Di=D(θi,θ0)formulae-sequencesubscript𝐷𝑖𝐷subscript𝜃0subscript𝜃𝑖subscriptsuperscript𝐷𝑖𝐷subscript𝜃𝑖subscript𝜃0\displaystyle D_{i}=D(\theta_{0},\theta_{i}),\ D^{{}^{\prime}}_{i}=D(\theta_{i% },\theta_{0})italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_D ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_D ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (20)

Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the region that supports choosing θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Disubscriptsuperscript𝐷𝑖D^{{}^{\prime}}_{i}italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the opposite. Denote #D#𝐷\#D# italic_D as the number of samples that fall in region D𝐷Ditalic_D. θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT achieves lower empirical risk than θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if #Di<#Di#subscriptsuperscript𝐷𝑖#subscript𝐷𝑖\#D^{{}^{\prime}}_{i}<\#D_{i}# italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Denote the disjointified regions of Di,Disubscript𝐷𝑖subscriptsuperscript𝐷𝑖D_{i},D^{{}^{\prime}}_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

Xi1,,ir={j{i1,,ir}Dj}{j{i1,,ir}Dj}subscript𝑋subscript𝑖1subscript𝑖𝑟subscript𝑗subscript𝑖1subscript𝑖𝑟subscript𝐷𝑗subscript𝑗subscript𝑖1subscript𝑖𝑟subscript𝐷𝑗\displaystyle X_{i_{1},...,i_{r}}=\{\cap_{j\in\{i_{1},...,i_{r}\}}D_{j}\}\char 9% 2\relax\{\cup_{j\notin\{i_{1},...,i_{r}\}}D_{j}\}italic_X start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ∩ start_POSTSUBSCRIPT italic_j ∈ { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } “ { ∪ start_POSTSUBSCRIPT italic_j ∉ { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } (21)
Xi1,,ir={j{i1,,ir}Dj}{j{i1,,ir}Dj}subscriptsuperscript𝑋subscript𝑖1subscript𝑖𝑟subscript𝑗subscript𝑖1subscript𝑖𝑟subscriptsuperscript𝐷𝑗subscript𝑗subscript𝑖1subscript𝑖𝑟subscriptsuperscript𝐷𝑗\displaystyle X^{{}^{\prime}}_{i_{1},...,i_{r}}=\{\cap_{j\in\{i_{1},...,i_{r}% \}}D^{{}^{\prime}}_{j}\}\char 92\relax\{\cup_{j\notin\{i_{1},...,i_{r}\}}D^{{}% ^{\prime}}_{j}\}italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ∩ start_POSTSUBSCRIPT italic_j ∈ { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } “ { ∪ start_POSTSUBSCRIPT italic_j ∉ { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }
Xc=X{jDjiDi},subscript𝑋𝑐𝑋subscript𝑖subscript𝑗subscriptsuperscript𝐷𝑗subscript𝐷𝑖\displaystyle X_{c}=X\char 92\relax\{\cup_{j}{D^{{}^{\prime}}_{j}}\cup_{i}{D_{% i}}\},italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_X “ { ∪ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∪ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ,

where r=1,..,Kr=1,..,Kitalic_r = 1 , . . , italic_K and 1irK1subscript𝑖𝑟𝐾1\leq i_{r}\leq K1 ≤ italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≤ italic_K. Notice these regions are non-intersecting and Di=Xi{i2Xi,i2}..{i2,..iKXi,i2,..iK}D_{i}=X_{i}\cup\{\cup_{i_{2}}X_{i,i_{2}}\}\cup..\cup\{\cup_{i_{2},..i_{K}}X_{i% ,i_{2},..i_{K}}\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ { ∪ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ∪ . . ∪ { ∪ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. We can write the following equation:

(#D1#D1#DK#DK)=A(#X1#X1,..,K#X1,..,K)\displaystyle\begin{pmatrix}\#D_{1}-\#D^{{}^{\prime}}_{1}\\ ...\\ \#D_{K}-\#D^{{}^{\prime}}_{K}\\ \end{pmatrix}=A\begin{pmatrix}\#X_{1}\\ ...\\ \#X_{1,..,K}\\ ...\\ \#X^{{}^{\prime}}_{1,..,K}\end{pmatrix}( start_ARG start_ROW start_CELL # italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - # italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL # italic_D start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - # italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = italic_A ( start_ARG start_ROW start_CELL # italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL # italic_X start_POSTSUBSCRIPT 1 , . . , italic_K end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL # italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , . . , italic_K end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) (22)

where A is a matrix. Denote the alphabet χ𝜒\chiitalic_χ:

χ={X1,..,X1,..,k,X1,..,X1,..,k,Xc}={a1,,a|χ|}\displaystyle\chi=\{X_{1},..,X_{1,..,k},X^{{}^{\prime}}_{1},..,X^{{}^{\prime}}% _{1,..,k},X_{c}\}=\{a_{1},...,a_{|\chi|}\}italic_χ = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_X start_POSTSUBSCRIPT 1 , . . , italic_k end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , . . , italic_k end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT | italic_χ | end_POSTSUBSCRIPT } (23)

Denote the probability mass function Q𝑄Qitalic_Q on χ𝜒\chiitalic_χ such that Q(ai)=Pr(xSi)𝑄subscript𝑎𝑖Pr𝑥subscript𝑆𝑖Q(a_{i})=\Pr\left(x\in S_{i}\right)italic_Q ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_Pr ( italic_x ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), aiχsubscript𝑎𝑖𝜒a_{i}\in\chiitalic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_χ, where aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i’th symbol in χ𝜒\chiitalic_χ and Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the region corresponding to it. Denote ΠΠ\Piroman_Π:

Π={(p1,..,p|χ|)|{A(p1p|χ|1)<0}c,i=1|χ|pi=1,pi0}\displaystyle\Pi=\Big{\{}(p_{1},..,p_{|\chi|})\mathrel{\Big{|}}\Big{\{}A\begin% {pmatrix}p_{1}\\ ...\\ p_{|\chi|-1}\end{pmatrix}<0\Big{\}}^{c},\sum_{i=1}^{|\chi|}p_{i}=1,p_{i}\geq 0% \Big{\}}roman_Π = { ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_p start_POSTSUBSCRIPT | italic_χ | end_POSTSUBSCRIPT ) | { italic_A ( start_ARG start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT | italic_χ | - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) < 0 } start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_χ | end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 } (24)

where {}csuperscript𝑐\{\}^{c}{ } start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is the complement set. This set represents all distributions on χ𝜒\chiitalic_χ for which the ERM will output a hypothesis outside of Aθ0subscript𝐴subscript𝜃0A_{\theta_{0}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (and thus, sub-optimal). Notice that QΠ𝑄ΠQ\notin\Piitalic_Q ∉ roman_Π, as Q𝑄Qitalic_Q is the true distribution on χ𝜒\chiitalic_χ, and under it the ERM must converge to the optimal hypothesis due to assumption 3.

Using a method of types based analysis [21, 24, 22] we get the following exponential behavior (see appendix B-E for details):

Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)2nDKL(ΠQ)approaches-limitPrsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿superscript2𝑛subscript𝐷𝐾𝐿conditionalΠ𝑄\displaystyle\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt% }}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)\doteq 2^{-nD_{KL}(\Pi\;\|\;Q)}roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) ≐ 2 start_POSTSUPERSCRIPT - italic_n italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) end_POSTSUPERSCRIPT (25)

Furthermore, by using Theorem 1 and the worst case error exponent of realizable learning in (10), we get Theorem 2.

V Example

This example shows how to compute the error exponent and that it empirically converges to the value in the theorem. Let 𝐗=[0,1]𝐗01\mathbf{X}=[0,1]bold_X = [ 0 , 1 ] with uniform distribution, and the ground truth is:

g(x)={00<x<0.610.6<x<0.900.9<x<1𝑔𝑥cases00𝑥0.610.6𝑥0.900.9𝑥1\displaystyle g(x)=\begin{cases}0&\quad 0<x<0.6\\ 1&\quad 0.6<x<0.9\\ 0&\quad 0.9<x<1\\ \end{cases}italic_g ( italic_x ) = { start_ROW start_CELL 0 end_CELL start_CELL 0 < italic_x < 0.6 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0.6 < italic_x < 0.9 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0.9 < italic_x < 1 end_CELL end_ROW

We use the 1-boundary hypothesis class for ERM learning. The optimal hypothesis minimizing the risk is:

fopt(x)={00<x<0.610.6<x<1subscript𝑓𝑜𝑝𝑡𝑥cases00𝑥0.610.6𝑥1\displaystyle f_{opt}(x)=\begin{cases}0&\quad 0<x<0.6\\ 1&\quad 0.6<x<1\\ \end{cases}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL 0 end_CELL start_CELL 0 < italic_x < 0.6 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0.6 < italic_x < 1 end_CELL end_ROW

There are two GLP’s: θ0=0.6,θ1=1formulae-sequencesubscript𝜃00.6subscript𝜃11\theta_{0}=0.6,\ \theta_{1}=1italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.6 , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1. Their Aθsubscript𝐴𝜃A_{\theta}italic_A start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT regions are Aθ0=(0,0.9),Aθ1=(0.9,1)formulae-sequencesubscript𝐴subscript𝜃000.9subscript𝐴subscript𝜃10.91A_{\theta_{0}}=(0,0.9)\ ,\ A_{\theta_{1}}=(0.9,1)italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( 0 , 0.9 ) , italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( 0.9 , 1 ). The pairs of regions D1,D1subscript𝐷1subscriptsuperscript𝐷1D_{1},D^{{}^{\prime}}_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, as defined in (20), are D1=(0.6,0.9),D1=(0.9,1)formulae-sequencesubscript𝐷10.60.9subscriptsuperscript𝐷10.91D_{1}=(0.6,0.9)\ ,\ D^{{}^{\prime}}_{1}=(0.9,1)italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.6 , 0.9 ) , italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.9 , 1 ). The disjointified regions (as in (21)) are X1=(0.6,0.9),X1=(0.9,1),Xc=(0,0.6)formulae-sequencesubscript𝑋10.60.9formulae-sequencesubscriptsuperscript𝑋10.91subscript𝑋𝑐00.6X_{1}=(0.6,0.9)\ ,\ X^{{}^{\prime}}_{1}=(0.9,1)\ ,\ X_{c}=(0,0.6)italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.6 , 0.9 ) , italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.9 , 1 ) , italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = ( 0 , 0.6 ). We get an alphabet χ={X1,X1,Xc}𝜒subscript𝑋1subscriptsuperscript𝑋1subscript𝑋𝑐\chi=\{X_{1},X^{{}^{\prime}}_{1},X_{c}\}italic_χ = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } with probabilities Q=(0.3,0.1,0.6)𝑄0.30.10.6Q=(0.3,0.1,0.6)italic_Q = ( 0.3 , 0.1 , 0.6 ). The region ΠΠ\Piroman_Π is:

Π={(p1,p2,p3)|(11)(p1p2)0,i=13pi=1, 0pi}\displaystyle\Pi=\Biggl{\{}(p_{1},p_{2},p_{3})\mathrel{\Bigg{|}}\begin{pmatrix% }-1&1\end{pmatrix}\begin{pmatrix}p_{1}\\ p_{2}\end{pmatrix}\geq 0,\sum_{i=1}^{3}p_{i}=1,\ 0\leq p_{i}\Biggl{\}}roman_Π = { ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) | ( start_ARG start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ≥ 0 , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , 0 ≤ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }

Computing DKL(ΠQ)subscript𝐷𝐾𝐿conditionalΠ𝑄D_{KL}(\Pi\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) is a simple constraint optimization problem with solution DKL(ΠQ)=0.0551subscript𝐷𝐾𝐿conditionalΠ𝑄0.0551D_{KL}(\Pi\;\|\;Q)=0.0551italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) = 0.0551. The error exponent using bound (9) is δ232=0.0003superscript𝛿2320.0003\frac{\delta^{2}}{32}=0.0003divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 32 end_ARG = 0.0003, while our improved error exponent is min{δ4,DKL(ΠQ)}=0.025𝛿4subscript𝐷𝐾𝐿conditionalΠ𝑄0.025\min\{\frac{\delta}{4},D_{KL}(\Pi\;\|\;Q)\}=0.025roman_min { divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG , italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) } = 0.025. Figure 2 shows that the empirical error exponent of the PAC error probability indeed converges to DKL(ΠQ)subscript𝐷𝐾𝐿conditionalΠ𝑄D_{KL}(\Pi\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ).

Refer to caption
Figure 2: Empirical error exponent of the second term in theorem 1. =22\ell=2roman_ℓ = 2, δ=0.1𝛿0.1\delta=0.1italic_δ = 0.1. The empirical exponent (blue) was computed using simulation.

VI Conclusions And Future Research

We derived an improved error exponent for agnostic PAC learning and showed that in some cases agnostic learning might be no harder than realizable learning. Any new realizable learning bound can be plugged into Theorem 2 to get a better agnostic bound. This result opens new directions for research. One important goal can be to find explicit conditions for practical hypotheses classes (e.g, neural networks) satisfying the conditions for Theorem 2.

Interestingly, the error exponent analysis of PAC learning turns out to be useful in attaining the first theoretical results for the knowledge distillation problem, [25], providing conditions that define where the associated teacher-student learning is useful and where it is not.

References

  • [1] L. G. Valiant, “A theory of the learnable.” Commun. ACM, vol. 27, no. 11, pp. 1134–1142, 1984. [Online]. Available: http://dblp.uni-trier.de/db/journals/cacm/cacm27.html#Valiant84
  • [2] L. Valiant, Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World.   Basic Books, 2013.
  • [3] V. Vapnik, Statistical Learning Theory.   Wiley New York, 1998.
  • [4] V. Vapnik and A. Chervonenkis, “Theory of pattern recognition,” 1974.
  • [5] O. Bousquet, S. Boucheron, and G. Lugosi, “Introduction to statistical learning theory,” Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, pp. 169–207, 2004.
  • [6] O. Bousquet, S. Hanneke, S. Moran, R. Van Handel, and A. Yehudayoff, “A theory of universal learning,” in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pp. 532–541.
  • [7] D. Cohn and G. Tesauro, “Can neural networks do better than the vapnik-chervonenkis bounds?” Advances in Neural Information Processing Systems, vol. 3, 1990.
  • [8] ——, “How tight are the vapnik-chervonenkis bounds?” Neural Computation, vol. 4, no. 2, pp. 249–269, 1992.
  • [9] V. Nagarajan and J. Z. Kolter, “Uniform convergence may be unable to explain generalization in deep learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [10] A. Nitanda and T. Suzuki, “Stochastic gradient descent with exponential convergence rates of expected classification errors,” in The 22nd International Conference on Artificial Intelligence and Statistics.   PMLR, 2019, pp. 1417–1426.
  • [11] D. Haussler, M. Kearns, and R. E. Schapire, “Bounds on the sample complexity of bayesian learning using information theory and the vc dimension,” Machine learning, vol. 14, pp. 83–113, 1994.
  • [12] J.-Y. Audibert and A. B. Tsybakov, “Fast learning rates for plug-in classifiers,” 2007.
  • [13] G. Lever, F. Laviolette, and J. Shawe-Taylor, “Distribution-dependent pac-bayes priors,” in International Conference on Algorithmic Learning Theory.   Springer, 2010, pp. 119–133.
  • [14] ——, “Tighter pac-bayes bounds through distribution-dependent priors,” Theoretical Computer Science, vol. 473, pp. 4–28, 2013.
  • [15] S. Ben-David and R. Urner, “The sample complexity of agnostic learning under deterministic labels,” in Conference on Learning Theory.   PMLR, 2014, pp. 527–542.
  • [16] G. M. Benedek and A. Itai, “Nonuniform learnability,” in Automata, Languages and Programming: 15th International Colloquium Tampere, Finland, July 11–15, 1988 Proceedings 15.   Springer, 1988, pp. 82–92.
  • [17] S. Hanneke, A. Kontorovich, S. Sabato, and R. Weiss, “Universal bayes consistency in metric spaces,” in 2020 Information Theory and Applications Workshop (ITA).   IEEE, 2020, pp. 1–33.
  • [18] S. Hanneke, “Learning whenever learning is possible: Universal learning under general stochastic processes,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 5751–5866, 2021.
  • [19] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms.   Cambridge university press, 2014.
  • [20] R. G. Gallager, Information theory and reliable communication.   Springer, 1968, vol. 588.
  • [21] I. Csiszar, “The method of types [information theory],” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2505–2523, 1998.
  • [22] I. N. Sanov, “On the probability of large deviations of random variables,” Selected Translations in Mathematical Statistics and Probability, vol. 1, pp. 213–244, 1961.
  • [23] M. Anthony, P. L. Bartlett, P. L. Bartlett et al., Neural network learning: Theoretical foundations.   cambridge university press Cambridge, 1999, vol. 9.
  • [24] I. Csiszár and J. Körner, Information theory: coding theorems for discrete memoryless systems.   Cambridge University Press, 2011.
  • [25] A. Hendel, “Improved PAC Learning Bounds with Application to Knowledge Distillation,” M.Sc thesis., Dept. of Electrical Engineering - Systems, Tel-Aviv Univ., Tel-Aviv, Israel., 2023.

Appendix A Proof Aθsubscript𝐴𝜃A_{\theta}italic_A start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT form a complete partitioning of ΘΘ\Thetaroman_Θ

Lemma 1

Let there be a complete hypothesis set {fθ,θΘ}subscript𝑓𝜃𝜃Θ\{f_{\theta},\theta\in\Theta\}{ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_θ ∈ roman_Θ } (as in assumption 4 in the paper), a ground truth function g𝑔gitalic_g and a set of GLP’s ΘoptsubscriptΘ𝑜𝑝𝑡\Theta_{opt}roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT of Rg(θ)subscript𝑅𝑔𝜃R_{g}(\theta)italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ). The regions Aθ,θΘoptsubscript𝐴superscript𝜃superscript𝜃subscriptΘ𝑜𝑝𝑡A_{\theta^{*}},\ \theta^{*}\in\Theta_{opt}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT, form a complete partitioning of ΘΘ\Thetaroman_Θ.

Proof 1

By definition, the regions are disjoint. Assume exists a distinct set Θ~~Θ\tilde{\Theta}over~ start_ARG roman_Θ end_ARG of hypotheses that don’t belong to any Aθsubscript𝐴superscript𝜃A_{\theta^{*}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, θΘoptsuperscript𝜃subscriptΘ𝑜𝑝𝑡\theta^{*}\in\Theta_{opt}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. We will first prove that either all hypotheses in Θ~~Θ\tilde{\Theta}over~ start_ARG roman_Θ end_ARG coincide or there exist θ~Θ~~𝜃~Θ\tilde{\theta}\in\tilde{\Theta}over~ start_ARG italic_θ end_ARG ∈ over~ start_ARG roman_Θ end_ARG such that θΘ~θ~for-all𝜃~Θ~𝜃\forall\ \theta\in\tilde{\Theta}\setminus\tilde{\theta}∀ italic_θ ∈ over~ start_ARG roman_Θ end_ARG ∖ over~ start_ARG italic_θ end_ARG exists a set X~𝐗~𝑋𝐗\tilde{X}\subseteq\mathbf{X}over~ start_ARG italic_X end_ARG ⊆ bold_X with positive probability such that xX~,(fθ(x),g(x))>(fθ~(x),g(x))formulae-sequencefor-all𝑥~𝑋subscript𝑓𝜃𝑥𝑔𝑥subscript𝑓~𝜃𝑥𝑔𝑥\forall x\in\tilde{X},\ \ell(f_{\theta}(x),g(x))>\ell(f_{\tilde{\theta}}(x),g(% x))∀ italic_x ∈ over~ start_ARG italic_X end_ARG , roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) > roman_ℓ ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ). Assume by contradiction this is not true, thus exists θ~1Θ~subscript~𝜃1~Θ\tilde{\theta}_{1}\in\tilde{\Theta}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ over~ start_ARG roman_Θ end_ARG such that (g(x),fθ~1(x))(g(x),fθ~(x))𝑔𝑥subscript𝑓subscript~𝜃1𝑥𝑔𝑥subscript𝑓~𝜃𝑥\ell(g(x),f_{\tilde{\theta}_{1}}(x))\leq\ell(g(x),f_{\tilde{\theta}}(x))roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ≤ roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_x ) ) w.p 1. And for θ~1subscript~𝜃1\tilde{\theta}_{1}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT we can find θ~2subscript~𝜃2\tilde{\theta}_{2}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that (g(x),fθ~2(x))(g(x),fθ~1(x))𝑔𝑥subscript𝑓subscript~𝜃2𝑥𝑔𝑥subscript𝑓subscript~𝜃1𝑥\ell(g(x),f_{\tilde{\theta}_{2}}(x))\leq\ell(g(x),f_{\tilde{\theta}_{1}}(x))roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ≤ roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) w.p 1. By repeating this we get a series θ~mΘ~subscript~𝜃𝑚~Θ\tilde{\theta}_{m}\in\tilde{\Theta}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ over~ start_ARG roman_Θ end_ARG such that (g(x),fθ~m+1(x))(g(x),fθ~m(x))𝑔𝑥subscript𝑓subscript~𝜃𝑚1𝑥𝑔𝑥subscript𝑓subscript~𝜃𝑚𝑥\ell(g(x),f_{\tilde{\theta}_{m+1}}(x))\leq\ell(g(x),f_{\tilde{\theta}_{m}}(x))roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ≤ roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) w.p 1.
If this series is finite, either the series is not distinct and the hypotheses in Θ~~Θ\tilde{\Theta}over~ start_ARG roman_Θ end_ARG coincide or for any other hypothesis in Θ~~Θ\tilde{\Theta}over~ start_ARG roman_Θ end_ARG we can find a set with positive probability for which the last element in the series has lower loss, which is a contradiction.
If the series is infinite, then the series 𝐗(g(x),fθ~m(x))𝑑(x)subscript𝐗𝑔𝑥subscript𝑓subscript~𝜃𝑚𝑥differential-d𝑥\int_{\mathbf{X}}\ell(g(x),f_{\tilde{\theta}_{m}}(x))d\mathcal{F}(x)∫ start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) italic_d caligraphic_F ( italic_x ) has a limit because it is monotone. By the completeness assumption, this means that the limit hypothesis of fθmsubscript𝑓subscript𝜃𝑚f_{\theta_{m}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is in FΘsubscript𝐹ΘF_{\Theta}italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT. If it is also in FΘ~subscript𝐹~ΘF_{\tilde{\Theta}}italic_F start_POSTSUBSCRIPT over~ start_ARG roman_Θ end_ARG end_POSTSUBSCRIPT, then either all hypotheses in Θ~~Θ\tilde{\Theta}over~ start_ARG roman_Θ end_ARG coincide or for any other hypothesis in Θ~~Θ\tilde{\Theta}over~ start_ARG roman_Θ end_ARG we can find a set with positive probability for which the limit hypothesis has a lower loss, which is a contradiction. If the limit hypothesis is not in FΘ~subscript𝐹~ΘF_{\tilde{\Theta}}italic_F start_POSTSUBSCRIPT over~ start_ARG roman_Θ end_ARG end_POSTSUBSCRIPT, then the whole series θmsubscript𝜃𝑚\theta_{m}italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT belongs to one of Aθsubscript𝐴superscript𝜃A_{\theta^{*}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, θΘoptsuperscript𝜃subscriptΘ𝑜𝑝𝑡\theta^{*}\in\Theta_{opt}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT, which is a contradiction.

To conclude this part, we’ve showed that either all hypotheses in Θ~~Θ\tilde{\Theta}over~ start_ARG roman_Θ end_ARG coincide to a single hypothesis θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG or there exist θ~Θ~~𝜃~Θ\tilde{\theta}\in\tilde{\Theta}over~ start_ARG italic_θ end_ARG ∈ over~ start_ARG roman_Θ end_ARG for which θΘ~θ~for-all𝜃~Θ~𝜃\forall\ \theta\in\tilde{\Theta}\setminus\tilde{\theta}∀ italic_θ ∈ over~ start_ARG roman_Θ end_ARG ∖ over~ start_ARG italic_θ end_ARG exists a set X~𝐗~𝑋𝐗\tilde{X}\subseteq\mathbf{X}over~ start_ARG italic_X end_ARG ⊆ bold_X with positive probability such that xX~,(fθ(x),g(x))>(fθ~(x),g(x))formulae-sequencefor-all𝑥~𝑋subscript𝑓𝜃𝑥𝑔𝑥subscript𝑓~𝜃𝑥𝑔𝑥\forall x\in\tilde{X},\ \ell(f_{\theta}(x),g(x))>\ell(f_{\tilde{\theta}}(x),g(% x))∀ italic_x ∈ over~ start_ARG italic_X end_ARG , roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) > roman_ℓ ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ).

Because θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG doesn’t belong to any of the regions Aθsubscript𝐴superscript𝜃A_{\theta^{*}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, θΘoptsuperscript𝜃subscriptΘ𝑜𝑝𝑡\theta^{*}\in\Theta_{opt}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT, it means θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG is not a GLP, as every GLP belongs to its own Aθsubscript𝐴superscript𝜃A_{\theta^{*}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT region. It also means that for every GLP θΘoptsuperscript𝜃subscriptΘ𝑜𝑝𝑡\theta^{*}\in\Theta_{opt}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT exists a set X~𝐗~𝑋𝐗\tilde{X}\subseteq\mathbf{X}over~ start_ARG italic_X end_ARG ⊆ bold_X with positive probability such that (fθ~(x),g(x))<(fθ(x),g(x))xX~subscript𝑓~𝜃𝑥𝑔𝑥subscript𝑓superscript𝜃𝑥𝑔𝑥for-all𝑥~𝑋\ell(f_{\tilde{\theta}}(x),g(x))<\ell(f_{\theta^{*}}(x),g(x))\ \forall x\in% \tilde{X}roman_ℓ ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) < roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ∀ italic_x ∈ over~ start_ARG italic_X end_ARG. By definition of Aθsubscript𝐴superscript𝜃A_{\theta^{*}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we have Pr((fθ(x),g(x))(fθ(x),g(x)))=1θAθPrsubscript𝑓superscript𝜃𝑥𝑔𝑥subscript𝑓𝜃𝑥𝑔𝑥1for-all𝜃subscript𝐴superscript𝜃\Pr\left(\ell\big{(}f_{\theta^{*}}(x),g(x)\big{)}\leq\ell\big{(}f_{\theta}(x),% g(x)\big{)}\right)=1\ \forall\theta\in A_{\theta^{*}}roman_Pr ( roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ≤ roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ) = 1 ∀ italic_θ ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.
Thus, θθΘoptAθfor-all𝜃subscriptsuperscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝐴superscript𝜃\forall\theta\in\cup_{\theta^{*}\in\Theta_{opt}}A_{\theta^{*}}∀ italic_θ ∈ ∪ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT there exists a set X~𝐗~𝑋𝐗\tilde{X}\subseteq\mathbf{X}over~ start_ARG italic_X end_ARG ⊆ bold_X with positive probability such that (fθ~(x),g(x))<(fθ(x),g(x))xX~subscript𝑓~𝜃𝑥𝑔𝑥subscript𝑓𝜃𝑥𝑔𝑥for-all𝑥~𝑋\ell(f_{\tilde{\theta}}(x),g(x))<\ell(f_{\theta}(x),g(x))\ \forall x\in\tilde{X}roman_ℓ ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) < roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ∀ italic_x ∈ over~ start_ARG italic_X end_ARG. We got that θΘθ~for-all𝜃Θ~𝜃\forall\theta\in\Theta\setminus\tilde{\theta}∀ italic_θ ∈ roman_Θ ∖ over~ start_ARG italic_θ end_ARG exist a set X~𝐗~𝑋𝐗\tilde{X}\subseteq\mathbf{X}over~ start_ARG italic_X end_ARG ⊆ bold_X with positive probability such that (fθ~(x),g(x))<(fθ(x),g(x))xX~subscript𝑓~𝜃𝑥𝑔𝑥subscript𝑓𝜃𝑥𝑔𝑥for-all𝑥~𝑋\ell(f_{\tilde{\theta}}(x),g(x))<\ell(f_{\theta}(x),g(x))\ \forall x\in\tilde{X}roman_ℓ ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) < roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ∀ italic_x ∈ over~ start_ARG italic_X end_ARG, which is a contradiction to θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG not being a GLP.
Thus, the set Θ~~Θ\tilde{\Theta}over~ start_ARG roman_Θ end_ARG is empty and the regions Aθ,θΘoptsubscript𝐴superscript𝜃superscript𝜃subscriptΘ𝑜𝑝𝑡A_{\theta^{*}},\ \theta^{*}\in\Theta_{opt}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT form a complete partitioning of ΘΘ\Thetaroman_Θ.

Appendix B Supplemented material for "Theoretical Results" section

In this section we’ll prove theorems 1 and 2. This will be done gradually: in subsection B-B we’ll prove an equivalent formulation to the PAC error probability that will be used in proving the theorems. In subsection B-C, we’ll prove theorem 1 along with 2 needed Lemmas. In section B-D we’ll prove a Lemma about lower and upper bounds on the seconds term in theorem 1 that will be used in proving theorem 2. Finally, in B-E we’ll prove theorem 2 along with with 3 needed Lemmas. The proofs of theorems 1 and 2 in this section are provided for the case of a finite number K+1𝐾1K+1italic_K + 1 of GLP’s. This is generalized to an infinite number of GLP’s in section C of the appendix. Note that we will sometimes refer to equations from the paper.

B-A K-boundary has stable global GLP

Lemma 2

Let g(x)𝑔𝑥g(x)italic_g ( italic_x ), x[0,1]𝑥01x\in[0,1]italic_x ∈ [ 0 , 1 ] be a binary ground truth function with at most M𝑀Mitalic_M transition points between 00 and 1111. Let {fθ,θΘ}subscript𝑓𝜃𝜃Θ\{f_{\theta},\theta\in\Theta\}{ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_θ ∈ roman_Θ } be the K-boundary class. Then the optimal hypothesis when learning from g(x)𝑔𝑥g(x)italic_g ( italic_x ) with binary loss is a stable GLP.

Proof 2

Let θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be the optimal hypothesis when learning from g𝑔gitalic_g with binary loss. θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a GLP as for any other hypothesis in the class there must exist a set in 𝐗𝐗\mathbf{X}bold_X for which θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is uniformly better (otherwise it wouldn’t minimize the risk).Denote the transition point of g𝑔gitalic_g as b1,,bMsubscript𝑏1subscript𝑏𝑀b_{1},...,b_{M}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and the transition points of fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT as a1,,aKsubscript𝑎1subscript𝑎𝐾a_{1},...,a_{K}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. Assume without loss of generality that bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT isn’t equal to either 00 or 1111 and denote b0=0subscript𝑏00b_{0}=0italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 and bM+1=1subscript𝑏𝑀11b_{M+1}=1italic_b start_POSTSUBSCRIPT italic_M + 1 end_POSTSUBSCRIPT = 1. First, we’ll prove that every transition point of fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT equals to one of b0,,bM+1subscript𝑏0subscript𝑏𝑀1b_{0},...,b_{M+1}italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_M + 1 end_POSTSUBSCRIPT - assume by contradiction exists aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that isn’t equal to one of b0,,bM+1subscript𝑏0subscript𝑏𝑀1b_{0},...,b_{M+1}italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_M + 1 end_POSTSUBSCRIPT, thus satisfying bj<ai<bj+1subscript𝑏𝑗subscript𝑎𝑖subscript𝑏𝑗1b_{j}<a_{i}<b_{j+1}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT, where j𝑗jitalic_j is one of 0,,M+10𝑀10,...,M+10 , … , italic_M + 1. For bj<x<bj+1subscript𝑏𝑗𝑥subscript𝑏𝑗1b_{j}<x<b_{j+1}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < italic_x < italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT, g(x)𝑔𝑥g(x)italic_g ( italic_x ) is constant (either 0 or 1). We can generate 2 new hypotheses by changing aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT or to bj+1subscript𝑏𝑗1b_{j+1}italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT, at least of these new hypotheses has zero loss for x[bj,bj+1]𝑥subscript𝑏𝑗subscript𝑏𝑗1x\in[b_{j},b_{j+1}]italic_x ∈ [ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ] and coincides with θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT outside of it. Notice that θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT has non-zero loss on [bj,bj+1]subscript𝑏𝑗subscript𝑏𝑗1[b_{j},b_{j+1}][ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ] as it can coincide with g𝑔gitalic_g either on [ai,bj+1]subscript𝑎𝑖subscript𝑏𝑗1[a_{i},b_{j+1}][ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ] or on [bj,ai]subscript𝑏𝑗subscript𝑎𝑖[b_{j},a_{i}][ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. This, θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is not the risk minimizer, which is a contradiction.
We move to prove that for every [bj,bj+1]subscript𝑏𝑗subscript𝑏𝑗1[b_{j},b_{j+1}][ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ], if exist i𝑖iitalic_i such that either ai=bjsubscript𝑎𝑖subscript𝑏𝑗a_{i}=b_{j}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT or ai=bj+1subscript𝑎𝑖subscript𝑏𝑗1a_{i}=b_{j+1}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT, then fθ0(x)=g(x)subscript𝑓subscript𝜃0𝑥𝑔𝑥f_{\theta_{0}}(x)=g(x)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = italic_g ( italic_x ) for x[bj,bj+1]𝑥subscript𝑏𝑗subscript𝑏𝑗1x\in[b_{j},b_{j+1}]italic_x ∈ [ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ] - assume by contradiction that there is such subset [bj,bj+1]subscript𝑏𝑗subscript𝑏𝑗1[b_{j},b_{j+1}][ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ] for which ai=bjsubscript𝑎𝑖subscript𝑏𝑗a_{i}=b_{j}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (the case of ai=bj+1subscript𝑎𝑖subscript𝑏𝑗1a_{i}=b_{j+1}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT is analogous ) and fθ0(x)subscript𝑓subscript𝜃0𝑥f_{\theta_{0}}(x)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) doesn’t coincide with g(x)𝑔𝑥g(x)italic_g ( italic_x ). Because both g(x)𝑔𝑥g(x)italic_g ( italic_x ) and fθ0(x)subscript𝑓subscript𝜃0𝑥f_{\theta_{0}}(x)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) don’t have any additional transition points between bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and bj+1subscript𝑏𝑗1b_{j+1}italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT, they are constant in this region and fθ0g(x)subscript𝑓subscript𝜃0𝑔𝑥f_{\theta_{0}}\neq g(x)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≠ italic_g ( italic_x ) for bj<x<bj+1subscript𝑏𝑗𝑥subscript𝑏𝑗1b_{j}<x<b_{j+1}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < italic_x < italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT. By changing aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be bj+1subscript𝑏𝑗1b_{j+1}italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT instead of bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we obtained a new K-boundary hypothesis that coincides with fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT outside [bj,bj+1]subscript𝑏𝑗subscript𝑏𝑗1[b_{j},b_{j+1}][ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ] and coincides with g𝑔gitalic_g on [bj,bj+1]subscript𝑏𝑗subscript𝑏𝑗1[b_{j},b_{j+1}][ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ] thus obtaining better risk than θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT which is a contradiction to its optimality.
Denote ϵ=mini{0,..,M}bi+1bi/2>0\epsilon=\min_{i\in\{0,..,M\}}||b_{i+1}-b_{i}||/2>0italic_ϵ = roman_min start_POSTSUBSCRIPT italic_i ∈ { 0 , . . , italic_M } end_POSTSUBSCRIPT | | italic_b start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | / 2 > 0, and some perturbation θ~0subscript~𝜃0\tilde{\theta}_{0}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with transition points a1,,aKsubscriptsuperscript𝑎1subscriptsuperscript𝑎𝐾a^{\prime}_{1},...,a^{\prime}_{K}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, where (a1,,aK)(a1,,aK)<ϵnormsubscriptsuperscript𝑎1subscriptsuperscript𝑎𝐾subscript𝑎1subscript𝑎𝐾italic-ϵ||(a^{\prime}_{1},...,a^{\prime}_{K})-(a_{1},...,a_{K})||<\epsilon| | ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) | | < italic_ϵ. For every i=1,..,Ki=1,..,Kitalic_i = 1 , . . , italic_K we know that aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is one of bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. thus, if ai=bjsubscript𝑎𝑖subscript𝑏𝑗a_{i}=b_{j}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, then
bjϵ<ai<bj+ϵsubscript𝑏𝑗italic-ϵsubscriptsuperscript𝑎𝑖subscript𝑏𝑗italic-ϵb_{j}-\epsilon<a^{\prime}_{i}<b_{j}+\epsilonitalic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ϵ < italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_ϵ Thus, bj1<ai<bj+1subscript𝑏𝑗1subscriptsuperscript𝑎𝑖subscript𝑏𝑗1b_{j-1}<a^{\prime}_{i}<b_{j+1}italic_b start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT < italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT. To show θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is stable, we need to show that any region for which fθ0(x)=g(x)subscript𝑓subscript𝜃0𝑥𝑔𝑥f_{\theta_{0}}(x)=g(x)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = italic_g ( italic_x ), we also have fθ~0(x)=g(x)subscript𝑓subscript~𝜃0𝑥𝑔𝑥f_{\tilde{\theta}_{0}}(x)=g(x)italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = italic_g ( italic_x ). For every j=0,..,Mj=0,..,Mitalic_j = 0 , . . , italic_M, fθ0(x)subscript𝑓subscript𝜃0𝑥f_{\theta_{0}}(x)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) might have loss on the region [bj,bj+1]subscript𝑏𝑗subscript𝑏𝑗1[b_{j},b_{j+1}][ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ] only if there is no i𝑖iitalic_i for which ai=bjsubscript𝑎𝑖subscript𝑏𝑗a_{i}=b_{j}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT or ai=bj+1subscript𝑎𝑖subscript𝑏𝑗1a_{i}=b_{j+1}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT. But this means that for every i=1,..,Ki=1,..,Kitalic_i = 1 , . . , italic_K, if ai=blsubscript𝑎𝑖subscript𝑏𝑙a_{i}=b_{l}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (where lj𝑙𝑗l\neq jitalic_l ≠ italic_j) then bl1<ai<bl+1subscript𝑏𝑙1subscriptsuperscript𝑎𝑖subscript𝑏𝑙1b_{l-1}<a^{\prime}_{i}<b_{l+1}italic_b start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT < italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT, thus ai[bj,bj+1]subscriptsuperscript𝑎𝑖subscript𝑏𝑗subscript𝑏𝑗1a^{\prime}_{i}\notin[b_{j},b_{j+1}]italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ [ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ]. Moreover, if ai<bjsubscript𝑎𝑖subscript𝑏𝑗a_{i}<b_{j}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT then ai<bjsubscriptsuperscript𝑎𝑖subscript𝑏𝑗a^{\prime}_{i}<b_{j}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and if ai>bj+1subscript𝑎𝑖subscript𝑏𝑗1a_{i}>b_{j+1}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT then ai>bj+1subscriptsuperscript𝑎𝑖subscript𝑏𝑗1a^{\prime}_{i}>b_{j+1}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT. Thus, the amount of transition points of fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT before bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the same as the amount of transition points of fθ~0subscript𝑓subscript~𝜃0f_{\tilde{\theta}_{0}}italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT before bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We conclude that fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and fθ~0subscript𝑓subscript~𝜃0f_{\tilde{\theta}_{0}}italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT coincide on any such region [bj,bj+1]subscript𝑏𝑗subscript𝑏𝑗1[b_{j},b_{j+1}][ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ]. Thus, any region that is missclassified by fθ0subscript𝑓subscript𝜃0f_{\theta_{0}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is also missclassified by fθ~0subscript𝑓subscript~𝜃0f_{\tilde{\theta}_{0}}italic_f start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a stable GLP.

B-B Equivalence Lemma

The following Lemma shows the equivalence:

Pr(Rg(θ^ng)Rg(θopt)>δ)=Pr(Rfopt(θ^ng)>δ)Prsubscript𝑅𝑔subscriptsuperscript^𝜃𝑔𝑛subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛𝛿\displaystyle\Pr\left(R_{g}(\hat{\theta}^{g}_{n})-R_{g}(\theta_{opt})>\delta% \right)=\Pr\left(R_{f_{opt}}(\hat{\theta}^{g}_{n})>\delta\right)roman_Pr ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) > italic_δ ) = roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ )

This will allow us to use the simpler right hand term instead of the PAC error probability.

Lemma 3

Let there be a hypothesis class {fθ,θΘ}subscript𝑓𝜃𝜃Θ\{f_{\theta},\theta\in\Theta\}{ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_θ ∈ roman_Θ } and a ground truth function g(x)𝑔𝑥g(x)italic_g ( italic_x ) with projection foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT on the hypothesis class. Under assumptions 1-5 from the main paper, for every 0<δ<δmax0𝛿subscript𝛿𝑚𝑎𝑥0<\delta<\delta_{max}0 < italic_δ < italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ the following holds:

Rg(θ)Rg(θopt)<δRfopt(θ)<δ,θΘsubscript𝑅𝑔𝜃subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿formulae-sequencesubscript𝑅subscript𝑓𝑜𝑝𝑡𝜃𝛿𝜃Θ\displaystyle R_{g}(\theta)-R_{g}(\theta_{opt})<\delta\Longleftrightarrow R_{f% _{{opt}}}(\theta)<\delta,\ \theta\in\Thetaitalic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) < italic_δ ⟺ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) < italic_δ , italic_θ ∈ roman_Θ (26)
Proof 3

¯¯\underline{\Longrightarrow}under¯ start_ARG ⟹ end_ARG: We have the following due to δ<δmax𝛿subscript𝛿𝑚𝑎𝑥\delta<\delta_{max}italic_δ < italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT:

Rfopt(θ)<δ<minθ~Aθ0Rfopt(θ~)θAθ0subscript𝑅subscript𝑓𝑜𝑝𝑡𝜃𝛿subscript~𝜃subscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡~𝜃𝜃subscript𝐴subscript𝜃0\displaystyle R_{f_{{opt}}}(\theta)<\delta<\min_{\tilde{\theta}\notin A_{% \theta_{0}}}R_{f_{{opt}}}(\tilde{\theta})\Longrightarrow\theta\in A_{\theta_{0}}italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) < italic_δ < roman_min start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) ⟹ italic_θ ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

For θAθ0𝜃subscript𝐴subscript𝜃0\theta\in A_{\theta_{0}}italic_θ ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT we have:

(g(x),fopt(x))=1(g(x),fθ(x))=1𝑔𝑥subscript𝑓𝑜𝑝𝑡𝑥1𝑔𝑥subscript𝑓𝜃𝑥1\displaystyle\ell(g(x),f_{{opt}}(x))=1\ \Longrightarrow\ \ell(g(x),f_{\theta}(% x))=1roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) ) = 1 ⟹ roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) = 1

Denote:

X1={x:(g(x),fopt(x))=1}subscript𝑋1conditional-set𝑥𝑔𝑥subscript𝑓𝑜𝑝𝑡𝑥1\displaystyle X_{1}=\{x:\ell(g(x),f_{{opt}}(x))=1\}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_x : roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) ) = 1 }
X0={x:(g(x),fopt(x))=0}subscript𝑋0conditional-set𝑥𝑔𝑥subscript𝑓𝑜𝑝𝑡𝑥0\displaystyle X_{0}=\{x:\ell(g(x),f_{{opt}}(x))=0\}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_x : roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) ) = 0 }

The following chain of equalities holds for θAθ0𝜃subscript𝐴subscript𝜃0\theta\in A_{\theta_{0}}italic_θ ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

Rg(θ)=X(g(x),fθ(x))𝑑(x)subscript𝑅𝑔𝜃subscript𝑋𝑔𝑥subscript𝑓𝜃𝑥differential-d𝑥\displaystyle R_{g}(\theta)=\int\limits_{X}\ell(g(x),f_{\theta}(x))d\mathcal{F% }(x)italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) italic_d caligraphic_F ( italic_x )
=X1(g(x),fθ(x))𝑑(x)+X0(g(x),fθ(x))𝑑(x)absentsubscriptsubscript𝑋1𝑔𝑥subscript𝑓𝜃𝑥differential-d𝑥subscriptsubscript𝑋0𝑔𝑥subscript𝑓𝜃𝑥differential-d𝑥\displaystyle=\int\limits_{X_{1}}\ell(g(x),f_{\theta}(x))d\mathcal{F}(x)+\int% \limits_{X_{0}}\ell(g(x),f_{\theta}(x))d\mathcal{F}(x)= ∫ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) italic_d caligraphic_F ( italic_x ) + ∫ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) italic_d caligraphic_F ( italic_x )
=X11𝑑(x)+X0(fopt(x),fθ(x))𝑑(x)absentsubscriptsubscript𝑋11differential-d𝑥subscriptsubscript𝑋0subscript𝑓𝑜𝑝𝑡𝑥subscript𝑓𝜃𝑥differential-d𝑥\displaystyle=\int\limits_{X_{1}}1d\mathcal{F}(x)+\int\limits_{X_{0}}\ell(f_{{% opt}}(x),f_{\theta}(x))d\mathcal{F}(x)= ∫ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1 italic_d caligraphic_F ( italic_x ) + ∫ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) italic_d caligraphic_F ( italic_x )
=Rg(θopt)+X(fopt(x),fθ(x))𝑑(x)absentsubscript𝑅𝑔subscript𝜃𝑜𝑝𝑡subscript𝑋subscript𝑓𝑜𝑝𝑡𝑥subscript𝑓𝜃𝑥differential-d𝑥\displaystyle=R_{g}(\theta_{opt})+\int\limits_{X}\ell(f_{{opt}}(x),f_{\theta}(% x))d\mathcal{F}(x)= italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) italic_d caligraphic_F ( italic_x )
X1(fopt(x),fθ(x))𝑑(x)subscriptsubscript𝑋1subscript𝑓𝑜𝑝𝑡𝑥subscript𝑓𝜃𝑥differential-d𝑥\displaystyle-\int\limits_{X_{1}}\ell(f_{{opt}}(x),f_{\theta}(x))d\mathcal{F}(x)- ∫ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) italic_d caligraphic_F ( italic_x )
=Rg(θopt)+Rfopt(θ)absentsubscript𝑅𝑔subscript𝜃𝑜𝑝𝑡subscript𝑅subscript𝑓𝑜𝑝𝑡𝜃\displaystyle=R_{g}(\theta_{opt})+R_{f_{{opt}}}(\theta)= italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ )

So, for θAθ0𝜃subscript𝐴subscript𝜃0\theta\in A_{\theta_{0}}italic_θ ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT we have Rg(θ)Rg(θopt)=Rfopt(θ)subscript𝑅𝑔𝜃subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡subscript𝑅subscript𝑓𝑜𝑝𝑡𝜃R_{g}(\theta)-R_{g}(\theta_{opt})=R_{f_{{opt}}}(\theta)italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ). We got Rg(θ)Rg(θopt)<δsubscript𝑅𝑔𝜃subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿R_{g}(\theta)-R_{g}(\theta_{opt})<\deltaitalic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) < italic_δ.
¯¯\underline{\Longleftarrow}under¯ start_ARG ⟸ end_ARG: We have the following due to δ<δmax𝛿subscript𝛿𝑚𝑎𝑥\delta<\delta_{max}italic_δ < italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT:

Rg(θ)Rg(θopt)<δ<minθAθ0Rg(θ)Rg(θopt)subscript𝑅𝑔𝜃subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿subscript𝜃subscript𝐴subscript𝜃0subscript𝑅𝑔𝜃subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡\displaystyle R_{g}(\theta)-R_{g}(\theta_{opt})<\delta<\min_{\theta\notin A_{% \theta_{0}}}R_{g}(\theta)-R_{g}(\theta_{opt})italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) < italic_δ < roman_min start_POSTSUBSCRIPT italic_θ ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT )

Thus, θAθ0𝜃subscript𝐴subscript𝜃0\theta\in A_{\theta_{0}}italic_θ ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We’ve already showed that for θAθ0𝜃subscript𝐴subscript𝜃0\theta\in A_{\theta_{0}}italic_θ ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT we have Rg(θ)Rg(θopt)=Rfopt(θ)subscript𝑅𝑔𝜃subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡subscript𝑅subscript𝑓𝑜𝑝𝑡𝜃R_{g}(\theta)-R_{g}(\theta_{opt})=R_{f_{{opt}}}(\theta)italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ). So, we got Rfopt(θ)<δsubscript𝑅subscript𝑓𝑜𝑝𝑡𝜃𝛿R_{f_{{opt}}}(\theta)<\deltaitalic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) < italic_δ

B-C Proof of Theorem 1

In this subsection we will prove theorem 1 from the main paper. Before proving it, we first need to prove 2 lemmas that will be used as part of the proof. The proof of theorem 1 is given in the end of this subsection.

Lemma 4

Given hypothesis set {fθ,θΘ}subscript𝑓𝜃𝜃Θ\{f_{\theta},\theta\in\Theta\}{ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_θ ∈ roman_Θ }, ground truth function g𝑔gitalic_g with projection foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT, and a drawn sample xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the following holds under assumptions 1-5 from the main paper:

Pr(Rfopt(θ^g(xn))>δ|Rfopt(θ^fopt(xn))>δ)=1Prsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃𝑔superscript𝑥𝑛𝛿ketsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃subscript𝑓𝑜𝑝𝑡superscript𝑥𝑛𝛿1\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}(x^{n}))>\delta\ |R_{f_{{% opt}}}(\hat{\theta}^{f_{{opt}}}(x^{n}))>\delta\right)=1roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) > italic_δ | italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) > italic_δ ) = 1 (27)
Proof 4

From assumption 5 we have δ<δmaxminθAθ0gRfopt(θ)𝛿subscript𝛿𝑚𝑎𝑥subscript𝜃subscriptsuperscript𝐴𝑔subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡𝜃\delta<\delta_{max}\leq\min_{\theta\notin A^{g}_{\theta_{0}}}R_{f_{{opt}}}(\theta)italic_δ < italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ≤ roman_min start_POSTSUBSCRIPT italic_θ ∉ italic_A start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ). Thus, if θ^g(xn)Aθ0superscript^𝜃𝑔superscript𝑥𝑛subscript𝐴subscript𝜃0\hat{\theta}^{g}(x^{n})\notin A_{\theta_{0}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, then we have Rfopt(θ^g(xn))>δsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃𝑔superscript𝑥𝑛𝛿R_{f_{{opt}}}(\hat{\theta}^{g}(x^{n}))>\deltaitalic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) > italic_δ and we are done.
Let’s focus on the case θ^g(xn)Aθ0superscript^𝜃𝑔superscript𝑥𝑛subscript𝐴subscript𝜃0\hat{\theta}^{g}(x^{n})\in A_{\theta_{0}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Denote:

x~k={xxng(x)fopt(x)}superscript~𝑥𝑘conditional-set𝑥superscript𝑥𝑛𝑔𝑥subscript𝑓𝑜𝑝𝑡𝑥\displaystyle\tilde{x}^{k}=\{x\in x^{n}\mid g(x)\neq f_{{opt}}(x)\}over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_x ∈ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_g ( italic_x ) ≠ italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) } (28)

We have the following:

θ^ngAθ0(fθ^ng(x),g(x))(fopt(x),g(x))w.p 1subscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0subscript𝑓subscriptsuperscript^𝜃𝑔𝑛𝑥𝑔𝑥subscript𝑓𝑜𝑝𝑡𝑥𝑔𝑥w.p 1\displaystyle\hat{\theta}^{g}_{n}\in A_{\theta_{0}}\Longrightarrow\ell(f_{\hat% {\theta}^{g}_{n}}(x),g(x))\geq\ell(f_{{opt}}(x),g(x))\ \text{w.p 1}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟹ roman_ℓ ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) ≥ roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_g ( italic_x ) ) w.p 1 (29)
fθ^ng(x)=fopt(x)xx~kabsentsubscript𝑓superscriptsubscript^𝜃𝑛𝑔𝑥subscript𝑓𝑜𝑝𝑡𝑥for-all𝑥superscript~𝑥𝑘\displaystyle\Longrightarrow f_{\hat{\theta}_{n}^{g}}(x)=f_{{opt}}(x)\ \forall% \ x\in\tilde{x}^{k}⟹ italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) ∀ italic_x ∈ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT

The empirical risks for any θ𝜃\thetaitalic_θ can be decomposed:

Rfoptemp(θ,xn)=subscriptsuperscript𝑅𝑒𝑚𝑝subscript𝑓𝑜𝑝𝑡𝜃superscript𝑥𝑛absent\displaystyle R^{emp}_{f_{{opt}}}(\theta,x^{n})=italic_R start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = 1nxx~k(fopt(x),fθ(x))+limit-from1𝑛subscript𝑥superscript~𝑥𝑘subscript𝑓𝑜𝑝𝑡𝑥subscript𝑓𝜃𝑥\displaystyle\frac{1}{n}\sum_{x\in\tilde{x}^{k}}\ell(f_{{opt}}(x),f_{\theta}(x% ))+divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) +
1nxxnx~k(fopt(x),fθ(x))1𝑛subscript𝑥superscript𝑥𝑛superscript~𝑥𝑘subscript𝑓𝑜𝑝𝑡𝑥subscript𝑓𝜃𝑥\displaystyle\frac{1}{n}\sum_{x\in x^{n}\setminus\tilde{x}^{k}}\ell(f_{{opt}}(% x),f_{\theta}(x))divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∖ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) )
Rgemp(θ,xn)=subscriptsuperscript𝑅𝑒𝑚𝑝𝑔𝜃superscript𝑥𝑛absent\displaystyle R^{emp}_{g}(\theta,x^{n})=italic_R start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = 1nxx~k(g(x),fθ(x))+limit-from1𝑛subscript𝑥superscript~𝑥𝑘𝑔𝑥subscript𝑓𝜃𝑥\displaystyle\frac{1}{n}\sum_{x\in\tilde{x}^{k}}\ell(g(x),f_{\theta}(x))+divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) +
1nxxnx~k(g(x),fθ(x))1𝑛subscript𝑥superscript𝑥𝑛superscript~𝑥𝑘𝑔𝑥subscript𝑓𝜃𝑥\displaystyle\frac{1}{n}\sum_{x\in x^{n}\setminus\tilde{x}^{k}}\ell(g(x),f_{% \theta}(x))divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∖ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) )

For θAθ0𝜃subscript𝐴subscript𝜃0\theta\in A_{\theta_{0}}italic_θ ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the empirical risk with regard to g𝑔gitalic_g is:

Rgemp(θ,xn)=kn+1nxxnx~k(fopt(x),fθ(x))subscriptsuperscript𝑅𝑒𝑚𝑝𝑔𝜃superscript𝑥𝑛𝑘𝑛1𝑛subscript𝑥superscript𝑥𝑛superscript~𝑥𝑘subscript𝑓𝑜𝑝𝑡𝑥subscript𝑓𝜃𝑥\displaystyle R^{emp}_{g}(\theta,x^{n})=\frac{k}{n}+\frac{1}{n}\sum_{x\in x^{n% }\setminus\tilde{x}^{k}}\ell(f_{{opt}}(x),f_{\theta}(x))italic_R start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = divide start_ARG italic_k end_ARG start_ARG italic_n end_ARG + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∖ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) (30)

Thus, Rgemp(θ,xn)subscriptsuperscript𝑅𝑒𝑚𝑝𝑔𝜃superscript𝑥𝑛R^{emp}_{g}(\theta,x^{n})italic_R start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) can be decomposed to 2 terms - a fixed term and a term that is minimized by foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. Thus, θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT is the minimizer of Rgemp(θ,xn)subscriptsuperscript𝑅𝑒𝑚𝑝𝑔𝜃superscript𝑥𝑛R^{emp}_{g}(\theta,x^{n})italic_R start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), so the ERM fθ^ngsubscript𝑓subscriptsuperscript^𝜃𝑔𝑛f_{\hat{\theta}^{g}_{n}}italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT will choose a hypothesis that is equal to foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT on the set xnx~ksuperscript𝑥𝑛superscript~𝑥𝑘x^{n}\setminus\tilde{x}^{k}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∖ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. From equation (LABEL:equality_on_tilde_x_k), foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT and fθ^ngsubscript𝑓superscriptsubscript^𝜃𝑛𝑔f_{\hat{\theta}_{n}^{g}}italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are also equal on x~ksuperscript~𝑥𝑘\tilde{x}^{k}over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, thus they are equal on the entire sample xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We also have Rfoptemp(θ^nfopt,xn)=0subscriptsuperscript𝑅𝑒𝑚𝑝subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛superscript𝑥𝑛0R^{emp}_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n},x^{n})=0italic_R start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = 0 because the empirical risk is zero in realizable learning, so foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT and fθ^nfoptsubscript𝑓subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛f_{\hat{\theta}^{f_{{opt}}}_{n}}italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT are equal on the entire sample xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We get:

fθ^ng(x)=fopt(x)=fθ^nfoptxxnsubscript𝑓subscriptsuperscript^𝜃𝑔𝑛𝑥subscript𝑓𝑜𝑝𝑡𝑥subscript𝑓subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛for-all𝑥superscript𝑥𝑛\displaystyle f_{\hat{\theta}^{g}_{n}}(x)=f_{{opt}}(x)=f_{\hat{\theta}^{f_{{% opt}}}_{n}}\ \forall\ x\in x^{n}italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∀ italic_x ∈ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (31)

θ^ngsubscriptsuperscript^𝜃𝑔𝑛\hat{\theta}^{g}_{n}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and θ^nfoptsubscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛\hat{\theta}^{f_{{opt}}}_{n}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT have the same empirical risk. By the convention the ERM is the hypothesis with minimum empirical risk that maximizes the true risk, which is equivalent to Rfopt(θ)subscript𝑅subscript𝑓𝑜𝑝𝑡𝜃R_{f_{{opt}}}(\theta)italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) (recall the equivalence in appendix section B-B). Thus, because they have the same empirical risk, θ^ngsubscriptsuperscript^𝜃𝑔𝑛\hat{\theta}^{g}_{n}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and θ^nfoptsubscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛\hat{\theta}^{f_{{opt}}}_{n}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are equal and we have:

θ^ngAθ0fθ^ng=fθ^nfoptsubscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0subscript𝑓subscriptsuperscript^𝜃𝑔𝑛subscript𝑓subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛\displaystyle\hat{\theta}^{g}_{n}\in A_{\theta_{0}}\Longrightarrow f_{\hat{% \theta}^{g}_{n}}=f_{\hat{\theta}^{f_{{opt}}}_{n}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟹ italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT
Rfopt(θ^ng)=Rfopt(θ^nfopt)>δabsentsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle\Longrightarrow R_{f_{{opt}}}(\hat{\theta}^{g}_{n})=R_{f_{{opt}}}% (\hat{\theta}^{f_{{opt}}}_{n})>\delta⟹ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ

To conclude, the following holds for Rfopt(θ^ng)>δsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛𝛿R_{f_{{opt}}}(\hat{\theta}^{g}_{n})>\deltaitalic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ:

Pr(Rfopt(θ^g(xn))>δRfopt(θ^fopt(xn))>δ)=1Prsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃𝑔superscript𝑥𝑛𝛿ketsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃subscript𝑓𝑜𝑝𝑡superscript𝑥𝑛𝛿1\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}(x^{n}))>\delta\mid R_{f_{% {opt}}}(\hat{\theta}^{f_{{opt}}}(x^{n}))>\delta\right)=1roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) > italic_δ ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) > italic_δ ) = 1
Lemma 5

Given hypothesis class {fθ,θΘ}subscript𝑓𝜃𝜃Θ\{f_{\theta},\theta\in\Theta\}{ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_θ ∈ roman_Θ }, ground truth g𝑔gitalic_g with projection foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT on the class and a drawn sample xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the following holds under assumptions 1-5 from the main paper:

Pr(Rfopt(θ^g(xn))>δ|θ^ngAθ0,Rfopt(θ^fopt(xn))<δ)Prsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃𝑔superscript𝑥𝑛conditional𝛿subscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃subscript𝑓𝑜𝑝𝑡superscript𝑥𝑛𝛿\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}(x^{n}))>\delta\ |\ \hat{% \theta}^{g}_{n}\in A_{\theta_{0}},\ R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}(x^{% n}))<\delta\right)roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) > italic_δ | over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) < italic_δ ) (32)
=0absent0\displaystyle=0= 0
Proof 5

Given Rfopt(θ^fopt(xn))<δsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃subscript𝑓𝑜𝑝𝑡superscript𝑥𝑛𝛿R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}(x^{n}))<\deltaitalic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) < italic_δ, any hypothesis θsuperscript𝜃\theta^{{}^{\prime}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT with Rfopt(θ)>δsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript𝜃𝛿R_{f_{{opt}}}(\theta^{{}^{\prime}})>\deltaitalic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) > italic_δ doesn’t achieve minimal empirical risk on xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with regard to foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT . This is true due to the convention that the ERM is the hypothesis with maximal risk from all the hypotheses with minimal empirical risk and due to the equivalence in section B-B. Denote:

x~k={xxng(x)fopt(x)}superscript~𝑥𝑘conditional-set𝑥superscript𝑥𝑛𝑔𝑥subscript𝑓𝑜𝑝𝑡𝑥\displaystyle\tilde{x}^{k}=\{x\in x^{n}\mid g(x)\neq f_{{opt}}(x)\}over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_x ∈ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_g ( italic_x ) ≠ italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) } (33)
x^m={xxnfθ(x)fopt(x)}superscript^𝑥𝑚conditional-set𝑥superscript𝑥𝑛subscript𝑓superscript𝜃𝑥subscript𝑓𝑜𝑝𝑡𝑥\displaystyle\hat{x}^{m}=\{x\in x^{n}\mid f_{\theta^{{}^{\prime}}}(x)\neq f_{{% opt}}(x)\}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = { italic_x ∈ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) ≠ italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) } (34)

Let’s assume θ^ng=θsubscriptsuperscript^𝜃𝑔𝑛superscript𝜃\hat{\theta}^{g}_{n}=\theta^{{}^{\prime}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , which means Rfopt(θ^ng)>δsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛𝛿R_{f_{{opt}}}(\hat{\theta}^{g}_{n})>\deltaitalic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ.
We are given that θ^ngAθ0subscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0\hat{\theta}^{g}_{n}\in A_{\theta_{0}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, so we have θAθ0superscript𝜃subscript𝐴subscript𝜃0\theta^{{}^{\prime}}\in A_{\theta_{0}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which means by definition of Aθ0subscript𝐴subscript𝜃0A_{\theta_{0}}italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, that if fθ(x)=g(x)subscript𝑓superscript𝜃𝑥𝑔𝑥f_{\theta^{{}^{\prime}}}(x)=g(x)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) = italic_g ( italic_x ) then fopt(x)=g(x)subscript𝑓𝑜𝑝𝑡𝑥𝑔𝑥f_{{opt}}(x)=g(x)italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_g ( italic_x ). Thus x~kx^m=superscript~𝑥𝑘superscript^𝑥𝑚\tilde{x}^{k}\cap\hat{x}^{m}=\emptysetover~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∩ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = ∅.
Notice that x^msuperscript^𝑥𝑚\hat{x}^{m}\neq\emptysetover^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ≠ ∅ because θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT achieves lower empirical risk with regard to foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT than θsuperscript𝜃\theta^{{}^{\prime}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, so there must be at least one sample of xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT on which θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT is better, otherwise fθsubscript𝑓superscript𝜃f_{\theta^{{}^{\prime}}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT coincide on xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and have the same empirical risk, which is a contradiction to θsuperscript𝜃\theta^{{}^{\prime}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT not being ERM with regard to foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. The empirical risk can be decomposed to:

Rgemp(θ,xn)=1nxx^m(g(x),fθ(x))subscriptsuperscript𝑅𝑒𝑚𝑝𝑔𝜃superscript𝑥𝑛1𝑛subscript𝑥superscript^𝑥𝑚𝑔𝑥subscript𝑓𝜃𝑥\displaystyle R^{emp}_{g}(\theta,x^{n})=\frac{1}{n}\sum_{x\in\hat{x}^{m}}\ell(% g(x),f_{\theta}(x))italic_R start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) (35)
+1nxx~k(g(x),fθ(x))1𝑛subscript𝑥superscript~𝑥𝑘𝑔𝑥subscript𝑓𝜃𝑥\displaystyle+\frac{1}{n}\sum_{x\in\tilde{x}^{k}}\ell(g(x),f_{\theta}(x))+ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_g ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) )

By definition of x^msuperscript^𝑥𝑚\hat{x}^{m}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, we have Rgemp(θopt,x^m)<Rgemp(θ,x^m)superscriptsubscript𝑅𝑔𝑒𝑚𝑝subscript𝜃𝑜𝑝𝑡superscript^𝑥𝑚superscriptsubscript𝑅𝑔𝑒𝑚𝑝superscript𝜃superscript^𝑥𝑚R_{g}^{emp}(\theta_{opt},\hat{x}^{m})<R_{g}^{emp}(\theta^{{}^{\prime}},\hat{x}% ^{m})italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) < italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ).
Form θAθ0superscript𝜃subscript𝐴subscript𝜃0\theta^{{}^{\prime}}\in A_{\theta_{0}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT we get Rgemp(θopt,x~k)=Rgemp(θ,x~k)superscriptsubscript𝑅𝑔𝑒𝑚𝑝subscript𝜃𝑜𝑝𝑡superscript~𝑥𝑘superscriptsubscript𝑅𝑔𝑒𝑚𝑝superscript𝜃superscript~𝑥𝑘R_{g}^{emp}(\theta_{opt},\tilde{x}^{k})=R_{g}^{emp}(\theta^{{}^{\prime}},% \tilde{x}^{k})italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).
Thus we have Rgemp(θopt,xn)<Rgemp(θ,xn)superscriptsubscript𝑅𝑔𝑒𝑚𝑝subscript𝜃𝑜𝑝𝑡superscript𝑥𝑛superscriptsubscript𝑅𝑔𝑒𝑚𝑝superscript𝜃superscript𝑥𝑛R_{g}^{emp}(\theta_{opt},x^{n})<R_{g}^{emp}(\theta^{{}^{\prime}},x^{n})italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) < italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ).
We got that θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT achieves lower empirical risk with regard to g𝑔gitalic_g, which is a contradiction to θ^ng=θsubscriptsuperscript^𝜃𝑔𝑛superscript𝜃\hat{\theta}^{g}_{n}=\theta^{{}^{\prime}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT. Thus, if θ^ngθsubscriptsuperscript^𝜃𝑔𝑛superscript𝜃\hat{\theta}^{g}_{n}\neq\theta^{{}^{\prime}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≠ italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT then :
Pr(Rfopt(θ^g(xn))>δθ^ngAθ0,Rfopt(θ^fopt(xn))<δ)=0Prsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃𝑔superscript𝑥𝑛conditional𝛿subscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡superscript^𝜃subscript𝑓𝑜𝑝𝑡superscript𝑥𝑛𝛿0\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}(x^{n}))>\delta\mid\hat{\theta}^{g}_{n}% \in A_{\theta_{0}},R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}(x^{n}))<\delta\right% )=0roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) > italic_δ ∣ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) < italic_δ ) = 0.

Proof of theorem 1:
By conditioning Pr(Rfopt(θ^ng)<δ)Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛𝛿\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})<\delta\right)roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) on Rfopt(θ^nfopt)subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n})italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) we get:

Pr(Rfopt(θ^ng)<δ)=Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛𝛿absent\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})<\delta\right)=roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) =
Pr(Rfopt(θ^ng)<δRfopt(θ^nfopt)<δ)Pr(Rfopt(θ^nfopt)<δ)Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛bra𝛿subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})<\delta\mid R_{f_{{% opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)\Pr\left(R_{f_{{opt}}}(\hat{% \theta}^{f_{{opt}}}_{n})<\delta\right)roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )
+Pr(Rfopt(θ^ng)<δRfopt(θ^nfopt)>δ)Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛inner-product𝛿subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle+\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})<\delta\mid R_{f_{{% opt}}}(\hat{\theta}^{f_{{opt}}}_{n})>\delta\right)+ roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ )
Pr(Rfopt(θ^nfopt)>δ)=\displaystyle\quad\cdot\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n})>% \delta\right)=⋅ roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ ) =
Pr(Rfopt(θ^ng)<δRfopt(θ^nfopt)<δ)Pr(Rfopt(θ^nfopt)<δ)Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛bra𝛿subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})<\delta\mid R_{f_{{% opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)\Pr\left(R_{f_{{opt}}}(\hat{% \theta}^{f_{{opt}}}_{n})<\delta\right)roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )

Where the last equality is due to Lemma 4. By conditioning on {θ^ngAθ0}subscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0\{\hat{\theta}^{g}_{n}\in A_{\theta_{0}}\}{ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, we get the following:

Pr(Rfopt(θ^ng)<δRfopt(θ^nfopt)<δ)=Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛bra𝛿subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿absent\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})<\delta\mid R_{f_{{% opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)=roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) =
Pr(Rfopt(θ^ng)<δRfopt(θ^nfopt)<δ,θ^ngAθ0)Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛bra𝛿subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿subscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})<\delta\mid R_{f_{{% opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta,\ \hat{\theta}^{g}_{n}\in A_{% \theta_{0}}\right)roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ , over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)absentPrsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle\cdot\Pr\left(\hat{\theta}^{g}_{n}\in A_{\theta_{0}}\mid R_{f_{{% opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)⋅ roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )
+Pr(Rfopt(θ^ng)<δRfopt(θ^nfopt)<δ,θ^ngAθ0)Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛bra𝛿subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿subscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0\displaystyle+\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})<\delta\mid R_{f_{{% opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta,\ \hat{\theta}^{g}_{n}\notin A_{% \theta_{0}}\right)+ roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ , over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)absentPrsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle\cdot\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_% {{opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)⋅ roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )
=Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)absentPrsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle=\Pr\left(\hat{\theta}^{g}_{n}\in A_{\theta_{0}}\mid R_{f_{{opt}}% }(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)= roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )

Where the last equality is due to Lemma 5 and because for δ<δmax𝛿subscript𝛿𝑚𝑎𝑥\delta<\delta_{max}italic_δ < italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT we have θ^ngAθ0Rfopt(θ^ng)>δsubscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛𝛿\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\Longrightarrow R_{f_{{opt}}}(\hat{% \theta}^{g}_{n})>\deltaover^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟹ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ. We conclude with the following equality:

Pr(Rfopt(θ^ng)<δ)=Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛𝛿absent\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})<\delta\right)=roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) =
Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)Pr(Rfopt(θ^nfopt)<δ)Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle\Pr\left(\hat{\theta}^{g}_{n}\in A_{\theta_{0}}\mid R_{f_{{opt}}}% (\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)\Pr\left(R_{f_{{opt}}}(\hat{\theta% }^{f_{{opt}}}_{n})<\delta\right)roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )

By denoting Pr(Rfopt(θ^nfopt)<δ)=1PRPrsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿1subscript𝑃𝑅\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)=1-P_{R}roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) = 1 - italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and taking the complement probability, we get:

Pr(Rfopt(θ^ng)>δ)=Prsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛𝛿absent\displaystyle\Pr\left(R_{f_{{opt}}}(\hat{\theta}^{g}_{n})>\delta\right)=roman_Pr ( italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ ) =
PR+(1PR)Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)subscript𝑃𝑅1subscript𝑃𝑅Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle P_{R}+(1-P_{R})\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}% }\mid R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + ( 1 - italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )

B-D Bounds Lemma

Lemma 6

Under assumptions 1-5 from the main paper, there exists a number 𝐍𝐍\ell\in\mathbf{N}roman_ℓ ∈ bold_N such that the following holds:

Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt% }}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )
Pr(i=1K{#Di+#Di}n)absentPrsuperscriptsubscript𝑖1𝐾subscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛\displaystyle\geq\Pr\left(\cup_{i=1}^{K}\{\#D_{i}+\ell\leq\#D_{i}^{{}^{\prime}% }\}_{n-\ell}\right)≥ roman_Pr ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_ℓ ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n - roman_ℓ end_POSTSUBSCRIPT )
Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt% }}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )
Pr(i=1K{#Di#Di}n)absentPrsuperscriptsubscript𝑖1𝐾subscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛\displaystyle\leq\Pr\left(\cup_{i=1}^{K}\{\#D_{i}\leq\#D_{i}^{{}^{\prime}}\}_{% n}\right)≤ roman_Pr ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

Before the proof, let’s denote the following concept of the set of minimal sequences:

Definition 7

Let there be a hypothesis class FΘsubscript𝐹ΘF_{\Theta}italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, a function foptFΘsubscript𝑓𝑜𝑝𝑡subscript𝐹Θf_{{opt}}\in F_{\Theta}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ∈ italic_F start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, and a number 0<δ<δmax0𝛿subscript𝛿𝑚𝑎𝑥0<\delta<\delta_{max}0 < italic_δ < italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. The set of minimal sequences with risk lower than δ𝛿\deltaitalic_δ with regard to foptsubscript𝑓𝑜𝑝𝑡f_{{opt}}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT is

Xminδ={xRfopt(θ^nfopt(x))<δ,Rfopt(θ^nfopt(x/xi))>δi}superscriptsubscript𝑋𝑚𝑖𝑛𝛿conditional-set𝑥formulae-sequencesubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝑥𝛿subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝑥subscript𝑥𝑖𝛿for-all𝑖\displaystyle X_{min}^{\delta}=\{\vec{x}\mid R_{f_{{opt}}}(\hat{\theta}^{f_{{% opt}}}_{n}(\vec{x}))<\delta,R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n}(\vec{x}% /x_{i}))>\delta\ \forall\ i\}italic_X start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT = { over→ start_ARG italic_x end_ARG ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG ) ) < italic_δ , italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG / italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) > italic_δ ∀ italic_i }

Where x/xi𝑥subscript𝑥𝑖\vec{x}/x_{i}over→ start_ARG italic_x end_ARG / italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is x𝑥\vec{x}over→ start_ARG italic_x end_ARG without the i’th component.

This set is nonempty due to assumption 3. Denote \ellroman_ℓ as the maximal length of a sequence in Xminδsuperscriptsubscript𝑋𝑚𝑖𝑛𝛿X_{min}^{\delta}italic_X start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT. For example, in the k-boundary hypothesis class we have 2k2𝑘\ell\leq 2kroman_ℓ ≤ 2 italic_k for any hypothesis. Thus, any i.i.d sequence xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT achieving Rfopt(θ^nfopt(x))<δsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝑥𝛿R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n}(\vec{x}))<\deltaitalic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG ) ) < italic_δ can be decomposed into a minimal sequence of length at most \ellroman_ℓ and the rest of the samples which have no constraint on them. So, if Rfopt(θ^nfopt(x))<δsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝑥𝛿R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n}(\vec{x}))<\deltaitalic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG ) ) < italic_δ, then there is a constraint on at most \ellroman_ℓ samples of x𝑥\vec{x}over→ start_ARG italic_x end_ARG. We’ll now state the proof for Lemma 6.

Proof 6

we have Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)=Pr(i=1K{#Di#Di}nRfopt(θ^nfopt)<δ)Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿Prsuperscriptsubscript𝑖1𝐾conditionalsubscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt}}}(\hat{% \theta}^{f_{{opt}}}_{n})<\delta\right)=\Pr\left(\cup_{i=1}^{K}\{\#D_{i}\leq\#D% _{i}^{{}^{\prime}}\}_{n}\mid R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<% \delta\right)roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) = roman_Pr ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ). Let the maximum length of a set in Xminδsuperscriptsubscript𝑋𝑚𝑖𝑛𝛿X_{min}^{\delta}italic_X start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT be \ellroman_ℓ. So any sequence xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT that satisfies Rfopt(θ^ng(xn))<δsubscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃𝑔𝑛superscript𝑥𝑛𝛿R_{f_{{opt}}}(\hat{\theta}^{g}_{n}(x^{n}))<\deltaitalic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) < italic_δ can be decomposed to a minimal sequence of length at most \ellroman_ℓ and the rest of the samples:

Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)=Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿absent\displaystyle\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt% }}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)=roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) =
Pr(i=1K{#Di#Di}nconstraint on at most  samples)Prsuperscriptsubscript𝑖1𝐾conditionalsubscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛constraint on at most  samples\displaystyle\Pr\left(\cup_{i=1}^{K}\{\#D_{i}\leq\#D_{i}^{{}^{\prime}}\}_{n}% \mid\text{constraint on at most $\ell$ samples}\right)roman_Pr ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ constraint on at most roman_ℓ samples )

The lower bound is obtained by assuming that all \ellroman_ℓ samples fell in every Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT region for i=1,K𝑖1𝐾i=1,...Kitalic_i = 1 , … italic_K:

Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿absent\displaystyle\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt% }}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)\geqroman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) ≥
Pr(i=1K{#Di+#Di}n)Prsuperscriptsubscript𝑖1𝐾subscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛\displaystyle\Pr\left(\cup_{i=1}^{K}\{\#D_{i}+\ell\leq\#D_{i}^{{}^{\prime}}\}_% {n-\ell}\right)roman_Pr ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_ℓ ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n - roman_ℓ end_POSTSUBSCRIPT )

The upper bound is obtained by assuming that all \ellroman_ℓ samples didn’t fall in any all Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT region for i=1,K𝑖1𝐾i=1,...Kitalic_i = 1 , … italic_K (i.e., they all fell in Xcsubscript𝑋𝑐X_{c}italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT):

Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿absent\displaystyle\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt% }}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)\leqroman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) ≤
Pr(i=1K{#Di#Di}n)Prsuperscriptsubscript𝑖1𝐾subscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛\displaystyle\Pr\left(\cup_{i=1}^{K}\{\#D_{i}\leq\#D_{i}^{{}^{\prime}}\}_{n}\right)roman_Pr ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

B-E Proof of theorem 2

In this subsection we’ll prove theorem 2. This will be done by showing that the error exponent of the bounds in Lemma 6 is DKL(ΠQ)subscript𝐷𝐾𝐿conditionalΠ𝑄D_{KL}(\Pi\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ). First, we’ll analyze the second term in theorem 1. Using Eq.(20), we have:

{θ^g(xn)Aθ0}=i=1K{Rgemp(θ0,xn)Rgemp(θi,xn)}superscript^𝜃𝑔superscript𝑥𝑛subscript𝐴subscript𝜃0superscriptsubscript𝑖1𝐾superscriptsubscript𝑅𝑔𝑒𝑚𝑝subscript𝜃0superscript𝑥𝑛superscriptsubscript𝑅𝑔𝑒𝑚𝑝subscript𝜃𝑖superscript𝑥𝑛\displaystyle\{\hat{\theta}^{g}(x^{n})\notin A_{\theta_{0}}\}=\cup_{i=1}^{K}\{% R_{g}^{emp}(\theta_{0},x^{n})\geq R_{g}^{emp}(\theta_{i},x^{n})\}{ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } = ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ≥ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } (36)
=i=1K{#Di#Di}nabsentsuperscriptsubscript𝑖1𝐾subscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛\displaystyle=\cup_{i=1}^{K}\{\#D_{i}\leq\#D_{i}^{{}^{\prime}}\}_{n}= ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

Subscript n𝑛nitalic_n indicates the length of the sample xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Thus we have:

Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)=Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿absent\displaystyle\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt% }}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)=roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) = (37)
Pr(i=1K{#Di#Di}nRfopt(θ^nfopt)<δ)Prsuperscriptsubscript𝑖1𝐾conditionalsubscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\displaystyle\Pr\left(\cup_{i=1}^{K}\{\#D_{i}\leq\#D_{i}^{{}^{\prime}}\}_{n}% \mid R_{f_{{opt}}}(\hat{\theta}^{f_{{opt}}}_{n})<\delta\right)roman_Pr ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ )

Denote the vector of non-negative integers
m=(m1,..,m1,..,k,m1,..,m1,..,k,mc)\vec{m}=(m_{1},..,m_{1,..,k},m^{{}^{\prime}}_{1},..,m^{{}^{\prime}}_{1,..,k},m% _{c})over→ start_ARG italic_m end_ARG = ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_m start_POSTSUBSCRIPT 1 , . . , italic_k end_POSTSUBSCRIPT , italic_m start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_m start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , . . , italic_k end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ). Using Eq.(21), we have the following:

Pr(i=1K{#Di#Di}n)=mMnPr(#X1=m1,..,\displaystyle\Pr\left(\cup_{i=1}^{K}\{\#D_{i}\leq\#D_{i}^{{}^{\prime}}\}_{n}% \right)=\sum_{\vec{m}\in M_{n}}Pr\big{(}\#X_{1}=m_{1},..,roman_Pr ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT over→ start_ARG italic_m end_ARG ∈ italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P italic_r ( # italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , (38)
#X1,..,k=m1,..,k,#X1=m1,..,#X1,..,k=m1,..,k,\displaystyle\#X_{1,..,k}=m_{1,..,k},\#X^{{}^{\prime}}_{1}=m^{{}^{\prime}}_{1}% ,..,\#X^{{}^{\prime}}_{1,..,k}=m^{{}^{\prime}}_{1,..,k},# italic_X start_POSTSUBSCRIPT 1 , . . , italic_k end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT 1 , . . , italic_k end_POSTSUBSCRIPT , # italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_m start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , # italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , . . , italic_k end_POSTSUBSCRIPT = italic_m start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , . . , italic_k end_POSTSUBSCRIPT ,
#Xc=mc)\displaystyle\#X_{c}=m_{c}\big{)}# italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )

Where Mnsubscript𝑀𝑛M_{n}italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the set of integers with sum n𝑛nitalic_n that satisfy at least one of {#Di#Di}nsubscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛\{\#D_{i}\leq\#D_{i}^{{}^{\prime}}\}_{n}{ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT :

Mn={m|{A(m1,..,m1,..,K,..,m1,..,K)t<0}c,\displaystyle M_{n}=\Bigg{\{}\vec{m}\mathrel{\Bigg{|}}\Big{\{}A\begin{pmatrix}% m_{1},..,m_{1,..,K},..,m^{{}^{\prime}}_{1,..,K}\end{pmatrix}^{t}<0\Big{\}}^{c},italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { over→ start_ARG italic_m end_ARG | { italic_A ( start_ARG start_ROW start_CELL italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_m start_POSTSUBSCRIPT 1 , . . , italic_K end_POSTSUBSCRIPT , . . , italic_m start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , . . , italic_K end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT < 0 } start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , (39)
m1+..+m1+..+mc=n}\displaystyle m_{1}+..+m^{{}^{\prime}}_{1}+..+m_{c}=n\Bigg{\}}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + . . + italic_m start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + . . + italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_n }

A𝐴Aitalic_A is the matrix from Eq.(22). The type of a sequence xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT on alphabet χ𝜒\chiitalic_χ is the empirical distribution of symbols in the sequence:

Pxn=(#a1n,#a2n,,#arn),a1,arχformulae-sequencesubscript𝑃superscript𝑥𝑛#subscript𝑎1𝑛#subscript𝑎2𝑛#subscript𝑎𝑟𝑛subscript𝑎1subscript𝑎𝑟𝜒\displaystyle P_{x^{n}}=(\frac{\#a_{1}}{n},\frac{\#a_{2}}{n},...,\frac{\#a_{r}% }{n}),\quad a_{1},...a_{r}\in\chiitalic_P start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ( divide start_ARG # italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG , divide start_ARG # italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG , … , divide start_ARG # italic_a start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG ) , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_a start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ italic_χ

Denote 𝒫nsubscript𝒫𝑛\mathcal{P}_{n}caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as the set of all length n𝑛nitalic_n sequences types and T(P)𝑇𝑃T(P)italic_T ( italic_P ) as the set of sequence xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with type P𝑃Pitalic_P. Our problem can be formulated as an i.i.d sequence over alphabet χ𝜒\chiitalic_χ. Mnsubscript𝑀𝑛M_{n}italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be formulated as a constraint on types instead of integers, denoted as M~nsubscript~𝑀𝑛\tilde{M}_{n}over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT :

M~n={(#a1n,,#a|χ|n)|{A(#a1n#a|χ|1n)<0}c,\displaystyle\tilde{M}_{n}=\Biggl{\{}(\frac{\#a_{1}}{n},...,\frac{\#a_{|\chi|}% }{n})\mathrel{\Bigg{|}}\Bigg{\{}A\begin{pmatrix}\frac{\#a_{1}}{n}\\ ...\\ \frac{\#a_{|\chi|-1}}{n}\end{pmatrix}<0\Bigg{\}}^{c},over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( divide start_ARG # italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG , … , divide start_ARG # italic_a start_POSTSUBSCRIPT | italic_χ | end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG ) | { italic_A ( start_ARG start_ROW start_CELL divide start_ARG # italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL divide start_ARG # italic_a start_POSTSUBSCRIPT | italic_χ | - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG end_CELL end_ROW end_ARG ) < 0 } start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , (40)
#a1n++#a|χ|n=1},aiχ\displaystyle\frac{\#a_{1}}{n}+...+\frac{\#a_{|\chi|}}{n}=1\Biggl{\}}\ ,\ a_{i% }\in\chidivide start_ARG # italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG + … + divide start_ARG # italic_a start_POSTSUBSCRIPT | italic_χ | end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG = 1 } , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_χ

Notice that the sets M~nsubscript~𝑀𝑛\tilde{M}_{n}over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are subsets of the set ΠΠ\Piroman_Π denoted in Eq.(24). Thus, Eq. (38) is the sum of types of sequences of length n𝑛nitalic_n that are contained in ΠΠ\Piroman_Π:

Pr(j=1K{#Dj#Dj}n)=P𝒫nΠPr(T(P))Prsuperscriptsubscript𝑗1𝐾subscript#subscript𝐷𝑗#subscriptsuperscript𝐷𝑗𝑛subscript𝑃subscript𝒫𝑛ΠPr𝑇𝑃\displaystyle\Pr\left(\cup_{j=1}^{K}\{\#D_{j}\leq\#D^{{}^{\prime}}_{j}\}_{n}% \right)=\sum_{P\in\mathcal{P}_{n}\cap\Pi}\Pr\left(T(P)\right)roman_Pr ( ∪ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∩ roman_Π end_POSTSUBSCRIPT roman_Pr ( italic_T ( italic_P ) )

Notice Q𝑄Qitalic_Q is not contained in ΠΠ\Piroman_Π because of the consistency assumption. The same formulation can be done for Pr(i=1K{#Di+#Di}n)Prsuperscriptsubscript𝑖1𝐾subscript#subscript𝐷𝑖#superscriptsubscript𝐷𝑖𝑛\Pr\left(\cup_{i=1}^{K}\{\#D_{i}+\ell\leq\#D_{i}^{{}^{\prime}}\}_{n-\ell}\right)roman_Pr ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_ℓ ≤ # italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n - roman_ℓ end_POSTSUBSCRIPT ). Denote the set Πn,subscriptΠ𝑛\Pi_{n,\ell}roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT:

Πn,={(p1,,p|χ|)|{A(p1p|χ|1)<n}c,\displaystyle\Pi_{n,\ell}=\Biggl{\{}(p_{1},...,p_{{|\chi|}})\mathrel{\Bigg{|}}% \Biggl{\{}A\begin{pmatrix}p_{1}\\ ...\\ p_{{|\chi|-1}}\end{pmatrix}<\frac{\ell}{n-\ell}\Biggl{\}}^{c},roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT = { ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT | italic_χ | end_POSTSUBSCRIPT ) | { italic_A ( start_ARG start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT | italic_χ | - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) < divide start_ARG roman_ℓ end_ARG start_ARG italic_n - roman_ℓ end_ARG } start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , (41)
i=1|χ|pi=1,pi0}\displaystyle\sum_{i=1}^{|\chi|}p_{i}=1,p_{i}\geq 0\Biggl{\}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_χ | end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 }

We have:

Pr(j=1K{#Dj+#Dj}n)=P𝒫nΠn,Pr(T(P))Prsuperscriptsubscript𝑗1𝐾subscript#subscript𝐷𝑗#subscriptsuperscript𝐷𝑗𝑛subscript𝑃subscript𝒫𝑛subscriptΠ𝑛Pr𝑇𝑃\displaystyle\Pr\left(\cup_{j=1}^{K}\{\#D_{j}+\ell\leq\#D^{{}^{\prime}}_{j}\}_% {n-\ell}\right)=\sum_{P\in\mathcal{P}_{n-\ell}\cap\Pi_{n,\ell}}\Pr\left(T(P)\right)roman_Pr ( ∪ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { # italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + roman_ℓ ≤ # italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n - roman_ℓ end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P start_POSTSUBSCRIPT italic_n - roman_ℓ end_POSTSUBSCRIPT ∩ roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Pr ( italic_T ( italic_P ) )

Theorem 3.3 in [21] states that if a set of probabilities ΠΠ\Piroman_Π on χ𝜒\chiitalic_χ, that doesn’t contain the underlying distribution Q𝑄Qitalic_Q, has the property:

limnDKL(Π𝒫nQ)=DKL(ΠQ)subscript𝑛subscript𝐷𝐾𝐿Πconditionalsubscript𝒫𝑛𝑄subscript𝐷𝐾𝐿conditionalΠ𝑄\displaystyle\lim_{n\rightarrow\infty}D_{KL}(\Pi\cap\mathcal{P}_{n}\;\|\;Q)=D_% {KL}(\Pi\;\|\;Q)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_Q ) = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q )

Then the following holds:

limn1nlogPr(T(xn)Π)=DKL(ΠQ)subscript𝑛1𝑛Pr𝑇superscript𝑥𝑛Πsubscript𝐷𝐾𝐿conditionalΠ𝑄\displaystyle\lim_{n\rightarrow\infty}\frac{1}{n}\log\Pr\left(T(x^{n})\in\Pi% \right)=-D_{KL}(\Pi\;\|\;Q)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG roman_log roman_Pr ( italic_T ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∈ roman_Π ) = - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q )

The 3 Lemmas in the end of this section show this condition is satisfied for both ΠΠ\Piroman_Π and Πn,subscriptΠ𝑛\Pi_{n,\ell}roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT. we have:

limn1nlogPr(T(xn)Π)=DKL(ΠQ)subscript𝑛1𝑛Pr𝑇superscript𝑥𝑛Πsubscript𝐷𝐾𝐿conditionalΠ𝑄\displaystyle\lim_{n\rightarrow\infty}\frac{1}{n}\log\Pr\left(T(x^{n})\in\Pi% \right)=-D_{KL}(\Pi\;\|\;Q)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG roman_log roman_Pr ( italic_T ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∈ roman_Π ) = - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q )
limn1nlogPr(T(xn)Πn,)=DKL(ΠQ)subscript𝑛1𝑛Pr𝑇superscript𝑥𝑛subscriptΠ𝑛subscript𝐷𝐾𝐿conditionalΠ𝑄\displaystyle\lim_{n\rightarrow\infty}\frac{1}{n-\ell}\log\Pr\left(T(x^{n-\ell% })\in\Pi_{n,\ell}\right)=-D_{KL}(\Pi\;\|\;Q)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n - roman_ℓ end_ARG roman_log roman_Pr ( italic_T ( italic_x start_POSTSUPERSCRIPT italic_n - roman_ℓ end_POSTSUPERSCRIPT ) ∈ roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT ) = - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q )

This means that the upper and lower bounds from Lemma 6 have the same error exponent DKL(ΠQ)subscript𝐷𝐾𝐿conditionalΠ𝑄D_{KL}(\Pi\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ). Thus, the error exponent of Pr(θ^ngAθ0Rfopt(θ^nfopt)<δ)Prsubscriptsuperscript^𝜃𝑔𝑛conditionalsubscript𝐴subscript𝜃0subscript𝑅subscript𝑓𝑜𝑝𝑡subscriptsuperscript^𝜃subscript𝑓𝑜𝑝𝑡𝑛𝛿\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\mid R_{f_{{opt}}}(\hat{% \theta}^{f_{{opt}}}_{n})<\delta\right)roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_δ ) is also DKL(ΠQ)subscript𝐷𝐾𝐿conditionalΠ𝑄D_{KL}(\Pi\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ). By using theorem 1 and the error exponent for the uniform realizable case we get:

Pr(Rg(θ^ng)Rg(θopt)>δ)enmin{δ4,DKL(ΠQ)}approaches-limitPrsubscript𝑅𝑔subscriptsuperscript^𝜃𝑔𝑛subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡𝛿superscript𝑒𝑛𝛿4subscript𝐷𝐾𝐿conditionalΠ𝑄\displaystyle\Pr\left(R_{g}(\hat{\theta}^{g}_{n})-R_{g}(\theta_{opt})>\delta% \right)\doteq e^{-n\cdot\min\{\frac{\delta}{4},D_{KL}(\Pi\;\|\;Q)\}}roman_Pr ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) > italic_δ ) ≐ italic_e start_POSTSUPERSCRIPT - italic_n ⋅ roman_min { divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG , italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) } end_POSTSUPERSCRIPT

This proves theorem 2. The following Lemmas prove the fulfilment of the needed conditions.

Lemma 7

Let there be an alphabet χ𝜒\chiitalic_χ with underlying probability Q and an i.i.d sequence xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over the alphabet. denote the set ΠΠ\Piroman_Π as in Eq.(24): For QΠ𝑄ΠQ\notin\Piitalic_Q ∉ roman_Π, the following holds:

limnDKL(Π𝒫nQ)=DKL(ΠQ)subscript𝑛subscript𝐷𝐾𝐿Πconditionalsubscript𝒫𝑛𝑄subscript𝐷𝐾𝐿conditionalΠ𝑄\displaystyle\lim_{n\rightarrow\infty}D_{KL}(\Pi\cap\mathcal{P}_{n}\;\|\;Q)=D_% {KL}(\Pi\;\|\;Q)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_Q ) = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q )
Proof 7

ΠΠ\Piroman_Π is the outside of a polygon on the probability simplex (including the boundary), thus it is a connected closed space. This means that DKL(ΠQ)subscript𝐷𝐾𝐿conditionalΠ𝑄D_{KL}(\Pi\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) is achieved for some probability PΠsuperscript𝑃ΠP^{*}\in\Piitalic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Π, such that DKL(ΠQ)=DKL(PQ)subscript𝐷𝐾𝐿conditionalΠ𝑄subscript𝐷𝐾𝐿conditionalsuperscript𝑃𝑄D_{KL}(\Pi\;\|\;Q)=D_{KL}(P^{*}\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ italic_Q ).
DKL(PQ)subscript𝐷𝐾𝐿conditional𝑃𝑄D_{KL}(P\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ∥ italic_Q ) is continuous in PΠ𝑃ΠP\in\Piitalic_P ∈ roman_Π, so for every ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 exists δ>0𝛿0\delta>0italic_δ > 0 such that if 0<PP<δ0norm𝑃superscript𝑃𝛿0<||P-P^{*}||<\delta0 < | | italic_P - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < italic_δ then |DKL(PQ)DKL(PQ)|<ϵ|D_{KL}(P\;\|\;Q)-D_{KL}(P^{*}\;\|\;Q)|<\epsilon| italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ∥ italic_Q ) - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ italic_Q ) | < italic_ϵ.
Because ΠΠ\Piroman_Π is a closed connected set, Lemma 9 applies, so for every δ>0𝛿0\delta>0italic_δ > 0 exists N𝑁Nitalic_N such that for n>N𝑛𝑁n>Nitalic_n > italic_N we have an empirical assignment P~nΠ𝒫nsubscript~𝑃𝑛Πsubscript𝒫𝑛\tilde{P}_{n}\in\Pi\cap\mathcal{P}_{n}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_Π ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT satisfying
P~nP<δnormsubscript~𝑃𝑛superscript𝑃𝛿||\tilde{P}_{n}-P^{*}||<\delta| | over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < italic_δ |DKL(P~nQ)DKL(ΠQ)|<ϵ\Longrightarrow|D_{KL}(\tilde{P}_{n}\;\|\;Q)-D_{KL}(\Pi\;\|\;Q)|<\epsilon⟹ | italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_Q ) - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) | < italic_ϵ
|DKL(Π𝒫nQ)DKL(ΠQ)|<ϵ\Longrightarrow|D_{KL}(\Pi\cap\mathcal{P}_{n}\;\|\;Q)-D_{KL}(\Pi\;\|\;Q)|<\epsilon⟹ | italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_Q ) - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) | < italic_ϵ.
We got that for every ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 exists N𝑁Nitalic_N such that for n>N𝑛𝑁n>Nitalic_n > italic_N we have |DKL(Π𝒫nQ)DKL(ΠQ)|<ϵ|D_{KL}(\Pi\cap\mathcal{P}_{n}\;\|\;Q)-D_{KL}(\Pi\;\|\;Q)|<\epsilon| italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_Q ) - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) | < italic_ϵ.

Lemma 8

Let χ𝜒\chiitalic_χ be an alphabet with probability Q𝑄Qitalic_Q and an i.i.d sequence xnsuperscript𝑥𝑛x^{n-\ell}italic_x start_POSTSUPERSCRIPT italic_n - roman_ℓ end_POSTSUPERSCRIPT, 𝐍𝐍\ell\in\mathbf{N}roman_ℓ ∈ bold_N, over χ𝜒\chiitalic_χ. Denote the the set ΠΠ\Piroman_Π as in Eq.(24) and the set Πn,subscriptΠ𝑛\Pi_{n,\ell}roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT as in Eq.(LABEL:eq:pi_n_ell). For QΠ𝑄ΠQ\notin\Piitalic_Q ∉ roman_Π, the following holds:

limnDKL(Πn,𝒫nQ)=DKL(ΠQ)subscript𝑛subscript𝐷𝐾𝐿subscriptΠ𝑛conditionalsubscript𝒫𝑛𝑄subscript𝐷𝐾𝐿conditionalΠ𝑄\displaystyle\lim_{n\rightarrow\infty}D_{KL}(\Pi_{n,\ell}\cap\mathcal{P}_{n}\;% \|\;Q)=D_{KL}(\Pi\;\|\;Q)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_Q ) = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q )
Proof 8

Πn,subscriptΠ𝑛\Pi_{n,\ell}roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT is the outside of a polygon on Sχsubscript𝑆𝜒S_{\chi}italic_S start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT (including the boundary), thus it is a connected closed set. We already saw in Lemma 7 that DKL(ΠQ)subscript𝐷𝐾𝐿conditionalΠ𝑄D_{KL}(\Pi\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) is achieved for some PΠsuperscript𝑃ΠP^{*}\in\Piitalic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Π such that DKL(ΠQ)=DKL(PQ)subscript𝐷𝐾𝐿conditionalΠ𝑄subscript𝐷𝐾𝐿conditionalsuperscript𝑃𝑄D_{KL}(\Pi\;\|\;Q)=D_{KL}(P^{*}\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ italic_Q ). DKL(PQ)subscript𝐷𝐾𝐿conditional𝑃𝑄D_{KL}(P\;\|\;Q)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ∥ italic_Q ) is continuous in PΠ𝑃ΠP\in\Piitalic_P ∈ roman_Π, so for every ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 exists δ>0𝛿0\delta>0italic_δ > 0 such that if 0<PP<δ0norm𝑃superscript𝑃𝛿0<||P-P^{*}||<\delta0 < | | italic_P - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < italic_δ then |DKL(PQ)DKL(PQ)|<ϵ|D_{KL}(P\;\|\;Q)-D_{KL}(P^{*}\;\|\;Q)|<\epsilon| italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ∥ italic_Q ) - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ italic_Q ) | < italic_ϵ. For every δ2>0𝛿20\frac{\delta}{2}>0divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG > 0 exists PΠsuperscript𝑃ΠP^{{}^{\prime}}\in\Piitalic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ roman_Π satisfying 0<PP<δ20normsuperscript𝑃superscript𝑃𝛿20<||P^{{}^{\prime}}-P^{*}||<\frac{\delta}{2}0 < | | italic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG such that Psuperscript𝑃P^{{}^{\prime}}italic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT is an interior point of ΠΠ\Piroman_Π. Notice that the boundaries of Πn,subscriptΠ𝑛\Pi_{n,\ell}roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT are converging in n𝑛nitalic_n to the boundaries of ΠΠ\Piroman_Π, so exists N1>0subscript𝑁10N_{1}>0italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 such that for n>N1𝑛subscript𝑁1n>N_{1}italic_n > italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT we have PΠn,superscript𝑃subscriptΠ𝑛P^{{}^{\prime}}\in\Pi_{n,\ell}italic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT.
ΠN1,subscriptΠsubscript𝑁1\Pi_{N_{1},\ell}roman_Π start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ end_POSTSUBSCRIPT is a closed connected set, thus Lemma 9 applies to it. So, for every δ2>0𝛿20\frac{\delta}{2}>0divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG > 0 exists N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that for n>N2𝑛subscript𝑁2n>N_{2}italic_n > italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT we have an empirical assignment P~nΠN1,subscript~𝑃𝑛subscriptΠsubscript𝑁1\tilde{P}_{n}\in\Pi_{N_{1},\ell}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_Π start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ end_POSTSUBSCRIPT such that P~nP<δ2normsubscript~𝑃𝑛superscript𝑃𝛿2||\tilde{P}_{n}-P^{{}^{\prime}}||<\frac{\delta}{2}| | over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT | | < divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG. Notice that Πn1,Πn2,subscriptΠsubscript𝑛1subscriptΠsubscript𝑛2\Pi_{n_{1},\ell}\subset\Pi_{n_{2},\ell}roman_Π start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ end_POSTSUBSCRIPT ⊂ roman_Π start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_ℓ end_POSTSUBSCRIPT for n1<n2subscript𝑛1subscript𝑛2n_{1}<n_{2}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. So, for n>max(N1,N2)𝑛subscript𝑁1subscript𝑁2n>\max(N_{1},N_{2})italic_n > roman_max ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) we have an empirical assignment P~nΠn,𝒫nsubscript~𝑃𝑛subscriptΠ𝑛subscript𝒫𝑛\tilde{P}_{n}\in\Pi_{n,\ell}\cap\mathcal{P}_{n}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT such thatP~nP<δ2normsubscript~𝑃𝑛superscript𝑃𝛿2||\tilde{P}_{n}-P^{{}^{\prime}}||<\frac{\delta}{2}| | over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT | | < divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG, and by the triangle inequality, it satisfies P~nP<δnormsubscript~𝑃𝑛superscript𝑃𝛿||\tilde{P}_{n}-P^{*}||<\delta| | over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < italic_δ
|DKL(P~nQ)DKL(ΠQ)|<ϵ\Longrightarrow|D_{KL}(\tilde{P}_{n}\;\|\;Q)-D_{KL}(\Pi\;\|\;Q)|<\epsilon⟹ | italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_Q ) - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) | < italic_ϵ
|DKL(Πn,𝒫nQ)DKL(ΠQ)|<ϵ\Longrightarrow|D_{KL}(\Pi_{n,\ell}\cap\mathcal{P}_{n}\;\|\;Q)-D_{KL}(\Pi\;\|% \;Q)|<\epsilon⟹ | italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_Q ) - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( roman_Π ∥ italic_Q ) | < italic_ϵ.

Lemma 9

Let ΠΠ\Piroman_Π be a closed and connected subset of {(p1,,pr)0p1++pr1, 0pi1}conditional-setsubscript𝑝1subscript𝑝𝑟formulae-sequence0subscript𝑝1subscript𝑝𝑟1 0subscript𝑝𝑖1\{(p_{1},...,p_{r})\mid 0\leq p_{1}+...+p_{r}\leq 1,\ 0\leq p_{i}\leq 1\}{ ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∣ 0 ≤ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≤ 1 , 0 ≤ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 1 } and let 𝒫nsubscript𝒫𝑛\mathcal{P}_{n}caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the set of types of sequences of length n𝑛nitalic_n over an alphabet χ𝜒\chiitalic_χ of size r𝑟ritalic_r. For all PΠsuperscript𝑃ΠP^{*}\in\Piitalic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Π the following holds:

limninfPΠ𝒫nPP=0subscript𝑛subscriptinfimum𝑃Πsubscript𝒫𝑛norm𝑃superscript𝑃0\displaystyle\lim_{n\rightarrow\infty}\inf_{P\in\Pi\cap\mathcal{P}_{n}}||P-P^{% *}||=0roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_P ∈ roman_Π ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_P - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | = 0
Proof 9

ΠΠ\Piroman_Π is a closed set, thus for any PΠsuperscript𝑃ΠP^{*}\in\Piitalic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Π and any ϵ2>0italic-ϵ20\frac{\epsilon}{2}>0divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG > 0 exists PqInterior(Π)subscript𝑃𝑞𝐼𝑛𝑡𝑒𝑟𝑖𝑜𝑟ΠP_{q}\in Interior(\Pi)italic_P start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ italic_I italic_n italic_t italic_e italic_r italic_i italic_o italic_r ( roman_Π ) such that PqP<ϵ2normsubscript𝑃𝑞superscript𝑃italic-ϵ2||P_{q}-P^{*}||<\frac{\epsilon}{2}| | italic_P start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG and Pqsubscript𝑃𝑞P_{q}italic_P start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is rational Pq=(a1b1,,arbr),ai,bi𝐍formulae-sequencesubscript𝑃𝑞subscript𝑎1subscript𝑏1subscript𝑎𝑟subscript𝑏𝑟subscript𝑎𝑖subscript𝑏𝑖𝐍P_{q}=(\frac{a_{1}}{b_{1}},...,\frac{a_{r}}{b_{r}}),\ a_{i},b_{i}\in\mathbf{N}italic_P start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = ( divide start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG italic_a start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ) , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_N (because 𝐐rsuperscript𝐐𝑟\mathbf{Q}^{r}bold_Q start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is a dense subset of ΠΠ\Piroman_Π). For every n denote the empirical probability P~n𝒫nsubscript~𝑃𝑛subscript𝒫𝑛\tilde{P}_{n}\in\mathcal{P}_{n}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT:

P~n=(a1b1nn,,ar1br1nn,ni=1r1aibinn)subscript~𝑃𝑛subscript𝑎1subscript𝑏1𝑛𝑛subscript𝑎𝑟1subscript𝑏𝑟1𝑛𝑛𝑛superscriptsubscript𝑖1𝑟1subscript𝑎𝑖subscript𝑏𝑖𝑛𝑛\displaystyle\tilde{P}_{n}=(\frac{\lfloor\frac{a_{1}}{b_{1}}n\rfloor}{n},...,% \frac{\lfloor\frac{a_{r-1}}{b_{r-1}}n\rfloor}{n},\frac{n-\sum_{i=1}^{r-1}% \lfloor\frac{a_{i}}{b_{i}}n\rfloor}{n})over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( divide start_ARG ⌊ divide start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_n ⌋ end_ARG start_ARG italic_n end_ARG , … , divide start_ARG ⌊ divide start_ARG italic_a start_POSTSUBSCRIPT italic_r - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT italic_r - 1 end_POSTSUBSCRIPT end_ARG italic_n ⌋ end_ARG start_ARG italic_n end_ARG , divide start_ARG italic_n - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r - 1 end_POSTSUPERSCRIPT ⌊ divide start_ARG italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_n ⌋ end_ARG start_ARG italic_n end_ARG )

Notice P~nsubscript~𝑃𝑛\tilde{P}_{n}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT converges to Pqsubscript𝑃𝑞P_{q}italic_P start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. This means that for any ϵ2>0italic-ϵ20\frac{\epsilon}{2}>0divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG > 0 exists N𝑁Nitalic_N such that for n>N1𝑛subscript𝑁1n>N_{1}italic_n > italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT we have P~nPq<ϵ2normsubscript~𝑃𝑛subscript𝑃𝑞italic-ϵ2||\tilde{P}_{n}-P_{q}||<\frac{\epsilon}{2}| | over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | | < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG. Because Pqsubscript𝑃𝑞P_{q}italic_P start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is in the interior of ΠΠ\Piroman_Π and P~nsubscript~𝑃𝑛\tilde{P}_{n}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT converges to Pqsubscript𝑃𝑞P_{q}italic_P start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, exists N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that for n>N2𝑛subscript𝑁2n>N_{2}italic_n > italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT we have P~nΠsubscript~𝑃𝑛Π\tilde{P}_{n}\in\Piover~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_Π. Using the triangle inequality, for n>max(N1,N2)𝑛𝑁1𝑁2n>\max(N1,N2)italic_n > roman_max ( italic_N 1 , italic_N 2 ) we have P~nP<ϵnormsubscript~𝑃𝑛superscript𝑃italic-ϵ||\tilde{P}_{n}-P^{*}||<\epsilon| | over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < italic_ϵ and P~nΠ𝒫nsubscript~𝑃𝑛Πsubscript𝒫𝑛\tilde{P}_{n}\in\Pi\cap\mathcal{P}_{n}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_Π ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT infPΠ𝒫nPP<ϵabsentsubscriptinfimum𝑃Πsubscript𝒫𝑛norm𝑃superscript𝑃italic-ϵ\Longrightarrow\inf_{P\in\Pi\cap\mathcal{P}_{n}}||P-P^{*}||<\epsilon⟹ roman_inf start_POSTSUBSCRIPT italic_P ∈ roman_Π ∩ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_P - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < italic_ϵ.
This shows that for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 exists N𝑁Nitalic_N such that for n>N𝑛𝑁n>Nitalic_n > italic_N we have infP𝒫nPP<ϵ𝑖𝑛subscript𝑓𝑃subscript𝒫𝑛norm𝑃superscript𝑃italic-ϵinf_{P\in\mathcal{P}_{n}}||P-P^{*}||<\epsilonitalic_i italic_n italic_f start_POSTSUBSCRIPT italic_P ∈ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_P - italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | < italic_ϵ

Appendix C Generalization to Infinite Amount Of Generalized Optimum Points

In this section we briefly show how to generalize results to the case of infinite amount of GLP’s and how to derive ΠΠ\Piroman_Π and Q𝑄Qitalic_Q. Let the set of GLP’s be ΘoptsubscriptΘ𝑜𝑝𝑡\Theta_{opt}roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT and θopt=θ0subscript𝜃𝑜𝑝𝑡subscript𝜃0\theta_{opt}=\theta_{0}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the global optimum point. We need the loss of the global optimum θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT to be bounded away from the loss of the other GLP’s:

ϵ>0Rfopt(θ)Rfopt>ϵθΘoptθoptitalic-ϵ0ketsubscript𝑅subscript𝑓𝑜𝑝𝑡superscript𝜃subscript𝑅subscript𝑓𝑜𝑝𝑡italic-ϵfor-allsuperscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝜃𝑜𝑝𝑡\displaystyle\exists\ \epsilon>0\mid R_{f_{{opt}}}(\theta^{*})-R_{f_{{opt}}}>% \epsilon\ \forall\theta^{*}\in\Theta_{opt}\setminus\theta_{opt}∃ italic_ϵ > 0 ∣ italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT > italic_ϵ ∀ italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ∖ italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT (42)

This is necessary for δmax>0subscript𝛿𝑚𝑎𝑥0\delta_{max}>0italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT > 0. Notice that this is achieved from assumptions 1, 2 and 4. Due to the completeness of ΘΘ\Thetaroman_Θ and uniqueness of θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT, the only hypotheses θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ that can potentially have a risk that is arbitrarily close to the optimal risk Rg(θopt)subscript𝑅𝑔subscript𝜃𝑜𝑝𝑡R_{g}(\theta_{opt})italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ) are those that are in the neighborhood θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. Due to the stability assumption, we know that exists a small enough neighborhood of θoptsubscript𝜃𝑜𝑝𝑡\theta_{opt}italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT, such that any hypothesis in it will have a higher loss w.p 1.
For each GLP θΘoptθoptsuperscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝜃𝑜𝑝𝑡\theta^{*}\in\Theta_{opt}\setminus\theta_{opt}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ∖ italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT of Rg(θ)subscript𝑅𝑔𝜃R_{g}(\theta)italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_θ ), denote:

Dθ=D(θ0,θ),Dθ=D(θ,θ0)formulae-sequencesubscript𝐷superscript𝜃𝐷subscript𝜃0superscript𝜃subscriptsuperscript𝐷superscript𝜃𝐷superscript𝜃subscript𝜃0\displaystyle D_{\theta^{*}}=D(\theta_{0},\theta^{*}),\ D^{{}^{\prime}}_{% \theta^{*}}=D(\theta^{*},\theta_{0})italic_D start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_D ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_D ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (43)

We have the following:

Pr(θ^ngAθ0)=Prsubscriptsuperscript^𝜃𝑔𝑛subscript𝐴subscript𝜃0absent\displaystyle\Pr\left(\hat{\theta}^{g}_{n}\notin A_{\theta_{0}}\right)=roman_Pr ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ italic_A start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =
Pr(θΘoptθ0{Rgemp(θ0,xn)Rgemp(θ,xn)})=Prsubscriptsuperscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝜃0superscriptsubscript𝑅𝑔𝑒𝑚𝑝subscript𝜃0superscript𝑥𝑛superscriptsubscript𝑅𝑔𝑒𝑚𝑝superscript𝜃superscript𝑥𝑛absent\displaystyle\Pr\left(\bigcup_{\theta^{*}\in\Theta_{opt}\setminus\theta_{0}}\{% R_{g}^{emp}(\theta_{0},x^{n})\geq R_{g}^{emp}(\theta^{*},x^{n})\}\right)=roman_Pr ( ⋃ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ∖ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ≥ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_m italic_p end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } ) =
Pr(θΘoptθ0{#Dθ#Dθ}n)Prsubscriptsuperscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝜃0subscript#subscript𝐷superscript𝜃#superscriptsubscript𝐷superscript𝜃𝑛\displaystyle\Pr\left(\bigcup_{\theta^{*}\in\Theta_{opt}\setminus\theta_{0}}\{% \#D_{\theta^{*}}\leq\#D_{\theta^{*}}^{{}^{\prime}}\}_{n}\right)roman_Pr ( ⋃ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ∖ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { # italic_D start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ # italic_D start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

From this point, generalizing the proof of theorem 1 is straight forward. Denote the following regions:

Xϕ=disjointify({Dθ,θΘoptθ0})subscript𝑋italic-ϕ𝑑𝑖𝑠𝑗𝑜𝑖𝑛𝑡𝑖𝑓𝑦subscript𝐷superscript𝜃superscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝜃0\displaystyle X_{\phi}=disjointify(\{D_{\theta^{*}}\ ,\ \theta^{*}\in\Theta_{% opt}\setminus\theta_{0}\})italic_X start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = italic_d italic_i italic_s italic_j italic_o italic_i italic_n italic_t italic_i italic_f italic_y ( { italic_D start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ∖ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } )
Xϕ=disjointify({Dθ,θΘoptθ0})subscriptsuperscript𝑋italic-ϕ𝑑𝑖𝑠𝑗𝑜𝑖𝑛𝑡𝑖𝑓𝑦subscriptsuperscript𝐷superscript𝜃superscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝜃0\displaystyle X^{{}^{\prime}}_{\phi}=disjointify(\{D^{{}^{\prime}}_{\theta^{*}% }\ ,\ \theta^{*}\in\Theta_{opt}\setminus\theta_{0}\})italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = italic_d italic_i italic_s italic_j italic_o italic_i italic_n italic_t italic_i italic_f italic_y ( { italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ∖ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } )
Xc=X{θΘoptθ0DθθΘoptθ0Dθ}subscript𝑋𝑐𝑋subscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝜃0subscriptsuperscript𝜃subscriptΘ𝑜𝑝𝑡subscript𝜃0subscriptsuperscript𝐷superscript𝜃subscript𝐷𝜃\displaystyle X_{c}=X\setminus\{\cup_{\theta^{{}^{\prime}}\in\Theta_{opt}% \setminus\theta_{0}}D^{{}^{\prime}}_{{\theta^{{}^{\prime}}}}\cup_{\theta\in% \Theta_{opt}\setminus\theta_{0}}{D_{\theta}}\}italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_X ∖ { ∪ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ∖ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∪ start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ∖ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT }

Where the disjointify𝑑𝑖𝑠𝑗𝑜𝑖𝑛𝑡𝑖𝑓𝑦disjointifyitalic_d italic_i italic_s italic_j italic_o italic_i italic_n italic_t italic_i italic_f italic_y operator takes a collection of sets and returns disjoint sets indexed by a continuous index ϕitalic-ϕ\phiitalic_ϕ. We get a continuous alphabet χ𝜒\chiitalic_χ. Denote XΦ=ϕXϕϕXϕXcsubscript𝑋Φsubscriptitalic-ϕsubscriptitalic-ϕsubscript𝑋italic-ϕsubscriptsuperscript𝑋italic-ϕsubscript𝑋𝑐X_{\Phi}=\cup_{\phi}X_{\phi}\cup_{\phi}X^{{}^{\prime}}_{\phi}\cup X_{c}italic_X start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT = ∪ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∪ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∪ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and let χ𝐑𝜒𝐑\chi\subseteq\mathbf{R}italic_χ ⊆ bold_R be generated by a bijective mapping Ψ:XΦχ:Ψsubscript𝑋Φ𝜒\Psi:X_{\Phi}\longrightarrow\chiroman_Ψ : italic_X start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ⟶ italic_χ. We can always find such mapping because XΦsubscript𝑋ΦX_{\Phi}italic_X start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT is a set of non-intersecting sub-sets of 𝐗𝐑𝐍𝐗superscript𝐑𝐍\mathbf{X}\subseteq\mathbf{R^{N}}bold_X ⊆ bold_R start_POSTSUPERSCRIPT bold_N end_POSTSUPERSCRIPT, so the cardinality of XΦsubscript𝑋ΦX_{\Phi}italic_X start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT is no greater than the cardinality of 𝐗𝐗\mathbf{X}bold_X and hence no greater than the cardinality of 𝐑𝐑\mathbf{R}bold_R. Thus, there exists a subset χ𝜒\chiitalic_χ of 𝐑𝐑\mathbf{R}bold_R with the same cardinality of XΦsubscript𝑋ΦX_{\Phi}italic_X start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT, which means there exists a bijective mapping from XΦsubscript𝑋ΦX_{\Phi}italic_X start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT to χ𝜒\chiitalic_χ, and Q𝑄Qitalic_Q is the distribution on χ𝜒\chiitalic_χ. Denote the following sets:

Φ(Dθ)={SXΦSDθ}Φsubscript𝐷𝜃conditional-set𝑆subscript𝑋Φ𝑆subscript𝐷𝜃\Phi(D_{\theta})=\{S\in X_{\Phi}\ \mid S\subseteq D_{\theta}\}roman_Φ ( italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = { italic_S ∈ italic_X start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ∣ italic_S ⊆ italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT } (44)

Let Fn(r),rχsubscript𝐹𝑛𝑟𝑟𝜒F_{n}(r),\ r\in\chiitalic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_r ) , italic_r ∈ italic_χ be the empirical distribution (CDF) on χ𝜒\chiitalic_χ induced by the drawn sequence xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. That is, if k𝑘kitalic_k samples from the sequence xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT landed in the region XϕXΦsubscript𝑋italic-ϕsubscript𝑋ΦX_{\phi}\in X_{\Phi}italic_X start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT, then Fn(Ψ(Xϕ))limaΨ(Xϕ)Fn(a)=knsubscript𝐹𝑛Ψsubscript𝑋italic-ϕsubscript𝑎Ψsuperscriptsubscript𝑋italic-ϕsubscript𝐹𝑛𝑎𝑘𝑛F_{n}(\Psi(X_{\phi}))-\lim_{a\to\Psi(X_{\phi})^{-}}F_{n}(a)=\frac{k}{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_Ψ ( italic_X start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ) - roman_lim start_POSTSUBSCRIPT italic_a → roman_Ψ ( italic_X start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_a ) = divide start_ARG italic_k end_ARG start_ARG italic_n end_ARG. Denote the set of all such empirical distribution functions as Fχnsubscriptsuperscript𝐹𝑛𝜒F^{n}_{\chi}italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT. Denote:

χθ={rχr=Ψ(Xϕ),XϕΦ(Dθ)}subscript𝜒𝜃conditional-set𝑟𝜒formulae-sequence𝑟Ψsubscript𝑋italic-ϕsubscript𝑋italic-ϕΦsubscript𝐷𝜃\displaystyle\chi_{\theta}=\{r\in\chi\mid r=\Psi(X_{\phi}),\ X_{\phi}\in\Phi(D% _{\theta})\}italic_χ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = { italic_r ∈ italic_χ ∣ italic_r = roman_Ψ ( italic_X start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ roman_Φ ( italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) } (45)
χθ={rχr=Ψ(Xϕ),XϕΦ(Dθ)}subscriptsuperscript𝜒𝜃conditional-set𝑟𝜒formulae-sequence𝑟Ψsubscript𝑋italic-ϕsubscript𝑋italic-ϕΦsubscriptsuperscript𝐷𝜃\displaystyle\chi^{{}^{\prime}}_{\theta}=\{r\in\chi\mid r=\Psi(X_{\phi}),\ X_{% \phi}\in\Phi(D^{{}^{\prime}}_{\theta})\}italic_χ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = { italic_r ∈ italic_χ ∣ italic_r = roman_Ψ ( italic_X start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ roman_Φ ( italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) }

These are the sets of values in the alphabet χ𝜒\chiitalic_χ that corresponds to regions in Dθsubscript𝐷𝜃D_{\theta}italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and Dθsubscriptsuperscript𝐷𝜃D^{{}^{\prime}}_{\theta}italic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Denote the following set of empirical distribution functions:

M~n={FnFχnθΘs.t\displaystyle\tilde{M}_{n}=\bigg{\{}F_{n}\in F^{n}_{\chi}\mid\exists\theta\in% \Theta\ \text{s.t}over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ∣ ∃ italic_θ ∈ roman_Θ s.t
rχθ(Fn(r)limarFn(a))rχθ(Fnr)limarFn(a))}\displaystyle\int_{r\in\chi^{{}^{\prime}}_{\theta}}(F_{n}(r)-\lim_{a\to r^{-}}% F_{n}(a))\geq\int_{r\in\chi_{\theta}}(F_{n}r)-\lim_{a\to r^{-}}F_{n}(a))\bigg{\}}∫ start_POSTSUBSCRIPT italic_r ∈ italic_χ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_r ) - roman_lim start_POSTSUBSCRIPT italic_a → italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_a ) ) ≥ ∫ start_POSTSUBSCRIPT italic_r ∈ italic_χ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_r ) - roman_lim start_POSTSUBSCRIPT italic_a → italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_a ) ) }

This the parallel of Eq.(LABEL:M_n_Tilde). Let Fχsubscript𝐹𝜒F_{\chi}italic_F start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT be the set of all distribution functions on χ𝜒\chiitalic_χ. We can now define the set ΠΠ\Piroman_Π:

Π={FFχθΘs.t\displaystyle\Pi=\bigg{\{}F\in F_{\chi}\mid\exists\theta\in\Theta\ \text{s.t}roman_Π = { italic_F ∈ italic_F start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ∣ ∃ italic_θ ∈ roman_Θ s.t (46)
rχθ(F(r)limarF(a))rχθ(F(r)limarF(a))}\displaystyle\int_{r\in\chi^{{}^{\prime}}_{\theta}}(F(r)-\lim_{a\to r^{-}}F(a)% )\geq\int_{r\in\chi_{\theta}}(F(r)-\lim_{a\to r^{-}}F(a))\bigg{\}}∫ start_POSTSUBSCRIPT italic_r ∈ italic_χ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F ( italic_r ) - roman_lim start_POSTSUBSCRIPT italic_a → italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F ( italic_a ) ) ≥ ∫ start_POSTSUBSCRIPT italic_r ∈ italic_χ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F ( italic_r ) - roman_lim start_POSTSUBSCRIPT italic_a → italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F ( italic_a ) ) }

This is the parallel of Eq.(24). The results are generalize to continuous alphabet by using the continuous version of Sanov’s theorem - Theorem 11 of [22].