Error Exponent in Agnostic PAC Learning
Abstract
Statistical learning theory and the Probably Approximately Correct (PAC) criterion are the common approach to mathematical learning theory. PAC is widely used to analyze learning problems and algorithms, and have been studied thoroughly. Uniform worst case bounds on the convergence rate have been well established using, e.g., VC theory or Radamacher complexity. However, in a typical scenario the performance could be much better. In this paper, we consider PAC learning using a somewhat different tradeoff, the error exponent - a well established analysis method in Information Theory - which describes the exponential behavior of the probability that the risk will exceed a certain threshold as function of the sample size. We focus on binary classification and find, under some stability assumptions, an improved distribution dependent error exponent for a wide range of problems, establishing the exponential behavior of the PAC error probability in agnostic learning. Interestingly, under these assumptions, agnostic learning may have the same error exponent as realizable learning. The error exponent criterion can be applied to analyze knowledge distillation, a problem that so far lacks a theoretical analysis.
I Introduction
Statistical machine learning studies the generalization ability and convergence rate of learning algorithms. One of the most popular criteria for learnability is the Probably Approximately Correct (PAC) criterion, suggested in [1, 2], which describes the probability of a learning algorithm to output a hypothesis that is not too far from the optimal one.
In this work, we will consider the class of Empirical Risk Minimization (ERM) predictors, which is the most prominent method for learning problems. ERM predictors choose the hypothesis achieving minimal loss on a given training sample, and their analysis under the PAC criterion is well established through VC theory [3, 4].
Classical setting divides the learning problem into two cases - Realizable learning, in which the target function is taken from the hypothesis class, and Agnostic learning, in which the target function could be outside the class. The general worst-case upper bounds of both cases are well established, see [5] for example.
Although VC theory is powerful, it provides a uniform upper-bound for the worst case scenario, where in a typical scenario the convergence rate could be much faster, as suggested by [6]. Actually, the recent rise of deep learning demonstrate that uniform bounds fail to describe many practical situations and better characteristics comes from considering non-uniform, possibly distribution dependent analysis.
In this paper we consider agnostic PAC learning for the case of binary labels and 0-1 loss function. We derive an improved distribution-dependent error exponent for the PAC error probability, using some assumptions, for a wide range of learning problems. Moreover, we show that under the specified assumptions, the derived error exponent can be the same for both agnostic and realizable learning.
I-A Related Work
VC theory and the PAC model provide conditions for uniform consistency and bounds, that are achieved, e.g., by ERM predictors [3]. This theory fails to explain the success of recent learning models, such as neural networks, as presented in [7, 8], where practical learning rates can be much faster than the ones predicted by the VC theory. Moreover, in [9] different types of over-parameterized models are analyzed and it is proved that any uniform bound would yield a bad generalization bound. This issue motivated theories that provide better, non-uniform learning rates.
In this direction, works such as [10, 11, 12], establish improved bounds for specific cases and algorithms. However, these works do not provide a general theory. [13, 14] developed tighter bounds for distribution dependent PAC-Bayes priors. [15] showed the existence of classes with faster rates than the classical agnostic bound, but the provided condition for such a rate is impractical for infinite feature spaces. Other works relax the uniformity property, as done by [16], who proposed a relaxed model of PAC in which the bound on the learning rate may depend on a hypothesis, but is uniform on all distributions consistent with that hypothesis. Other works focus on totally non-uniform learning bounds. For example, [17, 18] established a theory for non-uniform consistency, in which an algorithm is considered consistent if it convergence to the optimal risk for any ground truth, and showed there exists such algorithm for separable metric spaces. In [6] a theory for non-uniform PAC learning in the realizable setting is developed showing that the learning rate can be one of 3 types: exponential, linear and arbitrarily slow.
II Problem Formulation
Let the training data be pairs of data samples and their labels , where are i.i.d and drawn from a feature space according to an unknown distribution , and the labels are generated by some unknown deterministic function , called the ground-truth function. We have a hypothesis class , and would like to find the closest hypothesis to the ground truth in the class under some loss function. We focus on the setting in which the hypotheses range is binary and the loss function is the 0-1 loss:
The following notations are with regard to some arbitrary function , where can be the ground truth or some other function in discussion. For hypothesis class , denote the risk between hypothesis and some function as:
(1) |
and the empirical risk between and on sample as:
(2) |
In this paper, we analyze the Empirical Risk Minimization (ERM) algorithm, which selects the hypothesis that minimizes the empirical risk on the sample out of all the hypotheses in the class. Specifically, the ERM on a sample and hypothesis class , with regard to a function , is defined as:
(3) |
Whenever there are multiple hypotheses with the same minimal empirical risk, we use the convention of choosing the one maximizing the true risk (i.e., the worst one).
Denote the hypothesis achieving minimum risk with regard to the ground truth as :
(4) |
We will refer to as the projection of on . We assume for simplicity that the hypothesis class is non-degenerate in the sense that there is no subset of the feature space for which all hypotheses coincide. That is, for any set with positive probability, we have:
(5) |
This is not restrictive as any part of the feature space on which all hypotheses coincide will contribute the same risk to all hypotheses, thus not affecting the choice of ERM. We’ll also assume that for any positive (Lebesgue) measure set the probability measure is positive (this is non restrictive as such regions with zero probability have no effect).
II-A PAC Learning
In the context of ERM, we say that the class is (agnostic) PAC learnable [19] if there exist a sample size and an algorithm such that for every ground truth function , every probability distribution on and every , for , with probability at least we have:
(6) |
PAC actually describes the relationship between three quantities: the deviation from the optimal risk , the probability for deviation larger than and the size of the sample . We will refer to as the PAC error probability for shortness.
The analysis of learning algorithms using PAC is usually done by writing one parameter as a function of the other two. Most notably, writing as a function of and (known as sample complexity) or writing as function of for some fixed value of (known as excess risk). In this way, we can say one algorithm is better than the other if, for example, it has a better sample complexity (i.e., increases slower as a function of for a fixed ). We propose to fix and to look instead at the probability of deviation as function of the sample size . In this case, we say that one algorithm is better than the other if decays faster as a function of .
II-B VC Theory
VC theory [3] provides consistency conditions and uniform (worst-case) bounds for PAC learning of ERM predictors. This is done using the VC dimension of the hypothesis class, denoted , defined as the maximum sample size for which the sample can be separated into two classes in all possible label sequences, using functions from the hypothesis class.
VC theory provides the following well known results for ERM (see [5] for example). In agnostic learning, for , with probability at least we have:
(7) |
In realizable learning, for , with probability at least :
(8) |
These bounds describe the generalization and convergence of the ERM predictor.
Focusing on (7), we can get as a function of and by setting the right hand side to :
(9) |
Similarly, we get the following for the realizable case:
(10) |
This formulation of the upper bound shows that the PAC error probability decays exponentially with and allows us to explore its error exponent. Recall that the error exponent of a series is defined as:
(11) |
We will use the notation to indicate that series has the same error exponent as . The concept of error exponent (see section 5.6 in [20] for example), was proven useful in Information Theory for analyzing the decay rate of probabilities to zero. It allows utilizing powerful mathematical tools such as the method of types [21] and Sanov’s theorem [22]. We can see from (9) that the error exponent in the agnostic case is , and from (10) that the error exponent in the realizable case is . We note that the bound in (7) can be manipulated using a chaining technique [23] to get rid of the factor, but the resulting error exponent will be worse. In the next sections, we will derive an improved distribution-dependent bound for the PAC error probability for the agnostic case. This will be done using some assumption on the learning problem (i.e., on the hypothesis class and the ground truth) described in the next sections. Under these assumptions the error exponent in the agnostic case can be the same as in the realizable case for small enough .
III Preliminaries
In this section we introduce a few key concepts. In order to provide some intuition, we will use the k-boundary hypothesis class as a case study and demonstrate these concepts on it.
Definition 1 (k-boundary hypothesis class)
Let . The k-boundary hypothesis set is defined as
Where . For uniqueness, equality is allowed only between the first 2 parameters or between last parameters (e.g., ).
For example, on the feature space with uniform distribution, the 2-boundary function with parameters is
We will use this example throughout this section to demonstrate the presented concepts. Another important hypothesis class, which is more closely related to neural networks, is the class of linear classifiers:
Definition 2 (linear hypothesis class)
Let there be a feature space . The k-dimensional linear hypothesis set is:
Where , and is the indicator function.
Definition 3 (Generalized Optimum Point)
For hypothesis class and ground truth function , we say that is a generalized optimum point (GLP) of if there exists a set with positive probability, such that we have .
In simple words, is a GLP if no other hypothesis can beat it uniformly on the feature space . Notice that the hypothesis minimizing the risk is always a GLP, as for every other hypothesis in the class there must exist a set for which is uniformly better, otherwise it would not be the minimizer of the risk. We will refer to as the global optimum.
Consider for example the 1-boundary hypothesis class with a ground truth as described above. The GLP’s will be and , as no other hypothesis achieves a lower loss for all . Notice that these are the only GLP’s since any other hypothesis is no better (for all ) than either or . This divides the parameter space into two groups: hypotheses that are no better than and hypotheses that are no better than . We can informally say that when an ERM learns from using the 1-boundary hypothesis class, there is going to be a competition between these 2 groups. The following definition generalizes this concept.
For each GLP , denote the set :
Definition 4 ( region)
Let be the set of GLP’s of and be the global optimum. For every denote the regions:
(12) |
In order to make these regions disjoint, we handle the intersections in the following way:
-
1.
remove all overlaps from :
-
2.
For the other regions , arbitrarily assign the intersection to one of the regions such that there will not be any overlap, to obtain the regions
These regions form a complete partitioning of such that (see proof in appendix A).
In simple words, is the set of hypotheses in that are no better than the GLP for any given (with probability 1). For the example above we have the sets and .
Note that in any (non-degenerate) agnostic learning problem we will have at least 2 GLP’s, because if is outside the class, there must be a set in with positive probability for which is different than . Any hypothesis equal to on this set will be universally better than on this set, and will not belong to (this is a consequence of (5)). Thus, there must be other GLP’s in addition to .
Definition 5 (Dominating region)
For hypothesis class , the Dominating region of on , where , with regard to , denoted as , is defined as
(13) | ||||
The dominating region is the set in the feature space for which achieves lower loss than (i.e., , ). Using our example, the dominating region of on is .
Definition 6 (Stability)
We say that a GLP is stable if we can define a distance in and there exist such that for every with distance the following holds with probability 1:
(14) |
Informally, is a stable GLP if any hypothesis in its neighborhood does not have an improved classification ability with regard to any . Using our example from above, both and are stable.
IV Theoretical Results
We consider the following assumptions.
Assumption 1
is a stable GLP.
Assumption 2
is a unique GLP.
Assumption 3
The following is true in probability:
Assumption 4
is a complete space. i.e., the limit of any Cauchy sequence in is also in , and the limit of any sequence , such that the series has a limit, is also in .
Assumption 5
, where is denoted as:
(15) |
Assumption 2 is needed mainly to ease the analysis and can be generalized. Relaxing this assumption will cause the ERM to alternate between multiple regions of the (non-unique) global GLP’s. Assumption 3 is a non-uniform consistency requirement (this is weaker than finite VC dimension), which is reasonable. Assumption 4 is mainly a mathematical technicality. Assumption 5 means that we are looking at what happens for small enough , which is reasonable as we are interested in the behavior in the asymptotic regime. Assumption 1 is the only one that poses a significant constraint. Nevertheless, a wide range of learning problems satisfy it, such as -boundary class with a ground truth with finite number of transition points (see proof in appendix B-A ). It is also satisfied by some cases of linear classifiers, which are more closely related to neural networks. An example of a ground truth that satisfies these assumptions, using a 2-dimensional linear hypothesis set is shown in Figure 1 where the optimal linear hypothesis is , and any change in its parameters will result in mis-classification of more features, thus it is a stable GLP.
We can now move to state our main results.
Theorem 1
where is the realizable PAC error probability when learning from (see proof in appendix B-C). This theorem decomposes the PAC error probability into the error incurred in realizable learning and the additional error incurred in agnostic learning. Notice that for realizable learning, , we get , as expected.
Denote the KL divergence projection of a distribution on a set of distributions as:
(17) |
where is the KL divergence.
Theorem 2
Under assumptions 1 - 5, if has a finite VC dimension, there exists a positive real number such that the following holds:
(18) |
where , is a set of distributions on some alphabet , induced by the distribution on for which the ERM will output a hypothesis outside of and is the true distribution on the alphabet.
The proof of the Theorem is provided in appendix B-E. The distribution and the set of distributions will be explicitly derived in the next section.
This theorem establishes the exponential behavior of the PAC error probability and is achieved by showing that the error exponent of is and using the uniform realizable learning bound in (10). This implies that any improved realizable bound can be plugged into theorem 2 to get an improved agnostic bound, and the requirement of finite VC dimension might be unnecessary.
The achieved error exponent is better than the classical error exponent for agnostic learning in (9), which is , as it is linear in instead of quadratic in .
Notice that because is independent of , for the error exponent is , which is the same as the worst case realizable learning exponent. Thus, not only the error exponent is much better than the general one for agnostic learning, it also shows that agnostic learning might be no harder than realizable learning in some cases. This result can be expressed as a bound on the excess risk:
(19) |
IV-A Derivation of Error Exponent
In this section we provide details on how to construct the set and the distribution . The derivation in this section is partial and is done under the assumption that there are K+1 GLP’s. However, this is only to simplify the already complex derivation and is not a requirement (see appendices B and C for more details). must be one of the GLP’s. Denote . For each GLP of , denote the following regions:
(20) |
is the region that supports choosing over and is the opposite. Denote as the number of samples that fall in region . achieves lower empirical risk than if . Denote the disjointified regions of :
(21) | ||||
where and . Notice these regions are non-intersecting and . We can write the following equation:
(22) |
where A is a matrix. Denote the alphabet :
(23) |
Denote the probability mass function on such that , , where is the i’th symbol in and is the region corresponding to it. Denote :
(24) |
where is the complement set. This set represents all distributions on for which the ERM will output a hypothesis outside of (and thus, sub-optimal). Notice that , as is the true distribution on , and under it the ERM must converge to the optimal hypothesis due to assumption 3.
V Example
This example shows how to compute the error exponent and that it empirically converges to the value in the theorem. Let with uniform distribution, and the ground truth is:
We use the 1-boundary hypothesis class for ERM learning. The optimal hypothesis minimizing the risk is:
There are two GLP’s: . Their regions are . The pairs of regions , as defined in (20), are . The disjointified regions (as in (21)) are . We get an alphabet with probabilities . The region is:
Computing is a simple constraint optimization problem with solution . The error exponent using bound (9) is , while our improved error exponent is . Figure 2 shows that the empirical error exponent of the PAC error probability indeed converges to .
VI Conclusions And Future Research
We derived an improved error exponent for agnostic PAC learning and showed that in some cases agnostic learning might be no harder than realizable learning. Any new realizable learning bound can be plugged into Theorem 2 to get a better agnostic bound. This result opens new directions for research. One important goal can be to find explicit conditions for practical hypotheses classes (e.g, neural networks) satisfying the conditions for Theorem 2.
Interestingly, the error exponent analysis of PAC learning turns out to be useful in attaining the first theoretical results for the knowledge distillation problem, [25], providing conditions that define where the associated teacher-student learning is useful and where it is not.
References
- [1] L. G. Valiant, “A theory of the learnable.” Commun. ACM, vol. 27, no. 11, pp. 1134–1142, 1984. [Online]. Available: http://dblp.uni-trier.de/db/journals/cacm/cacm27.html#Valiant84
- [2] L. Valiant, Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World. Basic Books, 2013.
- [3] V. Vapnik, Statistical Learning Theory. Wiley New York, 1998.
- [4] V. Vapnik and A. Chervonenkis, “Theory of pattern recognition,” 1974.
- [5] O. Bousquet, S. Boucheron, and G. Lugosi, “Introduction to statistical learning theory,” Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, pp. 169–207, 2004.
- [6] O. Bousquet, S. Hanneke, S. Moran, R. Van Handel, and A. Yehudayoff, “A theory of universal learning,” in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pp. 532–541.
- [7] D. Cohn and G. Tesauro, “Can neural networks do better than the vapnik-chervonenkis bounds?” Advances in Neural Information Processing Systems, vol. 3, 1990.
- [8] ——, “How tight are the vapnik-chervonenkis bounds?” Neural Computation, vol. 4, no. 2, pp. 249–269, 1992.
- [9] V. Nagarajan and J. Z. Kolter, “Uniform convergence may be unable to explain generalization in deep learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- [10] A. Nitanda and T. Suzuki, “Stochastic gradient descent with exponential convergence rates of expected classification errors,” in The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019, pp. 1417–1426.
- [11] D. Haussler, M. Kearns, and R. E. Schapire, “Bounds on the sample complexity of bayesian learning using information theory and the vc dimension,” Machine learning, vol. 14, pp. 83–113, 1994.
- [12] J.-Y. Audibert and A. B. Tsybakov, “Fast learning rates for plug-in classifiers,” 2007.
- [13] G. Lever, F. Laviolette, and J. Shawe-Taylor, “Distribution-dependent pac-bayes priors,” in International Conference on Algorithmic Learning Theory. Springer, 2010, pp. 119–133.
- [14] ——, “Tighter pac-bayes bounds through distribution-dependent priors,” Theoretical Computer Science, vol. 473, pp. 4–28, 2013.
- [15] S. Ben-David and R. Urner, “The sample complexity of agnostic learning under deterministic labels,” in Conference on Learning Theory. PMLR, 2014, pp. 527–542.
- [16] G. M. Benedek and A. Itai, “Nonuniform learnability,” in Automata, Languages and Programming: 15th International Colloquium Tampere, Finland, July 11–15, 1988 Proceedings 15. Springer, 1988, pp. 82–92.
- [17] S. Hanneke, A. Kontorovich, S. Sabato, and R. Weiss, “Universal bayes consistency in metric spaces,” in 2020 Information Theory and Applications Workshop (ITA). IEEE, 2020, pp. 1–33.
- [18] S. Hanneke, “Learning whenever learning is possible: Universal learning under general stochastic processes,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 5751–5866, 2021.
- [19] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- [20] R. G. Gallager, Information theory and reliable communication. Springer, 1968, vol. 588.
- [21] I. Csiszar, “The method of types [information theory],” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2505–2523, 1998.
- [22] I. N. Sanov, “On the probability of large deviations of random variables,” Selected Translations in Mathematical Statistics and Probability, vol. 1, pp. 213–244, 1961.
- [23] M. Anthony, P. L. Bartlett, P. L. Bartlett et al., Neural network learning: Theoretical foundations. cambridge university press Cambridge, 1999, vol. 9.
- [24] I. Csiszár and J. Körner, Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
- [25] A. Hendel, “Improved PAC Learning Bounds with Application to Knowledge Distillation,” M.Sc thesis., Dept. of Electrical Engineering - Systems, Tel-Aviv Univ., Tel-Aviv, Israel., 2023.
Appendix A Proof form a complete partitioning of
Lemma 1
Let there be a complete hypothesis set (as in assumption 4 in the paper), a ground truth function and a set of GLP’s of . The regions , form a complete partitioning of .
Proof 1
By definition, the regions are disjoint. Assume exists a distinct set of hypotheses that don’t belong to any , . We will first prove that either all hypotheses in coincide or there exist such that exists a set with positive probability such that . Assume by contradiction this is not true, thus exists such that w.p 1. And for we can find such that w.p 1. By repeating this we get a series such that w.p 1.
If this series is finite, either the series is not distinct and the hypotheses in coincide or for any other hypothesis in we can find a set with positive probability for which the last element in the series has lower loss, which is a contradiction.
If the series is infinite, then the series has a limit because it is monotone. By the completeness assumption, this means that the limit hypothesis of is in . If it is also in , then either all hypotheses in coincide or for any other hypothesis in we can find a set with positive probability for which the limit hypothesis has a lower loss, which is a contradiction. If the limit hypothesis is not in , then the whole series belongs to one of , , which is a contradiction.
To conclude this part, we’ve showed that either all hypotheses in coincide to a single hypothesis or there exist for which exists a set with positive probability such that .
Because doesn’t belong to any of the regions , , it means is not a GLP, as every GLP belongs to its own region. It also means that for every GLP exists a set with positive probability such that . By definition of , we have
.
Thus, there exists a set with positive probability such that . We got that exist a set with positive probability such that , which is a contradiction to not being a GLP.
Thus, the set is empty and the regions form a complete partitioning of .
Appendix B Supplemented material for "Theoretical Results" section
In this section we’ll prove theorems 1 and 2. This will be done gradually: in subsection B-B we’ll prove an equivalent formulation to the PAC error probability that will be used in proving the theorems. In subsection B-C, we’ll prove theorem 1 along with 2 needed Lemmas. In section B-D we’ll prove a Lemma about lower and upper bounds on the seconds term in theorem 1 that will be used in proving theorem 2. Finally, in B-E we’ll prove theorem 2 along with with 3 needed Lemmas. The proofs of theorems 1 and 2 in this section are provided for the case of a finite number of GLP’s. This is generalized to an infinite number of GLP’s in section C of the appendix. Note that we will sometimes refer to equations from the paper.
B-A K-boundary has stable global GLP
Lemma 2
Let , be a binary ground truth function with at most transition points between and . Let be the K-boundary class. Then the optimal hypothesis when learning from with binary loss is a stable GLP.
Proof 2
Let be the optimal hypothesis when learning from with binary loss. is a GLP as for any other hypothesis in the class there must exist a set in for which is uniformly better (otherwise it wouldn’t minimize the risk).Denote the transition point of as and the transition points of as . Assume without loss of generality that isn’t equal to either or and denote and . First, we’ll prove that every transition point of equals to one of - assume by contradiction exists that isn’t equal to one of , thus satisfying , where is one of . For , is constant (either 0 or 1). We can generate 2 new hypotheses by changing to or to , at least of these new hypotheses has zero loss for and coincides with outside of it. Notice that has non-zero loss on as it can coincide with either on or on . This, is not the risk minimizer, which is a contradiction.
We move to prove that for every , if exist such that either or , then for - assume by contradiction that there is such subset for which (the case of is analogous ) and doesn’t coincide with . Because both and don’t have any additional transition points between and , they are constant in this region and for . By changing to be instead of , we obtained a new K-boundary hypothesis that coincides with outside and coincides with on thus obtaining better risk than which is a contradiction to its optimality.
Denote , and some perturbation with transition points , where .
For every we know that is one of . thus, if , then
Thus, . To show is stable, we need to show that any region for which , we also have . For every , might have loss on the region only if there is no for which or . But this means that for every , if (where ) then , thus . Moreover, if then , and if then . Thus, the amount of transition points of before is the same as the amount of transition points of before . We conclude that and coincide on any such region . Thus, any region that is missclassified by is also missclassified by and is a stable GLP.
B-B Equivalence Lemma
The following Lemma shows the equivalence:
This will allow us to use the simpler right hand term instead of the PAC error probability.
Lemma 3
Proof 3
: We have the following due to :
For we have:
Denote:
The following chain of equalities holds for :
So, for we have .
We got
.
:
We have the following due to :
Thus, . We’ve already showed that for we have . So, we got
B-C Proof of Theorem 1
In this subsection we will prove theorem 1 from the main paper. Before proving it, we first need to prove 2 lemmas that will be used as part of the proof. The proof of theorem 1 is given in the end of this subsection.
Lemma 4
Proof 4
From assumption 5 we have . Thus, if , then we have and we are done.
Let’s focus on the case .
Denote:
(28) |
We have the following:
(29) | ||||
The empirical risks for any can be decomposed:
For , the empirical risk with regard to is:
(30) |
Thus, can be decomposed to 2 terms - a fixed term and a term that is minimized by . Thus, is the minimizer of , so the ERM will choose a hypothesis that is equal to on the set . From equation (LABEL:equality_on_tilde_x_k), and are also equal on , thus they are equal on the entire sample . We also have because the empirical risk is zero in realizable learning, so and are equal on the entire sample . We get:
(31) |
and have the same empirical risk. By the convention the ERM is the hypothesis with minimum empirical risk that maximizes the true risk, which is equivalent to (recall the equivalence in appendix section B-B). Thus, because they have the same empirical risk, and are equal and we have:
To conclude, the following holds for :
Lemma 5
Proof 5
Given , any hypothesis with doesn’t achieve minimal empirical risk on with regard to . This is true due to the convention that the ERM is the hypothesis with maximal risk from all the hypotheses with minimal empirical risk and due to the equivalence in section B-B. Denote:
(33) | |||
(34) |
Let’s assume , which means .
We are given that , so we have , which means by definition of , that if then . Thus .
Notice that because achieves lower empirical risk with regard to than , so there must be at least one sample of on which is better, otherwise
and coincide on
and have the same empirical risk, which is a contradiction to not being ERM with regard to . The empirical risk can be decomposed to:
(35) | ||||
By definition of , we have
.
Form we get .
Thus we have .
We got that achieves lower empirical risk with regard to , which is a contradiction to . Thus, if then :
.
Proof of theorem 1:
By conditioning on we get:
Where the last equality is due to Lemma 4. By conditioning on , we get the following:
Where the last equality is due to Lemma 5 and because for we have . We conclude with the following equality:
By denoting and taking the complement probability, we get:
B-D Bounds Lemma
Lemma 6
Before the proof, let’s denote the following concept of the set of minimal sequences:
Definition 7
Let there be a hypothesis class , a function , and a number . The set of minimal sequences with risk lower than with regard to is
Where is without the i’th component.
This set is nonempty due to assumption 3. Denote as the maximal length of a sequence in . For example, in the k-boundary hypothesis class we have for any hypothesis. Thus, any i.i.d sequence achieving can be decomposed into a minimal sequence of length at most and the rest of the samples which have no constraint on them. So, if , then there is a constraint on at most samples of . We’ll now state the proof for Lemma 6.
Proof 6
we have . Let the maximum length of a set in be . So any sequence that satisfies can be decomposed to a minimal sequence of length at most and the rest of the samples:
The lower bound is obtained by assuming that all samples fell in every region for :
The upper bound is obtained by assuming that all samples didn’t fall in any all region for (i.e., they all fell in ):
B-E Proof of theorem 2
In this subsection we’ll prove theorem 2. This will be done by showing that the error exponent of the bounds in Lemma 6 is . First, we’ll analyze the second term in theorem 1. Using Eq.(20), we have:
(36) | ||||
Subscript indicates the length of the sample . Thus we have:
(37) | ||||
Denote the vector of non-negative integers
. Using Eq.(21), we have the following:
(38) | ||||
Where is the set of integers with sum that satisfy at least one of :
(39) | ||||
is the matrix from Eq.(22). The type of a sequence on alphabet is the empirical distribution of symbols in the sequence:
Denote as the set of all length sequences types and as the set of sequence with type . Our problem can be formulated as an i.i.d sequence over alphabet . can be formulated as a constraint on types instead of integers, denoted as :
(40) | ||||
Notice that the sets are subsets of the set denoted in Eq.(24). Thus, Eq. (38) is the sum of types of sequences of length that are contained in :
Notice is not contained in because of the consistency assumption. The same formulation can be done for . Denote the set :
(41) | ||||
We have:
Theorem 3.3 in [21] states that if a set of probabilities on , that doesn’t contain the underlying distribution , has the property:
Then the following holds:
The 3 Lemmas in the end of this section show this condition is satisfied for both and . we have:
This means that the upper and lower bounds from Lemma 6 have the same error exponent . Thus, the error exponent of is also . By using theorem 1 and the error exponent for the uniform realizable case we get:
This proves theorem 2. The following Lemmas prove the fulfilment of the needed conditions.
Lemma 7
Let there be an alphabet with underlying probability Q and an i.i.d sequence over the alphabet. denote the set as in Eq.(24): For , the following holds:
Proof 7
is the outside of a polygon on the probability simplex (including the boundary), thus it is a connected closed space. This means that is achieved for some probability , such that .
is continuous in , so for every exists such that if then .
Because is a closed connected set, Lemma 9 applies, so for every exists such that for we have an empirical assignment satisfying
.
We got that for every exists such that for we have
.
Lemma 8
Let be an alphabet with probability and an i.i.d sequence , , over . Denote the the set as in Eq.(24) and the set as in Eq.(LABEL:eq:pi_n_ell). For , the following holds:
Proof 8
is the outside of a polygon on (including the boundary), thus it is a connected closed set.
We already saw in Lemma 7 that is achieved for some such that .
is continuous in , so for every exists such that if then .
For every exists satisfying such that is an interior point of .
Notice that the boundaries of are converging in to the boundaries of , so exists such that for we have .
is a closed connected set, thus Lemma 9 applies to it.
So, for every exists such that for we have an empirical assignment such that .
Notice that for . So, for we have an empirical assignment such that, and by the triangle inequality, it satisfies
.
Lemma 9
Let be a closed and connected subset of and let be the set of types of sequences of length over an alphabet of size . For all the following holds:
Proof 9
is a closed set, thus for any and any exists such that and is rational (because is a dense subset of ). For every n denote the empirical probability :
Notice converges to . This means that for any exists such that for we have .
Because is in the interior of and converges to , exists such that for we have .
Using the triangle inequality, for we have and .
This shows that for any exists such that for we have
Appendix C Generalization to Infinite Amount Of Generalized Optimum Points
In this section we briefly show how to generalize results to the case of infinite amount of GLP’s and how to derive and . Let the set of GLP’s be and the global optimum point. We need the loss of the global optimum to be bounded away from the loss of the other GLP’s:
(42) |
This is necessary for . Notice that this is achieved from assumptions 1, 2 and 4.
Due to the completeness of and uniqueness of , the only hypotheses that can potentially have a risk that is arbitrarily close to the optimal risk are those that are in the neighborhood . Due to the stability assumption, we know that exists a small enough neighborhood of , such that any hypothesis in it will have a higher loss w.p 1.
For each GLP of , denote:
(43) |
We have the following:
From this point, generalizing the proof of theorem 1 is straight forward. Denote the following regions:
Where the operator takes a collection of sets and returns disjoint sets indexed by a continuous index . We get a continuous alphabet . Denote and let be generated by a bijective mapping . We can always find such mapping because is a set of non-intersecting sub-sets of , so the cardinality of is no greater than the cardinality of and hence no greater than the cardinality of . Thus, there exists a subset of with the same cardinality of , which means there exists a bijective mapping from to , and is the distribution on . Denote the following sets:
(44) |
Let be the empirical distribution (CDF) on induced by the drawn sequence . That is, if samples from the sequence landed in the region , then . Denote the set of all such empirical distribution functions as . Denote:
(45) | |||
These are the sets of values in the alphabet that corresponds to regions in and . Denote the following set of empirical distribution functions:
This the parallel of Eq.(LABEL:M_n_Tilde). Let be the set of all distribution functions on . We can now define the set :
(46) | ||||