Empirical Bernstein stopping

Volodymyr Mnih; Csaba Szepesvári; Jean-Yves Audibert

Empirical Bernstein Stopping Volodymyr Mnih mnih@cs.ualberta.ca Csaba Szepesvári szepesva@cs.ualberta.ca Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8 Canada Jean-Yves Audibert audibert@certis.enpc.fr Certis - Ecole des Ponts, 6 avenue Blaise Pascal, Cité Descartes, 77455 Marne-la-Vallée France Willow - ENS / INRIA, 45 rue d’Ulm, 75005 Paris, France Abstract Sampling is a popular way of scaling up machine learning algorithms to large datasets. The question often is how many samples are needed. Adaptive stopping algorithms monitor the performance in an online fashion and they can stop early, saving valuable resources. We consider problems where probabilistic guarantees are desired and demonstrate how recently-introduced empirical Bernstein bounds can be used to design stopping rules that are efficient. We provide upper bounds on the sample complexity of the new rules, as well as empirical results on model selection and boosting in the filtering setting. 1. Introduction Being able to handle large datasets and streaming data is crucial to scaling up machine learning algorithms to many-real world settings. When making even a single pass through the data is prohibitive, sampling may offer a good solution. In order for the resulting algorithms to be theoretically sound, sampling techniques that come with probabilistic guarantees are desirable. For example, when estimating the error of a classifier on a large dataset one may want to sample until the estimated error is within some small number ǫ of the true error with probability at least 1 − δ. The key problem is one of stopping or determining the required number of samples. Taking too many samples will result in inefficient algorithms, while taking too few may not be enough to achieve the desired guarantees. Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Finite sample bounds, such as Hoeffding’s inequality (Hoeffding, 1963), are the key technique used by recent, non-parametric stopping algorithms with probabilistic guarantees. While these stopping algorithms have proved to be effective for scaling up machine learning algorithms (Bradley & Schapire, 2008), (Domingos & Hulten, 2001), they can be significantly improved by incorporating variance information in a principled manner. We show how to employ the recently introduced empirical Bernstein bounds (Audibert et al., 2007a) to improve stopping algorithms and provide sample complexity bounds and empirical results to demonstrate the effect of incorporating variance information. Before proceeding, we identify two classes of stopping problems that will be examined. The first class concerns problems where some unknown quantities have to be measured either up to some prespecified level of accuracy or to support making a binary decision. Examples in this class include stopping with a fixed relative or absolute accuracy, with applications in hypothesis testing such as deciding on the sign of the mean, independence tests, and change detection. In problems belonging to the second group, the task is to pick the best option from a finite pool while measuring their performance using samples. Some notable examples include various versions of bandit problems, Hoeffding Races (Maron & Moore, 1993), and the general framework for scaling up learning algorithms proposed by Domingos (2001). The paper is organized as follows. In Section 2 we examine Hoeffding’s inequality and introduce the empirical Bernstein bound. In Section 3, we introduce a new stopping algorithm for stopping with a predefined relative accuracy and show that it is more efficient than previous algorithms. Section 4 demonstrates how a simple application of the empirical Bernstein bound can result in substantial improvements for problems Empirical Bernstein Stopping from the second class. Conclusions and future work directions are presented in Section 5. 2. Hoeffding Bounds vs. Empirical Bernstein Bounds Let X1 , . . . , Xt real-valued i.i.d. random P variables t with range R and, mean µ, and let X t = 1/t i=1 Xi . Hoeffding’s inequality (Hoeffding, 1963) states that with probability at least 1 − δ r log(2/δ) |X t − µ| ≤ R . 2t Due to its generality, Hoeffding’s inequality has been widely used in online learning scenarios. A drawback of the bound is that it scales linearly with the range R and does not scale with the variance of Xi . If a bound on the variance is known, Bernstein’s inequality can be used instead, which can yield significant improvements when the variance bound is small relative to the range. Since useful a priori bounds on the variance are rarely available, this approach is not practical. An approach that is more suitable to online scenarios is to apply Bernstein’s inequality to the sum of X1 , . . . , Xt , as well as the sum of the squares to obtain a single bound on the mean of X1 , . . . , Xt . The resulting bound, which we will refer to as the empirical Bernstein bound (Audibert et al., 2007a) states that with probability at least 1 − δ r 2 log (3/δ) 3R log (3/δ) + , |X t − µ| ≤ σ t t t where σ t is the empirical standard deviation of Pt X1 , . . . , Xt : σ 2t = 1t i=1 (Xi − X t )2 . The term involving the range R decreases at the rate of t−1 and quickly becomes negligible when the variance is large, while the square root term depends on σ t instead of R. Hence, when σ t ≪ R the empirical Bernstein bound quickly becomes much tighter than Hoeffding’s inequality. 3. Stopping Rules Let X1 , X2 , . . . be i.i.d. random variables with mean µ and variance σ 2 . We will refer to an algorithm as a stopping rule if at time t it observes Xt and based on past observations decides whether to stop or continue sampling. If a stopping rule S returns µ̂ that satisfies P [|µ̂ − µ| ≤ ǫ|µ|] ≥ 1 − δ, (1) then S is a (ǫ, δ)-stopping rule and µ̂ is an (ǫ, δ)approximation of µ. In this section, we develop an (ǫ, δ)-stopping rule for bounded Xi . Algorithms proposed for this problem include the Nonmonotonic Adaptive Sampling (NAS) algorithm, shown as Algorithm 1, due to Domingo et al. (2000a). The general idea is to first use Hoeffding’s inequality to© construct a sequence ª{αt } such that the event E = |X t − µ| ≤ αt , t ∈ N+ occurs with probability at least 1 − δ, and then use this sequence to design a ¯ ¯ stopping criterion that stops only if ¯X t − µ¯ ≤ ǫ|µ| given that E holds. Algorithm 1 Algorithm NAS t←0 repeat t←t+1 Obtain p Xt α ← (1/2t) log(t(t + 1)/δ) until |X t | < α(1 + 1/ǫ) return X t Domingo et al. (2000a) argue that if T is the number of samples after which NAS stops, and |µ| > 0, then there exists a constant C such that with probability at least 1 − δ T <C· R2 ³ 2 R´ log + log . ǫ2 µ2 δ ǫµ (2) The assumption that |µ| > 0 is necessary for guaranteeing that the algorithm will indeed stop, and will be assumed for the rest of the section. Dagum et al. (2000) introduced the AA algorithm for the case of bounded and nonnegative Xi . The AA algorithm is√a three step procedure. In the first step, an (max( ǫ, 1/2), δ/3)-approximation of µ, µ̃, is obtained. In the second step µ̃ is used to determine the number of samples necessary to produce an estimate σ̃ 2 of σ 2 such that max(σ̃ 2 , ǫµ̃) is a high probability upper bound on max(σ 2 , ǫµ)/2. In the last step, samples are drawn and their avc max(σ̃ 2 , ǫµ̃) log(1/δ) ǫ2 µ̃2 erage is returned as µ̂, where c is a universal constant. Dagum et al. (2000) prove that µ̂ is indeed an (ǫ, δ)approximation of µ and that that if T is the number of samples taken by AA, then there exists a constant C such that with probability at least 1 − δ T ≤ C · max (σ 2 , ǫµ) · 1 2 log . ǫ2 µ2 δ (3) In addition, Dagum et al. prove that if T is the number of samples taken by any (ǫ, δ)-stopping rule, then there exists a constant C ′ such that with probability at least 1−δ 2 1 T ≥ C ′ · max (σ 2 , ǫµ) · 2 2 log . ǫ µ δ Empirical Bernstein Stopping Hence, for bounded Xi , the AA algorithm requires a number of samples that is at most a multiplicative constant larger than that required by any other (ǫ, δ)stopping rule. In this sense the algorithm achieves “optimal” efficiency, up to a multiplicative constant. While the AA algorithm is able to take advantage of variance, it requires the random variables to be nonnegative. Any trivial extension of the AA algorithm to the case of signed random variables seems unlikely since the rule heavily relies on the monotonicity of partial sums that is present in the nonnegative case. On the other hand, Equation (2) suggests that the NAS algorithm is not able to take advantage of variance. As the first demonstration of how the empirical Bernstein bound can be used to design improved stopping algorithms, we propose a new algorithm, EBStop, which uses empirical Bernstein Bounds to achieve nearly the same scaling properties as the AA algorithm and, like the NAS algorithm, only requires the random variables to be bounded. 3.1. EBStop Similarly to the NAS algorithm, EBStop relies on a sequence ¯{ct } with the property that the event E = ¯ {¯X t − µ¯ ≤ ct , t ∈ N+ } occurs with probability at least 1 − δ. Let dt be a positive sequence satisfying P ∞ t=1 dt ≤ δ and let r 2 log (3/dt ) 3R log (3/dt ) ct = σ t + . t t Since {dt } sums to at most δ and (X t − ct , X t + ct ) is a 1 − dt confidence interval for µ obtained from the empirical Bernstein bound, by a union bound argument, the event E indeed occurs with probability at least 1−δ. In our work, we use dt = c/tp and c = δ(p−1)/p. The pseudocode for EBStop is shown as Algorithm 2, but the general idea is as follows. After drawing t samples, we set LB to max(0, max1≤s≤t |X s | − cs ) and UB to min1≤s≤t (|X s |+cs ). EBStop terminates as soon as (1+ǫ)LB ≥ (1−ǫ)UB and returns µ̂ = sgn(X t )·1/2· [(1 + ǫ)LB +(1 − ǫ) UB]. To see why µ̂ is an (ǫ, δ)-approximation, suppose the stopping condition has been satisfied and event E holds. Then |µ̂| ≤ 1/2 · [(1 + ǫ)LB + (1 + ǫ)LB] ≤ (1 + ǫ)|µ|, and similarly (1 − ǫ)|µ| ≤ |µ̂|. From the definition of LB, it also follows that |X t | > ct ≥ |X t − µ| which implies that sgn(µ̂) = sgn(µ). Since event E holds with probability at least 1 − δ, µ̂ is indeed an (ǫ, δ)approximation of µ. Algorithm 2 Algorithm EBStop LB ← 0 UB ← ∞ t←1 Obtain X1 while (1 + ǫ)LB < (1 − ǫ)UB do t←t+1 Obtain Xt LB ← max(LB, |X t | − ct ) UB ← min(UB, |X t | + ct ) end while return sgn(X t ) · 1/2· [(1 + ǫ)LB +(1 − ǫ) UB] While we omit the proof due to space constraints1 , we note that if X is a random variable distributed with range R, and if T is the number of samples taken by EBStop on X, then there exists a constant C such that with probability at least 1 − δ ¶µ ¶ µ 2 1 R R σ log + log . (4) T < C · max 2 2 , ǫ µ ǫ|µ| δ ǫ|µ| This bound is very similar to the upper bound for the stopping time of the AA algorithm, with the only difR ference being the extra log ǫ|µ| term. This term comes from constructing a confidence interval at each t and is not an artifact of our proof techniques. However, 1 by apthis extra term can be reduced to log log ǫ|µ| plying a geometric grid as we will see in the next section. Since EBStop does not require the variables to be non-negative, we can say that EBStop combines the best properties of NAS and AA algorithms for signed random variables. 3.2. Improving EBStop While EBStop has the desired scaling properties, we make two simple improvements in order to make it more efficient in practice. The first improvement is based on the idea that if the algorithm is not close to stopping, there is no point in checking the stopping condition at every point. We incorporate this idea into EBStop by adopting a geometric sampling schedule, also used by Domingo and Watanabe (2000a). Instead of testing the stopping criterion after each sample, we perform the kth test after ⌈β k ⌉ samples have been taken for some β > 1. Under this sampling strategy, when EBStop constructs a 1 − d confidence interval after t samples, d is of the order 1/(logβ t)p , which is much larger than 1/tp . Since 1 A version of the paper containing the proofs will be made available as a technical report. Empirical Bernstein Stopping this results in tighter confidence intervals, LB and UB will approach each other faster and the stopping condition will be satisfied after fewer samples. While geometric sampling can often reduce the number of required samples, it can also lead to taking roughly β times too many samples, because testing is only done at the ends of intervals. Nevertheless, the following result due to Audibert et al. (2007b) can be used to test the stopping condition after each sample without sacrificing the advantages of geometric sampling. Let t1 ≤ t2 for t1 , t2 ∈ N and let α ≥ t2 /t1 . Then with probability at least 1 − d, for all t ∈ {t1 , . . . , t2 } p |X t − µ| ≤ σ t 2α log (3/d) /t + 3α log (3/d) /t. (5) We use Equation (5) with t1 = ⌊β k ⌋ + 1, t2 = ⌊β k+1 ⌋, and d = c/k p to construct ct for each t ∈ {t1 , . . . , t2 }. This allows us to test the stopping condition after each sample, and use d that is on the order of 1/(logβ t)pα after t samples. A variant of EBStop that incorporates these two improvements is shown as Algorithm 3. ✁✂✂✂ ✗✘✙✚✛✜ ✢✣ ✗✘✙✚✛✜ ✢ ✗✘✚✛✜ ✢ ✤✤ ✥✤ ✚ ✙✦✜ ✥✤ ✚ ✆✂✂✂ ✖ ✟☞ ✕✡ ✟ ✔✒ ✍✓ ☎✂✂✂ ✡ ✑ ✒✏ ✎ ✟✠✍ ✄✂✂✂ ✌ ✟☞ ✡☛ ✝✞ ✟✠ ✂✂✂ ✂ ✁ ✂ ✁✂ ✂✂ ✂✂✂ n Figure 1. Comparison of stopping rules on averages of uniform random variables with varying variances. Error bars are at one standard deviation. too large values of δ. For example, for δ = 0.05, R = 1, the condition is ǫµ < e−20 . 3.3. Results: Synthetic Data Algorithm 3 Algorithm EBGStop LB ← 0 UB ← ∞ t←1 k←0 Obtain X1 while (1 + ǫ)LB < (1 − ǫ)UB do t←t+1 Obtain Xt if t > floor(β k ) then k ←k+1 α ← floor(β k )/floor(β k−1 ) x ← −α log dk /3 end if p ct ← σ t 2x/t + 3Rx/t LB ← max(LB, |X t | − ct ) UB ← min(UB, |X t | + ct ) end while return sgn(X t ) · 1/2· [(1 + ǫ)LB +(1 − ǫ) UB] In this section we evaluate the stopping rules AA, NAS, geometric NAS, EBStop, and EBGStop on the problem of estimating means of various random variables. To make the comparison fair, the geometric version of the NAS algorithm and EBGStop both grew intervals by a factor of 1.5, as this value worked best in previous experiments (Domingo & Watanabe, 2000b). We also used dt = (t(t+1))−1 in EBStop and EBGStop since this is the sequence implicitly used in NAS for constructing confidence intervals at time t. Since this put EBGStop at a slight disadvantage, we also include results for EBGStop, denoted by EBGStop*, with our default choice of dt = c/(logβ t)p , p = 1.1, and β = 1.1. In all the experiments we used ǫ = δ = 0.1. We use only non-negative valued random variables as they allow comparison to AA. Finally, we only compare the number of samples taken because none of the algorithms produced any estimates with relative error greater than ǫ in any of our experiments. One can show that as the result of adding geometric 1 sampling to EBStop reduces the log ǫ|µ| term in in1 equality (4) to log log ǫ|µ| . It should be noted that from the arguments of Dagum et al. (2000), no stopping rule can achieve a better bound than (3) for the case of bounded non-negative random variables. Hence, EBGStop is very close to being “optimal” in this sense. Where it would loose (for non-negative random variables) to AA is when ǫ, µ and δ are such that log(R/(ǫµ)) becomes significantly larger than 1/δ. We do not expect to see this happening in practice for not The first set of experiments was meant to test how well the various stopping rules are able to exploit the variance when it is small. Let the average of n uniform [a, b] random variables be denoted by U (a, b, n). Note that the expected value and variance of U (a, b, n) are (a + b)/2 and (b − a)2 /(12n), respectively. For this comparison we fixed a to 0, b to 1, and varied n to control the variance for a fixed mean. Figure 1 shows the results of running each stopping rule 100 times on U (0, 1, n) random variables for n = 1, 5, 10, 50, 100, 1000. Not surprisingly, NAS and geometric NAS fail to make use of the variance Empirical Bernstein Stopping 105 ✝ ✏ ✂☎ ✎ ✍ ✂☞ ✑✒✓✔✕✖ ✗✘ ✑✒✓✔✕✖ ✗ ✑✒✔✕✖ ✗ ✙✙ ✚✙ ✔ ✓✛✖ ✚✙✔ 107 ✝ ✏ ✂☎ ✎ ✟✌ ☎ ✟✌ ☎ ☛ ✡☞ ✠ ✂✄✟ ✝✞ ✂ ☎✆ ✂✄ 104 ☛ ☞✡ ✠ ✂✄✟ ✝✞ ✂ ☎✆ ✁ ✂✄ 106 ✍ ✂☞ 103 ✑✒✓✔✕✖ ✗✘ ✑✒✓✔✕✖ ✗ ✑✒✔✕✖ ✗ ✙✙ ✚✙ ✔ ✓✛✖ ✚✙✔ 5 10 104 ✁ 103 102 µ=0.9 µ=0.7 µ=0.5 µ=0.3 µ=0.1 102 µ=0.99 µ=0.9 µ=0.5 µ=0.1 µ=0.05 µ=0.01 Figure 2. Comparison of stopping rules on averages of uniform random variables with varying means. The number of samples is shown in log scale. Figure 3. Comparison of stopping rules on Bernoulli random variables. The number of samples is shown in log scale. and take roughly the same numbers of samples for all values of n. Variants of EBStop improve when the variance decreases, with EBGStop* performing especially well, beating all the other algorithms for all the scenarios tested. AA initially improves with the decreasing variance, but the effect is not as large as with EBGStop* because of the multi-phase structure of AA. requires O(1/(ǫ2 µ2 )) samples. In the second set of experiments we fix n at 10 and b − a at 0.2, keeping the variance fixed, and vary the mean. The variance is small enough that EBStop, its variants, and AA should take a number of samples in the order of R/(ǫµ). The results are presented in Figure 2 and suggest that both variants of NAS require 1/µ times more samples than the variance-adaptive methods. Note that Figure 2 shows the number of samples taken by each method in log scale. It may be surprising that in both experiments the AA algorithm did not outperform EBStop and EBGStop even though AA offers better guarantees on sample complexity. We believe that EBStop is able make better use of the data because it uses all samples in its stopping criterion, while AA wastes some samples on intermediate quantities. However, this difference should be reflected in the hidden constants. As discussed earlier, for really small values of µ and ǫ the AA algorithm should stop earlier than EBStop. Finally, we include a comparison of the stopping rules on Bernoulli random variables. Since Bernoulli random variables have maximal variance of all bounded random variables, the advantage of variance estimation should be diminished. However, inequality (4) suggests that in the case of Bernoulli random variables EBStop requires O(1/(ǫ2 µ)) samples since σ 2 = µ(1 − µ). Similarly, inequality (2) suggests that NAS Figure 3 shows the results of running each stopping rule on Bernoulli random variables with means 0.99, 0.9, 0.5, 0.1, 0.05, and 0.01, averaged over 100 runs. As in the previous set of experiments, the varianceadaptive methods seem to require 1/µ times fewer samples to stop. It should also be noted that the geometric version of the NAS algorithm does outperform EBStop for some intermediate values of µ, where the variance is the largest. However the performance difference is not large, and so we think the price paid for the unboundedly better performance of EBStop for small or large values of µ is not large. 3.4. Results: FilterBoost Boosting by filtering (Bradley & Schapire, 2008) is a framework for scaling up boosting algorithms to large or streaming datasets. Instead of working with the entire training set, all steps, such as finding a weak learner that has classification accuracy of at least 0.5, are done through sampling that employs stopping algorithms. Bradley and Schapire showed that such an approach can lead to a drastic speedup over a batch boosting algorithm. We evaluated the suitability of EBGStop and both variants of the NAS algorithm for the boosting by filtering setting by plugging them into the FilterBoost algorithm (Bradley & Schapire, 2008). The AA algorithm was not included because it cannot deal with signed random variables. Following Bradley and Schapire, the Adult and Covertype datasets from the UCI machine learning repository (Asuncion & Newman, 2007) were used. The covertype dataset was converted into a binary classi- Empirical Bernstein Stopping fication problem by taking ”Lodgepole Pine” as one class and merging the other classes. In setting up boosting we followed the procedure of Domingo and Watanabe (2000b) who also considered the use of stopping rules in the same context. Accordingly, we used decisions stumps as weak learners and we discretized all continuous attributes by binning their values into five equal bins. The results for the Adult dataset were averaged over 10 runs on the training set, while 10-fold cross-validation was used for the Covertype dataset. As shown in Figure 4, EBGStop required fewer samples and offered lower variance in stopping times than either variant of the NAS algorithm on both datasets. At the same time, the resulting classification accuracies were within 0.2% of each other on the Adult dataset and within 0.04% of each other on the Covertype dataset. 4. Racing Algorithms In this section we demonstrate how a general stopping algorithm that makes use of finite sample deviation bounds can be improved with the use of empirical Bernstein bounds. We consider the Hoeffding races algorithm (Maron & Moore, 1993) since it is representative of the class of general stopping algorithms. Racing algorithms aim to reduce the computational burden of performing tasks such model selection using a hold-out set by discarding poor models quickly (Maron & Moore, 1993; Ortiz & Kaelbling, 2000). The context of racing algorithms is the one of multiarmed bandit problems. Formally, consider M options. When option m is chosen the tth time, it gives a random value Xm,t from an unknown distribution νm . The samples R {Xm,t }t≥1 are independent of each other. Let µm = xνm (dx) be the mean reward obtained of option m. The goal is to find the options with the highest mean reward. Let δ > 0 be the confidence level parameter and N be the maximal amount of time allowed for deciding which option leads to the best expected reward. A racing algorithm either terminates when it runs out of time (i.e. at the end of the N -th round) or when it can say that with probability at least 1 − δ, it has found the best option, i.e. an option m∗ with µm∗ = maxm∈{1,...,M } µm . The Hoeffding race is an algorithm based on discarding options which are likely to have smaller mean than the optimal one until only one option remains. Precisely, for each time step and each distribution, δ/(M N ) confidence intervals are constructed for the mean. Options with upper confidence smaller than the lower confi- dence bound of another option are discarded. The algorithm samples one by one all the options that have not been discarded yet. We assume that the rewards have a bounded range R. If X m,t denotes the sample mean for option m after seeing t samples of this option then according to Hoeffding’s inequality, a δ/(M N ) confidence interval for the mean of option m is r r · ¸ log(2M N/δ) log(2M N/δ) X m,t − R , X m,t + R 2t 2t The Hoeffding race has been introduced and studied in (Maron & Moore, 1993; Maron & Moore, 1997) in a slightly different viewpoint since there the target was to find an option with mean at most ǫ below the optimal mean maxm∈{1,...,M } µm , where ǫ is a given positive parameter. The same problem was also studied by (Even-Dar et al., 2002, Theorem 3) in the infinite horizon setting. By substituting Hoeffding’s inequality with the empirical Bernstein bound we obtain a new algorithm, which we will refer to as the empirical Bernstein race. 4.1. Analysis of Racing Algorithms For the analysis we are interested in the expected number of samples taken by the Hoeffding race and the empirical Bernstein race. Due to space limitations, we omit the proofs of the following theorems. Let ∆m = µm∗ − µm , where option m∗ still denotes an optimal option: µm∗ = maxm∈{1,...,M } µm . Let ⌈u⌉ denote the smallest integer larger or equal to u, and let ⌊u⌋ denote the largest integer smaller or equal to u. Theorem 1 (Hoeffding Race). Let nH (m) = § 8R2 log(2M N/δ) ¨ . Without loss of generality, assume ∆2m that nH (1) ≤ nH (2) ≤ . . . ≤ nH (m). The number of samples, T, taken by the Hoeffding race is bounded by X 2 nH (m). µm <µm∗ The probability that no optimal option is returned is bounded by δ. If the algorithm runs out of time, then with probability at least 1 − δ, (i) the number of discarded options isP at least d, where d is the largest ind teger such that 2 m=1 nH (m) ≤ N , and (ii) the nondiscarded options satisfy s log(2M N/δ) µm ≥ µm∗ − 4R . 2⌊N/M ⌋ We recall that the principle of the empirical Bernstein race algorithm is the same as the Hoeffding’s one. We Empirical Bernstein Stopping ✎✏ ✑ ✒✓✔ ✎✏ ✑ ✕✖✒✑✗✔ ✘ ✎✏ ✑ ✒✓✔ ✎✏ ✑ ✕✖✒✑✗✔ ✘ ✞✁✟✟✟✟✟ ✞✠✟✟✟✟✟ ✞✟✟✟✟✟✟ ✆✟✟✟✟✟ ✄✟✟✟✟✟ ✁✟✟✟✟✟ ✠✟✟✟✟✟ ✁ ✂ ✄ ☎ ✆ ✝ ✞✟ ✞✞ ✞✠ ✞ ✡☛☞✌✍ ✞✁ ✞✂ ✟ ✞ ✠ ✁ ✂ ✄ ☎ ✆ ✝ ✞✟ ✞✞ ✞✠ ✞ ✞✁ ✞✂ ✡☛☞✌✍ Figure 4. Comparison of the number of samples required by different stopping rules in FilterBoost. Parameters ǫ, δ were set to 0.1 for both methods while τ was set to 0.25. Error bars are at 1 standard deviation. a) Results on the Adult dataset b) Results on the covertype dataset. sample one by one all the distributions that has not been discarded yet. The algorithm discards an option as soon as the upper bound on its mean reward is smaller than at least one of the lower bound on the mean of any other option. Theorem 2 (Empirical Bernstein Race). Let σm denote the standard deviation of νm . Introduce Σm = σm∗ + σm and ¼ » 2 8Σm + 18R∆m log(4M N/δ) . n(m) = ∆2m Without loss of generality, assume that n(1) ≤ n(2) ≤ . . . ≤ n(m). The number of samples taken by the empirical Bernstein race is bounded by X 2 n(m). µm <µm∗ The probability that no optimal option is returned is bounded by δ. If the algorithm runs out of time, then with probability at least 1 − δ, (i) the number of discarded options isP at least d, where d is the largest ind teger such that 2 m=1 nH (m) ≤ N , and (ii) the nondiscarded options satisfy s 8 log(4M N/δ) 9R log(4M N/δ) µm ≥ µm∗ − Σm − . ⌊N/M ⌋ ⌊N/M ⌋ As can be seen from the bounds, the result of incorporating the variance estimates is similar to what was observed in Section 3: The dependence of the number of required samples on R2 is reduced to a dependence on R and the variance. Similar results can be expected when applying the empirical Bernstein bound to other situations. 4.2. Results Following the procedure of Maron and Moore (1997), we evaluated the Hoeffding and empirical Bernstein races on the task of selecting the best k for k-nearest neighbor regression and classification through leaveone-out cross-validation.2 Three datasets of different types were used for the comparison. The SARCOS data presents a regression problem which involves predicting the torques at 7 joints of a robot arm based on the positions, velocities and accelerations at those joints. We only considered predicting the torque at the first joint. The Covertype2 dataset consists of 50,000 points sampled from the Covertype dataset from Section 3.4 and is a binary classification task. The Local dataset presents a regression problem that was created by sampling 10,000 points from a noisy piecewiselinear function defined on the unit interval and having a range of 1. The value of the range parameter R was set to 1 for the Covertype2 and Local datasets. For the SARCOS dataset, R was set to the range of the target values in the dataset. This differs from the approach of setting R separately for each option to several times the standard deviation in the samples observed, suggested by Maron and Moore (1997). We do not follow this approach because it invalidates the use of Hoeffding’s inequality. 2 Since leave-one-out cross-validation creates dependencies between the samples, the analysis does not apply to this case. However, our experiments gave similar results when we used a separate hold-out set. We decided to present results for leave-one-out cross-validation to facilitate comparison with the original papers. Empirical Bernstein Stopping Table 1. Percentage of work saved / number of options left after termination. Data set Hoeffding EB SARCOS Covertype2 Local 0.0% / 11 14.9% / 8 6.0% / 9 44.9% / 4 29.3% / 5 33.1% / 6 All methods were given the options k = 20 , 21 , 22 , 23 , . . . , 210 to begin with. The results are presented in Table 1. The table shows the percentage of work saved by each method (1− number of samples taken by method / M N ), as well as the number of options remaining after termination. The empirical Bernstein racing algorithm, which is denoted by EB, significantly outperforms the Hoeffding racing algorithm on all three datasets. The advantage of incorporating variance estimates is the smallest on the Covertype2 classification dataset. This is expected because the samples come from Bernoulli distributions which have the largest possible variance for a bounded random variable. The advantage of variance estimation is the largest on the SARCOS dataset, where R is much larger than the variance. While one may argue that the Hoeffding racing algorithm would do much better if R was set to a smaller value based on the standard deviation, the empirical Bernstein algorithm would also benefit. However, tweaking R this way is merely an unprincipled way of incorporating variance estimates into a racing algorithm. 5. Conclusions and Future Work We showed how variance information can be exploited in stopping problems in a principled manner. Most notably, we presented a near-optimal stopping rule for relative error estimation on bounded random variables, significantly extending the results of Domingo and Watanabe, and Dagum et al.. We also provided empirical and theoretical results on the effect that can be expected from incorporating variance estimates into existing stopping algorithms. One interesting question that should be addressed is if the bound achieved by the AA algorithm in the nonnegative case, which is known to be optimal, can be achieved without the non-negativity condition. Acknowledgements This work was supported in part by Agence Nationale de la Recherche project “Modèles Graphiques et Applications”, the Alberta Ingenuity Fund, iCore, the Computer and Automation Research Institute of the Hungarian Academy of Sciences, and NSERC. References Asuncion, A., & Newman, D. (2007). UCI machine learning repository. Audibert, J. Y., Munos, R., & Szepesvári, C. (2007a). Tuning bandit algorithms in stochastic environments. ALT (pp. 150–165). Audibert, J.-Y., Munos, R., & Szepesvári, C. (2007b). Variance estimates and exploration function in multi-armed bandit (Technical Report 07-31). Certis - Ecole des Ponts. http://certis.enpc.fr/ ~audibert/RR0731.pdf. Bradley, J. K., & Schapire, R. (2008). Filterboost: Regression and classification on large datasets. NIPS20 (pp. 185–192). Dagum, P., Karp, R., Luby, M., & Ross, S. (2000). An optimal algorithm for Monte Carlo estimation. SIAM Journal on Computing, 29, 1484–1496. Domingo, C., & Watanabe, O. (2000a). MadaBoost: A modification of AdaBoost. COLT’00 (pp. 180–189). Domingo, C., & Watanabe, O. (2000b). Scaling up a boosting-based learner via adaptive sampling. Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 317–328). Domingos, P., & Hulten, G. (2001). A general method for scaling up machine learning algorithms and its application to clustering. ICML (pp. 106–113). Even-Dar, E., Mannor, S., & Mansour, Y. (2002). PAC bounds for multi-armed bandit and Markov decision processes. COLT’02 (pp. 255–270). Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30. Maron, O., & Moore, A. (1993). Hoeffding races: Accelerating model selection search for classification and function approximation. NIPS 6 (pp. 59–66). Maron, O., & Moore, A. W. (1997). The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11, 193–225. Ortiz, L. E., & Kaelbling, L. P. (2000). Sampling methods for action selection in influence diagrams. AAAI/IAAI (pp. 378–385).

RELATED PAPERS

RELATED TOPICS

Log In

Empirical bernstein stopping

Empirical bernstein stopping

Related Papers

RELATED PAPERS

RELATED TOPICS