Empirical Bernstein Stopping
Volodymyr Mnih
mnih@cs.ualberta.ca
Csaba Szepesvári
szepesva@cs.ualberta.ca
Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8 Canada
Jean-Yves Audibert
audibert@certis.enpc.fr
Certis - Ecole des Ponts, 6 avenue Blaise Pascal, Cité Descartes, 77455 Marne-la-Vallée France
Willow - ENS / INRIA, 45 rue d’Ulm, 75005 Paris, France
Abstract
Sampling is a popular way of scaling up machine learning algorithms to large datasets.
The question often is how many samples
are needed. Adaptive stopping algorithms
monitor the performance in an online fashion and they can stop early, saving valuable resources. We consider problems where
probabilistic guarantees are desired and
demonstrate how recently-introduced empirical Bernstein bounds can be used to design
stopping rules that are efficient. We provide
upper bounds on the sample complexity of
the new rules, as well as empirical results on
model selection and boosting in the filtering
setting.
1. Introduction
Being able to handle large datasets and streaming data
is crucial to scaling up machine learning algorithms to
many-real world settings. When making even a single pass through the data is prohibitive, sampling may
offer a good solution. In order for the resulting algorithms to be theoretically sound, sampling techniques
that come with probabilistic guarantees are desirable.
For example, when estimating the error of a classifier on a large dataset one may want to sample until
the estimated error is within some small number ǫ of
the true error with probability at least 1 − δ. The key
problem is one of stopping or determining the required
number of samples. Taking too many samples will result in inefficient algorithms, while taking too few may
not be enough to achieve the desired guarantees.
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).
Finite sample bounds, such as Hoeffding’s inequality (Hoeffding, 1963), are the key technique used
by recent, non-parametric stopping algorithms with
probabilistic guarantees. While these stopping algorithms have proved to be effective for scaling up machine learning algorithms (Bradley & Schapire, 2008),
(Domingos & Hulten, 2001), they can be significantly
improved by incorporating variance information in a
principled manner. We show how to employ the recently introduced empirical Bernstein bounds (Audibert et al., 2007a) to improve stopping algorithms and
provide sample complexity bounds and empirical results to demonstrate the effect of incorporating variance information.
Before proceeding, we identify two classes of stopping
problems that will be examined. The first class concerns problems where some unknown quantities have
to be measured either up to some prespecified level
of accuracy or to support making a binary decision.
Examples in this class include stopping with a fixed
relative or absolute accuracy, with applications in hypothesis testing such as deciding on the sign of the
mean, independence tests, and change detection. In
problems belonging to the second group, the task is to
pick the best option from a finite pool while measuring
their performance using samples. Some notable examples include various versions of bandit problems, Hoeffding Races (Maron & Moore, 1993), and the general
framework for scaling up learning algorithms proposed
by Domingos (2001).
The paper is organized as follows. In Section 2 we examine Hoeffding’s inequality and introduce the empirical Bernstein bound. In Section 3, we introduce a new
stopping algorithm for stopping with a predefined relative accuracy and show that it is more efficient than
previous algorithms. Section 4 demonstrates how a
simple application of the empirical Bernstein bound
can result in substantial improvements for problems
Empirical Bernstein Stopping
from the second class. Conclusions and future work
directions are presented in Section 5.
2. Hoeffding Bounds vs. Empirical
Bernstein Bounds
Let X1 , . . . , Xt real-valued i.i.d. random P
variables
t
with range R and, mean µ, and let X t = 1/t i=1 Xi .
Hoeffding’s inequality (Hoeffding, 1963) states that
with probability at least 1 − δ
r
log(2/δ)
|X t − µ| ≤ R
.
2t
Due to its generality, Hoeffding’s inequality has been
widely used in online learning scenarios. A drawback
of the bound is that it scales linearly with the range R
and does not scale with the variance of Xi . If a bound
on the variance is known, Bernstein’s inequality can be
used instead, which can yield significant improvements
when the variance bound is small relative to the range.
Since useful a priori bounds on the variance are rarely
available, this approach is not practical.
An approach that is more suitable to online scenarios is to apply Bernstein’s inequality to the sum of
X1 , . . . , Xt , as well as the sum of the squares to obtain
a single bound on the mean of X1 , . . . , Xt . The resulting bound, which we will refer to as the empirical
Bernstein bound (Audibert et al., 2007a) states that
with probability at least 1 − δ
r
2 log (3/δ) 3R log (3/δ)
+
,
|X t − µ| ≤ σ t
t
t
where σ t is the empirical
standard deviation of
Pt
X1 , . . . , Xt : σ 2t = 1t i=1 (Xi − X t )2 . The term involving the range R decreases at the rate of t−1 and
quickly becomes negligible when the variance is large,
while the square root term depends on σ t instead of
R. Hence, when σ t ≪ R the empirical Bernstein
bound quickly becomes much tighter than Hoeffding’s
inequality.
3. Stopping Rules
Let X1 , X2 , . . . be i.i.d. random variables with mean
µ and variance σ 2 . We will refer to an algorithm as a
stopping rule if at time t it observes Xt and based on
past observations decides whether to stop or continue
sampling. If a stopping rule S returns µ̂ that satisfies
P [|µ̂ − µ| ≤ ǫ|µ|] ≥ 1 − δ,
(1)
then S is a (ǫ, δ)-stopping rule and µ̂ is an (ǫ, δ)approximation of µ. In this section, we develop an
(ǫ, δ)-stopping rule for bounded Xi .
Algorithms proposed for this problem include the
Nonmonotonic Adaptive Sampling (NAS) algorithm,
shown as Algorithm 1, due to Domingo et al. (2000a).
The general idea is to first use Hoeffding’s inequality to© construct a sequence ª{αt } such that the event
E = |X t − µ| ≤ αt , t ∈ N+ occurs with probability
at least 1 − δ, and then use this sequence
to design
a
¯
¯
stopping criterion that stops only if ¯X t − µ¯ ≤ ǫ|µ|
given that E holds.
Algorithm 1 Algorithm NAS
t←0
repeat
t←t+1
Obtain
p Xt
α ← (1/2t) log(t(t + 1)/δ)
until |X t | < α(1 + 1/ǫ)
return X t
Domingo et al. (2000a) argue that if T is the number
of samples after which NAS stops, and |µ| > 0, then
there exists a constant C such that with probability at
least 1 − δ
T <C·
R2 ³
2
R´
log
+
log
.
ǫ2 µ2
δ
ǫµ
(2)
The assumption that |µ| > 0 is necessary for guaranteeing that the algorithm will indeed stop, and will be
assumed for the rest of the section.
Dagum et al. (2000) introduced the AA algorithm for
the case of bounded and nonnegative Xi . The AA algorithm is√a three step procedure. In the first step,
an (max( ǫ, 1/2), δ/3)-approximation of µ, µ̃, is obtained. In the second step µ̃ is used to determine the
number of samples necessary to produce an estimate
σ̃ 2 of σ 2 such that max(σ̃ 2 , ǫµ̃) is a high probability upper bound on max(σ 2 , ǫµ)/2. In the last step,
samples are drawn and their avc max(σ̃ 2 , ǫµ̃) log(1/δ)
ǫ2 µ̃2
erage is returned as µ̂, where c is a universal constant.
Dagum et al. (2000) prove that µ̂ is indeed an (ǫ, δ)approximation of µ and that that if T is the number
of samples taken by AA, then there exists a constant
C such that with probability at least 1 − δ
T ≤ C · max (σ 2 , ǫµ) ·
1
2
log .
ǫ2 µ2
δ
(3)
In addition, Dagum et al. prove that if T is the number
of samples taken by any (ǫ, δ)-stopping rule, then there
exists a constant C ′ such that with probability at least
1−δ
2
1
T ≥ C ′ · max (σ 2 , ǫµ) · 2 2 log .
ǫ µ
δ
Empirical Bernstein Stopping
Hence, for bounded Xi , the AA algorithm requires
a number of samples that is at most a multiplicative
constant larger than that required by any other (ǫ, δ)stopping rule. In this sense the algorithm achieves
“optimal” efficiency, up to a multiplicative constant.
While the AA algorithm is able to take advantage of
variance, it requires the random variables to be nonnegative. Any trivial extension of the AA algorithm
to the case of signed random variables seems unlikely
since the rule heavily relies on the monotonicity of partial sums that is present in the nonnegative case. On
the other hand, Equation (2) suggests that the NAS
algorithm is not able to take advantage of variance.
As the first demonstration of how the empirical Bernstein bound can be used to design improved stopping algorithms, we propose a new algorithm, EBStop,
which uses empirical Bernstein Bounds to achieve
nearly the same scaling properties as the AA algorithm and, like the NAS algorithm, only requires the
random variables to be bounded.
3.1. EBStop
Similarly to the NAS algorithm, EBStop relies on a
sequence
¯{ct } with the property that the event E =
¯
{¯X t − µ¯ ≤ ct , t ∈ N+ } occurs with probability at
least 1 − δ. Let dt be a positive sequence satisfying
P
∞
t=1 dt ≤ δ and let
r
2 log (3/dt ) 3R log (3/dt )
ct = σ t
+
.
t
t
Since {dt } sums to at most δ and (X t − ct , X t + ct ) is a
1 − dt confidence interval for µ obtained from the empirical Bernstein bound, by a union bound argument,
the event E indeed occurs with probability at least
1−δ. In our work, we use dt = c/tp and c = δ(p−1)/p.
The pseudocode for EBStop is shown as Algorithm 2,
but the general idea is as follows. After drawing t
samples, we set LB to max(0, max1≤s≤t |X s | − cs ) and
UB to min1≤s≤t (|X s |+cs ). EBStop terminates as soon
as (1+ǫ)LB ≥ (1−ǫ)UB and returns µ̂ = sgn(X t )·1/2·
[(1 + ǫ)LB +(1 − ǫ) UB].
To see why µ̂ is an (ǫ, δ)-approximation, suppose the
stopping condition has been satisfied and event E
holds. Then
|µ̂| ≤ 1/2 · [(1 + ǫ)LB + (1 + ǫ)LB] ≤ (1 + ǫ)|µ|,
and similarly (1 − ǫ)|µ| ≤ |µ̂|. From the definition of
LB, it also follows that |X t | > ct ≥ |X t − µ| which
implies that sgn(µ̂) = sgn(µ). Since event E holds
with probability at least 1 − δ, µ̂ is indeed an (ǫ, δ)approximation of µ.
Algorithm 2 Algorithm EBStop
LB ← 0
UB ← ∞
t←1
Obtain X1
while (1 + ǫ)LB < (1 − ǫ)UB do
t←t+1
Obtain Xt
LB ← max(LB, |X t | − ct )
UB ← min(UB, |X t | + ct )
end while
return sgn(X t ) · 1/2· [(1 + ǫ)LB +(1 − ǫ) UB]
While we omit the proof due to space constraints1 , we
note that if X is a random variable distributed with
range R, and if T is the number of samples taken by
EBStop on X, then there exists a constant C such that
with probability at least 1 − δ
¶µ
¶
µ 2
1
R
R
σ
log + log
. (4)
T < C · max 2 2 ,
ǫ µ ǫ|µ|
δ
ǫ|µ|
This bound is very similar to the upper bound for the
stopping time of the AA algorithm, with the only difR
ference being the extra log ǫ|µ|
term. This term comes
from constructing a confidence interval at each t and
is not an artifact of our proof techniques. However,
1
by apthis extra term can be reduced to log log ǫ|µ|
plying a geometric grid as we will see in the next section. Since EBStop does not require the variables to
be non-negative, we can say that EBStop combines the
best properties of NAS and AA algorithms for signed
random variables.
3.2. Improving EBStop
While EBStop has the desired scaling properties, we
make two simple improvements in order to make it
more efficient in practice.
The first improvement is based on the idea that if the
algorithm is not close to stopping, there is no point
in checking the stopping condition at every point. We
incorporate this idea into EBStop by adopting a geometric sampling schedule, also used by Domingo and
Watanabe (2000a). Instead of testing the stopping criterion after each sample, we perform the kth test after
⌈β k ⌉ samples have been taken for some β > 1. Under this sampling strategy, when EBStop constructs a
1 − d confidence interval after t samples, d is of the order 1/(logβ t)p , which is much larger than 1/tp . Since
1
A version of the paper containing the proofs will be
made available as a technical report.
Empirical Bernstein Stopping
this results in tighter confidence intervals, LB and UB
will approach each other faster and the stopping condition will be satisfied after fewer samples.
While geometric sampling can often reduce the number
of required samples, it can also lead to taking roughly
β times too many samples, because testing is only done
at the ends of intervals. Nevertheless, the following
result due to Audibert et al. (2007b) can be used to
test the stopping condition after each sample without
sacrificing the advantages of geometric sampling. Let
t1 ≤ t2 for t1 , t2 ∈ N and let α ≥ t2 /t1 . Then with
probability at least 1 − d, for all t ∈ {t1 , . . . , t2 }
p
|X t − µ| ≤ σ t 2α log (3/d) /t + 3α log (3/d) /t. (5)
We use Equation (5) with t1 = ⌊β k ⌋ + 1, t2 = ⌊β k+1 ⌋,
and d = c/k p to construct ct for each t ∈ {t1 , . . . , t2 }.
This allows us to test the stopping condition after each
sample, and use d that is on the order of 1/(logβ t)pα
after t samples. A variant of EBStop that incorporates
these two improvements is shown as Algorithm 3.
✁✂✂✂
✗✘✙✚✛✜ ✢✣
✗✘✙✚✛✜ ✢
✗✘✚✛✜ ✢
✤✤
✥✤ ✚
✙✦✜ ✥✤ ✚
✆✂✂✂
✖ ✟☞
✕✡
✟
✔✒
✍✓
☎✂✂✂
✡
✑ ✒✏
✎ ✟✠✍
✄✂✂✂
✌
✟☞
✡☛
✝✞
✟✠
✂✂✂
✂
✁
✂
✁✂
✂✂
✂✂✂
n
Figure 1. Comparison of stopping rules on averages of uniform random variables with varying variances. Error bars
are at one standard deviation.
too large values of δ. For example, for δ = 0.05, R = 1,
the condition is ǫµ < e−20 .
3.3. Results: Synthetic Data
Algorithm 3 Algorithm EBGStop
LB ← 0
UB ← ∞
t←1
k←0
Obtain X1
while (1 + ǫ)LB < (1 − ǫ)UB do
t←t+1
Obtain Xt
if t > floor(β k ) then
k ←k+1
α ← floor(β k )/floor(β k−1 )
x ← −α log dk /3
end if p
ct ← σ t 2x/t + 3Rx/t
LB ← max(LB, |X t | − ct )
UB ← min(UB, |X t | + ct )
end while
return sgn(X t ) · 1/2· [(1 + ǫ)LB +(1 − ǫ) UB]
In this section we evaluate the stopping rules AA,
NAS, geometric NAS, EBStop, and EBGStop on the
problem of estimating means of various random variables. To make the comparison fair, the geometric
version of the NAS algorithm and EBGStop both grew
intervals by a factor of 1.5, as this value worked best in
previous experiments (Domingo & Watanabe, 2000b).
We also used dt = (t(t+1))−1 in EBStop and EBGStop
since this is the sequence implicitly used in NAS for
constructing confidence intervals at time t. Since this
put EBGStop at a slight disadvantage, we also include
results for EBGStop, denoted by EBGStop*, with our
default choice of dt = c/(logβ t)p , p = 1.1, and β = 1.1.
In all the experiments we used ǫ = δ = 0.1. We
use only non-negative valued random variables as they
allow comparison to AA. Finally, we only compare
the number of samples taken because none of the algorithms produced any estimates with relative error
greater than ǫ in any of our experiments.
One can show that as the result of adding geometric
1
sampling to EBStop reduces the log ǫ|µ|
term in in1
equality (4) to log log ǫ|µ| . It should be noted that
from the arguments of Dagum et al. (2000), no stopping rule can achieve a better bound than (3) for
the case of bounded non-negative random variables.
Hence, EBGStop is very close to being “optimal” in
this sense. Where it would loose (for non-negative random variables) to AA is when ǫ, µ and δ are such that
log(R/(ǫµ)) becomes significantly larger than 1/δ. We
do not expect to see this happening in practice for not
The first set of experiments was meant to test how
well the various stopping rules are able to exploit
the variance when it is small. Let the average
of n uniform [a, b] random variables be denoted by
U (a, b, n). Note that the expected value and variance of U (a, b, n) are (a + b)/2 and (b − a)2 /(12n),
respectively. For this comparison we fixed a to 0, b
to 1, and varied n to control the variance for a fixed
mean. Figure 1 shows the results of running each
stopping rule 100 times on U (0, 1, n) random variables
for n = 1, 5, 10, 50, 100, 1000. Not surprisingly, NAS
and geometric NAS fail to make use of the variance
Empirical Bernstein Stopping
105
✝
✏ ✂☎
✎
✍ ✂☞
✑✒✓✔✕✖ ✗✘
✑✒✓✔✕✖ ✗
✑✒✔✕✖ ✗
✙✙
✚✙ ✔
✓✛✖ ✚✙✔
107
✝
✏ ✂☎
✎
✟✌
☎
✟✌
☎
☛ ✡☞
✠ ✂✄✟
✝✞
✂
☎✆
✂✄
104
☛ ☞✡
✠ ✂✄✟
✝✞
✂
☎✆
✁
✂✄
106
✍ ✂☞
103
✑✒✓✔✕✖ ✗✘
✑✒✓✔✕✖ ✗
✑✒✔✕✖ ✗
✙✙
✚✙ ✔
✓✛✖ ✚✙✔
5
10
104
✁
103
102
µ=0.9
µ=0.7
µ=0.5
µ=0.3
µ=0.1
102
µ=0.99
µ=0.9
µ=0.5
µ=0.1
µ=0.05
µ=0.01
Figure 2. Comparison of stopping rules on averages of uniform random variables with varying means. The number
of samples is shown in log scale.
Figure 3. Comparison of stopping rules on Bernoulli random variables. The number of samples is shown in log
scale.
and take roughly the same numbers of samples for
all values of n. Variants of EBStop improve when
the variance decreases, with EBGStop* performing especially well, beating all the other algorithms for all
the scenarios tested. AA initially improves with the
decreasing variance, but the effect is not as large as
with EBGStop* because of the multi-phase structure
of AA.
requires O(1/(ǫ2 µ2 )) samples.
In the second set of experiments we fix n at 10 and
b − a at 0.2, keeping the variance fixed, and vary the
mean. The variance is small enough that EBStop, its
variants, and AA should take a number of samples
in the order of R/(ǫµ). The results are presented in
Figure 2 and suggest that both variants of NAS require
1/µ times more samples than the variance-adaptive
methods. Note that Figure 2 shows the number of
samples taken by each method in log scale.
It may be surprising that in both experiments the AA
algorithm did not outperform EBStop and EBGStop
even though AA offers better guarantees on sample
complexity. We believe that EBStop is able make
better use of the data because it uses all samples in
its stopping criterion, while AA wastes some samples
on intermediate quantities. However, this difference
should be reflected in the hidden constants. As discussed earlier, for really small values of µ and ǫ the
AA algorithm should stop earlier than EBStop.
Finally, we include a comparison of the stopping rules
on Bernoulli random variables. Since Bernoulli random variables have maximal variance of all bounded
random variables, the advantage of variance estimation should be diminished. However, inequality (4)
suggests that in the case of Bernoulli random variables EBStop requires O(1/(ǫ2 µ)) samples since σ 2 =
µ(1 − µ). Similarly, inequality (2) suggests that NAS
Figure 3 shows the results of running each stopping
rule on Bernoulli random variables with means 0.99,
0.9, 0.5, 0.1, 0.05, and 0.01, averaged over 100 runs.
As in the previous set of experiments, the varianceadaptive methods seem to require 1/µ times fewer
samples to stop. It should also be noted that the geometric version of the NAS algorithm does outperform
EBStop for some intermediate values of µ, where the
variance is the largest. However the performance difference is not large, and so we think the price paid
for the unboundedly better performance of EBStop for
small or large values of µ is not large.
3.4. Results: FilterBoost
Boosting by filtering (Bradley & Schapire, 2008) is a
framework for scaling up boosting algorithms to large
or streaming datasets. Instead of working with the
entire training set, all steps, such as finding a weak
learner that has classification accuracy of at least 0.5,
are done through sampling that employs stopping algorithms. Bradley and Schapire showed that such an
approach can lead to a drastic speedup over a batch
boosting algorithm.
We evaluated the suitability of EBGStop and both
variants of the NAS algorithm for the boosting by filtering setting by plugging them into the FilterBoost
algorithm (Bradley & Schapire, 2008). The AA algorithm was not included because it cannot deal with
signed random variables.
Following Bradley and Schapire, the Adult and Covertype datasets from the UCI machine learning repository (Asuncion & Newman, 2007) were used. The
covertype dataset was converted into a binary classi-
Empirical Bernstein Stopping
fication problem by taking ”Lodgepole Pine” as one
class and merging the other classes. In setting up
boosting we followed the procedure of Domingo and
Watanabe (2000b) who also considered the use of stopping rules in the same context. Accordingly, we used
decisions stumps as weak learners and we discretized
all continuous attributes by binning their values into
five equal bins. The results for the Adult dataset were
averaged over 10 runs on the training set, while 10-fold
cross-validation was used for the Covertype dataset.
As shown in Figure 4, EBGStop required fewer samples and offered lower variance in stopping times than
either variant of the NAS algorithm on both datasets.
At the same time, the resulting classification accuracies were within 0.2% of each other on the Adult
dataset and within 0.04% of each other on the Covertype dataset.
4. Racing Algorithms
In this section we demonstrate how a general stopping algorithm that makes use of finite sample deviation bounds can be improved with the use of empirical
Bernstein bounds. We consider the Hoeffding races
algorithm (Maron & Moore, 1993) since it is representative of the class of general stopping algorithms.
Racing algorithms aim to reduce the computational
burden of performing tasks such model selection using a hold-out set by discarding poor models quickly
(Maron & Moore, 1993; Ortiz & Kaelbling, 2000).
The context of racing algorithms is the one of multiarmed bandit problems. Formally, consider M options. When option m is chosen the tth time, it gives a
random value Xm,t from an unknown distribution νm .
The samples
R {Xm,t }t≥1 are independent of each other.
Let µm = xνm (dx) be the mean reward obtained of
option m. The goal is to find the options with the
highest mean reward.
Let δ > 0 be the confidence level parameter and N
be the maximal amount of time allowed for deciding
which option leads to the best expected reward. A
racing algorithm either terminates when it runs out
of time (i.e. at the end of the N -th round) or when
it can say that with probability at least 1 − δ, it has
found the best option, i.e. an option m∗ with µm∗ =
maxm∈{1,...,M } µm .
The Hoeffding race is an algorithm based on discarding
options which are likely to have smaller mean than the
optimal one until only one option remains. Precisely,
for each time step and each distribution, δ/(M N ) confidence intervals are constructed for the mean. Options
with upper confidence smaller than the lower confi-
dence bound of another option are discarded. The algorithm samples one by one all the options that have
not been discarded yet.
We assume that the rewards have a bounded range
R. If X m,t denotes the sample mean for option m
after seeing t samples of this option then according to
Hoeffding’s inequality, a δ/(M N ) confidence interval
for the mean of option m is
r
r
·
¸
log(2M N/δ)
log(2M N/δ)
X m,t − R
, X m,t + R
2t
2t
The Hoeffding race has been introduced and studied
in (Maron & Moore, 1993; Maron & Moore, 1997) in a
slightly different viewpoint since there the target was
to find an option with mean at most ǫ below the optimal mean maxm∈{1,...,M } µm , where ǫ is a given positive parameter. The same problem was also studied
by (Even-Dar et al., 2002, Theorem 3) in the infinite
horizon setting.
By substituting Hoeffding’s inequality with the empirical Bernstein bound we obtain a new algorithm, which
we will refer to as the empirical Bernstein race.
4.1. Analysis of Racing Algorithms
For the analysis we are interested in the expected number of samples taken by the Hoeffding race and the
empirical Bernstein race. Due to space limitations, we
omit the proofs of the following theorems.
Let ∆m = µm∗ − µm , where option m∗ still denotes
an optimal option: µm∗ = maxm∈{1,...,M } µm . Let ⌈u⌉
denote the smallest integer larger or equal to u, and
let ⌊u⌋ denote the largest integer smaller or equal to u.
Theorem 1 (Hoeffding Race). Let nH (m) =
§ 8R2 log(2M N/δ) ¨
. Without loss of generality, assume
∆2m
that nH (1) ≤ nH (2) ≤ . . . ≤ nH (m). The number of
samples, T, taken by the Hoeffding race is bounded by
X
2
nH (m).
µm <µm∗
The probability that no optimal option is returned is
bounded by δ. If the algorithm runs out of time, then
with probability at least 1 − δ, (i) the number of discarded options isP
at least d, where d is the largest ind
teger such that 2 m=1 nH (m) ≤ N , and (ii) the nondiscarded options satisfy
s
log(2M N/δ)
µm ≥ µm∗ − 4R
.
2⌊N/M ⌋
We recall that the principle of the empirical Bernstein
race algorithm is the same as the Hoeffding’s one. We
Empirical Bernstein Stopping
✎✏ ✑
✒✓✔ ✎✏ ✑
✕✖✒✑✗✔ ✘
✎✏ ✑
✒✓✔ ✎✏ ✑
✕✖✒✑✗✔ ✘
✞✁✟✟✟✟✟
✞✠✟✟✟✟✟
✞✟✟✟✟✟✟
✆✟✟✟✟✟
✄✟✟✟✟✟
✁✟✟✟✟✟
✠✟✟✟✟✟
✁
✂ ✄
☎
✆ ✝ ✞✟ ✞✞ ✞✠ ✞
✡☛☞✌✍
✞✁ ✞✂
✟
✞ ✠
✁
✂ ✄
☎
✆ ✝ ✞✟ ✞✞ ✞✠ ✞ ✞✁ ✞✂
✡☛☞✌✍
Figure 4. Comparison of the number of samples required by different stopping rules in FilterBoost. Parameters ǫ, δ were
set to 0.1 for both methods while τ was set to 0.25. Error bars are at 1 standard deviation. a) Results on the Adult
dataset b) Results on the covertype dataset.
sample one by one all the distributions that has not
been discarded yet. The algorithm discards an option
as soon as the upper bound on its mean reward is
smaller than at least one of the lower bound on the
mean of any other option.
Theorem 2 (Empirical Bernstein Race). Let σm denote the standard deviation of νm . Introduce Σm =
σm∗ + σm and
¼
» 2
8Σm + 18R∆m
log(4M
N/δ)
.
n(m) =
∆2m
Without loss of generality, assume that n(1) ≤ n(2) ≤
. . . ≤ n(m). The number of samples taken by the empirical Bernstein race is bounded by
X
2
n(m).
µm <µm∗
The probability that no optimal option is returned is
bounded by δ. If the algorithm runs out of time, then
with probability at least 1 − δ, (i) the number of discarded options isP
at least d, where d is the largest ind
teger such that 2 m=1 nH (m) ≤ N , and (ii) the nondiscarded options satisfy
s
8 log(4M N/δ) 9R log(4M N/δ)
µm ≥ µm∗ − Σm
−
.
⌊N/M ⌋
⌊N/M ⌋
As can be seen from the bounds, the result of incorporating the variance estimates is similar to what was
observed in Section 3: The dependence of the number
of required samples on R2 is reduced to a dependence
on R and the variance. Similar results can be expected
when applying the empirical Bernstein bound to other
situations.
4.2. Results
Following the procedure of Maron and Moore (1997),
we evaluated the Hoeffding and empirical Bernstein
races on the task of selecting the best k for k-nearest
neighbor regression and classification through leaveone-out cross-validation.2 Three datasets of different
types were used for the comparison. The SARCOS
data presents a regression problem which involves predicting the torques at 7 joints of a robot arm based
on the positions, velocities and accelerations at those
joints. We only considered predicting the torque at the
first joint. The Covertype2 dataset consists of 50,000
points sampled from the Covertype dataset from Section 3.4 and is a binary classification task. The Local
dataset presents a regression problem that was created by sampling 10,000 points from a noisy piecewiselinear function defined on the unit interval and having
a range of 1.
The value of the range parameter R was set to 1 for
the Covertype2 and Local datasets. For the SARCOS
dataset, R was set to the range of the target values
in the dataset. This differs from the approach of setting R separately for each option to several times the
standard deviation in the samples observed, suggested
by Maron and Moore (1997). We do not follow this
approach because it invalidates the use of Hoeffding’s
inequality.
2
Since leave-one-out cross-validation creates dependencies between the samples, the analysis does not apply to
this case. However, our experiments gave similar results
when we used a separate hold-out set. We decided to
present results for leave-one-out cross-validation to facilitate comparison with the original papers.
Empirical Bernstein Stopping
Table 1. Percentage of work saved / number of options left
after termination.
Data set
Hoeffding
EB
SARCOS
Covertype2
Local
0.0% / 11
14.9% / 8
6.0% / 9
44.9% / 4
29.3% / 5
33.1% / 6
All methods were given the options k
=
20 , 21 , 22 , 23 , . . . , 210 to begin with.
The results
are presented in Table 1. The table shows the
percentage of work saved by each method (1− number
of samples taken by method / M N ), as well as the
number of options remaining after termination.
The empirical Bernstein racing algorithm, which is denoted by EB, significantly outperforms the Hoeffding
racing algorithm on all three datasets. The advantage
of incorporating variance estimates is the smallest on
the Covertype2 classification dataset. This is expected
because the samples come from Bernoulli distributions
which have the largest possible variance for a bounded
random variable. The advantage of variance estimation is the largest on the SARCOS dataset, where R is
much larger than the variance. While one may argue
that the Hoeffding racing algorithm would do much
better if R was set to a smaller value based on the
standard deviation, the empirical Bernstein algorithm
would also benefit. However, tweaking R this way is
merely an unprincipled way of incorporating variance
estimates into a racing algorithm.
5. Conclusions and Future Work
We showed how variance information can be exploited
in stopping problems in a principled manner. Most
notably, we presented a near-optimal stopping rule
for relative error estimation on bounded random variables, significantly extending the results of Domingo
and Watanabe, and Dagum et al.. We also provided
empirical and theoretical results on the effect that can
be expected from incorporating variance estimates into
existing stopping algorithms.
One interesting question that should be addressed is if
the bound achieved by the AA algorithm in the nonnegative case, which is known to be optimal, can be
achieved without the non-negativity condition.
Acknowledgements
This work was supported in part by Agence Nationale
de la Recherche project “Modèles Graphiques et Applications”, the Alberta Ingenuity Fund, iCore, the
Computer and Automation Research Institute of the
Hungarian Academy of Sciences, and NSERC.
References
Asuncion, A., & Newman, D. (2007). UCI machine
learning repository.
Audibert, J. Y., Munos, R., & Szepesvári, C. (2007a).
Tuning bandit algorithms in stochastic environments. ALT (pp. 150–165).
Audibert, J.-Y., Munos, R., & Szepesvári, C. (2007b).
Variance estimates and exploration function in
multi-armed bandit (Technical Report 07-31). Certis - Ecole des Ponts. http://certis.enpc.fr/
~audibert/RR0731.pdf.
Bradley, J. K., & Schapire, R. (2008). Filterboost: Regression and classification on large datasets. NIPS20 (pp. 185–192).
Dagum, P., Karp, R., Luby, M., & Ross, S. (2000).
An optimal algorithm for Monte Carlo estimation.
SIAM Journal on Computing, 29, 1484–1496.
Domingo, C., & Watanabe, O. (2000a). MadaBoost: A
modification of AdaBoost. COLT’00 (pp. 180–189).
Domingo, C., & Watanabe, O. (2000b). Scaling
up a boosting-based learner via adaptive sampling.
Pacific-Asia Conference on Knowledge Discovery
and Data Mining (pp. 317–328).
Domingos, P., & Hulten, G. (2001). A general method
for scaling up machine learning algorithms and its
application to clustering. ICML (pp. 106–113).
Even-Dar, E., Mannor, S., & Mansour, Y. (2002). PAC
bounds for multi-armed bandit and Markov decision
processes. COLT’02 (pp. 255–270).
Hoeffding, W. (1963). Probability inequalities for sums
of bounded random variables. Journal of the American Statistical Association, 58, 13–30.
Maron, O., & Moore, A. (1993). Hoeffding races: Accelerating model selection search for classification
and function approximation. NIPS 6 (pp. 59–66).
Maron, O., & Moore, A. W. (1997). The racing algorithm: Model selection for lazy learners. Artificial
Intelligence Review, 11, 193–225.
Ortiz, L. E., & Kaelbling, L. P. (2000). Sampling
methods for action selection in influence diagrams.
AAAI/IAAI (pp. 378–385).