Bootstrap Report
Bootstrap Report
F.W. Scholz
University of Washington
1 edited
version of a technical report of same title and issued as bcstech-93-051,
Boeing Computer Services, Research and Technology.
Abstract
This report reviews several bootstrap methods with special emphasis on small
sample properties. Only those bootstrap methods are covered which promise
wide applicability. The small sample properties can be investigated ana-
lytically only in parametric bootstrap applications. Thus there is a strong
emphasis on the latter although the bootstrap methods can be applied non-
parametrically as well. The disappointing confidence coverage behavior of
several, computationally less extensive, parametric bootstrap methods should
raise equal or even more concerns about the corresponding nonparametric
bootstrap versions. The computationally more expensive double bootstrap
methods hold great hope in the parametric case and may provide enough
assurance for the nonparametric case.
Contents
1 The General Bootstrap Idea 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Setup and Objective . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Bootstrap Samples and Bootstrap Distribution . . . . . . . . . 7
3 Variance Estimation 16
3.1 Jackknife Variance Estimation . . . . . . . . . . . . . . . . . . 16
3.2 Substitution Variance Estimation . . . . . . . . . . . . . . . . 17
3.3 Bootstrap Variance Estimation . . . . . . . . . . . . . . . . . 17
1
5 Double Bootstrap Confidence Bounds 47
5.1 Prepivot Bootstrap Methods . . . . . . . . . . . . . . . . . . . 48
5.1.1 The Root Concept . . . . . . . . . . . . . . . . . . . . 48
5.1.2 Confidence Sets From Exact Pivots . . . . . . . . . . . 48
5.1.3 Confidence Sets From Bootstrapped Roots . . . . . . . 50
5.1.4 The Iteration or Prepivoting Principle . . . . . . . . . 51
5.1.5 Calibrated Confidence Coefficients . . . . . . . . . . . . 52
5.1.6 An Analytical Example . . . . . . . . . . . . . . . . . . 53
5.1.7 Prepivoting by Simulation . . . . . . . . . . . . . . . . 55
5.1.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . 57
5.2 The Automatic Double Bootstrap . . . . . . . . . . . . . . . . 58
5.2.1 Exact Confidence Bounds for Tame Pivots . . . . . . . 58
5.2.2 The General Pivot Case . . . . . . . . . . . . . . . . . 62
5.2.3 The Prepivoting Connection . . . . . . . . . . . . . . . 66
5.2.4 Sensitivity to Choice of Estimates . . . . . . . . . . . . 68
5.2.5 Approximate Pivots and Iteration . . . . . . . . . . . . 70
5.3 A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 Efron’s Percentile Method . . . . . . . . . . . . . . . . 77
5.3.2 Hall’s Percentile Method . . . . . . . . . . . . . . . . . 78
5.3.3 Bias Corrected Percentile Method . . . . . . . . . . . . 82
5.3.4 Percentile-t and Double Bootstrap Methods . . . . . . 89
5.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2
1 The General Bootstrap Idea
1.1 Introduction
The bootstrap method was introduced by Efron in 1979. Since then it has
evolved considerably. Efron’s paper has initiated a large body of hard theo-
retical research (much of it of asymptotic or large sample character) and it
has found wide acceptance as a data analysis tool. Part of the latter is due
to its considerable intuitive appeal, which is in contrast to the often deep
mathematical intricacies underlying much of statistical analysis methodol-
ogy. The basic bootstrap method is easily grasped by practitioners and by
consumers of statistics.
The popularity of the bootstrap was boosted early on by the very readable
Scientific American article by Diaconis and Efron (1983). Having chosen the
catchy name “bootstrap” certainly has not hurt its popularity. In Germany
one calls the bootstrap method “die Münchhausen Methode,” named after
Baron von Münchhausen, a fictional character in many phantastic stories. In
one of these he is supposed to have saved his life by pulling himself out of a
swamp by his own hairs. The first reference to “die Münchhausen Methode”
can be traced to the German translation of the Diaconis and Efron article,
which appeared in Spektrum der Wissenschaft in the same year. There the
translator recast the above episode to the following iamge: Pull yourself by
your mathematical hairs out of the statistical swamp.
Hall (1992) on page 2 of his extensive monograph on the bootstrap expresses
these contrasting thoughts concerning the “bootstrap” name:
Much of the bootstrap’s strength and acceptance also lies in its versatility. It
can handle a very wide spectrum of data analysis situations with equal ease.
In fact, it facilitates data anlyses that heretofore were simply impossible be-
cause the obstacles in the mathematical analysis were just too forbidding.
3
This gives us the freedom to model the data more accurately and obtain ap-
proximate answers to the right questions instead of right answers to often the
wrong questions. This freedom is bought at the cost of massive simulations
of resampled data sets followed by corresponding data analyses for each such
data set. The variation of results obtained in these alternate data analyses
should provide some insight into the accuracy and uncertainty of the data
analysis carried out on the original data.
This approach has become feasible only because of the concurrent advances
in computing. However, certain offshoots of the bootstrap, such as iterated
bootstrap methods, can still strain current computing capabilities and effi-
cient computing strategies are needed.
As stated above, the bootstrap has evolved considerably and there is no
longer a single preferred method, but a wide spectrum of separate methods,
all with their own strengths and weaknesses. All of these methods generally
agree on the same basic bootstrap idea but differ on how they are imple-
mented.
There are two major streams, namely the parametric bootstrap and the non-
parametric bootstrap, but even they can be viewed in a unified fashion. The
primary focus of this report is on parametric bootstrap methods, although
the definitions for the various bootstrap methods are general enough to be
applicable for the parametric and nonparametric case. The main reason for
this focus is that in certain parametric examples one can examine analyti-
cally the small sample properties of the various bootstrap methods. Such an
analysis is not possible for the nonparametric bootstrap.
4
data set as generic as possible we wish to emphasize the wide applicability
of the bootstrap methods.
Not knowing P is usually expressed by stating that P is one of many possible
probability mechanisms, i.e., we say that P is a member of a family P of
probability models that could have generated X.
In the course of this report we will repeatedly use specific examples for prob-
ability models and for ease of reference we will list most of them here.
5
The first example is of nonparametric character, because the parameter F
that indexes the various PF ∈ P cannot be fit into some finite dimensional
space. Also, we deal here with a pure random sample, i.e., with i.i.d. random
variables.
The second, third, and fourth example are of parametric nature, since there is
a one to one correspondence between F and θ = (µ, σ) in Example 2, between
F and θ = (α, β, σ) in Example 3, and between F and θ = (µ1 , µ2 , σ1 , σ2 , ρ)
in Example 4. We could as well have indexed the possible probability mech-
anisms by θ, i.e., write Pθ , with θ varying over some appropriate subset
Θ ⊂ R2 , Θ ⊂ R3 , or Θ ⊂ R5 , respectively. In Example 3 the data are inde-
pendent but not identically distributed, since the mean of Yi changes linearly
with ti .
Of course, we could identify θ with F also in the first example and write
P = {Pθ : θ ∈ Θ}, with Θ = F being of infinite dimensionality in that
case. Because of this we will use the same notation describing any family P,
namely
P = {Pθ : θ ∈ Θ}
and the whole distinction of nonparametric and parametric probability model
disappears in the background, where it is governed by the character of the
indexing set Θ.
Many statistical analyses concern themselves with estimating θ, i.e., with
estimating the probability mechanism that generated the data. We will as-
sume that we are always able to find such estimates and we denote a generic
estimate of θ by θb = θ(X),
b where the emphasized dependence on X should
make clear that any reasonable estimation procedure should be based on the
data at hand. Similarly, if we want to emphasize an estimate of P we write
Pb = Pθb. Finding any estimate at all can at times be a big order, but that
difficulty is not adressed here.
In Example 1 we may estimate θ = F by the empirical distribution function
of the data, i.e, by
n
1X
Fb (x) = I(−∞,x] (Xi )
n i=1
and we write θb = Fb . Here IA (x) = 1 if x ∈ A and IA (x) = 0 if x 6∈ A. Thus
Fb (x) is that fraction of the sample which does not exceed x. Fb (x) can also be
viewed as the cumulative distribution function of a probability distribution
which places probability mass 1/n at each of the Xi . If some of the Xi
coincide, then that common value will receive the appropriate multiple mass.
6
Often one is content with estimating a particular functional ψ = ψ(P ) of P .
This will be the situation on which we will focus from now on. A natural
estimate of ψ would then be obtained in ψb = ψ(Pb ) = ψ(Pθb).
In Example 1 one may be interested in estimating the mean of the sampled
distribution F . Then
Z ∞
ψ = ψ(PF ) = x dF (x)
−∞
and we obtain
∞ n
Z
1X
ψb = ψ PFb = x dFb (x) = Xi = X̄ ,
−∞ n i=1
7
The scatter of these estimates ψb1 , . . . , ψbB would be a reflection of the sam-
pling uncertainty in our original estimate ψ. b As B → ∞, the distribution of
the ψ1 , . . . , ψB represents the sampling distribution of ψ,
b b b i.e., we would then
be in a position to evaluate probabilities such as
QA (θ) = Pθ (ψb ∈ A)
for all appropriate sets A. This follows from the law of large numbers (LLN ),
namely
B
b (θ) = 1
X
Q A IA (ψbi ) −→ QA (θ)
B i=1
as B → ∞. This convergence is “in probability” or “almost surely” and we
will not dwell on it further. Since computing power is cheap, we can let B be
quite large and thus get a fairly accurate approximation of QA (θ) by using
Q
b (θ).
A
Knowledge of this sampling distribution could then be used to set error limits
on our estimate ψ.b For example, we could, by trial and error, find ∆ and
1
∆2 such that
.95 = Pθ (∆1 ≤ ψb ≤ ∆2 ) ,
i.e., 95% of the time we would expect ψb to fall between ∆1 and ∆2 . This
still does not express how far ψb is from the true ψ. This can only be judged
if we relate the position of the ∆i to that of ψ, i.e., write δ1 = ψ − ∆1 and
δ2 = ∆2 − ψ and thus
.95 = Pθ (ψb − δ2 ≤ ψ ≤ ψb + δ1 ) .
8
For each X?i obtain the corresponding estimate θbi? and evaluate ψbi? = ψ(θbi? ).
The bootstrap idea is founded in the hope that the scatter of these ψb1? , . . .,
ψbB? should serve as a reasonable proxy for the scatter of ψb1 , . . . , ψbB which we
cannot observe. If we let B → ∞, we would by the LLN be able to evaluate
QA (θ) b? ∈ A) = P (ψ
b = Pb (ψ b? ∈ A)
θb
for all appropriate sets A. This evaluation can be done to any desired degree
of accuracy by choosing B large enough in our simulations, since
B
1 X
Q
b (θ)
A
b = IA (ψbi? ) −→ QA (θ)
b
B i=1
Pθb −→ Pθ as θb −→ θ ,
in a sense, to be left unspecified here, one can then say that the bootstrap
distribution of ψb? is a good approximation to the sampling distribution of ψ,
b
i.e.,
Pθb(ψb? ∈ A) ≈ Pθ (ψb ∈ A) .
Research has focussed on making this statement more precise by resorting
to limit theory. In particular, research has studied the conditions under
which this approximation is reasonable and through sophisticated high order
asymptotic analysis has tried to reach for conclusions that are meaningful
9
even for moderately small samples. Our main concern in later sections will
be to examine the qualitative behavior of the various bootstrap methods in
small samples.
tribution is not centered on the unknown value ψ(θ), but is off by the bias
amount b(θ).
If we know the functional form of the bias term b(θ), then the following “bias
reduced” estimate
b − b(θ)
ψbbr1 = ψ(θ) b
suggests itself. The subscript 1 indicates that this could be just the first
in a sequence of bias reduction iterations, i.e., what we do with ψb for bias
reduction we could repeat on ψbbr1 and so on, see Section 2.3.
Such a correction will typically reduce the bias of the original estimate ψ(θ),
b
but will usually not eliminate it completely, unless of course b(θ)
b is itself an
unbiased estimate of b(θ).
10
Note that such bias correction often entails more variability in the bias cor-
rected estimate due to the additional variability of the subtracted bias cor-
rection term b(θ).
b However, it is not clear how the mean squared error of the
estimate will be affected by such a bias reduction, since
2
M SEθ ψb = Eθ ψb − ψ = varθ ψb + b2 (θ) .
The reduction in bias may well be more than offset by the increase in the
variance. In fact, one has the following expression for the difference of the
mean squared errors of ψbbr1 and ψb
2 2
Eθ ψbbr1 − ψ − Eθ ψb − ψ b 2 − 2E b(θ)(
= Eθ b(θ) b ψ b − ψ) .
θ
σ2
Eθ σb 2 = σ 2 − ,
n
i.e., the bias is b(θ) = −σ 2 /n. The bias reduced version is
2 σb 2
σbbr1 = σb 2 + .
n
Here one finds
2
n+1
2
varθ σbbr1 = varθ σb 2 > varθ σb 2
n
and
2 σ4
M SE(σb 2 ) = Eθ σb 2 − σ 2 = (2n3 − n2 ) ,
n4
11
2
2
2 σ4
M SE(σbbr1 ) = Eθ σbbr1 − σ2 = 2(n + 1) 2
(n − 1) + 1
n4
and thus
M SE(σb 2 ) < M SE(σbbr1
2
) for n > 1 ,
since
σb 2
ψbbr1 = X̄ 2 − .
n
Here we find
varθ (X̄)2 − σb 2 /n = varθ (X̄)2 + varθ σb 2 /n > varθ (X̄)2
and
µ2 σ 2 σ4
M SE(ψ)
b =4 +3 2 ,
n n
2 2 4
µσ σ 2n − 1
M SE(ψbbr1 ) = 4 + 2 2+
n n n2
and thus clearly
since
2n − 1 n2 − 2n + 1 (n − 1)2
3− 2+ = = > 0 for n > 1 .
n2 n2 n2
12
2.2 Bootstrap Bias Reduction
In many problems the functional form of the bias term b(θ) is not known. It
turns out that the bootstrap provides us with just the above bias correction
without having any knowledge of the function b(θ). Getting a bootstrap
sample of estimates ψb1? , . . . , ψbB? from Pθb we can form their average
B
1 X
ψ̄B? = ψbi? .
B i=1
By the LLN
ψ̄B? −→ Eθb ψb? = ψ(θ) b as B → ∞ ,
b + b(θ)
For large enough B this will be indistinguishable from ψbbr1 , for all practical
purposes.
Eθ ψbr1 (θ)
b = ψ(θ) + b (θ) .
1
13
and thus we can interpret b1 (θ) as the bias of −b(θ)
b for estimating −b(θ).
The second order bias reduced estimate thus becomes
ψbbr2 = ψbr2 (θ)
b = ψ (θ)b − b (θ)
b
br1 1
h i
b − b(θ)
= ψ(θ) b − E b(θb? )
b − b(θ)
θb
b + E b(θb? ) ,
b − 2b(θ)
= ψ(θ) θb
where the θb? inside the expectation indicates that its distribution is governed
by θ,
b the subscript on the expectation. Since ψ (θ)
br2
b is a function of θ,b we
can keep on iterating this scheme and even go to the limit with the iterations.
In the two examples of Section 2.1 the respective limits of these iterations
result ultimately in unbiased estimates of σ 2 and µ2 , respectively. In the case
of the variance estimate the ith iterate gives
1 1 1 − 1/ni+1
2
σbbri = σb 2 + + · · · + 1 = b2
σ
ni ni−1 1 − 1/n
n
→ σb 2 = s2 as i → ∞ ,
n−1
where s2 is the usual unbiased estimate of σ 2 . In the case of estimating µ2
the ith iterate gives
1 1 n − 1 1 − 1/ni
2 2
ψb
bri = X̄ − σb + ··· + i = X̄ 2 − s2
n n n2 1 − 1/n
s2
→ X̄ 2 − as i → ∞ ,
n
the latter being the conventional unbiased estimate of µ2 . In both exam-
ples the resulting limiting unbiased estimate is UMVU, i.e., has uniformly
minimum variance among all unbiased estimates of the respective target.
According to Hall (1992, p. 32) it is not always clear that these bias reduc-
tion iterations should converge to something. He does not give examples.
Presumably one may be able to get such examples from situations, in which
unbiased estimates do not exist. Since the analysis for such examples is com-
plicated and often involves estimates with infinite expectations, we will not
pursue this issue further.
14
do this only for the case of one iteration since even that can stretch the
simulation capacity of most computers.
Suppose we have generated the ith bootstrap data set X?i and from it we have
obtained θbi? . Then we can spawn a second generation or iterated bootstrap
sample X?? ??
i1 , . . . , XiC from Pθb? . Each such iterated bootstrap sample then
i
results in corresponding estimates
?? ??
θbi1 , . . . , θbiC
and thus
?? ??
ψbi1 , . . . , ψbiC , with ??
ψbij?? = ψ θbij .
From the LLN we have that
C
1 X
ψbij?? → Eθb? ψ(θbi?? ) = ψ(θbi? ) + b(θbi? ) as C → ∞ .
C j=1 i
Here θbi?? inside the expectation varies randomly as governed by Pθb? , while θbi?
i
is held fixed, just as θb? would vary randomly as governed by Pθb, while θb is
held fixed and just as θb would vary randomly as governed by Pθ , while the
true θ is held fixed.
By the LLN and glossing over double limit issues we have that
B C B
1 X 1 X 1 X
AbBC = ψbij?? ≈ ψ(θbi? ) + b(θbi? ) → Eθb ψ(θb? ) + b(θb? )
B i=1 C j=1 B i=1
and hence
? b − 3ψ̄ ? + A
ψbbr2 = 3ψ(θ) B
b
BC
≈ 3ψ(θ)
b − 3 ψ(θ)
b + b(θ)
b + ψ(θ) b + E b(θb? )
b + b(θ)
θb
b + E b(θb? ) = ψ
b − 2b(θ)
= ψ(θ) br2 .
b
θb
15
3 Variance Estimation
Suppose X ∼ Pθ and we are given an estimate ψb = ψ(X)b of the real valued
functional ψ = ψ(θ). We are interested in obtaining an estimate of the vari-
ance σψ2b(θ) of ψ.
b Such variance estimates are useful in assessing the quality
n
2 n−1X 2
σbψJ
b = ψb(−i) − ψb(·) .
n i=1
16
3.2 Substitution Variance Estimation
Another general variance estimation procedure is based on the following sub-
stitution idea. Knowing the functional form of σψ2b(θ) (as a function of θ), it
would be very natural to simply estimate σψ2b(θ) by replacing the unknown
parameter θ by θ,
b namely use as variance estimate
σbψ2b = σψ2b(θ)
b .
This sample variance is an unbiased estimate of σψ2b(θ)b and its accuracy can
17
distribution function of the sample, estimating θ = F . From analytical
considerations we know that
2 σ 2 (F )
σX̄ (F ) = ,
n
where σ(F ) is the standard deviation of F . The substitution principle would
estimate σ 2 (F )/n by σ 2 (Fb )/n, where
n
2
1X 2
σ F =
b Xi − X̄ .
n i=1
This σ 2 Fb is the variance of Fb , which places probability 1/n on each of
the Xi , whence the computational formula. Instead of using the analytical
2
form of σX̄ (F ) and substitution, the bootstrap variance estimation method
generates B samples, of size n each, from Fb and computes the B sample
averages X̄1? , . . . , X̄B? of these samples. For large B the sample variance
B B
1 X 2 1 X
2
σbX̄B = X̄i? − X̄¯? , where X̄¯? = X̄ ?
B − 1 i=1 B i=1 i
18
4 Bootstrap Confidence Bounds
There are many methods for constructing bootstrap confidence bounds. We
will not describe them all in detail. The reason for this is that we wish to
emphasize the basic simplicity of the bootstrap method and its generality of
applicability. Thus we will shy away from any bootstrap modifications which
take advantage of analytical devices that are very problem specific and limit
the generic applicability of the method.
We will start by introducing Efron’s original percentile method, followed by
its bias corrected version. The accelerated bias corrected percentile method is
not covered as it seems too complicated for general application. It makes use
of a certain analytical adjustment, namely the acceleration constant, which is
not easily determined from the bootstrap distribution. It is not entirely clear
to us whether the method is even well defined in general multiparameter sit-
uations not involving maximum likelihood estimates. These three percentile
methods are equivariant under monotone transformations on the parameter
to be estimated.
Next we will discuss what Hall calls the percentile method and the Student-t
percentile method. Finally, we discuss several double bootstrap methods,
namely Beran’s prepivoting, Loh’s calibrated bootstrap, and the automatic
double bootstrap. These, but especially the last one, appear to be most
promising as far as coverage accuracy in small samples is concerned. How-
ever, they also are computationally most intensive. As we go along, we
illustrate the methods with specific examples. In a case study we will further
illustrate the relative merits of all these methods for small sample sizes in the
context of estimating a normal quantile and connect the findings with the
approximation rate results given in the literature. All of these investigations
concentrate on parametric bootstrap methods, but the definitions are gen-
eral enough to allow them to be used in the nonparametric context as well.
However, in nonparametric settings it typically is not feasible to investigate
the small sample coverage properties of the various bootstrap methods, other
than by small sample asymptotic methods or by doubly or triply nested sim-
ulation loops, the latter being prohibitive. We found that the small sample
asymptotics are not very representative of the actual small sample behavior
in the parametric case. Thus the small sample asymptotic results in the
nonparametric case are of questionable value in really small samples.
Throughout our treatment of confidence intervals, whether by simple boot-
strap or by double bootstrap methods, it is often convenient to assume that
19
the distribution functions Fθ of the estimates ψb are generally continuous and
strictly increasing on their support {x : 0 < Fθ (x) < 1}. These assumptions
allow us to use the probability integral transform result, which states that
U = Fθ (ψ) b ∼ U (0, 1), and quantities like F −1 (p) are well defined without
θ
complications. Making this blanket assumption here saves us from repeating
it over and over. In some situations it may well be possible to maintain
greater validity by arguing more carefully, but that would entail inessential
technicalities and distract from getting the basic bootstrap ideas across. It
will be up to the reader to perform the necessary detail work, if such gener-
ality is desired. If we wish to deviate from the above tacit assumption, we
will do so explicitly.
This method was introduced by Efron (1981). Hall (1992) refers to this
also as the “other percentile method,” since he reserves the name “per-
centile method” for another method. In Hall’s scheme of viewing the boot-
strap Efron’s method does not fit in well and he advances various arguments
against this “other percentile method.” However, he admits that the “other
percentile method” performs quite well in the double bootstrap approach.
We seem to have found the reason for this as the section on the automatic
double bootstrap will make clear. For this reason we prefer not to use the
abject term “other percentile method” but instead call it “Efron’s percentile
method.” However, we will usually refer to the percentile method in this
section and only make the distinction when confusion with Hall’s percentile
method is to be avoided. We will first give the method in full generality,
present one simple example illustrating what the method does for us, show
its transformation equivariance and then provide some justification in the
single parameter case.
20
an appropriately chosen high value of the ordered bootstrap sample
? ?
ψb(1) ≤ . . . ≤ ψb(B)
might serve well as upper confidence bound for ψ. This has some intuitive
appeal, but before completely subscribing to this intuition the reader should
wait until reading the section on Hall’s percentile method.
To make the above definition more precise we appeal to the LLN . For
sufficiently large B we can treat the empirical distribution of the bootstrap
sample of estimates
B
b (t) = 1
X
G B I ?
B i=1 [ψbi ≤t]
as a good approximation to the distribution function Gθb(t) of ψb? , where
Gθb(t) = Pθb ψb? ≤ t .
Solving
Gθb(t) = 1 − α for t = ψbU (1 − α) = G−1
θb
(1 − α)
we will consider ψbU (1 − α) as a nominal 100(1 − α)% upper confidence bound
for ψ. For large B this upper bound can, for practical purposes, also be
obtained by taking G b −1 (1 − α) instead of G−1 (1 − α). This substitution
B θb
amounts to computing m = (1 − α)B and taking the mth of the sorted
? ? ?
bootstrap values, ψb(1) ≤ . . . ≤ ψb(B) , namely ψb(m) , as our upper bound. If
m = (1 − α)B is not an integer, one may have to resort to an interpolation
? ?
scheme for the two bracketing order statistics ψb(k) and ψb(k+1) , where k is the
largest integer ≤ m. In that case define
? ? ? ?
ψb(m) = ψb(k) + (m − k) ψb(k+1) − ψb(k) .
?
When B is sufficiently large, this bootstrap sample order statistic ψb(m) is a
−1
good approximation of Gθb (1 − α). Similarly, one defines
21
Together these two bounds comprise a nominal 100(1 − 2α)%, equal tailed
confidence interval for ψ. These are the bounds according to Efron’s per-
centile method. The qualifier “nominal” indicates that the actual coverage
probabilities of these bounds may be different from the intended or nominal
confidence level.
The above construction of upper bound, lower bound, and equal tailed inter-
val shows that generally one only needs to know how to construct an upper
bound. At times we will thus only discuss upper or lower bounds.
In situations where we deal with independent, identically distributed data
samples, i.e., X = (X1 , . . . , Xn ) with X1 , . . . , Xn i.i.d. ∼ Fθ , one can show
under some regularity conditions
√ that for large sample size n the coverage
error is proportional to 1/ n for the upper and lower bounds, respectively.
Due to fortuitous error cancellation the coverage error is proportional to 1/n
for the equal tailed confidence interval. What this may really mean in small
samples will later be illustrated in some concrete examples.
22
we know that the empirical distribution function of this sample is a good
approximation of
!
t − µb
Gµb(t) = Pµb X̄ ? ≤ t = Φ √ ,
σ0 / n
where the latter equation describes the analytical fact that X̄ ? ∼ N (µ, b σ02 /n),
when X̄ ? is the sample mean of X1? , . . . , Xn? i.i.d. ∼ N (µ,
b σ02 ). The bootstrap
method does not know this analytical fact. We only refer to it to see what
the bootstrap percentile method generates. The percentile method takes the
(1 − α)-percentile of the bootstrap sample of estimates as upper bound. For
large B this percentile is an excellent approximation to G−1 b (1 − α), namely
µ
2
the (1 − α)-percentile of the N (µ,
b σ0 /n) population or
σ
G−1 −1
√0
b (1 − α) = µ + Φ (1 − α) n = µU (1 − α) .
µ
b b
23
4.1.3 Transformation Equivariance
The property of transformation equivariance is defined as follows. If we have
a “method” for constructing confidence bounds for ψ and if g(ψ) = τ is a
strictly increasing transformation of ψ, then we could try to obtain upper
confidence bounds for τ = τ (θ) by two methods. On the one hand we can
obtain an upper bound ψbU for ψ and treat g(ψbU ) as upper bound for τ with
the same coverage proability, since
1 − α = P ψbU ≥ ψ = P g(ψbU ) ≥ τ .
τ (θ) = g(ψ(θ))
and thus on
τb? = τ (θb? ) = g(ψ(θb? )) = g(ψb? ) .
This in turn implies
Hθb(t) = Pθb (τb? ≤ t) = Pθb g(ψb? ) ≤ t = Pθb ψb? ≤ g −1 (t) = Gθb g −1 (t)
and thus
Hθb−1 (p) = g G−1
θb
(p) .
The percentile method applied to τb yields as upper bound
τbU = Hθb−1 (1 − α) = g G−1
θb
(1 − α) = g ψbU ,
i.e., we have the desired transformation equivariance relation between τbU and
ψbU .
24
4.1.4 A Justification in the Single Parameter Case
In this subsection we will describe conditions under which the percentile
method will give confidence bounds with exact coverage probabilities. In
fact, it is shown that the percentile method agrees with the classical bounds
in such situations.
Let θb = θ(X)
b be an estimate of θ and let ψb = ψ(θ) b be the estimate of
ψ, the real valued parameter of interest. Consider the situation, in which
the distribution of ψb depends only on ψ and not on any other nuisance
parameters, although these may be present in the model. Thus we essentially
deal with a single parameter problem. Suppose we want to get confidence
bounds for ψ = ψ(θ). Then ψb has distribution function
Gψ (t) = Pψ ψb ≤ t .
Here we write Pψ instead of Pθ because of the assumption made concerning
the distribution of ψ.
b In order to keep matters simple we will assume that
Gψ (t) is continuous in both t and ψ and that Gψ (t) & in ψ for fixed t. The
latter monotonicity assumption is appropriate for reasonable estimates, i.e.,
for responsive estimates that tend to increase as the target ψ increases.
Using the probability integral transform we have that U = Gψ (ψ) b is dis-
tributed uniformly over [0, 1]. Thus
1 − α = Pψ Gψ (ψ)
b ≥α =P ψ≤ψ
ψ
b
[1−α]
25
possible to transform estimates in this fashion so that the resulting distri-
bution is approximately standard normal, i.e., Z above would be a standard
normal random variable. The consequence of this transformation assump-
tions is that the percentile method will yield the same upper bound ψb[1−α] ,
and it does so without knowing g, τ or H. Only their existence is assumed
in the above transformation.
Under the above assumption we find
n o
Gψ (t) = P ψb ≤ t b − g(ψ) ≤ τ {g(t) − g(ψ)}
= P τ g(ψ)
= H (τ {g(t) − g(ψ)}) . (1)
and thus
ψb[1−α] = g −1 g(ψ)
b − H −1 (α)/τ = g −1 g(ψ)
b + H −1 (1 − α)/τ ,
where the last equality results from the symmetry of H. From Equation (1)
we obtain further
n o
1 − α = Gψ G−1 −1
ψ (1 − α) = H τ g Gψ (1 − α) − g(ψ)
and thus
G−1
ψ (1 − α) = g
−1
g(ψ) + H −1 (1 − α)/τ
and replacing ψ by ψb we have
G−1
b (1 − α) = g
ψ
−1 b + H −1 (1 − α)/τ = ψ
g(ψ) b
[1−α] .
This means that we can obtain the upper confidence bound ψb[1−α] simply
by simulating the cumulative distribution function Gψb(t) and then solving
Gψb(t) = 1 − α for t = ψb[1−α] , i.e., generate a large bootstrap sample of
estimates ψb1? , . . . , ψbB? and for m = (1 − α)B take ψb(m)
?
, the mth ordered value
? ?
of ψb(1) ≤ . . . ≤ ψb(B) , as a good approximation to
G−1
b (1 − α) = ψ[1−α] .
ψ
b
26
4.2 Bias Corrected Percentile Bootstrap
Gθ (ψ) = Pθ (ψb ≤ ψ) = .5
it is called median unbiased. For the bootstrap distribution Gθb this entails
Gθb(ψ)
b = .5. In order to correct for the bias in estimates that are not median
unbiased Efron proposed to compute the following estimated bias correction
x0 = Φ−1 Gθb(ψ)
b ,
ψbU bc = G−1
θb
(Φ(2x0 + z1−α ))
27
is the corresponding lower bound, i.e., 1 − α is replaced by α as we go from
upper bound to lower bound. Together these two bounds form an equal
tailed confidence interval for ψ, with nominal level (1 − 2α)%. Note that
these bounds revert to the Efron percentile bounds when x0 = 0, i.e., when
Gθb(ψ)
b = .5.
28
For the following it is useful to get the distribution function of ψb = σb 2
explicitly as follows
Its inverse is
σ2
G−1 −1
θ (p) = χn−1 (p) ,
n
where χ−1
n−1 (p) is the inverse of χn−1 .
Rather than simulating a bootstrap sample of estimates ψb1? , . . . , ψbB? , we pre-
tend that B is very large, say B = ∞, so that we actually have knowledge
of the exact bootstrap distribution
Gθb(x) = Pθb σb 2? ≤ x = χn−1 (nx/σb 2 ) .
This allows us to write down the bias corrected bootstrap confidence bounds
in compact mathematical notation and analyze its coverage properties with-
out resorting to simulations. However, keep in mind that this is not necessary
in order to get the bounds. They can always be obtained from the bootstrap
sample, as outlined in Section 4.2.1.
The upper confidence bound for ψ = σ 2 obtained by the bias corrected
percentile method can be expressed as
ψbU bc = Gθ−1
b Φ 2Φ
−1
Gθb(ψ)
b +z
1−α
= Gθ−1
b Φ 2Φ−1
(χ n−1 (n)) + z1−α
b2
σ
= χ−1
n−1 Φ 2Φ
−1
(χn−1 (n)) + z1−α .
n
In comparison, the ordinary Efron percentile upper bound can be expressed
as
σb 2
ψbU = G−1 (1 − α) = χ −1
n−1 (1 − α) .
θb n
The actual coverage probabilities of both bounds are given by the following
formulas:
Pθ ψbU ≥ ψ = Pθ χ−1 b 2 /n ≥ σ 2 = P (V ≥ n2 /χ−1
n−1 (1 − α)σ n−1 (1 − α))
29
Figure 1: Actual − Nominal Coverage Probability
of 95% Upper & Lower Bounds and Asymptotes
0.4
0.2
● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
0.0
●
● ●
● ● ● ●
●
●●● ● ●
● ● ●
coverage error
● ● ●
● ●●● ● ●
●● ● ●
● ● ●
● ● ●
● ●
●
●●● ●
● ●●
● ● ●
●
−0.2
●
●
● ●
●
●
●
●
●
−0.4
n=2 ●
30
Figure 2: Actual − Nominal Coverage Probability
of 90% Confidence Intervals and Asymptotes
0.0
●
●
●●
●● ●●
● ●
●●●●
● ●●●
● ● ● ●
● ●
●
●
●
●●
●
−0.1
●●
●
● ●
●
●
●
● ●
●
●
●
−0.2
●
●
●
coverage error
●
−0.3
●
●
−0.4
n=3 ●
n=2
−0.6
1 n
31
and
h i
Pθ ψbU bc ≥ ψ = Pθ χ−1 −1
n−1 Φ 2Φ (χn−1 (n)) + z1−α σb 2 /n ≥ σ 2
h i
= P V ≥ n2 /χ−1 −1
n−1 Φ 2Φ (χn−1 (n)) + z1−α
h i
= 1 − χn−1 n2 /χ−1 −1
n−1 Φ 2Φ (χn−1 (n)) + z1−α .
The coverage probabilities for the corresponding lower bounds are the com-
plement of the above probabilities with 1 − α replaced by α.
Figure 1 shows the coverage error (actual − nominal coverage rate) of nom-
inally 95% upper √ and lower confidence bounds for σ 2 plotted against the
theoretical rate 1/ n, for sample sizes n = 2, . . . , 20, 30, 40, 50, 100, 200, 500,
1000, 2000. The asymptotes are estimated by drawing lines through (0, 0)
and the points corresponding to n = 2000. Note the symmetry of the asymp- √
totes around the zero line, confirming the error cancellation of order 1/ n.
However, the sample size has to be fairly large, say n ≥ 30, before the asymp-
totes reasonably approximate the coverage error. The coverage error of the
upper bounds is negative and quite substantial for moderate n, whereas that
of the lower bounds is positive and small even for moderate n. Through-
out, the coverage error of the bias corrected percentile method appears to be
smaller than that of the Efron percentile method by a factor of at least two.
Figure 2 shows the coverage error (actual − nominal coverage rate) of the
corresponding nominally 90% confidence intervals for σ 2 plotted against the
theoretical rate of 1/n. The approximation to the asymptotes is good for
much smaller n here. Again the bias corrected version is better by a factor
of at least two and for large n by a factor of three.
32
Thus we have
h h ii
y0 = Φ−1 Gθb g −1 g(ψ)
b = Φ−1 Gθb(ψ)
b =x
0
and with Hθb−1 (·) = g Gθ−1
b (·) we can write
τbU bc = g Gθ−1
b (Φ(2x0 + z1−α )) = g ψU bc ,
b
Gψb (ψ)
b =α.
[1−α]
τ {g(ψ)
b − g(ψ)} + x ∼ Z or g(ψ)
0
b ∼ g(ψ) − x /τ + Z/τ ,
0
33
of estimates, transformed in the above fashion, are well approximated by a
standard normal distribution. Given the above transformation assumption
it is shown below that the bias corrected percentile upper bound for ψ agrees
again with ψb[1−α] . A priori knowledge of g and τ is not required, they only
need to exist. The bias correction constant x0 , which figures explicitly in
the definition of the bias corrected percentile method, is already defined in
terms of the accessible bootstrap distribution Gθb(·). The remainder of this
subsection proves the above claim. The argument is somewhat convoluted
and may be skipped.
First we have
Gψ (t) = Pψ ψb ≤ t = Pψ τ [g(ψ)
b − g(ψ)] ≤ τ [g(t) − g(ψ)]
agreeing with the original definition of the bias. The exact upper bound
ψb[1−α] is found by solving
Gψ (ψ)
b =α
i.e.,
zα = Φ−1 (α) = x0 + τ [g(ψ)
b − g(ψ)]
or
b − g(ψ) = −(x − z )/τ = −(x + z
g(ψ) 0 α 0 1−α )/τ
and finally
1
−1
ψb
[1−α] =ψ=g g(ψ) + (x0 + z1−α ) .
b (3)
τ
On the other hand, using again Equation (2) (in the second identity below),
we have
Φ(2x0 + z1−α ) = Gψ G−1
ψ (Φ(2x0 + z1−α ))
h i
= Φ x0 + τ g G−1
ψ [Φ(2x0 + z1−α )] − g(ψ) .
34
Equating the arguments of Φ on both sides we have
h i
x0 + z1−α = τ g G−1
ψ [Φ(2x0 + z1−α )] − g(ψ)
or
1
(x0 + z1−α ) + g(ψ) = g G−1
ψ [Φ(2x 0 + z1−α )]
τ
and
1
g −1
(x0 + z1−α ) + g(ψ) = G−1
ψ [Φ(2x0 + z1−α )] .
τ
Replacing ψ by ψb on both sides and recalling Equation (3) we obtain
ψb[1−α] = G−1
b [Φ(2x0 + z1−α )] ,
ψ
i.e., the bias corrected percentile upper bound coincides with the exact upper
bound ψb[1−α] .
Hall (1992) calls this method simply the percentile method, whereas he refers
to Efron’s percentile method as “the other percentile method.” Using the
terms “Efron’s percentile method” and “Hall’s percentile method” we pro-
pose to remove any value judgment and eliminate confusion. It is not clear
who first initiated Hall’s percentile method, although Efron (1979) already
discussed bootstrapping the distribution of ψb − ψ, but not in the context of
confidence bounds. The method fits well within the general framework that
Hall (1992) has built for understanding bootstrap methods. We will first give
a direct definition of Hall’s percentile method together with its motivation,
illustrate it with an example and relate it to Efron’s percentile method. The
method is generally not transformation equivariant.
Hθ (x) = Pθ (ψb − ψ ≤ x) .
35
This can be done by simulating a bootstrap sample ψb1? , . . . , ψbB? and forming
ψb1? − ψ, b? − ψ
b ...,ψ
B
b,
Here ψb is held fixed within the probability statement Pθb (· · ·) and the term
ψb? = ψ(θ(X
b ? )) is random with X? generated from the probability model P .
θb
The bootstrap method here consists of treating Hθb(x) as a good approxi-
mation to Hθ (x), the latter being unknown since it usually depends on the
unknown parameter θ. Of course, H c (x) will serve as our bootstrap approxi-
B
mation to Hθb(x) and thus of Hθ (x). The accuracy of the first approximation
(Hc (x) ≈ H (x)) can be controlled by the bootstrap sample size B, but the
B θb
accuracy of Hθb(x) ≈ Hθ (x) depends on the accuracy of θb as estimate of the
unknown θ. The latter accuracy is usually affected by the sample size, which
often is governed by other considerations beyond the control of the analyst.
Hall’s percentile method gives the 100(1 − α)% upper confidence bound for
ψ as
ψbHU = ψb − Hθb−1 (α) ,
and similarly the 100(1 − α)% lower confidence bound as
ψbHL = ψb − Hθb−1 (1 − α) .
The remainder of the discussion will focus on upper bounds, since the dis-
cussion for lower bounds would be entirely parallel.
The above upper confidence bound is motivated by the exact 100(1 − α)%
upper bound
Ub = ψb − Hθ−1 (α) ,
since
Pθ (Ub > ψ) = Pθ ψb − Hθ−1 (α) > ψ
= 1 − Pθ ψb − ψ ≤ Hθ−1 (α) = 1 − Hθ Hθ−1 (α) = 1 − α .
36
However, Ub is not a true confidence bound, since it typically depends on
the unknown θ through Hθ−1 (α). The bootstrap step consists in sidestepping
this problem by approximating Hθ−1 (α) by Hθb−1 (α). For large enough B,
we can obtain Hθb−1 (α) to any accuracy directly from the bootstrap sample
of the Di = ψbi? − ψ. b Simply order the D , i.e., find its order statistics
i
D(1) ≤ D(2) ≤ . . . ≤ D(B) and, for ` = Bα, take the `th value D(`) as
approximation of Hθb−1 (α). If ` is not an integer interpolate between the
appropriate bracketing values of D(k) and D(k+1) . Note that it is not required
that we know the analytical form of Hθ . All we need to know is how to
create new bootstrap samples X?i from Pθb and thus estimates ψbi? and finally
Di = ψbi? − ψ.b
In the exceptional case, where Hθ−1 (α) is independent of θ, we have Hθb−1 (α) =
Hθ−1 (α) = H −1 (α) and then the resulting confidence bounds have indeed
exact coverage probabilities, if we allow B → ∞.
The basic idea behind this method is to form some kind of pivot, i.e., a
function of the data and the parameter of interest, which has a distribution
independent of θ. This would be successful if indeed Hθ did not depend on
θ. The distribution of ψb will typically depend on θ, but it is hoped that
it depends on θ only through ψ = ψ(θ). Further, it is hoped that this
dependence is of a special form, namely that the distribution of ψb depends
on ψ only as a location parameter, so that the distribution of ψb − ψ is free
of any unknown parameters.
Treating ψ as a location parameter is often justifiable on asymptotic grounds,
i.e., for large samples, but may be very misplaced in small samples. In small
samples there is really no compelling reason for focussing on the location
pivot ψb − ψ as a general paradigm. For example, in the normal variance ex-
ample discussed earlier and revisited below it would be much more sensible to
consider the scale pivot σb 2 /σ 2 instead of the location pivot σb 2 −σ 2 . Similarly,
when dealing with a random sample from the bivariate normal population
of Example 4, parametrized by θ = (µ1 , µ2 , σ1 , σ2 , ρ) and with the correla-
tion coefficient ρ = ψ(θ) as the parameter of interest, it would make little
sense, except in very large samples, to treat ρ as a location parameter for the
maximum likelihood estimate ρ. b
The focus on ψ − ψ as the proper pivot for Hall’s percentile method is mainly
b
justified on asymptotic grounds. The reason for this is that most theoretical
bootstrap research has focused on the large sample aspects of the various
bootstrap methods.
37
For other pivots one would have to make appropriate modifications in Hall’s
percentile method. This is presented quite generally in Beran (1987) and we
will illustrate it here with the scale pivot ψ/ψ,
b where it is assumed that the
parameter ψ is positive. Suppose the distribution function of ψ/ψ b is Hθ (x)
then
1 − α = Pθ ψ/ψb > Hθ−1 (α) = Pθ ψ < ψ/Hb −1
θ (α)
and replacing the unknown Hθ−1 (α) by Hθb−1 (α) gives us the Beran/Hall per-
centile method upper bound for ψ, namely
−1
ψbHU = ψ/H
b
θb
(α) .
From now on, when no further qualifiers are given, it is assumed that a lo-
cation pivot was chosen in Hall’s percentile method. This simplifies matters,
especially since it is not always easy to see what kind of pivot would be most
appropriate in any given situation, the above normal correlation example
being a case in point. Since large sample considerations give some support
to location pivots, this default is quite natural.
38
Again we should remind ourselves that this analytic form of Hθb−1 (α) is not
required in order to compute the upper bound via the Hall percentile method.
However, it facilitates the analysis of the coverage rates of the method in this
example. This coverage rate can be expressed as
χ−1 (α)
! !
2
Pθ σbHU ≥ σ2 = Pθ σb 2 2 − n−1 ≥ σ2
n
n2 n2
! !
= P V ≥ = 1 − χn−1 .
2n − χ−1
n−1 (α) 2n − χ−1
n−1 (α)
Gθ (ψ + x) = 1 − Gθ (ψ − x) for all x.
Solving
1 − α = Hθb(x) = Gθb(x + ψ)
b
39
Figure 1a: Actual − Nominal Coverage Probability
of 95% Upper & Lower Bounds
0.4
0.2
● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
0.0
●
● ●
● ● ● ●
●
●●● ●
● ● ● ●
coverage error
● ● ●
● ●●● ● ●
●● ● ●
● ● ●
●
● ● ●
●
● ●
●
●●●
●● ●
●● ●●
●●
●●
● ● ●
● ●
●
−0.2
●
● ●
● ●
● ●
●
●
●
●
●
●
●
● ●
−0.4
●
●
●
n=2 ●
1 n
40
Hall’s percentile upper bound is
where the second equality results from the assumed symmetry of Hθb. Making
use of the dual representation of the above x1−α we find
ψbHU = ψb + G−1
θb
(1 − α) − ψb = G−1
θb
(1 − α) ,
41
of the bootstrap distribution of X̄ ? . This distribution is N (µ,
b σb 2 /n) and its
(1 − α)-quantile is
σb
µb zU (1 − α) = X̄ + z1−α √ with z1−α = Φ−1 (1 − α).
n
Hence Efron’s percentile method results in the same bound as in Section 3.1.2
with the only difference being that the previously assumed known σ0 is re-
placed by the estimate σb . The multiplier z1−α remains unchanged. Compare
this with the classical upper confidence bound given by
s
σb n sb
µb tU (1 − α) = X̄ + tn−1 (1 − α) √ = X̄ + tn−1 (1 − α) √ ,
n n−1 n
µb ? − µb = X̄ ? − X̄
σb σb
µb HU (1 − α) = µb − Hθb−1 (α) = X̄ − zα √ = X̄ + z1−α √ = µb zU (1 − α) .
n n
42
the above formula indicates, that the percentile methods act as though σb
is equal to the true (unknown) standard deviation σ, in which case the use
of the z-factor would be most appropriate. Since σb varies around σ from
sample to sample, this sampling variation needs to be accounted for in setting
confidence bounds.
The percentile-t method carries the pivoting step of Hall’s percentile method
(of bootstrapping X̄ − µ) one step further by considering a Studentized pivot
X̄ − µ
T = .
σb
If we knew the distribution function Kθ (x) of T we could obtain a (1 − α)
level upper confidence bound for µ as follows:
X̄ − Kθ−1 (α)σb
since
!
X̄ − µ
1−α=P ≥ Kθ−1 (α) = P X̄ − Kθ−1 (α)σb ≥ µ .
σb
The subscript θ on Kθ−1 (α) allows for the possibility that the distribution
of T may still depend on θ. In this particular example K is independent of
θ and thus X̄ − K −1 (α)σb is an exact (1 − α)-level upper confidence bound
for µ. To obtain K −1 (α) we can either appeal to tables of the Student-t
distribution, because for this particular example we know that
√ √
K −1 (α) = tn−1 (α)σb / n − 1 = −tn−1 (1 − α)σb / n − 1 ,
or, in a more generic approach, we can simulate the distribution K of T by
generating samples from N (µ, σ 2 ) for any θ = (µ, σ), since in this example
K is not sensitive to the choice of θ. However, for reasons to be explained in
the next section, we may as well simulate independent samples X?1 , . . . , X?B
from N (µ, b 2 ) and generate T1? , . . . , TB? with Ti? = (X̄i? − X̄)/σ
b σ b i? computed
from the ith bootstrap sample X?i . For very large B this simulation process
will approximate the bootstrap distribution K c = K of
X̄ ? − X̄
T? = .
σb ?
The percentile-t method constructs the (1 − α)-level upper confidence bound
as
c−1 (α)σ
µb tU = X̄ − K b .
43
For ` = αB we can consider the `th ordered value of T(1) ? ?
≤ . . . ≤ T(B) , namely
? −1
T(`) , as an excellent approximation to K (α). When αB is not an integer
c
one does the usual interpolation of the appropriate adjacent ordered values
? ?
T(k) and T(k+1) .
By bootstrapping the distribution of the Studentized ratio T we hope that
we capture to a large extent the sampling variability of the scale estimate
used in the denominator of T . That this may not be completely successful is
reflected in the possiblity that the distribution Kθ of T may still depend on
θ.
The above discussion gives rise to a small excursion, which is not an integral
part of the percentile-t method, but represents a rough substitute for it. Since
?
X̄(m) ≈ µb zU (1 − α), Efron (1982) considered the following t-factor patch to
the Efron percentile method, namely
s
tn−1 (1 − α) n ?
X̄ + X̄(m) − X̄ ,
z1−α n−1
with m = (1 − α)B. This patched version of the Efron percentile method
upper bound is approximately equal to the above µb tU (1 − α), as is seen from
? σb
X̄(m) ≈ µb zU (1 − α) = X̄ + z1−α √
n
? σb
⇒ X̄(m) − X̄ ≈ z1−α √
n
s s
tn−1 (1 − α) n ? σb n
⇒ X̄(m) − X̄ ≈ tn−1 (1 − α) √
z1−α n−1 n n−1
and thus
s
tn−1 (1 − α) n ?
X̄ + X̄(m) − X̄ ≈ µb tU (1 − α) .
z1−α n−1
This idea of patching the Efron percentile method can be applied to other
situations as well, especially when estimates are approximately normal. The
effect is to widen the bounds in order to roughly protect the coverage confi-
dence. In this particular example the patch works perfectly in that it results
in the classical bound. The patch is easily applied, provided we have a reason-
able idea of the degrees of freedom to use in the t-factor correction. However,
Efron (1982) warns against its indiscriminate use. Note also that in apply-
ing the patch we lose the transformation equivariance of Efron’s percentile
method.
44
4.4.2 General Definition
Suppose X ∼ Pθ and we are interested in confidence bounds for the real
valued functional ψ = ψ(θ). We also have available the estimate θb of θ and
estimate ψ by ψb = ψ(θ).b Furthermore, it is assumed that we have some scale
estimate σbψb, so that we can define the Studentized pivot
ψb − ψ
T = .
σbψb
In order for T to be a pivot in the strict sense, its distribution would have to
be independent of any unknown parameters. This is not assumed here, but
if this distribution Kθ depends on θ, it is hoped that it does so only weakly.
The (1 − α)-level percentile-t upper bound for ψ is defined as
ψb? − ψb
T? = .
σbψ?b
This is done by simulating samples X?1 , . . . , X?B from Pθb, generating T1? , . . . , TB? ,
with
ψb? − ψb
Ti? = i ?
σbψi
b
computed from the ith bootstrap sample X?i . For ` = αB take the `th ordered
?
value T(`) of the order statistics
? ?
T(1) ≤ . . . ≤ T(B)
45
of the true, but unknown value of θ, and thus Kθb−1 (α) is likely to be more
relevant than taking any value of θ in Kθ−1 (α) and solely appealing to the
insensitivity of Kθ with respect to θ.
The above definition of percentile-t bounds is for upper bounds, but by
switching from α to 1 − α we are covering 1 − α lower bounds as well.
Combining (1 − α)-level upper and lower bounds we obtain (1 − 2α)-level
confidence intervals for ψ.
46
5 Double Bootstrap Confidence Bounds
This section introduces two closely related double bootstrap methods for
constructing confidence bounds. Single bootstrapping amounts to generat-
ing B bootstrap samples, where B is quite large, typically B = 1, 000, and
computing estimates for each such bootstrap sample. In double bootstrap-
ping each of these B bootstrap samples spawns itself a set of A second order
bootstrap samples. Thus, all in all, A · B samples will be generated with
the attending data analyses to compute estimates, typically A · B + B of
them. If A = B = 1000 this amounts to 1, 001, 000 such analyses and is thus
computationally very intensive. This is a high computational price to pay,
especially when the computation of estimates θ(X)
b is costly to begin with. If
that cost grows with the sample size of X, one may want to limit its use only
to analyses involving small sample sizes, but that is the area where coverage
improvement makes most sense anyway. Before these methods will be used
routinely, progress will need to be made in computational efficiency. We hope
that some time soon clever algorithms will be found that reduce the effort of
A · B simulations to k · B, where k is of the order of ten. Such a reduction
would make these double bootstrap methods definitely the preferred choice
as a general tool for constructing confidence bounds.
It appears that methods based on double bootstrap approaches are most suc-
cessful in maintaining the intended coverage rates for the resulting confidence
bounds. A first application of the double bootstrap method to confidence
bounds surfaced in the last section when discussing the possibility of boot-
strap scale estimates to be used in the bootstrapped Studentized pivots of
the percentile-t method. Here we first discuss Beran’s (1987) method, which
is based on the concept of a root (a generalization of the pivot concept) and
the prepivoting idea. The latter invokes an estimated probability integral
transform in order to obtain improved pivots, which then are bootstrapped.
It is shown that Beran’s method is equivalent to Loh’s (1987) calibration
of confidence coefficients. This calibration uses the bootstrap method to
estimate the coverage error with the aim of correcting for it. The second it-
erated bootstrap method, proposed by Scholz (1992), automatically finds the
proper natural pivot when such pivots exist. This yields confidence bounds
with essentially exact coverage whenever these are possible.
47
5.1 Prepivot Bootstrap Methods
This subsection introduces the concept of a root, motivates the use of roots
by showing how confidence bounds are derived from special types of roots,
namely from exact pivots. Then single bootstrap confidence bounds, based
on roots, are introduced and seen to be a simple extension of Hall’s percentile
method. These confidence sets are based on an estimated probability integral
transform. This transform can be iterated, which suggests the prepivoting
step. The effect of this procedure is examined analytically in a special exam-
ple, where it results in exact coverage. Since analysis is not always, feasible
it is then shown how to accomplish the same by an iterated bootstrap simu-
lation procedure. This is concluded with remarks about the improved large
sample properties of the prepivot methods and with some critical comments.
48
then C(X, 1 − α) can be considered a (1 − α)-level confidence set for ψ. This
results from
where ψb = ψ(θ).
b Note that we have replaced all appearances of θ by θ,b i.e.,
?
in the distribution Pθb generating the bootstrap samples Xi and in ψ = ψ(θ).
b b
For large B this bootstrap sample of roots will give an accurate description
of Fθb(·), namely
B
1 X
I ? b −→ Fθb(x) as B → ∞ .
B i=1 [R(Xi ,ψ)≤x]
By sorting the bootstrap sample of roots we can, by the usual process, get a
good approximation to the quantile r1−α (θ),
b which is defined by
b = 1 − α or r −1
Fθb r1−α (θ) 1−α (θ) = Fb (1 − α) .
b
θ
50
The second representation of CB shows that the construction of the confi-
dence set appeals to the probability integral transform. Namely, for con-
tinuous Fθ the random variable U = Fθ (R) has a uniform distribution on
the interval (0, 1) and then P (U ≤ 1 − α) = 1 − α. Unfortunately, we can
only use the estimated probability integral transform Ub = Fθb (R) and Ub is
no longer distributed uniformly on (0, 1). In addition, its distribution will
usually still depend on θ. However, the distribution of Ub should approximate
that of U (0, 1).
The above method for bootstrap confidence sets is nothing but Hall’s per-
centile method, provided we take as root the location root
R(X, ψ) = ψb − ψ = ψ(X)
b −ψ .
Thus the above bootstrap confidence sets based on roots represent an exten-
sion of Hall’s percentile method to other than location roots.
as another root and in applying the above bootstrap confidence set process
with R1 (X, ψ) as root. Note that R1 (X, ψ) depends on X in two ways, once
through θb = θ(X)
b in Fθb and once through X in R(X, ψ). We denote the
distribution function of R1 by F1θ . It is worthwhile to point out again the
double dependence of F1θ on θ, namely through Pθ and ψ(θ) in
The formal bootstrap procedure consists in estimating F1θ (x) by F1θb(x), i.e.,
by replacing θ with θ.
b When the functional form of F is not known one
1θ
resorts again to simulation as will be explained later.
Denoting the (1 − α)-quantile of F1θb(x) by
b = F −1 (1 − α)
r1,1−α (θ) 1θb
51
with nominal confidence level 1 − α. Again, the second form of the confi-
dence set C1B (X, 1−α) shows the appeal to the estimated probability integral
transform, since F1θ (R1 (X, ψ)) is exactly U (0, 1), provided F1θ is continuous.
Actually we are dealing with a repeated estimated probability integral trans-
form since R1 already represented such a transform. It is hoped that this
repeated transform
R2 (X, ψ) = F1θb(R1 (X, ψ)) = F1θb Fθb(R(X, ψ))
52
to choose the original nominal confidence level in the definition of CB , now
denoted by 1 − α1 , such that the estimated exact coverage of CB (X, 1 − α1 )
becomes 1 − α, i.e.,
F1θb(1 − α1 ) = 1 − α or 1 − α1 = F1−1
θb
(1 − α) .
ψ = ψ(θ) = ψ(µ, σ) = µ
R(X, µ) = µb − µ = X̄ − µ .
Analytically Fθ is found to be
√ !
nx
Fθ (x) = Pθ X̄ − µ ≤ x = Φ .
σ
This leads to
√ ! √ !
nR(X, µ) n(X̄ − µ)
R1 (X, µ) = Fθb(R(X, µ)) = Φ =Φ .
σb σb
Since √ √
n(X̄ − µ) n(X̄ − µ)
Tn−1 = q = ∼ Gn−1 ,
σb n/(n − 1) S
53
where Gn−1 represents the Student-t distribution function with n − 1 degrees
of freedom, we find that
q
F1θ (x) = Pθ (R1 (X, µ) ≤ x) = P Φ n/(n − 1) Tn−1 ≤ x
q
−1
= Gn−1 (n − 1)/n Φ (x) .
leading to the classical lower confidence bound for µ, with exact coverage
1 − α. Of course, the above derivation appears rather convoluted in view of
the usual straightforward derivation of the classical bounds. This convoluted
process is not an intrinsic part of the prepivoting method and results only
from the analytical tracking of the prepivoting method. When prepivoting is
done by simulation, see Section 5.1.7, the derivation of the confidence bounds
is conceptually more straightforward, and all the work is in the simulation
effort.
54
A similar convoluted exercise, still in the context of Example 2 and using the
prepivot method with the location root R(X, σ 2 ) = σb 2 −σ 2 , leads to the lower
bound nσb 2 /χ2n−1 (1 − α). This coincides with the classical lower confidence
bound for σ 2 , with exact coverage 1 − α. Here χ2n−1 (1 − α) is the (1 − α)-
quantile of the chi-square distribution with n − 1 degrees of freedom. Here
matters would have been even better had we used the scale pivot R(X, σ 2 ) =
σb 2 /σ 2 instead. In that case the simple bootstrap confidence set CB (X, 1 − α)
would immediately lead to the classical bounds and bootstrap iteration would
not be necessary. This particular example shows that the choice of root
definitely improves matters.
It turns out that the above examples can be generalized and in doing so the
demonstration of the exact coverage property becomes greatly simplified.
However, the derivation of the confidence bounds themselves may still be
complicated.
The exact coverage in both the above examples is just a special case of the
following general result. In our generic setup let us further assume that
R1 (X, ψ) = Fθb (R(X, ψ))
is an exact pivot with continuous distribution function F1 , which is indepen-
dent of θ. This pivot assumption is satisfied in both our previous normal
examples and it is the reason behind the exact coverage there as well as in
this general case. Namely,
n o
C1B (X, 1 − α) = ψ : F1 Fθb (R(X, ψ)) ≤ 1 − α
has exact coverage since
U = F1 Fθb (R(X, ψ))
55
where we postpone for the moment the discussion of how to compute each
such root. Note that θb has taken the place of θ in ψ(θ)
b and in P , which
θb
generated the bootstrap sample.
By the LLN we have that
B
1 X
I −→ F1θb(x) as B → ∞ .
B i=1 [R1 (Xi ,ψ(θb))≤x]
?
As for the computation of each R1 X?i , ψ(θ) b , we will need to employ a
second level of bootstrap sampling. Recall that
and thus
R1 X?i , ψ(θ)
b =F
θbi?
R X?i , ψ(θ)
b ,
b ? ), with X? generated by P .
where θbi? = θ(X i i θb
For any θbi? generate a second level bootstrap sample
X??
ij , j = 1, . . . , A
and thus
A
1 X
R
b
1i = I
A j=1 [R(Xij ,ψ(θbi ))≤R(Xi ,ψ(θb))]
?? ? ?
−→ Fθb? R X?i , ψ(θ)
b = R1 X?i , ψ(θ)
b as A → ∞ .
i
b as a good approximation to R X? , ψ(θ)
Thus, for large A we can consider R b .
1i 1 i
In the same vein we can, for large B and A, consider
B B
1 X 1 X
R
b (x) =
2 I[Rb1i ≤x] = I ?
B i=1 B i=1 [R1 (Xi ,ψ(θb)≤x]
56
as a good approximation of F1θb(x). In particular, by sorting the R b we can
1i
obtain their (1 − α)-quantile γ = r1 (1 − α) by the usual method and treat it
b b
as a good approximation for γ = F1−1 θb
(1−α). Sorting the first level bootstrap
sample
R(X?1 , ψ),
b . . . , R(X? , ψ)
B
b
57
when ψ = ρ is the bivariate normal correlation in Example 4, one can hardly
treat ρ as location or scale parameter and either of the above two roots is
inappropriate. There is of course a natural pivot for ρ, but it is very compli-
cated and difficult to compute. The next section presents a modification of
the double bootstrap method which gets around the need of choosing a root
by automatically generating a canonical root as part of the process.
b ψ) % in ψ
(ii) R(ψ, b for fixed ψ.
Note that these assumptions do not preclude the presence of nuisance param-
eters. However, the role of such nuisance parameters is masked in that they
58
neither appear in the pivot nor influence its distribution F . As such, these
parameters are not really a nuisance. The following two examples satisfy the
above assumptions and in both cases nuisance parameters are present in the
model.
In the first example we revisit Example 4. Here we are interested in confi-
dence bounds for the correlation coefficient ψ = ψ(θ) = ρ. Fortuitously, the
distribution function Hρ (r) of the maximum likelihood estimate ρb is contin-
uous, depends only on the parameter ρ, and is monotone decreasing in ρ for
fixed r (see Lehmann 1986, p.340). Further,
R(ρ, b ∼ U (0, 1)
b ρ) = Hρ (ρ)
is a pivot. Thus (i) and (ii) are satisfied. This example has been exam-
ined extensively in the literature and Hall (1992) calls it the “smoking gun”
of bootstrap methods, i.e., any good bootstrap method better perform rea-
sonably well on this example. For example, the percentile-t method fails
spectacularly here, mainly because Studentizing does not pivot in this case.
This question was raised by Reid (1981) in the discussion of Efron (1981).
In the second example we revisit Example 2. Here we are interested in
confidence bounds on ψ = ψ(θ) = σ 2 . Using again maximum likelihood
estimates we have that
b ψ) = ψ = σ
b b2
R(ψ,
ψ σ2
is a pivot and satisfies (i) and (ii).
If we know the pivot distribution function F and the functional form of R,
we can construct exact confidence bounds for ψ as follows. From
b ψ) ≤ F −1 (1 − α) = 1 − α
Pθ R(ψ,
for ψ. Hence we have in ψbL an exact 100(1 − α)% lower confidence bound
for ψ. The dependence of ψbL on F and R is apparent.
59
It turns out that it is possible in principle to get the same exact confidence
bound without knowing F or R, as long as they exist. This is done at the
expense of performing the double bootstrap. Here exactness holds provided
both bootstrap simulation sample sizes tend to infinity.
The procedure is as follows. First obtain a bootstrap sample of estimates
ψb1? , . . . , ψbB? by the usual process from Pθb. By the LLN we have
B
1 X
G
b
B y|θ =
b I[ψb? ≤y] −→ Pθb ψb? ≤ y as B → ∞ .
B i=1 i
Using this empirical distribution function G
b
B y|θ we are able to approx-
b
imate Pθb ψb? ≤ y to any accuracy by just taking B large enough. With
the understanding of this approximation we will thus use Gb
B y|θb and
Pθb ψb? ≤ y interchangeably.
From monotonicity property (ii) we then have
Pθb ψb? ≤ y = Pθb R(ψb? , ψ)
b ≤ R(y, ψ)
b = F R(y, ψ)
b .
Next, given a value θbi? and ψbi? = ψ(θbi? ), we obtain a second level bootstrap
sample of estimates
?? ??
ψbi1 , . . . , ψbiA from Pθb? .
i
60
and regard them as equivalent proxy for
F R(ψ,
b ψb? ) , . . . , F R(ψ,
b ψb? ) .
1 B
Sorting these values we find the (1 − α)-quantile by the usual process. The
corresponding ψb? = ψbi? = ψbL? approximately solves
F R(ψ,
b ψb? ) ≈ 1 − α .
This value ψbL? is approximately the same as our previous ψbL , provided A and
B are sufficiently large.
The above procedure can be reduced to the following: Find that value θb?
and ψb? = ψ(θb? ), for which
Pθb? ψb?? ≤ ψb = F R(ψ,
b ψb? ) ≈ 1 − α .
This is then iterated by trying new values of ψb? , i.e., ψb1? , ψb2? , . . .. Since
F R(ψ,
b ψb? ) & in ψb?
one should be able to employ efficient root finding algorithms for solving
F R(ψ,
b ψb? ) = 1 − α ,
i.e., use far fewer than the originally indicated AB bootstrap iterations. It
seems reasonable that kA iterations will be sufficient with A ≈ 1000 and
k ≈ 10 to 20.
Note that in this procedure we only need to evaluate G b b?
1A ψ|θi , which in
b
turn only requires that we know how to evaluate the estimates ψ, b ψb? , or
ψb?? . No knowledge of the pivot function R or its distribution function F is
required. For the previously discussed bivariate normal correlation example,
there exists a tame pivot. Therefore we either can, through massive simula-
tion of computationally simple calculations of ρ,
b obtain the exact confidence
61
bound through the above bootstrap process, or in its place use the compu-
tationally difficult analytical process of evaluating the distribution function
Hρ (x) of ρb and solving
Hρ (ρ)
b =α
b ∼ U (0, 1) for
Motivated by the probability integral transform result, Dψ,η (ψ)
continuous Dψ,η , we make the following general pivot assumption:
(V ) Dψ,bη ψb is a pivot, i.e., has a distribution function H which does not
depend on unknown parameters, and Dψ,bη ψb & in ψ for fixed ψb and
ηb.
and
Dψ,bη (ψ) b ψ)) ∼ U (0, 1)
b = F (R(ψ,
62
We can think of θ as reparametrized in terms of ψ and σ and again we use
maximum likelihood estimates ψb and σb for ψ and σ. We have that
ψb − ψ ψb − ψ
and
σb σ
are both pivots with respective c.d.f.’s G1 and G2 and
!
y−ψ
Dθ (y) = Pθ ψb ≤ y = G2 .
σ
Thus !
ψb − ψ
Dψ,bσ (ψ)
b =G
2 ∼ G2 G−1
1 (U ) ,
σb
where U ∼ U (0, 1).
This example generalizes easily. Assume that there is a function R(ψ, b ψ, η)
which is a pivot, i.e., has distribution function G2 , and is decreasing in ψ
and increasing in ψ. b Suppose further that R(ψ, b ψ, ηb) is also a pivot with
distribution function G1 . Then again our general pivot assumption (V ) is
satisfied. This follows from
Dψ,η (y) = Pψ,η ψb ≤ y = Pψ,η R(ψ,
b ψ, η) ≤ R(y, ψ, η) = G (R(y, ψ, η))
2
and thus
Dψ,bη ψb = G2 R ψ,
b ψ, ηb = G2 G−1
1 (U )
63
estimate of τ . Then the above procedure applied to τb and with θ = (ψ, η)
reparametrized to ϑ = (τ, η) yields τbL = g(ψbL ).
This is seen as follows. Denote the reparametrized probability model by Peτ,η ,
which is equivalent to Pg−1 (τ ),η . The distribution function of τb is
f (y) = Pe (τb ≤ y) = Pe −1
D τ,η g ψ ≤ y = Pτ,η ψ ≤ g (y)
b e b
τ,η τ,η
= Pg−1 (τ ),η ψb ≤ g −1 (y) = Dg−1 (τ ),η g −1 (y) ,
so that
−1
D
f (τb) = D −1
τ,b
η η g
g (τ ),b g ψb = Dg−1 (τ ),bη ψb
1−α=H D
f (τb) = H D −1
τ,b
η η ψ
g (τ ),b
b
(ψbij?? , ηbij?? ) , j = 1, . . . , A , i = 1, . . . , B .
64
By the LLN , as A → ∞, we have
A
1 X
D
c =
iA I[ψb?? ≤ψb? ] −→ Pψ0 ,bηi? ψbi?? ≤ ψbi? = Dψ0 ,bηi? ψbi? ∼ H .
A j=1 ij i
The latter distributional assertion derives from the pivot assumption (V ) and
from the fact that (ψbi? , ηbi? ) arises from Pψ0 ,η0 . Again appealing to the LLN
we have
B
1 X
I −→ H(y) as B → ∞ ,
B i=1 [Dψ0 ,bηi? (ψbi )≤y]
?
For large A, B, N this solution is practically identical with the exact lower
confidence bound ψbL . If this latter process takes k iterations we will have
performed AB + kN bootstrap iterations. This is by no means efficient and
it is hoped that future work will make the computational aspects of this
approach more practical.
65
5.2.3 The Prepivoting Connection
We now examine the connection to Beran’s prepivoting approach and ,by
equivalence, also to Loh’s calibrated confidence sets, see Section 5.1.5, Sup-
pose we have a specified root function R(ψ,b ψ) = R(ψ(X),
b ψ) with distribu-
tion function Fψ,η (x). This is somewhat more special than Beran’s general
root concept R(X, ψ). Suppose now that the following assumptions hold
(V ? ) Fψ,bη (R(ψ,
b ψ)) is a pivot,
and
Dψ,bη (ψ)
b = F (R(ψ,
ψ,b
η
b ψ))
is a pivot by assumption.
When F does not depend on ψ, i.e., when the root function is successful in
eliminating ψ from the distribution of R, then one can replace
b ψ)) & in ψ for fixed ψ
Fψ,bη (R(ψ, b and ηb
Fψ,
bb η
(R(ψ,
b ψ))
66
as root, Beran’s prepivoting will lead to exact confidence bounds, since the
distribution of R depends only on the nuisance parameter η = σ.
In contrast, consider Example 4 with ψ = ρ. If we take the root R(ρ, b ρ) =
ρb − ρ, then
Fρ (x) = Pρ (ρb − ρ ≤ x) = Hρ (x + ρ)
b Here the assumption (V ? ) is satisfied since
with Hρ denoting the c.d.f. of ρ.
Fρ (ρb − ρ) = Hρ (ρ)
b
is a pivot. However,
Fρb(ρb − ρ) = Hρb(ρb − ρ + ρ)
b
appears not to be a pivot, although we have not verified this. This dif-
ference is mostly due to the badly chosen root. If we had taken as root
R(ρ, b ρ) = Hρ (ρ),
b then the distinction would not arise. In fact, in that case R
itself is already a pivot. However, this particular root function is not trivial
and that points out the other difference between Beran’s prepivoting and
the automatic double bootstrap. In the latter method no knowledge of an
“appropriate” root function is required. √
As a complementary example consider Example 2 with the root R = n(s2 −
σ 2 ) for the purpose of constructing confidence bounds for σ 2 . Let χf denote
the c.d.f. of a chi-square distribution with f degrees of freedom. Then
√
!!
2 2
x
Fµ,σ2 (x) = Pµ,σ2 n(s − σ ) ≤ x = χn−1 (n − 1) 1 + √ 2 .
nσ
Clearly
√
n(s2 − σ 2 )
!!
Fµb,σ2 (R) = χn−1 (n − 1) 1 + √ 2
nσ
(n − 1)s2
!
= χn−1 ∼ U (0, 1)
σ2
is a pivot which will lead to the classical lower bound for σ 2 . On the other
hand, the iterated root
√ 2
n(s − σ 2 )
!!
2
R1,n (σ ) = Fµb,s2 (R) = χn−1 (n − 1) 1 + √ 2
ns
2
!!
σ
= χn−1 (n − 1) 2 − 2
s
67
is a pivot as well, with distribution function
!−1
χ−1
n−1 (x)
F1,n (x) = χn−1 (n − 1) 2 − for 0 < x ≤ χn−1 (2(n − 1))
n−1
and F1,n (0) = χn−1 ((n − 1)/2), F1,n (x) = 0 for x < 0 and F1,n (x) = 1 for
x ≥ χn−1 (2(n − 1)). For γ ≥ χn−1 ((n − 1)/2) the set
n o h
B1,n = σ 2 : F1,n (R1,n ) ≤ γ = (n − 1)s2 /χ−1
n−1 (γ), ∞
yields the classical lower confidence bound, but for γ < χn−1 ((n − 1)/2) the
set B1,n is empty. This quirk was overlooked in Beran’s (1987) treatment of
this example. For large n the latter case hardly occurs, unless we deal with
small γ’s, i.e., with upper confidence bounds.
for some known constants k and r > 0. In question here is the sensitivity
of the resulting automatic double bootstrap lower bound ψbL with respect to
k and r. This issue is similar but not the same as that of transformation
equivariance.
It turns out that ψbL does not depend on k or r, i.e., the result is always the
same, namely the classical lower confidence bound for ψ. For example, it does
68
not matter whether we estimate σ by the m.l.e. or by s. More remarkable
is the fact that we could have started with the very biased starting estimate
ψb = X̄, corresponding to k = 0, with the same final lower confidence bound.
It is possible that there is a general theorem hidden behind this that would
more cleanly dispose of the following convoluted argument for this result.
This argument fills the remainder of this section and may be skipped.
Recalling ψ = µ + zp σ, one easily derives
Dψ,σ (x) = Pψ,σ ψb ≤ x = Pψ,σ X̄ + ks ≤ x
√ √
√
!
n(X̄ − µ) n(µ − x)
= Pψ,σ + ≤ −ks n/σ
σ σ
√
= Gn−1,√n(µ−x)/σ (−k n)
√
= Gn−1,√n(ψ−x)/σ−zp √n (−k n) , (5)
where Gf,δ (x) denotes the noncentral Student-t distribution with f degrees
of freedom and noncentrality parameter δ.
Next note that
√
Dψ,bσ (ψ)
b = G √ √ (−k n)
σ −zp n
n−1, n(ψ−ψ)/b
√
b
√
= Gn−1,−zp n−V /r (−k n) ,
where
√ √
n(ψb − ψ) n(X̄ − µ − zp σ) √
V = = +k n
s √ s
= k n + Tn−1,−zp √n
and Tf,δ is a random variable with distribution function Gf,δ (x). The distri-
bution function H of Dψ,bσ (ψ)
b can be expressed more or less explicitly as
√
b ≤ y = P −z n − V /r ≥ δ(n − 1, −k n, y) ,
√
H(y) = P Dψ,bσ (ψ) p
√
where δy = δ(n − 1, −k n, y) solves
√
Gn−1,δy (−k n) = y .
Using the above representation for V we have
√ √
H(y) = P Tn−1,−zp √n ≤ −rzp n − rδy − k n
√ √
= Gn−1,−zp √n − n(rzp + k) − rδ(n − 1, −k n, y) .
69
Solving H(y1−α ) = 1 − α for y1−α = H −1 (1 − α) we get
√ √
tn−1,−zp √n,1−α = − n(rzp + k) − rδ(n − 1, −k n, y1−α )
or √ √
−( n(rzp + k) + tn−1,−zp √n,1−α )/r = δ(n − 1, −k n, y1−α )
where tf,δ,1−α is the (1 − α)-quantile of Gf,δ (x). Using the defining equation
for δy we get
√
H −1 (1 − α) = Gn−1,−(√n(rzp +k)+tn−1,−zp √n,1−α )/r (−k n) .
Solving
H −1 (1 − α) = Dψ,bσ (ψ)
b
or
√ √
Gn−1,−(√n(rzp +k)+tn−1,−zp √n,1−α )/r (−k n) = Gn−1,√n(ψ−ψ)/
b b √ (−k n)
σ −zp n
or
s s
ψbL = ψ = ψb − ks − √ tn−1,−zp √n,1−α = X̄ − √ tn−1,−zp √n,1−α ,
n n
“Dψ,bη (ψ)
b is approximately distribution free”
70
holds in a neighborhood of the true unknown parameter θ. Since presumably
θb is our best guess at θ, we may as well start our search for H −1 (1 − α) as
close as possible to θ, namely with θ0 = (ψ0 , η0 ) = θ,
b in order to take greatest
advantage of the closeness of the used approximation. To emphasize this we
write
b =1−α
Hθb Dψ,bη (ψ)
as the equation that needs to be solved for ψ to obtain the 100(1 − α)%
lower bound ψbL for ψ. Of course, the left side of this equation will typically
no longer have a uniform distribution on (0, 1). Following Beran (1987) one
could iterate this procedure further. If
b ∼H
Hθb Dψ,bη (ψ) 2,θ
with H2,θb Hθb Dψ,bη (ψ)
b hopefully more uniform than Hθb Dψ,bη (ψ)
b , one
could then try for an adjusted lower bound by solving
H2,θb Hθb Dψ,bη (ψ)
b =1−α
for ψ = ψb2,L . This process can be further iterated in obvious fashion, but
whether this will be useful in small sample situations is questionable. What
would such an iteration converge to in the specific situation to be examined
next?
As illustration of the application of our method to an approximate pivot
situation we will consider the Behrens-Fisher problem, which was examined
by Beran (1988) in a testing context from an asymptotic rate perspective.
Let X1 , . . . , Xm and Y1 , . . . , Yn be independent random samples from respec-
tive N (µ, σ12 ) and N (ν, σ22 ) populations. Of interest are confidence bounds
for ψ = µ−ν. Since we do not assume σ1 = σ2 we are faced with the classical
Behrens-Fisher problem.
We will examine how the automatic double bootstrap or pivot method attacks
this problem. We can reparametrize the above model in terms of (ψ, η), where
µ = ψ + ν and η = (ν, σ1 , σ2 ). As natural estimate of ψ we take ψb = X̄ − Ȳ
and as estimate for η we take ηb = (Ȳ , s1 , s2 ), where s2i is the usual unbiased
estimate of σi2 . The distribution function of ψb is
x−ψ
Dψ,η (x) = Pψ,η X̄ − Ȳ ≤ x = Φ q .
σ12 /m + σ22 /n
71
The distribution function Hρ of
ψb − ψ
Dψ,bη (ψ)
b = Φ q
s21 /m + s22 /n
nσ12
ρ = ρ(σ12 , σ22 ) = .
nσ12 + mσ22
ψb − ψ
T =q
s21 /m + s22 /n
and in the process replace the unknown ρ by ρb = ρ(s21 , s22 ). This is done
for example in Welch’s solution (Welch (1947) and Aspin (1949)), where Gρ
is approximated by a Student t-distribution function Ff (t) with f = f (ρ)
degrees of freedom with
!−1
ρ2 (1 − ρ)2
f (ρ) = + .
m−1 n−1
72
course the possibility that the two approximation errors in Welch’s solution
cancel each other out to some extent.
The second phase of the pivot or automatic double bootstrap method stipu-
lates that we solve
ψb − ψ
1 − α = Hρb Dψ,bη (ψ)
b = G q
ρ
2 2
b
s1 /m + s2 /n
for ψ = ψbL , which yields the following 100(1 − α)% lower bound for ψ
q
ψbL = ψb − G−1 2 2
b (1 − α) s1 /m + s2 /n .
ρ
Beran (1988) arrives at exactly the same bound (although in a testing con-
text) by simple bootstrapping. However, he started out with the Studentized
test statistic T , which thus is one step ahead in the game. It is possible to
analyze the true coverage probabilities for ψbL and ψbW L , although the eval-
uation of the analytical formulae for these coverage probabilities requires
substantial numerical effort.
These analytical formulae are derived by using a well known conditioning
device, see Fleiss (1971) for a recent account of details. The formula for the
exact coverage probability for ψbL is as follows
Kρ (1 − α) = Pρ ψbL ≤ ψ
Z 1 q
= b(w)Fg G−1
b(w) (1 − α) ga1 (ρ)w + ga2 (ρ)(1 − w) dw
ρ
0
with
ρ 1−ρ
g =m+n−2, a1 (ρ) = , a2 (ρ) = .
m−1 n−1
Γ(α + β) α−1
b(w) = w (1 − w)β−1 I[0,1] (w)
Γ(α)Γ(β)
is the beta density with α = (m − 1)/2 and β = (n − 1)/2 and
wρ(n − 1)
ρ(w)
b = .
wρ(n − 1) + (1 − w)(1 − ρ)(m − 1)
G−1
ρ (p) is the inverse of
Z 1 q
Gρ (x) = Pρ (T ≤ x) = b(u)Fg x ga1 (ρ)u + ga2 (ρ)(1 − u) du .
0
73
The corresponding formula for the exact coverage of ψbW L is
Wρ (1 − α) = Pρ ψbW L ≤ ψ
Z 1 q
= b(w)Fg Ff−1b(w)) (1
(ρ
− α) ga1 (ρ)w + ga2 (ρ)(1 − w) dw .
0
When ρ = 0 or ρ = 1 and for any (m, n) one finds that the coverage prob-
abilities are exactly equal to the nominal values 1 − α, i.e., Kρ (1 − α) =
Wρ (1 − α) = 1 − α. This is seen most directly from the fact that in these
cases T ∼ Fm−1 and T ∼ Fn−1 , respectively.
Figure 3 displays the exact coverage probabilities Gρ (.95) and Wρ (.95) for
equal sample sizes m = n = 2, 3, 5 as a function of ρ ∈ [0, .5]. The full graph
is symmetric around ρ = .5 for m = n. It is seen that both procedures
are highly accurate even for small samples. Mostly the double bootstrap
based bounds are slightly more accurate than Welch’s method. However,
for ρ near zero or one there is a reversal. Note how fast the curve reversal
smoothes out as the sample sizes increase. Figure 4 shows the rate at which
the maximum coverage error for both procedures tends to zero for m = n =
2, . . . , 10, 15, 20, 30, 40, 50. It confirms the rate results given by Beran (1988).
The approximate asymptotes are the lines going through (0, 0) and the last
point, corresponding to m = n = 50. It seems plausible that the true
asymptotes actually coincide.
It may be of interest to find out what effect a further bootstrap iteration
would have on the exact coverage rate. The formulas for these coverage rates
are analogous to the previous ones with G−1 −1
b(w) (1 − α) and Ff (ρb(w)) (1 − α)
ρ
replaced by appropriate iterated inverses, adding considerably to the com-
plexity of numerical calculations. We conjecture that such an iteration will
increase the number of oscillations in the coverage curve. This may then ex-
plain why further iterations may lead to highly irregular coverage behavior.
74
Figure 3: Coverage Probabilities of 95% Lower Bounds
in the Behrens-Fisher Problem
● ●
●
0.97
●
●
●
●
m=n= 2
● ●
● ●
●
●
0.96
● ● ●
●
● ● m=n= 3
● ●
● ●
● ●
● ●
coverage probability
●
● ●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ●
● m=n= 5
0.95
● ●
● ●
● ● ● ●
● ● ● ●
●
●●
●● ●
● ●
● ● ●
● ●● ● ●
● ●
●●
●
● ● ● ●
● ●
●● ●
● ●●
●●●●
● ●
● ●
● ●
●
0.94
●
●
●
Double Bootstrap
●
●
Welch Approximated d.f.
●
●
●●●●
0.93
75
Figure 4: Maximum Coverage Error of 95% Lower Bounds
in the Behrens-Fisher Problem
0.030
Double Bootstrap
Welch Approximated d.f.
Approximate Asymptotes
0.025
●
0.020
| coverage error |
●
0.015
0.010
●
0.005
●
●
●
●
● ●
● ●
●
●●
●
0.000
●●●
●
(m + n)−1
76
5.3 A Case Study
In this section we examine the small sample performance of various bootstrap
methods in the context of Example 2. In particular, we examine the situation
of obtaining confidence bounds for a normal percentile ψ = µ + zp σ. Using
the notation introduced in Section 5.2.4 we take θb = (ψ,b σ
b ) as estimate of
θ = (ψ, η) = (ψ, σ), with
ψb = X̄ + ks and σb = rs
for some known constants k and r > 0. We will make repeated use of the
following expression for the bootstrap distribution of ψb?
b? ≤ x = D (x) = G
√
Pψ,
bbσ
ψ ψ,
bb σ
√
n−1, n(ψ−x)/
b b
√ (−k n) ,
σ −zp n
(6)
77
√
with k 00 = k − rzp − rδ1−α (k, n)/ n.
From (5) the actual coverage probabilities of these bounds are obtained as
√ √
Pψ,σ ψbEL ≤ ψ = Gn−1,−zp √n rδα (k, n) + rzp n − k n ,
√ √
Pψ,σ ψbEU ≥ ψ = 1 − Gn−1,−zp √n rδ1−α (k, n) + rzp n − k n
and
√ √
Pψ,σ ψbEL ≤ ψ ≤ ψbEU = Gn−1,−zp √n rδα (k, n) + rzp n − k n
√ √
−Gn−1,−zp √n rδ1−α (k, n) + rzp n − k n .
Figure 5 shows the behavior of the coverage error of the 95% lower bound √
(with r = 1 and k = z.10 ) for ψ = µ + z.10 σ against the theoretical rate 1/ √n.
The actual size of the error is quite large even for large n. Also, the 1/ n
asymptote is approximated well only for moderately large n, say n ≥ 20.
Figure 6 shows the corresponding result for the upper bound. Note that the
size of the error is substantially smaller here. Finally, Figure 7 shows the
coverage error of the 95% equal tailed confidence interval for ψ against the
theoretical rate of 1/n. The asymptote is reasonably approximated for much
smaller n here.
ψbHL = ψb − x?1−α
as 100(1 − α)% level lower bound for ψ. Here x?1−α is the (1 − α)-quantile of
the bootstrap distribution for ψb? − ψ.
b The corresponding 100(1 − α)% level
upper bound is
ψbHU = ψb − x?α ,
and jointly these two bounds serve as a 100(1 − 2α)% confidence interval for
ψ.
From Equation (6) we obtain
b? − ψ
b≤x =G
√
Pψ, ψ √ √
bbσ σ −zp n (−k n) .
n−1,−x n/b
Thus we have √
x?1−α = −σb δ1−α (k, n)/ n + zp
78
Figure 5: Coverage Error of Lower Confidence Bounds Using
Efron’s Percentile Method with X̄ + zp s Estimating ψ = µ + zp σ
in a Normal Population, p = .1 and Confidence γ = .95
0.00
●
●
●●
●
●●
●
●●
●●
●●●●●●
●●
●
●
●
● ●
●
●
●
50
40
30
−0.05
20
15
1312
11
10
true − nominal coverage probability
9
8
−0.10
7
6
5
−0.15
4
−0.20
3
−0.25
n=2
1 n
79
Figure 6: Coverage Error of Upper Confidence Bounds Using
Efron’s Percentile Method with X̄ + zp s Estimating ψ = µ + zp σ
in a Normal Population, p = .1 and Confidence γ = .95
4 3
5
6
0.04
7
8 n=2
9
10
11
12
13
15
true − nominal coverage probability
0.03
20
30
40
0.02
50
●
●
●
0.01
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
0.00
1 n
80
Figure 7: Coverage Error of Confidence Intervals Using Efron’s
Percentile Method with X̄ + zp s Estimating ψ = µ + zp σ
in a Normal Population, p = .1 and Confidence γ = .95
0.00
●
●
●
●●
●●●
●●●
●●
●
●
●
20
15
1312
11
−0.05
10
9
8
7
true − nominal coverage probability
5
−0.10
4
−0.15
3
−0.20
−0.25
n=2
1 n
81
and thus
√
ψbHL = ψb + σb δ1−α (k, n)/ n + zp
√
= X̄ + s k + rzp + rδ1−α (k, n)/ n = X̄ + k 0 s
√
with k 0 = k + rzp + rδ1−α (k, n)/ n.
From Equation 5 the actual coverage probability of ψbHL is
√ √
Pψ,σ ψbHL ≤ ψ = Gn−1,−zp √n (−k n − rzp n − rδ1−α (k, n)) .
Figures 8-10 show the qualitative behavior of the coverage error of these these
bounds when using k = z.10 , r = 1, and γ = .95. The error is moderately
improved over that of Efron’s percentile method but again sample sizes need
to be quite large before the theoretical asymptotic behavior takes hold. A
clearer comparison between Hall’s and Efron’s percentile methods can be
seen in Figures 11 and 12
and
−1
ψbbcU = Dψ,
bb σ
(Φ (2u0 + z1−α )) ,
where √
u0 = Φ−1 Dψ,
bb σ
( b = Φ−1 G
ψ) n−1,−z p
√ (−k n)
n
82
Figure 8: Coverage Error of Lower Confidence Bounds Using
Hall’s Percentile Method with X̄ + zp s Estimating ψ = µ + zp σ
in a Normal Population, p = .1 and Confidence γ = .95
0.00
●
●
●●
●
●●
●
●●
●●
●●●●●
●● ●
● ●
●
● ●
●
●
●
50 40
30
20
−0.05
15
1312
11
10
9
8
true − nominal coverage probability
7
6
−0.10
4
−0.15
3
−0.20
−0.25
n=2
1 n
83
Figure 9: Coverage Error of Upper Confidence Bounds Using
Hall’s Percentile Method with X̄ + zp s Estimating ψ = µ + zp σ
in a Normal Population, p = .1 and Confidence γ = .95
4 3
5
6
7 n=2
8
0.03
9
10
11
12
13
true − nominal coverage probability
15
20
0.02
30
40
50
●
●
●
●
0.01
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●
●
0.00
1 n
84
Figure 10: Coverage Error of Confidence Intervals Using Hall’s
Percentile Method with X̄ + zp s Estimating ψ = µ + zp σ
in a Normal Population, p = .1 and Confidence γ = .95
0.00
●
●
●
●●
●●●
●●●
●●
●
●
●
●
●
●
●
●
9
−0.05
8
7
6
true − nominal coverage probability
5
−0.10
4
−0.15
3
−0.20
−0.25
n=2
1 n
85
and Φ denotes the standard normal distribution function. When u0 = 0 these
bounds reduce to Efron’s percentile bounds. Since x = ψbbcU solves
√
Dψ, (x) = Gn−1,√n(ψ−x)/ √ (−k n) = Φ (2u0 + z1−α ) = γ(1 − α)
bbσ b bσ −zp n
with
√
γ(α) = Φ (2u0 + zα ) and kL = k − rzp − rδγ(α) (k, n)/ n .
86
Figure 11: Coverage Error of Confidence Bounds Comparing
Percentile Methods and Bias Correction with X̄ + zp s Estimating
ψ = µ + zp σ in Normal Population, p = .1 and Confidence γ = .95
0.05
upper bound ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
● ● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ●
● ● ●
0.00
●●
●● ● ●
● ●
● ●
●●
●●
● ●●●
●● ● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●●
●
●
●
●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●●
●●●
●●●
●●
●●●
●●●
●● ● ●
●● ●
● ● ●
● ●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ●
●
● ● ●
●
●
● ●
● ●
−0.05
●
●
true − nominal coverage probability
● ●
lower bound ●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
● ●
● ●
● ●
−0.10
●
● ●
●
●
−0.15
1 n
87
Figure 12: Coverage Error of Confidence Intervals Comparing
Percentile Methods and Bias Correction with X̄ + zp s Estimating
ψ = µ + zp σ in a Normal Population, p = .1 and Confidence γ = .95
0.00
●
●
●
●●
●●●
●●●
●●
●
●
●
●
●
●
●
−0.05
●
true − nominal coverage probability
●
−0.10
●
−0.15
88
5.3.4 Percentile-t and Double Bootstrap Methods
Using T = (ψb − ψ)/σb as the Studentized ratio in the percentile-t method will
result in the classical confidence bounds and thus there will be no coverage
error. This arises because T is an exact pivot.
If we take R = ψb − ψ as a root in Beran’s prepivoting method, we again
arrive at the same classical confidence bounds. This was already pointed out
in Section 5.2.3 and is due to the fact that the distribution of R only depends
on the nuisance parameter σ.
The automatic double bootstrap also arrives at the classical confidence bounds
as was already examined in Section 5.2.4. Thus the automatic double boot-
strap succeeds here without having to make a choice of scale estimate for
Studentization or of a root for prepivoting.
5.4 References
Aspin, A.A. (1949). “Tables for use in comparisons whose accuracy involves
two variances, separately estimated (with an Appendix by B.L. Welch).”
Biometrika 36, 290-296.
Bain, L.J. (1987). Statistical Analysis of Reliability and Life-Testing Models,
Theory and Methods. Marcel Dekker, Inc., New York.
Beran, R. (1987). “Prepivoting to reduce level error of confidence sets.”
Biometrika 74, 457-468.
Beran, R. (1988). “Prepivoting test statistics: A bootstrap view of asymp-
totic refinements.” J. Amer. Statist. Assoc. 83, 687-697.
Diaconis, P. and Efron, B. (1983a). “Computer intensive methods in statis-
tics.” Sci. Amer. 248, 116-130.
Diaconis, P. and Efron, B. (1983b). “Statistik per Computer: der Münchhausen-
Trick.” Spektrum der Wissenschaft, Juli 1983, 56-71. German transla-
tion of Diaconis, P. and Efron, B. (1983a), introducing the German term
Münchhausen for bootstrap.
DiCiccio, T.J. and Romano, J.P. (1988). “A review of bootstrap confidence
intervals.” (With discussion) J. Roy. Statist. Soc. Ser. B 50, 338-354.
Efron, B. (1979). “Bootstrap methods: Another look at the jackknife.” Ann.
Statist. 7, 1-26.
89
Efron, B. (1981). “Nonparametric standard errors and confidence intervals.”
(With discussion) Canad. J. Statist. 9, 139-172.
Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans.
SIAM, Philadelphia.
Efron, B. (1987). “Better bootstrap confidence intervals.” (With discussion)
J. Amer. Statist. Assoc. 82, 171-200.
Fleiss, J.L. (1971). “On the distribution of a linear combination of indepen-
dent chi squares.” J. Amer. Statist. Assoc. 66, 142-144.
Hall, P. (1988a). “Theoretical comparison of bootstrap confidence intervals.”
(With discussion) Ann. Statist. 16, 927-985.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer-Verlag,
New York.
Lehmann, E.L. (1986). Testing Statistical Hypotheses, Second Edition, John
Wiley & Sons, New York.
Loh, W. (1987). “Calibrating confidence coefficients.” J. Amer. Statist.
Assoc. 82, 155-162.
Reid, N. (1981). Discussion of Efron (1981).
Scholz, F.-W. (1994). “On exactness of the parametric double bootstrap.”
it Statistica Sinica, 4, 477-492.
Welch, B.L. (1947). “The generalization of ‘Student’s’ problem when several
different population variance are involved.” Biometrika 34, 28-35.
90