Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
38 views

Bootstrapping Techniques in Statistical Analysis and Approaches in R MATH 289

The document discusses various bootstrapping techniques used in statistical analysis and approaches in R. It begins with an abstract describing how bootstrapping can provide a more accurate approximation of a test statistic's distribution compared to relying solely on asymptotic approximations. It then defines bootstrapping and describes several common bootstrapping methods like bootstrap standard errors, confidence intervals, t-bootstrap, wild bootstrap, and smoothed bootstrap. For each method, it provides the general approach and algorithm. The document serves to illustrate different bootstrapping techniques that can be used for statistical inference and simulations.

Uploaded by

César Augusto
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Bootstrapping Techniques in Statistical Analysis and Approaches in R MATH 289

The document discusses various bootstrapping techniques used in statistical analysis and approaches in R. It begins with an abstract describing how bootstrapping can provide a more accurate approximation of a test statistic's distribution compared to relying solely on asymptotic approximations. It then defines bootstrapping and describes several common bootstrapping methods like bootstrap standard errors, confidence intervals, t-bootstrap, wild bootstrap, and smoothed bootstrap. For each method, it provides the general approach and algorithm. The document serves to illustrate different bootstrapping techniques that can be used for statistical inference and simulations.

Uploaded by

César Augusto
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Bootstrapping Techniques in Statistical Analysis and

Approaches in R
MATH 289
Ning Zhao
University of California San Diego, Department of Mathematics

Abstract
The true probability distribution of a test statistic is rarely known. Generally, its asymptotic
law is used as approximation of the true law. If the sample size is not large enough, the
asymptotic behavior of that statistic could lead to a poor approximation of the true one. Using
bootstrap methods, under some regularity conditions, it is possible to obtain a more accurate
approximation of the distribution of the test statistic. The bootstrap is a method to derive proper-
ties (standard errors, conïňA-dence
˛ intervals and critical values) of the sampling distribution of
estimators. It takes the sample (the values of the independent and dependent random variables)
as the population and the estimates of the sample as true values. Not to draw from a specified
distribution by a random number generator, the bootstrap draws with replacement from the
sample. The article will discuss many bootstrap methods and do simulations for some of them.

1 Definition
1.1 General illustration in Bootstrap World
Consider a sample with n = 1, ..., N independent observations of a dependent variable y and
M + 1 explanatory variables x. A paired bootstrap is obtained by independently drawing N pairs
( xi , yi ) from the observed sample withreplacement. The bootstrap sample has the same number of
observations, however some observations appear several times and other observations never. The
bootstrap involves drawing a large number B of bootstrap samples. a single bootstrap sample is
denoted ( xb∗ , y∗b ), where xb∗ is a N ∗ ( M + 1) matrix and y∗b an N-dimensional column vector of the
data in the b-th bootstrap sample.

1.2 Bootstrap Standard Errors


The empirical standard deviation in a series of bootstrap replications of θ̂ can approximate the
standard error se(θ̂ ) of an estimator θ̂.

Here is the approach:


• 1. Draw B independent bootstrap samples ( xb∗ , y∗b ) of size N from ( x, y). Usually B = 100
replications are sufficient.
• 2. Estimate the parameter θ of interest for each bootstrap sample: θ̂b∗ for b = 1, ..., B.
s
B
• 3. Estimate se(θ̂ ) by se
ˆ = 1
B ∑ (θ̂b∗ − θ̂ ∗ )
b =1

1
B
where θ̂ ∗ = 1
B ∑ θ̂b∗ .
b =1
The whole covariance matrix V (θ̂ of a vector θ̂ is estimated analogously. So, in this algorithm
the estimator θ̂ is consistent and asymptotically normally distributed and bootstrap se could in use
of constructing approximate CIs and perform asymptotic tests based on the normal distribution.

1.3 Confidence Intervals in Bootstrap Percentiles


We could construct a two-sided equal-tailed (1 − α) confidence interval for an estimate θ̂ from the
empirical distribution function in a series of bootstrap replications. The (1 − α/2) and the (α/2)
empirical percentiles of the bootstrap replications are used as upper and lower confidence bounds.
The procedure of this algorithm is called percentile bootstrap.
Here is the approach:
• 1. Draw B independent bootstrap samples ( xb∗ , y∗b ) of size N from ( x, y). It is recommended
to use B = 1000 or more replicatons.
• 2. Estimate the parameter θ for each bootstrap sample: θ̂b∗ b = 1, ..., B.
• 3. Order the bootstrap replications of θ̂ such that θ̂b∗ ≤ · · · ≤ θ̂ B∗ . The upper and lower confi-
dence bounds are the B ∗ (1 − α/2)-th and the B ∗ α/2-th ordered elements, respectively. For
B = 1000 and α = 5%. The estimated (1 − α) confidence interval of θ̂ is [θ̂ B∗ ∗α/2 ,θ̂ B∗ ∗(1−α/2) ].
Note that the confidence intervals are in general not symmetric. And it could also be performed
to construct an approximate two-sided test of a null hypothesis of the form H0 : θ = θ0 . The
null hypothesis is rejected if θ0 lies outside the two-tailed (1 − α) confidence interval with the
significance level α.

1.4 t-Bootstrap
ˆ (θ̂ ) and that the asymptotic distribution of
Assuming that we have consistent estimates of θ̂ and se
the t-statistic is the standard normal
θ̂ −θ0
t= ˆ (θ̂ )
se
−→ N (0, 1).

Then we could calculate approximate critical values from percentiles of empirical distribution in a
series bootstrap replications for the t-statistic.

Here is the algorithm:


• 1. Consistently do estimate θ and se(θ̂ ) using the observed sample: θ̂,se ˆ (θ̂ )
• 2. Draw B independent bootstrap samples ( xb∗ , y∗b ) of size N from ( x, y). It is recommended
to use B=1000 or more replications.
θ̂0∗ −θ0
• 3. Estimate the t-value, assuming θ0 = θ̂ for each bootstrap sample:t = ˆ ∗b (θ̂ )
for b = 1, ..., B,
se
where θ̂0∗
and ˆ ∗b (θ̂ )
se
are estimates of the parameter θ and its standard error by bootstrap.
• 4. Order the bootstrap replications of t such that t1∗ ≤ · · · ≤ t∗b . The lower critical value and
the upper critical values are then the B ∗ α/2-th and B ∗ (1 − α/2)-th elements, respectively.
We have: tα/2 = t∗B∗(α/2) ,t1−α/2 = t∗B∗(1−α/2)
By this, The t-bootstrap procedure could create confidence intervals using bootstrap critical
values instead of the ones from the standard normal tables:

[θ̂ + tα/2 ∗ se
ˆ (θ̂ ), θ̂ + t1−α/2 ∗ se
ˆ (θ̂ )]

2
Therefore The confidence interval from bootstrap-t is not necessarily better then the percentile
method. However, it is consistent with bootstrap-t hypothesis testing. The bootstrap is typically
used for consistent yet biased estimators. In lots of cases we know the asymptotic properties of
these estimators. Therefore we could use asymptotic theory to derive the approximate sampling
distribution. However bootstrap is an alternative way to produce approximations for the true
sample properties. However sometimes the asymptotic sampling distribution is not that simple to
derive. For instance, the asymptotic sampling is too time consuming and error prone. Another
concern is that the bootstrap produces better approcimations for some properties. For instance,
it can be shown that bootstrap approximations converge faster for some statistics than the
approximations based on asymptotic theory. These kinds of bootstrap approximations are called
asymptotic refinements.

1.5 Wild bootstrap


This method is performed when the model exhibits heteroskedasticity. And, the approach is pretty
much like residual bootstrap, yet it is to resample the response variable based on the residuals
values, which means for each replicaton, one need to compute a new y based on: yi∗ = ŷi + êi νi . So,
the residuals are randomly multiplied by r.v νi with mean 0 and variance1. And, in 2004, there is
a paper: The choice of smoothing parameter in nonparametric regression through Wild Bootstrap.
The paper focuses on 3D examples, also the comparison is made between asymptotic theory and
bootstrapping approaches, which is good to read and significant for real world simulation.

1.6 Smoothed bootstrap


The idea of Smoothed bootstrap is that a small amount of (usually normally distributed) zero-
centered random noise is added on to each resampled observation. This is equivalent to sampling
from a kernel density estimate of the data. The output of the approach make the resampling
distribution smoothed in some level.

Here is the approach:


• 1.Original sample Yn = {Y1 , . . . , Yn } ⇒ Estimator θ̂ of the parameter θ of interest.
• 2.An i.i.d sample Y1∗ , . . . , Yn∗ is drawn at random (with replacement) from Yn .
• 3.A parameter h > 0 is selected, and the final bootstrap sample is generated by adding
smoothing (noise)
ei = Y ∗ + hε i ,
Y i
where ε 1 , . . . , ε n are i.i.d. N (0, 1)random variables.
→ Bootstrap estimator θ̂ ∗
Note: For any real number y we have
n
e ≤ y|Yn ) = 1
P (Y ∑ P(Yi∗ + hε < y|Yn )
n i =1
1 n y − Yi∗
=
n ∑ Φ( h
),
i =1

where Φ and φ denote the distribution function and density of the standard normal distribution.
This is a continuous distribution with density
1 n y − Yi∗
fˆ(y) =
nh ∑ φ( h
)
i =1

3
From the theory of kernel density estimation it is known that asymptotically (n → ∞, h → 0,
nh → ∞) fˆ converges to the true density f of the underlying distribution. The smooth bootstrap
is thus consistent.

2 Simulation
2.1 A Regression Model
First, now, we compare Bootstrap Tests to Asymptotic Tests. We want to show the performance of
bootstrap tests as compared to standard asymptotic tests. To this end, let us discuss it in a linear
regression model:

y = β 1 + β 2 ∗ x + β 3 ∗ x + u, u N (0, δ2 ∗ I )

In this model, we are about to test the null hypothesis H0 : β 3 = 0. Compute a classical
t-statistic for this null hypothesis. On the basis of this test statistic, we can perform a real
test through this. We could select some N to perform parametric bootstrap tests by different
simulations. We also could figure out two types of bootstrap P-values correspondwith the exact
P-value, and the trend that the correspondence change as N increase

Here is the code:

## Loading the data set

##data = read.table(file="data.dat", header=FALSE)


nObs = length(data$V1)

Y = data$V1
X.1 = rep(1,nObs)
X.2 = data$V2
X.3 = data$V3

## OLS estimation of unrestricted model and classical t-test for H_0: beta_3 = 0

OLS.unres = lm(Y ~ X.2 + X.3)

t.test = coef(OLS.unres)[3] / sqrt(diag(vcov(OLS.unres))[3])

P.value.t = 2*(1 - pt(t.test, df=(nObs-3)) )

## Simulating NB=99,999,9999 bootstrap samples and associated test statistics

NB.1 = 99
NB.2 = 999
NB.3 = 9999

OLS.res = lm(Y ~ X.2)

beta.1 = coef(OLS.res)[1]

4
beta.2 = coef(OLS.res)[2]
sigma = summary(OLS.res)$sigma

U.matrix = matrix(0,nObs,NB.3)
Y.matrix = matrix(0,nObs,NB.3)
T.vector = rep(0,NB.3)

for (i in 1:NB.3)
U.matrix[,i] = rnorm(nObs,0,sigma)
Y.matrix[,i] = beta.1 * X.1 + beta.2 * X.2 + U.matrix[,i]
y = Y.matrix[,i]
OLS = lm(y ~ X.2 + X.3)
T.vector[i] = coef(OLS)[3] / sqrt(diag(vcov(OLS.unres))[3])

T.vector.1 = T.vector[1:NB.1]
T.vector.2 = T.vector[1:NB.2]
T.vector.3 = T.vector[1:NB.3]

## Computing the associated Bootstrap P-values


P.value.NB.1 = 1/NB.1 * sum( ifelse(abs(T.vector.1) >= abs(t.test),1,0) )
P.value.NB.2 = 1/NB.2 * sum( ifelse(abs(T.vector.2) >= abs(t.test),1,0) )
P.value.NB.3 = 1/NB.3 * sum( ifelse(abs(T.vector.3) >= abs(t.test),1,0) )

1-eps-converted-to.pdf

So these residuals appear exhibit normality, homogeneity , and independence. Those are
pretty clear. This might be a problem with heterogeneity. Most books just show a few examples
like this and then residuals with clear patterning, most often increasing residual values with
increasing fitted values, however this figure looks like the model has decreasing residual values

5
with increasing fitted values. (Note that we could try log transformation to omit this problem out
of the model.)

2-eps-converted-to.pdf

The Q-Q plot above comparing randomly generated, independent standard normal data on
the vertical axis to a standard normal population on the horizontal axis. The approximate linearity
of the points almostly suggests that the data are normally distributed. However, there are some
noises beyond the second Normal theoretical quantiles, where we should take care of those noises.

3-eps-converted-to.pdf

After standordized, the errors basically show the same results as the first figure.

6
4-eps-converted-to.pdf

The figure in terms of leverage represents cases we may want to research as possibly having
undue influence on the regression relationship.

2.2 Bootstrap CI
We mentioned the concern at the first part.

The bootstrap distribution and the sample may disagree systematically, in which case bias may
occur. If the bootstrap distribution of an estimator is symmetric, then percentile confidence-
interval are often used; such intervals are appropriate especially for median-unbiased estimators
of minimum risk (with respect to an absolute loss function). Furthermore, it is an appropriate
way to control and check the stability of the results. Bias in the bootstrap distribution will lead to
bias in the confidence-interval. Otherwise, if the bootstrap distribution is non-symmetric, then
percentile confidence-intervals are often inappropriate.

The Poisson distribution in real world statistics. For instance, it has been used by engineers
as a model for counting problem, based on the rationale that if the rate is approximately constant.
The model is that we need to estimate the parameter λ in the Poisson model:

λ x e−λ
P( X = x ) = x!

The simulation below illustrates the idea.

Here is the code:

# y_orig is for the original data


y_orig <- c(rep(12,1), rep(0,14), rep(1,30), rep(2,36), rep(5,43),rep(9, 6), rep(3,68), rep(4, 43),
rep(7,14), rep(11,1), rep(8,10))
# this y will be changed in the iterations
y <- y_orig

7
n <- length(y)
NegLogLike<-function(p)
NegLogLike <- -(mean(y*log(p))-p-mean(log(factorial(y))))

NB <- 3000 # we will sample 3000 samples and estimate 3000 times
# we will save the bootstrap results in this array, and this array is initialized as 0 vector
lambdahat_MLEB <- rep(0,NB)
for(i in 1:NB) # for loop, we will repeat the experiment NB times

# sample from the original data, using sampling with replacement!


y <- sample(y_orig, n, replace=T)
# for this new sample, find the MLE
out<-nlm(NegLogLike, p=c(0.5), hessian=TRUE)
# save the estimation result of the i-th iteration
lambdahat_MLEB [i]<- out$estimate

hist(lambdahat_MLEB, main="Sampling Distribution of B-MLE for


Poisson", xlab = "Estimated value", breaks=70, prob=TRUE)
curve(dnorm(x, mean=mean(lambdahat_MLEB), sd=
sd(lambdahat_MLEB)), col=’red’, add=TRUE)

5-eps-converted-to.pdf

As we see, the figure that stands for resampling looks pretty good. As above, we could
minimize the negative log-likelihood function in order to get Fisher Information for the Statistic:
n n
g = − n1 ∑ log f n ( xi |λ) = − n1 ∑ l n ( xi |λ)
i =1 i =1
then we could get Fisher Information through below:
n
g”= − n1 ∑ l”( xi |λ) −→ − E(l”( X |λ)) = I (λ)
i =1

> mean(lambdahat_MLEB)

8
[1] 3.891379
> mean(lambdahat_MLEB)
[1] 3.891379
> out$minimum
[1] 2.205347
> out$estimate
[1] 4.013331
> out$gradient
[1] 1.770456e-09
> out$hessian
[,1]
[1,] 0.2491199

3 Discussion
A advantage of bootstrap is its simplicity. It is a straightforward way to derive estimates of
standard errors and confidence intervals for complex estimators of complex parameters of the dis-
tribution, such as percentile points, proportions, odds ratio, and correlation coefficients. Moreover,
it is an appropriate way to control and check the stability of the results. And bootstrap is also
a very broaden technique that can be used into different fields. For instance, block bootstrap is
wildly performed as concrete statistical tool in Signal Process. However, bootstrapping is (under
some conditions) asymptotically consistent, however it does not provide general finite-sample
guarantees. Moreover, it tends to be overly optimistic. The apparent simplicity may conceal the
fact that important assumptions are being made when undertaking the bootstrap analysis (e.g.
independence of samples) where these would be more formally stated in other approaches.

The discussion of the bootstrap for data is not yet finished but carrying on. For the comparison
of the proposed resampling schemes a complete understanding is still missing and theoretical
research is still going on. Applications of time series analysis will also require new approaches.
Examples are more interacted with other fields in academic world...

References
[1] Bradley Efron, Robert J. Tibshirani, An Introduction to the Bootstrap, (1993)
[2] Brownstone, David, Robert Valetta, The Bootstrap and Multiple Imputations: Harnessing In-
creased Computing Power for Improved Statistical Tests, Journal of Economic Perspectives, 15(4),
129-141
[3] A Random Effect Block Bootstrap for Clustered Data, Journal of Computational and Graphical
Statistics, 2012
[4] Field , C. A. , and Welsh , A. H. Bootstrapping Clustered Data, Journal of the Royal Statistical
Society, Series B , 69 , 369 âĂŞ 390 .
[5] Shao , J. and Tu , D. ( 1995 )The Jackknife and Bootstrap New York : Springer , New York .
[6] Clark , R. G. and Allingham , S. ( 2011 ), âĂIJ Robust Resampling Confidence Intervals for
Empirical Variograms ,âĂİ Mathematical Geosciences , 43 ( 2 ), 243 âĂŞ 259 .

9
[7] Davison , A. C. and Hinkley , D. V. ( 1997 ), Bootstrap Methods and Their Application ,
Cambridge : Cambridge University Press .

[8] Caers , J. , Beirlant , J. and Vynckier , P. ( 1998 ), âĂIJ Bootstrap Confidence Intervals for Tail
Indices ,âĂİ Computational Statistics and Data Analysis , 26 , 259 âĂŞ 277 .

[9] DiCiccio TJ, Efron B (1996) Bootstrap confidence intervals (with Discussion). Statistical
Science 11: 189-228

[10] Varian, H.(2005). "Bootstrap Tutorial". Mathematica Journal, 9, 768-775.

[11] Gregory Shakhnarovich, Ran El-Yaniv, Yoram Baram, Smoothed Bootstrap and Statistical
Data Cloning for Classier Evaluation

10

You might also like