Komputasi Statistik: Pertemuan 14

Komputasi Statistik
Pertemuan 14
RESAMPLING
Background
• The theory was introduced in the 1930s by R.A. Fisher& E.J.G.
Pitman
• In 1966, Resampling method was first tried with graduate students
• In 1969, the method presented in the edition of Basic Research
Methods in Social Science (3rd edition, Julian L. Simon and Paul
Burstein, 1985).
• In the late 1970s, Efron began to publish formal analyses of the
bootstrap—an important resampling application.
• Since 1970, Bootstrap method has been “hailed by an official
American Statistical Association volume as the only “great
breakthrough” in statistics” (Kotz and Johnson, 1992).
• In 1973 Dan Weidenfeld and Julian L. Simon developed the
computer language called RESAMPLING STATS (earlier called SIMPLE
STATS).
In 1958, Tukey coined
1930 the term Jackknife. 1970s
In 1956, Quenouille 1960s
suggested a resampling
technique.
Why do we need resampling?
• Purpose of statistics is to estimate some

parameter(s) and reliability of them. Since
estimators are function of the sample points
they are random variables. If we could find
distribution of this random variable (sample
statistic) then we could estimate reliability of
the estimators.
• If we would have sampling distribution for the
sampling statistics then we can estimate
variance of the estimator, interval, even test
hypotheses
• Unfortunately apart from the simplest cases,

sampling distribution is not easy to derive.
– What is the sampling distribution of:
• The time since most recent common ancestor of all
humans?
• The adjusted R-squared?
• The AIC?
• The beta coefficient when independence is violated?
• The number of connections in a neural net?
• The eigenvalues of PCA?
• A bifurcation point in a phylogenetic tree?
• There are several techniques to approximate

these distributions e.g., Laplace approximation.
• These approximations give analytical form for
the approximate distributions. With advent of
computers more computationally intensive
methods are emerging. They work in many
cases satisfactorily.
• The t-distribution and chi-squared distribution

are good approximations for sufficiently large
and/or normally-distributed samples.
• However, when data is of un-known distribution
or sample size is small, re-sampling tests are
recommended.
Resampling vs Standard Method
Standard Resampling
Methods Methods
Resampling Methods
Resampling
Cross
Permutation Bootstrapping Jackknife
validation
Resampling Application Sampling
Method procedure used
Bootstrap Standard deviation, Samples drawn at

confidence interval, random, with
hypothesis testing, replacement
bias
Jackknife Standard deviation, Samples consist of
confidence interval, full data set with one
bias observation left out
Permutation Hypothesis testing Samples drawn at

random, without
replacement.
Cross-validation Model validation Data is randomly

divided into two or
more subsets, with
results validated
across sub-samples.
Bootstrap
11
History
• 1969 Simon publishes the bootstrap as
an example in Basic Research Methods
in Social Science (the earlier pigfood
example)
• 1979 Efron names and publishes first
paper on the bootstrap
What is the bootstrap? In Statistics…
 Randomly sampling, with replacement, from an
original dataset for use in obtaining statistical
estimates.
 A data‐based simulation method for statistical
inference.
 A computer‐based method for assigning measures
of accuracy to statistical estimates.
 The method requires modern computer power to
simplify intricate calculations of traditional statistical
theory.
13
Why use the bootstrap?
Small sample size.

Non-normal distribution of the sample.
A test of means for two samples.
Not as sensitive to N.
14
Bootstrap Idea
We avoid the task of taking many
samples from the population by instead
taking many resamples from a single
sample. The values of x from these
resamples form the bootstrap distribution.
We use the bootstrap distribution rather
than theory to learn about the sampling
distribution.
15
• Bootstrap draws samples from the Empirical
Distribution of data {X1, X2, · · · , Xn} to
replicate statistic θ to obtain its sampling
distribution.
• The Empirical Distribution is just a uniform
distribution over {X1, X2, · · · , Xn}.
• Therefore Bootstrap is just drawing i.i.d
samples from {X1, X2, · · · , Xn}.
• The procedure is illustrated by the following
graph.
16
(a) The idea of the sampling distribution of the sample mean : take very many
samples, collect the value of from each, and look at the distribution of these
values.
(b) The probability theory shortcut: if we know that the population values follow
a Normal distribution, theory tells us that the sampling distribution of is also
Normal.
(c) The bootstrap idea: when theory fails and we can afford only one sample,
that sample stands in for the population and the distribution of in many
resamples stands in for the sampling distribution.
Bootstrap Procedures
Population 
estimate by ˆ
sample
Unknown
distribution
i.i.d
inference
resample
Repeat for
B times
(B≥1000)
XB1, XB2, … , XBn
statistics
ˆ1* ˆ2* ˆB*
19
Bootstrap for Estimating Standard Error
b= 1,2, …,B
How many Bootstrap
Replications, B?
 A fairly small number, B=25, is sufficient to
be “informative” (Efron)
 B=50 is typically sufficient to provide a rude
estimate of the SE, but B>200 is generally
used.
 Confidence Interval require larger values of
B, B no less than 500, with B=1000
recommended.
Bootstrap with R
> female.heights <- c(117, 162, 143, 120, 183, 175, 147, 145, 165,
+ 167, 179, 116)
> mean(female.heights)
[1] 151.5833
>
> sd(female.heights)
[1] 24.14147
>
> hist(female.heights, xlab="Female Heights")
> #bootstrap
> f <- numeric(10000)
> for(i in 1:10000) {
+ f[i] <- mean(sample(female.heights, replace=TRUE))
+}
>
> #histogram of bootstrap parameter
> hist(f, xlab="bootstrap means",probability = TRUE)
> lines(density(f),col="red",lwd=2)
>
> #standard error
> se.boot<-sd(f)
Exercise
• Suppose we are interested in the
wireless network download speed in
STIS. It is difficult for us the examine
the entire population in STIS, then the
ideology of bootstrap resampling comes
in. The data given on next page.
• Find mean of bootstrap and it’s
standard error.
Jackknife
The Jackknife
• Jackknife methods make use of systematic partitions of

a data set to estimate properties of an estimator
computed from the full sample.
• Quenouille (1949, 1956) suggested the technique to

estimate (and, hence, reduce) the bias of an
estimator ˆn .
• Tukey(1958) coined the term jackknife to refer to the

method,and also showed that the method is useful in
estimating the variance of an estimator.
29
Jackknife Method
Consider the problem of estimating the standard error of
a Statistic t  t ( x1 , , xn ) calculated based on a random
sample from distribution F.
In the jackknife method resampling is done by deleting
one observation at a time.
Thus we calculate n values of the statistic denoted by
 i1 i n. Then the
ti  t ( x1 , x2 , , xi 1 , xi 1 , , xn ) . Let t  
 n
t
jackknife estimate of SE (t ) is given by
n 1 n   n  1 st*
JSE (t )  
n i 1
 ti  t  
 2
n (1)
  
where s t * is the sample standard deviation of t1 , t2 , , tn .
30
Jackknife
statistic t estimate by t
Unknown sample
distribution F
t  x1 , x2 , , xn  inference
resample
Repeat for
n times
t  x2 , x3 , , xn  t  x1 , x3 , , xn  n≥1000 t  x1 , x3 , , xn 1 
statistics
t1 t 2 tn
31
The formula ia not immediately evident, so let us look at the
special case: t  x . Then
nx  xi 1 n
 i  x.
1

ti  x i 
*

n  1 j i
xj 
n 1

and t  x 
*
n
x *
i 1
Using simple algebra it can be shown that
 x  x
n 2
n 1
 xi  x  
n
JSE (t )   * 2 i 1
 SE  x 
* i
(2)
n i 1 n  n  1
Thus the jackknife estimate of the standard error (1) gives

an exact result for x .
32
Limitations of the Jackknife
• The jackknife method of estimation can fail if the statistic
ti is not smooth. Smoothness implies that relatively
small changes to data values will cause only a small
change in the statistic.
• The jackknife is not a good estimation method for
estimating percentiles (such as the median), or when
using any other non-smooth estimator.
• An alternate the jackknife method of deleting one
observation at a time is to delete d observations at a
time (d  2). This is known as the delete-d jackknife.
• In practice, if n is large and d is chosen such that
n  d  n , then the problems of non-smoothness are
removed.
33
Jackknife with R
From female.height data example:
#JACKKNIFE
x<-female.heights
jack <- numeric(length(x)-1)
pseudo <- numeric(length(x))
for (i in 1:length(x)){
for (j in 1:length(x)) {
if(j < i) { jack[j] <- x[j] }
else {
if(j > i) { jack[j-1] <- x[j] }
}
}
pseudo[i] <- length(x)*sd(x)-(length(x)-1)*sd(jack)
}
t <- mean(pseudo)
vart <- var(pseudo)

Komputasi Statistik: Pertemuan 14

Uploaded by

Copyright:

Available Formats

Komputasi Statistik: Pertemuan 14

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Komputasi Statistik: Pertemuan 14

Uploaded by

Copyright:

Available Formats

Komputasi Statistik

• Purpose of statistics is to estimate some

• Unfortunately apart from the simplest cases,

• There are several techniques to approximate

• The t-distribution and chi-squared distribution

Bootstrap Standard deviation, Samples drawn at

Permutation Hypothesis testing Samples drawn at

Cross-validation Model validation Data is randomly

Small sample size.

• Jackknife methods make use of systematic partitions of

• Quenouille (1949, 1956) suggested the technique to

• Tukey(1958) coined the term jackknife to refer to the

Using simple algebra it can be shown that

Thus the jackknife estimate of the standard error (1) gives

You might also like