Komputasi Statistik: Pertemuan 14
Komputasi Statistik: Pertemuan 14
Komputasi Statistik: Pertemuan 14
Pertemuan 14
RESAMPLING
Background
• The theory was introduced in the 1930s by R.A. Fisher& E.J.G.
Pitman
• In 1966, Resampling method was first tried with graduate students
• In 1969, the method presented in the edition of Basic Research
Methods in Social Science (3rd edition, Julian L. Simon and Paul
Burstein, 1985).
• In the late 1970s, Efron began to publish formal analyses of the
bootstrap—an important resampling application.
• Since 1970, Bootstrap method has been “hailed by an official
American Statistical Association volume as the only “great
breakthrough” in statistics” (Kotz and Johnson, 1992).
• In 1973 Dan Weidenfeld and Julian L. Simon developed the
computer language called RESAMPLING STATS (earlier called SIMPLE
STATS).
In 1958, Tukey coined
1930 the term Jackknife. 1970s
In 1956, Quenouille 1960s
suggested a resampling
technique.
Why do we need resampling?
Standard Resampling
Methods Methods
Resampling Methods
Resampling
Cross
Permutation Bootstrapping Jackknife
validation
Resampling Application Sampling
Method procedure used
11
History
• 1969 Simon publishes the bootstrap as
an example in Basic Research Methods
in Social Science (the earlier pigfood
example)
• 1979 Efron names and publishes first
paper on the bootstrap
What is the bootstrap? In Statistics…
Randomly sampling, with replacement, from an
original dataset for use in obtaining statistical
estimates.
A data‐based simulation method for statistical
inference.
A computer‐based method for assigning measures
of accuracy to statistical estimates.
The method requires modern computer power to
simplify intricate calculations of traditional statistical
theory.
13
Why use the bootstrap?
14
Bootstrap Idea
We avoid the task of taking many
samples from the population by instead
taking many resamples from a single
sample. The values of x from these
resamples form the bootstrap distribution.
We use the bootstrap distribution rather
than theory to learn about the sampling
distribution.
15
• Bootstrap draws samples from the Empirical
Distribution of data {X1, X2, · · · , Xn} to
replicate statistic θ to obtain its sampling
distribution.
• The Empirical Distribution is just a uniform
distribution over {X1, X2, · · · , Xn}.
• Therefore Bootstrap is just drawing i.i.d
samples from {X1, X2, · · · , Xn}.
• The procedure is illustrated by the following
graph.
16
(a) The idea of the sampling distribution of the sample mean : take very many
samples, collect the value of from each, and look at the distribution of these
values.
(b) The probability theory shortcut: if we know that the population values follow
a Normal distribution, theory tells us that the sampling distribution of is also
Normal.
(c) The bootstrap idea: when theory fails and we can afford only one sample,
that sample stands in for the population and the distribution of in many
resamples stands in for the sampling distribution.
Bootstrap Procedures
Population
estimate by ˆ
sample
Unknown
distribution
i.i.d
inference
resample
Repeat for
B times
(B≥1000)
XB1, XB2, … , XBn
statistics
ˆ1* ˆ2* ˆB*
19
Bootstrap for Estimating Standard Error
b= 1,2, …,B
How many Bootstrap
Replications, B?
A fairly small number, B=25, is sufficient to
be “informative” (Efron)
B=50 is typically sufficient to provide a rude
estimate of the SE, but B>200 is generally
used.
Confidence Interval require larger values of
B, B no less than 500, with B=1000
recommended.
Bootstrap with R
> female.heights <- c(117, 162, 143, 120, 183, 175, 147, 145, 165,
+ 167, 179, 116)
> mean(female.heights)
[1] 151.5833
>
> sd(female.heights)
[1] 24.14147
>
> hist(female.heights, xlab="Female Heights")
> #bootstrap
> f <- numeric(10000)
> for(i in 1:10000) {
+ f[i] <- mean(sample(female.heights, replace=TRUE))
+}
>
> #histogram of bootstrap parameter
> hist(f, xlab="bootstrap means",probability = TRUE)
> lines(density(f),col="red",lwd=2)
>
> #standard error
> se.boot<-sd(f)
Exercise
• Suppose we are interested in the
wireless network download speed in
STIS. It is difficult for us the examine
the entire population in STIS, then the
ideology of bootstrap resampling comes
in. The data given on next page.
• Find mean of bootstrap and it’s
standard error.
Jackknife
The Jackknife
29
Jackknife Method
Consider the problem of estimating the standard error of
a Statistic t t ( x1 , , xn ) calculated based on a random
sample from distribution F.
In the jackknife method resampling is done by deleting
one observation at a time.
Thus we calculate n values of the statistic denoted by
i1 i n. Then the
ti t ( x1 , x2 , , xi 1 , xi 1 , , xn ) . Let t
n
t
jackknife estimate of SE (t ) is given by
n 1 n n 1 st*
JSE (t )
n i 1
ti t
2
n (1)
where s t * is the sample standard deviation of t1 , t2 , , tn .
30
Jackknife
statistic t estimate by t
Unknown sample
distribution F
t x1 , x2 , , xn inference
resample
Repeat for
n times
t x2 , x3 , , xn t x1 , x3 , , xn n≥1000 t x1 , x3 , , xn 1
statistics
t1 t 2 tn
31
The formula ia not immediately evident, so let us look at the
special case: t x . Then
nx xi 1 n
i x.
1
ti x i
*
n 1 j i
xj
n 1
and t x
*
n
x *
i 1
x x
n 2
n 1
xi x
n
JSE (t ) * 2 i 1
SE x
* i
(2)
n i 1 n n 1
32
Limitations of the Jackknife
• The jackknife method of estimation can fail if the statistic
ti is not smooth. Smoothness implies that relatively
small changes to data values will cause only a small
change in the statistic.
• The jackknife is not a good estimation method for
estimating percentiles (such as the median), or when
using any other non-smooth estimator.
• An alternate the jackknife method of deleting one
observation at a time is to delete d observations at a
time (d 2). This is known as the delete-d jackknife.
• In practice, if n is large and d is chosen such that
n d n , then the problems of non-smoothness are
removed.
33
Jackknife with R
From female.height data example:
#JACKKNIFE
x<-female.heights
jack <- numeric(length(x)-1)
pseudo <- numeric(length(x))
for (i in 1:length(x)){
for (j in 1:length(x)) {
if(j < i) { jack[j] <- x[j] }
else {
if(j > i) { jack[j-1] <- x[j] }
}
}
pseudo[i] <- length(x)*sd(x)-(length(x)-1)*sd(jack)
}
t <- mean(pseudo)
vart <- var(pseudo)