Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Komputasi Statistik: Pertemuan 14

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Komputasi Statistik

Pertemuan 14
RESAMPLING
Background
• The theory was introduced in the 1930s by R.A. Fisher& E.J.G. 
Pitman 
• In 1966, Resampling method was first tried with graduate students 
• In 1969,  the method presented in the edition of Basic Research 
Methods in Social Science (3rd edition, Julian L. Simon and Paul 
Burstein, 1985).
• In the late 1970s, Efron began to publish formal analyses of the 
bootstrap—an important resampling application. 
• Since 1970, Bootstrap method has been “hailed by an official 
American Statistical Association volume as the only “great 
breakthrough” in statistics” (Kotz and Johnson, 1992).
• In 1973 Dan Weidenfeld and Julian L. Simon developed the 
computer language called RESAMPLING STATS (earlier called SIMPLE 
STATS). 
In 1958, Tukey coined 
1930 the term Jackknife. 1970s

In 1956, Quenouille 1960s
suggested a resampling
technique.
Why do we need resampling?

• Purpose of statistics is to estimate some


parameter(s) and reliability of them. Since
estimators are function of the sample points
they are random variables. If we could find
distribution of this random variable (sample
statistic) then we could estimate reliability of
the estimators.
• If we would have sampling distribution for the
sampling statistics then we can estimate
variance of the estimator, interval, even test
hypotheses
Why do we need resampling?

• Unfortunately apart from the simplest cases,


sampling distribution is not easy to derive.
– What is the sampling distribution of:
• The time since most recent common ancestor of all
humans?
• The adjusted R-squared?
• The AIC?
• The beta coefficient when independence is violated?
• The number of connections in a neural net?
• The eigenvalues of PCA?
• A bifurcation point in a phylogenetic tree?
Why do we need resampling?

• There are several techniques to approximate


these distributions e.g., Laplace approximation.
• These approximations give analytical form for
the approximate distributions. With advent of
computers more computationally intensive
methods are emerging. They work in many
cases satisfactorily.
Why do we need resampling?

• The t-distribution and chi-squared distribution


are good approximations for sufficiently large
and/or normally-distributed samples.
• However, when data is of un-known distribution
or sample size is small, re-sampling tests are
recommended.
Resampling vs Standard Method

Standard Resampling
Methods Methods
Resampling Methods

Resampling

Cross
Permutation Bootstrapping Jackknife
validation
Resampling Application Sampling
Method procedure used

Bootstrap Standard deviation, Samples drawn at


confidence interval, random, with
hypothesis testing, replacement
bias
Jackknife Standard deviation, Samples consist of
confidence interval, full data set with one
bias observation left out

Permutation Hypothesis testing Samples drawn at


random, without
replacement.

Cross-validation Model validation Data is randomly


divided into two or
more subsets, with
results validated
across sub-samples.
Bootstrap

11
History
• 1969 Simon publishes the bootstrap as
an example in Basic Research Methods
in Social Science (the earlier pigfood
example)
• 1979 Efron names and publishes first
paper on the bootstrap
What is the bootstrap? In Statistics…

 Randomly sampling, with replacement, from an 
original dataset for use in obtaining statistical 
estimates.
 A data‐based simulation method for statistical 
inference.
 A computer‐based method for assigning measures 
of accuracy to statistical estimates.
 The method requires modern computer power to 
simplify intricate calculations of traditional statistical 
theory.

13
Why use the bootstrap?

Small sample size.


Non-normal distribution of the sample.
A test of means for two samples.
Not as sensitive to N.

14
Bootstrap Idea
We avoid the task of taking many
samples from the population by instead
taking many resamples from a single
sample. The values of x from these
resamples form the bootstrap distribution.
We use the bootstrap distribution rather
than theory to learn about the sampling
distribution.

15
• Bootstrap draws samples from the Empirical
Distribution of data {X1, X2, · · · , Xn} to
replicate statistic θ to obtain its sampling
distribution.
• The Empirical Distribution is just a uniform
distribution over {X1, X2, · · · , Xn}.
• Therefore Bootstrap is just drawing i.i.d
samples from {X1, X2, · · · , Xn}.
• The procedure is illustrated by the following
graph.

16
(a) The idea of the sampling distribution of the sample mean : take very many
samples, collect the value of from each, and look at the distribution of these
values.
(b) The probability theory shortcut: if we know that the population values follow
a Normal distribution, theory tells us that the sampling distribution of is also
Normal.
(c) The bootstrap idea: when theory fails and we can afford only one sample,
that sample stands in for the population and the distribution of in many
resamples stands in for the sampling distribution.
Bootstrap Procedures

Population  
estimate by ˆ
sample
Unknown
distribution
i.i.d
inference

resample
Repeat for 
B times
(B≥1000)
XB1, XB2, … , XBn

statistics
ˆ1* ˆ2* ˆB*
19
Bootstrap for Estimating Standard Error

b= 1,2, …,B
How many Bootstrap
Replications, B?
 A fairly small number, B=25, is sufficient to
be “informative” (Efron)
 B=50 is typically sufficient to provide a rude
estimate of the SE, but B>200 is generally
used.
 Confidence Interval require larger values of
B, B no less than 500, with B=1000
recommended.
Bootstrap with R

> female.heights <- c(117, 162, 143, 120, 183, 175, 147, 145, 165,
+ 167, 179, 116)
> mean(female.heights)
[1] 151.5833
>
> sd(female.heights)
[1] 24.14147
>
> hist(female.heights, xlab="Female Heights")
> #bootstrap
> f <- numeric(10000)
> for(i in 1:10000) {
+ f[i] <- mean(sample(female.heights, replace=TRUE))
+}
>
> #histogram of bootstrap parameter
> hist(f, xlab="bootstrap means",probability = TRUE)
> lines(density(f),col="red",lwd=2)
>
> #standard error
> se.boot<-sd(f)
Exercise
• Suppose we are interested in the
wireless network download speed in
STIS. It is difficult for us the examine
the entire population in STIS, then the
ideology of bootstrap resampling comes
in. The data given on next page.
• Find mean of bootstrap and it’s
standard error.
Jackknife
The Jackknife

• Jackknife methods make use of systematic partitions of


a data set to estimate properties of an estimator
computed from the full sample.

• Quenouille (1949, 1956) suggested the technique to


estimate (and, hence, reduce) the bias of an
estimator ˆn .

• Tukey(1958) coined the term jackknife to refer to the


method,and also showed that the method is useful in
estimating the variance of an estimator.

29
Jackknife Method
Consider the problem of estimating the standard error of
a Statistic t  t ( x1 , , xn ) calculated based on a random
sample from distribution F.
In the jackknife method resampling is done by deleting
one observation at a time.
Thus we calculate n values of the statistic denoted by
 i1 i n. Then the
ti  t ( x1 , x2 , , xi 1 , xi 1 , , xn ) . Let t  
 n
t
jackknife estimate of SE (t ) is given by

n 1 n   n  1 st*
JSE (t )  
n i 1
 ti  t  
 2

n (1)

  
where s t * is the sample standard deviation of t1 , t2 , , tn .
30
Jackknife

statistic t estimate by t
Unknown  sample
distribution F

t  x1 , x2 , , xn  inference
resample
Repeat for  
n times

t  x2 , x3 , , xn  t  x1 , x3 , , xn  n≥1000 t  x1 , x3 , , xn 1 
statistics
t1 t 2 tn
31
The formula ia not immediately evident, so let us look at the
special case: t  x . Then

nx  xi 1 n

 i  x.
1

ti  x i 
*

n  1 j i
xj 
n 1

and t  x 
*

n
x *

i 1

Using simple algebra it can be shown that

 x  x
n 2
n 1
 xi  x  
n
JSE (t )   * 2 i 1
 SE  x 
* i
(2)
n i 1 n  n  1

Thus the jackknife estimate of the standard error (1) gives


an exact result for x .

32
Limitations of the Jackknife
• The jackknife method of estimation can fail if the statistic
ti is not smooth. Smoothness implies that relatively
small changes to data values will cause only a small
change in the statistic.
• The jackknife is not a good estimation method for
estimating percentiles (such as the median), or when
using any other non-smooth estimator.
• An alternate the jackknife method of deleting one
observation at a time is to delete d observations at a
time (d  2). This is known as the delete-d jackknife.
• In practice, if n is large and d is chosen such that
n  d  n , then the problems of non-smoothness are
removed.
33
Jackknife with R
From female.height data example:
#JACKKNIFE
x<-female.heights
jack <- numeric(length(x)-1)
pseudo <- numeric(length(x))
for (i in 1:length(x)){
for (j in 1:length(x)) {
if(j < i) { jack[j] <- x[j] }
else {
if(j > i) { jack[j-1] <- x[j] }
}
}
pseudo[i] <- length(x)*sd(x)-(length(x)-1)*sd(jack)
}
t <- mean(pseudo)
vart <- var(pseudo)

You might also like