Sampling Distribution and Simulation in R
Sampling Distribution and Simulation in R
Research Article
Sampling Distribution and Simulation in R
Tofik Mussa Reshid
Department of Statistics, Werabe University, P.O. Box 46 Werabe, Ethiopia
Email: toffamr@gmail.com Tel: +251911878075
Simulation plays important role in many problems of our daily life. There has been increasing
interest in the use of simulation to teach the concept of sampling distribution. In this paper we try
to show the sampling distribution of some important statistic we often found in statistical
methods by taking 10,000 simulations. The simulation is presented using R-programming
language to help students to understand the concept of sampling distribution. This paper helps
students to understand the concept of central limit theorem, law of large number and simulation
of distribution of some important statistic we often encounter in statistical methods. This paper
is about one sample and two sample inference. The paper shows the convergence of t-distribution
to standard normal distribution. The sum of the square of deviations of items from population
mean and sample mean follow chi-square distribution with different degrees of freedom. The ratio
of two sample variance follow F-distribution. It is interesting that in linear regression the sampling
distribution of the estimated parameters are normally distributed.
Key Words: simulation, central limit theorem, law of large number, normal distribution, t-distribution, chi-square
distribution, F-distribution, Regression
INTRODUCTION
Simulation plays important role in many problems of our of the mean and central limit theorem, difficult to
daily life. By simulation we can do a lot and it is the heart understand (Ann E. Watkins et al, 2014). When we Come
of statistics. When some experiments are conducted, its to regression students find more difficult the concept of
validity can be checked by using simulation. Sometimes sampling distribution. This paper presents simulation
empirical solutions are unattainable in such case we solve studies focused on the difficulties students experience
by using simulations. This paper discusses how computer when learning about sampling distribution. It gives good
simulations are employed for sampling distribution and to understanding for students, researchers and teachers
make inference. The objective of this paper is to show how about sampling distribution and central limit theorem using
to make inference for one sample and two sample data. It simulation.
provides researchers and students with brief description of
commonly used statistical distributions using simulation In this paper it is tried to explain the simulation of sampling
The paper tries to prove the distribution of some important distributions of some important statistic we often encounter
statistic by using computer simulation. As computers in statistical methods. One way to use simulations is to
become more readily available to educators, there is wide allow students to experiment with a simulation and to
speculation that teaching inference via dynamic, visual discover the important principles on their own (David M.
simulations may make statistical inference more Lane 2015). To compute population parameters, the
accessible to introductory students (Moore, 1997). entire population needs to be known. However, in many
cases it is difficult to find the entire population items. In this
In statistical method we teach students about sampling case inference can be drawn based on sample statistic.
distribution of statistic, as a matter of fact sampling
distribution is necessary for statistical inference. Using Simulation in this paper is such that drawing samples in an
simulation to teach the sampling distribution of the mean experiment over and over again it tends to reveals certain
is widely recommended but rarely evaluated (Ann E. pattern. We try to check the validity of central limit theorem
Watkins et al, 2014). Students find the concept of sampling and law of large numbers. Moreover, in this paper it is
distribution difficult to grasp. Students find the concept of attempted to see some important sample statistic
sampling distribution, specifically, the sampling distribution distributions using simulation.
This paper has provided a tool for sampling distribution and statistic Z is computed. Each time a sample of n=5 and
and simulation in R programing. It gives a good n=20 items are drawn. The R code is given below.
understanding for students in simulation and R syntax f=function(N,n){ ;m=matrix(0,N);v=m;z=m
code. Copy the code and paste in to R will run the program. for(i in 1:N){;x=rnorm(n,8,6);m[i]=mean(x);v[i]=var(x)
Sometimes simulation is impressive due to its result. for z[i]=(m[i]-8)/6*sqrt(n)};return(z)}
example, the distribution of the sample variance is different Z=f(10000,20)
when the population mean is unknown. And the plot(density(Z),xlim=c(-
distribution of the sample mean is different when the 5,5),ylim=c(0,0.5),xlab="Z",lwd=1,col="blue",lty=1,main="
population variance is unknown. It is shown that the ")
convergence of student t-distribution, central limit theorem, lines(sort(Z),dnorm(sort(Z)),col="green",lwd=2,lty=2)
law of large numbers, and sampling distribution in Z=f(10000,5)
regression. lines(sort(Z),dnorm(sort(Z)),col="red",lwd=3,lty=3)
legend(1.6,0.4,c("simulationn=20","simulation
Statistic and Simulation n=5","normaldist"),col=c("blue","green","red"),lwd=c("1","2
","2"),lty=c(1,2,3))
A statistic is a function of sampling units draw from a
population. Any function of random variables of sampling
units is also a random variable used to estimate the
corresponding population parameter. Let u denote statistic
then it is a function of items in the sample. That is 𝑢 =
𝑔(𝑥1 , 𝑥2 , . . . 𝑥𝑛 ). Sampling distribution of a statistic is the
probability distribution of the sample statistic based on all
possible simple random samples from the population.
Through the use of simulation, we want to demonstrate the
properties of sampling distributions of some statistic such
as the mean, variance and regression parameters.
Simulation in this paper is such that takes n samples and
Figure 1: Simulation approaches to standard normal
compute statistic u and repeat N times. The simulation
distribution
procedure follows the following steps.
If the population variance is known the sampling
i. Draw n samples from a population of mean 𝜇 and
distribution of Z defined above is standard normal
variance 𝜎 2
distribution for any sample size. The simulation does not
ii. Compute statistic u
depend on sample size n. only the variance of the sample
∑𝑛 𝑥 ∑𝑛 (𝑥 −𝑥̄ )2
mean decreases as sample size n increases
For example,𝑥̄ 𝑛 = 𝑖=1 𝑖 and 𝑠𝑛 2 = 𝑖=1 𝑖 𝑛 𝜎2
𝑛 𝑛−1 because 𝑉𝑎𝑟(𝑥̄ 𝑛 ) =
𝑛𝑁
iii. Repeat i and ii N times, where N is the number of times
a sample is drawn.
2. The other case is if the population variance 𝜎 2 is
iv. Compute the expected value
unknown we use student t-distribution. t-distribution arises
when the estimate of the mean of a normally distributed
∑𝑁
𝑖=1 𝑥̄ 𝑛,𝑖
For example, 𝐸(𝑥̄ 𝑛 ) = and population in a situation where the sample size is small
𝑁
∑𝑁
𝑖=1(𝑥̄ 𝑛,𝑖 −𝐸(𝑥̄ 𝑛 ))
2 𝜎2 and the population variance is unknown.
𝑉𝑎𝑟(𝑥̄ 𝑛 ) = = It is developed by William Seally Gosset (1908). Its density
𝑁−1 𝑛𝑁
of n degrees of freedom is
where 𝜎 2 is population variance 𝛤(
𝑛+1
) 2 − 2
𝑛+1
2 1 𝑡
𝑓(𝑡) = 𝑛 × × (1 + )
Let𝑋1 , 𝑋2 , . . . 𝑋𝑛 are independently, identically, and normally 𝛤( ) √𝑛𝜋 𝑛
2
distributed with mean 𝜇 and finite variance 𝜎 2 then Let 𝑋1 , 𝑋2 , . . . 𝑋𝑛 are independently, identically and normally
inference about population mean can be made as follows: distributed with mean 𝜇 and variance 𝜎 2 then the sample
𝑥̄ −𝜇
1. If the population variance 𝜎 2 is known the mean inference can be made through 𝑇 = ~𝑡(𝑛 − 1)
𝑆/√𝑛
sampling distribution of the sample mean is obtained this follows student t distribution (Paul L. 2015).
through normal distribution for any sample size. 𝑍 =
𝑥̄ 𝑛 −𝜇 Figure2: Shows the density of N=10000 simulations from
~𝑁(0, 1) This is called standard normal distribution
𝜎/√𝑛 normal distribution by taking a random sample of n=5
denoted by Z (Paul L.,2015) items and statistic T is computed. The figure shows the
statistic T follows student t- distribution and it compares t-
Figure1: Depicts the density of N=10,000 simulations from distribution with standard normal distribution. The R code
normal distribution of mean 𝜇 = 8 and variance 𝜎 2 = 36 for the simulation is given as:
Sampling Distribution and Simulation in R
Int. J. Stat. Math. 156
f=function(N,n){ f=function(N){
m=matrix(0,N);v=m;t=m m=matrix(0,N);L=m;U=m
for(i in 1:N){ for(i in 1:N){x=rnorm(i,8,6)
x=rnorm(n,8,6);m[i]=mean(x) m[i]=mean(x)
v[i]=var(x);t[i]=(m[i]-8)/sd(x)*sqrt(n)} L[i]=8-1.96*6/sqrt(i)
return(t)} U[i]=8+1.96*6/sqrt(i)}
t=f(10000,5) m=data.frame(m,L,U)
plot(density(t),xlim=c(- return(m)}
5,5),ylim=c(0,0.5),col="green",lty=1,xlab="t") m=f(1000)
lines(sort(t),dt(sort(t),1),col="blue",lwd=2,lty=2) plot(m[,1],ty="l",col="blue",xlab="sample
lines(sort(t),dt(sort(t),4),col="black",lwd=2,lty=3) size",ylab="sample mean")
lines(sort(t),dnorm(sort(t)),col="red",lwd=2,lty=4) lines(m[,2],ty="l",col="black")
legend(1.7,0.4,c("simulation","t-dist(1)", lines(m[,3],ty="l",col="black")
"t-dist(4)","normal lines(c(0,1000),c(8,8),lty=1,lwd=2,col="red")
dis"),col=c("green","blue","black","red"),lwd=c("1","2","2","
2"),lty=c(1,2,3,4))
original distribution of random variable𝑋1 , 𝑋2 , . . . 𝑋𝑛 . A For large simulation, in this case N=10000 for all
shape that is normal if the population is normal, for other distributions Z turn to standard normal distribution
populations with finite mean and variance, the shape approximately. If our sample size is sufficiently large the
becomes more normal as n increases (Monica E. sampling distribution goes to standard normal distribution
Brussolo,2018). without considering the distribution of the original pattern.
Another important statistic we often found in statistical
We try to see the central limit theorem by taking simulation method is the sample variance. The inference made about
of different non normal distributions. We mentioning some population variance is through chi-square distribution. Chi-
examples such as drawing n samples from binomial, square distribution is one of the most widely used
exponential and gamma distributions. Let us draw non probability distribution in inferential statistics, notably in
normal sample and standardize it. Let us take symmetrical hypothesis testing and construction of confidence
distribution say binomial n’=50 and p=0.5 and another intervals. Chi-square distribution is used in goodness of fit
highly skewed pattern say exponential 𝜆 = 50 and gamma test, test of independency, likelihood ratio test, log rank
distribution shape parameter 𝛼 = 30 and scale test and Cochran Mantle Haenszel test.
parameter𝛽 = 0.5. The samples are drawn from binomial,
exponential and gamma distributions. Let 𝑧1 , 𝑧2 , . . . 𝑧𝑛 are independent standard normal
distributions then the sum of their square is chi square
Figure 4 shows the density of N=10000 simulations of a distribution with n degrees of freedom. Supose
sample n=100 for each of binomial, exponential and 𝑋1 , 𝑋2 , . . . 𝑋𝑛 are identically, independently normally
gamma distribution and statistic Z is computed. distributed we need to make inference on population
f=function(N){ variance, 𝜎 2 . Therefore, we have a quantity
z=matrix(0,N) 𝑥 −𝑥̄ 2 (𝑛−1)𝑠 2
for(i in 1:N){ 𝑄 = ∑𝑛𝑖=1 ( 𝑖 ) = follows chi square distribution
𝜎 𝜎2
b=rbinom(100,50,0.5) with n-1 degrees of freedom (R. Lyman Ott et al, 2010).
z[i]=(mean(b)-25)/(sqrt(12.5/100))}
return(z)} Figure 5: shows the density of large simulation N=10,000
z=f(10000) items from normal distribution, for a random sample n=10,
plot(density(z),xlim=c(min(z),max(z)),ylim=c(0,0.5),lty=1, the sampling distribution of 𝑄is exactly chi-square distribution.
lwd=1,col="black",xlab="Z",main="") The R code is given below
lines(sort(z),dnorm(sort(z)),col="red",lty=2,lwd=3) f=function(N,n){
f=function(N){ C=matrix(0,N)
z=matrix(0,N) for(i in 1:N){
for(i in 1:N){ x=rnorm(n,8,6)
b=rexp(100,50) C[i]=sum((x-mean(x))^2)/36}
z[i]=(mean(b)-0.02)/(sqrt(0.0004/100))} return(C)}
return(z)} C=f(10000,10)
z=f(10000) plot(density(C),xlim=c(min(C),max(C)),ylim=c(0,0.12),xlab="
lines(density(z),xlim=c(min(z),max(z)),ylim=c(0,0.5),lty=2,lwd chi-sq",lwd=1,col="blue",lty=1)
=2,col="green",xlab="Z") lines(sort(C),dchisq(sort(C),9),col="red",lwd=2,lty=2)
f=function(N,n,a,b){ lines(sort(C),dchisq(sort(C),10),col="purple",lwd=2,lty=2)
g=matrix(0,N) legend(17,0.12,c("simulation","chi-sq 9 df","chi sq 10
for(i in 1:N){ df"),col=c("blue","red","purple"),lwd=c("1","2","2"),lty=c(1,2,2)
g[i]=(mean(rgamma(n,a,b))-a*1/b)/sqrt(a*1/b^2/n)} )
return(g)}
g=f(10000,100,30,0.5)
lines(density(g),col="purple",xlab="Z",lty=3,lwd=3)
legend(1.4,0.45,c("bin simulation","exp simulation","gamma
simulation","stnd.norm"),lwd=c(1,2,3,3),lty=c(1,2,3,2),col=
c("black","green","purple","red"))
Let 𝑋1 , 𝑋2 , . . . 𝑋𝑛 are identically, independently normally Figure 7: shows the density of 10,000 simulations from two
distributed we need to make inference on population populations of equal mean and equal known variance. We
variance, 𝜎 2 . Therefore, we have a quantity 𝑄 = compute statistic Z for random sample of 𝑛1 = 10 and 𝑛2 =
𝑥𝑖 −𝜇 2 12, the simulation probability density is the same as
∑𝑛𝑖=1 ( ) follows chi square distribution with n degrees
𝜎 standard normal distribution approximately. The following
of freedom (R. Lyman Ott et al, 2010) is the R code of the simulation.
Figure 6: shows the density of N=10,000 simulation for a f=function(N,n1,n2){
random sample of n=10 selected from normal distribution z=matrix(0,N)
with known mean 𝜇 = 8and the statistic Q is calculated. As for(i in 1:N){
we can see from the figure Q is chi-square distribution with z[i]=(mean(rnorm(n1,8,6))-
n degrees of freedom. The R code is mean(rnorm(n2,8,6)))/(6*sqrt(1/n1+1/n2))}
f=function(N,n){ return(z)}
C=matrix(0,N) z=f(10000,20,12)
for(i in 1:N){ plot(density(z),col="red",lwd=1,lty=1,xlab="Z")
x=rnorm(n,8,6) lines(sort(z),dnorm(sort(z)),col="green",lwd=3,lty=3)
C[i]=sum((x-8)^2)/36} legend(1.7,0.3,c("simulation","normal
return(C)} dist"),col=c("red","green"),lwd=c("1","2"),lty=c(1,3))
C=f(10000,10)
plot(density(C),xlim=c(min(C),max(C)),ylim=c(0,0.12),xlab
="chi-sq",lwd=1,col="blue")
lines(sort(C),dchisq(sort(C),10),col="red",lwd=2,lty=2)
lines(sort(C),dchisq(sort(C),9),col="purple",lwd=2,lty=2)
legend(17,0.1,c("simulation","chisq 10 df","chi-sq 9
df"),col=c("blue","red","purple"),lwd=c("1","2","2"),lty=c(1,
2,2))
x1=rnorm(n1,8,6)
x2=rnorm(n2,8,4)
t[i]=(mean(x1)-mean(x2))/(sqrt(var(x1)/n1+var(x2)/n2))}
return(z)}
t=f(10000,10,12)
plot(density(t),col="red",lwd=1,lty=1,xlab="t")
n1=10;n2=12
c1=36/(10)/(36/10+16/12)
c2=1-c1
df=(n1-1)*(n2-1)/((n1-1)*c1^2+(n2-1)*c2^2)
Figure 8: simulation of t-distribution for two population. lines(sort(t),dt(sort(z),(df)),col="green",lwd=2,lty=2)
legend(1.2,0.36,c("simulation","t-
We have shown that when the variance of the two dist(17.7df)"),col=c("red","green"),lwd=c("1","2"))
populations is equal and unknown, the density of the
simulation is the same as student t- distribution with 𝑛1 +
𝑛2 − 2 degrees of freedom (R. Lyman Ott et al, 2010).
Then (1 − 𝛼)100% CI for 𝜇1 − 𝜇2 is
1 1
X̄-Ȳ ± 𝑡𝛼/2 (𝑛1+𝑛2−2) 𝑠𝑝 √ +
𝑛1 𝑛2
Note: if 𝑛1 + 𝑛2 − 2 > 30 making inference on both Z and
t-distribution is the same.
If the two populations have different and known variance
inference is made using standard normal distribution.
Interval estimation for mean difference 𝜇1 − 𝜇2 when 𝜎1 2 ≠ Figure 9: the density of simulations compared with t-
𝜎2 2 We make the assumptions of both populations are distribution
normally distributed and The samples are independent
𝜎1 2 𝜎2 2 (𝑋̄ −𝑌̄ )−𝛥
Figure 9 shows the density of 10,000 simulations is
𝜎 2𝑋̄−𝑌̄ = + and the statistic 𝑍 = 𝜎 2 𝜎 2
is standard approximately the same as students t-distribution with v
𝑛1 𝑛2 √ 1 + 2
𝑛1 𝑛2 degrees of freedom (R. Lyman Ott et al, 2010).
normal distribution. Then (1 − 𝛼)100% CI for 𝜇1 − 𝜇2 is
Then (1 − 𝛼)100% CI for 𝜇1 − 𝜇2 is 𝑠1 2 𝑠2 2
X̄-Ȳ ± 𝑡 (𝑣) 𝛼/2 √ +
𝜎1 2 𝜎2 2 𝑛1 𝑛2
ȳ 1 -ȳ 2 ± 𝑍𝛼/2 √ +
𝑛1 𝑛2 Note: if v>30 then t(v)~Z therefore making inference on
2
If the population variances 𝜎1 and 𝜎2 are unknown and 2 both is the same.
different We estimate the variance of the sample mean Paired sample
difference from sample
𝑠 21 𝑠 2 2 When we have paired samples say 𝑋 and 𝑌 we need to
𝜎̂ 2 ȳ 1-ȳ 2 = + make inference on their difference mean. We will first find
𝑛1 𝑛2
and then the statistic the difference d for each data pair. 𝑑𝑖 = 𝑋𝑖 -Y𝑖
(𝑋̄ − 𝑌̄) − 𝛥 (𝑣) The mean of the differences is 𝑑̄ =
∑ di ̄ 2
∑(𝑑 −𝑑 )
and 𝑠𝑑 = √ 𝑖
𝑇= ~𝑡 𝑛 𝑛−1
𝑠1 2 𝑠2 2 𝑑̄ −𝛥
√ + The distribution of 𝑇 = ~𝑡 (𝑛−1)
𝑛1 𝑛2 𝑠𝑑 /√𝑛
(𝑛1 −1)(𝑛2 −1)
Where 𝑣 = 2 2 degrees of freedom (Lyman
Figure 10: the density of N=10000 simulation density for
(𝑛1 −1)𝑐1 +(𝑛2 −1)𝑐2
sample size n=10 from two normal population is identical
et al. 2010).
𝑠1 2 𝑠2 2
to student’s t-distribution
𝑛1 𝑛2 The following R syntax code can do the simulation
Where 𝑐1 = 𝑠1 2 𝑠2 2
and 𝑐2 = 𝑠1 2 𝑠2 2
note 𝑐1 + 𝑐2 = 1
+
𝑛1 𝑛2
+
𝑛1 𝑛2
f=function(n){
t=matrix(0,n)
Figure 9: shows the density of simulation from two for(i in 1:n){
populations of the same mean and unequal variance. We x=rnorm(10,12,6)
take a random sample of 𝑛1 = 10 and 𝑛2 = 12 for unknown y=rnorm(10,12,8)
variance. The R code is d=x-y
f=function(N,n1,n2){ t[i]=mean(d)/sd(d)*sqrt(10)}
t=matrix(0,N) return(t)}
for(i in 1:N){ t=f(10000)
lines(sort(t),dt(sort(t),9),col="red",lty=2,lwd=3)
legend(2,0.3,c("simulation","t-
dist(9)"),col=c("blue","red"),lty=c(1,2),lwd=c(1,3))
From the figure we can see that the ratio of two variances
is F-distribution with 𝑛1 − 1 and 𝑛2 − 1
Another important statistic is the correlation. Let X and Y
are two random variables the correlation between these
𝐶𝑜𝑣(𝑋,𝑌)
variables is defined as 𝜌 =
√𝑉𝑎𝑟(𝑋)√𝑉𝑎𝑟(𝑌)
Figure 10: Simulation for paired sample t-distribution
The estimate of 𝜌from the sample is r defined as
The simulation density is identical to t -distribution with n- ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥̄ 𝑦̄
𝑟=
1 degrees of freedom. √∑𝑛𝑖=1 𝑥𝑖 2 − 𝑛𝑥̄ 2 √∑𝑛𝑖=1 𝑦𝑖 2 − 𝑛𝑦̄ 2
Then (1 − 𝛼)100% CI for 𝜇1 − 𝜇2 is Let 𝑋1 , 𝑋2 , . . . 𝑋𝑛1 be a simple random sample from a normal
d̄ ± 𝑡𝛼/2 (𝑛−1) 𝑠𝑑 /√𝑛 distribution of mean 𝜇1 and variance 𝜎1 2 and let 𝑌1 , 𝑌2 , . . . 𝑌𝑛2
If we have large sample size the statistics T turns to be a simple random sample from a normal distribution of
standard normal distribution. mean 𝜇2 and variance 𝜎2 2 and suppose that
𝑋1 , 𝑋2 , . . . 𝑋𝑛1 and 𝑌1 , 𝑌2 , . . . 𝑌𝑛2 are independent samples
Now we will begin to make inference on variance of two 𝑟√𝑛−2
population. The process of inference for two population then the sampling distribution 𝑇= is students t-
√1−𝑟 2
variances is through F-distribution, also known as distribution with n-2 degrees of freedom (R. Lyman Ott et
Snedecor distribution, named after Ronald Fisher and
al, 2010).
George W. Snedecor.
Figure 12 shows simulation density of T for a random
Let 𝑋1 , 𝑋2 , . . . 𝑋𝑛1 be a simple random sample from a normal
sample of 𝑛 = 10 from two normal populations are taken.
distribution of mean 𝜇1 and variance 𝜎1 2 and let 𝑌1 , 𝑌2 , . . . 𝑌𝑛2 The sampling distribution of T is students t-distribution with
be a simple random sample from a normal distribution of 8 degrees of freedom. We use the following R code
mean 𝜇2 and variance 𝜎2 2 and suppose that f=function(N,n){
𝑋1 , 𝑋2 , . . . 𝑋𝑛1 and 𝑌1 , 𝑌2 , . . . 𝑌𝑛2 are independent samples t=matrix(0,N);for(i in 1:N){;x=rnorm(n,8,6); y=rnorm(n,7,4)
𝜎2 2 𝑆1 2 r=cor(x,y);t[i]=r*sqrt(n-2)/sqrt(1-r^2)}
then the sampling distribution 𝐹 = is F-distribution
𝜎1 2 𝑆2 2 return(t)};t=f(10000,10)
of degrees of freedom 𝑛1 − 1 and 𝑛2 − 1 plot(density(t),xlim=c(min(t),max(t)),ylim=c(0,0.6),xlab="T
",lwd=1,col="blue",main="")
Figure 11: presents the simulation density compared with lines(sort(t),dt(sort(t),8),col="red",lwd=2,lty=2)
F-distribution. We take two populations of the first legend(1.5,0.4,c("simulation","t-dis(8)
population of 𝜇1 = 8and variance 𝜎1 2 = 36 and sample df"),col=c("blue","red"),lwd=c("1","2"),lty=c(1,2))
𝑛1 = 10 and second population 𝜇2 = 12 variance 𝜎2 2 = 49
and sample 𝑛2 = 14. Then the sampling distribution of F is
computed. The simulation density of F is identical to
Fishers distribution with 9 and 13 degrees of freedom.
Inference on regression
distribution freedom.
If the population variance 𝜎 2 is known the statistic
For simple regression 𝛽 = (𝛽0 , 𝛽1 )′ and ̂ −𝛽
𝛽
𝑍= is standard normal distribution.
1 𝑥̄ 2 𝜎2
𝛽̂0 ~𝑁 (𝛽0 , 𝜎 2 [ + ∑(𝑥 2]), 𝛽̂1 ~𝑁 (𝛽1 , ∑(𝑥 ) ̂)
√𝑉𝑎𝑟̂ (𝛽
𝑛 𝑖 −𝑥̄ ) 𝑖 −𝑥̄ )2
####REGRESSION
x=c(13.1259289863519, 9.97280258173123, 3.60393659118563, 9.76076080370694,
13.4466646108776, 18.6615084544756, 7.07075995905325, 10.734301129356,
11.0934200785123, 6.00321174040437, 5.06187068019062, 8.30410914169624,
15.6528919194825, 19.340568847023, 17.6761783366092, 15.3968393956311,
8.13349168887362, 2.47491842834279, 13.7670393986627, 12.9030547216535,
9.61479113763198, 11.0504995626397, 8.52958074864, 6.73342874459922,
14.1264660977758, 18.4355074968189, 16.915554122068, 11.0317060784437,
4.52571967476979, 8.37619267497212, 4.15179930301383, 13.6165081835352,
15.2277578259818, 5.87729318765923, 8.9937518001534, 6.91483139339834,
4.37928855139762, 13.8287743031979, 8.48079221323133, 17.2897333060391)
y=c(18.3284716199676, 11.9983356793452, 3.19005389478917, 17.9857143032932,
17.5261241893626, 24.2918720781243, 10.5747370562273, 6.15778938503531,
9.64142880997099, 14.0942357409376, 6.40587956995636, 11.9280460218608,
14.980217684097, 21.1935663470971, 23.6971419993407, 22.438711803928,
10.7544176247085, 12.7370736926116, 17.9085304531413, 12.7612311923915,
14.6016582112767, 14.2730117991017, 11.4572381322187, 11.6969638058108,
11.0486753962352, 21.0292074375214, 24.6398452952169, 11.1345173872645,
3.26861680487316, 10.6531134626035, 6.25943117619691, 17.994832240113,
23.4506507135199, 17.3030244749145, 14.8902683145381, 11.5692487487054,
6.80099139708624, 21.4242608411097, 4.92690380123521, 24.9058315809615)
mm=lm(y~x);summary(mm)##### REGRESSION
f=function(n){
b0=matrix(0,n);b1=b0;v0=b0;v1=b0;vb0=b0;vb1=b0
for(i in 1:n){
s=sample(1:40,20);Y=y[s];X=x[s]
m=lm(Y~X)
b0[i]=m$coeff[1];b1[i]=m$coeff[2]; r=residuals(m)
vb0[i]=(1/20+(mean(X))^2/sum((X-mean(X))^2))*var(r)*19/18
vb1[i]=1/sum((X-mean(X))^2)*var(r)*19/18
v0[i]=var(b0[1:i]); v1[i]=var(b1[1:i])}
b=data.frame(b0,b1,vb0,vb1,v0,v1)
return(b)}; b=f(10000)
mean(b[,1]);mean(b[,2]);mean(b[,3]);mean(b[,4]);mean(b[2:10000,5]);mean(b[2:10000,6])
t0=(b[,1]- 2.965)/sqrt(b[,5]);t0=t0[2:10000]
z0=(b[,1]-mean(b[,1]))/sqrt(2.916358)
t1=(b[,2]-mean(b[,2]))/sqrt(b[,6]);t1=t1[2:10000]
z1=(b[,2]-mean(b[,2]))/0.1298
########## sigma is unknown
plot(density(t0),xlim=c(min(t0),max(t0)),ylim=c(0,0.7),lty=2,lwd=3,col="green",main=expression(density~of~studentized(hat(bet
a[0]))),xlab=expression(studentized(hat(beta[0]))));#sigma is unknown
lines(sort(t0),dt(sort(t0),8),col="black",lwd=1)
legend(-2,0.6,c("simulation","t-dis(18)"),col=c("green","black","yellow"),lwd=c(2,1,2),lty=c(2,1,1))
plot(density(t1),xlim=c(min(t1),max(t1)),ylim=c(0,0.6),lty=2,lwd=3,col="green",main=expression(density~of~studentized(hat(bet
a[1]))),xlab=expression(studentized(hat(beta[1]))));#sigma is unknown
lines(sort(t1),dt(sort(t1),18),col="black",lwd=1)
legend(-2,0.6,c("simulation","t-dis(18)"),col=c("green","black","yellow"),lwd=c(3,1,2),lty=c(2,1,1))
########## sigma is known
plot(density(z0),xlim=c(min(z0),max(z0)),ylim=c(0,0.7),lty=2,lwd=3,col="blue",main=expression(density~of~standardized(hat(b
eta[0]))),xlab=expression(standardized(hat(beta[0]))));#sigma is known
lines(sort(z0),dnorm(sort(z0)),col="black",lwd=1)
legend(-2,0.7,c("simulation","stand.normal"),col=c("blue","black"),lwd=c(3,1),lty=c(2,1))
plot(density(z1),xlim=c(min(z1),max(z1)),ylim=c(0,0.7),lty=2,lwd=2,col="blue",main=expression(density~of~standardized(hat(b
eta[1]))),xlab=expression(standardized(beta[1]))) #sigma is known
lines(sort(z1),dnorm(sort(z1)),col="black",lwd=1)
legend(-2.5,0.7,c("simulation","stand.normal"),col=c("blue","black","green"),lwd=c(2,1,1),lty=c(2,1,1))
plot(b[,5],ty="l",col="blue",lty=1,ylim=c(2.5,3.5),ylab=expression(var~of~(beta[0])))
lines(c(0,10000),c(2.9106,2.9106),lwd=1,col="red")
plot(b[,6],ty="l",col="blue",lty=1,ylim=c(0.015,0.025),ylab=expression(var~of~(beta[1])))
lines(c(0,10000),c(0.018952,0.018952),lwd=1,col="red")