Lab 8 Activities Solution
Lab 8 Activities Solution
Solution
Instructions:
• Fill in your name in line 3.
• Knit to pdf to read the questions in a more readable format.
• Fill in the code chunks below and answer the questions with text responses. It is recommended that you
knit to pdf after you fill in each code chunk. Be sure when adding in text responses to never copy-paste
symbols from outside of the document.
• If you install any packages, do so in the R console, not in code chunks. The library() function must
appear in a code chunk if you will use a function from a package that is not part of base R.
• Your responses must use code that was covered in class; other methods to solve the problems will not
be accepted.
• Submit your knit pdf file to Crowdmark.
A reminder that the R code we have covered in class is available on our STAT 2150 A01 UM Learn page,
under Content > Course Material.
Your knit pdf file should show the result answering each question. To do this, after creating an R object, you
should also print it in a new line within the code chunk.
Question 1:
Suppose that a book editor finds that the number of typos per 100 pages of a manuscript follows a Poisson
distribution with λ = 3.
(a) Write the R code that calculates the probability of more than 3 typos in a 100-page manuscript.
1- ppois(3,3)
## [1] 0.3527681
# or:
1 - dpois(0,3) - dpois(1,3) - dpois(2,3) - dpois(3,3)
## [1] 0.3527681
(b) Write the R code that generates the number of typos in 20 sets of 100-page manuscripts. Store the
generated data in a vector called data20. Find the proportion of the 20 sets that have more than 3
typos. Do not change the seed in the below code chunk from 123.
set.seed(123)
data20 = rpois(20,3)
length(which(data20 > 3))/20
## [1] 0.4
(c) Repeat part (b), this time generating the number of typos in 100 sets of 100-page manuscripts. Store
the generated data in a vector called data100. Find the proportion of the 100 sets that have more than
3 typos. Again, do not change the seed from 123.
1
set.seed(123)
data100 = rpois(100,3)
length(which(data100 > 3))/100
## [1] 0.37
(d) Explain how your results in parts (b) and (c) relate to your result from part (a).
With a larger sample size (part (c) compared to part (b)), the sample proportion gets closer to the true
probability (part (a)).
(e) We would like to determine the number of typos per 100 pages such that 75% of the time, the number
of typos is less than or equal to this number. Write the R code that determines this number for the
population of all 100-page manuscripts.
qpois(0.75,3)
## [1] 4
(f) Write the R code that finds the data value x in the data20 vector such that 75% of the values in the
vector are less than or equal to x. Repeat with the data100 vector.
sorted_data20 = sort(data20)
sorted_data20[15] # 15 is 75% of 20
## [1] 5
# or:
quantile(data20,0.75)
## 75%
## 5
sorted_data100 = sort(data100)
sorted_data100[75] # 75 is 75% of 100
## [1] 4
# or:
quantile(data100,0.75)
## 75%
## 4
(g) Write the R code that calculates the probability of finding 3 typos in a 200-page manuscript.
dpois(3,6)
## [1] 0.08923508
(h) Simulate the number of typos in 75 such 200-page manuscripts. Then calculate what proportion of the
75 generated values have 3 typos. Do not change the seed in the below code chunk from 123.
set.seed(123)
data = rpois(75,6)
length(which(data == 3))/75
## [1] 0.09333333
(i) Suppose a random sample of 5 authors each have 100-page manuscripts. Write the R code that calculates
the probability that 3 of the 5 authors each have 1 typo in their manuscripts.
2
prob = dpois(1,3)
dbinom(3,5,prob)
## [1] 0.02411037
Question 2:
Suppose we have a normally distributed variable X with mean 100 and standard deviation 10.
(a) Generate a sample of 500 observations from this distribution and calculate how many of the 500
observations are between 105 and 115. Do not change the seed in the below code chunk from 123.
set.seed(123)
data = rnorm(500,100,10)
length(which(data < 115 & data > 105))
## [1] 112
(b) Calculate the probability that a randomly selected observation is between 105 and 115. Based on this
probability, calculate how many observations in a sample of 500 do you expect to be between 105 and
115.
prob = pnorm(115,100,10)-pnorm(105,100,10)
prob
## [1] 0.2417303
prob*500
## [1] 120.8652
Question 3:
A student wrote the following code to sample from a discrete distribution with some probabilities for the
various values of X, using the inversion method of sampling:
u = runif(1000,0,1)
x = numeric(1000)
for(i in 1:1000){
if(u[i] < 0.35){
x[i] = -10
} else if(u[i] < 0.60){
x[i] = 0
} else if(u[i] < 0.90){
x[i] = 10
} else{
x[i] = 15
}
}
Select a sample of the same size from a discrete distribution, using the same support and same probabilities
that were used above, this time using the sample() function. Store the sample in a vector, but do not print
the vector, as the output will be long.
data = sample(c(-10,0,10,15),1000,replace=TRUE,prob=c(0.35,0.25,0.30,0.10))
Question 4:
Suppose we take a random sample of size n from a normal distribution with unknown mean µ and unknown
3
variance σ 2 . Consider two different estimators of σ 2 :
n
1 X
σ̂12 = (xi − x)2
n − 1 i=1
and
n
1X
σ̂22 = (xi − x)2
n i=1
Note that σ̂12 is the same as the well-known sample variance s2 , implemented in R with the var() function.
Let us explore why we prefer to use σ̂12 as an estimator of σ 2 rather than σ̂22 .
Consider taking a sample of size 25 from the standard normal distribution (where we know the population
variance is 1):
rnorm(25,0,1)
Now repeat this process over and over again 1,000 times, and for each of these samples, calculate s2 = σ̂12 and
n−1 2
σ̂22 . (Note that σ̂22 = σ̂1 .) We then have 1,000 estimates of σ 2 = 1 using s2 = σ̂12 and 1,000 estimates of
n
σ 2 = 1 using σ̂22 . (Of course, if we already know σ 2 = 1, there would be no point in estimating it, but we are
trying to assess which of these two estimators performs better so we have to assume we know the value of
σ 2 .) The code is provided in the below code chunk:
set.seed(123)
samples = vector("list",length=1000)
for(i in 1:1000){
samples[[i]] = rnorm(25,0,1)
}
s2 = sapply(samples,var) # 1000 estimates using sˆ2
n = 25
sigma2squared_hat = (n-1)/n*s2 # 1000 estimates using the other estimator
Now find the average of the 1,000 estimates of σ 2 using s2 = σ̂12 and the average of the 1,000 estimates of σ 2
using σ̂22 .
mean(s2)
## [1] 1.00403
mean(sigma2squared_hat)
## [1] 0.963869
What do these results indicate about the performance of s2 = σ̂12 and σ̂22 for estimating σ 2 ?
The average value of sˆ2 is about 10 times closer to sigmaˆ2 = 1 than the average value of the other estimator.
So it seems sˆ2 is a better estimator of sigmaˆ2.