Tutorial 5
Tutorial 5
Tutorial 5
TUTORIAL 5
Please download the t5e1, t5e2, t5e3 and t5e4 Excel data files from the subject website and
save them to your USB flash drive. Read this handout and try to complete the tutorial
exercises before your tutorial class, so that you can ask help from your tutor during the Zoom
session if necessary.
After you have completed the tutorial exercises attempt the “Exercises for assessment”.
Save your solutions and answers along with the relevant R/Rstudio scripts and printouts in
a Word document and email it to your tutor by the next tutorial in order to get the tutorial
mark.
Let’s start with a single population variance. Suppose we take all possible random samples
of the same size (n) from a normal population, X: N(, 2), calculate the sample variance
(s2) from each, and develop the relative frequency distribution, i.e. the sampling distribution,
of the sample variance estimator.
It can be shown that the sample variance is an unbiased estimator of the population
variance, i.e. E(s2) = 2, and that the sum of squared deviations from the sample mean
divided by the population variance follows a chi-square distribution with n-1 degrees of
freedom, i.e.
n
(xi x)2 (n 1)s2
i1
n21
2
2
(n 1)s2 (n 1)s2
2 , 2
/2,n1 1 /2,n1
where /2,n1 and 1 /2,n1 are the (1-/2)100% and /2 100% percentiles of the chi-
2 2
square distribution with n – 1 degrees of freedom, and (ii) the test statistic for H0 : 0
2 2
(n 1)s2
n21
2
0
1
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Exercise 1 (Selvanathan, p. 581, ex. 14.10)
A company manufactures steel shafts for use in engines. One method for judging
inconsistencies in the production process is to determine the variance of the lengths of the
shafts. A random sample of 10 shafts was taken and their lengths measured in centimetres.
These measurements are saved in the t5e1 file. Do the calculations first manually and then
with R.
a) Find a 90% confidence interval for the population variance σ2, assuming that the lengths
of the steel shafts are normally distributed, at least approximately.
Use your calculator to find the sample standard deviation of length and square it to
obtain s2 = 0.493.
The sample size is 10 and the confidence level is (1-)100% = 90%, so the required
chi-square percentiles from Table 5 of Selvanathan (Appendix B, p. 1077) are1
2/2,n1 0.05,9
2
16.9 and 12 /2,n1 0.95,9
2
3.33
Hence, with 90% confidence, the population variance of the lengths of the shafts is
somewhere between 0.263 and 1.332 centimetres.2
b) In order to meet quality requirements, the production process of steel shafts has to be
suspended and the machines adjusted as soon as possible when the variance of the
lengths of the shafts is larger than 0.4 centimetres. At the 5% level of significance, can
we conclude that the population variance σ2 is greater than 0.4 centimetres and thus
some urgent adjustment is required? Assume again that the lengths of the steel shafts
are normally distributed.
H0 : 2 0.4 , HA : 2 0.4
This is a right-tail tests and the 5% critical value with 9 degrees of freedom is the same
than the upper 2 table value in part (a), i.e. 16.9, and H0 is rejected if the observed test
statistic exceeds this critical value.
1 These chi-square percentiles can be also obtained by applying the R quantile function qchisq(alpha, df = ),
i.e. by executing the qchisq(0.05, df = 9) and qchisq(0.95, df = 9) commands.
2
The question is about the population variance. If it were about the population standard deviation, we should
take the square roots of these confidence interval limits.
2
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
The test static value calculated from the sample at hand is
(n 1)s2 90.493
obs
2
11.0925
2
0 0.4
Since it is smaller than the critical value, at the 5% significance level we cannot reject
H0 and cannot conclude that urgent adjustment is needed.
To complete parts (a) and (b) with R, launch RStudio, create a new project and script,
name them t5e1, and import the data saved in the t5e1 Excel data file to RStudio. In
Tutorial 3 you already installed the DescTools package and used its SignTest function.
Another function in this package is
Execute the
library(DescTools)
VarTest(length, sigma.squared = 0.4, alternative = “greater”)
commands to obtain
The variable of interest is length and the alternative hypothesis is true variance is greater
than 0.4. The test statistic value is 11.103, about the same than the one calculated
manually. The p-value is 0.2687, so H0 is maintained at the 5% significance level.
The printout also shows the 95% confidence interval, but since we performed a right-tail
test, it is open from above. To reproduce the confidence interval obtained in part (a),
you need to request a two-tail test.
returns the printout shown at the top of the next page. It shows the 95% confidence
interval, (0.233, 1.645).
3
Like other tests, by default this test is also performed at the 5% significance level. Use the optional conf.level
argument if your significance level is different.
3
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
One Sample Chi-Square test on variance
data: length
X-squared = 11.103, df = 9, p-value = 0.5375
alternative hypothesis: true variance is not equal to 0.4
95 percent confidence interval:
0.2334571 1.6445776
sample estimates:
variance of x
0.4934444
c) What assumption did we have to make to answer parts (a) and (b)?
The confidence interval estimator and the chi-square test of 2 require that the sampled
population be normally distributed. Although these techniques remain valid for some
moderate deviations from normality, they are less robust than the confidence interval
estimator and the t-test of . Unfortunately, given the small sample size, this time we
cannot rely on the standard checks of normality.
Suppose now that we are interested in the comparison of the variances of two normally
distributed populations, X1: N(1, 12) and X2: N(2, 22). Assuming that we draw
independent random samples of sizes n1 and n2 from these populations, and calculate the
sample variances (s12, s22) from each
The ratio of two independent chi-square random variables divided by their respective
degrees of freedom has an F distribution, so
From this result, (i) the (1-)100% confidence interval estimate of 12 / 22 is
where F /2,n11,n2 1 and F1 /2,n11,n2 1 are the (1-/2)100% and /2 100% percentiles of the F-
distribution with df1 = n1 – 1 numerator degrees of freedom and df2 = n2 – 1 denominator
degrees of freedom, and (ii) the test statistic for H0: 12 / 22 = 1 against a one-sided or two-
sided alternative hypothesis is
s12
Fn11,n2 1
s22
4
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Exercise 2 (Selvanathan, p. 596, ex. 14.37)
a) The sample variances are s12 = 3.346 and s22 = 10.950. Estimate the ratio of the two
population variances with 95% confidence.
Assuming that both service times, X1 and X2, are normally distributed, we can use the
confidence interval estimator mentioned on the previous page.
Both sample sizes are 100 and the confidence level is (1-)100% = 95%, so the
required F percentiles from Table 6(b) of Selvanathan (Appendix B, pp. 1080-81)4 are
and
1 1
F1 /2,n1 1,n2 1 F0.975,99,99 F0.975,100,100 0.676
F0.025,100,100 1.48
Hence, with 95% confidence, the ratio of the variances of the populations of the service
times at the two tellers is somewhere between 0.206 and 0.452 (seconds2).
b) Do the data allow us to infer at the 10% significance level that the variances in service
times differ between the two tellers?
12 12
H0 : 2 1 , HA : 2 1
2 2
4
Tables 6(a)-6(d) provide F values only under the right tails of the various F distributions. F values under the
left tails can be determined from right-tail F values using the following formula: F1-,df1,df2 = 1 / F,df2,df1. Note
also, that we need to round df1 and df2 up to 100 because the F-tables provide the percentiles for selected
degrees of freedom only. The exact F percentiles could be obtained by applying the R quantile function
qf(alpha, df1 = , df2 = ), i.e. by executing the qf(0.025, df1 = 99, df2 = 99) and qf(0.975, df1 = 99, df2 = 99)
commands.
5
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Assume again that both service times are at least approximately normally distributed,
so that we can perform the F-test described on page 4.
and
1 1
F1/2,n11,n2 1 F0.95,99,99 F0.95,100,100 0.719
F0.05,100,100 1.39
H0 is rejected if the observed test statistic is smaller than the lower critical value (0.719)
or larger than the upper critical value (1.39).
s12 3.346
Fobs 2 0.306
s2 10.950
Since it is smaller than the lower critical value, we reject H0 at the 10% significance level
and conclude that the variances in service times differ between the two tellers.
Performing this test manually, it is possible to simplify the procedure a bit by labelling
the population with the larger sample variance as population 1. By doing so we assure
that the sample statistic is not smaller than one and hence we need only the upper
critical value straight from the F tables. The observed test statistic is
10.950
Fobs 3.273
3.346
and since it is larger than the upper critical value, we would again reject H0.5
To complete parts (a) and (b) with R, you can use the same VarTest command then in
Exercise 1, but with a slightly different set of arguments. This time, the general form of
the command is
where x and y are the variables to be compared, ratio0 is the hypothesized ratio of the
two population variances (by default, it is 1), and alternative is like before.
5
A word of caution is in order: if the test were a one-tail test, swapping the sample variances would imply that
the alternative hypothesis must be swapped as well, for example, from HA: 12 / 22 < 1 to HA: 22 / 12 > 1.
6
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Launch RStudio, create a new project and script, name them t5e2, import the data
saved in the t5e2 Excel data file to RStudio, and execute the following commands:
library(DescTools)
VarTest(Teller1, Teller2)
The variable of interest is the variance of x (Teller1) divided by the variance of y (Teller2),
the alternative hypothesis is true ratio of variance is not equal to 1. The ratio of the
sample variances, and hence the test statistic value, is about 0.306. The numerator and
denominator degrees of freedoms are both 99, and the p-value is practically zero, so H0
is can be rejected at any reasonable significance level.
c) What assumption did you have to make in order to answer parts (a) and (b)? Try to
verify whether that assumption is reasonable this time.
In parts (a) and (b) we assumed that the samples are random and independent and that
both sampled populations are normally distributed, at least approximately. Take the first
two requirements granted and perform the usual diagnostics for normality.
return the histograms shown on the next page. The superimposed normal curves seem
to fit to the relative frequency distributions.
produce the normal QQ plots displayed on page 9. On both plots most of the points fall
close to the reference line.
7
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Histogram of Teller1
0.25
0.20
0.15
Density
0.10
0.05
0.00
4 6 8 10 12
Teller1
Histogram of Teller2
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Density
0 5 10 15
Teller2
8
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Normal Q-Q Plot for Teller1
12
10
Sample Quantiles
8
6
4
-2 -1 0 1 2
Theoretical Quantiles
10
5
0
-2 -1 0 1 2
Theoretical Quantiles
9
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
The
library(pastecs)
round(stat.desc(Teller1, basic = FALSE, desc = TRUE, norm = TRUE), 3)
round(stat.desc(Teller2, basic = FALSE, desc = TRUE, norm = TRUE), 3)
for Teller2.
For both samples, the mean and the median are fairly similar, skewness and excess
kurtosis are only insignificantly different from zero, and p-values of the Shapiro-Wilk test
are above 0.5.
All things considered, the normality assumption is supported by all diagnostic checks.6
Consider a binary population X, which has only two possible elements, “success” coded as
1 and “failure” coded as 0. Suppose we are interested in the proportion (relative frequency)
of successes in this population, denoted as p. It is given by the usual population mean
formula,
1 N
p Xi
N i1
It can be estimated from a random sample of n observations with the sample proportion,
1n
pˆ Xi
n i1
6
When the normality assumption is unreasonable, one can use the nonparametric Siegel-Tukey test for
equality in variability. This test is also available in the DescTools package (SiegelTukeyTest), but we do not
learn about it in this course.
10
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Depending on whether sampling is with or without replacement, the sample proportion (p-
hat) is a binomial or hypergeometric random variable. When the sample size is large (np
5 and nq 5, and in the case of sampling without replacement, n < 0.05N as well), the
sampling distribution of p-hat can be approximated with a normal distribution,
pˆ N pˆ , pˆ
pq
, pˆ p , pˆ
n
ˆˆ
pq
pˆ z /2spˆ , spˆ
n
and (ii) the test statistic for H0: p = p0 against a one-sided or two-sided alternative hypothesis
is
pˆ p0
Z N(0;1)
p0q0
n
Exercise 3
Suppose that in a survey of 600 employers are asked whether they have used a recruitment
service within the past two months to find new staff. The responses are saved in the t5e3
Excel file.
a) Construct a 99% confidence interval for the population proportion of employers who
have used a recruitment service within the past two months to find new staff.
Launch RStudio, create a new project and script, name them t5e3, import the data
saved in the t5e3 Excel data file to RStudio, and execute the following commands:
attach(t5e3)
table(used)
11
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
used
no yes
474 126
In general, the
table()
length()
function, which returns the length of vectors and other R objects. In this case, the length
of the use vector is the sample size, so execute
table(used) / length(used)
used
no yes
0.79 0.21
prop.table()
prop.table(table(used))
These frequency and relative frequency distributions show that in the given sample 126
employers out of 600 (i.e. 21%) used a recruitment service within the past two months
to find new staff, so the sample proportion is 0.21. Using this sample proportion,
7
n(p-hat) and n(q-hat) are the expected numbers of successes and failures in the sample, granted that the
probability of success is equal to p-hat and the probability of failure is equal to q-hat, and they are the same
than the frequencies of yes and no.
8
The sample proportion is always a small number between zero and one, and the estimate of its standard
error is even smaller. In order to avoid unreasonable loss of precision, it is recommended to do the manual
calculations with a precision to 4 or even more decimal places. Once you determined the required confidence
interval, you can round its limits if you wish.
12
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
ˆˆ
pq 0.210.79
spˆ 0.0166
n 600
The confidence level is 99% and from the Standard Normal table
It implies that, with 99% confidence, the proportion of employers who have used a
recruitment service within the past two months to find new staff is between 16.7% and
25.3%.
With R, this confidence interval can be generated by the binom.test command that you
already used in Tutorial 3. Execute
to obtain
The reported 99% confidence interval is (0.169, 0.256). It is slightly wider than the one
we got manually because it is based on the correct binomial distribution while ours was
based on the normal approximation of the binomial distribution.
b) Based on the survey data, is there sufficient evidence at the 0.05 level of confidence
that more than 20% of all employers have used a recruitment service within the past two
months to find new staff?
H0 : p 0.2 , HA : p 0.2
The hypothesized value of the population proportion is 0.20, and it implies the following
standard error under the null hypothesis:
13
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
p0q0 0.20.8
0.0163
n 600
z z0.05 1.645
pˆ p0 0.21 0.20
zobs 0.613
p0q0 0.0163
n
Since it is smaller than the critical value, we cannot reject H0 at the 5% significance
level. Hence, there is not enough evidence to conclude that more than 20% of all
employers have used a recruitment service within the past two months to find new staff.
By default, the binom.test function assumes that the hypothesized population proportion
is 0.5. In part (a) this was fine because we were interested only in the confidence interval
estimate, which does not depend on the hypothesized population proportion. This time,
however, we are interested in a right-tail test at the 5% significance level with p0 = 0.2,
so we need to execute the following command:9
It returns
The p-value is 0.2849, implying that H0 cannot be rejected at the 5% significance level.
9
Since the default confidence level is 0.95, this time the last argument could be dropped from the command.
10
This command performs a chi-square test. The chi-square distribution was introduced in the Week 4 lecture
and you will learn about the chi-square test in the lectures next week.
14
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
In this case,
As you can see, the reported p-value is almost the same than on the previous printout.
Let’s now turn our attention to the two binary populations case. Suppose that the population
proportions are p1 and p2, that we draw independently a random sample from each
population, and that the sample sizes, n1 and n2, are large enough so that n1p1, n1q1, n2p2
and n2q2 are all at least 5 (and ni < 0.05Ni if sampling is without replacement).
Under these conditions, (i) an approximate (1-)100% confidence interval of the difference
between the two population proportions, p1 p2, is
pˆ1qˆ1 pˆ 2qˆ2
pˆ1 pˆ 2 z /2 spˆ pˆ
1 2
, spˆ1 pˆ2
n1 n2
and (ii) the test statistic for H0: p1 p2 = D0 against a one-sided or two-sided alternative
hypothesis follows the standard normal distribution reasonably well, but its actual formula
depends on D0.
Namely, on the one hand, if D0 = 0 and hence p1 = p2 under H0, the common population
proportion is best estimated from the pooled sample
f1 f2
pˆ
n1 n2
pˆ1 pˆ2
Z N(0,1)
spˆ1 pˆ2
where
15
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
1 1
spˆ1 pˆ2 pq
ˆ ˆ
n1 n2
On the other hand, if D0 0 and hence p1 p2 under H0, the two population proportions
must be estimated separately,
f1 f
pˆ1 , pˆ2 2
n1 n2
pˆ1qˆ1 pˆ 2qˆ2
spˆ1 pˆ2
n1 n2
Z
pˆ1 pˆ 2 D0 N (0,1)
spˆ1 pˆ2
The impact of the accumulation of carbon dioxide in the atmosphere caused by burning of
fossil fuels such as oil, coal and natural gas has been hotly debated for more than a decade.
Some environmentalists and scientists have predicted that the excess carbon dioxide will
increase the Earth’s temperature over the next 50 to 100 years with disastrous
consequences.
To gauge the public’s opinion on the subject, a random sample of 400 people was asked
two years ago whether they believed in the greenhouse effect. This year, 500 people were
asked the same question. The results are recorded as 1 = believe in greenhouse effect and
0 = do not believe in greenhouse effect, the variables concerning belief are denoted as X1
(first sample, i.e. two years ago) and X2 (second sample, i.e. this year) and the data are
saved in the t5e4 Excel file.
a) Estimate the real change in the public’s opinion about the subject, using a 90%
confidence level.
11
To save time, these sample proportions were obtained by R, like in Exercise 3.
16
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Using these sample proportions as estimates of the corresponding population
proportions,
npˆ1 248 , n(1 pˆ1) 152 , npˆ2 260 , n(1 pˆ2 ) 244
They are all much bigger than 5, so the normal approximation is a reasonable option.
z /2 z0.05 1.645
pˆ1 pˆ 2 z /2 spˆ pˆ
1 2
(0.62 0.52) 1.645 0.033 (0.0457,0.1543)
It implies that, with 90% confidence, the proportion of believers in the greenhouse effect
has decreased by 4.6% to 15.4%.12
In R, this confidence interval can be generated like the one in Exercise 3. The
table(X1)
table(X2)
X1
0 1
152 248
and
X2
0 1
240 260
There are 248 successes in the first sample and 260 in the second.
The confidence interval for the difference between the two population proportions is
provided by the prop.test function, but this time its x and n arguments (the number of
12
Recall that the confidence interval has been developed for the change in the proportion of believers between
the first and second surveys, so a positive p1 – p2 value indicates that the true proportion of believers was
larger two years ago than it is today.
17
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
successes and the sample size) have two elements and they need to be specified by
the c() function as c(x1, x2) and c(n1, n2), respectively.
The reported 90% confidence interval, (0.0435, 0.1565), is almost identical to the one
we obtained manually. The difference between them is due to the continuity correction
that is referred to in the heading of the printout. The prop.test function applies this
correction, called Yates’ correction for continuity, by default, though it only makes
practical difference when the sample sizes are small (i.e. some of n1p1, n1q1, n2p2 and
n2q2 is/are smaller than 5). To see its impact, add the correct = FALSE argument to the
pervious command,
The 90% confidence interval is (0.0457, 0.1543), the same than the one we obtained
manually.
b) Can we infer at the 10% significance level that there has been a decrease in belief in
the greenhouse effect?
Since X1 denotes the belief in the greenhouse effect in the first sample (drawn two years
ago) and X2 denotes the belief in the greenhouse effect in the second sample (drawn
this year), the hypotheses are
H0 : p1 p2 0 , HA : p1 p2 0
In this case D0 = 0, so the common population proportion is estimated from the pooled
sample,
18
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
f1 f2 248 260
pˆ 0.564
n1 n2 400500
1 1 1 1
spˆ1 pˆ2 pq
ˆ ˆ 0.5640.436 0.0333
n1 n2 400 500
z z0.1 1.282
and since it is smaller than the test statistic, we reject the null hypothesis and conclude
at the 10% significance level that there has been a decrease in belief in the greenhouse
effect.
It returns
The reported test statistic is a chi-square random variable, 2 = 9.039. How does it
compare to the standard normal test statistic we calculated manually, z = 3.003? As you
can see on the previous printout, the degrees of freedom of this chi-square random
variable is one, and you learnt on the week 4 lecture that the square of a standard normal
random variable is a chi-square random variable with df = 1. And indeed, as you can
check easily, apart from some rounding error,
2
zobs 3.0032 9.018 1,2obs
19
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
For the sake of comparison, we applied the prop.test function without continuity
correction. Note, however, that in general, there is no reason to overwrite the default
option. If you run
you get
The new test statistic is smaller and hence its p-value is larger, but since the sample
sizes are fairly large, the differences are negligible.
20
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5
Exercises for Assessment
Exercise 5
a) Estimate the ratio of the two population variances with 95% confidence.
b) Can we conclude at the 5% significance level that the population variances differ? What
do you conclude if the significance level is increased to 10%?
In parts (a) and (b) alike, do the calculations both manually and with R.
In a public opinion survey, 60 out of a sample of 100 high-income voters and 40 out of a
sample of 75 low-income voters supported the introduction of a new national security tax.
Can we conclude at the 5% level of significance that there is a difference in the proportion
of high- and low-income voters favouring a new national security tax? Do the calculations
both manually and with R.
21
L. Kónya, 2020, Semester 1 ECON20003 - Tutorial 5