Module 06 - One Population Parameter Estimation - Topic 4A
Module 06 - One Population Parameter Estimation - Topic 4A
PARAMETER ESTIMATION
In the real world, population parameters are almost always unknown, because
they represent summary measurements about large populations.
The following topics will discuss the estimation of population parameters which
are unknown using known sample statistics.
We will be estimating the population mean by the sample mean, and the
population proportion by the sample proportion.
2
Estimation
(i) Point estimation: This is a single number (or value) calculated from
available sample data and used to estimate a population parameter.
3
Confidence Intervals
Since a sample statistic such as x varies from sample to sample, this variation can be taken into
consideration when estimating the true population mean. Hence an “Interval Estimate” for the
population mean is obtained.
The interval that is constructed has a specified confidence or probability of correctly estimating
the true value of the population mean.
As an example, an interval can be constructed for which there will be a 95% confidence that this
interval includes the true value of the population mean.
In other words, if we repeat the process of obtaining samples and constructing intervals, in the
long run 95% of these intervals will contain the true population mean.
These intervals are called confidence intervals. The “level of confidence” is given as 100 (1 – α) %
for a given value 0 < α < 1. The choice of the confidence level is somewhat arbitrary.
4
Confidence Intervals
The basic form for all confidence intervals, including one and two populations,
is as follows:
The choice of the critical value and standard error depends on what
information we have about the population.
5
Confidence Intervals
In this unit we will be only estimating the population mean or proportion for one population
or the difference between population means or proportions for two populations.
The following table shows the point estimator (the sample statistic) that will be used when
calculating the confidence intervals
Population Mean µ 𝑥
Population Proportion p 𝑝
Topic 6 introduced the Normal Distribution; by far the most commonly used continuous
distribution as many phenomena follow a normal distribution. It also usually gives a good
approximation even when the normal distribution does not apply, so the normal distribution is
often used as an approximation – especially in modern finance.
The normal distribution has such useful properties that even in cases when the data is far from
“normal” (for example, when it has a marked positive skew), a transformation (taking
logarithms or square root are the most commonly used transformations) may be applied to the
original data to make it more symmetric (and therefore more like a normal distribution). The
transformed data is then analysed as if it came from a normal distribution.
We also know that the powerful result known as the Central Limit Theorem guarantees that
the sampling distribution of the mean tends to a Normal Distribution as the sample size
increases
2
Estimating µ with known
Hence the confidence interval to estimate the population mean given that the
population standard deviation is known is as follows
x z / 2
n
Assumptions of the population:
1. The population is normally distributed; or
2. If the population is not normal, then the sample size must be large (n ≥ 30)
3. The population standard deviation σ is known.
This formula can also be used be used if is unknown and the sample size n is large
i.e., n ≥ 30. If n ≥ 30, then s is a good approximation for .
3
Example
Construct a 95% confidence interval for the average online time for all users of
this particular ISP.
4
Example
So what do we know?
• 95% confidence interval: 100 (1 – α) = 95, so α = 0.05.
• An ISP conducted a survey of 250 customers: so n = 250.
• Average amount of time spent online was 10.5: so x 10.5
• Population standard deviation is 5.2 hours: so = 5.2.
x z 2
n Thus we can say that we are 95% confident
that the average online time for all users of
5.2
10.5 1.96
this particular Internet server is between
9.855 and 11.145 hours per week.
250
10.5 0.645
9.855 11.145
5
Confidence Intervals
The left hand end of a confidence interval (CI) is called the lower confidence
limit (LCL).
The right hand end is called the upper confidence limit (UCL).
LCL x z 2
n
UCL x z 2
n
6
Estimating with unknown
Just as the mean μ of the population is usually unknown, the population standard
deviation σ is also usually unknown. Then we estimate σ by s, which is the sample
standard deviation.
s
In this case, the standard error of x will have to be estimated as .
n
s
Then for small sample sizes, our CI will look like x ?
n
But now “?” can no longer be taken from a standard normal distribution.
7
Student t-distributions
The t-distributions were discovered by William S. Gosset in 1908.
Gosset was a statistician employed by the Guinness brewing company, which had stipulated that he
not publish under his own name. He therefore wrote under the pen name “Student”. The Student t-
distributions were named after him; and these distributions arise in the following situation.
Suppose we have a simple random sample of size n drawn from a normal population with mean μ
and standard deviation σ.
Let x denote the sample mean, and s the sample standard deviation.
x
Then the quantity t s has a t-distribution with n – 1 degrees of freedom (d.f).
n
A t-distribution variable is denoted by “t”, and d.f. is specified as a subscript. For example, t8.
8
Properties of the t-distribution
• The graph of a t-distribution extends indefinitely to the left and right, and is
“mound-shaped”.
• The high point of the t-distribution occurs at its mean, which is always equal to
zero.
• When the d.f. increase, the t-distribution gets closer to a standard normal
distribution.
9
t-distribution table
10
Example
Find t0.05, 8 .
12
Example
Construct a 90% confidence interval for the true average fibre content for this
breakfast cereal.
13
Example
So what do we know?
• 90% confidence interval: 100 (1 – α) = 90, so α = 0.10
• A random sample of 9 packets : so n=9
• Average fibre content of 3.6 grams : so x 3.6
• A random sample reveals … a standard deviation of 0.9 grams : so s = 0.9
Since is unknown then tα/2 is the critical value. From the t-table, we have t0.05, 8 = 1.860.
s
x t Thus, we can say that a 90% confidence interval for
2 n the true average fibre content for this breakfast cereal
is between 3.04 and 4.16 grams.
0.9
3.6 1.860
9
3.6 0.56
3.04 4.16 14
Estimating p
When we are dealing with a categorical variable (e.g. gender, preference, etc.), the parameter we
are most interested in is the proportion, p. Consider a population, of which only some individuals
possess the desired characteristic. We may be interested in estimating the proportion possessing
that characteristic.
If we take a sample of size n (sampling without replacement) and count the number which have
the characteristic ("number of successes"), x, then x will have a binomial distribution with
parameters n and p. If n is sufficiently large, (np and nq both greater than 5) the binomial
distribution (of x) can be approximated by a normal distribution.
The same principles used for the confidence interval for the mean are used for the confidence
interval of the population proportion. Here we want to obtain a plausible range of values for the
population proportion, p. Keep in mind, p should have a value between 0 and 1. However when
we use the sample proportion in constructing confidence intervals for p, this can lead to
confidence intervals which contain values outside of 0 and 1.
15
Estimating p – the population proportion
When we have a large sample, the critical value is zα/2
Hence the confidence interval to estimate the population proportion given that the
sample is large is as follows
ˆ qˆ
p
ˆ z
p
2 n
16
Example
A survey was taken of women in a major city to determine what factor was the
most important in deciding where to shop. The results appear below.
Factor Percentage
Price and Value 40%
Quality and Selection of Merchandise 30%
Service 15%
Shopping Environment 15%
If the sample size was 1200, estimate with 95% confidence the proportion of
women who identified "price and value" as the most important factor.
17
Example
So, what do we know?
• 95% confidence interval ; so α = 0.05.
• The sample size was 1200 ; so n = 1200.
• Price and Value most important: p̂ = 0.4.
Firstly, determine the critical z-value: z0.025 = 1.96.
ˆˆ
pq
pˆ z 2
n Thus we can say that a 95% confidence interval
for the true proportion of women who identified
0.4 0.6
0.4 1.96 "price and value" as the most important factor is
1200 between 37.2% and 42.8%.
0.4 0.028
0.372 p 0.428
18
.3
Estimating the difference between
Two Population Means and
Two Population Proportions
So far, all discussion of confidence intervals has centred on samples drawn from one
population. However, it is often important to compare two populations to see
whether there is a significant difference between the two.
Some examples:
Does the new safety program reduce accidents?
Are boys any better or worse than girls at Mathematics?
Is the new process better than the old one?
To conduct inference about two population parameters, we must first determine the
sampling distribution of the difference of two parameters.
2
Independent v Dependent Samples
Similarly to inferences about one population, we base our conclusions on samples
taken from each population. These samples can be either independent or dependent.
Two samples are independent if the sample values selected from one population are
not related to the sample values selected from the other population.
3
Difference Between Two Independent Population Means
In this section we will consider methods for using sample data from two independent
samples to construct confidence interval estimates of the difference between two
population means.
We are examining the difference between two population means (µ1 – µ2)
by examining the difference in the sample means x x .
1 2
We will look at two cases: 1. σ1 and σ2 are known
2. σ1 and σ2 are unknown but assumed equal.
In both cases, we will assume that the populations are normally distributed.
4
Difference Between Two Independent Population Means – σ1 and σ2 are known
1 2
2 2
SE
x1 x 2 n1 n2
Hence the 100(1 – α)% confidence interval to estimate the difference between
population means is as follows
1 2
2 2
x x z
1 2 2
n1
n2
5
Example
Two factories which are located 100 kilometres apart, were measured for amount
of time lost due to accidents. A sample of 45 days from the first factory had an
average time lost of 81 minutes. The sample from the other factory comprised 36
days, and had an average time lost of 76 minutes.
The population standard deviations of the time lost of the two factories are known
to be σ1 = 5.2 and σ2 = 3.4.
Construct a 95% confidence interval for the difference in the average time lost
between the two factories.
6
Example
Let Population 1 = Factory 1,
and Population 2 = Factory 2.
1
2 2
What do we know? x x z
1 2 2
n1
2
n2
5.2 3.4
2 2
1 2 81 76 z 0.05 2
45 36
5.2 3.4
2 2
Sample size 45 36
5 z0.025
45 36
Sample mean 81 76
5.2 3.4
2 2
That is, we are 95% confident that the difference in the average time lost is between the
two factories lies between 3.12 and 6.88 minutes.
7
Difference Between Two Independent Population Means –
σ1 and σ2 are unknown but assumed equal
If the population variances are unknown, we will consider only the situation in which the
population variances are assumed to be equal.
In nearly all practical situations, σ2 is unknown, and must be estimated using the
variances of two independent random samples selected from the populations.
The estimate used is called the pooled sample variance, denoted by (sp)2
n1 1 s1 n2 1 s2
2 2
s
2
It is calculated as follows:
n1 n2 2
p
Here (s1)2 and (s2)2 are the sample variances for each sample.
8
Difference Between Two Independent Population Means –
σ1 and σ2 are unknown but assumed equal
If σ1 and σ2 are unknown, the critical value is tα/2 , and the standard error is:
s 2p s 2p 2 1 1 1 1
SE( x x2 ) s p s p
n n
1 2 n1 n2 n1 n2
1
The degrees of freedom for the critical value t is d.f. = (n1 – 1) + (n2 – 1) = n1 + n2 – 2
Assuming that both populations follow a normal distribution. Hence the confidence interval to
estimate the difference between population means given that both the population standard
deviations are unknown and assumed equal is as follows
1 1
( x1 x 2 ) t s p
2
n1 n2 9
Example
Two plastics produced by different processes were subjected to tests in order to determine
their breaking strengths.
Construct a 95% confidence interval for the difference in their mean breaking strengths.
You may also assume that the populations are normal, and their variances are equal.
10
Example
Let Population 1 = Plastic X,
and Population 2 = Plastic Y.
What do we know?
t0.025,43 = 2.018
1 2
Hence the confidence interval is;
n= 20 25
(28.3 – 26.7) 2.018 3.647
0.05
That is, we are 95% confident that the true difference in the mean breaking strengths
lies between – 0.607 and 3.807 pounds.
11
Difference Between Two Dependent Population Means
In situations where the samples are dependent, we take the differences between each pair of
observations, and then use these “differences” to construct the confidence interval as if it were
one sample.
Let (xi , yi ) be the pair of values for the i th individual in the data set.
The sample mean of the differences is d while the standard deviation of the differences is sd.
We assume that the population of differences is normally distributed, and the sample of
differences represents a random sample from the population of differences.
12
Difference Between Two Dependent Population Means
if the population standard deviation is unknown then the critical value is tα/2 while the
standard error of the sample mean is s
d
Hence the confidence interval to estimate the population mean given that the population
standard deviation is unknown is as follows
sd
d t 2
n
where sd is the sample standard deviation of the differences, and
t/2 is the critical value of the t-distribution with (n – 1) degrees of freedom
13
Example
Eleven students were randomly selected from a population of 1000 students. The sampling
method was simple random sampling.
All of the students were given a standardised English test and a standardised maths test. Test
results are summarized below.
Test Score
English 85 87 85 85 68 81 84 71 46 75 80
Mathematics 83 83 83 82 65 79 83 60 47 77 83
Find the 90% confidence interval for the mean difference between student scores on the
maths and English tests.
14
Example
sd
d t 2
d 2 n
sd 3.715 3.715
2 1.812
t 2 t0.05, 10 df 1.812
11
n 11 2 2.03
The 90% confidence interval for the mean difference between student scores on the math
and English tests is from – 0.03 to 4.03.
15
Difference Between Two Population Proportions
Using the same ideas for proportions as we used for two means:
Since the population proportions are unknown, we use the sample proportions to
find the standard error:
pˆ1qˆ1 pˆ 2 qˆ2
SE pˆ1 pˆ 2
1 n n 2
For large samples, by the Central Limit Theorem, the critical value is zα/2 for
confidence intervals. Large means that
n1 pˆ1 , n1qˆ1 , n2 pˆ 2 , n2 qˆ2 5.
16
Difference Between Two Population Proportions
pˆ1qˆ1 pˆ 2 qˆ2
pˆ1 pˆ 2 z 2
n1 n2
17
Example
A random sample of 400 male university students shows that 160 of them catch
a train to university.
Using this data, construct a 99% confidence interval to estimate the difference in
population proportions of male and female students who take a train to
university.
18
Example
Let Population 1 = Male, and
Population 2 = Female. pˆ1qˆ1 pˆ 2 qˆ2
pˆ1 pˆ 2 z 2
n1 n2
What do we know?
0.4 0.6 0.3 0.7
1 2 0.4 0.3 2.576
400 400
n= 400 400 0.10 0.086
x= 160 120 That is, we are 99% confident that the true difference
in population proportions of male and female
x students who take a train to university is between
pˆ 0.4 0.3 0.014 and 0.186, or between 1.4% and 18.6%.
n
0.01
19
.4
Sample Size Determination
The sampling procedure, together with the sample size, controls the total amount of
relevant information in a sample. At this point in our study, we are concerned with the
simplest sampling situation – in other words, random sampling from a relatively large
population – and devote our attention to the selection of the sample size n.
We need to recognise that the confidence intervals we have covered are basically the
sample statistic ± an error value. If this error value (known as the maximum error or B)
is known, we can the calculate the sample size needed. (The “B” stands for “error
bound” here).
2
Sample Size Determination
The sample size determination formulae we are about to give you come from the
formulae for the maximum desired error of the estimates. Basically, they come from
the corresponding confidence interval formulae! The formula is then solved for n.
Be sure to round the answer obtained UP to the next whole number, not off to the
nearest whole number. If you round off, then you will exceed your maximum error
of the estimate in some cases.
By rounding up, you will have a smaller maximum error of the estimate than allowed,
but this is better than having a larger one than desired.
3
Sample Size Determination – Estimating m
When estimating the population mean, we can use the following formula to
determine the sample size, assuming we know σ (the population standard
deviation).
z /2
2
n
B
If the population standard deviation is unknown, then we have to estimate σ so we
can determine the sample size. In such cases, it is acceptable to use the following
estimate.
1
range
4 4
Example
A fast food company wants to determine the average number of times that fast
food users visit fast food restaurants per week.
They have decided that their estimate needs to be accurate to within one-tenth
of a visit, and they want to be 95% sure that their estimate does not differ from
the true number of visits by more than one-tenth of a visit.
Previous research has shown that the standard deviation is 0.7 visits. What is the
required sample size?
5
Example
What do we know?
… that their estimate does not differ from the true number of visits by
more than one-tenth of a visit… so B = 0.1.
Previous research has shown that the standard deviation is 0.7 visits…
so σ = 0.7.
z /2
2
1.96 0.7
2
n 188.2384.
B 0.1
Hence a sample size of 189 or more is needed.
6
Sample Size Determination – Estimating p
The formula for the sample size here is obtained by solving for n the maximum
error of the estimate formula for the population proportion. Again, we get this
from the corresponding confidence interval! Note that p is taken from a previous
study, if one is available.
2
z /2
n pq
B
If there is no previous study or estimate available, then use 0.5 for p and q, as these
are the values which will give the largest sample size. It is better to have too large of
a sample size and come under the maximum error of the estimate. In this case, the
formula simplifies to
2
1 z /2
n
4 B 7
Example
Secondary data (that is several years old) indicates that 22% of the
population is retired.
They are willing to accept an error rate of 5%, and they want to be 95%
certain that their finding does not differ from the true rate by more
than 5%.
8
Example
What do we know?
Secondary data (that is several years old) indicates that 22% of the population is
retired… so p = 0.22.
… that their finding does not differ from the true rate by more than 5%... so B = 0.05.
2 2
z /2 1.96
n pq 0.22 0.78 263.687.
B 0.05
Hence a sample size of 264 or more is needed.
9
Sample Size Determination – Estimating µ1 – µ2
When dealing with two populations we use the same method as with one population. Basically
we rearrange the error part of the confidence intervals to solve for n. Since we have n1 and n2
we add the condition that n1 = n2
2
z
n1 n2 2 12 22
B
If the population standard deviations are unknown, then we have to estimate them so we can
determine the sample size. In such cases, it is acceptable to use the following estimate for
both populations.
1
range
4 10
Example
Previous research has shown that the standard deviation is $3 for both males and
females.
11
Example
What do we know?
… that their estimate does not differ from the true value by more than $1 … so B = 1.
Previous research has shown that the standard deviation is $3 for both males and females … so
σ1 and σ2 = 3.
2
z 2
n1 n2 2
B
1 2
2 2
1.96 2 2
3 3 69.149
1
12
Sample Size Determination – Estimating p1 – p2
When dealing with two populations we use the same method as with one population. Basically
we rearrange the error part of the confidence intervals to solve for n. Since we have n1 and n2
we add the condition that n1 = n2
2
z
n1 n2 2 pˆ1qˆ1 pˆ 2 qˆ2
B
If there is no previous study or estimate available, then use 0.5 for p and q, as these are
the values which will give the largest sample size.
13
Example
They are willing to accept an error rate of 5%, and they want to be 90% certain that
their finding does not differ from the true rate by more than 5%.
What is the required sample size assuming an equal sample size from each state?
14
Example
2
z
2
Hence a sample size of 329 is needed from NSW and from Queensland.
15