And Estimation Sampling Distributions: Learning Outcomes
And Estimation Sampling Distributions: Learning Outcomes
And Estimation Sampling Distributions: Learning Outcomes
and Estimation
Sampling Distributions
40.1 Sampling Distributions 2
40.2 Interval Estimation for the Variance 13
Learning
You will learn about the distributions which are created when a population is sampled. For
example, every sample will have a mean value; this gives rise to a distribution of mean
values. We shall look at the behaviour of this distribution. We shall also look at the
problem of estimating the true value of a population mean (for example) from a given
sample.
outcomes
Sampling
Distributions
40.1
Introduction
When you are dealing with large populations, for example populations created by the manufacturing
processes, it is impossible, or very dicult indeed, to deal with the whole population and know
the parameters of that population. Items such as car components, electronic components, aircraft
components or ordinary everyday items such as light bulbs, cycle tyres and cutlery eectively form
innite populations. Hence we have to deal with samples taken from a population and estimate
those population parameters that we need. This Workbook will show you how to calculate single
number estimates of parameters - called point estimates - and interval estimates of parameters -
called interval estimates or condence intervals. In the latter case you will be able to calculate a
range of values and state the condence that the true value of the parameter you are estimating lies
in the range you have found.
'
&
$
%
Prerequisites
Before starting this Section you should . . .
understand and be able to calculate means
and variances
be familiar with the results and concepts met
in the study of probability
be familiar with the normal distribution
'
&
$
%
Learning Outcomes
On completion you should be able to . . .
understand what is meant by the terms
sample and sampling distribution
explain the importance of sampling in the
application of statistics
explain the terms point estimate and the
term interval estimate
calculate point estimates of means and
variances
nd interval estimates of population
parameters for given levels of condence
2 HELM (2008):
Workbook 40: Sampling Distributions and Estimation
1. Sampling
Why sample?
Considering samples from a distribution enables us to obtain information about a population where
we cannot, for reasons of practicality, economy, or both, inspect the whole of the population. For
example, it is impossible to check the complete output of some manufacturing processes. Items such
as electric light bulbs, nuts, bolts, springs and light emitting diodes (LEDs) are produced in their
millions and the sheer cost of checking every item as well as the time implications of such a checking
process render it impossible. In addition, testing is sometimes destructive - one would not wish to
destroy the whole production of a given component!
Populations and samples
If we choose n items from a population, we say that the size of the sample is n. If we take many
samples, the means of these samples will themselves have a distribution which may be dierent from
the population from which the samples were chosen. Much of the practical application of sampling
theory is based on the relationship between the parent population from which samples are drawn
and the summary statistics (mean and variance) of the ospring population of sample means. Not
surprisingly, in the case of a normal parent population, the distribution of the population and the
distribution of the sample means are closely related. What is surprising is that even in the case of a
non-normal parent population, the ospring population of sample means is usually (but not always)
normally distributed provided that the samples taken are large enough. In practice the term large
is usually taken to mean about 30 or more. The behaviour of the distribution of sample means is
based on the following result from mathematical statistics.
The central limit theorem
In what follows, we shall assume that the members of a sample are chosen at random from a
population. This implies that the members of the sample are independent. We have already met the
Central Limit Theorem. Here we will consider it in more detail and illustrate some of the properties
resulting from it.
Much of the theory (and hence the practice) of sampling is based on the Central Limit Theorem.
While we will not be looking at the proof of the theorem (it will be illustrated where practical) it is
necessary that we understand what the theorem says and what it enables us to do. Essentially, the
Central Limit Theorem says that if we take large samples of size n with mean
X from a population
which has a mean and standard deviation then the distribution of sample means
X is normally
distributed with mean and standard deviation
n
.
That is, the sampling distribution of the mean
X follows the distribution
X N
_
,
n
_
Strictly speaking we require
2
< , and it is important to note that no claim is made about the
way in which the original distribution behaves, and it need not be normal. This is why the Central
Limit Theorem is so fundamental to statistical practice. One implication is that a random variable
which takes the form of a sum of many components which are random but not necessarily normal
will itself be normal provided that the sum is not dominated by a small number of components. This
explains why many biological variables, such as human heights, are normally distributed.
HELM (2008):
Section 40.1: Sampling Distributions
3
In the case where the original distribution is normal, the relationship between the original distribution
X N(, ) and the distribution of sample means
X N
_
,
n
_
is shown below.
X N(, )
X N
Figure 1
The distributions of X and
X have the same mean but
X has the smaller standard deviation
n
The theorem says that we must take large samples. If we take small samples, the theorem only
holds if the original population is normally distributed.
Standard error of the mean
You will meet this term often if you read statistical texts. It is the name given to the standard
deviation of the population of sample means. The name stems from the fact that there is some
uncertainty in the process of predicting the original population mean from the mean of a sample or
samples.
Key Point 1
For a sample of n independent observations from a population with variance
2
, the standard error
of the mean is
n
=
n
.
Remember that this quantity is simply the standard deviation of the distribution of sample means.
4 HELM (2008):
Workbook 40: Sampling Distributions and Estimation
Finite populations
When we sample without replacement from a population which is not innitely large, the observations
are not independent. This means that we need to make an adjustment in the standard error of the
mean. In this case the standard error of the sample mean is given by the related but more complicated
formula
n,N
=
n
_
N n
N 1
where
n,N
is the standard error of the sample mean, N is the population size and n is the sample
size.
Note that, in cases where the size of the population N is large in comparison to the sample size n,
the quantity
N n
N 1
1
so that the standard error of the mean is approximately /
n.
Illustration - a distribution of sample means
It is possible to illustrate some of the above results by setting up a small population of numbers
and looking at the properties of small samples drawn from it. Notice that the setting up of a small
population, say of size 5, and taking samples of size 2 enables us to deal with the totality of samples,
there are
_
5
2
_
=
5!
2!3!
= 10 distinct samples possible, whereas if we take a population of 100 and
draw samples of size 10, there are
_
100
10
_
=
100!
10!90!
= 51, 930, 928, 370, 000 possible distinct samples
and from a practical point of view, we could not possibly list them all let alone work with them!
Suppose we take a population consisting of the ve numbers 1, 2, 3, 4 and 5 and draw samples of
size 2 to work with. The complete set of possible samples is:
(1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)
For the parent population, since we know that the mean = 3, then we can calculate the standard
deviation by
=
_
(1 3)
2
+ (2 3)
2
+ (3 3)
2
+ (4 3)
2
+ (5 3)
2
5
=
_
10
5
= 1.4142
For the population of sample means,
1.5, 2, 2.5, 3, 2.5, 3, 3.5, 3.5, 4, 4.5
their mean and standard deviation are given by the calculations:
1.5 + 2 + 2.5 + 3 + 2.5 + 3 + 3.5 + 3.5 + 4 + 4.5
10
= 3
and
_
(1.5 3)
2
+ (2 3)
2
+ + (4 3)
2
+ (4.5 3)
2
10
=
_
7.5
10
= 0.8660
We can immediately conclude that the mean of the population of sample means is the same as the
population mean .
HELM (2008):
Section 40.1: Sampling Distributions
5
Using the results given above the value of
n,N
should be given by the formula
n,N
=
n
_
N n
N 1
with = 1.4142, N = 5 and n = 2. Using these numbers gives:
2,5
=
n
_
N n
N 1
=
1.4142
2
_
5 2
5 1
=
_
3
4
= 0.8660 as predicted.
Note that in this case the correction factor
_
N n
N 1
0.8660 and is signicant. If we take samples
of size 10 from a population of 100, the factor becomes
_
N n
N 1
0.9535
and for samples of size 10 taken from a population of 1000, the factor becomes
_
N n
N 1
0.9955.
Thus as
_
N n
N 1
1, its eect on the value of
n
reduces to insignicance.
Task
Two-centimetre number 10 woodscrews are manufactured in their millions but
packed in boxes of 200 to be sold to the public or trade. If the length of the
screws is known to be normally distributed with a mean of 2 cm and variance
0.05 cm
2
, nd the mean and standard deviation of the sample mean of 200 boxed
screws. What is the probability that the sample mean length of the screws in a
box of 200 is greater than 2.02 cm?
Your solution
6 HELM (2008):
Workbook 40: Sampling Distributions and Estimation
Answer
Since the population is very large indeed, we are eectively sampling from an innite population.
The mean and standard deviation are given by
= 2 cm and
200
=
0.05
200
= 0.016 cm
Since the parent population is normally distributed the means of samples of 200 will be normally
distributed as well.
Hence P(sample mean length > 2.02) = P(z >
2.02 2
0.016
) = P(z > 1.25) = 0.5 0.3944 = 0.1056
2. Statistical estimation
When we are dealing with large populations (the production of items such as LEDs, light bulbs,
piston rings etc.) it is extremely unlikely that we will be able to calculate population parameters such
as the mean and variance directly from the full population.
We have to use processes which enable us to estimate these quantities. There are two basic methods
used called point estimation and interval estimation. The essential dierence is that point estimation
gives single numbers which, in the sense dened below, are best estimates of population parameters,
while interval estimates give a range of values together with a gure called the condence that the
true value of a parameter lies within the calculated range. Such ranges are usually called condence
intervals.
Statistically, the word estimate implies a dened procedure for nding population parameters. In
statistics, the word estimate does not mean a guess, something which is rough-and-ready. What
the word does mean is that an agreed precise process has been (or will be) used to nd required
values and that these values are best values in some sense. Often this means that the procedure
used, which is called the estimator, is:
(a) consistent in the sense that the dierence between the true value and the estimate
approaches zero as the sample size used to do the calculation increases;
(b) unbiased in the sense that the expected value of the estimator is equal to the true value;
(c) ecient in the sense that the variance of the estimator is small.
Expectation is covered in Workbooks 37 and 38. You should note that it is not always possible to
nd a best estimator. You might have to decide (for example) between one which is
consistent, biased and ecient
and one which is
consistent, unbiased and inecient
when what you really want is one which is
consistent, unbiased and ecient.
HELM (2008):
Section 40.1: Sampling Distributions
7
Point estimation
We will look at the point estimation of the mean and variance of a population and use the following
notation.
Notation
Population Sample Estimator
Size N n
Mean or E(x) x for
Variance
2
or V(x) s
2
2
for
2
Estimating the mean
This is straightforward.
= x
is a sensible estimate since the dierence between the population mean and the sample mean dis-
appears with increasing sample size. We can show that this estimator is unbiased. Symbolically we
have:
=
x
1
+x
2
+ x
n
n
so that
E( ) = =
E(x
1
) + E(x
2
) + + E(x
n
)
n
=
E(X) + E(X) + + E(X)
n
= E(X)
=
Note that the expected value of x
1
is E(X), i.e. E(x
1
) = E(X). Similarly for x
1
, x
2
, , x
n
.
Estimating the variance
This is a little more dicult. The true variance of the population is
2
=
(x )
2
N
which suggests
the estimator, calculated from a sample, should be
2
=
(x )
2
n
.
However, we do not know the true value of , but we do have the estimator = x.
Replacing by the estimator = x gives
2
=
(x x)
2
n
This can be written in the form
2
=
(x x)
2
n
=
x
2
n
( x)
2
Hence
E(
2
) =
E(
x
2
)
n
E{(
X)
2
} = E(X
2
) E{(
X)
2
}
8 HELM (2008):
Workbook 40: Sampling Distributions and Estimation
2
n
=
n 1
n
2
This result is biased, for an unbiased estimator the result should be
2
not
n 1
n
2
.
Fortunately, the remedy is simple, we just multiply by the so-called Bessels correction, namely
n
n 1
and obtain the result
2
=
n
n 1
(x x)
2
n
=
(x x)
2
n 1
There are two points to note here. Firstly (and rather obviously) you should not take samples of
size 1 since the variance cannot be estimated from such samples. Secondly, you should check the
operation of any hand calculators (and spreadsheets!) that you use to nd out exactly what you are
calculating when you press the button for standard deviation. You might nd that you are calculating
either
2
=
(x )
2
N
or
2
=
(x x)
2
n 1
It is just as well to know which, as the rst formula assumes that you are calculating the variance of
a population while the second assumes that you are estimating the variance of a population from a
random sample of size n taken from that population.
From now on we will assume that we divide by n1 in the sample variance and we will simply write
s
2
for s
2
n1
.
Interval estimation
We will look at the process of nding an interval estimation of the mean and variance of a population
and use the notation used above.
Interval estimation for the mean
This interval is commonly called the Condence Interval for the Mean.
Firstly, we know that while the sample mean x =
x
1
+x
2
+ +x
n
n
is a good estimator of the
population mean . We also know that the calculated mean x of a sample of size n is unlikely to be
exactly equal to . We will now construct an interval around x in such a way that we can quantify
the condence that the interval actually contains the population mean .
Secondly, we know that for suciently large samples taken from a large population, x follows a
normal distribution with mean and standard deviation
n
.
HELM (2008):
Section 40.1: Sampling Distributions
9
Thirdly, looking at the following extract from the normal probability tables,
Z =
X
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
1.9 .4713 4719 4726 4732 4738 4744 4750 4756 4762 4767
we can see that 247.5% = 95% of the values in the standard normal distribution lie between 1.96
standard deviation either side of the mean.
So before we see the data we may say that
P
_
1.96
n
x + 1.96
n
_
= 0.95
After we see the data we say with 95% condence that
1.96
n
x + 1.96
n
which leads to
x 1.96
n
x + 1.96
n
This interval is called a 95% condence interval for the mean .
Note that while the 95% level is very commonly used, there is nothing sacrosanct about this level.
If we go through the same argument but demand that we need to be 99% certain that lies within
the condence interval developed, we obtain the interval
x 2.58
n
x + 2.58
n
since an inspection of the standard normal tables reveals that 99% of the values in a standard normal
distribution lie within 2.58 standard deviations of the mean.
The above argument assumes that we know the population variance. In practice this is often not the
case and we have to estimate the population variance from a sample. From the work we have seen
above, we know that the best estimate of the population variance from a sample of size n is given
by the formula
2
=
(x x)
2
n 1
It follows that if we do not know the population variance, we must use the estimate in place of .
Our 95% and 99% condence intervals (for large samples) become
x 1.96
n
x + 1.96
n
and x 2.58
n
x + 2.58
n
where
2
=
(x x)
2
n 1
When we do not know the population variance, we need to estimate it. Hence we need to gauge the
condence we can have in the estimate.
In small samples, when we need to estimate the variance, the values 1.96 and 2.58 need to be replaced
by values from the Students t-distribution. See 41.
10 HELM (2008):
Workbook 40: Sampling Distributions and Estimation
Example 1
After 1000 hours of use the weight loss, in gm, due to wear in certain rollers in
machines, is normally distributed with mean and variance
2
. Fifty independent
observations are taken. (This may be regarded as a large sample.) If observation
i is y
i
, then
50
i=1
y
i
= 497.2 and
50
i=1
y
2
i
= 5473.58.
Estimate and
2
and give a 95% condence interval for .
Solution
We estimate using the sample mean: y =
y
i
n
=
497.2
50
= 9.944 gm
We estimate
2
using the sample variance:
s
2
=
1
n 1
(y
i
y)
2
=
1
n 1
_
y
2
i
1
n
_
y
i
_
2
_
=
1
49
_
5473.58
1
50
497.2
2
_
= 10.8046 gm
2
The estimated standard error of the mean is
_
s
2
n
=
_
10.8046
50
= 0.4649 gm
The 95% condence interval for is y 1.96
_
s
2
n
. That is 9.479 < < 10.409
Exercises
1. The voltages of sixty nominally 10 volt cells are measured. Assuming these to be independent
observations from a normal distribution with mean and variance
2
, estimate and
2
.
Regarding this as a largesample, nd a 99% condence interval for . The data are:
10.3 10.5 9.6 9.7 10.6 9.9 10.1 10.1 9.9 10.5
10.1 10.1 9.9 9.8 10.6 10.0 9.9 10.0 10.3 10.1
10.1 10.3 10.5 9.7 10.1 9.7 9.8 10.3 10.2 10.2
10.1 10.5 10.0 10.0 10.6 10.9 10.1 10.1 9.8 10.7
10.3 10.4 10.4 10.3 10.4 9.9 9.9 10.5 10.0 10.7
10.1 10.6 10.0 10.7 9.8 10.4 10.3 10.0 10.5 10.1
2. The natural logarithms of the times in minutes taken to complete a certain task are normally
distributed with mean and variance
2
. Seventy-ve independent observations are taken.
(This may be regarded as a large sample.) If the natural logarithm of the time for observation
i is y
i
, then
y
i
= 147.75 and
y
2
i
= 292.8175.
Estimate and
2
and give a 95% condence interval for .
Use your condence interval to nd a 95% condence interval for the median time to complete
the task.
HELM (2008):
Section 40.1: Sampling Distributions
11
Answers
1.
y
i
= 611.0,
y
2
i
= 6227.34 and n = 60. We estimate using the sample mean:
y =
y
i
n
=
611.0
60
= 10.1833 V
We estimate
2
using the sample variance:
s
2
=
1
n 1
(y
i
y)
2
=
1
n 1
_
y
2
i
1
n
_
y
i
_
2
_
=
1
59
_
6227.34
1
59
611.0
2
_
= 0.090226
The estimated standard error of the mean is
_
s
2
n
=
_
0.090226
60
= 0.03878 V
The 99% condence interval for is y 2.58
_
s
2
/n. That is
10.08 < < 10.28
2. We estimate using the sample mean:
y =
y
i
n
=
147.75
75
= 1.97
We estimate
2
using the sample variance:
s
2
=
1
n 1
(y
i
y)
2
=
1
n 1
_
y
2
i
1
n
_
y
i
_
2
_
=
1
74
_
292.8175
1
75
147.75
2
_
= 0.02365
The estimated standard error of the mean is
_
s
2
n
=
_
0.02365
75
= 0.01776
The 95% condence interval for is y 1.96
_
s
2
/n. That is
1.935 < < 2.005
The 95% condence interval for the median time, in minutes, to complete the task is
e
1.935
< M < e
2.005
That is
6.93 < M < 7.42
12 HELM (2008):
Workbook 40: Sampling Distributions and Estimation