Chapter Three - Estimation
Chapter Three - Estimation
Chapter Three - Estimation
ESTIMATION
Learning objectives
At the end of this chapter the student will be able to:
1.Understand the concepts of sample statistics and population parameters
2. Understand the principles of sampling distributions of means and
proportions and calculate their standard errors
3. Understand the principles of estimation and differentiate between point
and interval estimations
4. Compute appropriate confidence intervals for population means and
proportions and interpret the findings
5. Describe methods of sample size calculation for cross – sectional studies.
Introduction
• In this chapter the concepts of sample statistics and population
parameters are described.
• The sample from a population is used to provide the estimates of the
population parameters.
• The standard error, one of the most important concepts in statistical
inference, is introduced.
• Methods for calculating confidence intervals for population means
and proportions are given.
• The importance of the normal distribution (Z distribution) is stressed
throughout the chapter.
Point Estimation
• Definition: A parameter is a numerical descriptive measure of a
population ( μ is an example of a parameter).
• A statistic is a numerical descriptive measure of a sample ( X is an
example of a statistic).
• To each sample statistic there corresponds a population parameter.
We use X , , S , p, etc. to estimate μ, , σ, P (or π), etc.
Sample statistic Corresponding population parameter
X (sample mean) μ (population mean)
( sample variance) ( population variance)
S (sample Standard deviation) σ(population standard deviation)
p ( sample proportion) P or π (Population proportion)
• We have already seen that the mean X of a sample can be used to
estimate μ. This does not, of course, indicate that the mean of every
sample will equal the population mean.
• Definition: A point estimate of some population parameter O is a
single value Ô of a sample statistic.
• Eg. The mean survival time of 91 laboratory rats after removal of the
thyroid gland was 82 days with a standard deviation of 10 days
(assume the rats were randomly selected).
• In the above example, the point estimates for the population
parameters μ and σ ( with regard to the survival time of all laboratory
rats after removal of the thyroid gland) are 82 days and 10 days
respectively.
Sampling Distribution of Means
• The sampling distribution of means is one of the most fundamental
concepts of statistical inference, and it has remarkable properties.
Since it is a frequency distribution it has its own mean and standard
deviation .
One may generate the sampling distribution of means as follows:
1) Obtain a sample of n observations selected completely at random from
a large population . Determine their mean and then replace the
observations in the population.
2) Obtain another random sample of n observations from the population,
determine their mean and again replace the observations.
3) Repeat the sampling procedure indefinitely, calculating the mean of
the random sample of n each time and subsequently replacing the
observations in the population.
4) The result is a series of means of samples of size n. If each mean in the
series is now treated as an individual observation and arrayed in a
frequency distribution, one determines the sampling distribution of
means of samples of size n.
• Because the scores ( X s) in the sampling distribution of means are
themselves means (of individual samples), we shall use the notation σ
X for the standard deviation of the distribution. The standard
deviation of the sampling distribution of means is called the standard
error of the mean.
• Eg. • Obtain repeat samples of 25 from a large population of males.
• Determine the mean serum uric acid level in each sample by
replacing the 25 observations each time.
• Array the means into a distribution.
• Then you will generate the sampling distribution of mean serum
uric acid levels of samples of size 25.
Properties
1. The mean of the sampling distribution of means is the same as the
population mean, μ .
2. The SD of the sampling distribution of means is σ / √n .
3. The shape of the sampling distribution of means is approximately a
normal curve, regardless of the shape of the population distribution
and provided n is large enough (Central limit theorem).
• In practice, the approximation is a workable one if n is 30 or more.
• Eg 1. Suppose you have a population having four members with
values 10,20,30 and 40 .
If you take all conceivable samples of size 2 with replacement:
a) What is the frequency distribution of the sample means ?
b) Find the mean and standard deviation of the distribution (standard
error of the mean).
Possible samples x i ( sample mean )
• (10, 20) or (20, 10) 15
• (10, 30 ) or (30, 10) 20
• (10, 40) or (40, 10) 25
• (20, 30) or (30, 20) 25
• (20, 40) or (40, 20) 30
• (30, 40) or (40, 30) 35
• (10, 10) 10
• (20, 20) 20
• (30, 30) 30
(40, 40) 40
a) frequency distribution of sample means
sample mean ( xi ) frequency (fi)
10 1
15 2
20 3
25 4
30 3
35 2
40 1
• b) i) The mean of the sampling distribution = ∑ x ifi / ∑fi
= 400 / 16 = 25
ii) The standard deviation of the mean = σ x = = ∑ /Ʃfi = {∑ (10 -
25)2 + (15 - 25)2 + …. + ( 40 - 25)2 } / 16 = 1000 / 16
= √62.5 = 7.9
Exercise :For the population given above (10,20,30 and 40)
f) Find the population mean. Show that the population mean ( μ ) = the
mean of the sampling distribution
g) Find the population standard deviation and show that the standard
error of the mean (σ X = σx /√n )
(that is, the standard error of the mean is equal to the population
standard deviation divided by the square root of the sample size)
Interval Estimation (large samples)
• A point estimate does not give any indication on how far away the
parameter lies. A more useful method of estimation is to compute an
interval which has a high probability of containing the parameter.
Definition: An interval estimate is a statement that a population
parameter has a value lying between two specified limits.
Confidence interval for a single mean
• Consider the standard normal distribution and the statement
Pr (-1.96≤ Z ≤1.96) = 0. 95
This is merely a shorthand algebraic statement that 95% of the
standard normal curve lies between + 1.96 and –1.96. If one chooses
the sampling distribution of means (a normal curve with mean μ and
standard deviation σ /√ n), then ,
Pr(-1.96 ≤ ( X - μ)/(σ /√n) ≤ 1.96) =0 .95
• A little manipulation without altering the probability value of 95
percent gives
Pr( X - 1.96(σ /√n) ≤ μ ≤ X + 1.96(σ /√n) ) = 0.95
The range X -1.96(σ /√n) to X + 1.96(σ /√n) ) is called the 95%
confidence interval;
X -1.96(σ /√n) is the lower confidence limit while X + 1.96(σ /√n) is
the upper confidence limit.
• The confidence, expressed as a proportion, that the interval X -1.96(σ
/√n) to X + 1.96(σ /√n) contains the unknown population mean is called
the confidence coefficient.
When this coefficient is 0.95 as given above, the following formal
definition of confidence interval is given. If many different random
samples are taken, and if the confidence interval for each is determined,
then it is expected that 95% of these computed intervals will contain the
population mean ( μ ) .
Clearly, there appears to be no rationale (logical basis ) for taking
repeated samples of size n and determine the corresponding confidence
intervals.
However, the knowledge of the properties of these sampling distributions
of means (if one hypothetically obtained these repeated samples) permits
one to draw a conclusion based upon one sample and this was shown
repeatedly in the previous sections.
• From the above definition of Confidence interval (C.I.), the widely
used definition is derived. That is, when one claims X ± 1.96 (σ/√n) as
the limits on μ, there is a 95% chance that the statement is correct
( that μ is contained within the interval).
If more than 95% certainty regarding the population mean - say, a
99% C.I. were desired, the only change needed is to use ±2.58 (the
point enclosing 99% of the standard normal curve), which gives X ±
2.58 (σ/√n)
• Example
• Eg 1. The mean reading speed of a random sample of 81 adults is 325
words per minute. Find a 90% C.I. For the mean reading speed of all
adults (μ) if it is known that the standard deviation for all adults is 45
words per minute.
• Given n = 81 σ = 45, x = 325 ,Z = ± 1.64 ( the point enclosing 90% of
the standard normal curve)
A 90% C.I. for μ is x ± 1.64 (σ /√n) = 325 ± (1.64 x 5 ) = 325 ± 8.2
= (316.8, 333.2 )
Therefore, A 90% CI. For μ is 316.8 to 333.2 words per minute
• Exercise
• A random sample of 100 drug-treated patients has a mean survival
time of 46.9 months. If the SD of the population is 43.3 months, find a
95% confidence interval for the population mean.
• (The population consists of survival times of cancer patients who have
been treated with a new drug)
Confidence interval for the difference of
means
• Consider two different populations. The first population ( X ) has
mean μx and standard deviation σx, the second ( Y ) has mean μy and
standard deviation σy. From the first population take a sample of size
nx and compute its mean x ; from the second population take
independently a sample of size ny and compute y ; then determine x -
y . Do this for all pairs of samples that can be chosen independently
from the two populations. The differences, x - y , are a new set of
scores which form the sampling distribution of differences of means.
• The characteristics of the sampling distribution of differences of
means are:
1) The mean of the sampling distribution of differences of means
equals the difference of the population means ( Mean = μx - μy ).
2) The standard deviation of the sampling distribution of differences of
means, also called the standard error of differences of means is
denoted by σ ( x - y ) .
σ ( x - y ) = √ ( σ 2 x + σ 2 y ) where σ x is the standard error of the mean
of the first population and σ y is the standard error of the mean of the
second population. (σ 2 x = σ 2 x /nx ; σ 2 y = ; σ 2 y / ny )
3) The sampling distribution is normal if both populations are normal,
and is approximately normal if the samples are large enough (even if
the populations aren’t normal).
In practice, it is assumed that the sampling distribution of
differences of means is normal if both nx and ny are ≥30.
A formula for C.I is found by solving Z = {( x - y ) - ( μx - μy )} / σ ( x - y )
for μx - μy ; hence C.I. for the difference of means is ( x - y ) ± Z.σ ( x - y )
• Example: If a random sample of 50 non-smokers have a mean life of
76 years with a standard deviation of 8 years, and a random sample of
65 smokers live 68 years with a standard deviation of 9 years,
A) What is the point estimate for the difference of the population
means?
B) Find a 95% C.I. for the difference of mean lifetime of non-smokers
and smokers.
Given Population x(non-smokers) nx=50 , x = 76, Sx = 8, x = / nx,
= /50 =1.28 years
• Population y (smokers) ny=65 , y = 68, Sy = 9, y = y / ny, = 92 /65
=1.25 years
A) A point estimate for the difference of population means (μx- μy)
= x - y = 76-68 = 8 years
B) At a 95% confidence level, Z = ± 1.96, σ( x - y ) =
= = 1.59 y ears
Hence, 95% C.I. for μx- μy = ( x - y ) ± 1.96 σ( x - y ) = 8 ± 1.96 (1.59) = 8
± 3.12 = (4.88 to 11.12 years)
Confidence interval for a single proportion
• Notation: P (or π) = proportion of “successes” in a population
(parameter)
Q = 1-P = proportion of “failures” in a population
p = proportion of successes in a sample
q = 1-p proportion of “failures” in a sample
σp= Standard deviation of the sampling distribution of proportions
= Standard error of proportions
n = size of the sample
• The population represents categorical data while the scores in the
sampling distribution are proportions between 0 and 1.This set of
proportions has a mean and standard deviation.
The sampling distribution of proportions has the following
characteristics:
1. Its mean = P, the proportion in the population.
2. σp = PQ / n
3. The shape is approximately normal provided n is sufficiently large -
in this case, nP ≥5 and nQ ≥ 5 are the requirements for sufficiently large
n ( central limit theorem for proportions) .
• The confidence interval for the population proportion (P) is given by
the formula: p ± Z σp ( that is, p - Z σp and p + Z σp ) p = sample
proportion, σp = standard error of the proportion ( ).
• Example: An epidemiologist is worried about the ever increasing
trend of malaria in a certain locality and wants to estimate the
proportion of persons infected in the peak malaria transmission
period. If he takes a random sample of 150 persons in that locality
during the peak transmission period and finds that 60 of them are
positive for malaria, find
• a) 95% b)90% c)99% confidence intervals for
the proportion of the whole infected people in that locality during the
peak malaria transmission period .
• Sample proportion = 60 / 150 = 0.4 The standard error of proportion
depends on the population P. However, the population proportion (P)
is unknown. In such situations, can be used as an approximation to σp
= = 0.04
• a) A 95% C.I for the population proportion ( the proportion of the
whole infected people in that locality) = 0.4 ± 1.96 (0.04)
= (0.4 ± .078)
= (0.322,0 .478)
b) A 90% C.I for the population proportion ( the proportion of the
whole infected people in that locality)
= 0.4 ± 1.64 (0.04)
= (0.4 ±0 .066)
= (0.334, 0.466).
• A 99% C.I for the population proportion ( the proportion of the whole
infected people in that locality)
= 0.4 ± 2.58 (0.04)
= (0.4 ± 0.103)
= (0.297, 0.503).
Confidence interval for the difference of two
proportions
• By the same analogy, the C.I. for the difference of proportions (Px -
Py) is given by the following formula.
• C.I. for Px - Py = (px - py) ± Z σ (Px - Py) . Where Z is determined by the
confidence coefficient and σ (Px - Py) = √{ (px qx)/nx + (px qx)/nx }
• Example: Each of two groups consists of 100 patients who have
leukaemia. A new drug is given to the first group but not to the
second (the control group). It is found that in the first group 75 people
have remission for 2 years; but only 60 in the second group. Find 95%
confidence limits for the difference in the proportion of all patients
with leukaemia who have remission for 2 years.
• Note that nxpx = 100 x 0.75 = 75 >5
nxqy = 100 x 0.25 = 25 >5
nypy = 100 x 0 .60 = 60 >5
nyqy = 100 x 0.40 = 40 >5
• px =0 .75,
• qx = 0.25,
• nx = 100,
• Px = pxqx / nx = 0.75 x 0.25 / 100
= 0.001875
• py = 0.60,
• qx = 0.40,
• ny = 100,
• Py = pyqy / ny = 0.60 x 0.40 / 100
=0 .0024
• Hence, (Px-Py) = √ (Px + Px) = √ ( pxqx / nx) + (pyqy / ny)
= √ 0.001875+0.0024 =0 .065