Chapter Three

CHAPTER THREE
ESTIMATION
Learning objectives
At the end of this chapter the student will be able to:
1.Understand the concepts of sample statistics and population parameters
2. Understand the principles of sampling distributions of means and
proportions and calculate their standard errors
3. Understand the principles of estimation and differentiate between point
and interval estimations
4. Compute appropriate confidence intervals for population means and
proportions and interpret the findings
5. Describe methods of sample size calculation for cross – sectional studies.
Introduction
• In this chapter the concepts of sample statistics and population
parameters are described.
• The sample from a population is used to provide the estimates of the
population parameters.
• The standard error, one of the most important concepts in statistical
inference, is introduced.
• Methods for calculating confidence intervals for population means
and proportions are given.
• The importance of the normal distribution (Z distribution) is stressed
throughout the chapter.
Point Estimation
• Definition: A parameter is a numerical descriptive measure of a
population ( μ is an example of a parameter).
• A statistic is a numerical descriptive measure of a sample ( X is an
example of a statistic).
• To each sample statistic there corresponds a population parameter.
We use X , , S , p, etc. to estimate μ, , σ, P (or π), etc.
Sample statistic Corresponding population parameter
X (sample mean) μ (population mean)
( sample variance) ( population variance)
S (sample Standard deviation) σ(population standard deviation)
p ( sample proportion) P or π (Population proportion)
• We have already seen that the mean X of a sample can be used to
estimate μ. This does not, of course, indicate that the mean of every
sample will equal the population mean.
• Definition: A point estimate of some population parameter O is a
single value Ô of a sample statistic.
• Eg. The mean survival time of 91 laboratory rats after removal of the
thyroid gland was 82 days with a standard deviation of 10 days
(assume the rats were randomly selected).
• In the above example, the point estimates for the population
parameters μ and σ ( with regard to the survival time of all laboratory
rats after removal of the thyroid gland) are 82 days and 10 days
respectively.
Sampling Distribution of Means
• The sampling distribution of means is one of the most fundamental
concepts of statistical inference, and it has remarkable properties.
Since it is a frequency distribution it has its own mean and standard
deviation .
One may generate the sampling distribution of means as follows:
1) Obtain a sample of n observations selected completely at random from
a large population . Determine their mean and then replace the
observations in the population.
2) Obtain another random sample of n observations from the population,
determine their mean and again replace the observations.
3) Repeat the sampling procedure indefinitely, calculating the mean of
the random sample of n each time and subsequently replacing the
observations in the population.
4) The result is a series of means of samples of size n. If each mean in the
series is now treated as an individual observation and arrayed in a
frequency distribution, one determines the sampling distribution of
means of samples of size n.
• Because the scores ( X s) in the sampling distribution of means are
themselves means (of individual samples), we shall use the notation σ
X for the standard deviation of the distribution. The standard
deviation of the sampling distribution of means is called the standard
error of the mean.
• Eg. • Obtain repeat samples of 25 from a large population of males.
• Determine the mean serum uric acid level in each sample by
replacing the 25 observations each time.
• Array the means into a distribution.
• Then you will generate the sampling distribution of mean serum
uric acid levels of samples of size 25.
Properties
1. The mean of the sampling distribution of means is the same as the
population mean, μ .
2. The SD of the sampling distribution of means is σ / √n .
3. The shape of the sampling distribution of means is approximately a
normal curve, regardless of the shape of the population distribution
and provided n is large enough (Central limit theorem).
• In practice, the approximation is a workable one if n is 30 or more.
• Eg 1. Suppose you have a population having four members with
values 10,20,30 and 40 .
If you take all conceivable samples of size 2 with replacement:
a) What is the frequency distribution of the sample means ?
b) Find the mean and standard deviation of the distribution (standard
error of the mean).
Possible samples x i ( sample mean )
• (10, 20) or (20, 10) 15
• (10, 30 ) or (30, 10) 20
• (10, 40) or (40, 10) 25
• (20, 30) or (30, 20) 25
• (20, 40) or (40, 20) 30
• (30, 40) or (40, 30) 35
• (10, 10) 10
• (20, 20) 20
• (30, 30) 30
(40, 40) 40
a) frequency distribution of sample means
sample mean ( xi ) frequency (fi)
10 1
15 2
20 3
25 4
30 3
35 2
40 1
• b) i) The mean of the sampling distribution = ∑ x ifi / ∑fi
= 400 / 16 = 25
ii) The standard deviation of the mean = σ x = = ∑ /Ʃfi = {∑ (10 -
25)2 + (15 - 25)2 + …. + ( 40 - 25)2 } / 16 = 1000 / 16
= √62.5 = 7.9
Exercise :For the population given above (10,20,30 and 40)
f) Find the population mean. Show that the population mean ( μ ) = the
mean of the sampling distribution
g) Find the population standard deviation and show that the standard
error of the mean (σ X = σx /√n )
(that is, the standard error of the mean is equal to the population
standard deviation divided by the square root of the sample size)
Interval Estimation (large samples)
• A point estimate does not give any indication on how far away the
parameter lies. A more useful method of estimation is to compute an
interval which has a high probability of containing the parameter.
Definition: An interval estimate is a statement that a population
parameter has a value lying between two specified limits.
Confidence interval for a single mean
• Consider the standard normal distribution and the statement
Pr (-1.96≤ Z ≤1.96) = 0. 95
This is merely a shorthand algebraic statement that 95% of the
standard normal curve lies between + 1.96 and –1.96. If one chooses
the sampling distribution of means (a normal curve with mean μ and
standard deviation σ /√ n), then ,
Pr(-1.96 ≤ ( X - μ)/(σ /√n) ≤ 1.96) =0 .95
• A little manipulation without altering the probability value of 95
percent gives
Pr( X - 1.96(σ /√n) ≤ μ ≤ X + 1.96(σ /√n) ) = 0.95
The range X -1.96(σ /√n) to X + 1.96(σ /√n) ) is called the 95%
confidence interval;
X -1.96(σ /√n) is the lower confidence limit while X + 1.96(σ /√n) is
the upper confidence limit.
• The confidence, expressed as a proportion, that the interval X -1.96(σ
/√n) to X + 1.96(σ /√n) contains the unknown population mean is called
the confidence coefficient.
When this coefficient is 0.95 as given above, the following formal
definition of confidence interval is given. If many different random
samples are taken, and if the confidence interval for each is determined,
then it is expected that 95% of these computed intervals will contain the
population mean ( μ ) .
Clearly, there appears to be no rationale (logical basis ) for taking
repeated samples of size n and determine the corresponding confidence
intervals.
However, the knowledge of the properties of these sampling distributions
of means (if one hypothetically obtained these repeated samples) permits
one to draw a conclusion based upon one sample and this was shown
repeatedly in the previous sections.
• From the above definition of Confidence interval (C.I.), the widely
used definition is derived. That is, when one claims X ± 1.96 (σ/√n) as
the limits on μ, there is a 95% chance that the statement is correct
( that μ is contained within the interval).
If more than 95% certainty regarding the population mean - say, a
99% C.I. were desired, the only change needed is to use ±2.58 (the
point enclosing 99% of the standard normal curve), which gives X ±
2.58 (σ/√n)
• Example
• Eg 1. The mean reading speed of a random sample of 81 adults is 325
words per minute. Find a 90% C.I. For the mean reading speed of all
adults (μ) if it is known that the standard deviation for all adults is 45
words per minute.
• Given n = 81 σ = 45, x = 325 ,Z = ± 1.64 ( the point enclosing 90% of
the standard normal curve)
A 90% C.I. for μ is x ± 1.64 (σ /√n) = 325 ± (1.64 x 5 ) = 325 ± 8.2
= (316.8, 333.2 )
Therefore, A 90% CI. For μ is 316.8 to 333.2 words per minute
• Exercise
• A random sample of 100 drug-treated patients has a mean survival
time of 46.9 months. If the SD of the population is 43.3 months, find a
95% confidence interval for the population mean.
• (The population consists of survival times of cancer patients who have
been treated with a new drug)
Confidence interval for the difference of
means
• Consider two different populations. The first population ( X ) has
mean μx and standard deviation σx, the second ( Y ) has mean μy and
standard deviation σy. From the first population take a sample of size
nx and compute its mean x ; from the second population take
independently a sample of size ny and compute y ; then determine x -
y . Do this for all pairs of samples that can be chosen independently
from the two populations. The differences, x - y , are a new set of
scores which form the sampling distribution of differences of means.
• The characteristics of the sampling distribution of differences of
means are:
1) The mean of the sampling distribution of differences of means
equals the difference of the population means ( Mean = μx - μy ).
2) The standard deviation of the sampling distribution of differences of
means, also called the standard error of differences of means is
denoted by σ ( x - y ) .
σ ( x - y ) = √ ( σ 2 x + σ 2 y ) where σ x is the standard error of the mean
of the first population and σ y is the standard error of the mean of the
second population. (σ 2 x = σ 2 x /nx ; σ 2 y = ; σ 2 y / ny )
3) The sampling distribution is normal if both populations are normal,
and is approximately normal if the samples are large enough (even if
the populations aren’t normal).
In practice, it is assumed that the sampling distribution of
differences of means is normal if both nx and ny are ≥30.
A formula for C.I is found by solving Z = {( x - y ) - ( μx - μy )} / σ ( x - y )
for μx - μy ; hence C.I. for the difference of means is ( x - y ) ± Z.σ ( x - y )
• Example: If a random sample of 50 non-smokers have a mean life of
76 years with a standard deviation of 8 years, and a random sample of
65 smokers live 68 years with a standard deviation of 9 years,
A) What is the point estimate for the difference of the population
means?
B) Find a 95% C.I. for the difference of mean lifetime of non-smokers
and smokers.
Given Population x(non-smokers) nx=50 , x = 76, Sx = 8, x = / nx,
= /50 =1.28 years
• Population y (smokers) ny=65 , y = 68, Sy = 9, y = y / ny, = 92 /65
=1.25 years
A) A point estimate for the difference of population means (μx- μy)
= x - y = 76-68 = 8 years
B) At a 95% confidence level, Z = ± 1.96, σ( x - y ) =
= = 1.59 y ears
Hence, 95% C.I. for μx- μy = ( x - y ) ± 1.96 σ( x - y ) = 8 ± 1.96 (1.59) = 8
± 3.12 = (4.88 to 11.12 years)
Confidence interval for a single proportion
• Notation: P (or π) = proportion of “successes” in a population
(parameter)
Q = 1-P = proportion of “failures” in a population
p = proportion of successes in a sample
q = 1-p proportion of “failures” in a sample
σp= Standard deviation of the sampling distribution of proportions
= Standard error of proportions
n = size of the sample
• The population represents categorical data while the scores in the
sampling distribution are proportions between 0 and 1.This set of
proportions has a mean and standard deviation.
The sampling distribution of proportions has the following
characteristics:
1. Its mean = P, the proportion in the population.
2. σp = PQ / n
3. The shape is approximately normal provided n is sufficiently large -
in this case, nP ≥5 and nQ ≥ 5 are the requirements for sufficiently large
n ( central limit theorem for proportions) .
• The confidence interval for the population proportion (P) is given by
the formula: p ± Z σp ( that is, p - Z σp and p + Z σp ) p = sample
proportion, σp = standard error of the proportion ( ).
• Example: An epidemiologist is worried about the ever increasing
trend of malaria in a certain locality and wants to estimate the
proportion of persons infected in the peak malaria transmission
period. If he takes a random sample of 150 persons in that locality
during the peak transmission period and finds that 60 of them are
positive for malaria, find
• a) 95% b)90% c)99% confidence intervals for
the proportion of the whole infected people in that locality during the
peak malaria transmission period .
• Sample proportion = 60 / 150 = 0.4 The standard error of proportion
depends on the population P. However, the population proportion (P)
is unknown. In such situations, can be used as an approximation to σp
= = 0.04
• a) A 95% C.I for the population proportion ( the proportion of the
whole infected people in that locality) = 0.4 ± 1.96 (0.04)
= (0.4 ± .078)
= (0.322,0 .478)
b) A 90% C.I for the population proportion ( the proportion of the
whole infected people in that locality)
= 0.4 ± 1.64 (0.04)
= (0.4 ±0 .066)
= (0.334, 0.466).
• A 99% C.I for the population proportion ( the proportion of the whole
infected people in that locality)
= 0.4 ± 2.58 (0.04)
= (0.4 ± 0.103)
= (0.297, 0.503).
Confidence interval for the difference of two
proportions
• By the same analogy, the C.I. for the difference of proportions (Px -
Py) is given by the following formula.
• C.I. for Px - Py = (px - py) ± Z σ (Px - Py) . Where Z is determined by the
confidence coefficient and σ (Px - Py) = √{ (px qx)/nx + (px qx)/nx }
• Example: Each of two groups consists of 100 patients who have
leukaemia. A new drug is given to the first group but not to the
second (the control group). It is found that in the first group 75 people
have remission for 2 years; but only 60 in the second group. Find 95%
confidence limits for the difference in the proportion of all patients
with leukaemia who have remission for 2 years.
• Note that nxpx = 100 x 0.75 = 75 >5
nxqy = 100 x 0.25 = 25 >5
nypy = 100 x 0 .60 = 60 >5
nyqy = 100 x 0.40 = 40 >5
• px =0 .75,
• qx = 0.25,
• nx = 100,
• Px = pxqx / nx = 0.75 x 0.25 / 100
= 0.001875
• py = 0.60,
• qx = 0.40,
• ny = 100,
• Py = pyqy / ny = 0.60 x 0.40 / 100
=0 .0024
• Hence, (Px-Py) = √ (Px + Px) = √ ( pxqx / nx) + (pyqy / ny)
= √ 0.001875+0.0024 =0 .065
At a 95% Confidence level, Z = ± 1.96 and the difference of the two

independent random samples is (0.75 -0 .60) = 0.15 .
Therefore, a 95 % C. I. for the difference in the proportion with 2-year
remission is (0.15 ± 1.96 (0.065) ) = (0.15 ± 0.13) = (0 .02 to 0 .28).
Sample Size Estimation in cross – sectional
studies
• In planning any investigation we must decide how many people need
to be studied in order to answer the study objectives. If the study is
too small we may fail to detect important effects, or may estimate
effects too imprecisely. If the study is too large then we will waste
resources.
• In general, it is much better to increase the accuracy of data collection
(by improving the training of data collectors and data collection tools)
than to increase the sample size after a certain point.
• The eventual sample size is usually a compromise between what is
desirable and what is feasible.
• The feasible sample size is determined by the availability of
resources.
• It is also important to remember that resources are not only needed
to collect the information, but also to analyze it.
Estimating a proportion
• estimate how big the proportion might be (P)
• choose the margin of error you will allow in the estimate of the
proportion (say ± w)
• choose the level of confidence that the proportion in the whole
population is indeed between (p-w) and (p+w).
We can never be 100% sure. Do you want to be 95% sure?
• the minimum sample size required, for a very large population (N≥10,000)
is:
n = p(1-p) /
Show how the above formula is obtained.
A 95% C.I. for P = p ± 1.96 se , if we want our confidence interval to have a
maximum width of ± w,
1.96 se = w
1.96 √p(1-p)/n = w
p(1-p)/n = w2 ,
Hence, n = p(1-p)/w
• Example:
• a) p = 0.26 , w = 0.03 , Z = 1.96 ( i.e., for a 95% C.I.)
n = (0.26 × 0.74) /= 821.25 ≈ 822 Thus , the study should include at
least 822 subjects.
• b) If the above sample is to be taken from a relatively small

population (say N = 3000) , the required minimum sample will be
obtained from the above estimate by making some adjustment .
821.25 / (1+ (821.25/3000)) = 644.7 ≈ 645 subjects
Estimating a mean
The same approach is used but SE =
The required (minimum) sample size for a very large population is

given by:
n= /
• Eg. A health officer wishes to estimate mean hemoglobin level in a
defined community. From preliminary contact he thinks this mean is
about 150 mg/l with a standard deviation of 32 m/l. If he is willing to
tolerate a sampling error of up to 5 mg/l in his estimate,
• how many subjects should be included in his study? (α =5%, two
sided) - If the population size is assumed to be very large, the required
sample size would be:
• n = / = 157.4 ≈ 158 persons
• - If the population size is , say, 2000 , The required sample size would
be 146 persons.
• NB: can be estimated from previous similar studies or could be
obtained by conducting a small pilot study
Comparison of two Proportions (sample size
in each region)
• n = (p1q1 + p2q2) (f(α,β)) / ((p1 - p2)
• α = type I error (level of significance)
• β = type II error ( 1-β = power of the study)
• power = the probability of getting a significant result
• f (α,β) =10.5, when the power = 90% and the level of significance = 5%
• Eg. The proportion of nurses leaving the health service is compared
between two regions. In one region 30% of nurses is estimated to
leave the service within 3 years of graduation. In other region it is
probably 15%.
• Solution The required sample to show, with a 90% likelihood
(power) , that the percentage of nurses is different in these two
regions would be: (assume a confidence level of 95%)
• n = (1.28+1.96)2 ((0.3×0.7) +(0.15 ×0.85)) / (0.30 -0 .15)2 = 158
158 nurses are required in each region
Comparison of two means (sample size in
each group
• n = ( + ) f(α,β) /
• m1 and s1 2 are mean and variance of group 1 respectively.
• m2 and s2 2 are mean and variance of group 2 respectively.
• Eg. The birth weights in districts A and B will be compared. In district
A the mean birth weight is expected to be 3000 grams with a
standard deviation of 500 grams. In district B the mean is expected to
be 3200 grams with a standard deviation of 500 grams.
• The required sample size to demonstrate (with a likelihood of 90% ,
that is with a power of 90%) a significant difference between the
mean birth weights in districts A and B would be: N = (1.96 + 1.28 )²
(500 + 500)² / (3200 – 3000)² = 131 newborn babies in each district
Note that f(α,β) = 10.5
• That is , α = . 05 (two sided ) ⇒ Z = 1.96 β = ( 1- .9 ) = .1 (one sided ) ⇒
Z = 1.28

Chapter Three - Estimation

Uploaded by

Copyright:

Available Formats

Chapter Three - Estimation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter Three - Estimation

Uploaded by

Copyright:

Available Formats

At a 95% Confidence level, Z = ± 1.96 and the difference of the two

• b) If the above sample is to be taken from a relatively small

The required (minimum) sample size for a very large population is

You might also like