Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
67 views

Sampling and Estimation

This chapter discusses sampling and estimation. It defines key concepts such as populations, parameters, samples, and statistics. The chapter explains that samples are used to estimate unknown population parameters. It also introduces the central limit theorem and how it relates to the sampling distribution of statistics. Statistical inference is discussed as a way to draw conclusions about populations from sample data using concepts like unbiased estimators, accuracy, and precision. Methods for estimating standard errors are also presented.

Uploaded by

Dr P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Sampling and Estimation

This chapter discusses sampling and estimation. It defines key concepts such as populations, parameters, samples, and statistics. The chapter explains that samples are used to estimate unknown population parameters. It also introduces the central limit theorem and how it relates to the sampling distribution of statistics. Statistical inference is discussed as a way to draw conclusions about populations from sample data using concepts like unbiased estimators, accuracy, and precision. Methods for estimating standard errors are also presented.

Uploaded by

Dr P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter 4 SAMPLING and ESTIMATION

Objectives
After completing this chapter you should be able to
(a) distinguish between a population distribution and the sampling distribution of a
statistic
(b) explain the relevance of the Central Limit Theorem
(c) estimate population parameters using sample statistics
(d) calculate confidence limits for population parameters from sample data and give
the correct interpretation of these limits

4.1 Introduction
A sample is drawn from a population to infer something about the population. For
example, suppose the mean systolic blood pressure (mm Hg) of all students at
Queen’s is of interest. Rather than collect information from every student at Queen’s
(a time consuming and costly task) a sample of students may be randomly selected
and the mean blood pressure in this sample used to estimate the mean blood pressure
of all students at Queen’s, the population. The cost of using the sample is that the
mean blood pressure estimate may differ slightly from the population mean blood
pressure. The statistical theory presented in this chapter should enable us to determine
the precision (for definition see Section 4.4) of this sample estimate, i.e. how far from
the population mean blood pressure the sample estimate is likely to be.

4.2 Populations and Population Parameters


A population is any collection of individuals (or measurements made on those
individuals) in which we are interested. The frequency distribution of a variable in the
population is referred to as the population distribution. Summary values (e.g. means,
proportions) calculated in populations are referred to as population parameters.
These quantities are fixed but their values are generally unknown, and typically it is in
these that our interest lies.

Example 1: Blood pressure in male Queen’s postgraduate medical students

To illustrate some definitions we shall use some fictional data from the entirety of a
population. We shall consider the systolic blood pressure (in mm Hg) of the
population of all 121 male Queen’s postgraduate medical students in 2015. In
practice, we are seldom in a position where we have data from the entire population.
Identification numbers (ID) and systolic blood pressure measurements (BP, in mm
Hg) from all 121 male Queen’s postgraduate medical students in 2015 are shown
below.

ID 1 2 3 4 5 6 7 … 118 119 120 121


BP 129 130 141 120 128 129 123 … 120 108 125 140

4.1
The distribution of blood pressure in the population of male postgraduate Queen’s
students is simply the frequency distribution of the variable blood pressure. It can be
shown in a histogram.

The population mean blood pressure (denoted ) in male Queen’s postgraduate


medical students is 128.05 mm Hg and population standard deviation (denoted ) is
12.7 mm Hg.

4.3 Samples and Sample Statistics


A sample is any subset of a population and is ideally selected to be representative of
the population. Arguably the most likely way to achieve this is by taking a random
sample. A random sample is drawn from a population in such a way that each
member of the population has an equal chance of being selected.

Summary values calculated in samples (e.g. means and proportions) are called sample
statistics. These sample statistics are used to estimate population parameters. Sample
statistics vary from sample to sample as they depend upon the individuals included in
the sample selected. The standard notation for sample statistics and population
parameters is shown below.

Summary measure Sample statistic Population parameter


Mean x (mu)
Standard deviation s  (sigma)
Proportion p  (pi)
Intercept1 a (alpha)
Slope1 b (beta)
Correlation coefficient1 r rho
1
Covered in Chapter 9.

4.2
Example 1 (continued)

A sample of size 35 was randomly selected from the population of male Queen’s
postgraduate medical students. This sample included individuals 1 (129 mm Hg), 4
(120 mm Hg), 11 (124 mm Hg), 16 (140 mm Hg), … …120 (125 mm Hg). In this
sample the mean blood pressure (𝑥̅ ) was 127.7 mm Hg and the sample standard
deviation (s) was 11.6 mm Hg. This sample mean is an estimate of the population
mean (of 128.05 mm Hg) in this instance it has slightly under-estimated the
population mean.
The sample mean blood pressure varies depending upon the individuals included in
the sample. If we collect another sample it would include different individuals and
therefore the mean would differ.

4.4 Statistical Inference


Statistical inference is the use of probability theory as a basis for drawing reasonable
conclusions from sample data. The statistical theory discussed in this, and subsequent
chapters, depends upon the sample being selected at random. It will also be generally
assumed that the values in the sample are independent (and not, for example, repeated
observations on the same individuals).

The two main forms of statistical inference are estimation (covered in this chapter)
and hypothesis testing (introduced in Chapter 5). In this chapter we are interested in
estimating a population parameter from sample data e.g. using a sample mean to
estimate the population mean. Two desirable properties of an estimator are:

1) Accuracy or absence of bias i.e. if samples were repeatedly drawn from the
population and the mean of each sample calculated, we would like the sample means
to be “centred” about the population mean. This property is illustrated below.

X XXX XX Accurate
X XXX XX Inaccurate
Bias
X Sample mean of randomly selected sample from the population
Population mean (i.e. ‘true’ value)

The estimators covered in this chapter (such as the sample mean and sample
proportion) are all unbiased. In practice, bias is often incurred when, out of necessity,
a restricted population is investigated (for instance due to non-response, exclusion
criteria or practical considerations) rather than the actual population of interest.
2) Precision or repeatability i.e. if samples were repeatedly drawn from the
population and the mean of each sample calculated, we would like these sample
means to show little variation. This property is illustrated below.

X XXXXXXX Precise
X X X X X X X X Imprecise
X Sample mean of randomly selected sample from the population

4.3
The precision of the estimators covered in this chapter can be estimated and are
dependent upon a number of factors (such as sample size and population variance),
see Section 4.6.

4.5 Sampling Distribution


The sampling distribution of a statistic is the frequency distribution of that statistic
over all possible samples of a given size selected from the population.
Example 1 (continued)

To illustrate a sampling distribution, suppose samples of size 35 were repeatedly


drawn from the population of Queen’s postgraduate medical students (until every
possible sample of size 35 had been drawn) and for each sample the mean was
calculated. The means of eight samples are shown below but many other samples
could be selected.
Sample number 1 2 3 4 5 6 7 8 …
Sample mean 127.7 126.6 129.3 128.1 130.0 128.9 127.4 127.9 …

Notice how the sample mean varies from one sample to the next. The frequency
distribution of these sample means is the sampling distribution of the mean and is
presented in the histogram below.

In this histogram each observation is a sample mean. The mean of these sample
means is 128.05 which is the population mean. Also notice that the sampling
distribution of the mean is normally distributed. This sampling distribution shows
how far from the population mean the mean of one randomly selected sample is likely
to lie.

4.4
4.6 Standard Error
The standard deviation of the sampling distribution of a statistic is referred to as its
standard error. This quantity is of particular interest as it is a measure of how much a
sample statistic varies from one sample to another. Basically, it allows the degree of
precision (or imprecision) in a sample statistic to be estimated.

In practice the population is not known (and therefore the sampling distribution of the
mean cannot be determined exactly). Typically, one sample would be randomly
selected from the population. The sample mean would then be used to estimate the
unknown population parameter and the sample standard deviation of this sample
would be used to estimate the standard error.

The standard error of the mean is given by:


SE(𝑥̅ ) = 𝜎⁄
√𝑛
where n is the sample size. The population standard deviation (𝜎is approximated by
the sample standard deviation (s).

(x
1
SE(𝑥̅ ) = 𝑠⁄ , where s2   x)2 .
n 1
i
√𝑛
Note that the larger the sample size n the smaller the standard error and the more
precise the estimate.

Formulae for other sample statistics are shown below. Section 4.10 gives details of
how to calculate standard errors using MINITAB and SPSS.

The standard error of a proportion when n is large is approximated by:

SE(p) = p(1  p) / n .

We are commonly more interested in the difference between two population


parameters rather than in their absolute values. The formulae for the standard errors of
a difference in two means and two proportions when samples are large (n1, n2 > 30)
are given below. The subscripts denote the population from which the sample was
selected.

The standard error of a difference of means is approximated by:

𝑠2 𝑠2
SE(𝑥̅1 − 𝑥̅2 ) = √ 1 ⁄𝑛1 + 2⁄𝑛2

The standard error of a difference of proportions is approximated by:

𝑝 (1 − 𝑝1 )⁄ 𝑝2 (1 − 𝑝2 )⁄
𝑆𝐸(𝑝1 − 𝑝2 ) = √ 1 𝑛1 + 𝑛2

4.5
4.7 The Central Limit Theorem
When the sample size is large (as a rule of thumb n≥ 30), the sampling distribution of
the mean of a variable (for instance blood pressure) may be approximated by a
Normal distribution, regardless of the shape of the distribution of that variable in
the population. Furthermore, the mean of the sampling distribution is equal to 𝜇 and
the standard deviation (the standard error) of the sampling distribution is equal to
𝜎 ⁄ √𝑛, where n is the sample size. This result is referred to as the central limit
theorem and is of fundamental importance in statistics.

This theorem allows the sampling distribution of a statistic to be approximated from a


sample.

Example 1 (continued)

In our example above, one of the samples drawn from the population had the
following characteristics: 𝑛 = 35, 𝑥̅ = 129.2 and s = 12.5. Using this sample
𝑥̅ = 129.2 would be the best estimate of 𝜇. Furthermore the standard error of 𝑥̅ would
be estimated as 𝑠 ⁄ √𝑛 = 12.5/√35 = 2.1. As the sample (𝑛 = 35) was large,
according to the central limit theorem the sampling distribution of 𝑥̅ may be
approximated by N(129.2,2.12).

The sampling distribution is normally distributed. A property of the Normal


distribution is that 95% of observations lie within 1.96 standard deviations of the
mean (from Chapter 2). Therefore 95% of sample means will lie within (1.96x2.1=)
4.1 mm Hg of the population mean giving an estimate of the precision of the sample
mean.

4.8 Confidence Intervals

A sample statistic may be considered a ‘best guess’ of the value of some population
parameter and is referred to as point estimation. There is no reason to suppose that the
sample mean will be exactly equal to a population mean. It is likely to be close to it
and the degree to which it may differ can be assessed from the standard error. A point
estimate gives no indication of the precision of the estimate and it is therefore more
informative to give a range of values centred on the point estimate which are likely to
include the population parameter, with a given probability. This is referred to as
interval estimation. Conventionally the probability chosen is 95%. The range of
values is called the 95% confidence interval (95% CI) and the ends of this interval
are the 95% confidence limits (95% CLs). Often the population parameter is referred
to simply as the true value.

The confidence limits that we shall study are of the form

estimate  M x SE(estimate)

where the multiplier M depends on the sampling distribution of the estimator.

4.6
The probability that the confidence intervals based upon a random sample will capture
the value of the population parameter depends on the confidence level chosen. For
instance, suppose we are calculating 95% CIs then 95% of repeated samples from this
population will produce limits which include the population parameter, and only 5%
will not.

Equations for the 95% confidence intervals for several sample statistics based upon
large samples (n ≥ 30) are shown below. Section 4.10 gives details of how to
calculate CLs using MINITAB and SPSS. N.B. in this course you are expected to be
able to calculate CLs (and SEs) for a population mean and population proportion
using a hand calculator, but the relevant equation would be given in the question.

The 95% CLs for a population mean (μ) when n is large (≥ 30) are given by:
x  z 0.025SE( x ) or x  1.96 s / n
The multipliers are the upper 2.5% points of the distributions.

The 95% CLs for a population proportion when n is large (≥ 30) are given by:
p  z 0.025SE( p) or p  1.96 p(1  p) / n

The 95% CLs for a difference in population means 1 - 2) when n1, n2 are large
(≥ 30) are given by:
( x1  x2)  z0.025SE( x1  x2 )
s12 s2
= ( x1  x2)  1.96  2
n1 n2

The 95% CLs for a difference in population proportions (1 - 2) when n1, n2 are
large (≥ 30) are given by:
(p1  p2)  z0.025SE(p1  p2)
p1(1  p1) p2(1  p2)
= ( p1  p2)  1.96 
n1 n2

Example 1 (continued)

Consider again the sample selected from the population of male postgraduate medical
students (n=35, x =129.2, SE ( x )=2.1). The 95% CI for the population mean would
be calculated by:
[ x - 1.96 SE( x ) , x + 1.96 SE( x )]
[ 129.2 - 1.96 x 2.1 , 129.2 + 1.96 x 2.1]
[125.1, 133.3]
The estimate of the population mean based upon this sample would be commonly
presented as 129.2 mm Hg (95%CI 125.1, 133.3).

A further 19 samples were randomly selected from the population of postgraduate


medical students and for each sample a mean and 95% CI was calculated. The means
and confidence intervals are shown below. Also plotted on this graph is the actual

4.7
population mean blood pressure 128 mm Hg, which in this instance is known but in
most practical situations is unknown.

Figure 2: The mean (𝑥̅ ) and confidence interval of 20 randomly selected samples from
the population of postgraduate medical students

 Blood pressure (mm Hg)

 𝜇 = 128

























Notice that 95% (19/20) of the randomly selected samples have produced CLs which
have captured the population mean. Also notice that one sample, sample 15 produced
CLs which did not capture the population mean.

In this sense, if only one sample was drawn from the population we are 95% confident
that the population parameter will be contained within the sample CL.

4.9 The Student's t Distribution


As shown earlier when the sample size is large (n≥ 30) the central limit theorem may
be applied and the sampling distribution of the mean may be approximated by the
Normal distribution regardless of the distribution of the variable in the population.
Consequently, the normal distribution may be used in the calculation of CLs.

4.8
When the sample size is small (n<30) the central limit theorem is no longer
applicable. In this circumstance, when we are dealing with means the CLs may be
calculated by substituting the Student’s t distribution (with n-1 df) for the Normal
distribution in the previous equations. However, the t distribution may only be used if
the variable is normally distributed in the population from which the sample is
taken. The shape of the t distribution depends on the degrees of freedom, the number
of observations minus one.

The 95% CLs for a population mean (when n is small (<30) are given by:

x  t n 1,0.025SE( x ) or x  t n 1,0.025 s / n
where t has n-1 degrees of freedom.
The multipliers are the upper 2.5% points of the distributions.
N.B. the percentage points of the t distribution for given degrees of freedom can be
obtained from MS Excel. In an exam question, any required multiplier from the t
distribution will be given in the question.
Example 1 (continued)
Say a sample of size 10 was drawn from the population of male postgraduate medical
students and the following characteristics of the sample calculated: n=10, 𝑥̅ =123.1,
s=12.1 and consequently SE(𝑥̅ )=3.8. As blood pressure measurements are normally
distributed in the population it is possible to proceed to calculate a CI. The 95% CI for
the population mean based upon this sample would be calculated by:
[ x -tn-1,0.025 SE( x ) , x + tn-1,0.025 SE( x )]
[123.1 – 2.26 x 3.8 , 123.1 + 2.26 x 3.8]
[114.5, 131.7]
The estimate of the population mean based upon this sample may be presented as
123.1 mm Hg (95%CI 114.5, 131.7). Notice how much wider this CI is compared
with the CI based on the sample of size 35. This reflects the increased uncertainty in
this estimate as it is based upon a much smaller sample.

4.10 Calculating CLs (and standard errors) using MINITAB and SPSS
MINITAB allows confidence limits (and standard errors) to be calculated for means
and proportions from summarised sample data (i.e. means and standard deviations)
and from raw sample data (i.e. individual values for each member of the sample).

Confidence limits for a population mean when n is large


Using summarized data, confidence limits for a population mean when n is large may
be calculated using the commands Stat>Basic Statistics>1-Sample Z. Then, click on
Summarised data and enter the sample size, sample mean and sample standard
deviation. Using raw data, enter the data into a column first and then use the
commands Stat>Basic Statistics>1-Sample Z.

Confidence limits for a population mean when n is small


A similar command for calculating confidence limits for a population mean when n is
small is Stat>Basic Statistics>1-Sample t.

4.9
Confidence limits for a difference in population means
Using summarized data, the commands to calculate confidence limits for a difference
in population means are Stat>Basic Statistics>2-Sample t and then click on
Summarised data and enter summary statistics from the two samples. Using raw
data, this command can be used after creating a column containing the data from both
samples and a group column indicating the sample to which each individual belongs.
On the Stat>Basic Statistics>2-Sample t screen, the data column should be entered
as the Samples column and the group column as the Subscripts column.

Confidence limits for a population proportion


Confidence limits for a population proportion may be calculated using the commands
Stat>Basic Statistics>1-Proportion. Then, click on Summarised data and enter
sample size and the number of individuals in which an event occurred.

Confidence limits for a difference in two population proportions


The confidence limits for the difference in two population proportions may be
calculated using the commands Stat>Basic Statistics>2-Proportions.

In each instance, the significance level may be changed by clicking on the Options…
tab on the screen into which summarised data are entered.

SPSS also allows the calculation of confidence intervals but only from raw sample
data (i.e. individual values for each member of the sample). Confidence limits for a
population mean and for the difference in two population means are calculated by the
commands Analyze>Compare Means>One –Sample T Test and
Analyze>Compare Means>Independent –Samples T Test, respectively.

4.11 Recommended Reading


Bland: Chapter 8 - Estimation (sections 8.1 - 8.6)
Kirkwood and Sterne: Chapter 6 – Confidence interval for a mean

4.12 Further examples


2. In a simple random sample of fifteen patients with ischaemic heart disease the
serum cholesterol levels (mg/100 ml) were determined. The mean serum
cholesterol was 228mg/100ml and standard deviation was 45.8 mg/100ml.
What are the 95% confidence limits for the mean cholesterol level of all patients
with ischaemic heart disease? (N.B. the upper 2.5% point of the t distribution with
14 degrees of freedom is 2.145.)
From the question
n = 15, x = 228 mg/100 ml, and s = 45.8 mg/100 ml.
Then SE( x ) = s/ n = 45.8/ 15 = 11.83 mg/100 ml, and 95% confidence limits
for the population mean are:
x  t 14 ,0.025SE( x) = 228  2.145 (11.83) = 203 and 253 mg/100 ml.

4.10
Alternatively, in MINITAB use the commands Stat>Basic Statistics>1-Sample t,
click on Summarised data and enter the sample data to generate the following
output:

The result is interpreted as follows. Consider the population of all patients with
ischaemic heart disease. Then in 95% of random samples of size n = 15 from this
population the derived confidence limits will include the population mean.
3. In a random sample of 1165 men in the 25 - 64 age-group from the Belfast area
(Ulster Medical Journal 1989; 58: 60-68), 115 men were found to be obese (body
mass index exceeding 30 kg/m2).
Calculate 95% confidence limits for the proportion of the population of men who
are obese.
p = 115/1165 = 0.099
SE(p) = p(1  p) / n = 0.099(1-0.099)/1165 = 0.0087
95% confidence limits for the population proportion are:
p SE(p)
0.099  1.96(0.0087)
0.082 and 0.116 or 8.2% and 11.6%

Alternatively, in MINITAB use the commands Stat>Basic Statistics> Stat>Basic


Statistics>1-Proportion. Then, click on Summarised data and enter sample size
and the number of events (number of individuals who are obese) to generate the
following output:

Consider the populations of Belfast men in the 25-64 year age-group. In 95% of
random samples from these populations the confidence limits derived from the
samples will include the true proportion of men who are obese.

4.11
4.13 Practical
1. The following clotting times of plasma (minutes) were measured for a random
sample of 11 students in a physiology practical.
7.9, 8.6, 9.3, 10.7, 11.1, 11.4, 11.6, 11.7, 11.9, 12.7, and 15.0.
a) Calculate 95% confidence limits for the mean plasma clotting time in the
population from which this sample was drawn. Check you answer using
MINITAB. (N.B. the upper 2.5% point of the t distribution with 10 degrees of
freedom is 2.23.)
b) Using MINITAB, calculate 99% confidence limits for the mean plasma clotting
time in the population from which this sample was drawn.
c) What has happened to the width of this confidence interval? Why has this
occurred?

2. In 1976 a random sample of dental surgeries in England and Wales was selected
and a questionnaire sent by post to 1,500 surgeries. Of the 1,181 replies received,
222 did not have three basic facilities for resuscitation. (British Dental Journal
1978; 144: 271-279).
a) Estimate the proportion of all surgeries in England and Wales without these
facilities (i.e. the population proportion) and calculate 95% confidence limits for
this proportion. Check your answer using MINITAB.
b) Comment on other potential sources of error in your estimate besides sampling
error.

3. In a large trial to compare timolol against placebo in patients who had just
recovered from a heart attack, 98 out of a sample of 945 patients treated with
timolol died in an 18 month follow-up period. The corresponding figure for a
sample of 939 patients who received placebo was 152 deaths. (New England
Journal of Medicine, 1981; 304: 801-7).
Using MINITAB, calculate 95% confidence limits for the difference in the
proportions dying on timolol and placebo. Interpret these limits.

4.12

You might also like