Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

Lecture 8

Uploaded by

nirmal thing
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Lecture 8

Uploaded by

nirmal thing
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

MDA511

Lecture 8:
Sampling and Confidence Interval
Estimation

Recommended Text:
Albright and Winston, “Business Analytics”
6th Edition. 2017 Copyright © Cengage Learning

Compiled by Prof. Paul Kwan


Motivations for Sampling
In a typical statistical inference problem, you want to
discover one or more characteristics of a given
population.

Generally difficult or even impossible to contact each


member of the population
• Solution: identify a sample of the population and
then obtain information from members of the
sample

2
Lecture Objectives
• Discuss the sampling schemes generally used in real
sampling applications

• See how the information from a sample of the


population can be used to infer the properties of the
entire population

3
Sampling Terminology
A population is the set of all members about which a
study intends to make inferences.
• An inference is a statement about a numerical
characteristic of the population.
A frame is a list of all members of the population. The
potential sample members are called sampling units.
A probability sample is a sample in which the
sampling units are chosen from the population
according to a random mechanism.
A judgmental sample is a sample in which the
sampling units are chosen according to the sampler’s
judgment.
4
Methods for Selecting Random
Samples
Different types of sampling schemes have different
properties.

• There is typically a trade-off between cost and


accuracy.

• Some sampling schemes are cheaper and easier to


administer, whereas others are more costly but
provide more accurate information.

5
Simple Random Sampling
The simplest type of sampling scheme is called simple
random sampling.

A simple random sample of size n is one where each


possible sample of size n has the same chance of
being chosen.
• Simple random samples are the easiest to
understand, and their statistical properties are the
most straightforward.

More complex random samples are often used in real


applications 6
Simple Random Sampling
Simple random samples are used infrequently in real
applications. There are several reasons for this:

• Because each sampling unit has the same chance of


being sampled, simple random sampling can result in
samples that are spread over a large geographical
region.

• This can make sampling extremely expensive, especially


if personal interviews are used.

7
Simple Random Sampling (cont’d)
• Simple random sampling requires that all sampling
units be identified prior to sampling. Sometimes this is
infeasible.

• Simple random sampling can result in


underrepresentation or overrepresentation of certain
segments of the population.

8
Systematic Sampling
A systematic sample provides a convenient way to
choose the sample.
• First, divide the population size by the sample size,
creating “blocks.”
• Next, use a random mechanism to choose a number
between 1 and the number in each “block.”
• In general, one of the first k members is selected
randomly, and then every kth member after this one is
selected.
• The value k is called the sampling interval and equals
the ratio N/n, where N is the population size and n is the
desired sample size.
9
Systematic Sampling (Cont’d)

10
Stratified Sampling
• Suppose various subpopulations within the total
population can be identified. These subpopulations are
called strata.
• Instead of taking a simple random sample from the
entire population, it might make more sense to select a
simple random sample from each stratum separately.
• This sampling method is called stratified sampling.

11
Stratified Sampling (Cont’d)

12
Stratified Sampling (Cont’d)
Advantages of stratified sampling:
• Separate estimates can be obtained within each stratum,
which would not be obtained with a simple random sample
from the entire population.
• The accuracy of the resulting population estimates can be
increased by using appropriately defined strata.
• Define the strata such that there is less variability within the
individual strata than in the population as a whole.

13
Proportional Sample Sizes
There are many ways to choose sample sizes from
each stratum, but the most popular method is to use
proportional sample sizes.

• With proportional sample sizes, the proportion of a


stratum in the sample is the same as the proportion of
that stratum in the population.
• The advantage of proportional sample sizes is they
are very easy to determine.
• The disadvantage is they ignore differences in
variability among the strata.

14
Proportional Sample Sizes (Cont’d)

15
Cluster Sampling
In cluster sampling, the population is separated into
clusters, such as cities or city blocks, and then a random
sample of the clusters is selected.
• The primary advantage of cluster sampling is sampling
convenience (and possibly lower cost).
• The downside is that the inferences drawn from a cluster
sample can be less accurate for a given sample size than
other sampling plans.

16
Multistage Sampling Schemes
The cluster sampling scheme is an example of a
single-stage sampling scheme.
Real applications are often more complex than this,
resulting in multistage sampling schemes.
• For example, in ABC’s nationwide surveys, a random
sample of approximately 300 locations is chosen in
the first stage of the sampling process.
• City blocks or other geographical areas are then
randomly sampled from the first-stage locations in the
second stage of the process.
• This is followed by a systematic sampling of
households from each second-stage area.
17
Multistage Sampling Schemes
(Cont’d)

18
An Introduction to Estimation
The purpose of any random sample, simple or
otherwise, is to estimate properties of a population from
the data observed in the sample.

The mathematical procedures appropriate for


performing this estimation depend on which properties
of the population are of interest and which type of
random sampling scheme is used.

For both simple random samples and more complex


sampling schemes, the concepts are the same.
19
Sources of Estimation Errors
There are two basic sources of errors that can occur
when you sample randomly from a population:
• Sampling error
• Nonsampling error

Sampling error is the inevitable result of basing an


inference on a random sample rather than on the entire
population.

20
Sources of Estimation Errors
Nonsampling error is quite different and can occur for
a variety of reasons:
• Nonresponse bias occurs when a portion of the sample
fails to respond to the survey.

• Nontruthful responses are particularly a problem when


there are sensitive questions in a questionnaire.

• Measurement error occurs when the responses to the


questions do not reflect what the investigator had in mind
(e.g., when questions are poorly worded).

21
Sources of Estimation Errors (cont’d)
• Voluntary response bias occurs when the subset
of people who respond to a survey differs in some
important respect from all potential respondents.

• The potential for non-sampling error is enormous.

• However, unlike sampling error, it cannot be measured


with probability theory.
• It can be controlled only by using appropriate sampling
procedures and designing good survey instruments.

22
Key Terms in Sampling
A point estimate is a single numeric value, a “best
guess” of a population parameter, based on the data in
a random sample.

The sampling error (or estimation error) is the


difference between the point estimate and the true
value of the population parameter being estimated.

The sampling distribution of any point estimate is the


distribution of the point estimates from all possible
samples (of a given sample size) from the population.
23
Key Terms in Sampling
A confidence interval is an interval around the point
estimate, calculated from the sample data, that is very
likely to contain the true value of the population
parameter.
An unbiased estimate is a point estimate such that the
mean of its sampling distribution is equal to the true
value of the population parameter being estimated.
The standard error of an estimate is the standard
deviation of the sampling distribution of the estimate.
• It measures how much estimates vary from sample to
sample.
24
Sampling Distribution of the Sample
Mean
The sampling distribution of the sample mean has the
following properties:
• It is an unbiased estimate of the population mean, as
indicated in this equation:
• The standard error of the sample mean is given in the
equation where is the standard deviation of
the population, and n is the sample size.
• It is customary to approximate the standard error by
substituting the sample deviation, s, for , which leads to
this equation:
• If you go out two standard errors on either side of the
sample mean, you are approximately 95% confident of
capturing the population mean, as shown below:
25
The Finite Population Correction
Generally, sample size is small relative to the
population size.
There are situations, however, when the sample size is
greater than 5% of the population.
In this case, the formula for the standard error of the
mean should be modified with a finite population
correction, or fpc, factor:

The standard error of the mean is multiplied by fpc in


order to make the correction:
26
The Central Limit Theorem
For any population distribution with mean and standard
deviation , the sampling distribution of the sample mean
is approximately normal with mean and standard
deviation , and the approximation improves as n
increases. This is called the central limit theorem.

The important part of this result is the normality of the


sampling distribution.
• When you sum or average n randomly selected values from any
distribution, normal or otherwise, the distribution of the sum or
average is approximately normal, provided that n is sufficiently
large.
• This is the primary reason why the normal distribution is
relevant in so many real world applications. 27
Example 2: Average Winnings from
a Wheel of Fortune
Objective: To illustrate the central limit theorem by a simulation
of winnings in a game of chance.
Solution: The population is the set of all outcomes you could
obtain from a single spin of the wheel—that is, all dollar values
from $0 to $1000.
Each spin results in one randomly sampled dollar value from this
population.
Each replication of the experiment simulates n spins of the wheel
and calculates the average—that is, the winnings—from these n
spins.
A histogram of winnings is formed, for any value of n, where n is
the number of spins.
As the number of spins increases, the histogram starts to take on
more and more of a bell shape. 28
Example 2: Average Winnings from
a Wheel of Fortune
Single spin Three spins

Six spins Ten spins

29
Sample Size Selection
The problem of selecting the appropriate sample size in
any sampling context is not an easy one, but it must be
faced in the planning stages, before any sampling is
done.
• The sampling error tends to decrease as the sample
size increases, so the desire to minimize sampling
error encourages us to select larger sample sizes.
• However, several other factors encourage us to select
smaller sample sizes, including:
• Cost
• Timely collection of data
• Increased chance of nonsampling error, such as
nonresponse bias
30
Summary of Key Ideas for Simple
Random Sampling
• To estimate a population mean with a simple random sample,
the sample mean is typically used as a “best guess”. This
estimate is called a point estimate.
• The accuracy of the point estimate is measured by its
standard error. It is the standard deviation of the sampling
distribution of the point estimate.
• A confidence interval (with 95% confidence) for the population
mean extends to approximately two standard errors on either
side of the sample mean.
• From the central limit theorem, the sampling distribution of is
approximately normal when n is reasonably large.
• There is approximately a 95% chance that any particular will
be within two standard errors of the population mean .
• The sampling error can be reduced by increasing the sample
size n. 31
Confidence Interval Estimation
Statistical inferences are always based on an
underlying probability model, which means that some
type of random mechanism must generate the data.
• Two random mechanisms are generally used:
• Random sampling from a larger population
• Randomized experiments
Generally, statistical inferences are of two types:
• Confidence interval estimation uses the data to obtain
a point estimate and a confidence interval around
this point estimate.
• Hypothesis testing determines whether the observed
data provide support for a particular hypothesis.
32
Sampling Distributions
Most confidence intervals are of the form:

In general, whenever you make inferences about one or more


population parameters, you always base this inference on the
sampling distribution of a point estimate, such as the sample
mean.
An equivalent statement to the central limit theorem is that the
standardized quantity Z, as defined below, is approximately
normal with mean 0 and standard deviation 1:

• However, the population standard deviation σ is rarely known,


so it is replaced by its sample estimate s in the formula for Z.
• When the replacement is made, a new source of variability is
introduced, and the sampling distribution is no longer normal. 33
Instead, it is called the t distribution.
The t Distribution
If we are interested in estimating a population mean μ with a
sample of size n, we assume the population distribution is
normal with unknown standard deviation σ.
σ is replaced by the sample standard deviation s, as shown in
this equation:

• Then the standardized value in the equation has a t


distribution with n – 1 degrees of freedom.
• The degrees of freedom is a numerical parameter of the t
distribution that defines the precise shape of the
distribution.
• The t-value in this equation is very much like a typical Z-
value.
• That is, the t-value indicates the number of standard errors
34
by which the sample mean differs from the population
mean.
The t Distribution
The t distribution looks very much like the standard normal
distribution.
• It is bell-shaped and centered at 0.
• The only difference is that it is slightly more spread out, and
this increase in spread is greater for small degrees of
freedom.
• When n is large, so that the degrees of freedom is large,
the t distribution and the standard normal distribution are
practically indistinguishable, as shown below.

35
Other Sampling Distributions
The t distribution, a close relative of the normal
distribution, is used to make inferences about a
population mean when the population standard
deviation is unknown.

Two other close relatives of the normal distribution are


the chi-square and F distributions.
• These are used primarily to make inferences about
variances (or standard deviations), as opposed to
means.

36
Confidence Interval for a Mean
To obtain a confidence interval for μ, first specify a
confidence level, usually 90%, 95%, or 99%.

Then use the sampling distribution of the point estimate


to determine the multiple of the standard error (SE) to
go out on either side of the point estimate to achieve
the given confidence level.
• If the confidence level is 95%, the value used most
frequently in applications, the multiple is
approximately 2. More precisely, it is a t-value.
• A typical confidence interval for μ is of the form:
where
37
Confidence Interval for a Mean
To obtain the correct t-multiple, let α be 1 minus the
confidence level (expressed as a decimal).
• For example, if the confidence level is 90%, then α =
0.10.
Then the appropriate t-multiple is the value that cuts off
probability α/2 in each tail of the t distribution with n−1
degrees of freedom.
As the confidence level increases, the length of the
confidence interval also increases.
As n increases, the standard error s/√n decreases, so
the length of the confidence interval tends to decrease
for any confidence level.
38
Example 3: Customer Response to
a New Sandwich
Objective: To obtain a 95% confidence interval for the mean
satisfaction rating of the new sandwich.
Solution: A random sample of 40 customers who ordered a new
sandwich were surveyed. Each was asked to rate the sandwich
on a scale of 1 to 10.
The results appear in column B below.
This method, using only Excel®, is shown by the formulas in
column G.

39

You might also like