Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
193 views

CSA - Sampling Techniques & Survey Methods PDF

This document discusses sampling techniques and survey methods. It begins by explaining key terms like population, sample, census, and sampling. The main advantages of sampling over a full census are that sampling takes less time and costs less, can provide more reliable results due to fewer errors, and allows for more detailed information to be collected. Probability and non-probability sampling methods are described. The document outlines learning outcomes related to understanding sampling terminology, differentiating sampling techniques, appreciating representative samples, and distinguishing between sampling and non-sampling errors.

Uploaded by

kuyu2000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views

CSA - Sampling Techniques & Survey Methods PDF

This document discusses sampling techniques and survey methods. It begins by explaining key terms like population, sample, census, and sampling. The main advantages of sampling over a full census are that sampling takes less time and costs less, can provide more reliable results due to fewer errors, and allows for more detailed information to be collected. Probability and non-probability sampling methods are described. The document outlines learning outcomes related to understanding sampling terminology, differentiating sampling techniques, appreciating representative samples, and distinguishing between sampling and non-sampling errors.

Uploaded by

kuyu2000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

SAMPLING

TECHNIQUES AND
SURVEY METHODS

September 20, 2015


Unit 1

Sampling Methods and Distribution


CONTENT
1.1 Basic terms used in sampling
1.2 Probability and non-probability sampling method
1.3 Sampling error and non-sampling error
1.4 Sampling distribution

1.1 Introduction
When secondary data are not available for a problem under study, a decision may be
taken to collect primary data by using any of the methods discussed in this first unit of the
module. The required information may be obtained by either conducting a census
(involving full enumeration of units under study) or taking a sample. With a census, data
are collected for each and every unit (person, household, cattle, shop, factory, etc.) of the
population or universe under study. Suppose we need the average wage of workers
engaged in the sugar industry. In this case wage figures would be obtained about each
and every worker. The average wage is then obtaining by taking the ratio of the total
wages earned by all the workers to the total number of workers engaged in the sugar
industry.

In practice, researchers often use sample surveys to obtain information about a larger
population by selecting and measuring a sample (or fraction of units) from the population.
Conducting a full census can be quite costly. In addition, sampling theory, developed a
century ago, has shown that one does not need to conduct a census to obtain
information. Conducting a sample survey will do just as well. Even hospitals extract only
blood samples from patients for medical tests (rather than extracting all the blood of the
patient to determine whether or not the patient is in a clean bill of health).

When a population is too large to study or the costs for doing a census may be
prohibitive, we may be able to rely on the data collected from a sample. Inferences about
a population are based on the information from the sample drawn from that population,
particularly if the sample is designed to be representative of the population. Due to the
variability in the characteristics of the population, scientific sample designs should be
applied in the selection of a representative sample. If not, there is a high risk of distorting
the information about the population.

A sample is a collection of individuals selected from a larger population. For example, we


may have a sample of 50 persons representing a population of 1,000 people. Sampling
enables us to estimate the characteristic of a population by directly observing a portion of
the population. Researchers may not be interested in the sample itself, but in what can be
learned from the sample and how the knowledge gained from the sample can be
considered as information regarding the entire population.

Page 1 of 59
Sampling involves the selection of the number of study units from a defined population.
The thinking here is that the population is too large to consider collecting information from
all its members. If the whole population is taken, then there is no need for statistical
inference. Usually, a representative subgroup of the population (sample) is used in the
investigation. A representative sample has all the important characteristics of the
population from which it is drawn.

1.2 Learning Outcomes


After completing this module, learners/trainees will be able to:
 define a population and a sample, and understand sampling terminology.
 differentiate between probability and non-probability sampling methods.
 appreciate the importance of a representative sample.
 understand the advantages and limitations of the different sampling methods.
 distinguish between sampling and non-sampling errors.
 apply different techniques of sampling.

Basic Terms Used in Sampling

A population is the totality of all subjects that possess a certain common characteristic
that will be studied.
A sample is part of a population which is selected to reasonably represent the population
from which it is drawn.
A census is information on the whole population; enumeration of the entire population.
Sampling is the process of learning about the population on the basis of a sample drawn
from the population. Thus, when employing sampling methods, instead of taking every
unit of the universe, only a part of the universe is studied and conclusions are drawn on
that basis for the entire population. The process of sampling involves three elements:
 Selecting the sample,
 Collecting the data, and
 Making inference about the population.
The three elements cannot generally be considered in isolation from one another. Sample
selection, data collection and estimation are all interwoven and each has an impact on
the others. Sampling should not involve haphazard selection. Rather, it should embody
definite rules for selecting the sample. Following a set of rules for sample selection, we
cannot consider the estimation process independent of the manner in which the sample
has been selected.
It should be noted that a sample is not studied for its own sake. The basic objective of its
study is to draw inference about the population. In other words, sampling is a tool which
helps to know the characteristics of the universe or population by examining only a small
part of it. The values obtained from the study of a sample, such as the average and
dispersion, are known as ‘statistic’. On the other hand, such values for a population are
called ‘parameters’.

Page 2 of 59
Advantages of sampling
Sampling techniques have the following advantages over a complete enumeration:
i. Less time: Since the sample is a study of a part of the population, considerable
time and labor are saved when a sample survey is carried out. Time is saved not
only in collecting data but also in processing the data. For these reasons, a
sample provides more timely data in practice than a census.
ii. Less cost: The total financial burden of a sample survey is generally much less
than that of a complete census. In sampling, we study only a part of a population
and the total expense of collecting data is less than that required when the
census method is adopted. This is a great advantage particularly in an
underdeveloped economy where much of the information would be difficult to
collect through a census due to lack of adequate resources.
iii. More reliable results: Although the sampling technique involves certain
inaccuracies owing to sampling errors (i.e., imprecision in results arising from the
use of samples as different sets of samples could be chosen under the same
protocol), the result obtained from sampling is generally more reliable than that
obtained from a complete count. There are several reasons for it. Firstly, it is
always possible to determine the extent of sampling errors. Secondly, other
types of errors to which a survey is subject, such as inaccuracy of information,
incompleteness of returns, etc., are likely to be more serious in a complete
census than in a sample survey. This is because more effective precautions can
be taken in a sample survey to ensure that information gathered is accurate and
complete. For these reasons not only the total error can be expected to be small
in a sample survey but the sample result can also be used with a greater degree
of confidence because of our knowledge of the probable size of error. Thirdly, it
is possible to avail the services of experts, and to impart thorough training to the
investigators in a sample survey. This further reduces the possibility of errors.
Follow up work can also be undertaken much more effectively in the sampling
method. Indeed, even a complete census can only be tested for accuracy by
some type of sampling checks.
iv. Better timeliness: Data can be collected and summarized more quickly.
v. More detailed information: Since the use of sampling saves time and money, it
is possible to collect more detailed information in a sample survey. For example,
if the population consists of 1,000 persons and a survey of the consumption
pattern is undertaken, the two alternative techniques available are as follows:
a. We may collect the necessary data from each one of the 1000 people
through a questionnaire containing, say, 100 questions (census method): or
b. We may take a sample of 100 persons (i.e., 10% of population) and prepare
a questionnaire containing as many as 100 questions.
vi. Sampling method is the only method that can be used in certain cases.
There are some cases in which the census method is inapplicable and the only
practicable means is provided by the sample method. For example, if one is
interested in testing the breaking strength of chalks manufactured in a factory,
under a full enumeration, all the chalks would be broken in the process of testing.

Page 3 of 59
vii. The sample method is often used to judge the accuracy of the data obtained on a
census basis.
Disadvantages of sampling
i. There are always errors in sampling, both sampling and non-sampling errors. A
sample survey must be carefully planned and executed; otherwise, the results
obtained may be inaccurate and misleading. Even for a complete count care
must be taken and if the sampling procedure is not accurate serious errors may
arise in sampling.
ii. Sampling generally requires the services of experts. In the absence of qualified
and experienced persons, the information obtained from sample surveys cannot
be relied upon.
iii. At times, the sampling plan may be so complicated that it requires more time,
labor and money than a complete enumeration. This is so if the size of the
sample is a large proportion of the total population, and, if complicated weighted
procedures are used. With each additional complication in the survey, the
chances of error multiply and greater care has to be taken, which in turn, means
more time and labor.
iv. If the information is required for each and every unit in the domain of study, a
complete enumeration survey is necessary.

Sampling units/elements consists of the set of elements considered for selection

A Sampling frame is the list of all individual sampling units (elements) in the population.
 the list of all sampling units/elements in the population.
 the sample is selected from this list.
Example 1: If somebody studies the socio-economic status of households, then
 a household is the sampling unit.
 the list of households is the sampling frame.

Representative sample: A sample that represents the population accurately.

1.3 Essentials of Sampling


 Representativeness: A sample should be so selected that it truly represents the
universe; otherwise, the results obtained maybe misleading. To ensure
representativeness the random method of selection should be used.
 Adequacy: The size of sample should be adequate; otherwise, it may not represent
the characteristics of the universe.
 Homogeneity: When we talk of homogeneity we mean that there is no basic
difference in the nature of units of the universe and that of the sample. If two samples
from the same universe are taken, they should give more or less the same unit.

1.4 Steps in Sampling Process

a) Identifying the relevant population: Determine the relevant population from which
the sample is going to be drawn.

Page 4 of 59
Example 2: If the study concerns income, then the definition of the population as
individuals or households can make a difference.
b) Determining the method of sampling: Whether a probability sampling procedure
or a non-probability sampling procedure has to be used is also very important.
c) Securing a sampling frame: A list of elements from which the sample is actually
drawn is necessary.
d) Identifying parameters of interest: What specific population characteristics
(variables and attributes) may be of interest?
e) Determining the sample size: The determination of the sample size depends on
several factors.

1.5 Types of Sampling Techniques


The methods of sampling can be grouped under two broad divisions:
1. Probability (random) sampling methods, and
2. Non-probability (non-random) sampling methods.
A brief description of these sampling methods is given below.

Sampling Methods

Non-probability Sampling Probability Sampling Methods


Methods

Simple or unrestricted
Judgment Sampling random sampling

Stratified sampling
Quota Sampling Restricted random
sampling
B)
Convenience sampling Systematic sampling

Cluster sampling

1.5.1 Probability Sampling Methods


Probability sampling methods involve selection of a sample from a population, based on
chance. Probability sampling is more complex, more time-consuming and usually more
costly than non-probability sampling. However, because study samples are randomly
selected and their probability of inclusion can be calculated, reliable estimates can be
produced and inferences can be made about the population. There are different ways by

Page 5 of 59
which a probability sample can be selected. The method chosen depends on a number of
factors, such as the available sampling frame, how spread-out the population is, how
costly it is to survey members of the population.
Advantages of probability sampling methods
1. Probability sampling doesn’t depend upon the existence of detailed information
about the universe for its effectiveness.
2. Probability sampling provides estimates which are essentially unbiased and have
measurable precision.
3. It is possible to evaluate the relative efficiency of various sample designs only when
probability sampling is used.
Limitation of probability sampling methods:
1. Probability sampling requires a very high level of skill and experience for its use.
2. It requires a lot of time to plan and execute a probability sample.
3. The costs involved in probability sampling are generally high compared to non-
probability sampling.
The most common types of probability sampling methods are:
1. Simple random sampling
2. Stratified sampling
3. Systematic sampling
4. Cluster sampling

a) Simple Random Sampling


In what follows we denote the size of a population by N. The size of a sample drawn is
denoted by n. Simple random sampling is a method of selecting n units out of the N such
that all possible sample of size n are equally likely to have been chosen. In simple
random sampling, items which would be selected is just a matter of chance. It should be
noted that the word random does not mean ‘haphazard’ or ‘hit-or-miss’ - it rather means
that the selection process is such that chance only determines which items shall be
included in the sample.
The above definition of simple random sampling underscores that all n items of the
sample are selected independently of one another and all N items in the population have
the same chance of being included in the sample. By independence of selection we mean
that the selection of a particular item in one draw has no influence on the probabilities of
selection in any other draw.
It should also be noted that at each selection, all-remaining items in the population have
the same chance of being drawn. If sampling is made with replacement, i.e., when each
unit drawn from the population is returned prior to drawing the next unit, each item has a
probability of 1/N of being drawn at each selection. If sampling is without replacement,
i.e., when each unit drawn from the population is not returned prior to drawing the next
unit, the probability of selection of each item remaining in the population at the first draw
is 1/N, at the second draw is 1/(N-1), at the third draw is 1/(N-2), and so on. It should be
noted that sampling with replacement has very limited and special use in statistics. We

Page 6 of 59
are mostly concerned with sampling without replacement, such that all the possible
samples of a given size n are equally likely to be selected.

What methods are there to ensure randomness?


To ensure randomness of selection one may adopt either
1. The Lottery Method, or
2. Table of random numbers.
Lottery method: This is a very popular method of taking a random sample. Under this
method, all items of the universe are numbered or named on separate slips of paper of
identical size and shape. These slips are then folded and mixed up in a container or
drum. A blindfold selection is then made of the number of slips required to constitute the
desired sample size. The selection of items thus depends entirely on chance. This
method would be quite clear with the help of an example. If we want to take a sample of
10 persons out of a population of 100, the procedure is to write the names of the 100
persons on separate slips of paper, fold these slips, mix them thoroughly and then make
a blindfold selection of 10 slips.
The above method is very popular in lottery draws where a decision about prizes is to be
made. However, while adopting the method, it is absolutely essential to see that the slips
are of identical size, shape and color; otherwise, there is a possibility of personal
prejudice and bias affecting the results.
Table of random numbers: The lottery method discussed above becomes quite
cumbersome as the size of population increases. An alternative method of random
selection is that of using the table of random numbers. The random numbers are
generally obtained by some mechanism which, when repeated a large number of times,
ensures approximately equal frequencies for the numbers from 0 to 9 and also proper
frequencies for various combinations of numbers (such as 00, 01, …, 999, etc.) that is
expected to reduce a random sequence of the digits 0 to 9.
Advantages of simple random sampling
 Since the selection of items in the sample depends entirely on chance there is no
possibility of personal bias affecting the results.
 Compared to judgment sampling a random sample represents the universe in a
better way. As the size of the sample increases, it becomes increasingly
representative of the population.
 The analyst can easily assess the accuracy of this estimate because sampling
errors follow the principle of chance.
Limitations of simple random sampling
 The use of simple random sampling necessitates a completely catalogued
population from which to draw a sample. It is often difficult for the investigator to
have an up-to-date list of all the items of the population. That restricts the use of this
method in economic and business data where very often we have to employ
restricted random sampling designs.
 The size of the sample required to ensure statistical reliability is usually larger under
random sampling than in stratified sampling (see below).

Page 7 of 59
 From the point of view of field survey operations, it has been claimed that cases
selected by random sampling tend to be too widely dispersed geographically, and
that the time and cost of collecting data become large.
 Random sampling may produce the most non-random looking results. For example,
thirteen cards from a well shuffled pack of playing cards may consist of one suit. But
the probability of this type of occurrence though is very low.
b) Stratified random sampling
When a population can be classified in such a way that responses to questions asked in
a survey are more homogenous within groups than between these groups, then it may be
a good idea to employ a stratified technique for sampling. Using stratified sampling, the
population is divided into homogeneous, mutually exclusive groups called strata, and
stratification can be done by any variable that is available for all units prior to sampling
(e.g., age, sex, province of residence, income, etc.). If the stratification was done
correctly, data within a stratum would be more homogeneous than data coming from
different strata
The sampling procedure is such that a separate sample is taken independently from each
stratum, and a designated number of items are chosen from each stratum.
Selection of stratified random sampling: Some of the issues involved in setting up a
stratified random sample are:
Basis of stratification: What characteristic should be used to subdivide the universe into
different strata? As a general rule, strata are created on the basis of a variable known to
be correlated with the variable of interest and for which information on each universe
element is known. Strata should be constructed in a way, which will minimize differences
among sampling units within each stratum, and maximize differences among strata.
In other words the purpose of stratification is to increase the efficiency of sampling by
dividing a heterogeneous universe in such a way that (i) there is as great a homogeneity
as possible within each stratum, and (ii) a marked difference is possible between the
strata.
For example, if we are interested in studying the consumption pattern of people in Addis
Ababa, the city may be divided into various parts (such as zones or weredas) and from
each part a sample may be taken at random. Before deciding on stratification we must
have knowledge of the traits of the population. Such knowledge may be based upon
expert judgment, past experience, preliminary observations from pilot studies, etc.
Number of strata: How many strata should be constructed? The considerations limit the
number of strata that is feasible; costs of adding more strata may outrun benefits.
Sample size within strata: How many observations should be taken from each stratum?
Our decision in this situation depends on using either a proportional or a disproportional
allocation. In proportional allocation, we sample each stratum proportional to its relative
weight. In disproportional allocation this is not the case. It is worthwhile pointing out that
proportional allocation approach is simple. If all one knows about each stratum is the
number of items in that stratum, it is generally also the preferred procedure. In
disproportional sampling, the different strata are sampled at different weights. As a

Page 8 of 59
general rule when variability among observations within a stratum is high, we sample that
stratum at a higher variation than for strata with less internal variation.
Proportional and Disproportional Stratified Sampling
In a proportional stratified sampling plan, the number of items drawn from each stratum is
proportional to the size of the stratum. For example, if the population is divided into five
groups, their respective sizes being 10, 15, 20, 30 and 25 percent of the population and a
sample of 5,000 is drawn, the desired proportional sample may be obtained in the
following manner:
From stratum one 5,000 (0.10) = 500 items
From stratum two 5,000 (0.15) = 750 items
From stratum three 5,000 (0.20) = 1,000 items
From stratum four 5,000 (0.30) = 1,500 items
From stratum five 5,000 (0.25) = 1,250 items
Total = 5,000 items.
Proportional stratification yields a sample that represents the universe with respect to the
proportion in each stratum in the population. This procedure is satisfactory if there is no
great difference in dispersion from stratum to stratum. But it is certainly not the most
efficient procedure, especially when there is considerable variation in different strata.
In disproportional stratified sampling an equal number of cases is taken from each
stratum regardless of how the stratum is represented in the population. Thus, in the
above example, an equal number of items (1,000) from each stratum may be drawn. In
practice disproportional sampling is common when sampling forms a highly variable
population, wherein the variation of the measurements differs greatly from stratum to
stratum.
Illustration: The following table provides data about the length of service of the faculty of
a community by length of service:

Length of Lecturers Readers Professors Total


service
Less than 5 yrs 2,000 250 50 2,300
5-10 yrs 3,000 220 80 3,300
10-15 yrs 1,500 170 30 1,700
More than 15 yrs 880 80 40 1,000
Total 7,380 720 200 8,302

Workout how many lecturers, readers and professors would be selected from each
category if:

(i) we follow stratified proportional sampling method and take 10% of the population
equivalent to the sample size,

Page 9 of 59
(ii) the size of the sample is 10% of the population but the lecturers, readers and
professors are to be in the ratio of 5: 3: 2 and the weight length of service is to be in
the ratio of 4: 3: 2: 1.
Solution to (i). The sample size is 10% of the universe, hence 830 persons would be
included in the sample. Since 12 strata are formed and we want to follow proportional
stratified sampling method, we will take 10% from each stratum. The number of
persons selected shall be as follows:

Length of Lecturers Readers Professors Total


service
Less than 5 yrs 200 25 5 230
5-10 yrs 300 22 8 330
10-15 yrs 150 17 3 170
More than 15 yrs 88 8 4 100
Total 738 72 20 830

Solution to (ii). In the second case the size of sample is 830 but the lecturers, readers
and professors are to be in the ratio of 5:3:2 of the sample, i.e., we take 830(5/10) =
415 lecturers: 830(3/10) = 249 readers, and 830x2/10 =166 professors. Since the
weight length of service is 4:3:2:1, the number selected from each category shall be
as given in the table below:

Length of Lecturers Readers Professors Total


service
Less than 5 415  4 249  4 166  4 332
years  166 =99.6  66.4
10 10 10
5-10 year 415  3 249  3 166  3 248
 124.5  74.5  49.80
10 10 10
10-15 years 415  2 249  2 166  2 166
 83  49.8  33.2
10 10 10
Above 15 415  1 249  1 166  1 84
years  41.5  24.9  16.6
10 10 10

Total 415 249 166 830

Note: In practice rounding-off of sample size is applied.


Advantages of stratified sampling
 More representatives: Since the population is first divided into various strata and
then a sample is drawn from each stratum, there is little chance of exclusion of a
group of the population.

Page 10 of 59
 Greater accuracy: Stratified sampling ensures greater accuracy. The accuracy is
maximum if each stratum is so formed that it consists of uniform or homogeneous
items.
 Greater geographical concentration: Compared with random sample, stratified
samples can be more concentrated geographically, i.e., the units from the different
strata may be selected in such a way that all of them are localized in one
geographical area. This would greatly reduce the time and expenses of
interviewing.
Disadvantages of stratified sampling
 Utmost care must be exercised in dividing the population into various strata. Each
stratum must contain homogeneous items, as far as possible; otherwise, the
results may not be reliable. If proper stratification of the population is not done, the
sample may be biased.
 The items from each stratum should be selected at random; but this may be
difficult to achieve in the absence of skilled sampling supervisors.
 The likelihood that a stratified sample will be more widely distributed
geographically than a simple random sample; thus, the cost per observation may
be quite high.
c) Systematic Random Sampling
A systematic sample is formed by selecting one unit at random and then selecting
additional units at evenly spaced intervals until the required sample has been formed.
This method is popularly used in situations where a complete list of the population from
which a sample is to be drawn is available. The list may be prepared in alphabetical,
geographical, numerical or some other order. The items are serially numbered. The first
item is selected at random generally by following the Lottery method. Subsequent items
are selected by taking every kth item from the list, where k refers to the sampling interval
or sampling ratio, i.e., the ratio of population size to the size of the sample, that is k=N/n.
Remark: This method of sampling is also known as quasi-random sampling method.
Once the initial starting point is determined, the remainder of the items selected for the
sample are pre-determined by the sampling interval.
While calculating k, it is possible that we get a fractional value. In such a case we should
use approximation procedure, i.e., if the fraction is less than or equal to 0.5 it should be
omitted and if it is more than 0.5 it should be taken as 1. If it is exactly 0.5 it should be
omitted if the number is even; and should be taken as 1 if the number is odd. This is
based on the principle that the number after approximation should preferably be even.
For example, if the number of students is, 1,020, 1,150 and 1,100 and we want to take a
sample of 200, k shall be:
1,020 1,150 1,100
k= = 5.1 or 5, k= = 5.75 or 6, k= = 5.5 or 6.
200 200 200

Steps in systematic random sampling

1) Number the units on your frame from 1 to N (where N is the total population size).

Page 11 of 59
2) Determine the sampling interval (k) by dividing the number of units in the population
by the desired sample size.
3) Select a number between one and k at random. This number is called the random
start and would be the first number included in your sample.
4) Select every kth unit after that first number
Note: Systematic sampling should not be used when a cyclic repetition is inherent in the
sampling frame.
Example 3: In a class there are 96 students with Roll Nos. from 1 to 96. It is desired to
take a sample of 10 students. Use the systematic sampling method to determine the
sample size. The solution is

N 96
k =  9.6  take k  10.
n 10

Based on the Roll Nos. 1 to 96 the first student from 1 and 10 will be selected at random,
and then we will go on taking every 10thstudent. Suppose the first student happens to be
the 4th. The sample would then consist of the following Roll Nos. 4, 14, 24, 34, 44, 54, 64,
74, 84 and 94.
Systematic sampling is a relatively simple technique and may be more efficient
statistically than simple random sampling provided the list is arranged wholly at random.
However, it is rarely that this requirement is fulfilled. The nearest approach to
randomness is provided by alphabetical lists such as are found in telephone directory
although even these may have certain non-random characteristics.
Advantages of systematic sampling
1. The systematic sampling design is simple and convenient.
2. The time and work involved in sampling using this method are relatively low.
3. The results obtained are also found to be generally satisfactory, provided care is
taken to see that there are no periodic features associated with the sampling
interval.
4. If populations are sufficiently large, systematic sampling can often be expected to
yield results similar to those obtained by proportional stratified sampling.
Disadvantages of systematic sampling
1. The main limitation of the method is that it becomes less representative if we are
dealing with populations having “hidden periodicities”.
2. If the population is ordered in a systematic way with respect to the characteristics
the investigator is interested in, then it is possible that only certain types of items
will be included in the population, or at least more of certain types than others.

Page 12 of 59
d) Cluster Sampling
Under this scheme, the selection is done from primary, intermediate and final (or the
ultimate) units from a given population or stratum. There are several stages in which the
sampling process is carried out. At first, the first stage units are sampled by some
suitable method, such as simple random sampling. Then, a sample of second stage units
is selected from each of the selected first stage units, again by some suitable method,
which may be the same as, or different from the method employed for the first stage
units. Further stages may be added as required.

The fundamental difference between a cluster sample and a stratified sample is that the
clusters are themselves representative of the entire population, whereas the stratified
groups are not. The procedure may be illustrated as follows:
Suppose we want to take a sample of 5,000 households from the city of Addis Ababa. At
the first stage, the city can be divided into a number of districts and a few districts are
selected at random. At the second stage, each district may be subdivided into a number
of sub-districts and a sample of sub-districts may be taken at random. At the third stage,
a number of households may be selected from each of the sub-districts selected at the
second stage. To take another example, suppose in a particular survey, we wish to take a
sample of 10,000 students from Addis Ababa University. We may take departments as
primary units at the first stage, then draw a sample of departments in the second stage,
and choose students in the third and last stage.
Steps in cluster sampling
 Cluster sampling divides the population into groups or clusters.
 A number of clusters are selected randomly to represent the total population; all
units within selected clusters are included in the sample.
 No units from non-selected clusters are included in the sample; they are
represented by those from selected clusters.
 This differs from stratified sampling, where some units are selected from each
group.
Advantages of Cluster Sampling
1. Cost reduction.
2. It creates 'pockets' of sampled units instead of spreading the sample over the
whole territory.
3. Sometimes a list of all units in the population is not available, while a list of all
clusters is either available or easy to create.
Disadvantages of Cluster Sampling
 The approach creates some loss of efficiency when compared with simple random
sampling.
 It is usually better to survey a large number of small clusters instead of a small
number of large clusters. This is because neighboring units tend to be more alike,
resulting in a sample that does not represent the whole spectrum of situation
present in the population.
 In cluster sampling, we do not have total control over the final sample size.
e) Multi-stage sampling

Page 13 of 59
The surveys discussed thus far are referred to as one-stage samples. A number of
surveys conducted by the Central Statistical Agency (CSA) involve what is known as
multi-stage sampling, where the elements in the targeted population are grouped into
some sort of hierarchy of units, and sampling is done successively. For instance, a
sample of villages may be taken, and then a sampling of dwellings is done within the
selected villages. Such a sample survey may provide reliable results at national level. It
might be possible to obtain estimates for all districts in the country only if we could find
some variables, say, in a census, which correlate with the variable being measured in the
sample. This concern is called small area estimation. Small area estimates may not
always be reliable especially if the correlates do not establish a strong link with the
variable of interest.
In performing the sampling process, the sample size to be used is considered carefully. In
the second unit of this module it will be shown that the accuracy of estimates increases
with the square root of the sample size. For instance, a sample must be increased 25
times in order to get an increase of 5 times in the accuracy of an estimate. For opinion
polls, typically even a poll of slightly more than 1000 respondents would give “margins of
error” of about 1 percentage point. Non-statisticians find it hard to believe that if we are
trying to take political views of millions, all we need to do is have about 1000 respondents
nationwide. In practice, the size of the sample is dependent both on the degree of
accuracy required as well as on the variability inherent in the population. The same
degree of precision may be obtained from using a small sample from a homogeneous
population. If we wish to find out the total amount of money everyone in a certain room
has in their wallets, all we need to do is to ask one person if everybody has the same
amount of money in their wallets. But with varying amount of money, more persons have
to be asked to get a better estimate of the total amount of money.

1.5.2 Non-Probability Sampling Methods


In some cases, there may be no way of obtaining a frame or listing of the entire
population. In such cases, we have to resort to non-probability sampling with the aid of
experts. Non-probability sampling is a process of sample selection without the use of
randomization, and every item has an unknown chance of being selected. The most
important difference between probability and non-probability sampling is that the pattern
of sampling variability can be ascertained in case of probability sampling. In non-
probability sampling, there is no way of knowing the pattern of variability. Also, no
assurance is given that each item has a chance of being included, making it impossible
either to estimate sampling variability or to identify possible bias. Reliability cannot be
measured in non-probability sampling; the only way to address data quality is to compare
some of the survey results with available information about the population. Still, there is
no assurance that the estimates will meet an acceptable level of error.

Researchers are reluctant to use these methods because there is no way to measure the
precision of the resulting sample. Despite these drawbacks, non-probability sampling
methods can be useful when only descriptive comments about the sample itself are
desired. Secondly, they are quick, generally inexpensive and convenient. There are also
other circumstances, such as in researches when it is not feasible or is impractical to

Page 14 of 59
conduct probability sampling, as in the case of internet polls, or taking a sample survey of
persons with disability (when there is no administrative list of such persons).

The selection process in non-probability is, at least, partially subjective. Some of the most
common types of non-probability sampling methods that are used in practice are:
i. Judgment sampling
ii. Quota sampling and
iii. Convenience sampling

i. Judgment sampling
This approach is used when a sample is taken based on certain judgments about the
overall population. The underlying assumption is that the investigator will select units that
are characteristic of the population. The critical issue here is objectivity: how much can
judgment be relied upon to arrive at a typical sample? For example, if a sample of ten
students is to be selected from a class of sixty to analyze the spending habits of students,
the investigator would select 10 students who, in her/his opinion, are representative of the
class.
Judgment sampling is subject to the researcher's bias and is perhaps even more biased
than haphazard sampling. Since any preconceptions the researcher may have are
reflected in the sample, large biases can be introduced if these preconceptions are
inaccurate.
Researchers often use this method in exploratory studies like pre-testing of
questionnaires and focus groups. They also prefer to use this method in laboratory
settings where the choice of experimental subjects (i.e., animals, humans, etc.) reflects
the investigator's pre-existing beliefs about the population
Advantages of judgment sampling
1) In solving everyday business problems and making public policy decisions,
executives and public officials are often pressed for time and cannot wait for
probability sample designs. Judgment sampling is then the only practical method to
arrive at solutions to their urgent problems.
2) When we want to study some unknown traits of a population, some of whose
characteristics are known, we may then stratify the population according to these
known properties and select sampling units from each stratum on the basis of
judgment. This method is used to obtain a more representative sample.
Disadvantages of judgment sampling method
This method is not scientific because the population units to sample may be affected by
the personal prejudice or bias of the investigator. Thus, judgment sampling involves the
risk that investigator may establish foregone conclusions by including in the sample those
which conform to his preconceived notions. For example, if an investigator holds the view
that the wages of workers in a certain establishment are very low, and if she/he adopts
the judgment sampling method, she/he may include only in the sample those workers
whose wages are low and thereby establish his point of view which may be far from the

Page 15 of 59
truth. Since an element of subjectivity is possible, this method cannot be recommended
for general use.

There is no objective way of evaluating the reliability of sample results. The success of
this method depends upon the excellence of judgment. If the individual making decisions
is knowledgeable about the population and has good judgment, then the resulting sample
may be representative, otherwise the inferences based on sample may be erroneous. It
may be noted that even if a judgment sample is reasonably representative, there is no
objective method for determining the size or likelihood of sampling error.

ii. Quota Sampling


Quota sampling is a type of judgment sampling and is perhaps the most commonly used
sampling technique in non-probability category. In a quota, sample quotas are setup
according to some specified characteristics such as so many in each of several income
groups, so many in each age, so many with certain political or religious affiliations, and so
on. Each interviewer is then told to interview certain number of persons which constitute
his quota. Within the quota, the selection of sample items depends on personal judgment.
For example, in a radio listening survey, the interviewers may be told to interview 500
people living in a certain area, and that out of every 100 persons interviewed 60 are to be
housewives, 25 farmers and 15 children under the age of 15. Within these quotas the
interviewer is free to select the people to be interviewed. The cost per person interviewed
may be relatively small for quotas sample but there are numerous opportunities for bias,
which may invalidate the results. For example, interviewers may miss farmers working in
the fields or talk only with those housewives who are at home. If a person refuses to
respond, the interviewer simply selects someone else. Because of the risk of personal
prejudice and bias entering the process of selection, quota sampling is not widely used in
practical work.
Quota sampling and stratified random sampling are similar inasmuch as in both methods
the population is divided into parts and the total is allocated among the parts. However,
the two procedures differ entirely. In stratified random sampling the sample in each
stratum is chosen at random. In quota sampling, the sampling within each cell is not done
at random.
Quota sampling is often used in public opinion studies. It occasionally provides
satisfactory results if the interviewers are carefully trained and if they follow instructions
closely. It is often found that since the choices of respondents within a cell is left to the
field representatives, the more accessible and articulate people within a cell will usually
be the ones who are interviewed. Even with alert and conscientious field representatives
it is often difficult to determine control categories such as age, income, educational
qualifications, etc.

iii. Convenience sampling


A convenience sample is obtained by selecting ‘convenient’ population units. The method
of convenience sampling is also called the chunk. A chunk refers to that fraction of the
population being investigated which is selected neither by probability nor by judgment but
by convenience. A sample obtained from readily available lists such as automobile
registrations, telephone directories, etc., is a convenience sample and not a random

Page 16 of 59
sample even if the sample is drawn at random from the lists. If a person is to submit a
project report on labor-management relations in textile industry and he takes a textile mill
close to his office and interviews some people over there, he is following the convenience
sampling method. Convenience samples are prone to bias by their very nature - selecting
population elements that are convenient to choose almost always make them special or
different from the rest of the elements in the population in some way.

Hence the results obtained by following convenience sampling method can hardly be
representative of the population - they are generally biased and unsatisfactory. However,
convenience sampling is often used for making pilot studies. Questions may be tested
and the chunk may provide preliminary information before the final sampling design is
decided upon.
Selection of appropriate methods of sampling
Having discussed the various methods of sampling, the question now arises as to which
method to adopt in a particular situation. It should be noted that no one method can be
regarded as best under all circumstances - each method has its own specialty. A number
of factors such as the nature of the problem, size of universe, size of sample, availability
of finance, time, etc., would affect the choice of a particular method of sampling.

1.6 Problems in Sampling


To appreciate the need for sample surveys, it is necessary to understand clearly the role
of sampling and non-sampling errors in complete enumeration and sample surveys. The
error arising due to drawing inferences about the population on the basis of few
observations (sampling) is termed sampling error. Clearly, the sampling error is non-
existent in complete enumeration survey, since the whole population is surveyed.
However, the error mainly arising at the stage of ascertainment and processing of data,
which are termed non-sampling errors, are common both in complete enumeration and
sample surveys.

1.6.1 Sampling Errors (Random Errors)


Even if utmost care has been taken in selecting a sample, the results derived from a
sample study may not be exactly equal to the true value of a population. The reason is
that an estimate is based on a part and not on the whole, and samples are seldom, if
ever, perfect miniature of the population. Hence sampling gives rise to certain errors
known as sampling errors (or sampling fluctuations). These errors would not be present
in a complete enumeration survey. However, the errors can be controlled. Modern
sampling theory helps in designing the survey in such a manner that the sampling errors
can be made small.

Sampling errors are of two types: biased and unbiased.


i. Biased errors: These errors arise from any bias in selection, estimation, etc.
For example, if in place of simple random sampling, deliberate sampling has
been used in a particular case some bias is introduced in the result and hence
such errors are called biased sampling errors.

Page 17 of 59
ii. Unbiased errors: These errors arise due to chance differences between the
members of population included in the sample and those not included.
Sampling error in statistics is the difference between the value of a statistic and
that of the corresponding parameter.
Thus, the total sampling error is made up of errors due to bias, if any, and the random
sampling error. The essence of bias is that it forms a constant component of error that
does not decrease in a large population as the number in the sample increases. Such
error is, therefore known as cumulative or no compensating error.
On the other hand, the random sampling error decreases on an average as the size
sample increases. Such error is, therefore, also known as non-cumulative or
compensating error.
Causes of bias: Bias may arise due to -
i. Faulty process of selection;
ii. Faulty work during the collection; and
iii. Faulty methods of analysis.
i. Faulty selection: Faulty selection of the sample may give rise to bias in a number of
ways, such as:
a) Deliberate selection of a representative sample.
b) Conscious or unconscious bias in the selection of a random sample.
c) The randomness of selection may not really exist, even though the investigator
claims that she/he had a random sample if she/he allows her/his desire to obtain a
certain result to influence his selection.
d) Non-response: If all the items to be included in the sample are not covered there
will be bias even though no substitution has been attempted. This fault particularly
occurs in mailed questionnaires, which are returned incomplete. Moreover, the
information supplied by the informants may also be biased.
e) An appeal to the vanity of the person questioned may give rise to yet another kind
of bias. For example, the question ‘Are you a good student?’ is such that most of
the students would succumb to vanity and answer ‘Yes’.
ii. Bias due to faulty collection of data: Any consistent error in measurement will give
rise to bias whether the measurements are carried out on a sample or on all the
units of the population. The danger is, however, likely to be greater in sampling work
since the units measured are often smaller. Bias may arise due to improper
formulation of the decision, securing an inadequate frame, and so on. Biased
observations may result from a poorly designed questionnaire, an ill-trained
interviewer, failure of a respondent’s memory, etc. Bias in the flow of the data may
be due to unorganized collection procedure, faulty editing or coding of responses.
iii. Bias in analysis: In addition to bias which arises from faulty process of selection and
faulty collection of information, faulty methods of analysis may also introduce bias.
Such bias can be avoided by adopting the proper methods of analysis.
Avoidance of Bias: If possibilities of bias exist, a fully objective conclusion cannot be
drawn. The first essential of any sampling or census procedure must, therefore, be the
elimination of all sources of bias. The simplest and the only certain way of avoiding bias
in the selection process is for the sample to be drawn either entirely at random, or at

Page 18 of 59
random subject to restrictions which, while improving the accuracy, are of such a nature
that they do not introduce bias in the results. In certain cases, systematic selection may
also be permissible.
Method of reducing sampling errors
Once the absence of bias has been ensured, attention should be given to the random
sampling errors. Such errors must be reduced to the minimum so as to attain the desired
accuracy.
Apart from reducing errors of bias, the simplest way of increasing the accuracy of a
sample is to increase its size. The sampling error usually decreases with increase in
sample size, and in fact in many situations the decrease is inversely proportional to the
square root of the sample size.

1.6.2 Non-Sampling Errors


When a complete enumeration of units in the universe is made, one would expect that it
would give rise to data free from errors. However, in practice it is not so. For example, it
is difficult to completely avoid errors of observation or ascertainment. In the processing of
data tabulation errors may also be committed affecting the final results. Errors arising in
this manner are termed non-sampling errors, as they are due to factors other than the
inductive process of inferring about the population from a sample. Data obtained in an
investigation by complete enumeration are free from sampling error, whereas the results
of a sample survey would be subject to sampling error as well as non-sampling error.
As regards non-sampling errors they are likely to be more in case of complete
enumeration survey than in case of a sample survey, since it is possible to reduce the
non-sampling errors to a great extent by using better organization and suitably trained
personnel at the field and tabulation stages. The behavior of the non-sampling errors with
increase in the sample size is likely to be the opposite of that of sampling error, that is,
the non-sampling error is likely to increase with increase in sample size.

In many situations, it is quite possible that the non-sampling error in a complete


enumeration survey is greater than both the sampling and non-sampling errors taken
together in a sample survey, and naturally in such situations the latter is to be preferred to
the former.
Some sources of non-sampling errors
 Data specification is inadequate and inconsistent with respect to the objective of the
census or survey.
 Inaccurate or inappropriate methods of interview, observation or measurement due
to inadequate or ambiguous schedules, definitions or instructions.
 Lack of trained and experienced investigators.
 Lack of adequate inspection and supervision of primary staff.
 Errors due to non-response, i.e., incomplete coverage in respect of units.
 Errors in data processing operations such as coding, punching, verification, etc.
 Errors committed during presentation and printing of tabulated results.
 Defective frame and faulty selection of sampling units.

Page 19 of 59
These sources are not exhaustive, but are given to indicate some of the possible sources
of error.
Controlling non-sampling errors
In some situations the non-sampling errors may be large and deserve greater attention
than sampling errors. While, in general sampling errors decrease with increase in sample
size, non-sampling errors tend to increase with the sample size. In the case of sample
surveys both sampling and non-sampling errors have to be controlled and reduced to a
level at which their presence does not vitiate the use of final results.
Reliability of samples
The reliability of samples can be tested in the following ways:
 More samples of the same size should be taken from the same population and their
results need to be compared. If the results are similar, the sample will be reliable.
 If the measurements of the universe are known, then they should be compared with
the measurements of the sample.
In case of similarity of measurements, the sample is reliable.
A sub-sample should be taken from the samples and studied. If the results of sample and
sub-sample study show similarity, the sample should be considered reliable.

1.7 Sampling Distribution

So far, we have made a distinction between a population and a sample, stating that a
population consists of all conceivably possible (or hypothetically possible) observations of
a given characteristic, while a sample is simply part of a population. In this section, it is
important to note the difference between finite populations and infinite populations.
A population is finite if it consists of a finite or fixed number of elements, measurements,
or observations, whereas a population is infinite if it contains, at least hypothetically,
infinitely many elements. Observing the totals obtained in repeated rolls of a pair of dice
gives an infinite population. Sampling with replacement from a finite population is also
infinite.
When data are produced by random sampling or randomized experimentation, a statistic
is a random variable that obeys the laws of probability theory. The link between
probability and data is formed by the sampling distribution of a statistic. A sampling
distribution shows how a statistic would vary in a repeated data production. It is the basic
concept in statistical inference.
If we draw a sample of size n from a given finite population of size N without replacement,
N!
then the total number of possible samples is: N C n   k.
n!( N  n)!

Suppose we have a random sample X 1 , X 2 , ..., X n of size n drawn from a population of


size N. For each of the k samples we can compute some statistic T  t ( x1 , x 2 , x3 , ..., x n ) ,
like the mean X , the variance S 2 , etc. The set of the values of the statistic so obtained,
one for each sample constitute the sampling distribution of the statistic.

Page 20 of 59
Sampling distribution of the mean
This section focuses on the sample mean and its sampling distribution. The sampling
distribution of the sample mean, X , is determined by the design used to produce the
data, the sample size n and the population distribution having a mean  and variance
 2 . It can be shown that the variance of X
2  N n
Var( x )   .
n  N 1 

2
If N  , V ( x )  and  x   .
n
The same holds true if SRS with replacement is taken. The shape of the distribution of x
depends on the shape of the distribution of the population distribution. If the population is
2
normal, N (  , ) , then the sample mean X is N (  , ) . In general, any linear
n
combination of independent normal random variables is also normally distributed. To
elaborate this concept, let us start with the following illustration.
Example 4: Suppose a population consists of five numbers 2, 4, 6, 8, and 10. Consider
all possible samples of size 2 which can be drawn with replacement from this population.
Find the
a) mean of the population.
b) variance of the population.
c) mean and standard deviation of the sampling distribution of the means.
Solution: a) mean of the population    6. ; b) variance of the population =  2  8. c)
The number of possible samples (taken with replacement) of size n = 2 from N = 5 is
given Nn = 25. The samples, means and frequency distribution are given below.
Step 1: Draw all possible samples.

2 4 6 8 10
2 (2, 2) (2, 4) (2, 6) (2, 8) (2, 10)
4 (4, 2) (4, 4) (4, 6) (4, 8) (4, 10)
6 (6, 2) (6, 4) (6, 6) (6, 8) (6, 10)
8 (8, 2) (8, 4) (8, 6) (8, 8) (8, 10)
10 (10, 2) (10, 4) (10, 6) (10, 8) (10, 10)

Page 21 of 59
Step 2: Calculate the mean for each sample.

2 4 6 8 10
2 2 3 4 5 6
4 3 4 5 6 7
6 4 5 6 7 8
8 5 6 7 8 9
10 6 7 8 9 10

Step 3: Summarize the mean obtained in step 2 in terms of frequency distribution.

xi 2 3 4 5 6 7 8 9 1
0
fi 1 2 3 4 5 4 3 2 1

i) The mean of X ,  x =  xi f i =
150
 6  . .
f i 25

ii) The variance of X ,  2



 x i  x  fi

100
42.
x
f i 25
Activity
Repeat the exercise if sampling is drawn without replacement.
We observe that the frequency distribution of the sample means given below can also be
used to calculate x and  x (please verify):
Means: 4 5 6 7 8 9 10
Frequency: 1 1 2 2 2 1 1

Case 1: Sampling distribution of the mean (finite population or sampling without


replacement)
In most practical conditions, we cannot enumerate all possible samples as in the above
example to observe how close a sample mean might be to the mean of the population
from which the sample came. But we can get the result from essential theoretical results
about sampling distributions of the mean.
Theoretical Result 1: finite population
For random samples of size n taken from a finite population of size N with the mean 
and variance  2 , the sampling distribution of x has mean and variance given by
a) E x    x  
2  N n
b) Var x    x2    N  n    x 
n  N 1  n N 1

Page 22 of 59
The standard error (s.e.) of the mean
Since there are many problems in which we are interested in the standard deviation of
the sample mean rather than its variance, we define the standard error of the mean.

Definition 1.1: The standard deviation of the sample mean is called the standard error
(s.e.) of the mean,  x .
The standard error of the mean tells how much sample means vary from the mean of the
same population. It depends on two things: how large a sample we take and how much
variability there is in the population. Means based on large numbers of cases vary less
than means based on small number of cases.
Case 2: Sampling distribution of the mean (Infinite populations or sampling with
replacement)

Observe the factor N n when N   : N  n  N n=  n


1   , which approaches 1 as N
N 1 N 1 N  N
2
  . Thus, the theoretical result 1 is modified as  x2 
, the s.e. of X for an infinite
n
population. This is also the case if a random sample is drawn with replacement from a
finite population, which is illustrated below.
Theoretical Result 2: Infinite populations or sampling with replacement
For random samples of size n taken from an infinite population or when sampling is with
replacement from a population with mean  and variance  2 , we get
a) E x    x   .
2

b) Var x    x2   , and the standard error (s.e.) of x is  x  .
n n

Therefore, collecting these results from both cases, we have the standard error of the
mean

a)  x  , if sampling is with replacement or from an infinite population.
n
 N n
b)  x   , if sampling is without replacement from a finite population.
n N 1

Both formulae show that the standard error of the mean  x  :


 Increases with an increase in  (i.e., directly proportional).

Page 23 of 59
 Decreases with an increase in n (more specifically, it is inversely proportional to
the square root of n) in case of finite population and it decreases faster since n
also appears in N  n , which is a fraction.
N 1

Definition: Finite population correction factor


The factor N n is called the finite population correction factor (fpc).
N 1

Note.
1. When N is large compared to n, the difference between the two formulae for  x
are usually negligible, and  will be taken as an approximation when we are
n
sampling from a large, finite population.
2. The fpc is omitted (in practice) unless the sample constitutes at least 5 percent of
the population. This is not a hard-and-fast rule, but only a rule-of- thumb.
Activity
a) Find the fpc for n =100 and N =10,000.
b) When we take a sample from an infinite population, what happens to the standard
error of the mean when we use a sample mean to estimate the mean of a
population, if n is increased from 50 to 200?

The Central Limit Theorem


The sampling distribution of X is normal if the underlying population is normal. Even if
the population distribution is not normal, as the sample size increases, the distribution of
x gets closer to a normal distribution. This is true no matter what shape the population
distribution has, as long as the population has a finite standard deviation  . This fact of
probability theory is called the Central Limit Theorem (CLT). By using this result it has
been proved that the sampling distribution of most statistics like the sample proportion ( p
), difference of sample proportions ( p1  p2 ) , difference of sample means ( x1  x 2 ) ,
difference of sample standard deviations ( s1  s 2 ) , etc. are asymptotically normal.

Theoretical Result 3: Central Limit Theorem (CLT)


Let x1 , x 2 , .... , x n constitute a random sample of size n from any population with mean 
and finite standard deviation  . When n is large, the sampling distribution of the sample

mean x is approximately normal with mean  and standard deviation  x  .
n

Page 24 of 59
 2 
Mathematically, if n is large (usually n  30), X  N   ,  , from which we get the
 n 
standardized form of X , given by,

Z
x  N 0,1 .
/ n

N 0,1 is known as the standard normal distribution, i.e.   0 and  2  1. The Central
Limit Theorem justifies the use of normal-curve methods for many problems, because
(regardless of the actual shape of the population sampled) as long as we take a large
sample, the mean will be approximately normally distributed.
When the parent population is normal, however, regardless of the size of n, the mean is
normally distributed.
Theoretical Result 4: Sampling from a normal population
If X is the mean of a random sample of size n from a normal distribution with mean 
and variance  2 , its sampling distribution is a normal distribution with mean  and
2
variance .
n
 2 
Symbolically, X  N (  ,  2 )  X  N   ,  .
 n 

Exercises
1. (a) Define sampling. Explain the different methods of sampling.
(b) State the advantages of adopting sampling procedure in carrying out large-scale
surveys.
2. (a) Sampling is necessary under certain conditions. Explain this with suitable
examples.
(b) Point out the importance of sampling in solving business and economic problems.
3. Distinguish between ‘census’ and ‘sampling’ methods of collection of data, and
compare their merits and demerits.
4. Explain the terms ‘population’ and ‘sample’. Is it sometimes necessary and often
desirable to collect information about the population by conducting a sample survey
instead of complete enumeration?
5. Define a random sample and show how you would achieve randomness. How do you
select a random sample from a finite population? Point out the advantages of a
stratified random sample.
6. Prepare a sample survey scheme for ascertaining the percentage of income
households spent on food. In what respects will your scheme be optimal?
7. What are the main steps involved in a sample survey? Discuss the various sources
of errors in such surveys. Discuss briefly how these errors can be controlled.
8. (a) What is random sampling? How can a random sample be selected? Is random
sampling always better than other forms of sampling in the context of socio-economic
survey?

Page 25 of 59
(b) A sample may be large yet worthless because it is not random; or it may be
random but unreliable because it is small. Comment upon this statement.
9. A population consists of the four numbers, 3,7,11, and 15. Consider all possible
samples of size 2 drawn from this population without replacement. Find
a)  (b) 
c) the mean of the sampling distribution of means.
d) the standard deviation of the sampling distribution of means.
Verify (c) and (d ) from (a) and (b) using suitable formulae.

REFERENCE
Suggested textbook
Ott, R.L. and Longnecker, M.T. (2008). An Introduction to Statistical Methods and Data
Analysis. Duxbury Press, New York.

Reference books
Freund, R.J. and Wilson, W.J. (2003). Statistical Methods (2nd Ed.). Academic Press.

Gupta, C.B. and Gupta, V. (2004). An Introduction to Statistical Methods. Vikas


Publishing House Pvt. Ltd., India.

Snedecor, G.W. and Cochran, W.G. (1980). Statistical Methods (8th Ed.). Iowa State
University.

Page 26 of 59
Unit 2
Sample Survey Methods
CONTENT
2.1 Introduction
2.2 Basic Survey Design
2.3 Steps in Sampling Design
2.4 Sampling and Non-sampling Errors
2.5 Sample Survey Techniques

2.1 Introduction
In a broad sense, sampling theory can be considered as coextensive with modern
statistical methods. Almost all of the modem developments in statistics relate to the
inferences that can be made about a population when information is available from only a
sample of the elements of the population. Some of the ways in which this is reflected in
statistical programs are mentioned below.
Survey Work
In most survey work, the population consists of all persons (or housing units, households,
industrial establishments, farms, etc.) in a city or other area. Information is obtained or
desired from a sample of the population, but inferences are required on characteristics of
the whole population.
Design and Analysis of Experiments

In the design and analysis of experiments, the population represents all possible
applications of several alternative techniques which can be used. For example, the
experiment may be agricultural, in which a number of fertilizers are being tested. The
population is infinite because it represents the use of the fertilizers in all possible farms
over all time. The problem is to design experiments so that the maximum amount of
information can be made available for inferences about the full population, estimated from
a sample of limited size.
Quality Control

In the application of quality control methods in an industrial establishment, for example,


the population is all of the products coming out of a machine. Inferences are needed on
how well the products conform to specifications. The term "quality control" is also applied
to a sample check on the quality of field work done in a sample survey; the sample check
is carried out after the actual survey is completed. Office operations such as editing and
coding are also subject to quality control; a sample of the work is checked to determine if
it meets acceptable standards.
2.2 Learning Outcomes

After studying this module, the learner would be able to:

 Distinguish between population parameters and sample statistics

Page 27 of 59
 Take random samples from populations
 Explain why sample statistics are good estimators of population parameters
 Judge one estimator as better than another based on desirable properties of
estimators
 Understand basic survey designs
 Explain Survey Sampling
 Probability samples
 Non-probability samples

 Sampling and non-sampling errors

Page 28 of 59
2.3 Reasons for the use of Survey Sampling
There are six basic reasons for the use of samples:
 A sample may save money (as compared with the cost of a complete census)
when absolute precision is not necessary.
 A sample may make it possible to concentrate attention on individual cases.
 A sample saves time, when data are desired more quickly than would be possible
with a complete census.
 In industrial uses, some tests are destructive (for example, testing the length of
time an electric bulb will last) and can only be performed on a sample of items.
 Some populations can be considered as infinite, and can, therefore, only be
sampled. A simple example is an agricultural experiment for testing fertilizers. In
one sense, a census can be considered as a sample at one instant of time of an
underlying causal system which has random features in it.
 Where non-sampling errors are necessarily large, a sample may give better results
than a complete census because non-sampling errors are easier to control in
smaller-scale operations.
2.4 Limitations of Survey Sampling
Under certain conditions, the usefulness of sampling becomes questionable. Three
principal conditions can be mentioned.
1) If data are needed for very small areas, disproportionately large samples are required
since precision of a sample depends largely on the sample size and not on the
sampling rate. In this case, sampling may be almost as expensive as a complete
census.
2) If data are needed at regular intervals of time, and it is important to measure very
small changes from one period to the next, very large samples may be necessary.
3) If there are unusually high overhead costs connected with a sample survey, caused
by work involved in sample selection, control, etc., sampling may be impractical. For
example, in a country with many small villages it may be more economical to
enumerate all the households in the sample villages than to enumerate a sample of
households within the sample villages. For office processing, however, a sample of
the enumerated households may be used to reduce the work and costs of producing
tabulations.

2.5 Basic survey designs


Several types of research designs can be considered in quantitative analysis.
1. Experimental design: identifies effects by random assignment between control
and treatment group
2. Quasi-experimental: use a comparison group similar to the treatment group (no
random assignment)

Page 29 of 59
3. Cross-sectional: information gathered only at one point in time; uses variation
between subjects; is less expensive and the most common type of design.
4. Longitudinal: same subjects observed in more than one period; uses both cross-
sectional as well as within subject variation; useful for capturing changes over
time.
5. Case study: detailed analysis of individual cases.

2.6 Steps in Sampling Design


a) Identifying the relevant population
Determine the relevant population from which the sample is going to be drawn.

Example 5: if the study concerns income, then the definition of the population as
individuals or households can make a difference.

b) Determining the method of sampling


Decide whether a probability sampling procedure or a non-probability sampling procedure
has to be used is also very important.
c) Securing a sampling frame

A list of elements from which the sample is actually drawn is important and necessary.
d) Identifying parameters of interest

Decide what specific population characteristics (variables and attributes) may be of


interest.
e) Determining the sample size

2.7 Criteria and Definitions

2.7.1 Criteria for the acceptability of a sampling method


It has been demonstrated repeatedly in practical applications that sampling methods can
provide data of known reliability on an efficient and economical basis. However, although
a sample includes only part of a population, it would be misleading to call a collection of
numbers a "sample" merely because it includes part of a population
To be acceptable for statistical analysis, a sample must represent the population and
must have measurable reliability. In addition, the sampling plan should be practical and
efficient.
Chance of Selection for Each Unit
The sample must be selected so that it properly represents the population that is to be
covered. This means that each unit (farm, household, person, or whatever unit is being
sampled) must have a nonzero probability (chance) of being selected.

Page 30 of 59
Measurable Reliability
It should be possible to measure the reliability of the estimates made from the sample.
That is, in addition to the desired estimates of characteristics of the population (totals,
averages, percentages, etc.) the sample should give measures of the precision of these
estimates. As we shall see later, these measures of precision can be used to indicate the
maximum error that may reasonably be expected in the estimates, if the procedures are
carried out as specified, and if the sample is moderately large. The estimation of
precision is not possible unless the selection is carried out so that the chance of selection
of each unit is known in advance and random sampling is used.
Feasibility
A third characteristic is that the sampling plan must be practical. It must be sufficiently
simple and straightforward so that it can be carried out substantially as planned; that is,
the sampling theory and practice will be the same. A plan for selecting a sample, no
matter how attractive it may appear on paper, is useful only to the extent that it can be
carried out in practice. When the methods actually followed are the same (or substantially
the same) as specified in the sampling plan, then known sampling theory provides the
necessary measures of reliability. In addition, the measures of reliability computed from
the survey results will serve as powerful guides for future improvement in important
aspects of the sample design.

Economy and Efficiency


Finally, the design should be efficient. Among the various sampling methods that meet
the three criteria stated above, we would naturally choose the method which, to the best
of our knowledge, produces the most information at the smallest cost. Although this is not
an essential feature of an acceptable sampling plan, it is clearly a highly desirable one. It
implies that the most effective possible use will be made of all available facilities and
resources, such as maps, other statistical data, personal knowledge, sampling theory,
etc.
We shall consider only sampling methods that conform to the above criteria. We shall
present basic theory for various alternative designs which are possible as well as
methods of measuring their precision. We shall also stress practical methods of
application and considerations of efficiency.

2.7.2 Definitions of Terms


A statistical survey is an investigation involving the collection of data. Observations or
measurements are taken on a sample of elements for making statistical inferences about
a defined group of elements. Surveys are conducted in many ways.

Unit of analysis is the unit for which we wish to obtain statistical data. The most common
units of analysis are persons, households, farms, and business firms. They may also be
products coming out of some machine process. The unit of analysis is frequently called
an element of the population. There may be more than one unit of analysis in the same
survey; for example, households and persons; or number of farms and hectares (or
acres) harvested.

Page 31 of 59
A characteristic is a general term for any variable or attribute having different possible
values for different individual units of sampling or analysis. In a sample survey, we
observe or measure the values of one or more characteristics for the units in the sample.
For example, we observe (or ask about) the area of land for rice crop, the number of
cattle on a farm, the age and sex of a person, the number of children per family, etc. So,
we observe a unit, but we measure several characteristics of that unit.

A population or universe is the entire group of all the units of analysis whose
characteristics are to be estimated. The chapters in this sampling manual will deal
primarily with a finite population, having N units.

A probability sample is a sample obtained by application of the theory of probability. In


probability sampling, every element in a defined population has a known, nonzero,
probability of being selected. It should be possible to consider any element of the
population and state its probability of selection.
Sampling with Replacement and without Replacement. A simple way of obtaining a
probability sample is to draw the units one by one with a known probability of selection
assigned to each unit of the population at the first and each subsequent draw. The
successive draws may be made with or without replacing the units selected in the
preceding draws. The former is called the procedure of sampling with replacement, and
the latter, sampling without replacement.

A Sampling frame is the totality of the sampling units from which the sample is to be
selected. The frame may be a list of persons or of housing units; it may be a subdivided
map, or it may be a directory of names and addresses stored in some kind of electronic
medium, such as a file in a hard disk or a data base.

A parameter is a quantity computed from all values in a population set. That is, a
parameter is a descriptive measure of a population. For example, consider a population
consisting of N elements. Then the population total, the population average or any other
quantity computed from measurements including all elements of the population is a
parameter. The objective of sampling is to estimate the parameters of a population

A statistic is a quantity computed from sample observations of a characteristic, usually


for the purpose of making inference about the characteristic in the population. The
characteristic may be any variable which is associated with a member of the population,
such as age, income, employment status, etc.; the quantity may be a total, an average, a
median, or other quintiles. It may also be a rate of change, a percentage, a standard
deviation, or it may be any other quantity whose value we wish to estimate for the
population.

Note that the term statistic refers to a sample estimate and the term parameter refers to a
population value.

An estimate is a numerical quantity computed from sample observations of a


characteristic and intended to provide information about an unknown population value.

Page 32 of 59
An estimator is a mathematical formula or rule which uses sample results to produce an
n
estimate for the entire population. For example, the sample average, y   yi , is an
i 1
N
estimator. It provides an estimate of the parameter, the population average, Y   Y i that
i 1
is, the sample average is an estimate of the population average.
Therefore, the estimator refers to a mathematical formula. When numbers are plugged
into the formula, an estimate is produced. However, in common statistical language, the
words estimate and estimator are used interchangeably.

The probability of selection is the chance that each unit in the population has of being
included in the sample. Probability values range from 0 to 1, inclusive.

A random variable is a variable which, by chance, can be equal to any value in a


specified set. The probability that it equals any given value (or falls between two limits) is
either known, can be determined, or can be approximated or estimated. A chance
mechanism determines the value which a random variable takes. For example, in flipping
a coin, we can define the random variable X which can take the value 1 if the coin lands
'heads' and the value 0 if the coin lands 'tails'. Therefore, the variable X, as was just
defined, can take either one of two values after the coin is flipped.

The probability distribution gives the probabilities associated with the values which a
random variable can equal. If there are N values that a random variable X can take, say
Xl, X2, ... ,XN, then there are N probabilities associated with the Xi values, namely P1, P2,
..., PN. The probabilities and the values the random variable takes constitute the
probability distribution of X.
2.8 Sampling Distribution

The expected value is the average value for a single characteristic over all possible
samples. Mathematically, we define the expected value (or mean) of a discrete random
variable Y as:  ( y )   yp ( y ) .
y

The Greek letter  is used to indicate the sum of the products of all possible values of y
and their associated probabilities p(y); and  p( y)  1 . The small y denotes particular
y

value of Y. An analogous definition of the sample mean can be provided for a continuous
random variable using an integral in place of a summation operator. For now we limit our
interest to the discrete situation. Below we provide an illustration about the same.

The expected value is a weighted average of the possible outcomes, with the probability
weights reflecting the likelihood of occurrence of each outcome. Thus, the expected value
should be interpreted as the long-run average value of Y, if the frequency with which
each outcome occurs is in accordance with its probability.

Page 33 of 59
For example, consider 1 in which the random variable Y is used to represent the size of a
U.S. household selected at random. We write the expected value of Y as:

E(y) = (1)(0.225) + (2)(0.313) + (3)(0.175) + (4)(0.158) + (5)(0.076) + (6)(0.032) + (7.7)


(0.022) = 2.75.
The expected value of Y is not the most likely or the most typical value of Y. It is the long-
run average value of Y, if we repeatedly select households at random. Some households
have fewer than 2.75 people; some have more. The average of these different household
sizes is 2.75.
Note that the category "7 or more" aggregates data for households where Y = 7, 8, 9, ...;
so, it would be misleading to use Y = 7. Instead, we have put this .022 probability at Y =
7.7, which is the average size of households with seven or more persons.
A sampling distribution is the probability distribution of all possible values that an
estimate might take under a specified sampling plan.
In this section we will show by examples that the sample average (mean) is both an
unbiased and a consistent estimate of the true population average.
Let us first present the idea of a sampling distribution of the mean by actually listing all
possible random samples of size n = 2 which can be drawn from a hypothetical
population of N = 5 housing units (HUs). We wish to estimate the average household
(HH) size of these HUs from a sample.
Household size per household

Household
HH Size (Yi)
Unit (HU)
U1 3
U2 5
U3 7
U4 9
U5 11

N
The total number of persons in, the population is: Y   Yi  35 . The average number of
i 1
N
1
persons in a household (or average household size) is: Y 
N
Y
i 1
i 7.

5
If we take a sample of size 2 from this population, there are    10 possibilities, and
 2
they are (3,5), (3, 7) (3, 9) (3, 11) (5, 7) (5, 9) (5, 11) (7, 9) (7, 11) (9, 11).

Page 34 of 59
The means of these samples are 4, 5, 6, 7, 6, 7, 8, 8, 9, and 10, respectively, and if
sampling is random so that each sample has the probability 1/10, we obtain all the
possible samples of size two HUs from a population of 5 HUs, as shown in the table
below. The sampling distribution of the mean is also presented.

Samples of two HUs from a population of 5 HUs


Samples of Size Probability
Value of ( y )
(N=2) ( ( ))
3,5 4 1/10

3,7 5 1/10
3,9 6 1/10
3,11 7 1/10
5,7 6 1/10
5,9 7 1/10
5,11 8 1/10

7,9 8 1/10
7,11 9 1/10
9,11 10 1/10

Sampling distribution of the mean

Mean Probability
( ) ( ( ))
4 1/10
5 1/10
6 2/10
7 2/10
8 2/10
9 1/10
10 1/10

An examination of this sampling distribution reveals some pertinent information relative to


the problem of estimating the mean of the given population using a random sample of
size 2. For instance, we see that corresponding to y = 6, 7, or 8, the probability is 6/10
that a sample mean will not differ from the population mean (which is 7) by more than 1

Page 35 of 59
and the corresponding y = 5, 6, 7, 8, 9, the probability is 8/10 that sample mean will not
differ by more than 2.

Further useful information about this sampling distribution of the mean can be obtained
by calculating its expected value as follows:

1 1 2 2 2 1 1
( ) = (4 × ) + (5 × ) + (6 × ) + (7 × ) + (8 × ) + (9 × ) + (10 × ) = 7.
10 10 10 10 10 10 10
Note that the same results would be obtained for samples of all size. Recall the definition
of the expected value, which is the average of a single characteristic over all possible
samples.
We will now compare the distribution of the sample estimates to show that: As the sample
size increases, the means of the samples tend to concentrate more and more around the
true average value. In other words, the estimates tend to become more and more reliable
as the sample size increases.
The percentage distributions of the sample estimates can be used to predict the chance
of obtaining a sample estimate within specified ranges of the true value. To see the
above statements, consider a hypothetical population of 12 individuals. We wish to make
different estimates from a sample of 1, 2, 3, 4, 5, 6 and 7 individuals.

Predicting reliability of sample estimates (confidence interval)

We have seen that the precision of a sample can be predicted if we have the distribution
of all sample estimates of a given size for the population. In a real situation, we cannot
select all possible samples and examine the estimates derived from them. We must
depend upon a single sample. Therefore, it is necessary to find some measure of the
extent to which the estimates made from various samples differ from the true value; this
measure, if it is to be useful, must be one that can be estimated from the sample itself.
Before showing how and why we can do this, we shall introduce certain definitions and
relationships which are derived from the theory of sampling.
Standard deviation is the measure of variability in the population; in subsequent
discussion and consideration this is the measure of dispersion we will be using. The
square of standard deviation is called the population variance, and is designated by the
symbol  2 . The variance of the population is defined as the average of the squares of
the deviations of all the individual observations from their mean value. Thus, it would be
computed as follows, if all the values in the universe could be observed:
N
1 1
2  (Y I Y ) 2  ...  (YN  Y ) 2   (Y i  Y )2 .
N N i 1

where the Y's with subscripts are individual observations and is the mean of the N
observations for the N elements in the universe. Note that it has become fairly general
practice to denote the population variance by  2 when dividing by N, and by S2 when
dividing by N-l; symbolically,

Page 36 of 59
1 N
S2   (Yi  Y ) 2 .
N  1 i 1

Its sample equivalent is given by:

1 n
s2   ( yi  y ) 2 .
n  1 i 1

where n is the sample size, yi is the sample measurement of a characteristic and y is the
sample mean. We will use S2 throughout the text because s2 is an unbiased estimate of
S2. Note that all results are equivalent in either notation.
The variance of the sample means is the average of the squares of the deviations of the
means of all possible samples of size n from the true mean. The variance of y is denoted
2 2 S2  N  n S2 n
by S ( y ): S ( y )    (1  f ) where f  is called the sampling fraction.
n  N  n N
The reciprocal of f, that is N/n, is known as sampling weight or inverse sampling fraction.

Standard Error of Sample Means is the square root of the standard error of the
s N n
variance y is called standard error of means of sample size n, that is, S ( y )  .
n N
It is important to note that the standard error varies with the size of the sample, as we
would expect it to be. As the sample size increases, the standard error becomes smaller
and smaller. This is shown in the following:

N n
The factor in the formula for the variance of y is called the finite population
N
N n
correction (fpc). As a rule of thumb, if n  0.05N we can ignore since its value will
N
be close to 1. Otherwise, we should include it in the formula in order not to severely
overestimate the variance of y

Approach to Normal Distribution


It is a theoretical fact that as the sample size increases, the sample estimates differ less
and less from the expected value, and at the same time the standard error becomes
smaller and smaller. In practical sampling problems, where a reasonably large sample is
used (generally 30 or more cases), the distribution of sample results over all possible
samples approximates very closely the normal distribution - the familiar bell-shaped
curve.
For this distribution, the probabilities of being within a fixed range of the average value
are well known and have been published. These probabilities depend solely on the value
of the standard error. For example, the probability of being within one standard error is 68

Page 37 of 59
percent; for two standard errors, it is 95 percent; for three standard errors, it is 99.7
percent.

The implications are of fundamental importance to sampling theory. Suppose we have


drawn a simple random sample from a population, have computed the mean from the
sample ( y ) and have estimated the true standard error of the mean S( y ), by means of s(
y ). How can we infer the precision of this particular sample result? If we set an interval
based on s( y ) around the sample estimate ( y ), we will be fairly confident that  y  s( y)
will give an interval such that one will be correct about two-thirds of the time that the
interval covers the true mean. Similarly,  y  2 s( y ) gives a confidence interval for which
the assumption will be correct 95 percent of the time, and for y  3s( y ) it will be 99.7
percent of the time.
2.9 Sampling and Non-sampling Errors

Estimates are subject to both sampling errors and non-sampling errors. Sampling error
arises because information is not collected from the entire target population, but rather
from some portion of it. Through the use of scientific sampling procedures, however, it is
possible to estimate from the sample data the range within which the true population
value (parameter) is likely to be with a known probability.

Non-sampling error, on the other hand, is defined as a residual category consisting of all
other errors which are not the result of the data having been collected from only a
sample. These include errors made by respondents, enumerators, supervisors, office
clerical staff, key coding operators, etc.
Total Error (Mean Square Error). The total error is the sum of all errors about a sample
estimate, both sampling and non-sampling, both variable and systematic. An illustration
of the composition of the total error follows:

MSE (ˆ )   (ˆ   ) 2  Var (ˆ )  ( Bias ) 2 .

In practice, the bulk of sampling error consists of variable error, and by contrast the bulk
of non-sampling error is bias.
2.10 Sample survey Techniques

2.10.1 Simple Random Sampling (SRS)

• It is a process for selecting n sampling units one at a time, from a population of N


sampling units so that each sampling unit has an equal chance of being in the
sample.

• Every possible combination of n sampling units has the same chance of being
chosen.

Page 38 of 59
• Selection of one sampling unit at a time with equal probability may be
accomplished by either sampling with replacement or without replacement. Almost,
if not all, samples are selected without replacement.

• Using a table of random numbers to select that the unit satisfies this definition of
simple random sampling.

Advantages of SRS
• It is free of classification error.
• It requires minimum advance knowledge of the population.
• It best suits situations where not much information is available about the
population and data collection can be efficiently conducted on randomly distributed
items.
• If these conditions are not true, stratified sampling or cluster sampling may be a
better choice.

Disadvantages of SRS
• If the population is widely dispersed, it may be extremely costly to reach the units.
• A current list of the whole population we are interested in (sampling frame) may
not be readily available.
• Or perhaps, the population itself is not homogeneous and the sub-groups are very
different in size. In such a case, precision can be increased through stratified
sampling.

2.10.2 Survey sampling for Mean


Notation. The notation defined in this section is appropriate not only for simple random
sampling, but also for most survey designs. Capital letters refer to population values and
lower case (small) letters denote corresponding sample values. A bar (-) over a letter
denotes an average or mean value and (A) over a letter indicates an estimate.

N = the total number of units in a population; n = the total number of units in a


sample.
Yi = value of characteristics as measured in the ith unit in the population for I = 1, 2,
.,., N; yi = value of characteristics as measured in the ith unit in the sample for I =
1, 2, …, n.
N n
Y   Yi = total value of a characteristic in the population; y   yi = total value of
i 1 i 1
a characteristic in the sample.

Y 1 N y 1 n
Y    Yi = population mean; y    y i = sample mean.
N N i 1 n n i 1

The population variance and sample variance have been introduced above.
Note. In simple random sampling, s2, is an unbiased estimate of S2.

Page 39 of 59
Population values, their respective estimates and measure of precision
The sample estimate of the population total value, Y, is denoted by Yˆ , and can be written
as:
n
N
Yˆ  Ny  y i , where y is the estimate of the population average, Y , and is given by
n i 1

1 n NS N  n
y  y i .The sampling error of the estimate of Yˆ is S (Yˆ )  ( ) ; and the
n i 1 n N
S N n
sampling error of y is S ( y )  ( ).
n N

The corresponding formula for the estimated sampling errors are

Ns N  n s N n
S (Yˆ )  ( ), S ( y)  ( ).
n N n N

Relative Standard Error


Often we wish to consider not the absolute value of the standard error, but rather its value
in relation to the magnitude of the statistic to be estimated (mean, total, etc.). For this
purpose, one can express the standard error as a proportion (or a percent) of the value
being estimated. This form is called the relative standard error or coefficient of variation
and is denoted by the symbol CV when the relation of the standard error is to the
absolute value of the mean (with parentheses to indicate the statistic to which the error
applies). One advantage of pressing error as CV's is that it is unitless, unlike absolute
measures like the standard deviation and the standard error. The CV is useful when
making comparisons because no units enter into play. The population CV refers to the
relative standard error of means of samples of 1 unit (that is, the population standard
deviation expressed as a proportion of the population mean) and it's denoted simply by
CV (not followed by a parenthesis). Thus, for the estimate of the total, the true coefficient
of variation is:

S (Yˆ ) s( y )
CV (Yˆ )  and cv(Yˆ )  .
Y y

S ( y) s( y)
CV ( y )  and cv( y )  .
y y

2.10.3 Sampling for Proportions

Types of statistics for which proportions are used


Proportions arise in two ways in statistical analysis. First of all, we are frequently
interested in a statistic that is a proportion, rather than a total or an average; for example,
the proportion of the population that is unemployed, or the percentage of families with

Page 40 of 59
income greater than a certain amount, or the proportion of business firms interested in
purchasing a particular product. Secondly, it may be desired to classify a population into
a number of groups, and to find the percentage of the total population in each of these
groups. The groups may have a natural ordering as in distribution by age (0 to 4 years, 5
to 9, 10 to 14, etc.) or income classes; or they may be groups having no natural order,
such as those in an industrial classification of business firms, where the groups can be
arranged in a number of ways. The analysis is the same whenever the proportion of the
total in each group is the statistic to be measured.

Relationship to Previous Theory


Suppose we think of the total population and the sample in the following way. Consider a
particular class of units in which we are interested, and use the following notation:
A Total number of units in that class in the population.
a Number of units in that class in the sample.
P True proportion of units in that class in the population.
p Proportion in that class in the sample.
Q Population proportion not in that class (Q = 1-P).
q Proportion not in that class in the sample (q = 1-p).
A a
Note. P  , p .
N n

All of the formulas discussed in previous sections can be applied to this particular case by
considering each member of the population as having a characteristic which can have
only one of two values, either 0 or 1. If the member is in a particular class in which we are
interested, the value assigned is 1; if the member is not in the class, the value is 0.
Examining the entire population, we can see that the A members of the class each have
a value of 1; the rest have a value of Q. Adding up the values for all elements of the
population, we get A. In other words, A can be considered as the equivalent of
N
A Y
Y   Yi . Similarly P  can be considered in the same way as Y  .
i 1 N N

Applicable formulas
In sampling for proportions, the following formulas are applicable (with simple random
sampling):
a
Pˆ  p  and Aˆ  pN .
n

That is, an estimate of the proportion in the population is obtained by using the sample
proportion, and an estimate of the total number of units having the characteristic is
obtained by multiplying the sample proportion by the total number of units in the
population. Also
NPQ
 2  PQ , S2  .
N 1

Page 41 of 59
The population variance is PQ. Note that it is the variance of the population distribution
giving the value of 1 or 0 to an element depending on whether or not it is in the class
(whether it has the attribute in question). It can still be estimated by pq, unless n is very
n
small (for example n < 30) in which case the formula is s 2  pq .
n 1

N  n PQ
The variance of the sample proportion is  2 Pˆ  (  )
N 1 n
.

An unbiased estimate of the variance of p, derived from the sample, is


pq N  n
s 2 ( pˆ )  ( ).
n 1 N
Simplification for large populations
With large population and small sampling rate, the fpc can be ignored and the formula
become simpler.
Simplified formulas
True value Estimate
Variance of the mean S2 s2
S 2 ( y)  s 2 ( y) 
n n
Variance of proportion PQ pq
S 2 ( Pˆ )  s 2 ( pˆ ) 
n n 1
Coefficient of variation CV 2 cv 2
of the mean CV 2 ( y )  cv 2 ( y ) 2 
n n 1
Coefficient of variation Q q
CV 2 ( Pˆ )  cv 2 ( pˆ ) 
of proportion pn p ( n  1)
Variance of a total  S2 s2
S 2 (Y )  N 2 s s 2 (Yˆ )  N 2
n n
Variance of total PQ pq
S 2 ( Aˆ )  N 2 s 2 ( Aˆ )  N 2
number of units n n 1
having an attribute
CV of a total CV 2 cv 2
CV 2 (Yˆ )  CV 2 (Y )  cv 2 (Yˆ )  cv 2 ( y ) 
n n
CV of total number of Q q
CV 2 ( Aˆ )  CV 2 ( Pˆ )  cv 2 ( Aˆ )  cv 2 ( pˆ ) 
units having an n p ( n  1)
attribute

2.10.4 Sample size determination and selection of sample points under SRS

One of the first questions which a statistician is called upon to answer in planning a
sample survey refers to the size of the sample required for estimating a population
parameter with a specified precision. Making a decision about the size of the sample for
the survey is important. Too large a sample implies a waste of resources, and too small a

Page 42 of 59
sample diminishes the utility of the results.
Specific considerations for determining the sample size

When considering sample size determination, there are three very important concerns:
accuracy, practicality, and efficiency.
Accuracy can be defined as an inverse measure of the total error. Total error is the sum
of sampling error (SE) and non-sampling error, (NSE). Sampling error arises because
only a part of the population is observed, and not all of it. The terms PRECISION
and RELIABILITY are associated with sampling error. Estimator A is more precise or
more reliable than estimator B if the sampling error of A is smaller than the sampling error
of B. Non-sampling errors are usually biases which are very often due to poor quality
control of the survey operations (poor questionnaire design; interviewers that are not well
trained; response errors; etc.)
Practicality. To obtain an accurate estimate, both sampling and non-sampling errors
must be reduced. However, accuracy may come into conflict with practicality because:
1. to reduce sampling errors and increase precision, the sample size must be large.
2. too large a sample can impose an excessive burden on the limited resources
available (and resources are usually very limited) and increase the likelihood of
non-sampling errors.

Efficiency. A further concern is that a given sample size can produce different levels of
precision depending on which sampling techniques are chosen. This concept is known as
the statistical efficiency of the design. The most efficient design is the one that gives the
most precision for the same sample size. Therefore, expert sample design is needed in
the determination of the optimal sample size.

Example 6: A population consists of N = 5000 persons. A simple random sample without


replacement (WOR) of size n = 50 included 10 persons of Chinese descent.
A 95% confidence interval for P, the proportion of persons of Chinese descent in the
population is:

pq 50 (0.20)(0.80)
p  2 (1  f ) = 0.20  2 (1  )  (0.087, 0.312) .
n 500 50

The conclusion is that between 8.7% and 31.2% of the population is of Chinese descent;
this n interval is too wide to be useful. There are two ways in which a narrower interval
could be obtained:

 by lowering the confidence level, or


 by increasing the sample size.
There is a point at which lowering the confidence level is not attractive. We shall consider
the problem of determining the sample size necessary to produce a fixed level of
precision.
The following eight steps are taken into account when determining the sample size. We

Page 43 of 59
will study each one in detail.
 Degree of precision desired
 Formula to connect n with desired precision
 Advance estimates of variability in population
 Cost and operational constraints
 Expected sample loss due to non-response
 Number of different characteristics for which specified precision is required
 Population subdivisions for which separate estimates of a given precision are
required; these are also called domains of estimation.
 Expected gain or lose in efficiency

Degree of precision
Precision of an estimate refers to the amount of variable error, mainly sampling error,
contained in an estimate. To lower the sampling error, that is, to increase the precision,
we want n to be sufficiently large. Therefore, we decide on a target value for the precision
of the estimate. The degree of precision desired can be stated in terms of:

The absolute error (E) for the estimate ˆ is expected to satisfy the probability relation:

 
P ˆ      1  

In the above ˆ is an estimate of  and (1   ) is the degree of confidence desired. The


absolute error  is measured in the same unit used to measure the variable.
For example, E = 5 hectares or E = $10,000 or E = 25 persons.

The relative error (RE) for the estimate ˆ is expected to satisfy the probability relation:

 ˆ   
p  RE   1   .
  

This is E expressed as a proportion (or percentage) of the true value of the parameter
being estimated. For example, if E = 5 hectares and the true value of the parameter is
100, then RE = 5/100 = 0.05 or 5%.
The target coefficient of variation (cv) for the estimate (vo)
We set the cv (also known as the relative standard error) for the estimate equal to a
target value vo. For example, we can have:

VAR( )
 0.05  5% .

Depending on which of the three ways we use to specify the precision, the formula for n
will be different. The values of E, RE and  are usually decided by the user of the data in
conjunction with a statistician.

Page 44 of 59
Formula that connects n (sample size) with desired degree of precision
S 2  the population variance;  2 could be used instead.
n  the desired sample size.
S
CV  the population coefficient of variation, ( ) , where Y is the population mean.
Y
N  Number of units in the population.

1 for 68% confidence



K  2 for 95% confidence
3 for 99.7% confidence

cv (ˆ )  the coefficient of variation of the estimator where ˆ  y or Yˆ or p.
Var ( )
v 0  Specified target value for estimate's cv  .

E = Absolute error, RE = Relative error.

Note: The level of confidence states the probability that the n determined will provide the
degree of precision specified. For example, a 95% level of confidence means that, except
for a small chance (5%), we can be 95% certain that the precision specified will be
reached with the calculated n. This is equivalent to saying that the acceptable risk is 5%
that the value will lie outside of the range specified in the confidence interval.
The sample size needed to estimate a mean with absolute error E.
The sampling error of a mean using simple random sample is given by

S ( N  n)
S ( y)  . (1)
n N

Now,   kS ( y ) , where k is a multiple of the sampling error, selected to achieve the


specified degree of confidence. Therefore, if we substitute S ( y ) for (E/k), we get:

S ( N  n)
EK . (2)
n N

If we solve for n in equation (2), we get:

k 2 NS 2
n . (3)
K 2S 2  E 2 N

If the population size is large and n  0.05N , the finite population correction factor in
equation (2) can be ignored because its effect would be minimal. In this case, we have:

k2S 2
n .
E2

Page 45 of 59
Example 7: Consider a population consisting of 1,000 farms for which the population
variance of the number of cattle per farm is 250 (N = 1,000 and S2 = 250). Suppose we
want to estimate the average number of cattle per farm from a sample. We wish to have
reasonable confidence that the estimate will be close to the true value. Suppose the
sample estimate is to be in error by no more than 1 (one head of cattle) from the true
average, and we require an assurance of 95 chances out of 100 that the error will be no
larger than 1. In this case, E = 1, N=1,000, and S2 = 250.
Applying equation (3) to the case with k = 2 gives a value n greater than or equal to

k 2 NS 2 4(1000)(250)
n 2 2 2
  500 .
K S  E N 4(250)  1(1000)

If in the same situation we are satisfied with an error of not more than 3, with a
confidence level of 95 percent, the only change in the formula would be in the values of E
and E2, as follows: E=1 and E2 = 9.

k 2 NS 2 4(1000)(250)
n 2 2 2
  100 .
K S  E N 4(250)  9(1000)

Example 8: We wish to estimate the average age of 2,000 seniors on a particular college
campus. How large a SRS must be taken if we wish to estimate the age within 2 years
from the true average, 95% confidence? Assume S2 = 30.

k 2 S 2 2 2 (30)
E=2 and k=3 n   30 seniors.
E2 22

2.10.5 Sample size needed to estimate a proportion with absolute error E

The sample size n to estimate a population proportion P is obtained from equation (3). In
the equation, S2 = PQ:
k 2 NPQ
n 2 . (4)
K PQ  E 2 N

If the population size is large and n  0.05N , the fpc in equation (4) can be ignored
because its effect would be minimal. In this case, we have:

k 2 PQ
n .
E2

Suppose we would like to estimate P, the proportion of persons of Chinese descent to


within +3%, with 95% confidence. What sample size do we have to choose to achieve
this target? Assume P to be no larger than 1/2.

Page 46 of 59
k 2 PQ 2 2 (0.25)(0.75)
n   834.
E2 (0.03) 2

Sample size needed to estimate a total with absolute error E.


Ns N  n
Using S (Yˆ )  ( ) and letting E  KS ( y ) , we get the following formula:
n N

k 2 N 3s2 k 2N 2S 2
n  . (a)
k 2 N 2S 2  E 2 N k 2 N 2S 2  E 2

k 2 N 2S 2
If we ignore fpc, we have n  . (b)
E2
Sample size needed to estimate the number of units that possess a certain
attribute with absolute error E

To obtain the n necessary to estimate A, the number of units that possess a certain
characteristic, substitute PQ in place of S2 in equations (a) and (b).

2.10.6 Sample size formulas when the error is expressed in relative terms (RE)

We can obtain formulas for estimates when the desired error is expressed in relative
terms instead of absolute terms. For relative errors (RE), if RE is a proportion of the
RE
estimates, substitute ( cv ( y ) ) in (a) or (b) above. In order to avoid confusing the
k
estimated coefficient of variation (cv) from the population coefficient variation (CV) we
shall have the estimated coefficient of variation by cv. We have:

k 2 N (CV ) 2
n .
k 2 (CV ) 2  N ( RE ) 2

( RE ) s( y)
Note 1:  cv( y )  . This applies to both mean and the total. If we ignore fpc, the
k y
k 2 (CV ) 2
above equation becomes n  .
( RE ) 2

Note 2: In actual practice, we usually do not know S2 or (CV)2. Indeed we do not even
know S2 in advance of the survey. Instead, we use rough estimates of S2 or (CV)2
obtained by the methods discussed earlier.
Note 3: For the mean and the total, it is better to express the variance in relative rather
than absolute terms, for two reasons:

1. Most importantly, because a population's relative variance is more stable than its
absolute variance. A guess or estimate of the population coefficient of variation CV

Page 47 of 59
(from past data or from similar populations) is likely to be closer to the true value
than a guess or estimate of the variance.
2. The formula for n is the same for estimator of means or totals when it is expressed
in terms of the coefficient of variation.
Note 4: To estimate the proportion P, it is preferable to use the absolute error previously
discussed because the proportion is itself a relative quantity, so that taking the
percentage of a percentage can become confusing.
To obtain the formula for the sample size required to estimate a population proportion
when the error is expressed as relative error (RE), use equation (a) above.

k 2 N (CV ) 2
n .
k 2 (CV ) 2  N ( RE ) 2

Q k 2 NQ
Replacing (CV ) 2  we get n  2 .
P k Q  NP(RE) 2

k 2Q
If we ignore the fpc, this equation will yield n  .
( RE) 2 P

Example 9: We would like to carry out a survey to estimate the total area in hectares of
the farms used by a population. We desire that the estimate be within 10% of the true
value. How many farms should be surveyed? (In a pilot survey, we estimated the
population coefficient of variation, CV, of the variable farm size to be 1.2). Use 95%
confidence.
k 2Q 2 2 (1.2) 2
n  = 576 farms.
( RE) 2 P (0.1) 2

Sample size formulas when the error is expressed in terms of CV


k 2 N (CV ) 2
The equation n  2 can be expressed in terms of the coefficient of
k (CV ) 2  N ( RE ) 2
( RE )
variation. If CV   v 0 is a specified target value for an estimated coefficient of
k
k 2 N (CV ) 2
variation then, n  2 .
k (CV ) 2  N ( RE ) 2

2 2
 k   CV 
  (CV )
2  
 RE   v0 
n 2
 2
. (c)
1 k  2 1  CV 
1   (CV ) 1   
N  RE  N  v0 

Page 48 of 59
2 2
 k  2  CV 
If we ignore the fpc n    (CV )    . (d)
 RE  v
 0 

Equations (c) and (d) work for both mean and total.
Example 10: A survey was carried out to estimate the total area in hectares of the farms
in a population. The estimate should be within 10 percent of the true value with 95
percent confidence. How many farms should be surveyed? [In a pilot survey, estimated
the population coefficient of variation CV of the variable "farm size" was found to be 1.2].
2
RE 0.1  1.2 
In this case, k = 2, CV=1.2, and RE = 0.1. v0    0.05 . Thus n     576 .
k 2  .05 

Example 11: The results from a pilot test are used to estimate Y and S for the variable
'income' in a population of 5,000 households showed that Y= $14,852 per household, and
S = $12,300. A full scale survey is planned. What should be the sample size for this
survey if we want to estimate the mean income per household with a cv no larger than
5%?

The estimated population coefficient cv and the resultant estimated sample size n are:
12300
cv   0.828  82.8 %
14852
2 2 .
 (cv)   0828 
n    275 households
 v0   0.05 

2.10.7 Advanced Estimate of Population Variance

In the preceding section, we noted that most of the sample size formulas are written in
terms of the population variance. In practice the population variance is unknown and it
must be estimated or guessed. There are five ways of estimating population variances for
sample size determination.
Method 1: Select the sample in two steps, the first being a simple random sample of
size n1 (the first sample) from which estimates s12 and p1 of S2 and P,
respectively, are obtained. Then use this information to determine the
required n (the final sample size).
Method 2: Use results of a pilot survey. This is one of the more commonly used
methods.
Method 3: Use the results of previous samples of the same or similar population.
Method 4: Guess about the structure of the population and use some mathematical
results.
Method 5: (Only for qualitative characteristics.) If the statistic to be measured is a
proportion, then make a fairly good guess of P (the proportion in the population).
Method 1 carries out the survey in two steps. In the first step, only a subsample (a
random part of the total sample) is enumerated. An analysis of this part permits one to

Page 49 of 59
estimate the variance and to make revisions in the total size of the sample, if necessary.
In the second step, the remainder of the sample is enumerated in accordance with these
changes, if any. This method gives the most reliable estimates of S 2 or P, but it is not
often used, since it slows up the completion of the survey.
Method 2 is one of the more commonly used methods. It serves many purposes,
especially if the feasibility of the main survey is in doubt. If the pilot survey is itself a
simple random sample, the preceding methods apply. But often the pilot work is restricted
to a part of the population that is convenient to handle or that will reveal the magnitude of
certain problems.
Method 3 is also a very commonly used method. This method points to the value of
making available, or at least keeping accessible, any data on standard errors obtained in
previous surveys. Unfortunately, the cost of computing standard errors in complex
surveys is high, and frequently only those standard errors needed to give a rough idea of
the precision of the principal estimates are computed and recorded. If suitable past data
are found, the value of S2 may require adjustment for time change. Experience indicates
that the variance of an item tends to change much more slowly over time than the mean
value of the item itself. Even if the mean value changes, the relative error may be quite
stable.

2.10.8 Expected Sample Loss Due to Non-response

If past experience indicates that a certain level of non-response can be present, we may
want to inflate the calculated sample size to compensate. This is because our
calculations were based on a 100 percent response. If we do not obtain all the interviews,
then the estimates will be based on a number smaller than the calculated n and will,
therefore, have a greater variance than expected.

Inflating Procedure
We compute the inflated sample size n* from the following relationship: n* = n/r, where r is
an estimate of the expected response rate and it can be obtained from previous rounds of
the same survey, previous experience with similar surveys, a pilot (pre-test), etc.
For example, we calculate n to be 1,000 units. Based on the results of a pilot survey, we
anticipate the response rate to be 70 percent.

Our inflated n will be: n* = 1,000/.70 = 1,429. If our assumption was correct, we should
get back 70% of 1,429 = 1,000.
Therefore, our estimates will be based on the same number of units as expected and the
target precision will be attained.
Note. Inflating the sample size when there is non-response only helps compensate for
the resulting loss in precision. It does nothing for diminishing the resulting non-response
bias.

Page 50 of 59
2.11 Systematic Sampling

Systematic sampling is a statistical method involving the selection of every kth element
from a sampling frame, where k, the sampling interval, is calculated as k = N/n. Using this
procedure each element in the population has a known and equal probability of selection.
This makes systematic sampling functionally similar to simple random sampling. It is
however, much more efficient (if the variance within systematic sample is more than
variance of the population) and much less expensive to do.

The researcher must ensure that the chosen sampling interval does not hide a pattern.
Any pattern would threaten randomness. A random starting point must also be selected.

Systematic sampling is to be applied only if the given population is logically


homogeneous, because systematic sample units are uniformly distributed over the
population.
Example 12: Suppose a supermarket wants to study buying habits of their customers.
Using systematic sampling the management can choose every 10th or 15th customer
entering the supermarket and conduct the study on this sample.

General procedure for selecting a sample under systematic sampling

1. Assign serial numbers from 1 through N to the population units.


N
2. Calculate, SI  , the sampling interval:
n
 for exactness, carry as many decimals as possible

 you may round if you are doing this without a calculator, but you would be
sacrificing exactness for convenience
3. Select a random number (RN) from a table of random numbers between 0 and the SI.
This is called a random start (RS):

 in the permitted range, exclude zero, but include the sampling interval; use as
many digits as SI has, including decimals
 if you are searching through a RN table, pretend the decimal point is not there
 if you are using a calculator which only provides random numbers between zero
and one, multiply this random number by the value of SI in order to get a random
number between zero and SI. Remember to keep the decimals, do not round yet.
4. Begin the series of cumulated numbers with RS. Add SI to this first number to
determine the second. Then, add SI to the second number to get the third, and so on.

 Do not round decimals during the addition process

5. Stop cumulating when the last cumulated number exceeds N (discard this last
number):

 this should occur when you have cumulated n numbers

Page 51 of 59
 if you rounded SI before adding, you may not have exactly n.

6. Now go back and round all the cumulated numbers up to the next integer
 On the list of population units, circle the serial numbers that correspond to these
integers. These are the selected units.
Example 13: Suppose that a village has 285 housing units (HUs) and we wish to select a
systematic sample of 12 HUs for a survey. Assume the list is randomly ordered.
We want to determine the HUs that will be in the sample.
1. SI = N/n = 285/12 = 23.75
2. RN between 0001 and 2375 is 1979
3. RS = 19.79
4. Series of cumulated numbers:

1 20.79 20
2 20.79+23.75=44.54 44
3 44.54+23.75=68.29 68
4 92.04 92
5 115.7 115
6 139.54 139
7 163.29 163
8 187.04 187
9 1210.79 210
10 234.54 234
11 258.29 258
12 282.04 282
13 305.79 Discarded

2.12 Stratified sampling

In simple random sampling, we do not try to force the sample to be representative of


different groups in the population. The tendency to be representative is inherent in the
procedure itself and the sampling error can be reduced only by increasing the size of
sample. However, if something is known in advance about a population, it may be
possible to use the information in stratification and thus reduce the sampling error. The
judgment of experts may be useful here.
Stratified random sampling is a method in which the elements of the population are
divided into groups (strata), and a simple random sample is selected for each group,
taking at least one element from each group (stratum). One element from each group is
sufficient to estimate the mean, but two are needed to estimate its reliability; generally
many more than two are needed to make the estimates sufficiently precise. The process
of establishing these groups is called stratification and the groups are called strata. The
strata may reflect regions of a country, densely populated or sparsely populated areas,
various ethnic or other groups.

Page 52 of 59
In stratification we group together elements which are similar, so that the population
variance S2h within stratum h is small; at the same time, it is desirable that the means of
the several strata (Yh) be as different as possible. The letter h will be used to identify the
strata so that if L strata are created, h will go from 1 to L.

In stratified sampling, the probabilities of selection may be the same from group to group,
or they may be different. It is not necessary that all elements have the same chance of
selection, but the chance of each must be known. Under stratified random sampling all
the elements in a particular stratum have equal chances of being selected. While not
every combination of elements is possible, all of the possible samples (that is,
combinations of elements) that might be drawn have the same chance of occurring.
• Stratification is the process of grouping members of the population into relatively
homogeneous subgroups before sampling.
• When sub-populations vary considerably, it is advantageous to sample each
subpopulation (stratum) independently.
• The strata should be mutually exclusive: every element in the population must be
assigned to only one stratum. The strata should also be collectively exhaustive: no
population element can be excluded.
• Then random or systematic sampling is applied within each stratum.
• This often improves the representativeness of the sample by reducing sampling
error.
• It can produce a weighted mean that has less variability than the arithmetic mean
of a simple random sample of the population
Advantages of stratified sampling
• focuses on important subpopulations but ignores irrelevant ones.
• improves the precision.
• sampling equal number of observations from strata that vary widely in size may be
used to equate the statistical power of tests of differences between strata.
Disadvantages of stratified sampling
• can be difficult to select relevant stratification variables.
• not useful when there are no homogeneous subgroups.
• can be expensive.
• requires accurate information about the population; otherwise introduces bias.

Notation. We use the same notation as for simple random sampling, except that there
will be a subscript to indicate a particular stratum when we refer to information regarding
this stratum. Thus, N will represent the total number of elements in the population, as
before; but N1 will be the number in the first stratum, N2 will be the number in the second
stratum, etc. Similarly, n will be the total sample size; n1 will be the size of the sample in
the first stratum, n2 will be the size of the sample in the second stratum, etc. The
subscript h denotes the stratum and i the unit within the stratum. As in the case of simple
random sampling, capital letters refer to population values and lower case letters denote
corresponding sample values. The notation given in the following table will be used.

Page 53 of 59
Sample
Measurement Population Sample
estimate
Total number of elements N n …
Number of strata L L …
Number of elements in the hth
Nh nh …
stratum
Total for a certain variable
Y y Yˆst
(characteristic)
Total of the variable in stratum h Yh yh Ŷh
Average over all strata y st
(population mean) Y
Average for hth stratum (stratum yh
Yh …
mean)
Proportion having attribute P p Pst
Proportion in the hth stratum Ph Ph …
Population Variance S2 … …
Population variance for the hth
S2h s2h …
stratum
s 2 (Yˆh )
Variance of an estimated total S (Yˆst )
2
s 2 (Yˆst )
)
s 2 ( yh )
Variance of estimated mean S 2 ( y) s 2 ( y st )

Value of specific unit Y hi y hi …

Illustration for a Whole Population

Suppose we have a universe of eight farms with known value of land and buildings as
follows:

Farm Value of land and


buildings in $
A 2026
B 6854
C 1532
D 2180
E 5408
F 9284
G 1438
H 8836

Let us compute the average (mean) and the standard deviation of these values. In terms
of the notation above, we would have: N = 8; Y= $4,694.75, and S = $3,326.04.

Page 54 of 59
Now let us arrange the farms into two strata, so that the groupings of values are as
follows:

Stratum 1 Stratum 2
$1,438 $5,408
1,532 6,854
2,026 8,836
2,180 9,284
If we compute the average and standard deviation of each group of four farms separately,
we would have:

Stratum 1 Stratum 2
N1=4 N2=4
Y1 = $1,794 Y2 = $7,595.50
SI = $364.33 S2 = $1,800.45

Estimates from a stratified sample


The population mean can be expressed in terms of the stratum totals, as follows:

1 L 1 l L
Y   h N
N h 1
Y 
h 1
N Y
h h where the population total Y  
h 1
Yh .

I l 1 l N
y st   N H y h   h y h where yh is the sample total for the hth stratum.
N h 1 N h 1 n h

Illustration of estimate of mean


A stratified sample is drawn from a population of 1,000 farms to estimate average
expenditure by farm operators for hired labor. There are three strata - the total number of
farms in the first is 300; in the second, also 300; and in the third, 400. The selected
samples have 30, 30, and 40 farms in the three strata, respectively. The average
expenditure for the 30 farms in the first stratum is $12.20; for the 30 farms in the second
stratum, $25.60; and for the 40 farms in the third stratum, $48.70. For the sample
estimate of the average expenditure for all farms in the population we would have

N 1 y1  N 2 y 2  N 3 y 3 ( . ) ( , ) ( . )
y st  = = $30.82.
N ,

Estimate of total
As with simple random sampling, we make an estimate of the population total by
multiplying the estimate of the mean by the total number of elements in the population:

Page 55 of 59
l l
N
Yˆst  Ny st   N h y h   h y h .
h 1 h 1 n h

Estimate of proportion
To estimate a proportion for the population, the procedure is similar to that for the mean
because a proportion, Pst, is simply a special case of the mean Y when the only possible
values of Yi are 0 and1. In this case,
Nh
1
Yh  Ph 
Nh
Y
i 1
hi , where Yhi  0 or 1 .

Hence for stratified sampling the population proportion Pst is


L nh
1 1 L 1
Pst 
N

h 1
N h Ph the estimate of this is p st   N h ph where p h 
N h 1 nh
y
i 1
hi .

Sampling Error of a Stratified Sample


The formula for Sampling Error in a stratified sample is
2
1 L
Sh 1 Nh
S ( y st )   N h ( N h  nh ) , where S h2   (Yhi  Yh ) 2 . (5)
N2 h 1 nh N h  1 i 1

Illustration
Let us apply the use the equation above to the case of the eight farms in the illustration in
this section. Suppose we took a sample of four farms out of the eight - two from each
stratum - and we have computed:

Stratum 1: N1= 4, n1=2 Stratum 2: N2=4, n2 =2

S1=364.33, S12 = 132,736.35 S2=1,800.45,S22=3,241,620.2

Thus we have:

1 132736.35 321620.2
S ( y st )  [4(4  2)  4(4  2)( )]  210897.28  $459.24 .
64 2 2

It is interesting to compare this sampling error with the corresponding sampling error of
the mean for a simple random sample of four farms. For a simple random sample of four
farms, we would have

N  n s2 8  4 (3326.04) 2
S ( y)  ( )  $1,175.93 .
N n 8 4

Page 56 of 59
In this example, the sampling error of the stratified sample is much smaller than that of
the simple random sample, less than half. In fact, it would require a sample of six farms,
using simple random sampling, to achieve the same reliability (that is, as small a
sampling error) as we obtained with a stratified sample of the four farms.
Sample allocation
• Proportionate allocation uses a sampling fraction in each of the strata that is
proportional to that of the total population.
• If the population consists of 60% in the male stratum and 40% in the female
stratum, then the relative size of the two samples (three males, two females)
should reflect this proportion.
• Optimum allocation (or disproportionate allocation). Each stratum is proportional
to the standard deviation of the distribution of the variable. Larger samples are
taken in the strata with the greater variability to generate the least possible
sampling variance.
Choice of sample size for each stratum
In general the size of the sample in each stratum is taken in proportion to the size of the
stratum. This is called proportional allocation. Suppose that in a company there: male, full
time = 90; male, part time = 18; female, full time = 9; female, part time = 63; (a total of
180 staff). We are asked to take a sample of 40 staff, stratified according to the above
categories.
• The first step is to find the total number of staff (80) and calculate the percentage
in each group.
 % male full time = ( 90 / 180 ) x 100 = 0.5 x 100 = 50.
 % male part time = ( 18 / 180 ) x100 = 0.1 x 100 = 10.
 % female full time = (9 / 180 ) x 100 = 0.05 x 100 = 5.
 % female part time = (63/180)x100 = 0.35 x 100 = 35.
 10% of 40 is 4.
 5% of 40 is 2.
 35% of 40 is 14.
Note. Sometimes there is greater variability in some strata compared with others. In this
case, a larger sample should be drawn from those strata with greater variability.

Exercises

1) A Botanic researcher wishes to design a survey to estimate the number of birch


trees in a study area. The study area has been divided into 1000 units or plots.
From previous experience, the variance in the number of stems per plot is known
to be approximately ≈ 45. Using simple random sampling, what sample size
should be used to estimate the total number of trees in the study area to within 500
trees of the true value with 95% confidence? To within 1000 trees? To within 2000
trees?

Page 57 of 59
2) What sample size is required to estimate the proportion of people with blood type
O in a population of 1500 people to be 0.02 of the true proportion with 95%
confidence? Assume no prior knowledge about the proportion.

REFERENCE

Suggested textbook

Cochran, W. G. (1977). Sampling Techniques (3rd Ed.). John Wiley & Sons, New York.
Reference books

Kumar, R. S. (1996). Practical Sampling Technique (2nd Ed.). Marcel Dekker, New York.
Leedy, P. D. (1997). Practical Research: Planning and Design (6th Ed.). Prentice-Hall,
Inc., New Jersey.
Sukhatme, P. V. and Sukhatme, B.V. (1992). Sampling Theory of Surveys with
Applications. lowa State University Press & IARS.
Thompson, S.K. (2002). Sampling (2nd Ed.). John Wiley & Sons, New York.

Page 58 of 59

You might also like