CSA - Sampling Techniques & Survey Methods PDF
CSA - Sampling Techniques & Survey Methods PDF
TECHNIQUES AND
SURVEY METHODS
1.1 Introduction
When secondary data are not available for a problem under study, a decision may be
taken to collect primary data by using any of the methods discussed in this first unit of the
module. The required information may be obtained by either conducting a census
(involving full enumeration of units under study) or taking a sample. With a census, data
are collected for each and every unit (person, household, cattle, shop, factory, etc.) of the
population or universe under study. Suppose we need the average wage of workers
engaged in the sugar industry. In this case wage figures would be obtained about each
and every worker. The average wage is then obtaining by taking the ratio of the total
wages earned by all the workers to the total number of workers engaged in the sugar
industry.
In practice, researchers often use sample surveys to obtain information about a larger
population by selecting and measuring a sample (or fraction of units) from the population.
Conducting a full census can be quite costly. In addition, sampling theory, developed a
century ago, has shown that one does not need to conduct a census to obtain
information. Conducting a sample survey will do just as well. Even hospitals extract only
blood samples from patients for medical tests (rather than extracting all the blood of the
patient to determine whether or not the patient is in a clean bill of health).
When a population is too large to study or the costs for doing a census may be
prohibitive, we may be able to rely on the data collected from a sample. Inferences about
a population are based on the information from the sample drawn from that population,
particularly if the sample is designed to be representative of the population. Due to the
variability in the characteristics of the population, scientific sample designs should be
applied in the selection of a representative sample. If not, there is a high risk of distorting
the information about the population.
Page 1 of 59
Sampling involves the selection of the number of study units from a defined population.
The thinking here is that the population is too large to consider collecting information from
all its members. If the whole population is taken, then there is no need for statistical
inference. Usually, a representative subgroup of the population (sample) is used in the
investigation. A representative sample has all the important characteristics of the
population from which it is drawn.
A population is the totality of all subjects that possess a certain common characteristic
that will be studied.
A sample is part of a population which is selected to reasonably represent the population
from which it is drawn.
A census is information on the whole population; enumeration of the entire population.
Sampling is the process of learning about the population on the basis of a sample drawn
from the population. Thus, when employing sampling methods, instead of taking every
unit of the universe, only a part of the universe is studied and conclusions are drawn on
that basis for the entire population. The process of sampling involves three elements:
Selecting the sample,
Collecting the data, and
Making inference about the population.
The three elements cannot generally be considered in isolation from one another. Sample
selection, data collection and estimation are all interwoven and each has an impact on
the others. Sampling should not involve haphazard selection. Rather, it should embody
definite rules for selecting the sample. Following a set of rules for sample selection, we
cannot consider the estimation process independent of the manner in which the sample
has been selected.
It should be noted that a sample is not studied for its own sake. The basic objective of its
study is to draw inference about the population. In other words, sampling is a tool which
helps to know the characteristics of the universe or population by examining only a small
part of it. The values obtained from the study of a sample, such as the average and
dispersion, are known as ‘statistic’. On the other hand, such values for a population are
called ‘parameters’.
Page 2 of 59
Advantages of sampling
Sampling techniques have the following advantages over a complete enumeration:
i. Less time: Since the sample is a study of a part of the population, considerable
time and labor are saved when a sample survey is carried out. Time is saved not
only in collecting data but also in processing the data. For these reasons, a
sample provides more timely data in practice than a census.
ii. Less cost: The total financial burden of a sample survey is generally much less
than that of a complete census. In sampling, we study only a part of a population
and the total expense of collecting data is less than that required when the
census method is adopted. This is a great advantage particularly in an
underdeveloped economy where much of the information would be difficult to
collect through a census due to lack of adequate resources.
iii. More reliable results: Although the sampling technique involves certain
inaccuracies owing to sampling errors (i.e., imprecision in results arising from the
use of samples as different sets of samples could be chosen under the same
protocol), the result obtained from sampling is generally more reliable than that
obtained from a complete count. There are several reasons for it. Firstly, it is
always possible to determine the extent of sampling errors. Secondly, other
types of errors to which a survey is subject, such as inaccuracy of information,
incompleteness of returns, etc., are likely to be more serious in a complete
census than in a sample survey. This is because more effective precautions can
be taken in a sample survey to ensure that information gathered is accurate and
complete. For these reasons not only the total error can be expected to be small
in a sample survey but the sample result can also be used with a greater degree
of confidence because of our knowledge of the probable size of error. Thirdly, it
is possible to avail the services of experts, and to impart thorough training to the
investigators in a sample survey. This further reduces the possibility of errors.
Follow up work can also be undertaken much more effectively in the sampling
method. Indeed, even a complete census can only be tested for accuracy by
some type of sampling checks.
iv. Better timeliness: Data can be collected and summarized more quickly.
v. More detailed information: Since the use of sampling saves time and money, it
is possible to collect more detailed information in a sample survey. For example,
if the population consists of 1,000 persons and a survey of the consumption
pattern is undertaken, the two alternative techniques available are as follows:
a. We may collect the necessary data from each one of the 1000 people
through a questionnaire containing, say, 100 questions (census method): or
b. We may take a sample of 100 persons (i.e., 10% of population) and prepare
a questionnaire containing as many as 100 questions.
vi. Sampling method is the only method that can be used in certain cases.
There are some cases in which the census method is inapplicable and the only
practicable means is provided by the sample method. For example, if one is
interested in testing the breaking strength of chalks manufactured in a factory,
under a full enumeration, all the chalks would be broken in the process of testing.
Page 3 of 59
vii. The sample method is often used to judge the accuracy of the data obtained on a
census basis.
Disadvantages of sampling
i. There are always errors in sampling, both sampling and non-sampling errors. A
sample survey must be carefully planned and executed; otherwise, the results
obtained may be inaccurate and misleading. Even for a complete count care
must be taken and if the sampling procedure is not accurate serious errors may
arise in sampling.
ii. Sampling generally requires the services of experts. In the absence of qualified
and experienced persons, the information obtained from sample surveys cannot
be relied upon.
iii. At times, the sampling plan may be so complicated that it requires more time,
labor and money than a complete enumeration. This is so if the size of the
sample is a large proportion of the total population, and, if complicated weighted
procedures are used. With each additional complication in the survey, the
chances of error multiply and greater care has to be taken, which in turn, means
more time and labor.
iv. If the information is required for each and every unit in the domain of study, a
complete enumeration survey is necessary.
A Sampling frame is the list of all individual sampling units (elements) in the population.
the list of all sampling units/elements in the population.
the sample is selected from this list.
Example 1: If somebody studies the socio-economic status of households, then
a household is the sampling unit.
the list of households is the sampling frame.
a) Identifying the relevant population: Determine the relevant population from which
the sample is going to be drawn.
Page 4 of 59
Example 2: If the study concerns income, then the definition of the population as
individuals or households can make a difference.
b) Determining the method of sampling: Whether a probability sampling procedure
or a non-probability sampling procedure has to be used is also very important.
c) Securing a sampling frame: A list of elements from which the sample is actually
drawn is necessary.
d) Identifying parameters of interest: What specific population characteristics
(variables and attributes) may be of interest?
e) Determining the sample size: The determination of the sample size depends on
several factors.
Sampling Methods
Simple or unrestricted
Judgment Sampling random sampling
Stratified sampling
Quota Sampling Restricted random
sampling
B)
Convenience sampling Systematic sampling
Cluster sampling
Page 5 of 59
which a probability sample can be selected. The method chosen depends on a number of
factors, such as the available sampling frame, how spread-out the population is, how
costly it is to survey members of the population.
Advantages of probability sampling methods
1. Probability sampling doesn’t depend upon the existence of detailed information
about the universe for its effectiveness.
2. Probability sampling provides estimates which are essentially unbiased and have
measurable precision.
3. It is possible to evaluate the relative efficiency of various sample designs only when
probability sampling is used.
Limitation of probability sampling methods:
1. Probability sampling requires a very high level of skill and experience for its use.
2. It requires a lot of time to plan and execute a probability sample.
3. The costs involved in probability sampling are generally high compared to non-
probability sampling.
The most common types of probability sampling methods are:
1. Simple random sampling
2. Stratified sampling
3. Systematic sampling
4. Cluster sampling
Page 6 of 59
are mostly concerned with sampling without replacement, such that all the possible
samples of a given size n are equally likely to be selected.
Page 7 of 59
From the point of view of field survey operations, it has been claimed that cases
selected by random sampling tend to be too widely dispersed geographically, and
that the time and cost of collecting data become large.
Random sampling may produce the most non-random looking results. For example,
thirteen cards from a well shuffled pack of playing cards may consist of one suit. But
the probability of this type of occurrence though is very low.
b) Stratified random sampling
When a population can be classified in such a way that responses to questions asked in
a survey are more homogenous within groups than between these groups, then it may be
a good idea to employ a stratified technique for sampling. Using stratified sampling, the
population is divided into homogeneous, mutually exclusive groups called strata, and
stratification can be done by any variable that is available for all units prior to sampling
(e.g., age, sex, province of residence, income, etc.). If the stratification was done
correctly, data within a stratum would be more homogeneous than data coming from
different strata
The sampling procedure is such that a separate sample is taken independently from each
stratum, and a designated number of items are chosen from each stratum.
Selection of stratified random sampling: Some of the issues involved in setting up a
stratified random sample are:
Basis of stratification: What characteristic should be used to subdivide the universe into
different strata? As a general rule, strata are created on the basis of a variable known to
be correlated with the variable of interest and for which information on each universe
element is known. Strata should be constructed in a way, which will minimize differences
among sampling units within each stratum, and maximize differences among strata.
In other words the purpose of stratification is to increase the efficiency of sampling by
dividing a heterogeneous universe in such a way that (i) there is as great a homogeneity
as possible within each stratum, and (ii) a marked difference is possible between the
strata.
For example, if we are interested in studying the consumption pattern of people in Addis
Ababa, the city may be divided into various parts (such as zones or weredas) and from
each part a sample may be taken at random. Before deciding on stratification we must
have knowledge of the traits of the population. Such knowledge may be based upon
expert judgment, past experience, preliminary observations from pilot studies, etc.
Number of strata: How many strata should be constructed? The considerations limit the
number of strata that is feasible; costs of adding more strata may outrun benefits.
Sample size within strata: How many observations should be taken from each stratum?
Our decision in this situation depends on using either a proportional or a disproportional
allocation. In proportional allocation, we sample each stratum proportional to its relative
weight. In disproportional allocation this is not the case. It is worthwhile pointing out that
proportional allocation approach is simple. If all one knows about each stratum is the
number of items in that stratum, it is generally also the preferred procedure. In
disproportional sampling, the different strata are sampled at different weights. As a
Page 8 of 59
general rule when variability among observations within a stratum is high, we sample that
stratum at a higher variation than for strata with less internal variation.
Proportional and Disproportional Stratified Sampling
In a proportional stratified sampling plan, the number of items drawn from each stratum is
proportional to the size of the stratum. For example, if the population is divided into five
groups, their respective sizes being 10, 15, 20, 30 and 25 percent of the population and a
sample of 5,000 is drawn, the desired proportional sample may be obtained in the
following manner:
From stratum one 5,000 (0.10) = 500 items
From stratum two 5,000 (0.15) = 750 items
From stratum three 5,000 (0.20) = 1,000 items
From stratum four 5,000 (0.30) = 1,500 items
From stratum five 5,000 (0.25) = 1,250 items
Total = 5,000 items.
Proportional stratification yields a sample that represents the universe with respect to the
proportion in each stratum in the population. This procedure is satisfactory if there is no
great difference in dispersion from stratum to stratum. But it is certainly not the most
efficient procedure, especially when there is considerable variation in different strata.
In disproportional stratified sampling an equal number of cases is taken from each
stratum regardless of how the stratum is represented in the population. Thus, in the
above example, an equal number of items (1,000) from each stratum may be drawn. In
practice disproportional sampling is common when sampling forms a highly variable
population, wherein the variation of the measurements differs greatly from stratum to
stratum.
Illustration: The following table provides data about the length of service of the faculty of
a community by length of service:
Workout how many lecturers, readers and professors would be selected from each
category if:
(i) we follow stratified proportional sampling method and take 10% of the population
equivalent to the sample size,
Page 9 of 59
(ii) the size of the sample is 10% of the population but the lecturers, readers and
professors are to be in the ratio of 5: 3: 2 and the weight length of service is to be in
the ratio of 4: 3: 2: 1.
Solution to (i). The sample size is 10% of the universe, hence 830 persons would be
included in the sample. Since 12 strata are formed and we want to follow proportional
stratified sampling method, we will take 10% from each stratum. The number of
persons selected shall be as follows:
Solution to (ii). In the second case the size of sample is 830 but the lecturers, readers
and professors are to be in the ratio of 5:3:2 of the sample, i.e., we take 830(5/10) =
415 lecturers: 830(3/10) = 249 readers, and 830x2/10 =166 professors. Since the
weight length of service is 4:3:2:1, the number selected from each category shall be
as given in the table below:
Page 10 of 59
Greater accuracy: Stratified sampling ensures greater accuracy. The accuracy is
maximum if each stratum is so formed that it consists of uniform or homogeneous
items.
Greater geographical concentration: Compared with random sample, stratified
samples can be more concentrated geographically, i.e., the units from the different
strata may be selected in such a way that all of them are localized in one
geographical area. This would greatly reduce the time and expenses of
interviewing.
Disadvantages of stratified sampling
Utmost care must be exercised in dividing the population into various strata. Each
stratum must contain homogeneous items, as far as possible; otherwise, the
results may not be reliable. If proper stratification of the population is not done, the
sample may be biased.
The items from each stratum should be selected at random; but this may be
difficult to achieve in the absence of skilled sampling supervisors.
The likelihood that a stratified sample will be more widely distributed
geographically than a simple random sample; thus, the cost per observation may
be quite high.
c) Systematic Random Sampling
A systematic sample is formed by selecting one unit at random and then selecting
additional units at evenly spaced intervals until the required sample has been formed.
This method is popularly used in situations where a complete list of the population from
which a sample is to be drawn is available. The list may be prepared in alphabetical,
geographical, numerical or some other order. The items are serially numbered. The first
item is selected at random generally by following the Lottery method. Subsequent items
are selected by taking every kth item from the list, where k refers to the sampling interval
or sampling ratio, i.e., the ratio of population size to the size of the sample, that is k=N/n.
Remark: This method of sampling is also known as quasi-random sampling method.
Once the initial starting point is determined, the remainder of the items selected for the
sample are pre-determined by the sampling interval.
While calculating k, it is possible that we get a fractional value. In such a case we should
use approximation procedure, i.e., if the fraction is less than or equal to 0.5 it should be
omitted and if it is more than 0.5 it should be taken as 1. If it is exactly 0.5 it should be
omitted if the number is even; and should be taken as 1 if the number is odd. This is
based on the principle that the number after approximation should preferably be even.
For example, if the number of students is, 1,020, 1,150 and 1,100 and we want to take a
sample of 200, k shall be:
1,020 1,150 1,100
k= = 5.1 or 5, k= = 5.75 or 6, k= = 5.5 or 6.
200 200 200
1) Number the units on your frame from 1 to N (where N is the total population size).
Page 11 of 59
2) Determine the sampling interval (k) by dividing the number of units in the population
by the desired sample size.
3) Select a number between one and k at random. This number is called the random
start and would be the first number included in your sample.
4) Select every kth unit after that first number
Note: Systematic sampling should not be used when a cyclic repetition is inherent in the
sampling frame.
Example 3: In a class there are 96 students with Roll Nos. from 1 to 96. It is desired to
take a sample of 10 students. Use the systematic sampling method to determine the
sample size. The solution is
N 96
k = 9.6 take k 10.
n 10
Based on the Roll Nos. 1 to 96 the first student from 1 and 10 will be selected at random,
and then we will go on taking every 10thstudent. Suppose the first student happens to be
the 4th. The sample would then consist of the following Roll Nos. 4, 14, 24, 34, 44, 54, 64,
74, 84 and 94.
Systematic sampling is a relatively simple technique and may be more efficient
statistically than simple random sampling provided the list is arranged wholly at random.
However, it is rarely that this requirement is fulfilled. The nearest approach to
randomness is provided by alphabetical lists such as are found in telephone directory
although even these may have certain non-random characteristics.
Advantages of systematic sampling
1. The systematic sampling design is simple and convenient.
2. The time and work involved in sampling using this method are relatively low.
3. The results obtained are also found to be generally satisfactory, provided care is
taken to see that there are no periodic features associated with the sampling
interval.
4. If populations are sufficiently large, systematic sampling can often be expected to
yield results similar to those obtained by proportional stratified sampling.
Disadvantages of systematic sampling
1. The main limitation of the method is that it becomes less representative if we are
dealing with populations having “hidden periodicities”.
2. If the population is ordered in a systematic way with respect to the characteristics
the investigator is interested in, then it is possible that only certain types of items
will be included in the population, or at least more of certain types than others.
Page 12 of 59
d) Cluster Sampling
Under this scheme, the selection is done from primary, intermediate and final (or the
ultimate) units from a given population or stratum. There are several stages in which the
sampling process is carried out. At first, the first stage units are sampled by some
suitable method, such as simple random sampling. Then, a sample of second stage units
is selected from each of the selected first stage units, again by some suitable method,
which may be the same as, or different from the method employed for the first stage
units. Further stages may be added as required.
The fundamental difference between a cluster sample and a stratified sample is that the
clusters are themselves representative of the entire population, whereas the stratified
groups are not. The procedure may be illustrated as follows:
Suppose we want to take a sample of 5,000 households from the city of Addis Ababa. At
the first stage, the city can be divided into a number of districts and a few districts are
selected at random. At the second stage, each district may be subdivided into a number
of sub-districts and a sample of sub-districts may be taken at random. At the third stage,
a number of households may be selected from each of the sub-districts selected at the
second stage. To take another example, suppose in a particular survey, we wish to take a
sample of 10,000 students from Addis Ababa University. We may take departments as
primary units at the first stage, then draw a sample of departments in the second stage,
and choose students in the third and last stage.
Steps in cluster sampling
Cluster sampling divides the population into groups or clusters.
A number of clusters are selected randomly to represent the total population; all
units within selected clusters are included in the sample.
No units from non-selected clusters are included in the sample; they are
represented by those from selected clusters.
This differs from stratified sampling, where some units are selected from each
group.
Advantages of Cluster Sampling
1. Cost reduction.
2. It creates 'pockets' of sampled units instead of spreading the sample over the
whole territory.
3. Sometimes a list of all units in the population is not available, while a list of all
clusters is either available or easy to create.
Disadvantages of Cluster Sampling
The approach creates some loss of efficiency when compared with simple random
sampling.
It is usually better to survey a large number of small clusters instead of a small
number of large clusters. This is because neighboring units tend to be more alike,
resulting in a sample that does not represent the whole spectrum of situation
present in the population.
In cluster sampling, we do not have total control over the final sample size.
e) Multi-stage sampling
Page 13 of 59
The surveys discussed thus far are referred to as one-stage samples. A number of
surveys conducted by the Central Statistical Agency (CSA) involve what is known as
multi-stage sampling, where the elements in the targeted population are grouped into
some sort of hierarchy of units, and sampling is done successively. For instance, a
sample of villages may be taken, and then a sampling of dwellings is done within the
selected villages. Such a sample survey may provide reliable results at national level. It
might be possible to obtain estimates for all districts in the country only if we could find
some variables, say, in a census, which correlate with the variable being measured in the
sample. This concern is called small area estimation. Small area estimates may not
always be reliable especially if the correlates do not establish a strong link with the
variable of interest.
In performing the sampling process, the sample size to be used is considered carefully. In
the second unit of this module it will be shown that the accuracy of estimates increases
with the square root of the sample size. For instance, a sample must be increased 25
times in order to get an increase of 5 times in the accuracy of an estimate. For opinion
polls, typically even a poll of slightly more than 1000 respondents would give “margins of
error” of about 1 percentage point. Non-statisticians find it hard to believe that if we are
trying to take political views of millions, all we need to do is have about 1000 respondents
nationwide. In practice, the size of the sample is dependent both on the degree of
accuracy required as well as on the variability inherent in the population. The same
degree of precision may be obtained from using a small sample from a homogeneous
population. If we wish to find out the total amount of money everyone in a certain room
has in their wallets, all we need to do is to ask one person if everybody has the same
amount of money in their wallets. But with varying amount of money, more persons have
to be asked to get a better estimate of the total amount of money.
Researchers are reluctant to use these methods because there is no way to measure the
precision of the resulting sample. Despite these drawbacks, non-probability sampling
methods can be useful when only descriptive comments about the sample itself are
desired. Secondly, they are quick, generally inexpensive and convenient. There are also
other circumstances, such as in researches when it is not feasible or is impractical to
Page 14 of 59
conduct probability sampling, as in the case of internet polls, or taking a sample survey of
persons with disability (when there is no administrative list of such persons).
The selection process in non-probability is, at least, partially subjective. Some of the most
common types of non-probability sampling methods that are used in practice are:
i. Judgment sampling
ii. Quota sampling and
iii. Convenience sampling
i. Judgment sampling
This approach is used when a sample is taken based on certain judgments about the
overall population. The underlying assumption is that the investigator will select units that
are characteristic of the population. The critical issue here is objectivity: how much can
judgment be relied upon to arrive at a typical sample? For example, if a sample of ten
students is to be selected from a class of sixty to analyze the spending habits of students,
the investigator would select 10 students who, in her/his opinion, are representative of the
class.
Judgment sampling is subject to the researcher's bias and is perhaps even more biased
than haphazard sampling. Since any preconceptions the researcher may have are
reflected in the sample, large biases can be introduced if these preconceptions are
inaccurate.
Researchers often use this method in exploratory studies like pre-testing of
questionnaires and focus groups. They also prefer to use this method in laboratory
settings where the choice of experimental subjects (i.e., animals, humans, etc.) reflects
the investigator's pre-existing beliefs about the population
Advantages of judgment sampling
1) In solving everyday business problems and making public policy decisions,
executives and public officials are often pressed for time and cannot wait for
probability sample designs. Judgment sampling is then the only practical method to
arrive at solutions to their urgent problems.
2) When we want to study some unknown traits of a population, some of whose
characteristics are known, we may then stratify the population according to these
known properties and select sampling units from each stratum on the basis of
judgment. This method is used to obtain a more representative sample.
Disadvantages of judgment sampling method
This method is not scientific because the population units to sample may be affected by
the personal prejudice or bias of the investigator. Thus, judgment sampling involves the
risk that investigator may establish foregone conclusions by including in the sample those
which conform to his preconceived notions. For example, if an investigator holds the view
that the wages of workers in a certain establishment are very low, and if she/he adopts
the judgment sampling method, she/he may include only in the sample those workers
whose wages are low and thereby establish his point of view which may be far from the
Page 15 of 59
truth. Since an element of subjectivity is possible, this method cannot be recommended
for general use.
There is no objective way of evaluating the reliability of sample results. The success of
this method depends upon the excellence of judgment. If the individual making decisions
is knowledgeable about the population and has good judgment, then the resulting sample
may be representative, otherwise the inferences based on sample may be erroneous. It
may be noted that even if a judgment sample is reasonably representative, there is no
objective method for determining the size or likelihood of sampling error.
Page 16 of 59
sample even if the sample is drawn at random from the lists. If a person is to submit a
project report on labor-management relations in textile industry and he takes a textile mill
close to his office and interviews some people over there, he is following the convenience
sampling method. Convenience samples are prone to bias by their very nature - selecting
population elements that are convenient to choose almost always make them special or
different from the rest of the elements in the population in some way.
Hence the results obtained by following convenience sampling method can hardly be
representative of the population - they are generally biased and unsatisfactory. However,
convenience sampling is often used for making pilot studies. Questions may be tested
and the chunk may provide preliminary information before the final sampling design is
decided upon.
Selection of appropriate methods of sampling
Having discussed the various methods of sampling, the question now arises as to which
method to adopt in a particular situation. It should be noted that no one method can be
regarded as best under all circumstances - each method has its own specialty. A number
of factors such as the nature of the problem, size of universe, size of sample, availability
of finance, time, etc., would affect the choice of a particular method of sampling.
Page 17 of 59
ii. Unbiased errors: These errors arise due to chance differences between the
members of population included in the sample and those not included.
Sampling error in statistics is the difference between the value of a statistic and
that of the corresponding parameter.
Thus, the total sampling error is made up of errors due to bias, if any, and the random
sampling error. The essence of bias is that it forms a constant component of error that
does not decrease in a large population as the number in the sample increases. Such
error is, therefore known as cumulative or no compensating error.
On the other hand, the random sampling error decreases on an average as the size
sample increases. Such error is, therefore, also known as non-cumulative or
compensating error.
Causes of bias: Bias may arise due to -
i. Faulty process of selection;
ii. Faulty work during the collection; and
iii. Faulty methods of analysis.
i. Faulty selection: Faulty selection of the sample may give rise to bias in a number of
ways, such as:
a) Deliberate selection of a representative sample.
b) Conscious or unconscious bias in the selection of a random sample.
c) The randomness of selection may not really exist, even though the investigator
claims that she/he had a random sample if she/he allows her/his desire to obtain a
certain result to influence his selection.
d) Non-response: If all the items to be included in the sample are not covered there
will be bias even though no substitution has been attempted. This fault particularly
occurs in mailed questionnaires, which are returned incomplete. Moreover, the
information supplied by the informants may also be biased.
e) An appeal to the vanity of the person questioned may give rise to yet another kind
of bias. For example, the question ‘Are you a good student?’ is such that most of
the students would succumb to vanity and answer ‘Yes’.
ii. Bias due to faulty collection of data: Any consistent error in measurement will give
rise to bias whether the measurements are carried out on a sample or on all the
units of the population. The danger is, however, likely to be greater in sampling work
since the units measured are often smaller. Bias may arise due to improper
formulation of the decision, securing an inadequate frame, and so on. Biased
observations may result from a poorly designed questionnaire, an ill-trained
interviewer, failure of a respondent’s memory, etc. Bias in the flow of the data may
be due to unorganized collection procedure, faulty editing or coding of responses.
iii. Bias in analysis: In addition to bias which arises from faulty process of selection and
faulty collection of information, faulty methods of analysis may also introduce bias.
Such bias can be avoided by adopting the proper methods of analysis.
Avoidance of Bias: If possibilities of bias exist, a fully objective conclusion cannot be
drawn. The first essential of any sampling or census procedure must, therefore, be the
elimination of all sources of bias. The simplest and the only certain way of avoiding bias
in the selection process is for the sample to be drawn either entirely at random, or at
Page 18 of 59
random subject to restrictions which, while improving the accuracy, are of such a nature
that they do not introduce bias in the results. In certain cases, systematic selection may
also be permissible.
Method of reducing sampling errors
Once the absence of bias has been ensured, attention should be given to the random
sampling errors. Such errors must be reduced to the minimum so as to attain the desired
accuracy.
Apart from reducing errors of bias, the simplest way of increasing the accuracy of a
sample is to increase its size. The sampling error usually decreases with increase in
sample size, and in fact in many situations the decrease is inversely proportional to the
square root of the sample size.
Page 19 of 59
These sources are not exhaustive, but are given to indicate some of the possible sources
of error.
Controlling non-sampling errors
In some situations the non-sampling errors may be large and deserve greater attention
than sampling errors. While, in general sampling errors decrease with increase in sample
size, non-sampling errors tend to increase with the sample size. In the case of sample
surveys both sampling and non-sampling errors have to be controlled and reduced to a
level at which their presence does not vitiate the use of final results.
Reliability of samples
The reliability of samples can be tested in the following ways:
More samples of the same size should be taken from the same population and their
results need to be compared. If the results are similar, the sample will be reliable.
If the measurements of the universe are known, then they should be compared with
the measurements of the sample.
In case of similarity of measurements, the sample is reliable.
A sub-sample should be taken from the samples and studied. If the results of sample and
sub-sample study show similarity, the sample should be considered reliable.
So far, we have made a distinction between a population and a sample, stating that a
population consists of all conceivably possible (or hypothetically possible) observations of
a given characteristic, while a sample is simply part of a population. In this section, it is
important to note the difference between finite populations and infinite populations.
A population is finite if it consists of a finite or fixed number of elements, measurements,
or observations, whereas a population is infinite if it contains, at least hypothetically,
infinitely many elements. Observing the totals obtained in repeated rolls of a pair of dice
gives an infinite population. Sampling with replacement from a finite population is also
infinite.
When data are produced by random sampling or randomized experimentation, a statistic
is a random variable that obeys the laws of probability theory. The link between
probability and data is formed by the sampling distribution of a statistic. A sampling
distribution shows how a statistic would vary in a repeated data production. It is the basic
concept in statistical inference.
If we draw a sample of size n from a given finite population of size N without replacement,
N!
then the total number of possible samples is: N C n k.
n!( N n)!
Page 20 of 59
Sampling distribution of the mean
This section focuses on the sample mean and its sampling distribution. The sampling
distribution of the sample mean, X , is determined by the design used to produce the
data, the sample size n and the population distribution having a mean and variance
2 . It can be shown that the variance of X
2 N n
Var( x ) .
n N 1
2
If N , V ( x ) and x .
n
The same holds true if SRS with replacement is taken. The shape of the distribution of x
depends on the shape of the distribution of the population distribution. If the population is
2
normal, N ( , ) , then the sample mean X is N ( , ) . In general, any linear
n
combination of independent normal random variables is also normally distributed. To
elaborate this concept, let us start with the following illustration.
Example 4: Suppose a population consists of five numbers 2, 4, 6, 8, and 10. Consider
all possible samples of size 2 which can be drawn with replacement from this population.
Find the
a) mean of the population.
b) variance of the population.
c) mean and standard deviation of the sampling distribution of the means.
Solution: a) mean of the population 6. ; b) variance of the population = 2 8. c)
The number of possible samples (taken with replacement) of size n = 2 from N = 5 is
given Nn = 25. The samples, means and frequency distribution are given below.
Step 1: Draw all possible samples.
2 4 6 8 10
2 (2, 2) (2, 4) (2, 6) (2, 8) (2, 10)
4 (4, 2) (4, 4) (4, 6) (4, 8) (4, 10)
6 (6, 2) (6, 4) (6, 6) (6, 8) (6, 10)
8 (8, 2) (8, 4) (8, 6) (8, 8) (8, 10)
10 (10, 2) (10, 4) (10, 6) (10, 8) (10, 10)
Page 21 of 59
Step 2: Calculate the mean for each sample.
2 4 6 8 10
2 2 3 4 5 6
4 3 4 5 6 7
6 4 5 6 7 8
8 5 6 7 8 9
10 6 7 8 9 10
xi 2 3 4 5 6 7 8 9 1
0
fi 1 2 3 4 5 4 3 2 1
i) The mean of X , x = xi f i =
150
6 . .
f i 25
Page 22 of 59
The standard error (s.e.) of the mean
Since there are many problems in which we are interested in the standard deviation of
the sample mean rather than its variance, we define the standard error of the mean.
Definition 1.1: The standard deviation of the sample mean is called the standard error
(s.e.) of the mean, x .
The standard error of the mean tells how much sample means vary from the mean of the
same population. It depends on two things: how large a sample we take and how much
variability there is in the population. Means based on large numbers of cases vary less
than means based on small number of cases.
Case 2: Sampling distribution of the mean (Infinite populations or sampling with
replacement)
Therefore, collecting these results from both cases, we have the standard error of the
mean
a) x , if sampling is with replacement or from an infinite population.
n
N n
b) x , if sampling is without replacement from a finite population.
n N 1
Page 23 of 59
Decreases with an increase in n (more specifically, it is inversely proportional to
the square root of n) in case of finite population and it decreases faster since n
also appears in N n , which is a fraction.
N 1
Note.
1. When N is large compared to n, the difference between the two formulae for x
are usually negligible, and will be taken as an approximation when we are
n
sampling from a large, finite population.
2. The fpc is omitted (in practice) unless the sample constitutes at least 5 percent of
the population. This is not a hard-and-fast rule, but only a rule-of- thumb.
Activity
a) Find the fpc for n =100 and N =10,000.
b) When we take a sample from an infinite population, what happens to the standard
error of the mean when we use a sample mean to estimate the mean of a
population, if n is increased from 50 to 200?
Page 24 of 59
2
Mathematically, if n is large (usually n 30), X N , , from which we get the
n
standardized form of X , given by,
Z
x N 0,1 .
/ n
N 0,1 is known as the standard normal distribution, i.e. 0 and 2 1. The Central
Limit Theorem justifies the use of normal-curve methods for many problems, because
(regardless of the actual shape of the population sampled) as long as we take a large
sample, the mean will be approximately normally distributed.
When the parent population is normal, however, regardless of the size of n, the mean is
normally distributed.
Theoretical Result 4: Sampling from a normal population
If X is the mean of a random sample of size n from a normal distribution with mean
and variance 2 , its sampling distribution is a normal distribution with mean and
2
variance .
n
2
Symbolically, X N ( , 2 ) X N , .
n
Exercises
1. (a) Define sampling. Explain the different methods of sampling.
(b) State the advantages of adopting sampling procedure in carrying out large-scale
surveys.
2. (a) Sampling is necessary under certain conditions. Explain this with suitable
examples.
(b) Point out the importance of sampling in solving business and economic problems.
3. Distinguish between ‘census’ and ‘sampling’ methods of collection of data, and
compare their merits and demerits.
4. Explain the terms ‘population’ and ‘sample’. Is it sometimes necessary and often
desirable to collect information about the population by conducting a sample survey
instead of complete enumeration?
5. Define a random sample and show how you would achieve randomness. How do you
select a random sample from a finite population? Point out the advantages of a
stratified random sample.
6. Prepare a sample survey scheme for ascertaining the percentage of income
households spent on food. In what respects will your scheme be optimal?
7. What are the main steps involved in a sample survey? Discuss the various sources
of errors in such surveys. Discuss briefly how these errors can be controlled.
8. (a) What is random sampling? How can a random sample be selected? Is random
sampling always better than other forms of sampling in the context of socio-economic
survey?
Page 25 of 59
(b) A sample may be large yet worthless because it is not random; or it may be
random but unreliable because it is small. Comment upon this statement.
9. A population consists of the four numbers, 3,7,11, and 15. Consider all possible
samples of size 2 drawn from this population without replacement. Find
a) (b)
c) the mean of the sampling distribution of means.
d) the standard deviation of the sampling distribution of means.
Verify (c) and (d ) from (a) and (b) using suitable formulae.
REFERENCE
Suggested textbook
Ott, R.L. and Longnecker, M.T. (2008). An Introduction to Statistical Methods and Data
Analysis. Duxbury Press, New York.
Reference books
Freund, R.J. and Wilson, W.J. (2003). Statistical Methods (2nd Ed.). Academic Press.
Snedecor, G.W. and Cochran, W.G. (1980). Statistical Methods (8th Ed.). Iowa State
University.
Page 26 of 59
Unit 2
Sample Survey Methods
CONTENT
2.1 Introduction
2.2 Basic Survey Design
2.3 Steps in Sampling Design
2.4 Sampling and Non-sampling Errors
2.5 Sample Survey Techniques
2.1 Introduction
In a broad sense, sampling theory can be considered as coextensive with modern
statistical methods. Almost all of the modem developments in statistics relate to the
inferences that can be made about a population when information is available from only a
sample of the elements of the population. Some of the ways in which this is reflected in
statistical programs are mentioned below.
Survey Work
In most survey work, the population consists of all persons (or housing units, households,
industrial establishments, farms, etc.) in a city or other area. Information is obtained or
desired from a sample of the population, but inferences are required on characteristics of
the whole population.
Design and Analysis of Experiments
In the design and analysis of experiments, the population represents all possible
applications of several alternative techniques which can be used. For example, the
experiment may be agricultural, in which a number of fertilizers are being tested. The
population is infinite because it represents the use of the fertilizers in all possible farms
over all time. The problem is to design experiments so that the maximum amount of
information can be made available for inferences about the full population, estimated from
a sample of limited size.
Quality Control
Page 27 of 59
Take random samples from populations
Explain why sample statistics are good estimators of population parameters
Judge one estimator as better than another based on desirable properties of
estimators
Understand basic survey designs
Explain Survey Sampling
Probability samples
Non-probability samples
Page 28 of 59
2.3 Reasons for the use of Survey Sampling
There are six basic reasons for the use of samples:
A sample may save money (as compared with the cost of a complete census)
when absolute precision is not necessary.
A sample may make it possible to concentrate attention on individual cases.
A sample saves time, when data are desired more quickly than would be possible
with a complete census.
In industrial uses, some tests are destructive (for example, testing the length of
time an electric bulb will last) and can only be performed on a sample of items.
Some populations can be considered as infinite, and can, therefore, only be
sampled. A simple example is an agricultural experiment for testing fertilizers. In
one sense, a census can be considered as a sample at one instant of time of an
underlying causal system which has random features in it.
Where non-sampling errors are necessarily large, a sample may give better results
than a complete census because non-sampling errors are easier to control in
smaller-scale operations.
2.4 Limitations of Survey Sampling
Under certain conditions, the usefulness of sampling becomes questionable. Three
principal conditions can be mentioned.
1) If data are needed for very small areas, disproportionately large samples are required
since precision of a sample depends largely on the sample size and not on the
sampling rate. In this case, sampling may be almost as expensive as a complete
census.
2) If data are needed at regular intervals of time, and it is important to measure very
small changes from one period to the next, very large samples may be necessary.
3) If there are unusually high overhead costs connected with a sample survey, caused
by work involved in sample selection, control, etc., sampling may be impractical. For
example, in a country with many small villages it may be more economical to
enumerate all the households in the sample villages than to enumerate a sample of
households within the sample villages. For office processing, however, a sample of
the enumerated households may be used to reduce the work and costs of producing
tabulations.
Page 29 of 59
3. Cross-sectional: information gathered only at one point in time; uses variation
between subjects; is less expensive and the most common type of design.
4. Longitudinal: same subjects observed in more than one period; uses both cross-
sectional as well as within subject variation; useful for capturing changes over
time.
5. Case study: detailed analysis of individual cases.
Example 5: if the study concerns income, then the definition of the population as
individuals or households can make a difference.
A list of elements from which the sample is actually drawn is important and necessary.
d) Identifying parameters of interest
Page 30 of 59
Measurable Reliability
It should be possible to measure the reliability of the estimates made from the sample.
That is, in addition to the desired estimates of characteristics of the population (totals,
averages, percentages, etc.) the sample should give measures of the precision of these
estimates. As we shall see later, these measures of precision can be used to indicate the
maximum error that may reasonably be expected in the estimates, if the procedures are
carried out as specified, and if the sample is moderately large. The estimation of
precision is not possible unless the selection is carried out so that the chance of selection
of each unit is known in advance and random sampling is used.
Feasibility
A third characteristic is that the sampling plan must be practical. It must be sufficiently
simple and straightforward so that it can be carried out substantially as planned; that is,
the sampling theory and practice will be the same. A plan for selecting a sample, no
matter how attractive it may appear on paper, is useful only to the extent that it can be
carried out in practice. When the methods actually followed are the same (or substantially
the same) as specified in the sampling plan, then known sampling theory provides the
necessary measures of reliability. In addition, the measures of reliability computed from
the survey results will serve as powerful guides for future improvement in important
aspects of the sample design.
Unit of analysis is the unit for which we wish to obtain statistical data. The most common
units of analysis are persons, households, farms, and business firms. They may also be
products coming out of some machine process. The unit of analysis is frequently called
an element of the population. There may be more than one unit of analysis in the same
survey; for example, households and persons; or number of farms and hectares (or
acres) harvested.
Page 31 of 59
A characteristic is a general term for any variable or attribute having different possible
values for different individual units of sampling or analysis. In a sample survey, we
observe or measure the values of one or more characteristics for the units in the sample.
For example, we observe (or ask about) the area of land for rice crop, the number of
cattle on a farm, the age and sex of a person, the number of children per family, etc. So,
we observe a unit, but we measure several characteristics of that unit.
A population or universe is the entire group of all the units of analysis whose
characteristics are to be estimated. The chapters in this sampling manual will deal
primarily with a finite population, having N units.
A Sampling frame is the totality of the sampling units from which the sample is to be
selected. The frame may be a list of persons or of housing units; it may be a subdivided
map, or it may be a directory of names and addresses stored in some kind of electronic
medium, such as a file in a hard disk or a data base.
A parameter is a quantity computed from all values in a population set. That is, a
parameter is a descriptive measure of a population. For example, consider a population
consisting of N elements. Then the population total, the population average or any other
quantity computed from measurements including all elements of the population is a
parameter. The objective of sampling is to estimate the parameters of a population
Note that the term statistic refers to a sample estimate and the term parameter refers to a
population value.
Page 32 of 59
An estimator is a mathematical formula or rule which uses sample results to produce an
n
estimate for the entire population. For example, the sample average, y yi , is an
i 1
N
estimator. It provides an estimate of the parameter, the population average, Y Y i that
i 1
is, the sample average is an estimate of the population average.
Therefore, the estimator refers to a mathematical formula. When numbers are plugged
into the formula, an estimate is produced. However, in common statistical language, the
words estimate and estimator are used interchangeably.
The probability of selection is the chance that each unit in the population has of being
included in the sample. Probability values range from 0 to 1, inclusive.
The probability distribution gives the probabilities associated with the values which a
random variable can equal. If there are N values that a random variable X can take, say
Xl, X2, ... ,XN, then there are N probabilities associated with the Xi values, namely P1, P2,
..., PN. The probabilities and the values the random variable takes constitute the
probability distribution of X.
2.8 Sampling Distribution
The expected value is the average value for a single characteristic over all possible
samples. Mathematically, we define the expected value (or mean) of a discrete random
variable Y as: ( y ) yp ( y ) .
y
The Greek letter is used to indicate the sum of the products of all possible values of y
and their associated probabilities p(y); and p( y) 1 . The small y denotes particular
y
value of Y. An analogous definition of the sample mean can be provided for a continuous
random variable using an integral in place of a summation operator. For now we limit our
interest to the discrete situation. Below we provide an illustration about the same.
The expected value is a weighted average of the possible outcomes, with the probability
weights reflecting the likelihood of occurrence of each outcome. Thus, the expected value
should be interpreted as the long-run average value of Y, if the frequency with which
each outcome occurs is in accordance with its probability.
Page 33 of 59
For example, consider 1 in which the random variable Y is used to represent the size of a
U.S. household selected at random. We write the expected value of Y as:
Household
HH Size (Yi)
Unit (HU)
U1 3
U2 5
U3 7
U4 9
U5 11
N
The total number of persons in, the population is: Y Yi 35 . The average number of
i 1
N
1
persons in a household (or average household size) is: Y
N
Y
i 1
i 7.
5
If we take a sample of size 2 from this population, there are 10 possibilities, and
2
they are (3,5), (3, 7) (3, 9) (3, 11) (5, 7) (5, 9) (5, 11) (7, 9) (7, 11) (9, 11).
Page 34 of 59
The means of these samples are 4, 5, 6, 7, 6, 7, 8, 8, 9, and 10, respectively, and if
sampling is random so that each sample has the probability 1/10, we obtain all the
possible samples of size two HUs from a population of 5 HUs, as shown in the table
below. The sampling distribution of the mean is also presented.
3,7 5 1/10
3,9 6 1/10
3,11 7 1/10
5,7 6 1/10
5,9 7 1/10
5,11 8 1/10
7,9 8 1/10
7,11 9 1/10
9,11 10 1/10
Mean Probability
( ) ( ( ))
4 1/10
5 1/10
6 2/10
7 2/10
8 2/10
9 1/10
10 1/10
Page 35 of 59
and the corresponding y = 5, 6, 7, 8, 9, the probability is 8/10 that sample mean will not
differ by more than 2.
Further useful information about this sampling distribution of the mean can be obtained
by calculating its expected value as follows:
1 1 2 2 2 1 1
( ) = (4 × ) + (5 × ) + (6 × ) + (7 × ) + (8 × ) + (9 × ) + (10 × ) = 7.
10 10 10 10 10 10 10
Note that the same results would be obtained for samples of all size. Recall the definition
of the expected value, which is the average of a single characteristic over all possible
samples.
We will now compare the distribution of the sample estimates to show that: As the sample
size increases, the means of the samples tend to concentrate more and more around the
true average value. In other words, the estimates tend to become more and more reliable
as the sample size increases.
The percentage distributions of the sample estimates can be used to predict the chance
of obtaining a sample estimate within specified ranges of the true value. To see the
above statements, consider a hypothetical population of 12 individuals. We wish to make
different estimates from a sample of 1, 2, 3, 4, 5, 6 and 7 individuals.
We have seen that the precision of a sample can be predicted if we have the distribution
of all sample estimates of a given size for the population. In a real situation, we cannot
select all possible samples and examine the estimates derived from them. We must
depend upon a single sample. Therefore, it is necessary to find some measure of the
extent to which the estimates made from various samples differ from the true value; this
measure, if it is to be useful, must be one that can be estimated from the sample itself.
Before showing how and why we can do this, we shall introduce certain definitions and
relationships which are derived from the theory of sampling.
Standard deviation is the measure of variability in the population; in subsequent
discussion and consideration this is the measure of dispersion we will be using. The
square of standard deviation is called the population variance, and is designated by the
symbol 2 . The variance of the population is defined as the average of the squares of
the deviations of all the individual observations from their mean value. Thus, it would be
computed as follows, if all the values in the universe could be observed:
N
1 1
2 (Y I Y ) 2 ... (YN Y ) 2 (Y i Y )2 .
N N i 1
where the Y's with subscripts are individual observations and is the mean of the N
observations for the N elements in the universe. Note that it has become fairly general
practice to denote the population variance by 2 when dividing by N, and by S2 when
dividing by N-l; symbolically,
Page 36 of 59
1 N
S2 (Yi Y ) 2 .
N 1 i 1
1 n
s2 ( yi y ) 2 .
n 1 i 1
where n is the sample size, yi is the sample measurement of a characteristic and y is the
sample mean. We will use S2 throughout the text because s2 is an unbiased estimate of
S2. Note that all results are equivalent in either notation.
The variance of the sample means is the average of the squares of the deviations of the
means of all possible samples of size n from the true mean. The variance of y is denoted
2 2 S2 N n S2 n
by S ( y ): S ( y ) (1 f ) where f is called the sampling fraction.
n N n N
The reciprocal of f, that is N/n, is known as sampling weight or inverse sampling fraction.
Standard Error of Sample Means is the square root of the standard error of the
s N n
variance y is called standard error of means of sample size n, that is, S ( y ) .
n N
It is important to note that the standard error varies with the size of the sample, as we
would expect it to be. As the sample size increases, the standard error becomes smaller
and smaller. This is shown in the following:
N n
The factor in the formula for the variance of y is called the finite population
N
N n
correction (fpc). As a rule of thumb, if n 0.05N we can ignore since its value will
N
be close to 1. Otherwise, we should include it in the formula in order not to severely
overestimate the variance of y
Page 37 of 59
percent; for two standard errors, it is 95 percent; for three standard errors, it is 99.7
percent.
Estimates are subject to both sampling errors and non-sampling errors. Sampling error
arises because information is not collected from the entire target population, but rather
from some portion of it. Through the use of scientific sampling procedures, however, it is
possible to estimate from the sample data the range within which the true population
value (parameter) is likely to be with a known probability.
Non-sampling error, on the other hand, is defined as a residual category consisting of all
other errors which are not the result of the data having been collected from only a
sample. These include errors made by respondents, enumerators, supervisors, office
clerical staff, key coding operators, etc.
Total Error (Mean Square Error). The total error is the sum of all errors about a sample
estimate, both sampling and non-sampling, both variable and systematic. An illustration
of the composition of the total error follows:
In practice, the bulk of sampling error consists of variable error, and by contrast the bulk
of non-sampling error is bias.
2.10 Sample survey Techniques
• Every possible combination of n sampling units has the same chance of being
chosen.
Page 38 of 59
• Selection of one sampling unit at a time with equal probability may be
accomplished by either sampling with replacement or without replacement. Almost,
if not all, samples are selected without replacement.
• Using a table of random numbers to select that the unit satisfies this definition of
simple random sampling.
Advantages of SRS
• It is free of classification error.
• It requires minimum advance knowledge of the population.
• It best suits situations where not much information is available about the
population and data collection can be efficiently conducted on randomly distributed
items.
• If these conditions are not true, stratified sampling or cluster sampling may be a
better choice.
Disadvantages of SRS
• If the population is widely dispersed, it may be extremely costly to reach the units.
• A current list of the whole population we are interested in (sampling frame) may
not be readily available.
• Or perhaps, the population itself is not homogeneous and the sub-groups are very
different in size. In such a case, precision can be increased through stratified
sampling.
Y 1 N y 1 n
Y Yi = population mean; y y i = sample mean.
N N i 1 n n i 1
The population variance and sample variance have been introduced above.
Note. In simple random sampling, s2, is an unbiased estimate of S2.
Page 39 of 59
Population values, their respective estimates and measure of precision
The sample estimate of the population total value, Y, is denoted by Yˆ , and can be written
as:
n
N
Yˆ Ny y i , where y is the estimate of the population average, Y , and is given by
n i 1
1 n NS N n
y y i .The sampling error of the estimate of Yˆ is S (Yˆ ) ( ) ; and the
n i 1 n N
S N n
sampling error of y is S ( y ) ( ).
n N
Ns N n s N n
S (Yˆ ) ( ), S ( y) ( ).
n N n N
S (Yˆ ) s( y )
CV (Yˆ ) and cv(Yˆ ) .
Y y
S ( y) s( y)
CV ( y ) and cv( y ) .
y y
Page 40 of 59
income greater than a certain amount, or the proportion of business firms interested in
purchasing a particular product. Secondly, it may be desired to classify a population into
a number of groups, and to find the percentage of the total population in each of these
groups. The groups may have a natural ordering as in distribution by age (0 to 4 years, 5
to 9, 10 to 14, etc.) or income classes; or they may be groups having no natural order,
such as those in an industrial classification of business firms, where the groups can be
arranged in a number of ways. The analysis is the same whenever the proportion of the
total in each group is the statistic to be measured.
All of the formulas discussed in previous sections can be applied to this particular case by
considering each member of the population as having a characteristic which can have
only one of two values, either 0 or 1. If the member is in a particular class in which we are
interested, the value assigned is 1; if the member is not in the class, the value is 0.
Examining the entire population, we can see that the A members of the class each have
a value of 1; the rest have a value of Q. Adding up the values for all elements of the
population, we get A. In other words, A can be considered as the equivalent of
N
A Y
Y Yi . Similarly P can be considered in the same way as Y .
i 1 N N
Applicable formulas
In sampling for proportions, the following formulas are applicable (with simple random
sampling):
a
Pˆ p and Aˆ pN .
n
That is, an estimate of the proportion in the population is obtained by using the sample
proportion, and an estimate of the total number of units having the characteristic is
obtained by multiplying the sample proportion by the total number of units in the
population. Also
NPQ
2 PQ , S2 .
N 1
Page 41 of 59
The population variance is PQ. Note that it is the variance of the population distribution
giving the value of 1 or 0 to an element depending on whether or not it is in the class
(whether it has the attribute in question). It can still be estimated by pq, unless n is very
n
small (for example n < 30) in which case the formula is s 2 pq .
n 1
N n PQ
The variance of the sample proportion is 2 Pˆ ( )
N 1 n
.
2.10.4 Sample size determination and selection of sample points under SRS
One of the first questions which a statistician is called upon to answer in planning a
sample survey refers to the size of the sample required for estimating a population
parameter with a specified precision. Making a decision about the size of the sample for
the survey is important. Too large a sample implies a waste of resources, and too small a
Page 42 of 59
sample diminishes the utility of the results.
Specific considerations for determining the sample size
When considering sample size determination, there are three very important concerns:
accuracy, practicality, and efficiency.
Accuracy can be defined as an inverse measure of the total error. Total error is the sum
of sampling error (SE) and non-sampling error, (NSE). Sampling error arises because
only a part of the population is observed, and not all of it. The terms PRECISION
and RELIABILITY are associated with sampling error. Estimator A is more precise or
more reliable than estimator B if the sampling error of A is smaller than the sampling error
of B. Non-sampling errors are usually biases which are very often due to poor quality
control of the survey operations (poor questionnaire design; interviewers that are not well
trained; response errors; etc.)
Practicality. To obtain an accurate estimate, both sampling and non-sampling errors
must be reduced. However, accuracy may come into conflict with practicality because:
1. to reduce sampling errors and increase precision, the sample size must be large.
2. too large a sample can impose an excessive burden on the limited resources
available (and resources are usually very limited) and increase the likelihood of
non-sampling errors.
Efficiency. A further concern is that a given sample size can produce different levels of
precision depending on which sampling techniques are chosen. This concept is known as
the statistical efficiency of the design. The most efficient design is the one that gives the
most precision for the same sample size. Therefore, expert sample design is needed in
the determination of the optimal sample size.
pq 50 (0.20)(0.80)
p 2 (1 f ) = 0.20 2 (1 ) (0.087, 0.312) .
n 500 50
The conclusion is that between 8.7% and 31.2% of the population is of Chinese descent;
this n interval is too wide to be useful. There are two ways in which a narrower interval
could be obtained:
Page 43 of 59
will study each one in detail.
Degree of precision desired
Formula to connect n with desired precision
Advance estimates of variability in population
Cost and operational constraints
Expected sample loss due to non-response
Number of different characteristics for which specified precision is required
Population subdivisions for which separate estimates of a given precision are
required; these are also called domains of estimation.
Expected gain or lose in efficiency
Degree of precision
Precision of an estimate refers to the amount of variable error, mainly sampling error,
contained in an estimate. To lower the sampling error, that is, to increase the precision,
we want n to be sufficiently large. Therefore, we decide on a target value for the precision
of the estimate. The degree of precision desired can be stated in terms of:
The absolute error (E) for the estimate ˆ is expected to satisfy the probability relation:
P ˆ 1
The relative error (RE) for the estimate ˆ is expected to satisfy the probability relation:
ˆ
p RE 1 .
This is E expressed as a proportion (or percentage) of the true value of the parameter
being estimated. For example, if E = 5 hectares and the true value of the parameter is
100, then RE = 5/100 = 0.05 or 5%.
The target coefficient of variation (cv) for the estimate (vo)
We set the cv (also known as the relative standard error) for the estimate equal to a
target value vo. For example, we can have:
VAR( )
0.05 5% .
Depending on which of the three ways we use to specify the precision, the formula for n
will be different. The values of E, RE and are usually decided by the user of the data in
conjunction with a statistician.
Page 44 of 59
Formula that connects n (sample size) with desired degree of precision
S 2 the population variance; 2 could be used instead.
n the desired sample size.
S
CV the population coefficient of variation, ( ) , where Y is the population mean.
Y
N Number of units in the population.
Note: The level of confidence states the probability that the n determined will provide the
degree of precision specified. For example, a 95% level of confidence means that, except
for a small chance (5%), we can be 95% certain that the precision specified will be
reached with the calculated n. This is equivalent to saying that the acceptable risk is 5%
that the value will lie outside of the range specified in the confidence interval.
The sample size needed to estimate a mean with absolute error E.
The sampling error of a mean using simple random sample is given by
S ( N n)
S ( y) . (1)
n N
S ( N n)
EK . (2)
n N
k 2 NS 2
n . (3)
K 2S 2 E 2 N
If the population size is large and n 0.05N , the finite population correction factor in
equation (2) can be ignored because its effect would be minimal. In this case, we have:
k2S 2
n .
E2
Page 45 of 59
Example 7: Consider a population consisting of 1,000 farms for which the population
variance of the number of cattle per farm is 250 (N = 1,000 and S2 = 250). Suppose we
want to estimate the average number of cattle per farm from a sample. We wish to have
reasonable confidence that the estimate will be close to the true value. Suppose the
sample estimate is to be in error by no more than 1 (one head of cattle) from the true
average, and we require an assurance of 95 chances out of 100 that the error will be no
larger than 1. In this case, E = 1, N=1,000, and S2 = 250.
Applying equation (3) to the case with k = 2 gives a value n greater than or equal to
k 2 NS 2 4(1000)(250)
n 2 2 2
500 .
K S E N 4(250) 1(1000)
If in the same situation we are satisfied with an error of not more than 3, with a
confidence level of 95 percent, the only change in the formula would be in the values of E
and E2, as follows: E=1 and E2 = 9.
k 2 NS 2 4(1000)(250)
n 2 2 2
100 .
K S E N 4(250) 9(1000)
Example 8: We wish to estimate the average age of 2,000 seniors on a particular college
campus. How large a SRS must be taken if we wish to estimate the age within 2 years
from the true average, 95% confidence? Assume S2 = 30.
k 2 S 2 2 2 (30)
E=2 and k=3 n 30 seniors.
E2 22
The sample size n to estimate a population proportion P is obtained from equation (3). In
the equation, S2 = PQ:
k 2 NPQ
n 2 . (4)
K PQ E 2 N
If the population size is large and n 0.05N , the fpc in equation (4) can be ignored
because its effect would be minimal. In this case, we have:
k 2 PQ
n .
E2
Page 46 of 59
k 2 PQ 2 2 (0.25)(0.75)
n 834.
E2 (0.03) 2
k 2 N 3s2 k 2N 2S 2
n . (a)
k 2 N 2S 2 E 2 N k 2 N 2S 2 E 2
k 2 N 2S 2
If we ignore fpc, we have n . (b)
E2
Sample size needed to estimate the number of units that possess a certain
attribute with absolute error E
To obtain the n necessary to estimate A, the number of units that possess a certain
characteristic, substitute PQ in place of S2 in equations (a) and (b).
2.10.6 Sample size formulas when the error is expressed in relative terms (RE)
We can obtain formulas for estimates when the desired error is expressed in relative
terms instead of absolute terms. For relative errors (RE), if RE is a proportion of the
RE
estimates, substitute ( cv ( y ) ) in (a) or (b) above. In order to avoid confusing the
k
estimated coefficient of variation (cv) from the population coefficient variation (CV) we
shall have the estimated coefficient of variation by cv. We have:
k 2 N (CV ) 2
n .
k 2 (CV ) 2 N ( RE ) 2
( RE ) s( y)
Note 1: cv( y ) . This applies to both mean and the total. If we ignore fpc, the
k y
k 2 (CV ) 2
above equation becomes n .
( RE ) 2
Note 2: In actual practice, we usually do not know S2 or (CV)2. Indeed we do not even
know S2 in advance of the survey. Instead, we use rough estimates of S2 or (CV)2
obtained by the methods discussed earlier.
Note 3: For the mean and the total, it is better to express the variance in relative rather
than absolute terms, for two reasons:
1. Most importantly, because a population's relative variance is more stable than its
absolute variance. A guess or estimate of the population coefficient of variation CV
Page 47 of 59
(from past data or from similar populations) is likely to be closer to the true value
than a guess or estimate of the variance.
2. The formula for n is the same for estimator of means or totals when it is expressed
in terms of the coefficient of variation.
Note 4: To estimate the proportion P, it is preferable to use the absolute error previously
discussed because the proportion is itself a relative quantity, so that taking the
percentage of a percentage can become confusing.
To obtain the formula for the sample size required to estimate a population proportion
when the error is expressed as relative error (RE), use equation (a) above.
k 2 N (CV ) 2
n .
k 2 (CV ) 2 N ( RE ) 2
Q k 2 NQ
Replacing (CV ) 2 we get n 2 .
P k Q NP(RE) 2
k 2Q
If we ignore the fpc, this equation will yield n .
( RE) 2 P
Example 9: We would like to carry out a survey to estimate the total area in hectares of
the farms used by a population. We desire that the estimate be within 10% of the true
value. How many farms should be surveyed? (In a pilot survey, we estimated the
population coefficient of variation, CV, of the variable farm size to be 1.2). Use 95%
confidence.
k 2Q 2 2 (1.2) 2
n = 576 farms.
( RE) 2 P (0.1) 2
2 2
k CV
(CV )
2
RE v0
n 2
2
. (c)
1 k 2 1 CV
1 (CV ) 1
N RE N v0
Page 48 of 59
2 2
k 2 CV
If we ignore the fpc n (CV ) . (d)
RE v
0
Equations (c) and (d) work for both mean and total.
Example 10: A survey was carried out to estimate the total area in hectares of the farms
in a population. The estimate should be within 10 percent of the true value with 95
percent confidence. How many farms should be surveyed? [In a pilot survey, estimated
the population coefficient of variation CV of the variable "farm size" was found to be 1.2].
2
RE 0.1 1.2
In this case, k = 2, CV=1.2, and RE = 0.1. v0 0.05 . Thus n 576 .
k 2 .05
Example 11: The results from a pilot test are used to estimate Y and S for the variable
'income' in a population of 5,000 households showed that Y= $14,852 per household, and
S = $12,300. A full scale survey is planned. What should be the sample size for this
survey if we want to estimate the mean income per household with a cv no larger than
5%?
The estimated population coefficient cv and the resultant estimated sample size n are:
12300
cv 0.828 82.8 %
14852
2 2 .
(cv) 0828
n 275 households
v0 0.05
In the preceding section, we noted that most of the sample size formulas are written in
terms of the population variance. In practice the population variance is unknown and it
must be estimated or guessed. There are five ways of estimating population variances for
sample size determination.
Method 1: Select the sample in two steps, the first being a simple random sample of
size n1 (the first sample) from which estimates s12 and p1 of S2 and P,
respectively, are obtained. Then use this information to determine the
required n (the final sample size).
Method 2: Use results of a pilot survey. This is one of the more commonly used
methods.
Method 3: Use the results of previous samples of the same or similar population.
Method 4: Guess about the structure of the population and use some mathematical
results.
Method 5: (Only for qualitative characteristics.) If the statistic to be measured is a
proportion, then make a fairly good guess of P (the proportion in the population).
Method 1 carries out the survey in two steps. In the first step, only a subsample (a
random part of the total sample) is enumerated. An analysis of this part permits one to
Page 49 of 59
estimate the variance and to make revisions in the total size of the sample, if necessary.
In the second step, the remainder of the sample is enumerated in accordance with these
changes, if any. This method gives the most reliable estimates of S 2 or P, but it is not
often used, since it slows up the completion of the survey.
Method 2 is one of the more commonly used methods. It serves many purposes,
especially if the feasibility of the main survey is in doubt. If the pilot survey is itself a
simple random sample, the preceding methods apply. But often the pilot work is restricted
to a part of the population that is convenient to handle or that will reveal the magnitude of
certain problems.
Method 3 is also a very commonly used method. This method points to the value of
making available, or at least keeping accessible, any data on standard errors obtained in
previous surveys. Unfortunately, the cost of computing standard errors in complex
surveys is high, and frequently only those standard errors needed to give a rough idea of
the precision of the principal estimates are computed and recorded. If suitable past data
are found, the value of S2 may require adjustment for time change. Experience indicates
that the variance of an item tends to change much more slowly over time than the mean
value of the item itself. Even if the mean value changes, the relative error may be quite
stable.
If past experience indicates that a certain level of non-response can be present, we may
want to inflate the calculated sample size to compensate. This is because our
calculations were based on a 100 percent response. If we do not obtain all the interviews,
then the estimates will be based on a number smaller than the calculated n and will,
therefore, have a greater variance than expected.
Inflating Procedure
We compute the inflated sample size n* from the following relationship: n* = n/r, where r is
an estimate of the expected response rate and it can be obtained from previous rounds of
the same survey, previous experience with similar surveys, a pilot (pre-test), etc.
For example, we calculate n to be 1,000 units. Based on the results of a pilot survey, we
anticipate the response rate to be 70 percent.
Our inflated n will be: n* = 1,000/.70 = 1,429. If our assumption was correct, we should
get back 70% of 1,429 = 1,000.
Therefore, our estimates will be based on the same number of units as expected and the
target precision will be attained.
Note. Inflating the sample size when there is non-response only helps compensate for
the resulting loss in precision. It does nothing for diminishing the resulting non-response
bias.
Page 50 of 59
2.11 Systematic Sampling
Systematic sampling is a statistical method involving the selection of every kth element
from a sampling frame, where k, the sampling interval, is calculated as k = N/n. Using this
procedure each element in the population has a known and equal probability of selection.
This makes systematic sampling functionally similar to simple random sampling. It is
however, much more efficient (if the variance within systematic sample is more than
variance of the population) and much less expensive to do.
The researcher must ensure that the chosen sampling interval does not hide a pattern.
Any pattern would threaten randomness. A random starting point must also be selected.
you may round if you are doing this without a calculator, but you would be
sacrificing exactness for convenience
3. Select a random number (RN) from a table of random numbers between 0 and the SI.
This is called a random start (RS):
in the permitted range, exclude zero, but include the sampling interval; use as
many digits as SI has, including decimals
if you are searching through a RN table, pretend the decimal point is not there
if you are using a calculator which only provides random numbers between zero
and one, multiply this random number by the value of SI in order to get a random
number between zero and SI. Remember to keep the decimals, do not round yet.
4. Begin the series of cumulated numbers with RS. Add SI to this first number to
determine the second. Then, add SI to the second number to get the third, and so on.
5. Stop cumulating when the last cumulated number exceeds N (discard this last
number):
Page 51 of 59
if you rounded SI before adding, you may not have exactly n.
6. Now go back and round all the cumulated numbers up to the next integer
On the list of population units, circle the serial numbers that correspond to these
integers. These are the selected units.
Example 13: Suppose that a village has 285 housing units (HUs) and we wish to select a
systematic sample of 12 HUs for a survey. Assume the list is randomly ordered.
We want to determine the HUs that will be in the sample.
1. SI = N/n = 285/12 = 23.75
2. RN between 0001 and 2375 is 1979
3. RS = 19.79
4. Series of cumulated numbers:
1 20.79 20
2 20.79+23.75=44.54 44
3 44.54+23.75=68.29 68
4 92.04 92
5 115.7 115
6 139.54 139
7 163.29 163
8 187.04 187
9 1210.79 210
10 234.54 234
11 258.29 258
12 282.04 282
13 305.79 Discarded
Page 52 of 59
In stratification we group together elements which are similar, so that the population
variance S2h within stratum h is small; at the same time, it is desirable that the means of
the several strata (Yh) be as different as possible. The letter h will be used to identify the
strata so that if L strata are created, h will go from 1 to L.
In stratified sampling, the probabilities of selection may be the same from group to group,
or they may be different. It is not necessary that all elements have the same chance of
selection, but the chance of each must be known. Under stratified random sampling all
the elements in a particular stratum have equal chances of being selected. While not
every combination of elements is possible, all of the possible samples (that is,
combinations of elements) that might be drawn have the same chance of occurring.
• Stratification is the process of grouping members of the population into relatively
homogeneous subgroups before sampling.
• When sub-populations vary considerably, it is advantageous to sample each
subpopulation (stratum) independently.
• The strata should be mutually exclusive: every element in the population must be
assigned to only one stratum. The strata should also be collectively exhaustive: no
population element can be excluded.
• Then random or systematic sampling is applied within each stratum.
• This often improves the representativeness of the sample by reducing sampling
error.
• It can produce a weighted mean that has less variability than the arithmetic mean
of a simple random sample of the population
Advantages of stratified sampling
• focuses on important subpopulations but ignores irrelevant ones.
• improves the precision.
• sampling equal number of observations from strata that vary widely in size may be
used to equate the statistical power of tests of differences between strata.
Disadvantages of stratified sampling
• can be difficult to select relevant stratification variables.
• not useful when there are no homogeneous subgroups.
• can be expensive.
• requires accurate information about the population; otherwise introduces bias.
Notation. We use the same notation as for simple random sampling, except that there
will be a subscript to indicate a particular stratum when we refer to information regarding
this stratum. Thus, N will represent the total number of elements in the population, as
before; but N1 will be the number in the first stratum, N2 will be the number in the second
stratum, etc. Similarly, n will be the total sample size; n1 will be the size of the sample in
the first stratum, n2 will be the size of the sample in the second stratum, etc. The
subscript h denotes the stratum and i the unit within the stratum. As in the case of simple
random sampling, capital letters refer to population values and lower case letters denote
corresponding sample values. The notation given in the following table will be used.
Page 53 of 59
Sample
Measurement Population Sample
estimate
Total number of elements N n …
Number of strata L L …
Number of elements in the hth
Nh nh …
stratum
Total for a certain variable
Y y Yˆst
(characteristic)
Total of the variable in stratum h Yh yh Ŷh
Average over all strata y st
(population mean) Y
Average for hth stratum (stratum yh
Yh …
mean)
Proportion having attribute P p Pst
Proportion in the hth stratum Ph Ph …
Population Variance S2 … …
Population variance for the hth
S2h s2h …
stratum
s 2 (Yˆh )
Variance of an estimated total S (Yˆst )
2
s 2 (Yˆst )
)
s 2 ( yh )
Variance of estimated mean S 2 ( y) s 2 ( y st )
Suppose we have a universe of eight farms with known value of land and buildings as
follows:
Let us compute the average (mean) and the standard deviation of these values. In terms
of the notation above, we would have: N = 8; Y= $4,694.75, and S = $3,326.04.
Page 54 of 59
Now let us arrange the farms into two strata, so that the groupings of values are as
follows:
Stratum 1 Stratum 2
$1,438 $5,408
1,532 6,854
2,026 8,836
2,180 9,284
If we compute the average and standard deviation of each group of four farms separately,
we would have:
Stratum 1 Stratum 2
N1=4 N2=4
Y1 = $1,794 Y2 = $7,595.50
SI = $364.33 S2 = $1,800.45
1 L 1 l L
Y h N
N h 1
Y
h 1
N Y
h h where the population total Y
h 1
Yh .
I l 1 l N
y st N H y h h y h where yh is the sample total for the hth stratum.
N h 1 N h 1 n h
N 1 y1 N 2 y 2 N 3 y 3 ( . ) ( , ) ( . )
y st = = $30.82.
N ,
Estimate of total
As with simple random sampling, we make an estimate of the population total by
multiplying the estimate of the mean by the total number of elements in the population:
Page 55 of 59
l l
N
Yˆst Ny st N h y h h y h .
h 1 h 1 n h
Estimate of proportion
To estimate a proportion for the population, the procedure is similar to that for the mean
because a proportion, Pst, is simply a special case of the mean Y when the only possible
values of Yi are 0 and1. In this case,
Nh
1
Yh Ph
Nh
Y
i 1
hi , where Yhi 0 or 1 .
Illustration
Let us apply the use the equation above to the case of the eight farms in the illustration in
this section. Suppose we took a sample of four farms out of the eight - two from each
stratum - and we have computed:
Thus we have:
1 132736.35 321620.2
S ( y st ) [4(4 2) 4(4 2)( )] 210897.28 $459.24 .
64 2 2
It is interesting to compare this sampling error with the corresponding sampling error of
the mean for a simple random sample of four farms. For a simple random sample of four
farms, we would have
N n s2 8 4 (3326.04) 2
S ( y) ( ) $1,175.93 .
N n 8 4
Page 56 of 59
In this example, the sampling error of the stratified sample is much smaller than that of
the simple random sample, less than half. In fact, it would require a sample of six farms,
using simple random sampling, to achieve the same reliability (that is, as small a
sampling error) as we obtained with a stratified sample of the four farms.
Sample allocation
• Proportionate allocation uses a sampling fraction in each of the strata that is
proportional to that of the total population.
• If the population consists of 60% in the male stratum and 40% in the female
stratum, then the relative size of the two samples (three males, two females)
should reflect this proportion.
• Optimum allocation (or disproportionate allocation). Each stratum is proportional
to the standard deviation of the distribution of the variable. Larger samples are
taken in the strata with the greater variability to generate the least possible
sampling variance.
Choice of sample size for each stratum
In general the size of the sample in each stratum is taken in proportion to the size of the
stratum. This is called proportional allocation. Suppose that in a company there: male, full
time = 90; male, part time = 18; female, full time = 9; female, part time = 63; (a total of
180 staff). We are asked to take a sample of 40 staff, stratified according to the above
categories.
• The first step is to find the total number of staff (80) and calculate the percentage
in each group.
% male full time = ( 90 / 180 ) x 100 = 0.5 x 100 = 50.
% male part time = ( 18 / 180 ) x100 = 0.1 x 100 = 10.
% female full time = (9 / 180 ) x 100 = 0.05 x 100 = 5.
% female part time = (63/180)x100 = 0.35 x 100 = 35.
10% of 40 is 4.
5% of 40 is 2.
35% of 40 is 14.
Note. Sometimes there is greater variability in some strata compared with others. In this
case, a larger sample should be drawn from those strata with greater variability.
Exercises
Page 57 of 59
2) What sample size is required to estimate the proportion of people with blood type
O in a population of 1500 people to be 0.02 of the true proportion with 95%
confidence? Assume no prior knowledge about the proportion.
REFERENCE
Suggested textbook
Cochran, W. G. (1977). Sampling Techniques (3rd Ed.). John Wiley & Sons, New York.
Reference books
Kumar, R. S. (1996). Practical Sampling Technique (2nd Ed.). Marcel Dekker, New York.
Leedy, P. D. (1997). Practical Research: Planning and Design (6th Ed.). Prentice-Hall,
Inc., New Jersey.
Sukhatme, P. V. and Sukhatme, B.V. (1992). Sampling Theory of Surveys with
Applications. lowa State University Press & IARS.
Thompson, S.K. (2002). Sampling (2nd Ed.). John Wiley & Sons, New York.
Page 58 of 59