Chapter 3 - Sampling Design and Data Collection
Chapter 3 - Sampling Design and Data Collection
• Summary:
− Selection Bias: Results from how the sample is chosen. The
sample is not representative of the population.
− Data Bias: Results from errors in the data itself. The data used in
training or analysis is systematically skewed in some way.
Sample Size and Representativeness
• Obviously, the sample has to be representative of the
population to make generalizations possible.
• But, how is it possible to ensure representativeness?
– One way of taking care of representativeness is to use the
procedure of randomization.
– According to the Law of Statistical Regularity, a reasonably
smaller sample may be good representative if the subjects of the
sample are selected at random. The basic premise of sampling is:
“Randomize whenever possible”: Take cases at random, assign
them to groups at random if this is required, and give
experimental treatments at random in an experimental research.
• What if randomization is impossible?
− Some times randomization may be difficult to practice. Under this
condition, you ought to practice the second law – The Law of
Inertia of the Large Sample – which serves as a consequence of
Sample Size and Representativeness …
− The Law of Inertia of the Large Sample says that a large sample is
more stable or good representative as compared with small
sample. The sampling error is inversely proportional to the sample
size – the larger the sample, the smaller the error.
• There are general suggestions to be followed in making
decisions regarding sample size:
✓ If the population under study is homogeneous, a small sample is
sufficient. On the other hand, a much larger sample is necessary
if there is greater variability in the units of the population.
✓ A small sample is often satisfactory in an intensive laboratory
study in which greater precision is desired.
✓ The sample size may be adequate or inadequate depending on
how the sample is drawn. For example, a sample size of 27 may
be satisfactory in simple random sampling, but this may not be
adequate in stratified and cluster sampling because various strata
or cluster must be represented.
Sample Size and Representativeness …
▪ As a general rule, it may be said that a researcher has to consider
the number of factors or independent variables and the number
of groups or categories in each factor to generally decide on
sample size.
− According to Drapper and Smith, we may say that sample size (n)
is a function of factors (Xi) and categories (Ck) such that a
minimum of 10 observations is required for each category of a
factor.
n = 10[Cf1 x Cf2 x Cf3 … x Cfn]
Where;
▪ n = sample size
▪ Cf1 = number of categories of factor 1
▪ Cf2 = number of categories of factor 2
▪ Cf3 = number of categories of factor 3
▪ Cfn = number of categories of factor n
Sample Size and Representativeness …
For example, if the researcher has two factors in the research
(say, sex and age) such that there are 2 categories (male and
female) in the first factor and 4 categories in the second factor
(say, grade 5, 7, 9 and 11), then the minimum sample size this
researcher has to draw = 10*2*4 = 80 students.
❖ If the sample is directly drawn from a known population, then the
minimum required sample size can be determined using
Yamane’s formula for sample size estimation from a single
population:
n = N / (1 + N*e2)
Where; n = Sample size, N = Population size, e = Sampling error or
precision level, usually an alpha level of 0.05.
− Example: If you have a population size of 35,000 cases with a
margin of error of 5%, then applying this formula - you are to
draw a sample size of about 396 cases.
− Solution: n = N / (1 + N*e2) = 35000/ (1 + 35000* 0.052) = 396.
Sample Size and Representativeness …
❖ If the actual size of the population is not known, then sampling
from a single population can be estimated by the following
formula:
n = p(1-p)Z2/E2
Where;
n is the sample size,
p is the proportion of the population having the major interest
(Assumed as 50% or 0.5),
Z is the confidence interval &
E is the margin of error.
− Exercise: How many possible number of respondents (n) do you
need when you plan to conduct a research on the efficiency and
usability of a certain mobile application if the margin of error is
5%?
− Ans: n 384.
Sample Size and Representativeness …
Types of Sampling Methods
• There are two major types of sampling methods:
probability and non-probability sampling.
– Probability sampling method is a kind of sample selection where
randomization is used instead of deliberate choice. Each member of
the population has a known, non-zero chance of being selected.
– Non-probability sampling methods are where the researcher
deliberately picks items or individuals for the sample based on non-
random factors such as convenience, geographic availability, or costs.
• Types of probability sampling - every member of the
population has a chance of being selected.
1) In simple random sampling, each individual has an equal
probability of being chosen, and each selection is independent
of the others.
✓ To conduct this type of sampling, you can use tools like random number
generators or other techniques that are based entirely on chance.
Types of Sampling Methods …
2) Systematic sampling involves selecting units or elements at
regular intervals from an ordered list of the population.
3) Stratified sampling divides the population into subgroups
(strata), and random samples are drawn from each stratum in
proportion to its size in the population.
▪ To use this sampling method, you divide the population into
subgroups (called strata) based on the relevant characteristic (e.g.,
gender identity, age range, income bracket, job role, etc).
▪ Based on the overall proportions of the population, you calculate
how many people should be sampled from each subgroup. Then you
use random or systematic sampling to select a sample from each
subgroup.
4) Cluster sampling involves dividing the population into
subgroups, but each subgroup should have similar
characteristics to the whole sample.
▪ Instead of sampling individuals from each subgroup, you randomly
select entire subgroups.
Stratified Sampling