LESSON 4 Sampling and Sampling Distribution
LESSON 4 Sampling and Sampling Distribution
Research studies are conducted on a small number of subjects and the results generalized to the
class of all subjects. For example, results of a study on effects of insulin conducted on a
randomly sampled population would be applicable to all patients with diabetes mellitus on insulin in
the world. Thus the inference derived from the ‘study group’ is generalized to all such groups. This
type of inference is called inductive inference. But the question is, ‘Are we justified to make such an
inference?’
Important Terms
We study the sample but it is the population’s characteristics that we want to know. However, we
cannot study specific characteristics of the entire population due to constraints in time, money and
human resource such as data collectors. Given enough resources the whole population can be studied.
That is why quality control officers in field surveys who want to know quality of the data collected
during field interviews would randomly select may be 5% of all the completed questionnaires and
check on the quality of the responses. He then makes an inductive inference for all the questionnaires
submitted.
We therefore need to be specific about the population to be studied for example if its patients, what
type of patients and disease stage? This helps eliminate fallacies (drawing wrong conclusions) which
may affect interpretation of data.
Types of Populations
It is important for you to understand and differentiate between these three types of population.
Finite: Population with known total number of members. For example, number of community
members, number of TB patients in a hospital
Infinite: Population with unknown number of members. For example, all possible heights
within a given range 150 cm – 160 cm – has infinite number of values
Hypothetical: Population that is assumed for theoretical purposes. For example, guinea pigs
given Vitamin A deficient diet. The expected outcome is Vitamin A deficiency symptoms.
The results would be observed and generalized to all such groups of guinea pigs. Such group
is a population but an imaginary population.
Sampling Variations and Bias
-The principal object of sampling is to get maximum information about the population with
minimum effort/resources.
-The sample should therefore be representative. In other words, the sample should be unbiased,
random sample.
For example, a study on hemoglobin content requires only a drop of blood from each study
participant. One drop of blood is well mixed and contains all the blood constituents.
-But in most populations there is a variation among individuals as well as the environment they live
in. Thus one sample will differ from the next drawn from that population. You can imagine a class of
40 students. If we are interested in drawing a sample of 5 and recording their weight to get the mean
class weight, each separated group of five students will result in different individual weights and
overall mean weight. This is the form of variation we are talking about.
Variation between one sample and another can be due to: non-
sampling variation or sampling variation (or bias).
Bias is defined as a systematic difference between the results obtained
from a study and the true state of affairs.
-This type of sampling can be defined as the sampling in which members of the population do
not have an equal chance of being selected to participate in a study or project.
-This method is useful in studies that are not concerned with parameters of the entire population
and therefore as a researcher, you should not assume that the sample fully represents the target
population.
-You can apply non-probability sampling in pilot studies, case studies, hypothesis development
and qualitative research.
Convenience sampling
Quota Sampling
-In quota sampling approach, both judgment and convenience are combined.
-It is more structured than straight accessibility or judgmental sampling.
-The technique is similar to stratified random sampling which we will discuss in a short while
under probability sampling.
-The objective is to include various groups or quotas in the population
- For example, if a researcher wants to include certain religions in the sample he may pick quotas
of each (30% African Inland Church, 40% Catholics, 20% Moslems; 20% Anglican Church of
Kenya). Selection of subjects is not random since subjects are picked as they fit into identified
quotas.
Snowball Sampling
-In this method, initial subjects with the desired characteristics are identified using purposeful
sampling technique.
-The identified subject names others that s/he knows have the required characteristics until the
researcher gets the numbers he requires.
-The method is useful when the population with the characteristics under study is not well
known.
-For example, a study on why nurses want to out-migrate from Kenya would attempt to identify
one nurse who has applied for verification of certificates at the Nursing Council of Kenya. S/he
would then identify the next nurse, and so on until the sample required is attained.
Literally, as a ball of snow rolls, it becomes bigger and bigger similar to the increasing number
of respondents that will have been interviewed by the end of data collection.
a)- Lead to unrepresentative samples. Thus we cannot generalize our results to the entire
population
b). Results are unconvincing because of lack of yardstick against which to measure
‘representativeness’
-Can lead to self-selection of participants into the study.
-Only those with required characteristic may volunteer to take part in the study which is more in
convenience sampling where participants are conveniently sampled
-Where the population being studied is not homogenous which is a common occurrence, the
sample should still be representative.
-Consider different sections of the population which are homogenous within themselves.
-For example when studying the whole population, you can stratify it by age groups so that in
each stratum you will specific age sets.
-In this probability sampling method you will divide the population into strata or sections then
use stratified sampling procedure to select the study participants. The sample is then drawn
independently from each stratum by simple random sampling method. Since a random sample is
taken from each stratum, the whole population is adequately represented. In this case, the
heterogenous population is divided into several sections each of which is homogenous. Thus the
variability in each stratum that exist in the population is represented in the sample.
- Ensure that appropriate stratification is done before selecting the sample.
Systematic Sampling
In this method every nth case is chosen for the study from a list of cases.
-In this method all sections of the population are adequately represented. The method is simpler
and more convenient even where the list of the population is not available.
-For example, selecting 150 houses from 1500 houses you can apply this procedure:
-Select one house in every 10
- The researcher can start selection of the houses from any direction
• Less accurate results due to higher sampling error than simple random sampling (cluster unit
members may influence the outcome
•It is expensive.
Multistage Sampling
-This is a modification of cluster sampling that involves a process of drawing samples from
selected clusters (sub-sampling).
-It can be two-stage, three stage, and so on.
-For example, sample drawn from selected districts, locations then villages and finally
individuals from these villages who are the respondents in the study using villages as the
clusters.
-The probability sampling used in each stage is simple random sampling which simply helps to
eliminate bias.
The importance of probability sampling is that choosing sample based on these techniques keeps
the investigator on the safe side when s/he extends the findings from the sample to the population
for purposes of generalizations.
-The units must be defined clearly. If the unit is an individual – the characteristic of the
individual to be included in the population should be stated clearly. In case of a house, it should
be known if churches or schools or vacant houses are to be excluded in the population.
-The guiding principle in the definition is that there should be no overlapping of units and all
units put together constitute a population. Even where there is no list, a map with serially
numbered houses can be used to choose a sample of houses. These are used as substitutes for list
of sampling frames.
SAMPLING DISTRIBUTION
b. When the distribution in the population is not normal, the distribution of the sample mean may
also not be normal
The second of these effects is of practical importance when the sample size is very small (< 15),
and when the distribution in the population is extremely non-normal. This leads us to the Central
Limit Theorem .
NORMAL DISTRIBUTION
Introduction
-The normal distribution is the most important and most widely used distribution in statistics. It is
sometimes called the "bell curve".
-It is also called the "Gaussian curve" after the mathematician Karl Friedrich Gauss who played an
important role in its history.
-The shape of histogram reflects what is known as the data's frequency distribution. That is, how
often each value appears in the data.
-The distribution of the data is reflected by the shape of the frequency histogram.
Normal Distribution Curve
The normal curve can be described as the statistical representation of the data collected whereby
most values are clustered around a central value with smaller and smaller frequencies moving further
away from the center.
Properties of a Normal Distribution
The normal distribution and its histogram have the following properties:
-There is a single highest bar (the mode).
-There are as many values above the mode as there are below it (it is in the middle).
-The shape of the histogram is symmetrical about the mode, so the left side is a mirror image of the
right.
- Most values are close to the mean - values that are small or large relative to the mean occur
infrequently (extreme values).
-It represents infinitely large population - neither of its tails touch baseline of graph (x-axis).
-The frequency of values gets lower as you move further from the mode in a way that produces a bell
shape.
These characteristics can further be simplified as follows:
It is a smooth, bell-shaped curve.
It is symmetric about the mean (centre).
It has asymptotic tails (taper off but never touch).
It is marked in standard deviations on both sides of the mean.
The median, mean, mode and midrange are the same value.
The area under the curve is equal to one or 100%.
The sample that produced this histogram below; has a range from 0 to 8 and there are 100 values in
the sample.
- The value 4 appears 30 times in the data - more often than any other - so 4 is the mode;
-You should be able to see how it is shaped like a bell - the frequencies drop off with increasing
speed as you move away from the mode.
-In normally distributed data, the mean, mode and median are all the same (or very close);
-In normally distributed, the mode is near to the centre of the range.
Normal Curve
Normal Curve is described by two parameters:
a. Arithmetic mean (μ) and (sample mean x)
b. Standard deviation of the population (ϭ) and sample standard deviation (SD)
The mean, mode and median of normal distribution are assigned value of zero (0) while the value of
SD of the normal distribution is given a value of 1.00.
There are two very important things to mention here.
The total area under the curve is 1. The area under the curve represents frequency (whole
area represents 100%). Thus area under curve represents some percentage of the observations
which we will illustrate using the above figure.
Normal Curve
Definition of z-score
A Z-score is the number of standard deviations that a given x value is above or below the mean.
-The standard score is the difference between a person’s or observation’s raw score and the mean of
distribution expressed in SD units.
-Knowing standard score enables us to know how many standard deviations the score is above or
below the mean.
-z is based on a normal distribution.
-When scores are changed to z scores, scores below the mean are negative while scores above the
mean are positive.
The weight of men has a bell-shaped distribution with a mean of 69.0 kg and a standard deviation of
2.8 kg. What percentage of men with weights between 65.4 kg and 72.3 kg?
Solution: The Z-score associated with 65.4 is (65.4 - 69.0)/2.8 = -1.29. For 72.3, the Z-score is 1.18
(why?) Since the respective Z-scores are -1.29 and 1.18, the area under the bell-shaped curve is
78.25%.
The idea of Z-scores is very valuable for understanding what you are calculating when finding
probabilities.
-In estimation, the main interest in statistical inference is to estimate the population parameter
from a sample of observation.
-This should be unbiased estimate.
-The estimate can be a single statistic such as the mean (point estimate) or a range with attached
probability called confidence interval.
Assume the mean and standard error of a large sample of cases to estimate the population Hb
values are 11.4 g/100ml and 0.7, respectively. But theoretical statistics states that sample mean is
the best estimate of the population mean. So, point estimate of population mean is 11.4 g/100ml.
You may recall that with sampling, we EXPECT error in our statistics. Thus, calculated statistics
would not be equal to actual parameters being measured due to random (chance) errors. We
therefore require unbiased sampling where no factors systematically push estimate in a particular
direction. The larger sample, the less error is expected.
Consider (conceptualize) a distribution of sample means drawn from a distribution. Repeated
sampling (calculating mean) from the same population produces a distribution of sample means
A distribution of sample means drawn from a distribution (the sampling distribution of means)
will be a normal distribution.
NB
SEm=SD population/√n
Calculate the standard error of the Mean using the following values:
Mean = 75
SDp = 16
n = 64
SEm
Note that as ‘n’ increases, SEm decreases. Similarly, as SD decreases, the SEm decreases
and the more likely your confidence in estimating the population mean increases.
The smaller the standard error the more confidence we can have in the statistic. The standard error
can be decreased by increasing the sample size.
a. The degree of precision required between the sample population and the general population.
This is the error that can be tolerated in the estimate (e.g. 1% or 5% = 0.01 and 0.05) . i.e the
error
b. What level of confidence is required so that the error will not be greater than the tolerable
limit
c. What is the variability of the population (expressed as standard deviation)
√pq/n ≤ 0.03
What is the anticipated value of p? If no estimate is available use the conservative set p at 0.5. In
using p=0.5 we maximize on sample size.
Thus 1.96 √(0.5)(0.5)/n ≤ 0.03
1.96√(0.5)(0.5)/n ≤ 0.03 (squaring both sides)
3.84 x (0.5)(0.5)/n ≤ 0.0009
n = 3.84(0.25)/0.0009
n = 1066
Thus a minimum of 1066 children are needed in the sample