Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

LESSON 4 Sampling and Sampling Distribution

The document discusses sampling methods used in research studies, emphasizing the importance of selecting representative samples to generalize findings to a larger population. It differentiates between non-probability and probability sampling methods, detailing various techniques and their advantages and disadvantages. Additionally, it covers concepts such as sampling error and the significance of statistical inference in understanding population characteristics based on sample data.

Uploaded by

Evans Mogaka
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

LESSON 4 Sampling and Sampling Distribution

The document discusses sampling methods used in research studies, emphasizing the importance of selecting representative samples to generalize findings to a larger population. It differentiates between non-probability and probability sampling methods, detailing various techniques and their advantages and disadvantages. Additionally, it covers concepts such as sampling error and the significance of statistical inference in understanding population characteristics based on sample data.

Uploaded by

Evans Mogaka
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

SAMPLING METHODS

Research studies are conducted on a small number of subjects and the results generalized to the
class of all subjects. For example, results of a study on effects of insulin conducted on a
randomly sampled population would be applicable to all patients with diabetes mellitus on insulin in
the world. Thus the inference derived from the ‘study group’ is generalized to all such groups. This
type of inference is called inductive inference. But the question is, ‘Are we justified to make such an
inference?’

Important Terms

a .Sample: small group chosen for study

b. Population: whole group which the sample represents

c. Sample size: the number of individuals or observations in the sample

d. Sampling: choosing of a sample from a population

We study the sample but it is the population’s characteristics that we want to know. However, we
cannot study specific characteristics of the entire population due to constraints in time, money and
human resource such as data collectors. Given enough resources the whole population can be studied.
That is why quality control officers in field surveys who want to know quality of the data collected
during field interviews would randomly select may be 5% of all the completed questionnaires and
check on the quality of the responses. He then makes an inductive inference for all the questionnaires
submitted.
We therefore need to be specific about the population to be studied for example if its patients, what
type of patients and disease stage? This helps eliminate fallacies (drawing wrong conclusions) which
may affect interpretation of data.

Types of Populations

It is important for you to understand and differentiate between these three types of population.
 Finite: Population with known total number of members. For example, number of community
members, number of TB patients in a hospital

 Infinite: Population with unknown number of members. For example, all possible heights
within a given range 150 cm – 160 cm – has infinite number of values

 Hypothetical: Population that is assumed for theoretical purposes. For example, guinea pigs
given Vitamin A deficient diet. The expected outcome is Vitamin A deficiency symptoms.
The results would be observed and generalized to all such groups of guinea pigs. Such group
is a population but an imaginary population.
Sampling Variations and Bias

-The principal object of sampling is to get maximum information about the population with
minimum effort/resources.
-The sample should therefore be representative. In other words, the sample should be unbiased,
random sample.
For example, a study on hemoglobin content requires only a drop of blood from each study
participant. One drop of blood is well mixed and contains all the blood constituents.
-But in most populations there is a variation among individuals as well as the environment they live
in. Thus one sample will differ from the next drawn from that population. You can imagine a class of
40 students. If we are interested in drawing a sample of 5 and recording their weight to get the mean
class weight, each separated group of five students will result in different individual weights and
overall mean weight. This is the form of variation we are talking about.
 Variation between one sample and another can be due to: non-
sampling variation or sampling variation (or bias).
 Bias is defined as a systematic difference between the results obtained
from a study and the true state of affairs.

What are the causes of non-sampling variations?

 Different types of sampling methods: non-probability or probability


methods
 Inconsistencies in definition: definition of clinical presentation of
malaria in children differs from one area to the other depending of
endemicity of malaria

 Tools: these are the instruments used such using standardized


weighing scale or unbalanced one

 Measurement process: Measuring height of patients requires that


they remove their shoes. In some cases, measurements are taking
wrongly with shoes on and this may vary from one study group
to the other depending on the training of research assistant who
takes the measurements.

 Non-response: sensitive questions on sexual practices may draw


mixed responses from interviewees depending on the level of
privacy accorded them. Married women with multiple sexual
partners may refuse to respond to such sensitive question in the
presence of other people. Such refusal is what is called non-
response. In the end, you may end up with a large number of
women with similar characteristics failing to answer that
particular question introducing bias in the study.

 Carelessness: some interviewers may be simply careless in


asking questions as is expected.
 Recording practices: the way the one records 8 and 0 may be
similar e.g. which can be easily be mistaken for 8.
This is what you can do to remove non-sampling variations or biases
 Adequate planning
 Training of research assistants
 Standardization of tools or instruments
 Supervision and monitoring of the study

-You can see that non-sampling variation or bias is something that we


can do something about and prevent from happening.
-There are two main sampling methods namely non-probability sampling
and probability sampling. Let us discuss each of these methods in detail.

NON-PROBABILITY SAMPLING METHODS

-This type of sampling can be defined as the sampling in which members of the population do
not have an equal chance of being selected to participate in a study or project.
-This method is useful in studies that are not concerned with parameters of the entire population
and therefore as a researcher, you should not assume that the sample fully represents the target
population.
-You can apply non-probability sampling in pilot studies, case studies, hypothesis development
and qualitative research.

Examples of this method include:


a. Convenience sampling
b. Purposive sampling
c. Quota sampling
d. Snowball sampling

Convenience sampling

-Convenience sampling is also called accidental, accessibility, haphazard or volunteer sampling.


-The main reason for adopting convenience sampling techniques is for administrative
convenience to as the researcher.
-The sample is chosen with ease of access being the sole concern. Cases are selected as they
become available.
-The method lacks representativeness since only those who can be conveniently accessed are
interviewed yet there are others in the targeted population who will never have a chance to be
interviewed.
A good example is radio programmes where listeners are asked questions and instructed to send
answers via SMS.
- The method is biased because only those who have radios and mobile phones and listen to that
particular radio station will be able to respond. The problem is that respondents are easily and
conveniently available.
b. Purposive Sampling
-Purposive sampling is also known as judgmental sampling.
-Although the researcher recognizes that the population contains different types of individuals
with differing measures and ease of access, he exercises deliberate subjective choice in drawing
what he regards as a ‘representative’ sample.
-It aims at elimination of anticipated sources of distortion. However, there is always a risk of
distortion due to prejudices or lack of knowledge of certain crucial features in the structure of the
population.
-An example of purposive sampling is selection of a certain age range (breastfeeding children
aged 0-23 months), religious sect/group who are vegetarians among others.
- This approach is useful where it would be expensive interviewing a large sample and only
ending up with a small number of the group you are interested in. There is no need of
interviewing mothers without children aged 0-59 months when you are mainly interested in
immunization coverage.

Quota Sampling
-In quota sampling approach, both judgment and convenience are combined.
-It is more structured than straight accessibility or judgmental sampling.
-The technique is similar to stratified random sampling which we will discuss in a short while
under probability sampling.
-The objective is to include various groups or quotas in the population
- For example, if a researcher wants to include certain religions in the sample he may pick quotas
of each (30% African Inland Church, 40% Catholics, 20% Moslems; 20% Anglican Church of
Kenya). Selection of subjects is not random since subjects are picked as they fit into identified
quotas.

Snowball Sampling

-In this method, initial subjects with the desired characteristics are identified using purposeful
sampling technique.
-The identified subject names others that s/he knows have the required characteristics until the
researcher gets the numbers he requires.
-The method is useful when the population with the characteristics under study is not well
known.
-For example, a study on why nurses want to out-migrate from Kenya would attempt to identify
one nurse who has applied for verification of certificates at the Nursing Council of Kenya. S/he
would then identify the next nurse, and so on until the sample required is attained.
Literally, as a ball of snow rolls, it becomes bigger and bigger similar to the increasing number
of respondents that will have been interviewed by the end of data collection.

Disadvantages of Non-Probability Sampling

a)- Lead to unrepresentative samples. Thus we cannot generalize our results to the entire
population
b). Results are unconvincing because of lack of yardstick against which to measure
‘representativeness’
-Can lead to self-selection of participants into the study.
-Only those with required characteristic may volunteer to take part in the study which is more in
convenience sampling where participants are conveniently sampled

PROBABILILTY (RANDOM) SAMPLING METHODS


-To solve the problem of sampling variations, there is need to introduce the element of
‘randomness’ into sampling procedures.
-Samples are thus drawn according to some probability mechanism using probability random
sampling techniques.
-Probability sampling methods can therefore be defined as a sampling procedure in which every
member of the population has an equal chance of being unbiasly selected to participate in your
research study or project.
-All investigations are carried out to ascertain a particular characteristic of the population. In that
case, unbiased samples must be drawn from that population using the probability random
sampling techniques.
-Examples of probability sampling include:
a. Simple random sampling
b. Stratified random sampling
c. Systematic sampling
d. Cluster sampling
e. Multistage sampling

Simple Random Sampling


-Regarding simple random sampling, each member (sampling unit) of the population has an
equal chance of being selected in the sample.
-The method is best applicable where a list of the sampling unit (sampling frame) is available.
-Example of sampling frames can be students’ class list, names of schools in a district with their
corresponding codes, or list of house numbers in an estate.
-Here randomness is assured by use of a sampling procedure for example lottery or table of
random numbers.

Stratified Random Sampling

-Where the population being studied is not homogenous which is a common occurrence, the
sample should still be representative.
-Consider different sections of the population which are homogenous within themselves.
-For example when studying the whole population, you can stratify it by age groups so that in
each stratum you will specific age sets.
-In this probability sampling method you will divide the population into strata or sections then
use stratified sampling procedure to select the study participants. The sample is then drawn
independently from each stratum by simple random sampling method. Since a random sample is
taken from each stratum, the whole population is adequately represented. In this case, the
heterogenous population is divided into several sections each of which is homogenous. Thus the
variability in each stratum that exist in the population is represented in the sample.
- Ensure that appropriate stratification is done before selecting the sample.

Systematic Sampling
In this method every nth case is chosen for the study from a list of cases.

-In this method all sections of the population are adequately represented. The method is simpler
and more convenient even where the list of the population is not available.
-For example, selecting 150 houses from 1500 houses you can apply this procedure:
-Select one house in every 10

-The initial number to be within the first 10 and is determined randomly

- The researcher can start selection of the houses from any direction

Problem arises when a locked house or abandoned house is chosen.


-In such a case, you move to the next structure which is inhabited and continue with the intervals
you were adding.
-In this method there is no need of preparing a sample frame for selection. However, the
researcher bears sole responsibility of how the systematic sampling procedure is used in order to
give valid results
Cluster Sampling
-This is the procedure used for administrative convenience when the whole population consists
of many natural groups (clusters).
-In immunization, the school can be used as a sampling unit and a random sample of the schools
selected. Thus all the class one pupils in the school are selected in the sample. Another natural
group or cluster can be a village.
-Cluster sampling therefore means any method of sampling in which a group is taken as a
sampling unit.
-It is used when it is too expensive to spread out your sample across the population as a whole.
For example, transport can be expensive if you are using interviewers to travel between people
spread out all over the country.
Criteria used in the selection of study participants using cluster approach.
The selection criteria in cluster sampling involves the following:
i. Get the total number of clusters in the study area
ii. Randomly select the number of clusters that represent the population
iii. Then all the units within the selected clusters are included in the sample
iv. No units from non-selected clusters are included in the sample as they are represented by
those from the selected clusters
v. This differs from stratified sampling where some units are selected from each group

Advantages of Cluster Sampling


• Reduced costs
• Simplified field work
• Administration is more convenient – sample is more localised in relatively few centres
Disadvantages of Cluster Sampling

• Less accurate results due to higher sampling error than simple random sampling (cluster unit
members may influence the outcome

•It is expensive.
Multistage Sampling

-This is a modification of cluster sampling that involves a process of drawing samples from
selected clusters (sub-sampling).
-It can be two-stage, three stage, and so on.
-For example, sample drawn from selected districts, locations then villages and finally
individuals from these villages who are the respondents in the study using villages as the
clusters.
-The probability sampling used in each stage is simple random sampling which simply helps to
eliminate bias.

The importance of probability sampling is that choosing sample based on these techniques keeps
the investigator on the safe side when s/he extends the findings from the sample to the population
for purposes of generalizations.

Listing of the Population


-Sampling techniques necessitate the use of a list of individuals in the population called the
sampling frame.

-Population consists of many units – individuals or observations – sampling units – households,


villages, and so on.

-The units must be defined clearly. If the unit is an individual – the characteristic of the
individual to be included in the population should be stated clearly. In case of a house, it should
be known if churches or schools or vacant houses are to be excluded in the population.

-The guiding principle in the definition is that there should be no overlapping of units and all
units put together constitute a population. Even where there is no list, a map with serially
numbered houses can be used to choose a sample of houses. These are used as substitutes for list
of sampling frames.

SAMPLING DISTRIBUTION

A sample is supposed to provide a description of the population from which it is drawn.


-Is what is found in a sample of patients with a disease like diabetes mellitus true of all other
patients with the same disease? Would another sample present the same picture?
-If samples are subject to chance findings and if two samples are not alike, with how much
confidence can we speak of the population from studying one sample?
Statistical inference is the procedure that is followed in drawing conclusions from sample
values. Rarely does the sample value coincide with the population value.
-The discrepancy between the sample value and the population value is called sampling error if
the result is due to random sampling.
-However, these errors behave systematically and also have characteristic distribution.
Example:
Assume a situation with 500 Hb values where the calculated mean = 10.96 g/100ml. Take
random sample of size 20. Calculate its mean (called sample mean). This is an estimate of
population mean. Take another sample of 20 and get the mean. Repeat the process 25 times, each
time calculating the mean and recording. If in a group of 18 students each of them draw 25
samples the total sample means will be 450 (25 x 18) which are estimates of population mean.
We can work out the frequency distribution of the estimates in the sampling experiment.
- The distribution is called the sampling distribution of means. Most of the sample means cluster
around the population mean (10.96 g/100ml) while very few samples deviate from the population
mean.
The standard deviation of the sampling distribution is called the standard error. We had
mentioned this earlier.
s.e = S /√n
Where:
S=sample SD;
n = number of subjects in the sample
s.e.= the standard error e.g of the mean or proportion
Standard error is used to tell us about the dispersion (or precision) of the observations around the
population mean / proportion.
- But here the standard error of the dispersion measured is that of sample means/ proportion
around the population mean/ proportion. The obtained standard error describes how the sample
mean / proportion varies about the population mean / proportion.
Example:
Sample of a class = 40; mean score = 75 and SD = 5; SEM = 0.8
Thus sample mean varies from population mean by:
75 ± 0.8 (or ranges between 74.2 and 75.8) scores.
For large number of samples, the distribution of sample statistics will resemble the normal
distribution curve. The phenomenon happens not only when parent population is normally
distributed but also when it is not normally distributed.
-It is known that the sampling distribution closely resembles the normal curve whenever the
sample size is greater than 30. When sample size is not large two things may happen:
a. Sample standard deviation may not be a reliable estimate for population mean (μ)

b. When the distribution in the population is not normal, the distribution of the sample mean may
also not be normal

The second of these effects is of practical importance when the sample size is very small (< 15),
and when the distribution in the population is extremely non-normal. This leads us to the Central
Limit Theorem .

NORMAL DISTRIBUTION
Introduction
-The normal distribution is the most important and most widely used distribution in statistics. It is
sometimes called the "bell curve".
-It is also called the "Gaussian curve" after the mathematician Karl Friedrich Gauss who played an
important role in its history.
-The shape of histogram reflects what is known as the data's frequency distribution. That is, how
often each value appears in the data.
-The distribution of the data is reflected by the shape of the frequency histogram.
Normal Distribution Curve
The normal curve can be described as the statistical representation of the data collected whereby
most values are clustered around a central value with smaller and smaller frequencies moving further
away from the center.
Properties of a Normal Distribution
The normal distribution and its histogram have the following properties:
-There is a single highest bar (the mode).
-There are as many values above the mode as there are below it (it is in the middle).
-The shape of the histogram is symmetrical about the mode, so the left side is a mirror image of the
right.
- Most values are close to the mean - values that are small or large relative to the mean occur
infrequently (extreme values).
-It represents infinitely large population - neither of its tails touch baseline of graph (x-axis).
-The frequency of values gets lower as you move further from the mode in a way that produces a bell
shape.
These characteristics can further be simplified as follows:
 It is a smooth, bell-shaped curve.
 It is symmetric about the mean (centre).
 It has asymptotic tails (taper off but never touch).
 It is marked in standard deviations on both sides of the mean.
 The median, mean, mode and midrange are the same value.
 The area under the curve is equal to one or 100%.

The sample that produced this histogram below; has a range from 0 to 8 and there are 100 values in
the sample.
- The value 4 appears 30 times in the data - more often than any other - so 4 is the mode;

-The mode (4) is at the centre of the histogram;

-The shape is symmetrical (not perfectly, but close enough);

-You should be able to see how it is shaped like a bell - the frequencies drop off with increasing
speed as you move away from the mode.

Figure below shows a histogram produced from normally distributed data.


Remember, it is the shape of the histogram that is important, not the raw data.

-In normally distributed data, the mean, mode and median are all the same (or very close);
-In normally distributed, the mode is near to the centre of the range.

Probability and Normal Curve


In this sub-section we deal with the idea of a continuous random variable.
-By this it is meant that the random variable can take on an infinite number of values, with the
property that there are no gaps between the values.
-An example of a continuous random variable is the selection of time of the day or the speed of a car.
-A random variable is said to have a normal distribution if it has a probability distribution that is
symmetric and bell-shaped see figure below

Normal Curve
Normal Curve is described by two parameters:
a. Arithmetic mean (μ) and (sample mean x)

b. Standard deviation of the population (ϭ) and sample standard deviation (SD)

The mean, mode and median of normal distribution are assigned value of zero (0) while the value of
SD of the normal distribution is given a value of 1.00.
There are two very important things to mention here.

 The total area under the curve is 1. The area under the curve represents frequency (whole
area represents 100%). Thus area under curve represents some percentage of the observations
which we will illustrate using the above figure.

 Secondly, area is used to measure probabilities.


Area under the curve is defined by 1ϭ, 2ϭ or 3ϭ standard deviations from the mean. Areas to the right
of the mean are indicated as +1ϭ, +2ϭ and +3ϭ while areas left to the mean -1ϭ, -2ϭ and -3ϭ.
The characteristics of normal curve:
 Approx 68.26% of observations (between mean and 1ϭ on either side of it ±1ϭ)

 Approx 95.44% of observations (between mean and 2ϭ on either side of it ±2ϭ)


 Approx 99.74% of observations (between mean and 3ϭ on either side of it ±3ϭ)

 0.26% of observations lie beyond 3ϭ on either side

Normal Curve

Usefulness of a Normal Curve


Properties of the curve can be used to make inferences and hypothesis testing.
-By knowing the mean and SD of a data set, proportion of observations lying between the two values
can be determined.
-Tables have been developed that give percentage of observations that fall in various parts of the
normal curve.
-The table has been constructed based on the assumption that normal distribution has a mean of zero
(μ = 0) and SD of one (ϭ = 1). But observed data may have any mean or SD. Thus raw data is
converted to standard scores (z values) in order for the normal tables to be applicable.
A normal distribution is intimately connected to Z-scores. The main idea is to standardize all the
data that is given by using Z-scores. These Z-scores can then be used to find the area (and thus the
probability) under the normal curve.

Definition of z-score
A Z-score is the number of standard deviations that a given x value is above or below the mean.

-If z represents the Z-score for a given x value then Z=X-µ/ ϭ

Standard score is also referred to as Z score.


-It is the value obtained after transforming an observed score to fit a normal distribution.
-The way of transforming observed score is given by the formula given above which can be
simplified as below:
Standard score = score obtained - mean of distribution / standard deviation of distribution

-The standard score is the difference between a person’s or observation’s raw score and the mean of
distribution expressed in SD units.
-Knowing standard score enables us to know how many standard deviations the score is above or
below the mean.
-z is based on a normal distribution.
-When scores are changed to z scores, scores below the mean are negative while scores above the
mean are positive.

Using Z-scores to find probabilities

The weight of men has a bell-shaped distribution with a mean of 69.0 kg and a standard deviation of
2.8 kg. What percentage of men with weights between 65.4 kg and 72.3 kg?
Solution: The Z-score associated with 65.4 is (65.4 - 69.0)/2.8 = -1.29. For 72.3, the Z-score is 1.18
(why?) Since the respective Z-scores are -1.29 and 1.18, the area under the bell-shaped curve is
78.25%.

The idea of Z-scores is very valuable for understanding what you are calculating when finding
probabilities.

CENTRAL LIMIT THEOREM


Usually, when a normally distributed population (e.g. population blood pressure) is replaced by
non-normally distributed population and where the sample taken is larger, the sample mean
would be normally distributed. This is a very remarkable mathematical property known as
central limit theorem.
“It states that: even when a variable is NOT normally distributed the sample mean will tend to
be normally distributed.”
The number needed to give a close approximation to normality depends on how non-normal the
population is. In most circumstances a sample size of 15 or more is enough. Statistical inference
consists of two aspects:
i) Estimation of a population value (parameter);
ii) Testing of hypothesis

-In estimation, the main interest in statistical inference is to estimate the population parameter
from a sample of observation.
-This should be unbiased estimate.
-The estimate can be a single statistic such as the mean (point estimate) or a range with attached
probability called confidence interval.
Assume the mean and standard error of a large sample of cases to estimate the population Hb
values are 11.4 g/100ml and 0.7, respectively. But theoretical statistics states that sample mean is
the best estimate of the population mean. So, point estimate of population mean is 11.4 g/100ml.
You may recall that with sampling, we EXPECT error in our statistics. Thus, calculated statistics
would not be equal to actual parameters being measured due to random (chance) errors. We
therefore require unbiased sampling where no factors systematically push estimate in a particular
direction. The larger sample, the less error is expected.
Consider (conceptualize) a distribution of sample means drawn from a distribution. Repeated
sampling (calculating mean) from the same population produces a distribution of sample means
A distribution of sample means drawn from a distribution (the sampling distribution of means)
will be a normal distribution.

NB

In the Central Limit Theorem:


1. Mean of distribution of sampling means equals population mean if the n of means is large
which is true even when population is skewed if sample is large (n > 60).
2. SD of the distribution of sampling means is the Standard Error of the Mean
The sample standard error of the mean can be estimated using the following formula to calculate

SEm=SD population/√n

Calculate the standard error of the Mean using the following values:
Mean = 75
SDp = 16
n = 64
SEm
Note that as ‘n’ increases, SEm decreases. Similarly, as SD decreases, the SEm decreases
and the more likely your confidence in estimating the population mean increases.

Confidence Interval, Limits and Levels


Confidence interval is very closely related to SEm
-The confidence interval refers to the band of scores or the range of values within which a
population parameter (value) is estimated to lie.
- The upper and lower limits of the range of values are called the confidence limits.
-Confidence level: refers to the estimated probability that a population parameter lies within a
given confidence interval. We can choose the confidence level we wish to attain.
The general convention is to use confidence level of 95% or 99%.
Approximately 95% of the sample means in the distribution obtained by repeated sampling
would lie within two standard errors above or below the population mean. The assumption is that
the sample means are normally distributed, where the mean is the population mean, μ (miu) and
the standard deviation (ð) (sigma) is the standard error of the sample means, s/√n
-This is justified where the sample is large or more than 30 (n > 30) and the distribution is
normal.
-Again sample standard deviation, s, is a reliable estimate of the population standard deviation ð,.
-From normal distribution curve, 95% of the observations should lie within 2 SD from the mean.
More precisely, and similarly, 95% of the sample means should lie within 1.96 standard errors
above or below the population mean. 1.96 is the 5% point of the standard normal distribution.
Thus there is a 95% probability that a particular sample mean lie within 1.96 standard errors
above or below the population mean
The result is used to estimate from the observed sample mean and its standard error (s.e.= s/√n) a
range within which the population mean is likely to lie. Since there is a 95% probability that the
sample mean lies within 1.96 s.e. above or below the population mean:
There is a 95% probability that the interval between X¯-1.96 s.e. and X¯+1.96 s.e. contains the
unknown population mean.
The interval from X¯-1.96 s.e. to X¯+1.96 s.e therefore represents likely values for the
population mean. This is called 95% confidence interval (C.I.) for the population mean. And
therefore
X¯ + 1.96 s.e. and X¯-1.96 s.e. are the upper and lower 95% limits for the population mean.
Thus for large sample 95% C.I. = X¯ ± (1.96 x s/√n).
Confidence intervals for percentages other than 95% are calculated in the same way using the
appropriate percentage point, Z, of the standard normal distribution in place of 1.96.

Confidence Interval is computed as follows:


CI = observed statistic (e.g. mean) ± (Z-score of confidence level) (Standard error)
Where observed statistic is the sample mean
Confidence Intervals for Proportions and Percentages
-Confidence intervals can be set up for proportions and percentages. For proportions the
formula for calculating the s.e. of the proportion:
Sp = √pq/n
Sp = standard error of the proportion
p = sample proportion (proportion of interest)
q=1–p
n=sample size
NOTE VALUES FOR CONFIDENCE LEVELS
99% C.I. is X¯±2.58 x s.e.
95% C.I. is X¯±1.96 x s.e.
90% C.I. is X¯±1.64 x s.e.
-We can say that we are 99% sure or confident that the true
population mean of the test for this particular group of students lies
between 73.2 and 76.8).
-The confidence level used reflects the degree or risk that the
researcher is willing to take responsibility of being wrong.
-With 95% confidence level, the researcher accepts the probability
of being wrong 5 times out of 100.
-A 99% confidence level sets the risk at only 1%.
Illustration of Confidence Intervals, Limits and Levels using the Normal Curve

The smaller the standard error the more confidence we can have in the statistic. The standard error
can be decreased by increasing the sample size.

ESTIMATING THE SIZE OF THE SAMPLE


The larger the sample size the better.
However, there are set guidelines for arriving at the required sample size.
Factors to be considered when estimating the sample size include:

a. The degree of precision required between the sample population and the general population.
This is the error that can be tolerated in the estimate (e.g. 1% or 5% = 0.01 and 0.05) . i.e the
error

b. What level of confidence is required so that the error will not be greater than the tolerable
limit
c. What is the variability of the population (expressed as standard deviation)

d. What method of sampling should be employed

SAMPLE SIZE CALCULATION


Sample size is usually calculated by the project’s statistician. However, as you will be doing research
and designing your own study, it is important to be familiar with the calculation using the following
example.
Calculate the minimum sample size required to determine p, the proportion of children in the
population vaccinated against tetanus
The methodology includes the following steps:
a. Assuming the tolerable limit is 0.03 (3%)
b. We wish to be 95% sure that the error will not be greater than 0.03.
c. Use the general C.I. for proportions and consider this in the light of our estimated confidence level
d. The error term of our confidence interval level is:
e. z value x √pq/n

Decision made: error not to exceed 0.03


f. z√pq/n ≤ 0.03
g. Confidence level of 95%. The appropriate z will be 1.96

√pq/n ≤ 0.03
What is the anticipated value of p? If no estimate is available use the conservative set p at 0.5. In
using p=0.5 we maximize on sample size.
Thus 1.96 √(0.5)(0.5)/n ≤ 0.03
1.96√(0.5)(0.5)/n ≤ 0.03 (squaring both sides)
3.84 x (0.5)(0.5)/n ≤ 0.0009
n = 3.84(0.25)/0.0009
n = 1066
Thus a minimum of 1066 children are needed in the sample

You might also like