Lesson 3
Lesson 3
Lesson 3
LESSON OBJECTIVES
• Develop data collection instruments and
procedures
• Outline probability and non probability
sampling methods
• Outline different scales of measurement
• Outline types of variables
• Outline different methods used to control for
confounding
• Apply data gathering techniques to various
epidemiological designs
Sources of data in epidemiology
• JOURNALS
• OTHER DEPARTMENTS OR ORGANISATIONS
• INTERNET
• TEXT BOOKS
• RESEARCH PROJECTS
• HOSPITAL CLINIC RECORDS
• SURVEYS/CENSUS
• ANY OTHER?
Who to question
• When collecting data you may not ask each and
every person in the community. Doing that may
prove to be tedious, time consuming and
expensive.
• As a result you can select a small sample (study
population) that will represent the entire
community.
• The process follows statistically proven methods
that gives a representative sample.
• These methods are called ‘sampling methods”
and can be classified in to two i.e.
– probability sampling methods
– Non- probability sampling methods
Sampling
• Sampling process of selecting a group of
people that are representative of the
population being studied
• Selection of the study subjects from the
community of interest to represent the
population
• Sample-selected people
Advantages of samples
• Easy to study
• Data analysis easier on sample
• Less costly (on budget)
• Saves on time
• Random samples avoid bias in the study
Types of sampling
• Two major types of sampling
• Probability sampling
• Non probability sampling
Probability samples
– Every subject in the population has a
possibility/probability of being chosen to take part
/selected
– Usually selected randomly from the population
Probability sampling include:-
• Simple random sample
• Systematic random sample
• Cluster sampling
• Stratified sampling
Simple random
• Every unit in the population has an equal
chance of being selected
• Techniques used include:-
• Lottery method
• Table of random numbers
• Computer program
Systematic sample
• Every nth element is selected e.g. if you want
a 10% sample, you would select every10th
element
• To get the sampling interval
• = Total Population
Sample
• E.g. in a population of 3000, you want n=75
• ~ 3000
• 75 = 40
You will pick every 40th number
Stratified Samples
• Selection of study elements from a target
population that can be divided into
groups/categories e.g. according to age
Elements are selected from each stratum
randomly
• Cluster Sampling
• Elements are sampled in groups rather than
individuals
• A group is selected and all its members
included in the sample
Used when.
• Constructing a complete list of population
elements is difficult, costly, or impossible.
• The population is concentrated in "natural"
clusters (city blocks, schools, hospitals, etc.)
Non Probability Sampling
• Not every element has the opportunity
to be selected
• Subjective judgment is used in the
selection of the sample
• Used where it is not possible to use
probability sampling e.g.limited
resources
• Or when researcher is not interested in
selecting a representative sample
Types of non probability samples
• Convenience sample
• Snowball sample
• Quota sample
• Purposive sample
Convenience Sample
• Also called accidental sampling
• Use any available elements of the
population that meet the inclusion
criteria
• Subjects being in the right place at the
right time
Purposive Sample
• Also called judgemental sampling
• The researcher selects elements/subjects
that best represent the phenomenon
under study in the population (selective
sampling)
Snow ball Sample
• Initial members of the group identify other
possible members and those members in turn
identify other members
• The sample grows like snowballs
• where the researcher is unable to identify
member of a population in advance
• E.g. - Homeless
• - Commercial sex workers
• - Drug addicts etc
Quota Sample
• Used where there are subgroups of the
population to be studied
• Subjects are selected to fit in identified quotas
e.g. certain religion or social class
NOTE
• You may chose to use one or several
approaches but not necessarily all.
• The sampling method used is dependent on
your objective and design of study.
Development & testing tools
• The tools used to measure a community’s
health status
• In community survey they include:
-Questionnaires
– Focus group discussions guides
– Tools for Measuring physical heath status e.g.
weighing scales, tape measures
– Key informant interview schedule
Questionnaire
• A questionnaire is a set of standardised questions
designed to collect information about a specific issue
in the community
• A tool for collecting information
• Made of open ended and or closed ended questions
• Open-allows respondent to provide their own answer
• Closed-offer respondents a list of possible answers to
choose from
Qualities of a Good Questionnaire
What is statistics?
47
Inferential & Descriptive Statistics
• Two broad branches in statistics
Descriptive statistics
Once data has been collected, normally the step that follows is to
summarize the data, if possible, with one or two summary
statistics. Summary or descriptive statistics describe the original
data set (the set of responses for each question) by using just one
or two numbers – typically an average and a measure of
dispersion.
Inferential Statistics
This is the branch of statistics that makes use of sample data to make
generalization concerning the population parameters. Here
theoretical distributions become handy.
48
Analysis of Data
• Why Analyze Data?
49
Quantitative and Qualitative Data
52
NUMERICAL VARIABLES
1. CONTINIOUS
• Continuous variables in clinical research take
positive values
• With this type of data, one can develop more
and more accurate measurements depending
on the instrument used, e.g.:
– height in centimetres (2.5 cm or 2.546 cm or
2.543216 cm)
– temperature in degrees Celsius (37.20C or
37.199990C etc.)
– Continuous variables can be transformed into
categorical or binary variables in clinical research
studies. 53
2. Discrete. These are variables in which
numbers can only have full values, e.g.:
54
CATEGORICAL
• Ordinal variables. These are grouped
variables that are ordered or ranked in
increasing or decreasing order:
• For example:
High income (above $300 per month);
Middle income ($100-$300 per month); and
Low income (less than $100 per month).
55
• Nominal variables .The groups in these
variables do not have an order or ranking
in them.
For example:
• Sex: male, female
• Main food crops: maize, millet, rice, etc.
• Religion: Christian, Moslem, Hindu,
Buddhism, etc.
56
• BINARY VARIABLES(DICHOTOMUS)
• Binary variables are often used as indicator
variables,
• which take the value of 1 if a specific
characteristic or disease is present, and 0 if it
is not.
CAUSE AND EFFECT
• When studying the relationship between
variables, e.g establishing a determinant of a
disease, OR the relationship between a
disease (outcome) and exposure variables are
often labeled as Dependent or independent.
• Dependent variable:- A variable that is
influenced or changed by another variable
• Independent variable:-An independent
variable is manipulated in order to determine
its effect or influence on another variable.
58
Confounding
STEP 3
• Weight each stratum-specific relative risk by
the proportion of subjects in the particular
stratum and then combine the two weighted
risks
FORMULA
• RELATIVE RISK X WEIGHT 1 + RELATIVE RISK X
WEIGHT 2
• Weighting
• Relative risk (in detail in another lesson)
• The advantage of stratification, compared to
restriction, is that the full study population is
analyzed, preserving study power and
maintaining generalizability.
• The primary disadvantage of stratification is
the inability to deal with multiple confounding
factors simultaneously.
• Because each separate strata must be created
for each combination of factors
3) MATCHING
• Matching involves identifying groups of
subjects within a study population who are
the same with respect to a confounder of
interest.
72
3. An interval-scale : Unlike the ordinal scale, interval
scale on the other hand has equally spaced units,
in other words there is some rule that has been
established as a basis of making the units equal
Fahrenheit scale is an example of an interval scale. In
this case one can say an increase in temperature
from 10 to 20 degrees Fahrenheit involves the
same increase in temperature from 60 to 70
degrees Fahrenheit. The interval scale however
does not have a true zero point. Mean is the
appropriate measure of central tendency, while
standard deviation is the widely used measure of
dispersion.
73
4. A ratio-scale : is an interval variable with a
true zero point, such as height in centimeters.
Generally all statistical techniques are usable
in ratio scale.
74
Organizing Data for interpretation
75
line list or line listing
• The line listing is one type of epidemiologic
database, and is organized like a spreadsheet
with rows and columns.
• Typically, each row is called a record or
observation and represents one person or
case of disease. Each column is called a
variable and contains information about one
characteristic of the individual, such as race or
date of birth.
76
77
Frequency Distributions
80
HISTOGRAM
81
What can we see in the graph
1. Central location of the data ( where the
distribution has its peak)
• Note that the data in Figure 2.1 seem to cluster
around a central value, with progressively fewer
persons on either side of this central value.
• This type of symmetric distribution, as
illustrated in is the classic bell-shaped curve —
also known as a normal distribution.
• Three measures of central location are
commonly used in epidemiology: arithmetic
mean, median, and mode.
82
2. The spread (how widely dispersed the data is
from the highest pick)
83
3. shape( how data is distributed on both sides
of the pick) the shape can be symmetrical or
asymmetrical.
• A distribution that has a central location to the
right and a tail to the left is said to be
negatively skewed or skewed to the left.
• A distribution that has a central location to the
left and a tail off to the right is said to be
positively skewed or skewed to the right
84
Skewed to the right
20
Frequency
10
85
Skewed to the left
30
20
Percent
10
50 55 60 65 70 75 80 85 90 95 100
grades
86
MEASURES OF CETRAL LOCATION
• Mode
• The mode is the value that occurs most often
in a set of data.
• It can be determined simply by tallying the
number of times each value occurs.
• Consider, for example, the number of doses
of diphtheriapertussis- tetanus (DPT) vaccine
each of seventeen 2-year-old children in a
particular village received: 0, 0, 1, 1, 2, 2, 2, 3,
3, 3, 3, 3, 3, 4, 4, 4, 4
87
Median
mean=4+23+28+31+ 32/5
mean=118/5
Mean=23.6
90
MEASURES OF SPREAD
1. RANGE
• The range of a set of data is the difference
between its largest (maximum) value and its
smallest (minimum) value.
• In the epidemiologic community, the range is
usually reported as “from (the minimum) to
(the maximum),” that is, as two numbers
rather than one.
92
3. Quartiles
95
FORMULA FOR SD
96
97
NORMAL DISTRIBUTION
103
• there is a 95 per cent chance that we choose a
sample whose mean lies in the interval µ -
1.96 σ /√ n to µ +1.96 σ /√ n
• In order to calculate this CI, we first need the
sample mean of a random sample, the sample
size n, and the population standard deviation
σ.
• We know the first two after collecting
information about the members of the
sample, but we do not know the standard
deviation of the population.
EXAMPLE
• The following is a summary statistic from a
study that was done to estimate the mean
weight for males in Nakuru County.
• Sample size=200 people
• Mean weight of 69.7 kg
• Standard deviation of 15 kg.
Using this data calculate a 95 per cent CI for the
mean weight of the population (10 000
people)
• NEXT EXAMPLE IF YOU HAVE S.E
106
ESTIMATING SAMPLE SIZE FOR THE MEAN