Lesson 3

LESSON THREE
LESSON OBJECTIVES
• Develop data collection instruments and
procedures
• Outline probability and non probability
sampling methods
• Outline different scales of measurement
• Outline types of variables
• Outline different methods used to control for
confounding
• Apply data gathering techniques to various
epidemiological designs
Sources of data in epidemiology
• JOURNALS
• OTHER DEPARTMENTS OR ORGANISATIONS
• INTERNET
• TEXT BOOKS
• RESEARCH PROJECTS
• HOSPITAL CLINIC RECORDS
• SURVEYS/CENSUS
• ANY OTHER?
Who to question
• When collecting data you may not ask each and
every person in the community. Doing that may
prove to be tedious, time consuming and
expensive.
• As a result you can select a small sample (study
population) that will represent the entire
community.
• The process follows statistically proven methods
that gives a representative sample.
• These methods are called ‘sampling methods”
and can be classified in to two i.e.
– probability sampling methods
– Non- probability sampling methods
Sampling
• Sampling process of selecting a group of
people that are representative of the
population being studied
• Selection of the study subjects from the
community of interest to represent the
population
• Sample-selected people
Advantages of samples
• Easy to study
• Data analysis easier on sample
• Less costly (on budget)
• Saves on time
• Random samples avoid bias in the study
Types of sampling
• Two major types of sampling
• Probability sampling
• Non probability sampling
Probability samples
– Every subject in the population has a
possibility/probability of being chosen to take part
/selected
– Usually selected randomly from the population
Probability sampling include:-
• Simple random sample
• Systematic random sample
• Cluster sampling
• Stratified sampling
Simple random
• Every unit in the population has an equal
chance of being selected
• Techniques used include:-
• Lottery method
• Table of random numbers
• Computer program
Systematic sample
• Every nth element is selected e.g. if you want
a 10% sample, you would select every10th
element
• To get the sampling interval
• = Total Population
Sample
• E.g. in a population of 3000, you want n=75
• ~ 3000
• 75 = 40
You will pick every 40th number
Stratified Samples
• Selection of study elements from a target
population that can be divided into
groups/categories e.g. according to age
Elements are selected from each stratum
randomly
• Cluster Sampling
• Elements are sampled in groups rather than
individuals
• A group is selected and all its members
included in the sample
Used when.
• Constructing a complete list of population
elements is difficult, costly, or impossible.
• The population is concentrated in "natural"
clusters (city blocks, schools, hospitals, etc.)
Non Probability Sampling
• Not every element has the opportunity
to be selected
• Subjective judgment is used in the
selection of the sample
• Used where it is not possible to use
probability sampling e.g.limited
resources
• Or when researcher is not interested in
selecting a representative sample
Types of non probability samples
• Convenience sample
• Snowball sample
• Quota sample
• Purposive sample
Convenience Sample
• Also called accidental sampling
• Use any available elements of the
population that meet the inclusion
criteria
• Subjects being in the right place at the
right time
Purposive Sample
• Also called judgemental sampling
• The researcher selects elements/subjects
that best represent the phenomenon
under study in the population (selective
sampling)
Snow ball Sample
• Initial members of the group identify other
possible members and those members in turn
identify other members
• The sample grows like snowballs
• where the researcher is unable to identify
member of a population in advance
• E.g. - Homeless
• - Commercial sex workers
• - Drug addicts etc
Quota Sample
• Used where there are subgroups of the
population to be studied
• Subjects are selected to fit in identified quotas
e.g. certain religion or social class
NOTE
• You may chose to use one or several
approaches but not necessarily all.
• The sampling method used is dependent on
your objective and design of study.
Development & testing tools
• The tools used to measure a community’s
health status
• In community survey they include:
-Questionnaires
– Focus group discussions guides
– Tools for Measuring physical heath status e.g.
weighing scales, tape measures
– Key informant interview schedule
Questionnaire
• A questionnaire is a set of standardised questions
designed to collect information about a specific issue
in the community
• A tool for collecting information
• Made of open ended and or closed ended questions
• Open-allows respondent to provide their own answer
• Closed-offer respondents a list of possible answers to
choose from
Qualities of a Good Questionnaire
• Has simple and specific questions. Avoids

wording that is above the vocabulary or reading
skills of the respondents.
• Has short and precise questions. The number of
questions should not be too many or else they
will put off the person being interviewed. In
other words, keep it short and
simple (KISS).
• Avoids use of abbreviations or jargon.
• Avoids questions that are too demanding and
time consuming
• Avoids bias in questions. Biased questions
influence people to answer in a way that does
not accurately reflect their position. For
example, a question like ’Do you agree with
the majority of the people that health
standards are falling?’ implies that the
respondent should agree.
• Avoids making assumptions.
– Questions such as ’How many children do you have?’ assume that the
respondent has children. You should only ask this question after
establishing the situation with the question ’Do you have children?’
• Avoids double questions.
– For example, ’Did the MCH talk help to identify ways to improve the
sanitation and nutrition of your children?’ It is better to ask about
sanitation and nutrition separately.
• Has clear wording.
– Words such as majority, older people, regularly, might mean different
things to different people and so should be avoided.
• Questions ask about simple common happenings.
• Questions range from known to unknown and from simple to
complex.
• All the questions should relate to the purpose of study.
– Eliminate ’nice to know’ questions, you may end up with 'information
overload’.
• Questions are acceptable to the people
included in the survey. You should view the
questions through the respondents eye and
ask yourself the following:
- Will the question be seen as reasonable?
- Will it infringe on the respondents privacy?
- Will the respondent be able and willing to
answer the question?
• Questions should not screen disease if no
effective treatment can be offered for the
cases found or if the condition is rare.
• Type of question should either be open-
or closed-ended.
• Questionnaire must be pre-tested before
executing the survey. This helps to
identify and eliminate questions that are
defective or may lead to wrong
information. You may even need to
rephrase the questionnaire so that it can
elicit the correct responses.
Pretesting the questionnaire
Done to establish:
• that each question measures what it is intended to
measure
• whether respondents understand the questions
• Whether the questionnaire creates a positive
impression
• Whether questions are too long
• any bias on the part of the researcher
Pre-testing
• Pre-testing can be done by
administering the questionnaire to a similar
group to that of your target population or
Asking colleaques to review it critically
Simulating the actual data collection
procedure
FGDs
• Group is focused on a particular area of
interest
Focus Group Discussions (FGDs)
• This is a group discussion that gathers

together people from similar backgrounds or
experiences to discuss a specific topic of
interest to the researcher.
• The group of participants are guided by a
moderator (or group facilitator), who
introduces topics for discussion and helps the
group to participate in a lively and natural
discussion amongst themselves.
Data management
• Data cleaning;
– Ensure the data collected is consistent and
relevant
• Data entry;
– Data collection tools should have been coded well.
This enables for easy data entry in to a package for
analysis.
• Data presentation;
– You arrange the data into tally sheets, then into
frequency tables, bar graphs, pie charts etc.
Data handling
Involves
• Checking all forms for completeness
• Ensure questionnaires are numbered
• Store information/data securely
• Record forms are clearly labelled
Data analysis
• This is separation & categorization of the raw
data obtained from the field into groups to
help us understand its meaning
• Done to:
To summarise the data
To make inferences about the data
Steps in data analysis
• Data cleaning
• Sorting/tallying
• Coding/entering the data
• Analysis of results
Data cleaning
• Ensure that questionnaires are complete
• Handle missing data-if a question misses information
in majority of questionnaires, ignore it from the
study
• Correct any mistakes committed by the interviewers
after confirming with them
• Ensure the data collected is consistent and relevant
Coding & data entry
• Coding involves the conversion of data into
numerical codes which represent the
measured attributes
• After data is coded, entries can be made for
analysis e.g. using computer packages
Data presentation
Data can be presented in:
 Frequency tables
 Histograms
 Frequency polygons
 Bar graphs
 Pie charts
 Maps
NOTE: read & make notes on each of this
Bias
Bias: Deviation of results or inferences from the truth, or
processes leading to such deviation
• Can occur in the design, implementation or analysis stages of
a study
• Occurs when there is an error in how the study is conducted
e.g. (Asking the wrong questions, asking the right questions in
the wrong way, failing to get an appropriate control group,
failing to achieve adequate follow-up).
• To reduce the role of bias in a study, we examine and try to
eliminate its potential sources or take it into account when
interpreting the results.
Study Designs April 2011 40

Bias cont’d
• Systematic error built into the study design:

 Selection bias: this occurs when the subjects studied
are not representative of the target population about
which conclusions are to be drawn.
 Information bias: results from the different quality of
information and errors in obtaining and classifying
information.

Types of selection bias
• Response bias: Those who agree to be in a study may
be in some way different with those who refuse to
participate.
 Volunteers may be different from those who are enlisted.

Types of selection bias cont’d
• Membership bias (health worker effect): observed in
occupational medicine whereby there is an over
proportionate healthy group of workers in
particularly exposed working places.
• Migration bias: results from migration of diseased
subjects from an exposed status to an unexposed
status during the course of a study.

Types of information bias
• Interviewer bias: An interviewer’s knowledge may

influence the structure of questions and the manner of
presentation, which may influence responses.
• Recall bias: Those with a particular outcome or exposure
may remember events more clearly or amplify their
recollections.
• Diagnostic suspicion bias: when potentially exposed
subjects are subjected to more and in-depth diagnostic
procedures and tests (cohort studies?).

Types of information bias cont’d
• Observer bias: Observers may have

preconceived expectations of what they
should find in an examination.
• Loss to follow-up: Those that are lost to
follow-up or who withdraw from the study
may be different from those who are followed
for the entire study.

Types of information bias cont’d
• Hawthorne effect: An effect first documented at a

hawthorne manufactering plant; people act
differently if they know they are being watched.
• Surveillance bias: The group with the known
exposure or outcome may be followed more closely
or longer than the comparison group.
• Misclassification bias: Errors are made in classifying
either disease or exposure status..

Introduction to Biostatistics
What is statistics?
 Statistics is the summary of information (data) in

a meaningful fashion, and its appropriate
presentation.
Statistics is the postulation of a plausible model

explaining the mechanism that generates the data,
with the ultimate goal to extrapolate and predict
data under circumstances beyond the current
experiment
47
Inferential & Descriptive Statistics
• Two broad branches in statistics
Descriptive statistics
Once data has been collected, normally the step that follows is to
summarize the data, if possible, with one or two summary
statistics. Summary or descriptive statistics describe the original
data set (the set of responses for each question) by using just one
or two numbers – typically an average and a measure of
dispersion.
Inferential Statistics
This is the branch of statistics that makes use of sample data to make
generalization concerning the population parameters. Here
theoretical distributions become handy.
48
Analysis of Data
• Why Analyze Data?
The purpose of analyzing data is to obtain usable and

useful information. The analysis, irrespective of
whether the data is qualitative or quantitative, may:
describe and summarize the data

identify relationships between variables
compare variables
identify the difference between variables
forecast outcomes.
49
Quantitative and Qualitative Data
Qualitative data : It arises from qualitative variables. A variable is

a characteristic of interest that varies from one item to the
other and may take any one of a specified set of values or
attributes. Qualitative variables generate non-numerical data.
The most common form of qualitative data is derived from
semi-structured or unstructured interviews, although other
sources can include observations, life histories and journals
and documents of all kinds including newspapers. Qualitative
data from interviews can be analyzed for content (content
analysis) or for the language used (discourse analysis).
Qualitative data is difficult to analyze and often opportunities
to achieve high marks are lost because the data is treated
casually and without rigor.
Quantitative data arises from quantitative variables. They
generate numerical data.
50
VARIABLES
• In order to understand the
determinants/Distribution/frequency/etc of
health and disease in a given population, an
epidemiologist must look for population
characteristics that are measurable.
• A Measurable characteristic in a population
that assumes different values is termed as a
variable.
• These may be in the form of numbers (e.g.,
age) or non-numerical characteristics (e.g.,
sex).
51
Types of variables
• TWO MAIN TYPES
• Variables can be expressed in terms of

numbers and these are called – NUMERICAL
VARIABLES
• Some are expressed in categories and these

are called – CATEGORICAL VARIABLES
52
NUMERICAL VARIABLES
1. CONTINIOUS
• Continuous variables in clinical research take
positive values
• With this type of data, one can develop more
and more accurate measurements depending
on the instrument used, e.g.:
– height in centimetres (2.5 cm or 2.546 cm or
2.543216 cm)
– temperature in degrees Celsius (37.20C or
37.199990C etc.)
– Continuous variables can be transformed into
categorical or binary variables in clinical research
studies. 53
2. Discrete. These are variables in which
numbers can only have full values, e.g.:
– number of visits to a clinic (0, 1, 2, 3, 4, etc).

– number of sexual partners (0, 1, 2, 3, 4, 5, etc.)
– Number of students in class
– etc
54
CATEGORICAL
• Ordinal variables. These are grouped
variables that are ordered or ranked in
increasing or decreasing order:
• For example:
High income (above $300 per month);
Middle income ($100-$300 per month); and
Low income (less than $100 per month).
55
• Nominal variables .The groups in these
variables do not have an order or ranking
in them.
For example:
• Sex: male, female
• Main food crops: maize, millet, rice, etc.
• Religion: Christian, Moslem, Hindu,
Buddhism, etc.
56
• BINARY VARIABLES(DICHOTOMUS)
• Binary variables are often used as indicator
variables,
• which take the value of 1 if a specific
characteristic or disease is present, and 0 if it
is not.
CAUSE AND EFFECT
• When studying the relationship between
variables, e.g establishing a determinant of a
disease, OR the relationship between a
disease (outcome) and exposure variables are
often labeled as Dependent or independent.
• Dependent variable:- A variable that is
influenced or changed by another variable
• Independent variable:-An independent
variable is manipulated in order to determine
its effect or influence on another variable.
58
Confounding
• Confounding is an apparent association

between disease and exposure caused by a
third factor not taken into consideration.
• A confounder is a variable that is associated
with the exposure and, independent of that
exposure, is a risk factor for the disease.

Confounding cont’d
• To be a confounding, the extraneous variable must

have the following three characteristics:
a) Must be a risk factor for the disease.
b) Must be associated with the exposure under study
in the population from which the cases are derived.
c) Must not be an intermediate step in the causal
path between the exposure and the disease.

Methods to control for confounding
• 1) RESTRICTION- this means
removing all subjects in the study
‘with’ the confounding factor.
• This approach would “break” the link
between the confounding factor and
the outcome by restricting the study
to only subjects ‘without’ the
confounding factor
DISADVANTAGE OF RESTRICTION
• 1. LOSS OF STUDY POWER:- a researcher might
remove too many subjects and leave too few
remaining subjects to detect a statistically
significant association.
• 2. LOSS OF GENERALIZABILITY:- if restriction was

used to deal with many other confounding
factors a researcher may remove a substantial
proportion of a study population.
• Results from such a study may not readily
generalize to clinical practice where all types of
patients are encountered.
2) STRATIFICATION
STEP ONE
• an alternative strategy to excluding all
individuals “with confounding” is to
separate study subjects according to
their confounding status.
• The result would be two subpopulations,
or strata. One stratum would consist
exclusively of subjects “with
confounding”. And another of subjects
exclusively “without confounding”
STEP 2
• The next step is to calculate the relative risk of
developing an outcome within each stratum
STEP 3
• Weight each stratum-specific relative risk by
the proportion of subjects in the particular
stratum and then combine the two weighted
risks
FORMULA
• RELATIVE RISK X WEIGHT 1 + RELATIVE RISK X
WEIGHT 2
• Weighting
• Relative risk (in detail in another lesson)
• The advantage of stratification, compared to
restriction, is that the full study population is
analyzed, preserving study power and
maintaining generalizability.
• The primary disadvantage of stratification is
the inability to deal with multiple confounding
factors simultaneously.
• Because each separate strata must be created
for each combination of factors
3) MATCHING
• Matching involves identifying groups of
subjects within a study population who are
the same with respect to a confounder of
interest.
• Matching permits adjustment for multiple

confounding factors, provided that
appropriate control subjects can be identified.
• Advantages to matching are that it is an
intuitive process and that it can address
several confounders simultaneously
DISADVANTAGES
• the matching procedure must be specified as
part of the initial study design (once cohort
has been created It is not easy to go back)
• it can be difficult to find suitable matches for
multiple confounding factors.
• factors selected to be matching variables can
no longer be evaluated as disease risk factors.
4) REGRESSION
• regression is a mathematical model that can
estimate the independent association etween
many exposure variables and an outcome
variable.
• Regression utilizes all of the study data, can
account for multiple confounders
simultaneously
• ( THIS WILL BE DISCUSSED IN DETAIL LATER
DURING THIS COURSE)
5) RANDOMIZATION
• Randomization has the important advantage
of balancing both measured and unmeasured
characteristics.
• This removes uncertainty as to whether the
observed associations might be confounded
by factors that were not measured in the
study.
• REFER TO YOUR PREVIOUS NOTES.

(EXPERIMENTAL DESIGNS)
MEASURMENT SCALES
• How is data measured?
- There are four scales of measurement that are used in
statistics.
1. A nominal-scale : nominal scale is simply a system of

assigning number symbols to events in order to label
them.
Nominal scale provide a convenient way of keeping
track of events and people.
Nominal data are thus counted data and therefore one
is only restricted to mode and measures of central
tendency when doing analysis.
71
2. An ordinal-scale : has values that can be
ranked but are not necessarily evenly spaced,
such as stage of cancer.
• for example a student rank in a class involves
the use of an ordinal scale. Data is only
measured from highest to the lowest and
therefore the appropriate measure of central
tendency is the median in addition measures
of dispersion that can be applied are
percentiles and quintiles.
72
3. An interval-scale : Unlike the ordinal scale, interval
scale on the other hand has equally spaced units,
in other words there is some rule that has been
established as a basis of making the units equal
Fahrenheit scale is an example of an interval scale. In
this case one can say an increase in temperature
from 10 to 20 degrees Fahrenheit involves the
same increase in temperature from 60 to 70
degrees Fahrenheit. The interval scale however
does not have a true zero point. Mean is the
appropriate measure of central tendency, while
standard deviation is the widely used measure of
dispersion.
73
4. A ratio-scale : is an interval variable with a
true zero point, such as height in centimeters.
Generally all statistical techniques are usable
in ratio scale.
74
Organizing Data for interpretation
• Whether you are conducting routine

surveillance, investigating an outbreak, or
conducting a study, you must first compile
information in an organized manner.
• There are several ways in which data can be

organized for easier interpretation
75
line list or line listing
• The line listing is one type of epidemiologic
database, and is organized like a spreadsheet
with rows and columns.
• Typically, each row is called a record or
observation and represents one person or
case of disease. Each column is called a
variable and contains information about one
characteristic of the individual, such as race or
date of birth.
76
77
Frequency Distributions
• Look again at the data in Table 2.1.

• How many of the cases (or case-patients) are
male?
• This may be easy to do… but if the data base is
large, then interpretation becomes difficult.
Frequency distributions then becomes more
convenient
• A frequency distribution thus:- is a table that
summarizes a variable.
• It displays the values a variable can take and the
number of persons or records with each value 78
79
GRAPHS
• Data in a frequency distribution table can be
graphed.
• There are several types of graphs that are put

to use in the display/interpretation of data.
• Figure 2.1 in the next slide shows the an
outbreak of salmonella by dates
80
HISTOGRAM
81
What can we see in the graph
1. Central location of the data ( where the
distribution has its peak)
• Note that the data in Figure 2.1 seem to cluster
around a central value, with progressively fewer
persons on either side of this central value.
• This type of symmetric distribution, as
illustrated in is the classic bell-shaped curve —
also known as a normal distribution.
• Three measures of central location are
commonly used in epidemiology: arithmetic
mean, median, and mode.
82
2. The spread (how widely dispersed the data is
from the highest pick)
• Two measures of spread commonly used in

epidemiology are range and standard
deviation.
83
3. shape( how data is distributed on both sides
of the pick) the shape can be symmetrical or
asymmetrical.
• A distribution that has a central location to the
right and a tail to the left is said to be
negatively skewed or skewed to the left.
• A distribution that has a central location to the
left and a tail off to the right is said to be
positively skewed or skewed to the right
84
Skewed to the right
Number of Music CDs of Spring 1998 Stat 250 Students
20
Frequency
10
0 100 200 300 400

Number of Music CDs
85
Skewed to the left
30
20
Percent
10
50 55 60 65 70 75 80 85 90 95 100
grades
86
MEASURES OF CETRAL LOCATION
• Mode
• The mode is the value that occurs most often
in a set of data.
• It can be determined simply by tallying the
number of times each value occurs.
• Consider, for example, the number of doses
of diphtheriapertussis- tetanus (DPT) vaccine
each of seventeen 2-year-old children in a
particular village received: 0, 0, 1, 1, 2, 2, 2, 3,
3, 3, 3, 3, 3, 4, 4, 4, 4
87
Median
• The median is the middle value of a set of data

that has been put into rank order.
• Suppose you had the following ages in years
for patients with a particular illness: 4, 23, 28,
31, 32
• The median age is 28 years, because it is the
middle value, with two values smaller than 28
and two values larger than 28.
• The median is also the 50th percentile of the

distribution. 88
ARITHMENTIC MEAN
• The arithmetic mean is a more technical name
for what is more commonly called the mean
or average. The arithmetic mean is the value
that is closest to all the other values in a
distribution.
Method for calculating the mean
• Step 1. Add all of the observed values in the
distribution.
• Step 2. Divide the sum by the number of
observations.
89
Mean-Example
 Xi
Formula: X n
Consider the 4, 23, 28, 31, 32

data set
mean=4+23+28+31+ 32/5
mean=118/5
Mean=23.6
90
MEASURES OF SPREAD
1. RANGE
• The range of a set of data is the difference
between its largest (maximum) value and its
smallest (minimum) value.
• In the epidemiologic community, the range is
usually reported as “from (the minimum) to
(the maximum),” that is, as two numbers
rather than one.
R= largest obs. - smallest obs.

91
2. Percentiles
• Percentiles divide the data in a distribution

into 100 equal parts.
• The Pth percentile (P ranging from 0 to 100) is
the value that has P percent of the
observations falling at or below it.
• In other words, the 90th percentile has 90%
of the observations at or below it.
92
3. Quartiles
• Sometimes, epidemiologists group data into four

equal parts, or quartiles.
• Each quartile includes 25% of the data. The cut-
off for the first quartile is the 25th percentile.
• Second 50th percentile
• Third 75th percentile
• Fourth 100th percentile
• The interquartile range is generally used in
conjunction with the median.
• they are useful for characterizing the central
location and spread of any frequency distribution
93
94
STANDARD DEVIATION
• It is a measure of how spread out the data are
around the mean,
• so we only use the standard deviation as the

measure of spread in conjunction with the
mean as the measure of location.
95
FORMULA FOR SD
96
97
NORMAL DISTRIBUTION
• Consider the normal curve illustrated in Figure 2.9.

The mean is at the center, and data are equally
distributed on either side of this mean. The points
that show ±1, 2, and 3 standard deviations are
marked on the x axis. For normally distributed data,
approximately two-thirds (68.3%, to be exact) of the
data fall within one standard deviation of either side
of the mean; 95.5% of the data fall within two
standard deviations of the mean; and 99.7% of the
data fall within three standard deviations.
• Exactly 95.0% of the data fall within 1.96 standard
deviations of the mean.
98
99
STANDARD ERROR
• The standard error of the mean refers to
variability we might expect in the arithmetic
means of repeated samples taken from the
same population.
• The standard error quantifies the variation in
those sample means.
• The standard deviation describes variability in
a set of data.
• Standard error is important in calculating the
confidence interval/limit
101
102
Confidence interval/limit
• Confidence level or interval, is the expected
percentage of times that the actual value will
fall within the stated precision limits.
• Thus if we take a confidence level of 95% then
we mean that there are 95 chances on 100 (or
.95 in 1) that the sample results represent the
true condition of the population within a
specified precision range against 5 chances in
100 that it does not.
103
• there is a 95 per cent chance that we choose a
sample whose mean lies in the interval µ -
1.96 σ /√ n to µ +1.96 σ /√ n
• In order to calculate this CI, we first need the
sample mean of a random sample, the sample
size n, and the population standard deviation
σ.
• We know the first two after collecting
information about the members of the
sample, but we do not know the standard
deviation of the population.
EXAMPLE
• The following is a summary statistic from a
study that was done to estimate the mean
weight for males in Nakuru County.
• Sample size=200 people
• Mean weight of 69.7 kg
• Standard deviation of 15 kg.
Using this data calculate a 95 per cent CI for the
mean weight of the population (10 000
people)
• NEXT EXAMPLE IF YOU HAVE S.E
106
ESTIMATING SAMPLE SIZE FOR THE MEAN
• The 95 per cent CI allows us to quantify the

precision of a population estimate measured
in a sample.
• that it is possible to calculate the sample size
required to estimate a population value (mean
or other characteristic) to a desired level of
accuracy.
In a study of the effects of smoking on birth weight,
it was required to estimate the mean birth weight
of baby girls to within ±100 g. Previous studies have
shown that the birth weights of baby girls have a
standard deviation of about 520 g. Calculate the
sample size required to estimate the mean to
within ±100 g
SAMPLE SIZE FOR A PROPORTION
• Suppose now that we want to estimate the
proportion P of a population with some
characteristic, such as a disease;
• that is, we want to estimate the disease
prevalence in the population.
• The estimate of P is the proportion of the
sample with the characteristic.
Calculate the sample size required to estimate the
percentage of smokers in some population to within ±4
per cent of the true value. If the population percentage
of smokers is 30 per cent.
• END

Lesson 3

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Lesson 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 3

Uploaded by

Copyright:

Available Formats

LESSON THREE

• Has simple and specific questions. Avoids

• This is a group discussion that gathers

Study Designs April 2011 40

• Systematic error built into the study design:

Study Designs April 2011 41

Study Designs April 2011 42

Study Designs April 2011 43

• Interviewer bias: An interviewer’s knowledge may

Study Designs April 2011 44

• Observer bias: Observers may have

Study Designs April 2011 45

• Hawthorne effect: An effect first documented at a

Study Designs April 2011 46

 Statistics is the summary of information (data) in

Statistics is the postulation of a plausible model

The purpose of analyzing data is to obtain usable and

describe and summarize the data

Qualitative data : It arises from qualitative variables. A variable is

• Variables can be expressed in terms of

• Some are expressed in categories and these

– number of visits to a clinic (0, 1, 2, 3, 4, etc).

• Confounding is an apparent association

Study Designs April 2011 59

• To be a confounding, the extraneous variable must

Study Designs April 2011 60

• 2. LOSS OF GENERALIZABILITY:- if restriction was

• Matching permits adjustment for multiple

• REFER TO YOUR PREVIOUS NOTES.

1. A nominal-scale : nominal scale is simply a system of

• Whether you are conducting routine

• There are several ways in which data can be

• Look again at the data in Table 2.1.

• There are several types of graphs that are put

• Two measures of spread commonly used in

Number of Music CDs of Spring 1998 Stat 250 Students

0 100 200 300 400

• The median is the middle value of a set of data

• The median is also the 50th percentile of the

Consider the 4, 23, 28, 31, 32

R= largest obs. - smallest obs.

• Percentiles divide the data in a distribution

• Sometimes, epidemiologists group data into four

• so we only use the standard deviation as the

• Consider the normal curve illustrated in Figure 2.9.

• The 95 per cent CI allows us to quantify the

You might also like