Section 1: Organizing Data: Line List Line Listing
Section 1: Organizing Data: Line List Line Listing
Section 1: Organizing Data: Line List Line Listing
Whether you are conducting routine surveillance, investigating an outbreak, or conducting a study,
you must first compile information in an organized manner. One common method is to create a line
list or line listing. Table 2.1 is a typical line listing from an epidemiologic investigation of an
apparent cluster of hepatitis A.
A variable can be any characteristic that differs from person to person, such as height, sex,
smallpox vaccination status, or physical activity pattern. The value of a variable is the number or
descriptor that applies to a particular person, such as 5'6" (168 cm), female, and never vaccinated.
The line listing is one type of epidemiologic database, and is organized like a spreadsheet with rows
and columns. Typically, each row is called a record or observation and represents one person or
case of disease. Each column is called a variable and contains information about one characteristic
of the individual, such as race or date of birth. The first column or variable of an epidemiologic
database usually contains the person's name, initials, or identification number. Other columns might
contain demographic information, clinical details, and exposures possibly related to illness.
Table 2.1 Line Listing of Hepatitis A Cases, County Health Department, January — February
2004
Date of Age IV IgM
ID Town Sex Hosp Jaundice Outbreak Highest ALT*
Diagnosis (Years) Drugs Pos
01 01/05 B 74 M Y N N N Y 232
02 01/06 J 29 M N Y N Y Y 285
03 01/08 K 37 M Y Y N N Y 3250
04 01/19 J 3 F N N N N Y 1100
05 01/30 C 39 M N Y N N Y 4146
06 02/02 D 23 M Y Y N Y Y 1271
07 02/03 F 19 M Y Y N N Y 300
08 02/05 I 44 M N Y N N Y 766
09 02/19 G 28 M Y N N Y Y 23
10 02/22 E 29 F N Y Y N Y 543
11 02/23 A 21 F Y Y Y N Y 1897
12 02/24 H 43 M N Y Y N Y 1220
13 02/26 B 49 F N N N N Y 644
14 02/26 H 42 F N N Y N Y 2581
15 02/27 E 59 F Y Y Y N Y 2892
16 02/27 E 18 M Y N Y N Y 814
17 02/27 A 19 M N Y Y N Y 2812
18 02/28 E 63 F Y Y Y N Y 4218
Date of Age IV IgM
ID Town Sex Hosp Jaundice Outbreak Highest ALT*
Diagnosis (Years) Drugs Pos
19 02/28 E 61 F Y Y Y N Y 3410
20 02/29 A 40 M N Y Y N Y 4297
* ALT = Alanine aminotransferase
Some epidemiologic databases, such as line listings for a small cluster of disease, may have only a
few rows (records) and a limited number of columns (variables). Such small line listings are
sometimes maintained by hand on a single sheet of paper. Other databases, such as birth or death
records for the entire country, might have thousands of records and hundreds of variables and are
best handled with a computer. However, even when records are computerized, a line listing with key
variables is often printed to facilitate review of the data.
One computer software package that is widely used by epidemiologists to manage data is Epi Info, a
free package developed at CDC. Epi Info allows the user to design a questionnaire, enter data right
into the questionnaire, edit the data, and analyze the data. Two versions are available:
Epi Info 3 (formerly Epi Info 2000 or Epi Info 2002) is Windows-based, and continues to be
supported and upgraded. It is the recommended version and can be downloaded from the CDC
website: http://www.cdc.gov/epiinfo/downloads.htm.
This lesson includes Epi Info commands for creating frequency distributions and calculating some of
the measures of central location and spread described in the lesson. Since Epi Info 3 is the
recommended version, only commands for this version are provided in the text; corresponding
commands for Epi Info 6 are offered at the end of the lesson.
First, notice that for certain variables, the values are numeric; for others, the values
are descriptive. The type of values influence the way in which the variables can be summarized.
Variables can be classified into one of four types, depending on the type of scale used to
characterize their values (Table 2.2).
Table 2.2 Types of Variables
Scale Example Values
Nominal
Ordinal
disease status yes / no
ovarian cancer Stage I, II, III, or IV
"categorical" or "qualitative"
Interval
Ratio
any date from recorded time to
date of birth
current
tuberculin skin test
0 – ??? of induration
"continuous" or "quantitative"
A nominal-scale variable is one whose values are categories without any numerical ranking,
such as county of residence. In epidemiology, nominal variables with only two categories are
very common: alive or dead, ill or well, vaccinated or unvaccinated, or did or did not eat the
potato salad. A nominal variable with two mutually exclusive categories is sometimes called a
dichotomous variable.
An ordinal-scale variable has values that can be ranked but are not necessarily evenly spaced,
such as stage of cancer (see Table 2.3).
An interval-scale variable is measured on a scale of equally spaced units, but without a true
zero point, such as date of birth.
A ratio-scale variable is an interval variable with a true zero point, such as height in centimeters
or duration of illness.
Nominal- and ordinal-scale variables are considered qualitative or categorical variables, whereas
interval- and ratio-scale variables are considered quantitative or continuous variables. Sometimes
the same variable can be measured using both a nominal scale and a ratio scale. For example, the
tuberculin skin tests of a group of persons potentially exposed to a co-worker with tuberculosis can
be measured as "positive" or "negative" (nominal scale) or in millimeters of induration (ratio scale).
When a database contains only a limited number of records, you can easily pick out the information you need
directly from the raw data. By scanning the 5th column, you can see that 12 of the 20 case-patients are male.
With larger databases, however, picking out the desired information at a glance becomes increasingly difficult.
To facilitate the task, the variables can be summarized into tables called frequency distributions.
A frequency distribution displays the values a variable can take and the number of persons or records with
each value. For example, suppose you have data from a study of women with ovarian cancer and wish to look
at parity, that is, the number of times each woman has given birth. To construct a frequency distribution that
displays these data:
First, list all the values that the variable parity can take, from the lowest possible value to the highest.
Then, for each value, record the number of women who had that number of births (twins and other multiple-
birth pregnancies count only once).
Table 2.4 displays what the resulting frequency distribution would look like. Notice that the frequency
distribution includes all values of parity between the lowest and highest observed, even though there were no
women for some values. Notice also that each column is clearly labeled, and that the total is given in the
bottom row.
Table 2.4 Distribution of Case-Subjects by Parity (Ratio-Scale Variable), Ovarian Cancer Study, CDC
Parity Number of Cases
0 45
1 25
2 43
Parity Number of Cases
3 32
4 22
5 8
6 2
7 0
8 1
9 0
10 1
Total 179
Data Sources: Lee NC, Wingo PA, Gwinn ML, Rubin GL, Kendrick JS, Webster LA, Ory HW. The reduction in risk of ovarian cancer associated with
Centers for Disease Control Cancer and Steroid Hormone Study. Oral contraceptive use and the risk of ovarian cancer. JAMA 1983;249:1596–9.
Table 2.4 displays the frequency distribution for a continuous variable. Continuous variables are often further
summarized with measures of central location and measures of spread. Distributions for ordinal and nominal
variables are illustrated in Tables 2.5 and 2.6, respectively. Categorical variables are usually further
summarized as ratios, proportions, and rates (discussed in Lesson 3).
Table 2.5 Distribution of Cases by Stage of Disease (Ordinal-Scale Variable), Ovarian Cancer Study,
CDC
Cases
Stage Number Percent
I 45 20
II 11 5
III 104 58
IV 30 17
Total 179 100
Data Sources: Lee NC, Wingo PA, Gwinn ML, Rubin GL, Kendrick JS, Webster LA, Ory HW. The reduction in risk of ovarian cancer associated with
Centers for Disease Control Cancer and Steroid Hormone Study. Oral contraceptive use and the risk of ovarian cancer. JAMA 1983;249:1596–9.
Table 2.6 Distribution of Cases by Enrollment Site (Nominal-Scale Variable), Ovarian Cancer Study,
CDC
Cases
Enrollment
Site Number Percent
Atlanta 18 10
Connecticut 39 22
Detroit 35 20
Cases
Enrollment
Site Number Percent
Iowa 30 17
New Mexico 7 4
San Francisco 33 18
Seattle 9 5
Utah 8 4
Total 179 100
Data Sources: Lee NC, Wingo PA, Gwinn ML, Rubin GL, Kendrick JS, Webster LA, Ory HW. The reduction in risk of ovarian cancer associated with
Centers for Disease Control Cancer and Steroid Hormone Study. Oral contraceptive use and the risk of ovarian cancer. JAMA 1983;249:1596–9.
The data in a frequency distribution can be graphed. We call this type of graph a histogram. Figure 2.1 is a
graph of the number of outbreak-related salmonellosis cases by date of illness onset.
Figure 2.1 Number of Outbreak-Related Salmonellosis Cases by Date of Onset of Illness — United
States, June–July 2004
Image Description
Source: Centers for Disease Control and Prevention. Outbreaks of Salmonella infections associated with eating Roma tomatoes–United States and
Central location
Note that the data in Figure 2.1 seem to cluster around a central value, with progressively fewer persons on
either side of this central value. This type of symmetric distribution, as illustrated in Figure 2.2, is the classic
bell-shaped curve — also known as a normal distribution. The clustering at a particular value is known as
the central location or central tendency of a frequency distribution. The central location of a distribution is
one of its most important properties. Sometimes it is cited as a single value that summarizes the entire
distribution. Figure 2.3 illustrates the graphs of three frequency distributions identical in shape but with different
central locations.
Image Description
Depending on the shape of the frequency distribution, all measures of central location can be identical or
different. Additionally, measures of central location can be in the middle or off to one side or the other.
Spread
A second property of frequency distribution is spread (also called variation or dispersion). Spread refers to the
distribution out from a central value. Two measures of spread commonly used in epidemiology
are range and standard deviation. For most distributions seen in epidemiology, the spread of a frequency
distribution is independent of its central location. Figure 2.4 illustrates three theoretical frequency distributions
that have the same central location but different amounts of spread. Measures of spread will be discussed later
in this lesson.
Figure 2.4 Three Distributions with Same Central Location but Different Spreads
Image Description
Shape
Skewness refers to the tail, not the hump. So a distribution that is skewed to the left has a long left tail.
A third property of a frequency distribution is its shape. The graphs of the three theoretical frequency
distributions in Figure 2.4 were completely symmetrical. Frequency distributions of some characteristics of
human populations tend to be symmetrical. On the other hand, the data on parity in Figure 2.5
are asymmetrical or more commonly referred to as skewed.
Image Description
Data Sources: Lee NC, Wingo PA, Gwinn ML, Rubin GL, Kendrick JS, Webster LA, Ory HW. The reduction in risk of ovarian cancer associated with
Centers for Disease Control Cancer and Steroid Hormone Study. Oral contraceptive use and the risk of ovarian cancer. JAMA 1983;249:1596–9.
A distribution that has a central location to the left and a tail off to the right is said to be positively
skewed or skewed to the right. In Figure 2.6, distribution A is skewed to the right. A distribution that has a
central location to the right and a tail to the left is said to be negatively skewed or skewed to the left. In
Figure 2.6, distribution C is skewed to the left.
Image Description
Question: How would you describe the parity data in Figure 2.5?
Answer: Figure 2.5 is skewed to the right. Skewing to the right is common in distributions that begin with zero,
such as number of servings consumed, number of sexual partners in the past month, and number of hours
spent in vigorous exercise in the past week.
One distribution deserves special mention — the Normal or Gaussian distribution. This is the classic
symmetrical bell-shaped curve like the one shown in Figure 2.2. It is defined by a mathematical equation and is
very important in statistics. Not only do the mean, median, and mode coincide at the central peak, but the area
under the curve helps determine measures of spread such as the standard deviation and confidence interval
covered later in this lesson.
Nominal yes no no
Ordinal yes no no
Measure of central location: a single, usually central, value that best represents an entire distribution of data.
Measures of central location include the mode, median, arithmetic mean, midrange, and geometric mean.
Selecting the best measure to use for a given distribution depends largely on two factors:
Each measure — what it is, how to calculate it, and when best to use it — is described in this section.
Mode
Definition of mode
The mode is the value that occurs most often in a set of data. It can be determined simply by tallying the
number of times each value occurs. Consider, for example, the number of doses of diphtheria-pertussis-tetanus
(DPT) vaccine each of seventeen 2-year-old children in a particular village received:
0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4
Two children received no doses; two children received 1 dose; three received 2 doses; six received 3 doses;
and four received all 4 doses. Therefore, the mode is 3 doses, because more children received 3 doses than
any other number of doses.
1. Step 1. Arrange the observations into a frequency distribution, indicating the values of the variable and
the frequency with which each value occurs. (Alternatively, for a data set with only a few values,
arrange the actual values in ascending order, as was done with the DPT vaccine doses above.)
Example A: Table 2.8 (below) provides data from 30 patients who were hospitalized and received
antibiotics. For the variable “length of stay” (LOS) in the hospital, identify the mode.
LOS Frequency
0 1
1 0
2 1
3 1
4 1
5 2
6 1
7 1
8 1
9 3
LOS Frequency
LOS Frequency
10 5
11 1
12 3
13 1
14 1
15 0
16 1
17 0
18 2
19 1
LOS Frequency
20 0
21 0
22 1
. 0
. 0
27 1
. 0
. 0
49 1
Most values appear once, but the distribution includes two 5s, three 9s, five 10s, three 12s, and two
18s.
Because 10 appears most frequently, the mode is 10.
Example B: Find the mode of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22
days.
None
Note: When no value occurs more than once, the distribution is said to have no mode.
Example : Find the mode of the following incubation periods for Bacillus cereus food poisoning:
2, 3, 3, 3, 3, 3, 4, 4, 5, 6, 7, 9, 10, 11, 11, 12, 12, 12, 12, 12, 14, 14, 15, 17, 18, 20, 21 hours
Done
Example C illustrates the fact that a frequency distribution can have more than one mode. When this occurs,
the distribution is said to be bi-modal. Indeed, Bacillus cereus is known to cause two syndromes with different
incubation periods: a short-incubation- period (1–6 hours) syndrome characterized by vomiting; and a long-
incubation-period (6–24 hours) syndrome characterized by diarrhea.
Table 2.8 Sample Data from the Northeast Consortium Vancomycin Quality Improvement Project
ID Admission Date Discharge Date LOS DOB (mm/dd) DOB (year) Age Sex ESRD
ID Admission Date Discharge Date LOS DOB (mm/dd) DOB (year) Age Sex ESRD
Epi Info does not have a Mode command. Thus, the best way to identify the mode is to create a histogram and
look for the tallest column(s).
NOTE: The Means command provides a mode, but only the lowest value if a distribution has more than one
mode.
The mode is the preferred measure of central location for addressing which value is the most popular or
the most common. For example, the mode is used to describe which day of the week people most prefer to
come to the influenza vaccination clinic, or the “typical” number of doses of DPT the children in a particular
community have received by their second birthday.
As demonstrated, a distribution can have a single mode. However, a distribution has more than one mode
if two or more values tie as the most frequent values. It has no mode if no value appears more than once.
The mode is used almost exclusively as a “descriptive” measure. It is almost never used in statistical
manipulations or analyses.
The mode is not typically affected by one or two extreme values (outliers).
Exercise 2.3
Using the same vaccination data as in Exercise 2.2, find the mode. (If you answered Exercise 2.2, find the
mode from your frequency distribution.)
2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1
Median
Definition of median
The median is the middle value of a set of data that has been put into rank order. Similar to the median on a
highway that divides the road in two, the statistical median is the value that divides the data into two halves,
with one half of the observations being smaller than the median value and the other half being larger. The
median is also the 50th percentile of the distribution. Suppose you had the following ages in years for patients
with a particular illness:
The median age is 28 years, because it is the middle value, with two values smaller than 28 and two values
larger than 28.
Step 2. Find the middle position of the distribution by using the following formula:
Middle position = (n + 1) / 2
a. If the number of observations (n) is odd, the middle position falls on a single observation.
b. If the number of observations is even, the middle position falls between two observations.
a. If the number of observations (n) is odd and the middle position falls on a single observation, the median
equals the value of that observation.
b. If the number of observations is even and the middle position falls between two observations, the median
equals the average of the two values.
Properties and uses of the median
The median is a good descriptive measure, particularly for data that are skewed, because it is the central
point of the distribution.
The median is relatively easy to identify. It is equal to either a single observed value (if odd number of
observations) or the average of two observed values (if even number of observations).
The median, like the mode, is not generally affected by one or two extreme values (outliers). For example,
if the values on the previous page had been 4, 23, 28, 31, and 131 (instead of 31), the median would still
be 28.
The median has less-than-ideal statistical properties. Therefore, it is not often used in statistical
manipulations and analyses.
Exercise 2.4
Determine the median for the same vaccination data used in Exercises 2.2. and 2.3.
2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1
Arithmetic mean
Definition of mean
The arithmetic mean is a more technical name for what is more commonly called the mean or average. The
arithmetic mean is the value that is closest to all the other values in a distribution.
27 + 31 + 15 + 30 + 22 = 125
125 / 5 = 25.0
Therefore, the mean incubation period is 25.0 days.
The mean has excellent statistical properties and is commonly used in additional statistical manipulations
and analyses. One such property is called the centering property of the mean. When the mean is
subtracted from each observation in the data set, the sum of these differences is zero (i.e., the negative
sum is equal to the positive sum). For the data in the previous hepatitis A example:
15 – 25.0 -10.0
22 – 25.0 -3.0
27 – 25.0 + 2.0
30 – 25.0 + 5.0
31 – 25.0 + 6.0
This demonstrates that the mean is the arithmetic center of the distribution.
Because of this centering property, the mean is sometimes called the center of gravity of a frequency
distribution. If the frequency distribution is plotted on a graph, and the graph is balanced on a fulcrum, the
point at which the distribution would balance would be the mean.
The arithmetic mean is the best descriptive measure for data that are normally distributed.
On the other hand, the mean is not the measure of choice for data that are severely skewed or have
extreme values in one direction or another. Because the arithmetic mean uses all of the observations in the
distribution, it is affected by any extreme value. Suppose that the last value in the previous distribution was
131 instead of 31. The mean would be 225 / 5 = 45.0 rather than 25.0. As a result of one extremely large
value, the mean is much larger than all values in the distribution except the extreme value (the “outlier”).
Your Turn: What is the mean number of cigarettes smoked per day? [Answer: 17]
Exercise 2.5
Determine the mean for the same set of vaccination data.
2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1
1. Identify the smallest (minimum) observation and the largest (maximum) observation.
2. Add the minimum plus the maximum, then divide by two.
Exception: Age differs from most other variables because age does not follow the usual rules for rounding to
the nearest integer. Someone who is 17 years and 360 days old cannot claim to be 18 year old for at least 5
more days. Thus, to identify the midrange for age (in years) data, you must add the smallest (minimum)
observation plus the largest (maximum) observation plus 1, then divide by two.
For descriptive purposes, a reasonable answer is 2. However, recall that the midrange is usually calculated as
an intermediate step in other calculations. Therefore, more precision is necessary.
Consider that children born in August have just turned 2 years old. Others, born in September the previous
year, are almost but not quite 3 years old. Ignoring seasonal trends in births and assuming a very large room of
children, birthdays are expected to be uniformly distributed throughout the year. The youngest child, born on
September 1, is exactly 2.000 years old. The oldest child, whose birthday is September 2 of the previous year,
is 2.997 years old. For statistical purposes, the mean and midrange of this theoretical group of 2-year-olds are
both 2.5 years.
Example B: Find the midrange of the grouping 15–24 (e.g., number of alcoholic beverages consumed in one
week).
This calculation assumes that the grouping 15–24 really covers 14.50–24.49…. Since the midrange of 14.50–
24.49… = 19.49…, the midrange can be reported as 19.5.
Example C: Find the midrange of the age group 15–24 years.
Age differs from the majority of other variables because age does not follow the usual rules for rounding to the
nearest integer. For most variables, 15.99 can be rounded to 16. However, an adolescent who is 15 years and
360 days old cannot claim to be 16 years old (and hence get his driver’s license or learner’s permit) for at least
5 more days. Thus, the interval of 15–24 years really spans 15.0–24.99… years. The midrange of 15.0 and
24.99… = 19.99… = 20.0 years.
Geometric mean
To calculate the geometric mean, you need a scientific calculator with log and yx keys.
To what power would you need to raise a base of 10 to get a value of 100?
Because 10 times 10 or 102 equals 100, the log of 100 at base 10 equals 2. Similarly, the log of 16 at base 2
equals 4, because 24 = 2 x 2 x 2 x 2 = 16.
An antilog raises the base to the power (logarithm). For example, the antilog of 2 at base 10 is 102, or 100. The
antilog of 4 at base 2 is 24, or 16. The majority of titers are reported as multiples of 2 (e.g., 2, 4, 8, etc.);
therefore, base 2 is typically used when dealing with titers.
Method A
Method B
1. Calculate the product of the values by multiplying all of the values together.
2. Take the nth root of the product (where n is the number of observations) to get the geometric mean.
10, 10, 100, 100, 100, 100, 10,000, 100,000, 100,000, 1,000,000
Because these values are all multiples of 10, it makes sense to use logs of base 10.
Take the log (in this case, to base 10) of each value.
log10(xi) = 1, 1, 2, 2, 2, 2, 4, 5, 5, 6
Calculate the mean of the log values by summing and dividing by the number of observations (in this case, 10).
1. Take the antilog of the mean of the log values to get the geometric mean.
2. Antilog10(3) = 103 = 1,000.
3. The geometric mean of the set of data is 1,000.
Calculate the geometric mean from the following 95% confidence intervals of an odds ratio: 1.0, 9.0
1.
1. Calculate the product of the values by multiplying all values together.
The geometric mean is the average of logarithmic values, converted back to the base. The geometric mean
tends to dampen the effect of extreme values and is always smaller than the corresponding arithmetic mean. In
that sense, the geometric mean is less sensitive than the arithmetic mean to one or a few extreme values.
The geometric mean is the measure of choice for variables measured on an exponential or logarithmic
scale, such as dilutional titers or assays.
The geometric mean is often used for environmental samples, when levels can range over several orders
of magnitude. For example, levels of coliforms in samples taken from a body of water can range from less
than 100 to more than 100,000.
Exercise 2.6
Using the dilution titers shown below, calculate the geometric mean titer of convalescent antibodies against
tularemia among 10 residents of Martha’s Vineyard. [Hint: Use only the second number in the ratio, i.e., for
1:640, use 640.]
ID # Acute Convalescent
1 1:16 1:512
2 1:16 1:512
3 1:32 1:128
5 1:32 1:1024
6 “negative” 1:1024
7 1:256 1:2048
8 1:32 1:128
9 “negative” 1:4096
10 1:16 1:1024
The mode and median are useful as descriptive measures. However, they are not often used for further
statistical manipulations. In contrast, the mean is not only a good descriptive measure, but it also has good
statistical properties. The mean is used most often in additional statistical manipulations.
While the arithmetic mean is the measure of choice when data are normally distributed, the median is the
measure of choice for data that are not normally distributed. Because epidemiologic data tend not to be
normally distributed (incubation periods, doses, ages of patients), the median is often preferred. The geometric
mean is used most commonly with laboratory data, particularly dilution titers or assays and environmental
sampling data.
The arithmetic mean uses all the data, which makes it sensitive to outliers. Although the geometric mean also
uses all the data, it is not as sensitive to outliers as the arithmetic mean. The midrange, which is based on the
minimum and maximum values, is more sensitive to outliers than any other measures. The mode and median
tend not to be affected by outliers.
In summary, each measure of central location — mode, median, mean, midrange, and geometric mean — is a
single value that is used to represent all of the observed values of a distribution. Each measure has its
advantages and limitations. The selection of the most appropriate measure requires judgment based on the
characteristics of the data (e.g., normally distributed or skewed, with or without outliers, arithmetic or log scale)
and the reason for calculating the measure (e.g., for descriptive or analytic purposes).
Spread, or dispersion, is the second important feature of frequency distributions. Just as measures of central
location describe where the peak is located, measures of spread describe the dispersion (or variation) of values
from that peak in the distribution. Measures of spread include the range, interquartile range, and standard
deviation.
Range
Definition of range
The range of a set of data is the difference between its largest (maximum) value and its smallest (minimum)
value. In the statistical world, the range is reported as a single number and is the result of subtracting the
maximum from the minimum value. In the epidemiologic community, the range is usually reported as “from (the
minimum) to (the maximum),” that is, as two numbers rather than one.
1. Step 1. Identify the smallest (minimum) observation and the largest (maximum) observation.
2. Step 2. Epidemiologically, report the minimum and maximum values. Statistically, subtract the
minimum from the maximum value.
For an epidemiologic or lay audience, you could report that “incubation periods ranged from 15 to 31 days.”
Statistically, that range is 16 days.
Percentiles
Percentiles divide the data in a distribution into 100 equal parts. The Pth percentile (P ranging from 0 to 100) is
the value that has P percent of the observations falling at or below it. In other words, the 90 th percentile has
90% of the observations at or below it. The median, the halfway point of the distribution, is the 50 th percentile.
The maximum value is the 100th percentile, because all values fall at or below the maximum.
Quartiles
Sometimes, epidemiologists group data into four equal parts, or quartiles. Each quartile includes 25% of the
data. The cut-off for the first quartile is the 25th percentile. The cut-off for the second quartile is the
50th percentile, which is the median. The cut-off for the third quartile is the 75th percentile. And the cut-off for the
fourth quartile is the 100th percentile, which is the maximum.
Interquartile range
The interquartile range is a measure of spread used most commonly with the median. It represents the central
portion of the distribution, from the 25thpercentile to the 75th percentile. In other words, the interquartile range
includes the second and third quartiles of a distribution. The interquartile range thus includes approximately
one half of the observations in the set, leaving one quarter of the observations on each side.
a. If a quartile lies on an observation (i.e., if its position is a whole number), the value of the quartile
is the value of that observation. For example, if the position of a quartile is 20, its value is the
value of the 20th observation.
b. If a quartile lies between observations, the value of the quartile is the value of the lower
observation plus the specified fraction of the difference between the observations. For example, if
the position of a quartile is 20¼, it lies between the 20th and 21st observations, and its value is the
value of the 20th observation, plus ¼ the difference between the value of the 20th and
21st observations.
4. Step 4. Epidemiologically, report the values at Q1 and Q3. Statistically, calculate the interquartile range
as Q3 minus Q1.
Figure 2.7 The Middle Half of the Observations in a Frequency Distribution Lie within the Interquartile
Range
Image Description
Find the interquartile range for the length of stay data in Table 2.8.
2. 0, 2, 3, 4, 5, 5, 6, 7, 8,
9,
5. Step 2. Find the position of the 1st and 3rd quartiles. Note that the distribution has 30 observations.
6. Step 3. Identify the value of the 1st and 3rd quartiles (Q1 and Q3).
Value of Q1: The position of Q1 is 7¾; therefore, the value of Q1 is equal to the value of the
7th observation plus ¾ of the difference between the values of the 7th and 8th observations:
Value of Q3: The position of Q3 was 23¼; thus, the value of Q3 is equal to the value of the
23rd observation plus ¼ of the difference between the value of the 23rd and 24th observations:
Q3 = 14.5
Q1 = 6.75
Interquartile range = 14.5− 6.75 = 7.75
As indicated above, the median for the length of stay data is 10. Note that the distance between Q 1 and the
median is 10 − 6.75 = 3.25. The distance between Q3 and the median is 14.5−10 = 4.5. This indicates that the
length of stay data is skewed slightly to the right (to the longer lengths of stay).
Question:
In the data set named SMOKE, what is the interquartile range for the weight of the participants?
Answer:
In Epi Info:
Select Analyze Data.
Select Read (Import). The default data set should be Sample.mdb. Under Views, scroll down to view SMOKE, and
double click, or click once and then click OK.
Click on Select. Then type in weight < 770, or select weight from available values, then type < 770, and click on OK.
Select Means. Then click on the down arrow beneath Means of, scroll down and select WEIGHT, then click OK.
Scroll to the bottom of the output to find the first quartile (25% = 130) and the third quartile (75% = 180). So the
interquartile range runs from 130 to 180 pounds, for a range of 50 pounds.
Your Turn:
What is the interquartile range of height of study participants? [Answer: 506 to 777]
The interquartile range is generally used in conjunction with the median. Together, they are useful for
characterizing the central location and spread of any frequency distribution, but particularly those that are
skewed.
For a more complete characterization of a frequency distribution, the 1st and 3rd quartiles are sometimes
used with the minimum value, the median, and the maximum value to produce a five-number summary of
the distribution. For example, the five-number summary for the length of stay data is:
Minimum value = 0,
Q1 = 6.75,
Median = 10,
Q3 = 14.5, and
Maximum value = 49.
Together, the five values provide a good description of the center, spread, and shape of a distribution.
These five values can be used to draw a graphical illustration of the data, as in the boxplot in Figure 2.8.
Some statistical analysis software programs such as Epi Info produce frequency distributions with three output
columns: the number or count of observations for each value of the distribution, the percentage of observations
for that value, and the cumulative percentage. The cumulative percentage, which represents the percentage of
observations at or below that value, gives you the percentile (see Table 2.10).
Table 2.10 Frequency Distribution of Length of Hospital Stay, Sample Data, Northeast Consortium
Vancomycin Quality Improvement Project
Length of Stay Cumulative
(Days) Frequency Percent Percent
0 1 3.3 3.3
2 1 3.3 6.7
3 1 3.3 10.0
4 1 3.3 13.3
5 2 6.7 20.0
6 1 3.3 23.3
7 1 3.3 26.7
8 1 3.3 30.0
9 3 10.0 40.0
10 5 16.7 56.7
11 1 3.3 60.0
12 3 10.0 70.0
13 1 3.3 73.3
14 1 3.3 76.7
16 1 3.3 80.0
18 2 6.7 86.7
19 1 3.3 90.0
22 1 3.3 93.3
27 1 3.3 96.7
49 1 3.3 100.0
Length of Stay Cumulative
(Days) Frequency Percent Percent
Total 30 100.0
A shortcut to calculating Q1, the median, and Q3 by hand is to look at the tabular output from these software
programs and note which values include 25%, 50%, and 75% of the data, respectively. This shortcut method
gives slightly different results than those you would calculate by hand, but usually the differences are minor.
Exercise 2.8
Determine the first and third quartiles and interquartile range for the same vaccination data as in the previous
exercises.
2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1
Standard deviation
Definition of standard deviation
The standard deviation is the measure of spread used most commonly with the arithmetic mean. Earlier, the
centering property of the mean was described — subtracting the mean from each observation and then
summing the differences adds to 0. This concept of subtracting the mean from each observation is the basis for
the standard deviation. However, the difference between the mean and each observation is squared to
eliminate negative numbers. Then the average is calculated and the square root is taken to get back to the
original units.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Click OK
→ You should see the list of the frequency by the variable you selected. Scroll down until you see the
Standard Deviation (Std Dev) and other data.
2. Step 2. Subtract the mean from each observation. Square the difference.
5. Step 5. Take the square root of the value obtained in Step 4. The result is the standard deviation.
The numeric value of the standard deviation does not have an easy, non-statistical interpretation, but
similar to other measures of spread, the standard deviation conveys how widely or tightly the observations
are distributed from the center. From the previous example, the mean incubation period was 25 days, with
a standard deviation of 6.6 days. If the standard deviation in a second outbreak had been 3.7 days (with
the same mean incubation period of 25 days), you could say that the incubation periods in the second
outbreak showed less variability than did the incubation periods of the first outbreak.
Standard deviation is usually calculated only when the data are more-or-less “normally distributed,” i.e., the
data fall into a typical bell-shaped curve. For normally distributed data, the arithmetic mean is the
recommended measure of central location, and the standard deviation is the recommended measure of
spread. In fact, means should never be reported without their associated standard deviation.
2. Step 2. Subtract the mean from each observation. Square the difference.
4. Step 4: Divide the sum of the squared differences by (n − 1). This is the variance.
5. Step 5: Take the square root of the variance. The result is the standard deviation.
±1 SD includes 68.3%
±2 SD includes 95.5%
±3 SD includes 99.7%
Consider the normal curve illustrated in Figure 2.9. The mean is at the center, and data are equally distributed
on either side of this mean. The points that show ±1, 2, and 3 standard deviations are marked on the x-axis.
For normally distributed data, approximately two-thirds (68.3%, to be exact) of the data fall within one standard
deviation of either side of the mean; 95.5% of the data fall within two standard deviations of the mean; and
99.7% of the data fall within three standard deviations. Exactly 95.0% of the data fall within 1.96 standard
deviations of the mean.
Figure 2.9 Area Under Normal Curve within 1, 2 and 3 Standard Deviations
Image Description
Exercise 2.9
Calculate the standard deviation for the same set of vaccination data.
2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1
The standard deviation is sometimes confused with another measure with a similar name — the standard error
of the mean. However, the two are not the same. The standard deviation describes variability in a set of data.
The standard error of the mean refers to variability we might expect in the arithmetic means of repeated
samples taken from the same population.
The standard error assumes that the data you have is actually a sample from a larger population. According to
the assumption, your sample is just one of an infinite number of possible samples that could be taken from the
source population. Thus, the mean for your sample is just one of an infinite number of other sample means.
The standard error quantifies the variation in those sample means.
2. Step 2. Divide the standard deviation by the square root of the number of observations (n).
Properties and uses of the standard error of the mean
The primary practical use of the standard error of the mean is in calculating confidence intervals around the
arithmetic mean. (Confidence intervals are addressed in the next section.)
Find the standard error of the mean for the length-of-stay data in Table 2.10, given that the standard deviation
is 9.1888.
n = 30
Standard error of the mean = 9.188 ⁄ √30 = 9.188 ⁄ 5.477 = 1.67
Confidence intervals are calculated for some but not all epidemiologic measures. The two measures covered in
this lesson for which confidence intervals are often presented are the mean and the geometric mean.
Confidence intervals can also be calculated for some of the epidemiologic measures covered in Lesson 3, such
as a proportion, risk ratio, and odds ratio.
The confidence interval for a mean is based on the mean itself and some multiple of the standard error of the
mean. Recall that the standard error of the mean refers to the variability of means that might be calculated from
repeated samples from the same population. Fortunately, regardless of how the data are distributed, means
(particularly from large samples) tend to be normally distributed. (This is from an argument known as the
Central Limit Theorem). So we can use Figure 2.9 to show that the range from the mean minus one standard
deviation to the mean plus one standard deviation includes 68.3% of the area under the curve.
Consider a population-based sample survey in which the mean total cholesterol level of adult females was 206,
with a standard error of the mean of 3. If this survey were repeated many times, 68.3% of the means would be
expected to fall between the mean minus 1 standard error and the mean plus 1 standard error, i.e., between
203 and 209. One might say that the investigators are 68.3% confident those limits contain the actual mean of
the population.
In public health, investigators generally want to have a greater level of confidence than that, and usually set the
confidence level at 95%. Although the statistical definition of a confidence interval is that 95% of the confidence
intervals from an infinite number of similarly conducted samples would include the true population values, this
definition has little meaning for a single study. More commonly, epidemiologists interpret a 95% confidence
interval as the range of values consistent with the data from their study.
3 × 1.96 = 5.88
3. Step 3. Lower limit of the 95% confidence interval = mean minus 1.96 × standard error.
206 − 5.88 = 200.12
Upper limit of the 95% confidence interval = mean plus 1.96 × standard error.
206 + 5.88 = 211.88
Rounding to one decimal, the 95% confidence interval is 200.1 to 211.9. In other words, this study’s best
estimate of the true population mean is 206, but is consistent with values ranging from as low as 200.1 and as
high as 211.9. Thus, the confidence interval indicates how precise the estimate is. (This confidence interval is
narrow, indicating that the sample mean of 206 is fairly precise.) It also indicates how confident the researchers
should be in drawing inferences from the sample to the entire population.
The mean is not the only measure for which a confidence interval can or should be calculated. Confidence
intervals are also commonly calculated for proportions, rates, risk ratios, odds ratios, and other
epidemiologic measures when the purpose is to draw inferences from a sample survey or study to the
larger population.
Most epidemiologic studies are not performed under the ideal conditions required by the theory behind a
confidence interval. As a result, most epidemiologists take a common-sense approach rather than a strict
statistical approach to the interpretation of a confidence interval, i.e., the confidence interval represents the
range of values consistent with the data from a study, and is simply a guide to the variability in a study.
Confidence intervals for means, proportions, risk ratios, odds ratios, and other measures all are calculated
using different formulas. The formula for a confidence interval of the mean is well accepted, as is the
formula for a confidence interval for a proportion. However, a number of different formulas are available for
risk ratios and odds ratios. Since different formulas can sometimes give different results, this supports
interpreting a confidence interval as a guide rather than as a strict range of values.
Regardless of the measure, the interpretation of a confidence interval is the same: the narrower the
interval, the more precise the estimate; and the range of values in the interval is the range of population
values most consistent with the data from the study.
Image Description
How, then, do you choose the most appropriate measures? A partial answer to this question is to select the
measure of central location on the basis of how the data are distributed, and then use the corresponding
measure of spread. Table 2.11 summarizes the recommended measures.
Table 2.11 Recommended Measures of Central Location and Spread by Type of Data
Type of Distribution Measure of Central Location Measure of Spread
In statistics, the arithmetic mean is the most commonly used measure of central location, and is the measure
upon which the majority of statistical tests and analytic techniques are based. The standard deviation is the
measure of spread most commonly used with the mean. But as noted previously, one disadvantage of the
mean is that it is affected by the presence of one or a few observations with extremely high or low values. The
mean is "pulled" in the direction of the extreme values. You can tell the direction in which the data are skewed
by comparing the values of the mean and the median; the mean is pulled away from the median in the direction
of the extreme values. If the mean is higher than the median, the distribution of data is skewed to the right. If
the mean is lower than the median, as in the right side of Figure 2.10, the distribution is skewed to the left.
The advantage of the median is that it is not affected by a few extremely high or low observations. Therefore,
when a set of data is skewed, the median is more representative of the data than is the mean. For descriptive
purposes, and to avoid making any assumption that the data are normally distributed, many epidemiologists
routinely present the median for incubation periods, duration of illness, and age of the study subjects.
Two measures of spread can be used in conjunction with the median: the range and the interquartile range.
Although many statistics books recommend the interquartile range as the preferred measure of spread, most
practicing epidemiologists use the simpler range instead.
The mode is the least useful measure of central location. Some sets of data have no mode; others have more
than one. The most common value may not be anywhere near the center of the distribution. Modes generally
cannot be used in more elaborate statistical calculations. Nonetheless, even the mode can be helpful when one
is interested in the most common value or most popular choice.
The geometric mean is used for exponential or logarithmic data such as laboratory titers, and for environmental
sampling data whose values can span several orders of magnitude. The measure of spread used with the
geometric mean is the geometric standard deviation. Analogous to the geometric mean, it is the antilog of the
standard deviation of the log of the values.
The geometric standard deviation is substituted for the standard deviation when incorporating logarithms of
numbers. Examples include describing environmental particle size based on mass, or variability of blood lead
concentrations.1
Table 2.12 Self-Reported Average Number of Cigarettes Smoked Per Day, Survey of Students (n = 200)
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 3
4 6 7 7 8 8 9 10 12 12 13 13
14 15 15 15 15 15 16 17 17 18 18 18
18 19 19 20 20 20 20 20 20 20 20 20
20 20 21 21 22 22 23 24 25 25 26 28
29 30 30 30 30 32 35 40
Mean = 5.4
Median = 0
Mode = 0
Minimum value = 0
Maximum value = 40
Range = 0–40
Interquartile range = 8.8 (0.0–8.8)
Standard deviation = 9.5
These results are correct, but they do not summarize the data well. Almost three fourths of the students,
representing the mode, do not smoke at all. Separating the 58 smokers from the 142 nonsmokers yields a
more informative summary of the data. Among the 58 (29%) who do smoke:
Mean = 18.5
Median = 19.5
Mode = 20
Minimum value = 2
Maximum value = 40
Range = 2–40
Interquartile range = 8.5 (13.7–22.25)
Standard deviation = 8.0
Thus, a more informative summary of the data might be "142 (71%) of the students do not smoke at all. Of the
58 students (29%) who do smoke, mean consumption is just under a pack* a day (mean = 18.5, median =
19.5). The range is from 2 to 40 cigarettes smoked per day, with approximately half the smokers smoking from
14 to 22 cigarettes per day."
* a typical pack contains 20 cigarettes
Exercise 2.11
The data in Table 2.13 (on page 2-57) are from an investigation of an outbreak of severe abdominal pain,
persistent vomiting, and generalized weakness among residents of a rural village. The cause of the outbreak
was eventually identified as flour unintentionally contaminated with lead dust.
Table 2.13 Age and Blood Lead Levels (BLLs) of Ill Villagers and Family Members — Country X, 1996
ID Age (Years) BLL† Log10BLL
1 3 69 1.84
2 4 45 1.66
3 6 49 1.69
4 7 84 1.92
5 9 48 1.68
6 10 58 1.77
7 11 17 1.23
8 12 76 1.88
9 13 61 1.79
10 14 78 1.89
11 15 48 1.68
12 15 57 1.76
13 16 68 1.83
14 16 ? ?
15 17 26 1.42
16 19 78 1.89
17 19 56 1.75
18 20 54 1.73
ID Age (Years) BLL† Log10BLL
19 22 73 1.86
20 26 74 1.87
21 27 63 1.80
22 33 103 2.01
23 33 46 1.66
24 35 78 1.89
25 35 50 1.70
26 36 64 1.81
27 36 67 1.83
28 38 79 1.90
29 40 58 1.76
30 45 86 1.93
31 47 76 1.88
32 49 58 1.76
33 56 ? ?
34 60 26 1.41
35 65 104 2.02
36 65 39 1.59
37 65 35 1.54
38 70 72 1.86
39 70 57 1.76
40 76 38 1.58
ID Age (Years) BLL† Log10BLL
41 78 44 1.64
? Missing value
Data Source: Nasser A, Hatch D, Pertowski C, Yoon S. Outbreak investigation of an unknown illness in a rural village, Egypt (case study). Cairo: Field