Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Section 1: Organizing Data: Line List Line Listing

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 45

Section 1: Organizing Data

Whether you are conducting routine surveillance, investigating an outbreak, or conducting a study,
you must first compile information in an organized manner. One common method is to create a line
list or line listing. Table 2.1 is a typical line listing from an epidemiologic investigation of an
apparent cluster of hepatitis A.
A variable can be any characteristic that differs from person to person, such as height, sex,
smallpox vaccination status, or physical activity pattern. The value of a variable is the number or
descriptor that applies to a particular person, such as 5'6" (168 cm), female, and never vaccinated.

The line listing is one type of epidemiologic database, and is organized like a spreadsheet with rows
and columns. Typically, each row is called a record or observation and represents one person or
case of disease. Each column is called a variable and contains information about one characteristic
of the individual, such as race or date of birth. The first column or variable of an epidemiologic
database usually contains the person's name, initials, or identification number. Other columns might
contain demographic information, clinical details, and exposures possibly related to illness.

Table 2.1 Line Listing of Hepatitis A Cases, County Health Department, January — February
2004
Date of Age IV IgM
ID Town Sex Hosp Jaundice Outbreak Highest ALT*
Diagnosis (Years) Drugs Pos
01 01/05 B 74 M Y N N N Y 232
02 01/06 J 29 M N Y N Y Y 285
03 01/08 K 37 M Y Y N N Y 3250
04 01/19 J 3 F N N N N Y 1100
05 01/30 C 39 M N Y N N Y 4146
06 02/02 D 23 M Y Y N Y Y 1271
07 02/03 F 19 M Y Y N N Y 300
08 02/05 I 44 M N Y N N Y 766
09 02/19 G 28 M Y N N Y Y 23
10 02/22 E 29 F N Y Y N Y 543
11 02/23 A 21 F Y Y Y N Y 1897
12 02/24 H 43 M N Y Y N Y 1220
13 02/26 B 49 F N N N N Y 644
14 02/26 H 42 F N N Y N Y 2581
15 02/27 E 59 F Y Y Y N Y 2892
16 02/27 E 18 M Y N Y N Y 814
17 02/27 A 19 M N Y Y N Y 2812
18 02/28 E 63 F Y Y Y N Y 4218
Date of Age IV IgM
ID Town Sex Hosp Jaundice Outbreak Highest ALT*
Diagnosis (Years) Drugs Pos
19 02/28 E 61 F Y Y Y N Y 3410
20 02/29 A 40 M N Y Y N Y 4297
* ALT = Alanine aminotransferase

Some epidemiologic databases, such as line listings for a small cluster of disease, may have only a
few rows (records) and a limited number of columns (variables). Such small line listings are
sometimes maintained by hand on a single sheet of paper. Other databases, such as birth or death
records for the entire country, might have thousands of records and hundreds of variables and are
best handled with a computer. However, even when records are computerized, a line listing with key
variables is often printed to facilitate review of the data.

Icon of the Epi Info computer software developed at CDC

One computer software package that is widely used by epidemiologists to manage data is Epi Info, a
free package developed at CDC. Epi Info allows the user to design a questionnaire, enter data right
into the questionnaire, edit the data, and analyze the data. Two versions are available:

Epi Info 3 (formerly Epi Info 2000 or Epi Info 2002) is Windows-based, and continues to be
supported and upgraded. It is the recommended version and can be downloaded from the CDC
website: http://www.cdc.gov/epiinfo/downloads.htm.

Epi Info 6 is DOS-based, widely used, but being phased out.

This lesson includes Epi Info commands for creating frequency distributions and calculating some of
the measures of central location and spread described in the lesson. Since Epi Info 3 is the
recommended version, only commands for this version are provided in the text; corresponding
commands for Epi Info 6 are offered at the end of the lesson.

Section 2: Types of Variables


Look again at the variables (columns) and values (individual entries in each column) in Table 2.1. If
you were asked to summarize these data, how would you do it?

First, notice that for certain variables, the values are numeric; for others, the values
are descriptive. The type of values influence the way in which the variables can be summarized.
Variables can be classified into one of four types, depending on the type of scale used to
characterize their values (Table 2.2).
Table 2.2 Types of Variables
Scale Example Values
Nominal
Ordinal
disease status yes / no
ovarian cancer Stage I, II, III, or IV

"categorical" or "qualitative"
Interval
Ratio
any date from recorded time to
date of birth
current
tuberculin skin test
0 – ??? of induration

"continuous" or "quantitative"

 A nominal-scale variable is one whose values are categories without any numerical ranking,
such as county of residence. In epidemiology, nominal variables with only two categories are
very common: alive or dead, ill or well, vaccinated or unvaccinated, or did or did not eat the
potato salad. A nominal variable with two mutually exclusive categories is sometimes called a
dichotomous variable.
 An ordinal-scale variable has values that can be ranked but are not necessarily evenly spaced,
such as stage of cancer (see Table 2.3).
 An interval-scale variable is measured on a scale of equally spaced units, but without a true
zero point, such as date of birth.
 A ratio-scale variable is an interval variable with a true zero point, such as height in centimeters
or duration of illness.

Nominal- and ordinal-scale variables are considered qualitative or categorical variables, whereas
interval- and ratio-scale variables are considered quantitative or continuous variables. Sometimes
the same variable can be measured using both a nominal scale and a ratio scale. For example, the
tuberculin skin tests of a group of persons potentially exposed to a co-worker with tuberculosis can
be measured as "positive" or "negative" (nominal scale) or in millimeters of induration (ratio scale).

Table 2.3 Example of Ordinal-Scale Variable: Stages of Breast Cancer*


Stage Tumor Size Lymph Node Involvement Metastasis (Spread)
I Less than 2 cm No No
Stage Tumor Size Lymph Node Involvement Metastasis (Spread)
II Between 2 and 5 cm No or in same side of breast No
III More than 5 cm Yes, on same side of breast No
IV Not applicable Not applicable Yes
* This table describes the stages of breast cancer. Note that each stage is more extensive than the
previous one and generally carries a less favorable prognosis, but you cannot say that the difference
between Stages 1 and 3 is the same as the difference between Stages 2 and 4.

Section 3: Frequency Distributions


Look again at the data in Table 2.1. How many of the cases (or case-patients) are male?

When a database contains only a limited number of records, you can easily pick out the information you need
directly from the raw data. By scanning the 5th column, you can see that 12 of the 20 case-patients are male.

With larger databases, however, picking out the desired information at a glance becomes increasingly difficult.
To facilitate the task, the variables can be summarized into tables called frequency distributions.

A frequency distribution displays the values a variable can take and the number of persons or records with
each value. For example, suppose you have data from a study of women with ovarian cancer and wish to look
at parity, that is, the number of times each woman has given birth. To construct a frequency distribution that
displays these data:

 First, list all the values that the variable parity can take, from the lowest possible value to the highest.
 Then, for each value, record the number of women who had that number of births (twins and other multiple-
birth pregnancies count only once).

To create a frequency distribution from a data set in Analysis Module:

Select frequencies, then choose variable.

Table 2.4 displays what the resulting frequency distribution would look like. Notice that the frequency
distribution includes all values of parity between the lowest and highest observed, even though there were no
women for some values. Notice also that each column is clearly labeled, and that the total is given in the
bottom row.

Table 2.4 Distribution of Case-Subjects by Parity (Ratio-Scale Variable), Ovarian Cancer Study, CDC
Parity Number of Cases
0 45
1 25
2 43
Parity Number of Cases
3 32
4 22
5 8
6 2
7 0
8 1
9 0
10 1
Total 179
Data Sources: Lee NC, Wingo PA, Gwinn ML, Rubin GL, Kendrick JS, Webster LA, Ory HW. The reduction in risk of ovarian cancer associated with

oral contraceptive use. N Engl J Med 1987;316: 650–5.

Centers for Disease Control Cancer and Steroid Hormone Study. Oral contraceptive use and the risk of ovarian cancer. JAMA 1983;249:1596–9.

Table 2.4 displays the frequency distribution for a continuous variable. Continuous variables are often further
summarized with measures of central location and measures of spread. Distributions for ordinal and nominal
variables are illustrated in Tables 2.5 and 2.6, respectively. Categorical variables are usually further
summarized as ratios, proportions, and rates (discussed in Lesson 3).

Table 2.5 Distribution of Cases by Stage of Disease (Ordinal-Scale Variable), Ovarian Cancer Study,
CDC
Cases
Stage Number Percent
I 45 20
II 11 5
III 104 58
IV 30 17
Total 179 100
Data Sources: Lee NC, Wingo PA, Gwinn ML, Rubin GL, Kendrick JS, Webster LA, Ory HW. The reduction in risk of ovarian cancer associated with

oral contraceptive use. N Engl J Med 1987;316: 650–5.

Centers for Disease Control Cancer and Steroid Hormone Study. Oral contraceptive use and the risk of ovarian cancer. JAMA 1983;249:1596–9.

Table 2.6 Distribution of Cases by Enrollment Site (Nominal-Scale Variable), Ovarian Cancer Study,
CDC
Cases
Enrollment
Site Number Percent
Atlanta 18 10
Connecticut 39 22
Detroit 35 20
Cases
Enrollment
Site Number Percent
Iowa 30 17
New Mexico 7 4
San Francisco 33 18
Seattle 9 5
Utah 8 4
Total 179 100
Data Sources: Lee NC, Wingo PA, Gwinn ML, Rubin GL, Kendrick JS, Webster LA, Ory HW. The reduction in risk of ovarian cancer associated with

oral contraceptive use. N Engl J Med 1987;316: 650–5.

Centers for Disease Control Cancer and Steroid Hormone Study. Oral contraceptive use and the risk of ovarian cancer. JAMA 1983;249:1596–9.

Section 4: Properties of Frequency Distributions


Graphing will be covered in Lesson 4

The data in a frequency distribution can be graphed. We call this type of graph a histogram. Figure 2.1 is a
graph of the number of outbreak-related salmonellosis cases by date of illness onset.

Figure 2.1 Number of Outbreak-Related Salmonellosis Cases by Date of Onset of Illness — United
States, June–July 2004

Image Description
Source: Centers for Disease Control and Prevention. Outbreaks of Salmonella infections associated with eating Roma tomatoes–United States and

Canada, 2004. MMWR 54;325–8.

Even a quick look at this graph reveals three features:

 Where the distribution has its peak (central location),


 How widely dispersed it is on both sides of the peak (spread), and
 Whether it is more or less symmetrically distributed on the two sides of the peak

Central location
Note that the data in Figure 2.1 seem to cluster around a central value, with progressively fewer persons on
either side of this central value. This type of symmetric distribution, as illustrated in Figure 2.2, is the classic
bell-shaped curve — also known as a normal distribution. The clustering at a particular value is known as
the central location or central tendency of a frequency distribution. The central location of a distribution is
one of its most important properties. Sometimes it is cited as a single value that summarizes the entire
distribution. Figure 2.3 illustrates the graphs of three frequency distributions identical in shape but with different
central locations.

Figure 2.2 Bell-Shaped Curve

Image Description

Figure 2.3 Three Identical Curves with Different Central Locations


Image Description
Three measures of central location are commonly used in epidemiology: arithmetic mean, median,
and mode. Two other measures that are used less often are the midrange and geometric mean. All of these
measures will be discussed later in this lesson.

Depending on the shape of the frequency distribution, all measures of central location can be identical or
different. Additionally, measures of central location can be in the middle or off to one side or the other.

Spread
A second property of frequency distribution is spread (also called variation or dispersion). Spread refers to the
distribution out from a central value. Two measures of spread commonly used in epidemiology
are range and standard deviation. For most distributions seen in epidemiology, the spread of a frequency
distribution is independent of its central location. Figure 2.4 illustrates three theoretical frequency distributions
that have the same central location but different amounts of spread. Measures of spread will be discussed later
in this lesson.

Figure 2.4 Three Distributions with Same Central Location but Different Spreads
Image Description

Shape
Skewness refers to the tail, not the hump. So a distribution that is skewed to the left has a long left tail.

A third property of a frequency distribution is its shape. The graphs of the three theoretical frequency
distributions in Figure 2.4 were completely symmetrical. Frequency distributions of some characteristics of
human populations tend to be symmetrical. On the other hand, the data on parity in Figure 2.5
are asymmetrical or more commonly referred to as skewed.

Figure 2.5 Distribution of Case-Subjects by Parity, Ovarian Cancer Study, CDC

Image Description
Data Sources: Lee NC, Wingo PA, Gwinn ML, Rubin GL, Kendrick JS, Webster LA, Ory HW. The reduction in risk of ovarian cancer associated with

oral contraceptive use. N Engl J Med 1987;316: 650–5.

Centers for Disease Control Cancer and Steroid Hormone Study. Oral contraceptive use and the risk of ovarian cancer. JAMA 1983;249:1596–9.

A distribution that has a central location to the left and a tail off to the right is said to be positively
skewed or skewed to the right. In Figure 2.6, distribution A is skewed to the right. A distribution that has a
central location to the right and a tail to the left is said to be negatively skewed or skewed to the left. In
Figure 2.6, distribution C is skewed to the left.

Figure 2.6 Three Distributions with Different Skewness

Image Description
Question: How would you describe the parity data in Figure 2.5?

Answer: Figure 2.5 is skewed to the right. Skewing to the right is common in distributions that begin with zero,
such as number of servings consumed, number of sexual partners in the past month, and number of hours
spent in vigorous exercise in the past week.

One distribution deserves special mention — the Normal or Gaussian distribution. This is the classic
symmetrical bell-shaped curve like the one shown in Figure 2.2. It is defined by a mathematical equation and is
very important in statistics. Not only do the mean, median, and mode coincide at the central peak, but the area
under the curve helps determine measures of spread such as the standard deviation and confidence interval
covered later in this lesson.

Section 5: Methods for Summarizing Data


Knowing the type of variable helps you decide how to summarize the data. Table 2.7 displays the ways in
which different variables might be summarized.

Table 2.7 Methods for Summarizing Different Types of Variables


Measure of
Scale Ratio or Proportion Measure of Spread
Central Location

Nominal yes no no

Ordinal yes no no

Interval yes, but might need to group


yes yes
first

Ratio yes, but might need to group


yes yes
first

Measure of central location: a single, usually central, value that best represents an entire distribution of data.

Section 6: Measures of Central Location


A measure of central location provides a single value that summarizes an entire distribution of data. Suppose
you had data from an outbreak of gastroenteritis affecting 41 persons who had recently attended a wedding. If
your supervisor asked you to describe the ages of the affected persons, you could simply list the ages of each
person. Alternatively, your supervisor might prefer one summary number — a measure of central location.
Saying that the mean (or average) age was 48 years rather than reciting 41 ages is certainly more efficient, and
most likely more meaningful.

Measures of central location include the mode, median, arithmetic mean, midrange, and geometric mean.
Selecting the best measure to use for a given distribution depends largely on two factors:

 The shape or skewness of the distribution, and


 The intended use of the measure.

Each measure — what it is, how to calculate it, and when best to use it — is described in this section.

Mode
Definition of mode

The mode is the value that occurs most often in a set of data. It can be determined simply by tallying the
number of times each value occurs. Consider, for example, the number of doses of diphtheria-pertussis-tetanus
(DPT) vaccine each of seventeen 2-year-old children in a particular village received:
0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4

Two children received no doses; two children received 1 dose; three received 2 doses; six received 3 doses;
and four received all 4 doses. Therefore, the mode is 3 doses, because more children received 3 doses than
any other number of doses.

Method for identifying the mode

1. Step 1. Arrange the observations into a frequency distribution, indicating the values of the variable and
the frequency with which each value occurs. (Alternatively, for a data set with only a few values,
arrange the actual values in ascending order, as was done with the DPT vaccine doses above.)

2. Step 2. Identify the value that occurs most often.

EXAMPLES: Identifying the Mode

Example A: Table 2.8 (below) provides data from 30 patients who were hospitalized and received
antibiotics. For the variable “length of stay” (LOS) in the hospital, identify the mode.

1. Step 1. Arrange the data in a frequency distribution.

LOS Frequency

0 1

1 0

2 1

3 1

4 1

5 2

6 1

7 1

8 1

9 3

LOS Frequency
LOS Frequency

10 5

11 1

12 3

13 1

14 1

15 0

16 1

17 0

18 2

19 1

LOS Frequency

20 0

21 0

22 1

. 0

. 0

27 1

. 0

. 0

49 1

3. Alternatively, arrange the values in ascending order.


4. 0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49

5. Step 2. Identify the value that occurs most often.

Most values appear once, but the distribution includes two 5s, three 9s, five 10s, three 12s, and two
18s.
Because 10 appears most frequently, the mode is 10.

Example B: Find the mode of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22
days.

1. Step 1. Arrange the values in ascending order.

15, 22, 27, 30, and 31 days

2. Step 2. Identify the value that occurs most often.

None

Note: When no value occurs more than once, the distribution is said to have no mode.

Example : Find the mode of the following incubation periods for Bacillus cereus food poisoning:
2, 3, 3, 3, 3, 3, 4, 4, 5, 6, 7, 9, 10, 11, 11, 12, 12, 12, 12, 12, 14, 14, 15, 17, 18, 20, 21 hours

1. Step 1. Arrange the values in ascending order.

Done

2. Step 2. Identify the values that occur most often.

Five 3s and five 12s

Example C illustrates the fact that a frequency distribution can have more than one mode. When this occurs,
the distribution is said to be bi-modal. Indeed, Bacillus cereus is known to cause two syndromes with different
incubation periods: a short-incubation- period (1–6 hours) syndrome characterized by vomiting; and a long-
incubation-period (6–24 hours) syndrome characterized by diarrhea.

Table 2.8 Sample Data from the Northeast Consortium Vancomycin Quality Improvement Project
ID Admission Date Discharge Date LOS DOB (mm/dd) DOB (year) Age Sex ESRD
ID Admission Date Discharge Date LOS DOB (mm/dd) DOB (year) Age Sex ESRD

1 1/01 1/10 9 11/18 1928 66 M Y 3 N

2 1/08 1/30 22 01/21 1916 78 F N 10 Y

3 1/16 3/06 49 04/22 1920 74 F N 32 Y

4 1/23 2/04 12 05/14 1919 75 M N 5 Y

5 1/24 2/01 8 08/17 1929 65 M N 4 N

6 1/27 2/14 18 01/11 1918 77 M N 6 Y

7 2/06 2/16 10 01/09 1920 75 F N 2 Y

8 2/12 2/22 10 06/12 1927 67 M N 1 N

9 2/22 3/04 10 05/09 1915 79 M N 8 N

10 2/22 3/08 14 04/09 1920 74 F N 10 N

11 2/25 3/04 7 07/28 1915 79 F N 4 N

12 3/02 3/14 12 04/24 1928 66 F N 8 N

13 3/11 3/17 6 11/09 1925 69 M N 3 N

14 3/18 3/23 5 04/08 1924 70 F N 2 N

15 3/19 3/28 9 09/13 1915 79 F N 1 Y

16 3/27 4/01 5 01/28 1912 83 F N 4 Y

17 3/31 4/02 2 03/14 1921 74 M N 2 Y

18 4/12 4/24 12 02/07 1927 68 F N 3 N

19 4/17 5/06 19 03/04 1921 74 F N 11 Y

20 4/29 5/26 27 02/23 1921 74 F N 14 N

21 5/11 5/15 4 05/05 1923 72 M N 4 Y

22 5/14 5/14 0 01/03 1911 84 F N 1 N

23 5/20 5/30 10 11/11 1922 72 F N 9 Y

24 5/21 6/08 18 08/08 1912 82 M N 14 Y

25 5/26 6/05 10 09/28 1924 70 M Y 5 N


ID Admission Date Discharge Date LOS DOB (mm/dd) DOB (year) Age Sex ESRD

26 5/27 5/30 3 05/14 1899 96 F N 2 N

27 5/28 6/06 9 07/22 1921 73 M N 1 Y

28 6/07 6/20 13 12/30 1896 98 F N 3 N

29 6/07 6/23 16 08/31 1906 88 M N 1 N

30 6/16 6/27 11 07/07 1917 77 F N 7 Y

To identify the mode from a data set in Analysis Module:

Epi Info does not have a Mode command. Thus, the best way to identify the mode is to create a histogram and
look for the tallest column(s).

Select graphs, then choose histogram under Graph Type.

The tallest column(s) is(are) the mode(s).

NOTE: The Means command provides a mode, but only the lowest value if a distribution has more than one
mode.

Properties and uses of the mode


The mode is the easiest measure of central location to understand and explain. It is also the easiest to identify,
and requires no calculations.

 The mode is the preferred measure of central location for addressing which value is the most popular or
the most common. For example, the mode is used to describe which day of the week people most prefer to
come to the influenza vaccination clinic, or the “typical” number of doses of DPT the children in a particular
community have received by their second birthday.
 As demonstrated, a distribution can have a single mode. However, a distribution has more than one mode
if two or more values tie as the most frequent values. It has no mode if no value appears more than once.
 The mode is used almost exclusively as a “descriptive” measure. It is almost never used in statistical
manipulations or analyses.
 The mode is not typically affected by one or two extreme values (outliers).

Exercise 2.3
Using the same vaccination data as in Exercise 2.2, find the mode. (If you answered Exercise 2.2, find the
mode from your frequency distribution.)

2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1

Check your answers

Median
Definition of median

The median is the middle value of a set of data that has been put into rank order. Similar to the median on a
highway that divides the road in two, the statistical median is the value that divides the data into two halves,
with one half of the observations being smaller than the median value and the other half being larger. The
median is also the 50th percentile of the distribution. Suppose you had the following ages in years for patients
with a particular illness:

4, 23, 28, 31, 32

The median age is 28 years, because it is the middle value, with two values smaller than 28 and two values
larger than 28.

Method for identifying the median


Step 1. Arrange the observations into increasing or decreasing order.

Step 2. Find the middle position of the distribution by using the following formula:

Middle position = (n + 1) / 2

a. If the number of observations (n) is odd, the middle position falls on a single observation.
b. If the number of observations is even, the middle position falls between two observations.

Step 3. Identify the value at the middle position.

a. If the number of observations (n) is odd and the middle position falls on a single observation, the median
equals the value of that observation.
b. If the number of observations is even and the middle position falls between two observations, the median
equals the average of the two values.
Properties and uses of the median

 The median is a good descriptive measure, particularly for data that are skewed, because it is the central
point of the distribution.
 The median is relatively easy to identify. It is equal to either a single observed value (if odd number of
observations) or the average of two observed values (if even number of observations).
 The median, like the mode, is not generally affected by one or two extreme values (outliers). For example,
if the values on the previous page had been 4, 23, 28, 31, and 131 (instead of 31), the median would still
be 28.
 The median has less-than-ideal statistical properties. Therefore, it is not often used in statistical
manipulations and analyses.

Exercise 2.4
Determine the median for the same vaccination data used in Exercises 2.2. and 2.3.

2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1

Check your answers

Arithmetic mean
Definition of mean
The arithmetic mean is a more technical name for what is more commonly called the mean or average. The
arithmetic mean is the value that is closest to all the other values in a distribution.

Method for calculating the mean


Step 1. Add all of the observed values in the distribution.

Step 2. Divide the sum by the number of observations.

EXAMPLE: Finding the Mean


Find the mean of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.

Step 1. Add all of the observed values in the distribution.

27 + 31 + 15 + 30 + 22 = 125

Step 2. Divide the sum by the number of observations.

125 / 5 = 25.0
Therefore, the mean incubation period is 25.0 days.

Properties and uses of the arithmetic mean

 The mean has excellent statistical properties and is commonly used in additional statistical manipulations
and analyses. One such property is called the centering property of the mean. When the mean is
subtracted from each observation in the data set, the sum of these differences is zero (i.e., the negative
sum is equal to the positive sum). For the data in the previous hepatitis A example:

Value minus Mean Difference

15 – 25.0 -10.0

22 – 25.0 -3.0

27 – 25.0 + 2.0

30 – 25.0 + 5.0

31 – 25.0 + 6.0

125 – 125.0 = 0 + 13.0 – 13.0 = 0

This demonstrates that the mean is the arithmetic center of the distribution.

 Because of this centering property, the mean is sometimes called the center of gravity of a frequency
distribution. If the frequency distribution is plotted on a graph, and the graph is balanced on a fulcrum, the
point at which the distribution would balance would be the mean.
 The arithmetic mean is the best descriptive measure for data that are normally distributed.
 On the other hand, the mean is not the measure of choice for data that are severely skewed or have
extreme values in one direction or another. Because the arithmetic mean uses all of the observations in the
distribution, it is affected by any extreme value. Suppose that the last value in the previous distribution was
131 instead of 31. The mean would be 225 / 5 = 45.0 rather than 25.0. As a result of one extremely large
value, the mean is much larger than all values in the distribution except the extreme value (the “outlier”).

Epi Info Demonstration: Finding the Median


Question: In the data set named SMOKE, what is the mean weight of the participants?

Answer: In Epi Info:


Select Analyze Data.
Select Read (Import). The default data set should be Sample.mdb. Under Views, scroll down to view SMOKE,
and double click, or click once and then click OK. Note that 9 persons have a weight of 777, and 10 persons
have a weight of 999. These are code for “refused” and “missing.” To delete these records, enter the following
commands:
Click on Select. Then type in the weight < 770, or select weight from available values, then type < 750, and
click on OK.
Select Means. Then click on the down arrow beneath Means of, scroll down and select WEIGHT, then click
OK.

The resulting output should indicate a mean weight of 158.116 pounds.

Your Turn: What is the mean number of cigarettes smoked per day? [Answer: 17]

Exercise 2.5
Determine the mean for the same set of vaccination data.

2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1

Check your answers

The midrange (midpoint of an interval)


Definition of midrange
The midrange is the half-way point or the midpoint of a set of observations. The midrange is usually calculated
as an intermediate step in determining other measures.

Method for identifying the midrange

1. Identify the smallest (minimum) observation and the largest (maximum) observation.
2. Add the minimum plus the maximum, then divide by two.

Exception: Age differs from most other variables because age does not follow the usual rules for rounding to
the nearest integer. Someone who is 17 years and 360 days old cannot claim to be 18 year old for at least 5
more days. Thus, to identify the midrange for age (in years) data, you must add the smallest (minimum)
observation plus the largest (maximum) observation plus 1, then divide by two.

Midrange (most types of data) = (minimum + maximum) / 2


Midrange (age data) = (minimum + maximum + 1) / 2

Consider the following example:


In a particular pre-school, children are assigned to rooms on the basis of age on September 1. Room 2 holds
all of the children who were at least 2 years old but not yet 3 years old as of September 1. In other words, every
child in room 2 was 2 years old on September 1. What is the midrange of ages of the children in room 2 on
September 1?

For descriptive purposes, a reasonable answer is 2. However, recall that the midrange is usually calculated as
an intermediate step in other calculations. Therefore, more precision is necessary.

Consider that children born in August have just turned 2 years old. Others, born in September the previous
year, are almost but not quite 3 years old. Ignoring seasonal trends in births and assuming a very large room of
children, birthdays are expected to be uniformly distributed throughout the year. The youngest child, born on
September 1, is exactly 2.000 years old. The oldest child, whose birthday is September 2 of the previous year,
is 2.997 years old. For statistical purposes, the mean and midrange of this theoretical group of 2-year-olds are
both 2.5 years.

Properties and uses of the midrange

 The midrange is not commonly reported as a measure of central location.


 The midrange is more commonly used as an intermediate step in other calculations, or for plotting graphs
of data collected in intervals.

EXAMPLES: Identifying the Midrange


Example A: Find the midrange of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.

1. Identify the minimum and maximum values.


Minimum = 15, maximum = 31
2. Add the minimum plus the maximum, then divide by two.
Midrange = 15 + 31 / 2 = 46 / 2 = 23 days

Example B: Find the midrange of the grouping 15–24 (e.g., number of alcoholic beverages consumed in one
week).

1. Identify the minimum and maximum values.


Minimum = 15, maximum = 24
2. Add the minimum plus the maximum, then divide by two.
Midrange = 15 + 24 / 2 = 39 / 2 = 19.5

This calculation assumes that the grouping 15–24 really covers 14.50–24.49…. Since the midrange of 14.50–
24.49… = 19.49…, the midrange can be reported as 19.5.
Example C: Find the midrange of the age group 15–24 years.

1. Identify the minimum and maximum values.


Minimum = 15, maximum = 24
2. Add the minimum plus the maximum plus 1, then divide by two.

Midrange = (15 + 24 + 1) / 2 = 40 / 2 = 20 years

Age differs from the majority of other variables because age does not follow the usual rules for rounding to the
nearest integer. For most variables, 15.99 can be rounded to 16. However, an adolescent who is 15 years and
360 days old cannot claim to be 16 years old (and hence get his driver’s license or learner’s permit) for at least
5 more days. Thus, the interval of 15–24 years really spans 15.0–24.99… years. The midrange of 15.0 and
24.99… = 19.99… = 20.0 years.

Geometric mean
To calculate the geometric mean, you need a scientific calculator with log and yx keys.

Definition of geometric mean


The geometric mean is the mean or average of a set of data measured on a logarithmic scale. The geometric
mean is used when the logarithms of the observations are distributed normally (symmetrically) rather than the
observations themselves. The geometric mean is particularly useful in the laboratory for data from serial
dilution assays (1/2, 1/4, 1/8, 1/16, etc.) and in environmental sampling data.

More About Logarithms


A logarithm is the power to which a base is raised.

To what power would you need to raise a base of 10 to get a value of 100?
Because 10 times 10 or 102 equals 100, the log of 100 at base 10 equals 2. Similarly, the log of 16 at base 2
equals 4, because 24 = 2 x 2 x 2 x 2 = 16.

20 = 1 (anything raised to the 0 power is 1)


21 = 2 = 2
22 = 2 x 2 = 4
23 = 2 x 2 x 2 = 8
24 = 2 x 2 x 2 x 2 = 16
25 = 2 x 2 x 2 x 2 x 2 = 32
26 = 2 x 2 x 2 x 2 x 2 x 2 = 64
27 = 2 x 2 x 2 x 2 x 2 x 2 x 2 = 128
and so on.

100 = 1 (Anything raised to the 0 power equals 1)


101 = 10
102 = 100
103 = 1,000
104 = 10,000
105 = 100,000
106 = 1,000,000
107 = 10,000,000
and so on.

An antilog raises the base to the power (logarithm). For example, the antilog of 2 at base 10 is 102, or 100. The
antilog of 4 at base 2 is 24, or 16. The majority of titers are reported as multiples of 2 (e.g., 2, 4, 8, etc.);
therefore, base 2 is typically used when dealing with titers.

Method for calculating the geometric mean


There are two methods for calculating the geometric mean.

Method A

1. Take the logarithm of each value.


2. Calculate the mean of the log values by summing the log values, then dividing by the number of
observations.
3. Take the antilog of the mean of the log values to get the geometric mean.

Method B

1. Calculate the product of the values by multiplying all of the values together.
2. Take the nth root of the product (where n is the number of observations) to get the geometric mean.

EXAMPLES: Calculating the Geometric Mean

Example A: Using Method A


Calculate the geometric mean from the following set of data.

10, 10, 100, 100, 100, 100, 10,000, 100,000, 100,000, 1,000,000

Because these values are all multiples of 10, it makes sense to use logs of base 10.

Take the log (in this case, to base 10) of each value.
log10(xi) = 1, 1, 2, 2, 2, 2, 4, 5, 5, 6

Calculate the mean of the log values by summing and dividing by the number of observations (in this case, 10).

Mean of log10(xi) = (1+1+2+2+2+2+4+5+5+6) / 10 = 30 / 10 = 3

1. Take the antilog of the mean of the log values to get the geometric mean.
2. Antilog10(3) = 103 = 1,000.
3. The geometric mean of the set of data is 1,000.

Example B: Using Method B

Calculate the geometric mean from the following 95% confidence intervals of an odds ratio: 1.0, 9.0

1.
1. Calculate the product of the values by multiplying all values together.

1.0 x 9.0 = 9.0

1. Take the square root of the product.

The geometric mean = square root of 9.0 = 3.0.

Properties and uses of the geometric mean

The geometric mean is the average of logarithmic values, converted back to the base. The geometric mean
tends to dampen the effect of extreme values and is always smaller than the corresponding arithmetic mean. In
that sense, the geometric mean is less sensitive than the arithmetic mean to one or a few extreme values.

 The geometric mean is the measure of choice for variables measured on an exponential or logarithmic
scale, such as dilutional titers or assays.
 The geometric mean is often used for environmental samples, when levels can range over several orders
of magnitude. For example, levels of coliforms in samples taken from a body of water can range from less
than 100 to more than 100,000.
Exercise 2.6

Using the dilution titers shown below, calculate the geometric mean titer of convalescent antibodies against
tularemia among 10 residents of Martha’s Vineyard. [Hint: Use only the second number in the ratio, i.e., for
1:640, use 640.]

ID # Acute Convalescent

1 1:16 1:512

2 1:16 1:512

3 1:32 1:128

4 not done 1:512

5 1:32 1:1024

6 “negative” 1:1024

7 1:256 1:2048

8 1:32 1:128

9 “negative” 1:4096

10 1:16 1:1024

Check your answers

Selecting the appropriate measure


Measures of central location are single values that summarize the observed values of a distribution. The mode
provides the most common value, the median provides the central value, the arithmetic mean provides the
average value, the midrange provides the midpoint value, and the geometric mean provides the logarithmic
average.

The mode and median are useful as descriptive measures. However, they are not often used for further
statistical manipulations. In contrast, the mean is not only a good descriptive measure, but it also has good
statistical properties. The mean is used most often in additional statistical manipulations.

While the arithmetic mean is the measure of choice when data are normally distributed, the median is the
measure of choice for data that are not normally distributed. Because epidemiologic data tend not to be
normally distributed (incubation periods, doses, ages of patients), the median is often preferred. The geometric
mean is used most commonly with laboratory data, particularly dilution titers or assays and environmental
sampling data.
The arithmetic mean uses all the data, which makes it sensitive to outliers. Although the geometric mean also
uses all the data, it is not as sensitive to outliers as the arithmetic mean. The midrange, which is based on the
minimum and maximum values, is more sensitive to outliers than any other measures. The mode and median
tend not to be affected by outliers.

In summary, each measure of central location — mode, median, mean, midrange, and geometric mean — is a
single value that is used to represent all of the observed values of a distribution. Each measure has its
advantages and limitations. The selection of the most appropriate measure requires judgment based on the
characteristics of the data (e.g., normally distributed or skewed, with or without outliers, arithmetic or log scale)
and the reason for calculating the measure (e.g., for descriptive or analytic purposes).

Section 7: Measures of Spread

Spread, or dispersion, is the second important feature of frequency distributions. Just as measures of central
location describe where the peak is located, measures of spread describe the dispersion (or variation) of values
from that peak in the distribution. Measures of spread include the range, interquartile range, and standard
deviation.

Range
Definition of range

The range of a set of data is the difference between its largest (maximum) value and its smallest (minimum)
value. In the statistical world, the range is reported as a single number and is the result of subtracting the
maximum from the minimum value. In the epidemiologic community, the range is usually reported as “from (the
minimum) to (the maximum),” that is, as two numbers rather than one.

Method for identifying the range

1. Step 1. Identify the smallest (minimum) observation and the largest (maximum) observation.

2. Step 2. Epidemiologically, report the minimum and maximum values. Statistically, subtract the
minimum from the maximum value.

EXAMPLE: Identifying the Range


Find the range of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.
1. Step 1. Identify the minimum and maximum values.

Minimum = 15, maximum = 31

2. Step 2. Subtract the minimum from the maximum value.

Range = 31–15 = 16 days

For an epidemiologic or lay audience, you could report that “incubation periods ranged from 15 to 31 days.”
Statistically, that range is 16 days.

Percentiles
Percentiles divide the data in a distribution into 100 equal parts. The Pth percentile (P ranging from 0 to 100) is
the value that has P percent of the observations falling at or below it. In other words, the 90 th percentile has
90% of the observations at or below it. The median, the halfway point of the distribution, is the 50 th percentile.
The maximum value is the 100th percentile, because all values fall at or below the maximum.

Quartiles
Sometimes, epidemiologists group data into four equal parts, or quartiles. Each quartile includes 25% of the
data. The cut-off for the first quartile is the 25th percentile. The cut-off for the second quartile is the
50th percentile, which is the median. The cut-off for the third quartile is the 75th percentile. And the cut-off for the
fourth quartile is the 100th percentile, which is the maximum.

Interquartile range
The interquartile range is a measure of spread used most commonly with the median. It represents the central
portion of the distribution, from the 25thpercentile to the 75th percentile. In other words, the interquartile range
includes the second and third quartiles of a distribution. The interquartile range thus includes approximately
one half of the observations in the set, leaving one quarter of the observations on each side.

Method for determining the interquartile range

1. Step 1. Arrange the observations in increasing order.


2. Step 2. Find the position of the 1st and 3rd quartiles with the following formulas. Divide the sum by the
number of observations.

Position of 1st quartile (Q1) = 25th percentile = (n + 1) ⁄ 4


Position of 3rd quartile (Q3) = 75th percentile = 3(n + 1) ⁄ 4 = 3 × Q1

3. Step 3. Identify the value of the 1st and 3rd quartiles.

a. If a quartile lies on an observation (i.e., if its position is a whole number), the value of the quartile
is the value of that observation. For example, if the position of a quartile is 20, its value is the
value of the 20th observation.
b. If a quartile lies between observations, the value of the quartile is the value of the lower
observation plus the specified fraction of the difference between the observations. For example, if
the position of a quartile is 20¼, it lies between the 20th and 21st observations, and its value is the
value of the 20th observation, plus ¼ the difference between the value of the 20th and
21st observations.
4. Step 4. Epidemiologically, report the values at Q1 and Q3. Statistically, calculate the interquartile range
as Q3 minus Q1.

Figure 2.7 The Middle Half of the Observations in a Frequency Distribution Lie within the Interquartile
Range
Image Description

EXAMPLE: Finding the Interquartile Range

Find the interquartile range for the length of stay data in Table 2.8.

1. Step 1. Arrange the observations in increasing order.

2. 0, 2, 3, 4, 5, 5, 6, 7, 8,
9,

3. 9, 9, 10, 10, 10, 10, 10, 11, 12,


12,

4. 12, 13, 14, 16, 18, 18, 19, 22, 27,


49

5. Step 2. Find the position of the 1st and 3rd quartiles. Note that the distribution has 30 observations.

Position of Q1 = (n + 1) ⁄ 4 = (30 + 1) ⁄ 4 = 7.75

Position of Q3 = 3(n + 1) ⁄ 4 = 3(30 + 1) ⁄ 4 = 23.25


Thus, Q1 lies ¾ of the way between the 7th and 8th observations, and Q3 lies ¼ of the way between the
23rd and 24th observations.

6. Step 3. Identify the value of the 1st and 3rd quartiles (Q1 and Q3).

Value of Q1: The position of Q1 is 7¾; therefore, the value of Q1 is equal to the value of the
7th observation plus ¾ of the difference between the values of the 7th and 8th observations:

Value of the 7th observation: 6


Value of the 8th observation: 7

Q1 = 6 + ¾(7−6) = 6 + ¾(1) = 6.75

Value of Q3: The position of Q3 was 23¼; thus, the value of Q3 is equal to the value of the
23rd observation plus ¼ of the difference between the value of the 23rd and 24th observations:

Value of the 23rd observation: 14


Value of the 24th observation: 16

Q3 = 14 + ¼(16−14) = 14 + ¼(2) = 14 + (2 ⁄ 4) = 14.5

7. Step 4: Calculate the interquartile range as Q3 minus Q1 .

Q3 = 14.5
Q1 = 6.75
Interquartile range = 14.5− 6.75 = 7.75

As indicated above, the median for the length of stay data is 10. Note that the distance between Q 1 and the
median is 10 − 6.75 = 3.25. The distance between Q3 and the median is 14.5−10 = 4.5. This indicates that the
length of stay data is skewed slightly to the right (to the longer lengths of stay).

Epi Info Demonstration: Finding the Interquartile Range

Question:

In the data set named SMOKE, what is the interquartile range for the weight of the participants?
Answer:

In Epi Info:
Select Analyze Data.

Select Read (Import). The default data set should be Sample.mdb. Under Views, scroll down to view SMOKE, and
double click, or click once and then click OK.

Click on Select. Then type in weight < 770, or select weight from available values, then type < 770, and click on OK.

Select Means. Then click on the down arrow beneath Means of, scroll down and select WEIGHT, then click OK.

Scroll to the bottom of the output to find the first quartile (25% = 130) and the third quartile (75% = 180). So the
interquartile range runs from 130 to 180 pounds, for a range of 50 pounds.

Your Turn:

What is the interquartile range of height of study participants? [Answer: 506 to 777]

Properties and uses of the interquartile range

 The interquartile range is generally used in conjunction with the median. Together, they are useful for
characterizing the central location and spread of any frequency distribution, but particularly those that are
skewed.
 For a more complete characterization of a frequency distribution, the 1st and 3rd quartiles are sometimes
used with the minimum value, the median, and the maximum value to produce a five-number summary of
the distribution. For example, the five-number summary for the length of stay data is:

Minimum value = 0,
Q1 = 6.75,
Median = 10,
Q3 = 14.5, and
Maximum value = 49.

 Together, the five values provide a good description of the center, spread, and shape of a distribution.
These five values can be used to draw a graphical illustration of the data, as in the boxplot in Figure 2.8.

Figure 2.8 Interquartile Range from Cumulative Frequencies


Image Description

Some statistical analysis software programs such as Epi Info produce frequency distributions with three output
columns: the number or count of observations for each value of the distribution, the percentage of observations
for that value, and the cumulative percentage. The cumulative percentage, which represents the percentage of
observations at or below that value, gives you the percentile (see Table 2.10).

Table 2.10 Frequency Distribution of Length of Hospital Stay, Sample Data, Northeast Consortium
Vancomycin Quality Improvement Project
Length of Stay Cumulative
(Days) Frequency Percent Percent
0 1 3.3 3.3
2 1 3.3 6.7
3 1 3.3 10.0
4 1 3.3 13.3
5 2 6.7 20.0
6 1 3.3 23.3
7 1 3.3 26.7
8 1 3.3 30.0
9 3 10.0 40.0
10 5 16.7 56.7
11 1 3.3 60.0
12 3 10.0 70.0
13 1 3.3 73.3
14 1 3.3 76.7
16 1 3.3 80.0
18 2 6.7 86.7
19 1 3.3 90.0
22 1 3.3 93.3
27 1 3.3 96.7
49 1 3.3 100.0
Length of Stay Cumulative
(Days) Frequency Percent Percent
Total 30 100.0

A shortcut to calculating Q1, the median, and Q3 by hand is to look at the tabular output from these software
programs and note which values include 25%, 50%, and 75% of the data, respectively. This shortcut method
gives slightly different results than those you would calculate by hand, but usually the differences are minor.

Exercise 2.8
Determine the first and third quartiles and interquartile range for the same vaccination data as in the previous
exercises.

2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1

Check your answer.

Standard deviation
Definition of standard deviation

The standard deviation is the measure of spread used most commonly with the arithmetic mean. Earlier, the
centering property of the mean was described — subtracting the mean from each observation and then
summing the differences adds to 0. This concept of subtracting the mean from each observation is the basis for
the standard deviation. However, the difference between the mean and each observation is squared to
eliminate negative numbers. Then the average is calculated and the square root is taken to get back to the
original units.

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Method for calculating the standard deviation

To calculate the standard deviation from a data set in Analysis Module:

Click on the Means command under the Statistics folder


In the Means Of drop-down box, select the variable of interest
→ Select Variable

Click OK
→ You should see the list of the frequency by the variable you selected. Scroll down until you see the
Standard Deviation (Std Dev) and other data.

1. Step 1. Calculate the arithmetic mean.

2. Step 2. Subtract the mean from each observation. Square the difference.

3. Step 3. Sum the squared differences.

4. Step 4. Divide the sum of the squared differences by n − 1.

5. Step 5. Take the square root of the value obtained in Step 4. The result is the standard deviation.

Properties and uses of the standard deviation

 The numeric value of the standard deviation does not have an easy, non-statistical interpretation, but
similar to other measures of spread, the standard deviation conveys how widely or tightly the observations
are distributed from the center. From the previous example, the mean incubation period was 25 days, with
a standard deviation of 6.6 days. If the standard deviation in a second outbreak had been 3.7 days (with
the same mean incubation period of 25 days), you could say that the incubation periods in the second
outbreak showed less variability than did the incubation periods of the first outbreak.
 Standard deviation is usually calculated only when the data are more-or-less “normally distributed,” i.e., the
data fall into a typical bell-shaped curve. For normally distributed data, the arithmetic mean is the
recommended measure of central location, and the standard deviation is the recommended measure of
spread. In fact, means should never be reported without their associated standard deviation.

EXAMPLE: Calculating the Standard Deviation


Find the mean of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.

1. Step 1. Calculate the arithmetic mean.

Mean = (27 + 31 + 15 + 30 +22) ⁄ 5 = 125 ⁄ 5 = 25.0

2. Step 2. Subtract the mean from each observation. Square the difference.

Value Minus Mean Difference Difference Squared


Value Minus Mean Difference Difference Squared
27 − 25.0 + 2.0 4.0
31 − 225.0 + 6.0 36.0
15 − 225.0 −10.0 100.0
30 − 225.0 + 5.0 25.0
22 − 225.0 − 3.0 9.0

3. Step 3. Sum the squared differences.

Sum = 4 + 36 + 100 + 25 + 9 = 174

4. Step 4: Divide the sum of the squared differences by (n − 1). This is the variance.

Variance = 174 ⁄ (5 − 1) = 174 ⁄ 4 = 43.5 days squared

5. Step 5: Take the square root of the variance. The result is the standard deviation.

Standard deviation = square root of 43.5 = 6.6 days

Areas included in normal distribution:

±1 SD includes 68.3%

±1.96 SD includes 95.0%

±2 SD includes 95.5%

±3 SD includes 99.7%
Consider the normal curve illustrated in Figure 2.9. The mean is at the center, and data are equally distributed
on either side of this mean. The points that show ±1, 2, and 3 standard deviations are marked on the x-axis.
For normally distributed data, approximately two-thirds (68.3%, to be exact) of the data fall within one standard
deviation of either side of the mean; 95.5% of the data fall within two standard deviations of the mean; and
99.7% of the data fall within three standard deviations. Exactly 95.0% of the data fall within 1.96 standard
deviations of the mean.

Figure 2.9 Area Under Normal Curve within 1, 2 and 3 Standard Deviations
Image Description

Exercise 2.9
Calculate the standard deviation for the same set of vaccination data.

2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1

Check your answer.

Standard error of the mean


Definition of standard error

The standard deviation is sometimes confused with another measure with a similar name — the standard error
of the mean. However, the two are not the same. The standard deviation describes variability in a set of data.
The standard error of the mean refers to variability we might expect in the arithmetic means of repeated
samples taken from the same population.

The standard error assumes that the data you have is actually a sample from a larger population. According to
the assumption, your sample is just one of an infinite number of possible samples that could be taken from the
source population. Thus, the mean for your sample is just one of an infinite number of other sample means.
The standard error quantifies the variation in those sample means.

Method for calculating the standard error of the mean

1. Step 1. Calculate the standard deviation.

2. Step 2. Divide the standard deviation by the square root of the number of observations (n).
Properties and uses of the standard error of the mean

 The primary practical use of the standard error of the mean is in calculating confidence intervals around the
arithmetic mean. (Confidence intervals are addressed in the next section.)

EXAMPLE: Finding the Standard Error of the Mean

Find the standard error of the mean for the length-of-stay data in Table 2.10, given that the standard deviation
is 9.1888.

1. Step 1. Calculate the standard deviation.

Standard deviation (given) = 9.188

2. Step 2. Divide the standard deviation by the square root of n.

n = 30
Standard error of the mean = 9.188 ⁄ √30 = 9.188 ⁄ 5.477 = 1.67

Confidence limits (confidence interval)


Definition of a confidence interval
Often, epidemiologists conduct studies not only to measure characteristics in the subjects studied, but also to
make generalizations about the larger population from which these subjects came. This process is called
inference. For example, political pollsters use samples of perhaps 1,000 or so people from across the country
to make inferences about which presidential candidate is likely to win on Election Day. Usually, the inference
includes some consideration about the precision of the measurement. (The results of a political poll may be
reported to have a margin of error of, say, plus or minus three points.) In epidemiology, a common way to
indicate a measurement’s precision is by providing a confidence interval. A narrow confidence interval indicates
high precision; a wide confidence interval indicates low precision.

Confidence intervals are calculated for some but not all epidemiologic measures. The two measures covered in
this lesson for which confidence intervals are often presented are the mean and the geometric mean.
Confidence intervals can also be calculated for some of the epidemiologic measures covered in Lesson 3, such
as a proportion, risk ratio, and odds ratio.

The confidence interval for a mean is based on the mean itself and some multiple of the standard error of the
mean. Recall that the standard error of the mean refers to the variability of means that might be calculated from
repeated samples from the same population. Fortunately, regardless of how the data are distributed, means
(particularly from large samples) tend to be normally distributed. (This is from an argument known as the
Central Limit Theorem). So we can use Figure 2.9 to show that the range from the mean minus one standard
deviation to the mean plus one standard deviation includes 68.3% of the area under the curve.

Consider a population-based sample survey in which the mean total cholesterol level of adult females was 206,
with a standard error of the mean of 3. If this survey were repeated many times, 68.3% of the means would be
expected to fall between the mean minus 1 standard error and the mean plus 1 standard error, i.e., between
203 and 209. One might say that the investigators are 68.3% confident those limits contain the actual mean of
the population.

In public health, investigators generally want to have a greater level of confidence than that, and usually set the
confidence level at 95%. Although the statistical definition of a confidence interval is that 95% of the confidence
intervals from an infinite number of similarly conducted samples would include the true population values, this
definition has little meaning for a single study. More commonly, epidemiologists interpret a 95% confidence
interval as the range of values consistent with the data from their study.

Method for calculating a 95% confidence interval for a mean

1. Step 1. Calculate the mean and its standard error.

2. Step 2. Multiply the standard error by 1.96.

3. Step 3. Lower limit of the 95% confidence interval =

mean minus 1.96 × standard error.


Upper limit of the 95% confidence interval =
mean plus 1.96 × standard error.

EXAMPLE: Calculating a 95% Confidence Interval for a Mean


Find the 95% confidence interval for a mean total cholesterol level of 206, standard error of the mean of 3.

1. Step 1. Calculate the mean and its error.

Mean = 206, standard error of the mean = 3 (both given)

2. Step 2. Multiply the standard error by 1.96.

3 × 1.96 = 5.88

3. Step 3. Lower limit of the 95% confidence interval = mean minus 1.96 × standard error.
206 − 5.88 = 200.12
Upper limit of the 95% confidence interval = mean plus 1.96 × standard error.
206 + 5.88 = 211.88

Rounding to one decimal, the 95% confidence interval is 200.1 to 211.9. In other words, this study’s best
estimate of the true population mean is 206, but is consistent with values ranging from as low as 200.1 and as
high as 211.9. Thus, the confidence interval indicates how precise the estimate is. (This confidence interval is
narrow, indicating that the sample mean of 206 is fairly precise.) It also indicates how confident the researchers
should be in drawing inferences from the sample to the entire population.

Properties and uses of confidence intervals

 The mean is not the only measure for which a confidence interval can or should be calculated. Confidence
intervals are also commonly calculated for proportions, rates, risk ratios, odds ratios, and other
epidemiologic measures when the purpose is to draw inferences from a sample survey or study to the
larger population.
 Most epidemiologic studies are not performed under the ideal conditions required by the theory behind a
confidence interval. As a result, most epidemiologists take a common-sense approach rather than a strict
statistical approach to the interpretation of a confidence interval, i.e., the confidence interval represents the
range of values consistent with the data from a study, and is simply a guide to the variability in a study.
 Confidence intervals for means, proportions, risk ratios, odds ratios, and other measures all are calculated
using different formulas. The formula for a confidence interval of the mean is well accepted, as is the
formula for a confidence interval for a proportion. However, a number of different formulas are available for
risk ratios and odds ratios. Since different formulas can sometimes give different results, this supports
interpreting a confidence interval as a guide rather than as a strict range of values.
 Regardless of the measure, the interpretation of a confidence interval is the same: the narrower the
interval, the more precise the estimate; and the range of values in the interval is the range of population
values most consistent with the data from the study.

Section 8: Choosing the Right Measure of Central Location and


Spread
Measures of central location and spread are useful for summarizing a distribution of data. They also facilitate
the comparison of two or more sets of data. However, not every measure of central location and spread is well
suited to every set of data. For example, because the normal distribution (or bell-shaped curve) is perfectly
symmetrical, the mean, median, and mode all have the same value (as illustrated in Figure 2.10). In practice,
however, observed data rarely approach this ideal shape. As a result, the mean, median, and mode usually
differ.

Figure 2.10 Effect of Skewness on Mean, Median, and Mode

Image Description
How, then, do you choose the most appropriate measures? A partial answer to this question is to select the
measure of central location on the basis of how the data are distributed, and then use the corresponding
measure of spread. Table 2.11 summarizes the recommended measures.

Table 2.11 Recommended Measures of Central Location and Spread by Type of Data
Type of Distribution Measure of Central Location Measure of Spread

Normal Arithmetic mean Standard deviation

Asymmetrical or skewed Median Range or interquartile range

Exponential or logarithmic Geometric mean Geometric standard

In statistics, the arithmetic mean is the most commonly used measure of central location, and is the measure
upon which the majority of statistical tests and analytic techniques are based. The standard deviation is the
measure of spread most commonly used with the mean. But as noted previously, one disadvantage of the
mean is that it is affected by the presence of one or a few observations with extremely high or low values. The
mean is "pulled" in the direction of the extreme values. You can tell the direction in which the data are skewed
by comparing the values of the mean and the median; the mean is pulled away from the median in the direction
of the extreme values. If the mean is higher than the median, the distribution of data is skewed to the right. If
the mean is lower than the median, as in the right side of Figure 2.10, the distribution is skewed to the left.

The advantage of the median is that it is not affected by a few extremely high or low observations. Therefore,
when a set of data is skewed, the median is more representative of the data than is the mean. For descriptive
purposes, and to avoid making any assumption that the data are normally distributed, many epidemiologists
routinely present the median for incubation periods, duration of illness, and age of the study subjects.

Two measures of spread can be used in conjunction with the median: the range and the interquartile range.
Although many statistics books recommend the interquartile range as the preferred measure of spread, most
practicing epidemiologists use the simpler range instead.

The mode is the least useful measure of central location. Some sets of data have no mode; others have more
than one. The most common value may not be anywhere near the center of the distribution. Modes generally
cannot be used in more elaborate statistical calculations. Nonetheless, even the mode can be helpful when one
is interested in the most common value or most popular choice.

The geometric mean is used for exponential or logarithmic data such as laboratory titers, and for environmental
sampling data whose values can span several orders of magnitude. The measure of spread used with the
geometric mean is the geometric standard deviation. Analogous to the geometric mean, it is the antilog of the
standard deviation of the log of the values.

The geometric standard deviation is substituted for the standard deviation when incorporating logarithms of
numbers. Examples include describing environmental particle size based on mass, or variability of blood lead
concentrations.1

Sometimes, a combination of these measures is needed to adequately describe a set of data.

EXAMPLE: Summarizing Data


Consider the smoking histories of 200 persons (Table 2.12) and summarize the data.

Table 2.12 Self-Reported Average Number of Cigarettes Smoked Per Day, Survey of Students (n = 200)

Number of Cigarettes Smoked Per Day

0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 3
4 6 7 7 8 8 9 10 12 12 13 13
14 15 15 15 15 15 16 17 17 18 18 18
18 19 19 20 20 20 20 20 20 20 20 20
20 20 21 21 22 22 23 24 25 25 26 28
29 30 30 30 30 32 35 40

Analyzing all 200 observations yields the following results:

Mean = 5.4
Median = 0
Mode = 0
Minimum value = 0
Maximum value = 40
Range = 0–40
Interquartile range = 8.8 (0.0–8.8)
Standard deviation = 9.5

These results are correct, but they do not summarize the data well. Almost three fourths of the students,
representing the mode, do not smoke at all. Separating the 58 smokers from the 142 nonsmokers yields a
more informative summary of the data. Among the 58 (29%) who do smoke:

Mean = 18.5
Median = 19.5
Mode = 20
Minimum value = 2
Maximum value = 40
Range = 2–40
Interquartile range = 8.5 (13.7–22.25)
Standard deviation = 8.0

Thus, a more informative summary of the data might be "142 (71%) of the students do not smoke at all. Of the
58 students (29%) who do smoke, mean consumption is just under a pack* a day (mean = 18.5, median =
19.5). The range is from 2 to 40 cigarettes smoked per day, with approximately half the smokers smoking from
14 to 22 cigarettes per day."
* a typical pack contains 20 cigarettes

Exercise 2.11
The data in Table 2.13 (on page 2-57) are from an investigation of an outbreak of severe abdominal pain,
persistent vomiting, and generalized weakness among residents of a rural village. The cause of the outbreak
was eventually identified as flour unintentionally contaminated with lead dust.

1. Summarize the blood level data with a frequency distribution.


2. Calculate the arithmetic mean. [Hint: Sum of known values = 2,363]
3. Identify the median and interquartile range.
4. Calculate the standard deviation. [Hint: Sum of squares = 157,743]
5. Calculate the geometric mean using the log lead levels provided. [Hint: Sum of log lead levels = 68.45]

Check your answer.

Table 2.13 Age and Blood Lead Levels (BLLs) of Ill Villagers and Family Members — Country X, 1996
ID Age (Years) BLL† Log10BLL

1 3 69 1.84

2 4 45 1.66

3 6 49 1.69

4 7 84 1.92

5 9 48 1.68

6 10 58 1.77

7 11 17 1.23

8 12 76 1.88

9 13 61 1.79

10 14 78 1.89

11 15 48 1.68

12 15 57 1.76

13 16 68 1.83

14 16 ? ?

15 17 26 1.42

16 19 78 1.89

17 19 56 1.75

18 20 54 1.73
ID Age (Years) BLL† Log10BLL

19 22 73 1.86

20 26 74 1.87

21 27 63 1.80

ID Age (Years) BLL† Log10BLL

22 33 103 2.01

23 33 46 1.66

24 35 78 1.89

25 35 50 1.70

26 36 64 1.81

27 36 67 1.83

28 38 79 1.90

29 40 58 1.76

30 45 86 1.93

31 47 76 1.88

32 49 58 1.76

33 56 ? ?

34 60 26 1.41

35 65 104 2.02

36 65 39 1.59

37 65 35 1.54

38 70 72 1.86

39 70 57 1.76

40 76 38 1.58
ID Age (Years) BLL† Log10BLL

41 78 44 1.64

† Blood lead levels measured in micrograms per deciliter (mcg/dL)

? Missing value

Data Source: Nasser A, Hatch D, Pertowski C, Yoon S. Outbreak investigation of an unknown illness in a rural village, Egypt (case study). Cairo: Field

Epidemiology Training Program, 1999.

References (This Section)


1. Griffin S., Marcus A., Schulz T., Walker S. Calculating the interindividual geometric standard deviation of r use in
the integrated exposure uptake biokinetic model for lead in children. Environ Health Perspect 1999;107:481–7.

You might also like