Stats Chapter 2
Stats Chapter 2
Chapter 2
Summarizing data
In this section we will explore techniques for summarizing numerical variables. For example,
consider the loan amount variable from the loan50 data set, which represents the loan size for all
50 loans in the data set. This variable is numerical since we can sensibly discuss the numerical
difference of the size of two loans. On the other hand, area codes and zip codes are not numerical,
but rather they are categorical variables.
Throughout this section and the next, we will apply these methods using the loan50 and
county data sets, which were introduced in Section 1.2. If you’d like to review the variables from
either data set, see Figures 1.3 and 1.5.
$40k
$30k
Loan Amount
$20k
$10k
$0
$0 $50k $100k $150k $200k $250k $300k
Total Income
Figure 2.1: A scatterplot of total income versus loan amount for the loan50
data set.
Looking at Figure 2.1, we see that there are many borrowers with an income below $100,000
on the left side of the graph, while there are a handful of borrowers with income above $250,000.
EXAMPLE 2.1
Figure 2.2 shows a plot of median household income against the poverty rate for 3,142 counties.
What can be said about the relationship between these variables?
The relationship is evidently nonlinear, as highlighted by the dashed line. This is different from
previous scatterplots we’ve seen, which show relationships that do not show much, if any, curvature
in the trend.
42 CHAPTER 2. SUMMARIZING DATA
$120k
$80k
$60k
$40k
$20k
$0
0% 10% 20% 30% 40% 50%
Poverty Rate (Percent)
Figure 2.2: A scatterplot of the median household income against the poverty rate
for the county data set. A statistical model has also been fit to the data and is
shown as a dashed line.
Interest Rate
Figure 2.3: A dot plot of interest rate for the loan50 data set. The distribu-
tion’s mean is shown as a red triangle.
1 Answers may vary. Scatterplots are helpful in quickly spotting associations relating variables, whether those
associations come in the form of simple trends or whether those relationships are more complex.
2 Consider the case where your vertical axis represents something “good” and your horizontal axis represents
something that is only good in moderation. Health and water consumption fit this description: we require some water
to survive, but consume too much and it becomes toxic and can kill a person.
2.1. EXAMINING NUMERICAL DATA 43
●
●
●
● ●●●
●●●●●
●●●●●●● ● ●
●●●●●●● ● ●●
●●●●●●●●●●●●●●●●● ●●
Figure 2.4: A stacked dot plot of interest rate for the loan50 data set. The rates
have been rounded to the nearest percent in this plot, and the distribution’s mean
is shown as a red triangle.
The mean, often called the average, is a common way to measure the center of a distribution
of data. To compute the mean interest rate, we add up all the interest rates and divide by the number
of observations:
10.90% + 9.92% + 26.30% + · · · + 6.08%
x̄ = = 11.57%
50
The sample mean is often labeled x̄. The letter x is being used as a generic placeholder for the
variable of interest, interest rate, and the bar over the x communicates we’re looking at the
average interest rate, which for these 50 loans was 11.57%. It is useful to think of the mean as the
balancing point of the distribution, and it’s shown as a triangle in Figures 2.3 and 2.4.
MEAN
The sample mean can be computed as the sum of the observed values divided by the number
of observations:
x1 + x2 + · · · + xn
x̄ =
n
where x1 , x2 , . . . , xn represent the n observed values.
The loan50 data set represents a sample from a larger population of loans made through
Lending Club. We could compute a mean for this population in the same way as the sample mean.
However, the population mean has a special label: µ. The symbol µ is the Greek letter mu and
represents the average of all observations in the population. Sometimes a subscript, such as x , is used
to represent which variable the population mean refers to, e.g. µx . Often times it is too expensive
to measure the population mean precisely, so we often estimate µ using the sample mean, x̄.
3 x corresponds to the interest rate for the first loan in the sample (10.90%), x to the second loan’s interest rate
1 2
(9.92%), and xi corresponds to the interest rate for the ith loan in the data set. For example, if i = 4, then we’re
examining x4 , which refers to the fourth observation in the data set.
4 The sample size was n = 50.
44 CHAPTER 2. SUMMARIZING DATA
EXAMPLE 2.6
The average interest rate across all loans in the population can be estimated using the sample data.
Based on the sample of 50 loans, what would be a reasonable estimate of µx , the mean interest rate
for all loans in the full data set?
The sample mean, 11.57%, provides a rough estimate of µx . While it’s not perfect, this is our single
best guess of the average interest rate of all the loans in the population under study.
In Chapter 5 and beyond, we will develop tools to characterize the accuracy of point estimates like
the sample mean. As you might have guessed, point estimates based on larger samples tend to be
more accurate than those based on smaller samples.
EXAMPLE 2.7
The mean is useful because it allows us to rescale or standardize a metric into something more easily
interpretable and comparable. Provide 2 examples where the mean is useful for making comparisons.
1. We would like to understand if a new drug is more effective at treating asthma attacks than the
standard drug. A trial of 1500 adults is set up, where 500 receive the new drug, and 1000 receive a
standard drug in the control group:
New drug Standard drug
Number of patients 500 1000
Total asthma attacks 200 300
Comparing the raw counts of 200 to 300 asthma attacks would make it appear that the new drug is
better, but this is an artifact of the imbalanced group sizes. Instead, we should look at the average
number of asthma attacks per patient in each group:
The standard drug has a lower average number of asthma attacks per patient than the average in
the treatment group.
2. Emilio opened a food truck last year where he sells burritos, and his business has stabilized
over the last 3 months. Over that 3 month period, he has made $11,000 while working 625 hours.
Emilio’s average hourly earnings provides a useful statistic for evaluating whether his venture is,
at least from a financial perspective, worth it:
$11000
= $17.60 per hour
625 hours
By knowing his average hourly wage, Emilio now has put his earnings into a standard unit that is
easier to compare with many other jobs that he might consider.
EXAMPLE 2.8
Suppose we want to compute the average income per person in the US. To do so, we might first
think to take the mean of the per capita incomes across the 3,142 counties in the county data set.
What would be a better approach?
The county data set is special in that each county actually represents many individual people. If
we were to simply average across the income variable, we would be treating counties with 5,000 and
5,000,000 residents equally in the calculations. Instead, we should compute the total income for each
county, add up all the counties’ totals, and then divide by the number of people in all the counties.
If we completed these steps with the county data, we would find that the per capita income for the
US is $30,861. Had we computed the simple mean of per capita income across counties, the result
would have been just $26,093!
This example used what is called a weighted mean. For more information on this topic, check out
the following online supplement regarding weighted means openintro.org/d?file=stat wtd mean.
2.1. EXAMINING NUMERICAL DATA 45
Interest Rate 5.0% - 7.5% 7.5% - 10.0% 10.0% - 12.5% 12.5% - 15.0% ··· 25.0% - 27.5%
Count 11 15 8 4 ··· 1
15
10
Frequency
0
5% 10% 15% 20% 25%
Interest Rate
Histograms provide a view of the data density. Higher bars represent where the data are
relatively more common. For instance, there are many more loans with rates between 5% and 10%
than loans with rates between 20% and 25% in the data set. The bars make it easy to see how the
density of the data changes relative to the interest rate.
Histograms are especially convenient for understanding the shape of the data distribution.
Figure 2.6 suggests that most loans have rates under 15%, while only a handful of loans have rates
above 20%. When data trail off to the right in this way and has a longer right tail, the shape is said
to be right skewed.5
Data sets with the reverse characteristic – a long, thinner tail to the left – are said to be left
skewed. We also say that such a distribution has a long left tail. Data sets that show roughly equal
trailing off in both directions are called symmetric.
5 Other ways to describe data that are right skewed: skewed to the right, skewed to the high end, or skewed
20 20 20
15 15 15
10 10 10
5 5 5
0 0 0
0 5 10 15 0 5 10 15 20 0 5 10 15 20
Figure 2.7: Counting only prominent peaks, the distributions are (left to right)
unimodal, bimodal, and multimodal. Note that we’ve said the left plot is unimodal
intentionally. This is because we are counting prominent peaks, not just any peak.
EXAMPLE 2.11
Figure 2.6 reveals only one prominent mode in the interest rate. Is the distribution unimodal,
bimodal, or multimodal?
Unimodal. Remember that uni stands for 1 (think uni cycles). Similarly, bi stands for 2 (think
bi cycles). We’re hoping a multicycle will be invented to complete this analogy.
Looking for modes isn’t about finding a clear and correct answer about the number of modes in
a distribution, which is why prominent is not rigorously defined in this book. The most important
part of this examination is to better understand your data.
6 The skew is visible in all three plots, though the flat dot plot is the least useful. The stacked dot plot and
If we square these deviations and then take an average, the result is equal to the sample variance,
denoted by s2 :
We divide by n − 1, rather than dividing by n, when computing a sample’s variance; there’s some
mathematical nuance here, but the end result is that doing this makes this statistic slightly more
reliable and useful.
Notice that squaring the deviations does two things. First, it makes large values relatively
much larger, seen by comparing (−0.67)2 , (−1.65)2 , (14.73)2 , and (−5.49)2 . Second, it gets rid of
any negative signs.
The standard deviation is defined as the square root of the variance:
√
s = 25.52 = 5.05
While often omitted, a subscript of x may be added to the variance and standard deviation, i.e.
s2x and sx , if it is useful as a reminder that these are the variance and standard deviation of the
observations represented by x1 , x2 , ..., xn .
The standard deviation represents the typical deviation of observations from the mean. Usually
about 70% of the data will be within one standard deviation of the mean and about 95% will
be within two standard deviations. However, as seen in Figures 2.8 and 2.9, these percentages
are not strict rules.
Like the mean, the population values for variance and standard deviation have special symbols:
σ 2 for the variance and σ for the standard deviation. The symbol σ is the Greek letter sigma.
48 CHAPTER 2. SUMMARIZING DATA
Figure 2.8: For the interest rate variable, 34 of the 50 loans (68%) had interest
rates within 1 standard deviation of the mean, and 48 of the 50 loans (96%) had
rates within 2 standard deviations. Usually about 70% of the data are within
1 standard deviation of the mean and 95% within 2 standard deviations, though
this is far from a hard rule.
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
Figure 2.9: Three very different population distributions with the same mean µ = 0
and standard deviation σ = 1.
EXAMPLE 2.14
Describe the distribution of the interest rate variable using the histogram in Figure 2.6. The
description should incorporate the center, variability, and shape of the distribution, and it should
also be placed in context. Also note any especially unusual cases.
The distribution of interest rates is unimodal and skewed to the high end. Many of the rates fall
near the mean at 11.57%, and most fall within one standard deviation (5.05%) of the mean. There
are a few exceptionally large interest rates in the sample that are above 20%.
In practice, the variance and standard deviation are sometimes used as a means to an end, where
the “end” is being able to accurately estimate the uncertainty associated with a sample statistic.
For example, in Chapter 5 the standard deviation is used in calculations that help us understand
how much a sample mean varies from one sample to the next.
9 Figure 2.9 shows three distributions that look quite different, but all have the same mean, variance, and standard
deviation. Using modality, we can distinguish between the first plot (bimodal) and the last two (unimodal). Using
skewness, we can distinguish between the last plot (right skewed) and the first two. While a picture, like a histogram,
tells a more complete story, we can use modality and shape (symmetry/skew) to characterize basic information about
a distribution.
2.1. EXAMINING NUMERICAL DATA 49
median
10%
Q1 (first quartile)
lower whisker
5%
Figure 2.10: A vertical dot plot, where points have been horizontally stacked, next
to a labeled box plot for the interest rates of the 50 loans.
The first step in building a box plot is drawing a dark line denoting the median, which splits the
data in half. Figure 2.10 shows 50% of the data falling below the median and other 50% falling above
the median. There are 50 loans in the data set (an even number) so the data are perfectly split into
two groups of 25. We take the median in this case to be the average of the two observations closest to
the 50th percentile, which happen to be the same value in this data set: (9.93% + 9.93%)/2 = 9.93%.
When there are an odd number of observations, there will be exactly one observation that splits the
data into two halves, and in such a case that observation is the median (no average needed).
The second step in building a box plot is drawing a rectangle to represent the middle 50% of
the data. The total length of the box, shown vertically in Figure 2.10, is called the interquartile
range (IQR, for short). It, like the standard deviation, is a measure of variability in data. The more
variable the data, the larger the standard deviation and IQR tend to be. The two boundaries of the
box are called the first quartile (the 25th percentile, i.e. 25% of the data fall below this value) and
the third quartile (the 75th percentile), and these are often labeled Q1 and Q3 , respectively.
50 CHAPTER 2. SUMMARIZING DATA
IQR = Q3 − Q1
Extending out from the box, the whiskers attempt to capture the data outside of the box.
However, their reach is never allowed to be more than 1.5 × IQR. They capture everything within
this reach. In Figure 2.10, the upper whisker does not extend to the last two points, which is beyond
Q3 + 1.5 × IQR, and so it extends only to the last point below this limit. The lower whisker stops
at the lowest value, 5.31%, since there is no additional data to reach; the lower whisker’s limit is not
shown in the figure because the plot does not extend down to Q1 − 1.5 × IQR. In a sense, the box is
like the body of the box plot and the whiskers are like its arms trying to reach the rest of the data.
Any observation lying beyond the whiskers is labeled with a dot. The purpose of labeling these
points – instead of extending the whiskers to the minimum and maximum observed values – is to help
identify any observations that appear to be unusually distant from the rest of the data. Unusually
distant observations are called outliers. In this case, it would be reasonable to classify the interest
rates of 24.85% and 26.30% as outliers since they are numerically distant from most of the data.
10 Since Q and Q capture the middle 50% of the data and the median splits the data in the middle, 25% of the
1 3
data fall between Q1 and the median, and another 25% falls between the median and Q3 .
11 These visual estimates will vary a little from one person to the next: Q = 8%, Q = 14%, IQR = Q − Q = 6%.
1 3 3 1
(The true values: Q1 = 7.96%, Q3 = 13.72%, IQR = 5.76%.)
2.1. EXAMINING NUMERICAL DATA 51
Original ●
●
26.3% to 15%
26.3% to 35% ●
Figure 2.11: Dot plots of the original interest rate data and two modified data sets.
Figure 2.12: A comparison of how the median, IQR, mean (x̄), and standard deviation
(s) change had an extreme observations from the interest rate variable been different.
The median and IQR are called robust statistics because extreme observations have little
effect on their values: moving the most extreme value generally has little influence on these statistics.
On the other hand, the mean and standard deviation are more heavily influenced by changes in
extreme observations, which can be important in some situations.
EXAMPLE 2.18
The median and IQR did not change under the three scenarios in Figure 2.12. Why might this be
the case?
The median and IQR are only sensitive to numbers near Q1 , the median, and Q3 . Since values in
these regions are stable in the three data sets, the median and IQR estimates are also stable.
12 (a) Mean is affected more. (b) Standard deviation is affected more. Complete explanations are provided in the
probably more useful. However, if the goal is to understand something that scales well, such as the total amount of
money we might need to have on hand if we were to offer 1,000 loans, then the mean would be more useful.
52 CHAPTER 2. SUMMARIZING DATA
3000
1000
2500
2000
Frequency
Frequency
1500 500
1000
500
0 0
0m 2m 4m 6m 8m 10m 2 3 4 5 6 7
Population (m = millions) log10(Population)
(a) (b)
Figure 2.13: (a) A histogram of the populations of all US counties. (b) A histogram
of log10 -transformed county populations. For this plot, the x-value corresponds to
the power of 10, e.g. “4” on the x-axis corresponds to 104 = 10,000.
EXAMPLE 2.20
Consider the histogram of county populations shown in Figure 2.13(a), which shows extreme skew.
What isn’t useful about this plot?
Nearly all of the data fall into the left-most bin, and the extreme skew obscures many of the
potentially interesting details in the data.
There are some standard transformations that may be useful for strongly right skewed data
where much of the data is positive but clustered near zero. A transformation is a rescaling of
the data using a function. For instance, a plot of the logarithm (base 10) of county populations
results in the new histogram in Figure 2.13(b). This data is symmetric, and any potential outliers
appear much less extreme than in the original data set. By reigning in the outliers and extreme
skew, transformations like this often make it easier to build statistical models against the data.
Transformations can also be applied to one or both variables in a scatterplot. A scatterplot of
the population change from 2010 to 2017 against the population in 2010 is shown in Figure 2.14(a).
In this first scatterplot, it’s hard to decipher any interesting patterns because the population variable
is so strongly skewed. However, if we apply a log10 transformation to the population variable, as
shown in Figure 2.14(b), a positive association between the variables is revealed. In fact, we may
be interested in fitting a trend line to the data when we explore methods around fitting regression
lines in Chapter 8.
√ Transformations other than the logarithm 1
can be useful, too. For instance, the square root
( original observation) and inverse ( original observation ) are commonly used by data scientists. Com-
mon goals in transforming data are to see the data structure differently, reduce skew, assist in
modeling, or straighten a nonlinear relationship in a scatterplot.
2.1. EXAMINING NUMERICAL DATA 53
40% 40%
Population Change
Population Change
20% 20%
0% 0%
−20% −20%
0m 2m 4m 6m 8m 10m 2 3 4 5 6 7
Population Before Change (m = millions) log10(Population Before Change)
(a) (b)
Figure 2.14: (a) Scatterplot of population change against the population before
the change. (b) A scatterplot of the same data but where the population size has
been log-transformed.
EXAMPLE 2.21
What interesting features are evident in the poverty and unemployment rate intensity maps?
Poverty rates are evidently higher in a few locations. Notably, the deep south shows higher poverty
rates, as does much of Arizona and New Mexico. High poverty rates are evident in the Mississippi
flood plains a little north of New Orleans and also in a large section of Kentucky.
The unemployment rate follows similar trends, and we can see correspondence between the two
variables. In fact, it makes sense for higher rates of unemployment to be closely related to poverty
rates. One observation that stand out when comparing the two maps: the poverty rate is much
higher than the unemployment rate, meaning while many people may be working, they are not
making enough to break out of poverty.
14 Note: answers will vary. There is some correspondence between high earning and metropolitan areas, where we
can see darker spots (higher median household income), though there are several exceptions. You might look for large
cities you are familiar with and try to spot them on the map as dark spots.
54 CHAPTER 2. SUMMARIZING DATA
>25%
Poverty
14%
2%
(a)
>7%
Unemployment Rate
4%
2%
(b)
Figure 2.15: (a) Intensity map of poverty rate (percent). (b) Map of the unem-
ployment rate (percent).
2.1. EXAMINING NUMERICAL DATA 55
91%
Homeownership Rate
73%
<55%
(a)
>$75
$47
$19
(b)
Figure 2.16: (a) Intensity map of homeownership rate (percent). (b) Intensity map
of median household income ($1000s).
56 CHAPTER 2. SUMMARIZING DATA
Exercises
2.1 Mammal life spans. Data were collected on life spans (in years) and gestation lengths (in days) for 62
mammals. A scatterplot of life span versus length of gestation is shown below.15
100 ●
●
● ●
versed, i.e. if we plotted length of gestation
● ●
●
versus life span? 25 ●
●
●
● ●
●
●
●
●
● ●
●●
(c) Are life span and length of gestation inde- ●
● ●
●
●
● ●
●
●
●
● ●
0
0 200 400 600
Gestation (days)
2.2 Associations. Indicate which of the plots show (a) a positive association, (b) a negative association,
or (c) no association. Also determine if the positive and negative associations are linear or nonlinear. Each
part may refer to more than one plot.
2.3 Reproducing bacteria. Suppose that there is only sufficient space and nutrients to support one million
bacterial cells in a petri dish. You place a few bacterial cells in this petri dish, allow them to reproduce freely,
and record the number of bacterial cells in the dish over time. Sketch a plot representing the relationship
between number of bacterial cells and time.
2.4 Office productivity. Office productivity is relatively low when the employees feel no stress about their
work or job security. However, high levels of stress can also lead to reduced employee productivity. Sketch
a plot to represent the relationship between stress and productivity.
2.5 Parameters and statistics. Identify which value represents the sample mean and which value represents
the claimed population mean.
(a) American households spent an average of about $52 in 2007 on Halloween merchandise such as costumes,
decorations and candy. To see if this number had changed, researchers conducted a new survey in 2008
before industry numbers were reported. The survey included 1,500 households and found that average
Halloween spending was $58 per household.
(b) The average GPA of students in 2001 at a private university was 3.37. A survey on a sample of 203
students from this university yielded an average GPA of 3.59 a decade later.
2.6 Sleeping in college. A recent article in a college newspaper stated that college students get an average
of 5.5 hrs of sleep each night. A student who was skeptical about this value decided to conduct a survey
by randomly sampling 25 students. On average, the sampled students slept 6.25 hours per night. Identify
which value represents the sample mean and which value represents the claimed population mean.
15 T. Allison and D.V. Cicchetti. “Sleep in mammals: ecological and constitutional correlates”. In: Arch. Hydrobiol
75 (1975), p. 442.
2.1. EXAMINING NUMERICAL DATA 57
2.7 Days off at a mining plant. Workers at a particular mining site receive an average of 35 days paid
vacation, which is lower than the national average. The manager of this plant is under pressure from a
local union to increase the amount of paid time off. However, he does not want to give more days off to
the workers because that would be costly. Instead he decides he should fire 10 employees in such a way as
to raise the average number of days off that are reported by his employees. In order to achieve this goal,
should he fire employees who have the most number of days off, least number of days off, or those who have
about the average number of days off?
2.8 Medians and IQRs. For each part, compare distributions (1) and (2) based on their medians and IQRs.
You do not need to calculate these statistics; simply state how the medians and IQRs compare. Make sure
to explain your reasoning.
2.9 Means and SDs. For each part, compare distributions (1) and (2) based on their means and standard
deviations. You do not need to calculate these statistics; simply state how the means and the standard
deviations compare. Make sure to explain your reasoning. Hint: It may be useful to sketch dot plots of the
distributions.
(b) (1) -20, 0, 0, 0, 15, 25, 30, 30 (d) (1) 100, 200, 300, 400, 500
(2) -40, 0, 0, 0, 15, 25, 30, 30 (2) 0, 50, 300, 550, 600
2.10 Mix-and-match. Describe the distribution in the histograms below and match them to the box plots.
70 100
6 80
65
60
4
60
40
2
55 20
0 0
50 60 70 0 50 100 0 2 4 6
(a) (b) (c) (1) (2) (3)
58 CHAPTER 2. SUMMARIZING DATA
2.11 Air quality. Daily air quality is measured by the air quality index (AQI) reported by the Environ-
mental Protection Agency. This index reports the pollution level and what associated health effects might
be a concern. The index is calculated for five major air pollutants regulated by the Clean Air Act and takes
values from 0 to 300, where a higher value indicates lower air quality. AQI was reported for a sample of
91 days in 2011 in Durham, NC. The relative frequency histogram below shows the distribution of the AQI
values on these days.16
0.2
2.12 Median vs. mean. Estimate the median for the 400 observations shown in the histogram, and note
whether you expect the mean to be higher or lower than the median.
80
60
40
20
0
40 50 60 70 80 90 100
2.13 Histograms vs. box plots. Compare the two plots below. What characteristics of the distribution
are apparent in the histogram and not in the box plot? What characteristics are apparent in the box plot
but not in the histogram?
200 25
150 20
100 15
50 10
5
0
5 10 15 20 25
2.14 Facebook friends. Facebook data indicate that 50% of Facebook users have 100 or more friends,
and that the average friend count of users is 190. What do these findings suggest about the shape of the
distribution of number of friends of Facebook users?17
2.15 Distributions and appropriate statistics, Part I. For each of the following, state whether you expect
the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median
would best represent a typical observation in the data, and whether the variability of observations would be
best represented using the standard deviation or IQR. Explain your reasoning.
(a) Number of pets per household.
(b) Distance to work, i.e. number of miles between work and home.
(c) Heights of adult males.
2.16 Distributions and appropriate statistics, Part II. For each of the following, state whether you expect
the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median
would best represent a typical observation in the data, and whether the variability of observations would be
best represented using the standard deviation or IQR. Explain your reasoning.
(a) Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses cost below
$450,000, 75% of the houses cost below $1,000,000 and there are a meaningful number of houses that
cost more than $6,000,000.
(b) Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses cost below
$600,000, 75% of the houses cost below $900,000 and very few houses that cost more than $1,200,000.
(c) Number of alcoholic drinks consumed by college students in a given week. Assume that most of these
students don’t drink since they are under 21 years old, and only a few drink excessively.
(d) Annual salaries of the employees at a Fortune 500 company where only a few high level executives earn
much higher salaries than all the other employees.
2.17 Income at the coffee shop. The first histogram below shows the distribution of the yearly incomes of
40 patrons at a college coffee shop. Suppose two new people walk into the coffee shop: one making $225,000
and the other $250,000. The second histogram shows the new income distribution. Summary statistics are
also provided.
12
4
(1) (2)
0 n 40 42
$60k $62.5k $65k $67.5k $70k Min. 60,680 60,680
1st Qu. 63,620 63,710
(1) Median 65,240 65,350
12
Mean 65,090 73,300
8 3rd Qu. 66,160 66,540
Max. 69,890 250,000
4 SD 2,122 37,321
0
$60k $110k $160k $210k $260k
(2)
(a) Would the mean or the median best represent what we might think of as a typical income for the 42
patrons at this coffee shop? What does this say about the robustness of the two measures?
(b) Would the standard deviation or the IQR best represent the amount of variability in the incomes of the
42 patrons at this coffee shop? What does this say about the robustness of the two measures?
2.18 Midrange. The midrange of a distribution is defined as the average of the maximum and the minimum
of that distribution. Is this statistic robust to outliers and extreme skew? Explain your reasoning
60 CHAPTER 2. SUMMARIZING DATA
2.19 Commute times. The US census collects data on time it takes Americans to commute to work, among
many other variables. The histogram below shows the distribution of average commute times in 3,142 US
counties in 2010. Also shown below is a spatial intensity map of the same data.
200
>33
100
19
0
4
10 20 30 40
Mean work travel (in min)
(a) Describe the numerical distribution and comment on whether or not a log transformation may be
advisable for these data.
(b) Describe the spatial distribution of commuting times using the map below.
2.20 Hispanic population. The US census collects data on race and ethnicity of Americans, among many
other variables. The histogram below shows the distribution of the percentage of the population that is
Hispanic in 3,142 counties in the US in 2010. Also shown is a histogram of logs of these values.
2000
250
1500 200
150
1000
100
500
50
0 0
0 20 40 60 80 100 −2 −1 0 1 2 3 4
Hispanic % log(% Hispanic)
>40
20
(a) Describe the numerical distribution and comment on why we might want to use log-transformed values
in analyzing or modeling these data.
(b) What features of the distribution of the Hispanic population in US counties are apparent in the map
but not in the histogram? What features are apparent in the histogram but not the map?
(c) Is one visualization more appropriate or helpful than the other? Explain your reasoning.
2.2. CONSIDERING CATEGORICAL DATA 61
In this section, we will introduce tables and other basic tools for categorical data that are
used throughout this book. The loan50 data set represents a sample from a larger loan data set
called loans. This larger data set contains information on 10,000 loans made through Lending Club.
We will examine the relationship between homeownership, which for the loans data can take a value
of rent, mortgage (owns but has a mortgage), or own, and app type, which indicates whether the
loan application was made with a partner or whether it was an individual application.
homeownership
rent mortgage own Total
individual 3496 3839 1170 8505
app type
joint 362 950 183 1495
Total 3858 4789 1353 10000
homeownership Count
rent 3858
mortgage 4789
own 1353
Total 10000
Figure 2.18: A table summarizing the frequencies of each value for the
homeownership variable.
A bar plot is a common way to display a single categorical variable. The left panel of Figure 2.19
shows a bar plot for the homeownership variable. In the right panel, the counts are converted into
proportions, showing the proportion of observations that are in each level (e.g. 3858/10000 = 0.3858
for rent).
62 CHAPTER 2. SUMMARIZING DATA
4000 0.4
Frequency
Proportion
3000 0.3
2000 0.2
1000 0.1
0 0.0
rent mortgage own rent mortgage own
Homeownership Homeownership
Figure 2.19: Two bar plots of number. The left panel shows the counts, and the
right panel shows the proportions in each group.
Sometimes it is useful to understand the fractional breakdown of one variable in another, and we
can modify our contingency table to provide such a view. Figure 2.20 shows the row proportions
for Figure 2.17, which are computed as the counts divided by their row totals. The value 3496 at
the intersection of individual and rent is replaced by 3496/8505 = 0.411, i.e. 3496 divided by
its row total, 8505. So what does 0.411 represent? It corresponds to the proportion of individual
applicants who rent.
Figure 2.20: A contingency table with row proportions for the app type and
homeownership variables. The row total is off by 0.001 for the joint row due
to a rounding error.
A contingency table of the column proportions is computed in a similar way, where each column
proportion is computed as the count divided by the corresponding column total. Figure 2.21 shows
such a table, and here the value 0.906 indicates that 90.6% of renters applied as individuals for the
loan. This rate is higher compared to loans from people with mortgages (80.2%) or who own their
home (86.5%). Because these rates vary between the three levels of homeownership (rent, mortgage,
own), this provides evidence that the app type and homeownership variables are associated.
Figure 2.21: A contingency table with column proportions for the app type and
homeownership variables. The total for the last column is off by 0.001 due to a
rounding error.
We could also have checked for an association between app type and homeownership in Fig-
ure 2.20 using row proportions. When comparing these row proportions, we would look down
columns to see if the fraction of loans where the borrower rents, has a mortgage, or owns varied
across the individual to joint application types.
2.2. CONSIDERING CATEGORICAL DATA 63
EXAMPLE 2.25
Data scientists use statistics to filter spam from incoming email messages. By noting specific char-
acteristics of an email, a data scientist may be able to classify some emails as spam or not spam with
high accuracy. One such characteristic is whether the email contains no numbers, small numbers, or
big numbers. Another characteristic is the email format, which indicates whether or not an email
has any HTML content, such as bolded text. We’ll focus on email format and spam status using the
email data set, and these variables are summarized in a contingency table in Figure 2.22. Which
would be more helpful to someone hoping to classify email as spam or regular email for this table:
row or column proportions?
A data scientist would be interested in how the proportion of spam changes within each email
format. This corresponds to column proportions: the proportion of spam in plain text emails and
the proportion of spam in HTML emails.
If we generate the column proportions, we can see that a higher fraction of plain text emails are spam
(209/1195 = 17.5%) than compared to HTML emails (158/2726 = 5.8%). This information on its
own is insufficient to classify an email as spam or not spam, as over 80% of plain text emails are not
spam. Yet, when we carefully combine this information with many other characteristics, we stand a
reasonable chance of being able to classify some emails as spam or not spam with confidence.
Example 2.25 points out that row and column proportions are not equivalent. Before settling
on one form for a table, it is important to consider each to ensure that the most useful table is
constructed. However, sometimes it simply isn’t clear which, if either, is more useful.
EXAMPLE 2.26
Look back to Tables 2.20 and 2.21. Are there any obvious scenarios where one might be more useful
than the other?
None that we thought were obvious! What is distinct about app type and homeownership vs the
email example is that these two variables don’t have a clear explanatory-response variable relation-
ship that we might hypothesize (see Section 1.2.4 for these terms). Usually it is most useful to
“condition” on the explanatory variable. For instance, in the email example, the email format was
seen as a possible explanatory variable of whether the message was spam, so we would find it more
interesting to compute the relative frequencies (proportions) for each email format.
18 (a) 0.451 represents the proportion of individual applicants who have a mortgage. (b) 0.802 represents the
joint joint
individual individual
4000 4000
3000 3000
Frequency
Frequency
2000 2000
1000 1000
0 0
rent mortgage own rent mortgage own
(a) (b)
1.0
0.8
Proportion
0.6
0.4
joint
0.2
individual
0.0
rent mortgage own
(c)
Figure 2.23: (a) Stacked bar plot for homeownership, where the counts have been
further broken down by app type. (b) Side-by-side bar plot for homeownership
and app type. (c) Standardized version of the stacked bar plot.
2.2. CONSIDERING CATEGORICAL DATA 65
EXAMPLE 2.27
Examine the three bar plots in Figure 2.23. When is the stacked, side-by-side, or standardized
stacked bar plot the most useful?
The stacked bar plot is most useful when it’s reasonable to assign one variable as the explanatory
variable and the other variable as the response, since we are effectively grouping by one variable first
and then breaking it down by the others.
Side-by-side bar plots are more agnostic in their display about which variable, if any, represents the
explanatory and which the response variable. It is also easy to discern the number of cases in of the
six different group combinations. However, one downside is that it tends to require more horizontal
space; the narrowness of Figure 2.23(b) makes the plot feel a bit cramped. Additionally, when two
groups are of very different sizes, as we see in the own group relative to either of the other two
groups, it is difficult to discern if there is an association between the variables.
The standardized stacked bar plot is helpful if the primary variable in the stacked bar plot is relatively
imbalanced, e.g. the own category has only a third of the observations in the mortgage category,
making the simple stacked bar plot less useful for checking for an association. The major downside
of the standardized version is that we lose all sense of how many cases each of the bars represents.
indiv.
joint
(a) (b)
Figure 2.24: (a) The one-variable mosaic plot for homeownership. (b) Two-variable
mosaic plot for both homeownership and app type.
To create a completed mosaic plot, the single-variable mosaic plot is further divided into pieces
in Figure 2.24(b) using the app type variable. Each column is split proportional to the number
of loans from individual and joint borrowers. For example, the second column represents loans
where the borrower has a mortgage, and it was divided into individual loans (upper) and joint loans
(lower). As another example, the bottom segment of the third column represents loans where the
borrower owns their home and applied jointly, while the upper segment of this column represents
borrowers who are homeowners and filed individually. We can again use this plot to see that the
homeownership and app type variables are associated, since some columns are divided in different
66 CHAPTER 2. SUMMARIZING DATA
vertical locations than others, which was the same technique used for checking an association in the
standardized stacked bar plot.
In Figure 2.24, we chose to first split by the homeowner status of the borrower. However, we
could have instead first split by the application type, as in Figure 2.25. Like with the bar plots, it’s
common to use the explanatory variable to represent the first split in a mosaic plot, and then for the
response to break up each level of the explanatory variable, if these labels are reasonable to attach
to the variables under consideration.
indiv. joint
rent
mortgage
own
Figure 2.25: Mosaic plot where loans are grouped by the homeownership variable
after they’ve been divided into the individual and joint application types.
2.2.5 The only pie chart you will see in this book
A pie chart is shown in Figure 2.26 alongside a bar plot representing the same information.
Pie charts can be useful for giving a high-level overview to show how a set of cases break down.
However, it is also difficult to decipher details in a pie chart. For example, it takes a couple seconds
longer to recognize that there are more loans where the borrower has a mortgage than rent when
looking at the pie chart, while this detail is very obvious in the bar plot. While pie charts can be
useful, we prefer bar plots for their ease in comparing groups.
4000
rent
3000
Frequency
2000
own
1000
mortgage
0
rent mortgage own
Homeownership
Figure 2.27: In this table, median household income (in $1000s) from a random
sample of 100 counties that had population gains are shown on the left. Median
incomes from a random sample of 50 counties that had no population gain are
shown on the right.
68 CHAPTER 2. SUMMARIZING DATA
Gain
$120k No Gain
Median Household Income
$100k
$80k
$60k
$40k
$20k
Figure 2.28: Side-by-side box plot (left panel) and hollow histograms (right panel)
for med hh income, where the counties are split by whether there was a population
gain or loss.
The side-by-side box plot is a traditional tool for comparing across groups. An example is
shown in the left panel of Figure 2.28, where there are two box plots, one for each group, placed
into one plotting window and drawn on the same scale.
Another useful plotting method uses hollow histograms to compare numerical data across
groups. These are just the outlines of histograms of each group put on the same plot, as shown in
the right panel of Figure 2.28.
20 Answers may vary a little. The counties with population gains tend to have higher income (median of about
$45,000) versus counties without a gain (median of about $40,000). The variability is also slightly larger for the
population gain group. This is evident in the IQR, which is about 50% bigger in the gain group. Both distributions
show slight to moderate right skew and are unimodal. The box plots indicate there are many observations far above
the median in each group, though we should anticipate that many observations will fall beyond the whiskers when
examining any data set that contain more than a couple hundred data points.
21 Answers will vary. The side-by-side box plots are especially useful for comparing centers and spreads, while the
hollow histograms are more useful for seeing distribution shape, skew, and potential anomalies.
2.2. CONSIDERING CATEGORICAL DATA 69
Exercises
2.21 Antibiotic use in children. The bar plot and the pie chart below show the distribution of pre-existing
medical conditions of children involved in a study on the optimal duration of antibiotic use in treatment of
tracheitis, which is an upper respiratory infection.
Prematurity Trauma
Cardiovascular Neuromuscular
Respiratory Respiratory
Trauma Genetic/metabolic
Neuromuscular Immunocompromised
Gastrointestinal
Genetic/metabolic
Immunocompromised Cardiovascular
Gastrointestinal
(a) What features are apparent in the bar plot but not in the pie chart?
(b) What features are apparent in the pie chart but not in the bar plot?
(c) Which graph would you prefer to use for displaying these categorical data?
2.22 Views on immigration. 910 randomly sampled registered voters from Tampa, FL were asked if they
thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for
US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US
citizenship, or (iii) lose their jobs and have to leave the country. The results of the survey by political
ideology are shown below.22
Political ideology
Conservative Moderate Liberal Total
(i) Apply for citizenship 57 120 101 278
(ii) Guest worker 121 113 28 262
Response
(iii) Leave the country 179 126 45 350
(iv) Not sure 15 4 1 20
Total 372 363 175 910
2.23 Views on the DREAM Act. A random sample of registered voters from Tampa, FL were asked
if they support the DREAM Act, a proposed law which would provide a path to citizenship for people
brought illegally to the US as children. The survey also collected information on the political ideology of the
respondents. Based on the mosaic plot shown below, do views on the DREAM Act and political ideology
appear to be independent? Explain your reasoning.23
Support
Not support
Not sure
2.24 Raise taxes. A random sample of registered voters nationally were asked whether they think it’s
better to raise taxes on the rich or raise taxes on the poor. The survey also collected information on the
political party affiliation of the respondents. Based on the mosaic plot shown below, do views on raising
taxes and political affiliation appear to be independent? Explain your reasoning.24
EXAMPLE 2.30
Suppose your professor splits the students in class into two groups: students on the left and students
on the right. If p̂L and p̂R represent the proportion of students who own an Apple product on the
left and right, respectively, would you be surprised if p̂L did not exactly equal p̂R ?
While the proportions would probably be close to each other, it would be unusual for them to be
exactly the same. We would probably observe a small difference due to chance.
outcome
infection no infection Total
vaccine 5 9 14
treatment
placebo 6 0 6
Total 11 9 20
In this study, a smaller proportion of patients who received the vaccine showed signs of an
infection (35.7% versus 100%). However, the sample is very small, and it is unclear whether the
difference provides convincing evidence that the vaccine is effective.
EXAMPLE 2.33
Data scientists are sometimes called upon to evaluate the strength of evidence. When looking at
the rates of infection for patients in the two groups in this study, what comes to mind as we try to
determine whether the data show convincing evidence of a real difference?
The observed infection rates (35.7% for the treatment group versus 100% for the control group)
suggest the vaccine may be effective. However, we cannot be sure if the observed difference represents
the vaccine’s efficacy or is just from random chance. Generally there is a little bit of fluctuation in
sample data, and we wouldn’t expect the sample proportions to be exactly equal, even if the truth
was that the infection rates were independent of getting the vaccine. Additionally, with such small
samples, perhaps it’s common to observe such large differences when we randomly split a group due
to chance alone!
Example 2.33 is a reminder that the observed outcomes in the data sample may not perfectly
reflect the true relationships between variables since there is random noise. While the observed
difference in rates of infection is large, the sample size for the study is small, making it unclear if
this observed difference represents efficacy of the vaccine or whether it is simply due to chance. We
label these two competing claims, H0 and HA , which are spoken as “H-nought” and “H-A”:
H0 : Independence model. The variables treatment and outcome are independent. They have
no relationship, and the observed difference between the proportion of patients who developed
an infection in the two groups, 64.3%, was due to chance.
HA : Alternative model. The variables are not independent. The difference in infection rates of
64.3% was not due to chance, and vaccine affected the rate of infection.
What would it mean if the independence model, which says the vaccine had no influence on the
rate of infection, is true? It would mean 11 patients were going to develop an infection no matter
which group they were randomized into, and 9 patients would not develop an infection no matter
which group they were randomized into. That is, if the vaccine did not affect the rate of infection,
the difference in the infection rates was due to chance alone in how the patients were randomized.
Now consider the alternative model: infection rates were influenced by whether a patient re-
ceived the vaccine or not. If this was true, and especially if this influence was substantial, we would
expect to see some difference in the infection rates of patients in the groups.
We choose between these two competing claims by assessing if the data conflict so much with
H0 that the independence model cannot be deemed reasonable. If this is the case, and the data
support HA , then we will reject the notion of independence and conclude the vaccine was effective.
We’re going to implement simulations, where we will pretend we know that the malaria vaccine
being tested does not work. Ultimately, we want to understand if the large difference we observed
is common in these simulations. If it is common, then maybe the difference we observed was purely
due to chance. If it is very uncommon, then the possibility that the vaccine was helpful seems more
plausible.
Figure 2.29 shows that 11 patients developed infections and 9 did not. For our simulation,
we will suppose the infections were independent of the vaccine and we were able to rewind back
to when the researchers randomized the patients in the study. If we happened to randomize the
patients differently, we may get a different result in this hypothetical world where the vaccine doesn’t
influence the infection. Let’s complete another randomization using a simulation.
2.3. CASE STUDY: MALARIA VACCINE 73
In this simulation, we take 20 notecards to represent the 20 patients, where we write down
“infection” on 11 cards and “no infection” on 9 cards. In this hypothetical world, we believe each
patient that got an infection was going to get it regardless of which group they were in, so let’s see
what happens if we randomly assign the patients to the treatment and control groups again. We
thoroughly shuffle the notecards and deal 14 into a vaccine pile and 6 into a placebo pile. Finally,
we tabulate the results, which are shown in Figure 2.30.
outcome
infection no infection Total
treatment vaccine 7 7 14
(simulated) placebo 4 2 6
Total 11 9 20
Figure 2.30: Simulation results, where any difference in infection rates is purely
due to chance.
We computed one possible difference under the independence model in Guided Practice 2.34,
which represents one difference due to chance. While in this first simulation, we physically dealt out
notecards to represent the patients, it is more efficient to perform this simulation using a computer.
Repeating the simulation on a computer, we get another difference due to chance:
2 9
− = −0.310
6 14
And another:
3 8
− = −0.071
6 14
And so on until we repeat the simulation enough times that we have a good idea of what represents
the distribution of differences from chance alone. Figure 2.31 shows a stacked plot of the differences
found from 100 simulations, where each dot represents a simulated difference between the infection
rates (control rate minus treatment rate).
Note that the distribution of these simulated differences is centered around 0. We simulated
these differences assuming that the independence model was true, and under this condition, we
expect the difference to be near zero with some random fluctuation, where near is pretty generous
in this case since the sample sizes are so small in this study.
EXAMPLE 2.35
How often would you observe a difference of at least 64.3% (0.643) according to Figure 2.31? Often,
sometimes, rarely, or never?
It appears that a difference of at least 64.3% due to chance alone would only happen about 2% of
the time according to Figure 2.31. Such a low probability indicates a rare event.
27 4/6 − 7/14 = 0.167 or about 16.7% in favor of the vaccine. This difference due to chance is much smaller than
●
●
●
●
● ●
● ●
● ●
● ●
● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
Figure 2.31: A stacked dot plot of differences from 100 simulations produced under
the independence model, H0 , where in these simulations infections are unaffected
by the vaccine. Two of the 100 simulations had a difference of at least 64.3%, the
difference observed in the study.
The difference of 64.3% being a rare event suggests two possible interpretations of the results
of the study:
H0 Independence model. The vaccine has no effect on infection rate, and we just happened to
observe a difference that would only occur on a rare occasion.
HA Alternative model. The vaccine has an effect on infection rate, and the difference we
observed was actually due to the vaccine being effective at combatting malaria, which explains
the large difference of 64.3%.
Based on the simulations, we have two options. (1) We conclude that the study results do not
provide strong evidence against the independence model. That is, we do not have sufficiently strong
evidence to conclude the vaccine had an effect in this clinical setting. (2) We conclude the evidence
is sufficiently strong to reject H0 and assert that the vaccine was useful. When we conduct formal
studies, usually we reject the notion that we just happened to observe a rare event.28 So in this case,
we reject the independence model in favor of the alternative. That is, we are concluding the data
provide strong evidence that the vaccine provides some protection against malaria in this clinical
setting.
One field of statistics, statistical inference, is built on evaluating whether such differences are
due to chance. In statistical inference, data scientists evaluate which model is most reasonable given
the data. Errors do occur, just like rare events, and we might choose the wrong model. While we
do not always choose correctly, statistical inference gives us tools to control and evaluate how often
these errors occur. In Chapter 5, we give a formal introduction to the problem of model selection.
We spend the next two chapters building a foundation of probability and theory necessary to make
that discussion rigorous.
28 This reasoning does not generally extend to anecdotal observations. Each of us observes incredibly rare events
every day, events we could not possibly hope to predict. However, in the non-rigorous setting of anecdotal evidence,
almost anything may appear to be a rare event, so the idea of looking for rare events in day-to-day activities is
treacherous. For example, we might look at the lottery: there was only a 1 in 292 million chance that the Powerball
numbers for the largest jackpot in history (January 13th, 2016) would be (04, 08, 19, 27, 34) with a Powerball of
(10), but nonetheless those numbers came up! However, no matter what numbers had turned up, they would have
had the same incredibly rare odds. That is, any set of numbers we could have observed would ultimately be incredibly
rare. This type of situation is typical of our daily lives: each possible event in itself seems incredibly rare, but if we
consider every alternative, those outcomes are also incredibly rare. We should be cautious not to misinterpret such
anecdotal evidence.
2.3. CASE STUDY: MALARIA VACCINE 75
Exercises
2.25 Side effects of Avandia. Rosiglitazone is the active ingredient in the controversial type 2 diabetes
medicine Avandia and has been linked to an increased risk of serious cardiovascular problems such as stroke,
heart failure, and death. A common alternative treatment is pioglitazone, the active ingredient in a diabetes
medicine called Actos. In a nationwide retrospective observational study of 227,571 Medicare beneficiaries
aged 65 years or older, it was found that 2,593 of the 67,593 patients using rosiglitazone and 5,386 of
the 159,978 using pioglitazone had serious cardiovascular problems. These data are summarized in the
contingency table below.29
Cardiovascular problems
Yes No Total
Rosiglitazone 2,593 65,000 67,593
Treatment
Pioglitazone 5,386 154,592 159,978
Total 7,979 219,592 227,571
(a) Determine if each of the following statements is true or false. If false, explain why. Be careful: The
reasoning may be wrong even if the statement’s conclusion is correct. In such cases, the statement
should be considered false.
i. Since more patients on pioglitazone had cardiovascular problems (5,386 vs. 2,593), we can conclude
that the rate of cardiovascular problems for those on a pioglitazone treatment is higher.
ii. The data suggest that diabetic patients who are taking rosiglitazone are more likely to have cardio-
vascular problems since the rate of incidence was (2,593 / 67,593 = 0.038) 3.8% for patients on this
treatment, while it was only (5,386 / 159,978 = 0.034) 3.4% for patients on pioglitazone.
iii. The fact that the rate of incidence is higher for the rosiglitazone group proves that rosiglitazone
causes serious cardiovascular problems.
iv. Based on the information provided so far, we cannot tell if the difference between the rates of
incidences is due to a relationship between the two variables or due to chance.
(b) What proportion of all patients had cardiovascular problems?
(c) If the type of treatment and having cardiovascular problems were independent, about how many patients
in the rosiglitazone group would we expect to have had cardiovascular problems?
(d) We can investigate the relationship between outcome and treatment in this study using a randomization
technique. While in reality we would carry out the simulations required for randomization using statisti-
cal software, suppose we actually simulate using index cards. In order to simulate from the independence
model, which states that the outcomes were independent of the treatment, we write whether or not each
patient had a cardiovascular problem on cards, shuffled all the cards together, then deal them into two
groups of size 67,593 and 159,978. We repeat this simulation 1,000 times and each time record the num-
ber of people in the rosiglitazone group who had cardiovascular problems. Use the relative frequency
histogram of these counts to answer (i)-(iii).
0.2
i. What are the claims being tested?
ii. Compared to the number calculated in part (b),
which would provide more support for the alterna-
tive hypothesis, more or fewer patients with car- 0.1
diovascular problems in the rosiglitazone group?
iii. What do the simulation results suggest about the
relationship between taking rosiglitazone and hav- 0
ing cardiovascular problems in diabetic patients?
2250 2350 2450
Simulated rosiglitazone cardiovascular events
29 D.J. Graham et al. “Risk of acute myocardial infarction, stroke, heart failure, and death in elderly Medicare
patients treated with rosiglitazone or pioglitazone”. In: JAMA 304.4 (2010), p. 411. issn: 0098-7484.
76 CHAPTER 2. SUMMARIZING DATA
2.26 Heart transplants. The Stanford University Heart Transplant Study was conducted to determine
whether an experimental heart transplant program increased lifespan. Each patient entering the program
was designated an official heart transplant candidate, meaning that he was gravely ill and would most likely
benefit from a new heart. Some patients got a transplant and some did not. The variable transplant
indicates which group the patients were in; patients in the treatment group got a transplant and those in the
control group did not. Of the 34 patients in the control group, 30 died. Of the 69 people in the treatment
group, 45 died. Another variable called survived was used to indicate whether or not the patient was alive
at the end of the study.30
control treatment
alive
1500
dead
500
0
control treatment
(a) Based on the mosaic plot, is survival independent of whether or not the patient got a transplant? Explain
your reasoning.
(b) What do the box plots below suggest about the efficacy (effectiveness) of the heart transplant treatment.
(c) What proportion of patients in the treatment group and what proportion of patients in the control group
died?
(d) One approach for investigating whether or not the treatment is effective is to use a randomization
technique.
i. What are the claims being tested?
ii. The paragraph below describes the set up for such approach, if we were to do it without using
statistical software. Fill in the blanks with a number or phrase, whichever is appropriate.
We write alive on cards representing patients who were alive at the end of
the study, and dead on cards representing patients who were not. Then,
we shuffle these cards and split them into two groups: one group of size
representing treatment, and another group of size representing control. We
calculate the difference between the proportion of dead cards in the treatment and control
groups (treatment - control) and record this value. We repeat this 100 times to build a
distribution centered at . Lastly, we calculate the fraction of simulations
where the simulated differences in proportions are . If this fraction is low,
we conclude that it is unlikely to have observed such an outcome by chance and that the
null hypothesis should be rejected in favor of the alternative.
iii. What do the simulation results shown below suggest about the effectiveness of the transplant pro-
gram?
●
● ●
● ●
● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
30 B. Turnbull et al. “Survivorship of Heart Transplant Data”. In: Journal of the American Statistical Association
Chapter exercises
2.27 Make-up exam. In a class of 25 students, 24 of them took an exam in class and 1 student took a
make-up exam the following day. The professor graded the first batch of 24 exams and found an average
score of 74 points with a standard deviation of 8.9 points. The student who took the make-up the following
day scored 64 points on the exam.
(a) Does the new student’s score increase or decrease the average score?
(b) What is the new average?
(c) Does the new student’s score increase or decrease the standard deviation of the scores?
2.28 Infant mortality. The infant mortality rate is defined as the number of infant deaths per 1,000 live
births. This rate is often used as an indicator of the level of health in a country. The relative frequency
histogram below shows the distribution of estimated infant death rates for 224 countries for which such data
were available in 2014.31
Fraction of Countries
0.4
(a) Estimate Q1, the median, and Q3 from the 0.3
histogram.
(b) Would you expect the mean of this data set 0.2
to be smaller or larger than the median? 0.1
Explain your reasoning.
0
0 20 40 60 80 100 120
Infant Mortality (per 1000 Live Births)
2.29 TV watchers. Students in an AP Statistics class were asked how many hours of television they
watch per week (including online streaming). This sample yielded an average of 4.71 hours, with a standard
deviation of 4.18 hours. Is the distribution of number of hours students watch television weekly symmetric?
If not, what shape would you expect this distribution to have? Explain your reasoning.
x̄
2.30 A new statistic. The statistic median can be used as a measure of skewness. Suppose we have a
distribution where all observations are greater than 0, xi > 0. What is the expected shape of the distribution
under the following conditions? Explain your reasoning.
x̄
(a) median =1
x̄
(b) median <1
x̄
(c) median >1
2.31 Oscar winners. The first Oscar awards for best actor and best actress were given out in 1929. The
histograms below show the age distribution for all of the best actor and best actress winners from 1929 to
2018. Summary statistics for these distributions are also provided. Compare the distributions of ages of
best actor and actress winners.32
Best actress
50
Best Actress
40
Mean 36.2
30
SD 11.9
20
n 92
10
0
Best actor
50
40
30 Best Actor
20 Mean 43.8
10 SD 8.83
0 n 92
20 40 60 80
Age (in years)
31 CIA Factbook, Country Comparisons, 2014.
32 Oscar winners from 1929 – 2012, data up to 2009 from the Journal of Statistics Education data archive and more
current data from wikipedia.org.
78 CHAPTER 2. SUMMARIZING DATA
2.32 Exam scores. The average on a history exam (scored out of 100 points) was 85, with a standard
deviation of 15. Is the distribution of the scores on this exam symmetric? If not, what shape would you
expect this distribution to have? Explain your reasoning.
2.33 Stats scores. Below are the final exam scores of twenty introductory statistics students.
57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94
Create a box plot of the distribution of these scores. The five number summary provided below may be
useful.
2.34 Marathon winners. The histogram and box plots below show the distribution of finishing times for
male and female winners of the New York Marathon between 1970 and 1999.
3.2
Marathon times
20
2.8
10 2.4
0 2.0
2.0 2.4 2.8 3.2
(a) What features of the distribution are apparent in the histogram and not the box plot? What features
are apparent in the box plot but not in the histogram?
(b) What may be the reason for the bimodal distribution? Explain.
(c) Compare the distribution of marathon times for men and women based on the box plot shown below.
Men
Women
(d) The time series plot shown below is another way to look at these data. Describe what is visible in this
plot but not in the others.
3.2 Women
● Men
Marathon times
2.8
●
● ●
2.4 ● ●
●
●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
2.0
1970 1975 1980 1985 1990 1995 2000