Islp 1
Islp 1
Islp 1
To our parents:
vii
viii Preface
-Yogi Berra
Contents
Preface vii
1 Introduction 1
2 Statistical Learning 15
2.1 What Is Statistical Learning? . . . . . . . . . . . . . . . . . 15
2.1.1 Why Estimate f ? . . . . . . . . . . . . . . . . . . . 17
2.1.2 How Do We Estimate f ? . . . . . . . . . . . . . . . 20
2.1.3 The Trade-Off Between Prediction Accuracy
and Model Interpretability . . . . . . . . . . . . . . 23
2.1.4 Supervised Versus Unsupervised Learning . . . . . 25
2.1.5 Regression Versus Classification Problems . . . . . 27
2.2 Assessing Model Accuracy . . . . . . . . . . . . . . . . . . 27
2.2.1 Measuring the Quality of Fit . . . . . . . . . . . . 28
2.2.2 The Bias-Variance Trade-Off . . . . . . . . . . . . . 31
2.2.3 The Classification Setting . . . . . . . . . . . . . . 34
2.3 Lab: Introduction to Python . . . . . . . . . . . . . . . . . 40
2.3.1 Getting Started . . . . . . . . . . . . . . . . . . . . 40
2.3.2 Basic Commands . . . . . . . . . . . . . . . . . . . 40
2.3.3 Introduction to Numerical Python . . . . . . . . . 42
2.3.4 Graphics . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.5 Sequences and Slice Notation . . . . . . . . . . . . 51
2.3.6 Indexing Data . . . . . . . . . . . . . . . . . . . . . 51
2.3.7 Loading Data . . . . . . . . . . . . . . . . . . . . . 55
2.3.8 For Loops . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.9 Additional Graphical and Numerical Summaries . . 61
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3 Linear Regression 69
3.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . 70
3.1.1 Estimating the Coefficients . . . . . . . . . . . . . 71
3.1.2 Assessing the Accuracy of the Coefficient
Estimates . . . . . . . . . . . . . . . . . . . . . . . 72
3.1.3 Assessing the Accuracy of the Model . . . . . . . . 77
3.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . 80
3.2.1 Estimating the Regression Coefficients . . . . . . . 81
ix
x Contents
4 Classification 135
4.1 An Overview of Classification . . . . . . . . . . . . . . . . . 135
4.2 Why Not Linear Regression? . . . . . . . . . . . . . . . . . 136
4.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 138
4.3.1 The Logistic Model . . . . . . . . . . . . . . . . . . 139
4.3.2 Estimating the Regression Coefficients . . . . . . . 140
4.3.3 Making Predictions . . . . . . . . . . . . . . . . . . 141
4.3.4 Multiple Logistic Regression . . . . . . . . . . . . . 142
4.3.5 Multinomial Logistic Regression . . . . . . . . . . . 144
4.4 Generative Models for Classification . . . . . . . . . . . . . 146
4.4.1 Linear Discriminant Analysis for p = 1 . . . . . . . 147
4.4.2 Linear Discriminant Analysis for p >1 . . . . . . . 150
4.4.3 Quadratic Discriminant Analysis . . . . . . . . . . 156
4.4.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 158
4.5 A Comparison of Classification Methods . . . . . . . . . . 161
4.5.1 An Analytical Comparison . . . . . . . . . . . . . . 161
4.5.2 An Empirical Comparison . . . . . . . . . . . . . . 164
4.6 Generalized Linear Models . . . . . . . . . . . . . . . . . . 167
4.6.1 Linear Regression on the Bikeshare Data . . . . . . 167
4.6.2 Poisson Regression on the Bikeshare Data . . . . . 169
4.6.3 Generalized Linear Models in Greater Generality . 172
4.7 Lab: Logistic Regression, LDA, QDA, and KNN . . . . . . 173
4.7.1 The Stock Market Data . . . . . . . . . . . . . . . 173
4.7.2 Logistic Regression . . . . . . . . . . . . . . . . . . 174
4.7.3 Linear Discriminant Analysis . . . . . . . . . . . . 179
4.7.4 Quadratic Discriminant Analysis . . . . . . . . . . 181
4.7.5 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 182
4.7.6 K-Nearest Neighbors . . . . . . . . . . . . . . . . . 183
4.7.7 Linear and Poisson Regression on the Bikeshare Data188
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Contents xi
Index 597
1
Introduction
Wage Data
In this application (which we refer to as the Wage data set throughout this
book), we examine a number of factors that relate to wages for a group of
men from the Atlantic region of the United States. In particular, we wish
to understand the association between an employee’s age and education, as
well as the calendar year, on his wage. Consider, for example, the left-hand
panel of Figure 1.1, which displays wage versus age for each of the individu-
als in the data set. There is evidence that wage increases with age but then
decreases again after approximately age 60. The blue line, which provides
an estimate of the average wage for a given age, makes this trend clearer.
Given an employee’s age, we can use this curve to predict his wage. However,
it is also clear from Figure 1.1 that there is a significant amount of vari-
ability associated with this average value, and so age alone is unlikely to
provide an accurate prediction of a particular man’s wage.
300
300
300
200
200
200
Wage
Wage
Wage
50 100
50 100
50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5
FIGURE 1.1. Wage data, which contains income survey information for men
from the central Atlantic region of the United States. Left: wage as a function of
age. On average, wage increases with age until about 60 years of age, at which
point it begins to decline. Center: wage as a function of year. There is a slow
but steady increase of approximately $10,000 in the average wage between 2003
and 2009. Right: Boxplots displaying wage as a function of education, with 1
indicating the lowest level (no high school diploma) and 5 the highest level (an
advanced graduate degree). On average, wage increases with the level of education.
6
Percentage change in S&P
4
2
2
0
0
−2
−2
−2
−4
−4
−4
Down Up Down Up Down Up
FIGURE 1.2. Left: Boxplots of the previous day’s percentage change in the S&P
index for the days for which the market increased or decreased, obtained from the
Smarket data. Center and Right: Same as left panel, but the percentage changes
for 2 and 3 days previous are shown.
day’s stock market performance will fall into the Up bucket or the Down
bucket. This is known as a classification problem. A model that could
accurately predict the direction in which the market will move would be
very useful!
The left-hand panel of Figure 1.2 displays two boxplots of the previous
day’s percentage changes in the stock index: one for the 648 days for which
the market increased on the subsequent day, and one for the 602 days for
which the market decreased. The two plots look almost identical, suggest-
ing that there is no simple strategy for using yesterday’s movement in the
S&P to predict today’s returns. The remaining panels, which display box-
plots for the percentage changes 2 and 3 days previous to today, similarly
indicate little association between past and present returns. Of course, this
lack of pattern is to be expected: in the presence of strong correlations be-
tween successive days’ returns, one could adopt a simple trading strategy
to generate profits from the market. Nevertheless, in Chapter 4, we explore
these data using several different statistical learning methods. Interestingly,
there are hints of some weak trends in the data that suggest that, at least
for this 5-year period, it is possible to correctly predict the direction of
movement in the market approximately 60% of the time (Figure 1.3).
0.52
Predicted Probability
0.50
0.48
0.46
Down Up
Today’s Direction
clustering problem. Unlike in the previous examples, here we are not trying
to predict an output variable.
We devote Chapter 12 to a discussion of statistical learning methods
for problems in which no natural output variable is available. We consider
the NCI60 data set, which consists of 6,830 gene expression measurements
for each of 64 cancer cell lines. Instead of predicting a particular output
variable, we are interested in determining whether there are groups, or
clusters, among the cell lines based on their gene expression measurements.
This is a difficult question to address, in part because there are thousands
of gene expression measurements per cell line, making it hard to visualize
the data.
The left-hand panel of Figure 1.4 addresses this problem by represent-
ing each of the 64 cell lines using just two numbers, Z1 and Z2 . These
are the first two principal components of the data, which summarize the
6,830 expression measurements for each cell line down to two numbers or
dimensions. While it is likely that this dimension reduction has resulted in
some loss of information, it is now possible to visually examine the data
for evidence of clustering. Deciding on the number of clusters is often a
difficult problem. But the left-hand panel of Figure 1.4 suggests at least
four groups of cell lines, which we have represented using separate colors.
In this particular data set, it turns out that the cell lines correspond
to 14 different types of cancer. (However, this information was not used
to create the left-hand panel of Figure 1.4.) The right-hand panel of Fig-
ure 1.4 is identical to the left-hand panel, except that the 14 cancer types
are shown using distinct colored symbols. There is clear evidence that cell
lines with the same cancer type tend to be located near each other in this
two-dimensional representation. In addition, even though the cancer infor-
mation was not used to produce the left-hand panel, the clustering obtained
does bear some resemblance to some of the actual cancer types observed
in the right-hand panel. This provides some independent verification of the
accuracy of our clustering analysis.