Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Islp 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

First Printing: July 5, 2023

To our parents:

Alison and Michael James

Chiara Nappi and Edward Witten

Valerie and Patrick Hastie

Vera and Sami Tibshirani

John and Brenda Taylor

and to our families:

Michael, Daniel, and Catherine

Tessa, Theo, Otto, and Ari

Samantha, Timothy, and Lynda

Charlie, Ryan, Julie, and Cheryl

Lee-Ann and Isobel


Preface

Statistical learning refers to a set of tools for making sense of complex


datasets. In recent years, we have seen a staggering increase in the scale and
scope of data collection across virtually all areas of science and industry.
As a result, statistical learning has become a critical toolkit for anyone who
wishes to understand data — and as more and more of today’s jobs involve
data, this means that statistical learning is fast becoming a critical toolkit
for everyone.
One of the first books on statistical learning — The Elements of Statisti-
cal Learning (ESL, by Hastie, Tibshirani, and Friedman) — was published
in 2001, with a second edition in 2009. ESL has become a popular text not
only in statistics but also in related fields. One of the reasons for ESL’s
popularity is its relatively accessible style. But ESL is best-suited for indi-
viduals with advanced training in the mathematical sciences.
An Introduction to Statistical Learning, With Applications in R (ISLR)
— first published in 2013, with a second edition in 2021 — arose from
the clear need for a broader and less technical treatment of the key topics
in statistical learning. In addition to a review of linear regression, ISLR
covers many of today’s most important statistical and machine learning
approaches, including resampling, sparse methods for classification and re-
gression, generalized additive models, tree-based methods, support vector
machines, deep learning, survival analysis, clustering, and multiple testing.
Since it was published in 2013, ISLR has become a mainstay of un-
dergraduate and graduate classrooms worldwide, as well as an important
reference book for data scientists. One of the keys to its success has been
that, beginning with Chapter 2, each chapter contains an R lab illustrating
how to implement the statistical learning methods seen in that chapter,
providing the reader with valuable hands-on experience.
However, in recent years Python has become an increasingly popular lan-
guage for data science, and there has been increasing demand for a Python-

vii
viii Preface

based alternative to ISLR. Hence, this book, An Introduction to Statistical


Learning, With Applications in Python (ISLP), covers the same materials
as ISLR but with labs implemented in Python — a feat accomplished by the
addition of a new co-author, Jonathan Taylor. Several of the labs make use
of the ISLP Python package, which we have written to facilitate carrying out
the statistical learning methods covered in each chapter in Python. These
labs will be useful both for Python novices, as well as experienced users.
The intention behind ISLP (and ISLR) is to concentrate more on the
applications of the methods and less on the mathematical details, so it is
appropriate for advanced undergraduates or master’s students in statistics
or related quantitative fields, or for individuals in other disciplines who
wish to use statistical learning tools to analyze their data. It can be used
as a textbook for a course spanning two semesters.
We are grateful to these readers for providing valuable comments on the
first edition of ISLR: Pallavi Basu, Alexandra Chouldechova, Patrick Dana-
her, Will Fithian, Luella Fu, Sam Gross, Max Grazier G’Sell, Courtney
Paulson, Xinghao Qiao, Elisa Sheng, Noah Simon, Kean Ming Tan, Xin Lu
Tan. We thank these readers for helpful input on the second edition of ISLR:
Alan Agresti, Iain Carmichael, Yiqun Chen, Erin Craig, Daisy Ding, Lucy
Gao, Ismael Lemhadri, Bryan Martin, Anna Neufeld, Geoff Tims, Carsten
Voelkmann, Steve Yadlowsky, and James Zou. We are immensely grateful
to Balasubramanian “Naras” Narasimhan for his assistance on both ISLR
and ISLP.
It has been an honor and a privilege for us to see the considerable impact
that ISLR has had on the way in which statistical learning is practiced, both
in and out of the academic setting. We hope that this new Python edition
will continue to give today’s and tomorrow’s applied statisticians and data
scientists the tools they need for success in a data-driven world.

It’s tough to make predictions, especially about the future.

-Yogi Berra
Contents

Preface vii

1 Introduction 1

2 Statistical Learning 15
2.1 What Is Statistical Learning? . . . . . . . . . . . . . . . . . 15
2.1.1 Why Estimate f ? . . . . . . . . . . . . . . . . . . . 17
2.1.2 How Do We Estimate f ? . . . . . . . . . . . . . . . 20
2.1.3 The Trade-Off Between Prediction Accuracy
and Model Interpretability . . . . . . . . . . . . . . 23
2.1.4 Supervised Versus Unsupervised Learning . . . . . 25
2.1.5 Regression Versus Classification Problems . . . . . 27
2.2 Assessing Model Accuracy . . . . . . . . . . . . . . . . . . 27
2.2.1 Measuring the Quality of Fit . . . . . . . . . . . . 28
2.2.2 The Bias-Variance Trade-Off . . . . . . . . . . . . . 31
2.2.3 The Classification Setting . . . . . . . . . . . . . . 34
2.3 Lab: Introduction to Python . . . . . . . . . . . . . . . . . 40
2.3.1 Getting Started . . . . . . . . . . . . . . . . . . . . 40
2.3.2 Basic Commands . . . . . . . . . . . . . . . . . . . 40
2.3.3 Introduction to Numerical Python . . . . . . . . . 42
2.3.4 Graphics . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.5 Sequences and Slice Notation . . . . . . . . . . . . 51
2.3.6 Indexing Data . . . . . . . . . . . . . . . . . . . . . 51
2.3.7 Loading Data . . . . . . . . . . . . . . . . . . . . . 55
2.3.8 For Loops . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.9 Additional Graphical and Numerical Summaries . . 61
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3 Linear Regression 69
3.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . 70
3.1.1 Estimating the Coefficients . . . . . . . . . . . . . 71
3.1.2 Assessing the Accuracy of the Coefficient
Estimates . . . . . . . . . . . . . . . . . . . . . . . 72
3.1.3 Assessing the Accuracy of the Model . . . . . . . . 77
3.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . 80
3.2.1 Estimating the Regression Coefficients . . . . . . . 81

ix
x Contents

3.2.2 Some Important Questions . . . . . . . . . . . . . . 83


3.3 Other Considerations in the Regression Model . . . . . . . 91
3.3.1 Qualitative Predictors . . . . . . . . . . . . . . . . 91
3.3.2 Extensions of the Linear Model . . . . . . . . . . . 94
3.3.3 Potential Problems . . . . . . . . . . . . . . . . . . 100
3.4 The Marketing Plan . . . . . . . . . . . . . . . . . . . . . . 109
3.5 Comparison of Linear Regression with K-Nearest
Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6 Lab: Linear Regression . . . . . . . . . . . . . . . . . . . . 116
3.6.1 Importing packages . . . . . . . . . . . . . . . . . . 116
3.6.2 Simple Linear Regression . . . . . . . . . . . . . . . 117
3.6.3 Multiple Linear Regression . . . . . . . . . . . . . . 122
3.6.4 Multivariate Goodness of Fit . . . . . . . . . . . . 123
3.6.5 Interaction Terms . . . . . . . . . . . . . . . . . . . 124
3.6.6 Non-linear Transformations of the Predictors . . . 125
3.6.7 Qualitative Predictors . . . . . . . . . . . . . . . . 126
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

4 Classification 135
4.1 An Overview of Classification . . . . . . . . . . . . . . . . . 135
4.2 Why Not Linear Regression? . . . . . . . . . . . . . . . . . 136
4.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 138
4.3.1 The Logistic Model . . . . . . . . . . . . . . . . . . 139
4.3.2 Estimating the Regression Coefficients . . . . . . . 140
4.3.3 Making Predictions . . . . . . . . . . . . . . . . . . 141
4.3.4 Multiple Logistic Regression . . . . . . . . . . . . . 142
4.3.5 Multinomial Logistic Regression . . . . . . . . . . . 144
4.4 Generative Models for Classification . . . . . . . . . . . . . 146
4.4.1 Linear Discriminant Analysis for p = 1 . . . . . . . 147
4.4.2 Linear Discriminant Analysis for p >1 . . . . . . . 150
4.4.3 Quadratic Discriminant Analysis . . . . . . . . . . 156
4.4.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 158
4.5 A Comparison of Classification Methods . . . . . . . . . . 161
4.5.1 An Analytical Comparison . . . . . . . . . . . . . . 161
4.5.2 An Empirical Comparison . . . . . . . . . . . . . . 164
4.6 Generalized Linear Models . . . . . . . . . . . . . . . . . . 167
4.6.1 Linear Regression on the Bikeshare Data . . . . . . 167
4.6.2 Poisson Regression on the Bikeshare Data . . . . . 169
4.6.3 Generalized Linear Models in Greater Generality . 172
4.7 Lab: Logistic Regression, LDA, QDA, and KNN . . . . . . 173
4.7.1 The Stock Market Data . . . . . . . . . . . . . . . 173
4.7.2 Logistic Regression . . . . . . . . . . . . . . . . . . 174
4.7.3 Linear Discriminant Analysis . . . . . . . . . . . . 179
4.7.4 Quadratic Discriminant Analysis . . . . . . . . . . 181
4.7.5 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 182
4.7.6 K-Nearest Neighbors . . . . . . . . . . . . . . . . . 183
4.7.7 Linear and Poisson Regression on the Bikeshare Data188
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Contents xi

5 Resampling Methods 201


5.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . 202
5.1.1 The Validation Set Approach . . . . . . . . . . . . 202
5.1.2 Leave-One-Out Cross-Validation . . . . . . . . . . 204
5.1.3 k-Fold Cross-Validation . . . . . . . . . . . . . . . 206
5.1.4 Bias-Variance Trade-Off for k-Fold
Cross-Validation . . . . . . . . . . . . . . . . . . . 208
5.1.5 Cross-Validation on Classification Problems . . . . 209
5.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.3 Lab: Cross-Validation and the Bootstrap . . . . . . . . . . 215
5.3.1 The Validation Set Approach . . . . . . . . . . . . 216
5.3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . 217
5.3.3 The Bootstrap . . . . . . . . . . . . . . . . . . . . 220
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

6 Linear Model Selection and Regularization 229


6.1 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . 231
6.1.1 Best Subset Selection . . . . . . . . . . . . . . . . . 231
6.1.2 Stepwise Selection . . . . . . . . . . . . . . . . . . 233
6.1.3 Choosing the Optimal Model . . . . . . . . . . . . 235
6.2 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . 240
6.2.1 Ridge Regression . . . . . . . . . . . . . . . . . . . 240
6.2.2 The Lasso . . . . . . . . . . . . . . . . . . . . . . . 244
6.2.3 Selecting the Tuning Parameter . . . . . . . . . . . 252
6.3 Dimension Reduction Methods . . . . . . . . . . . . . . . . 253
6.3.1 Principal Components Regression . . . . . . . . . . 254
6.3.2 Partial Least Squares . . . . . . . . . . . . . . . . . 260
6.4 Considerations in High Dimensions . . . . . . . . . . . . . 262
6.4.1 High-Dimensional Data . . . . . . . . . . . . . . . . 262
6.4.2 What Goes Wrong in High Dimensions? . . . . . . 263
6.4.3 Regression in High Dimensions . . . . . . . . . . . 265
6.4.4 Interpreting Results in High Dimensions . . . . . . 266
6.5 Lab: Linear Models and Regularization Methods . . . . . . 267
6.5.1 Subset Selection Methods . . . . . . . . . . . . . . 268
6.5.2 Ridge Regression and the Lasso . . . . . . . . . . . 273
6.5.3 PCR and PLS Regression . . . . . . . . . . . . . . 280
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

7 Moving Beyond Linearity 289


7.1 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . 290
7.2 Step Functions . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.3 Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . 293
7.4 Regression Splines . . . . . . . . . . . . . . . . . . . . . . . 294
7.4.1 Piecewise Polynomials . . . . . . . . . . . . . . . . 294
7.4.2 Constraints and Splines . . . . . . . . . . . . . . . 296
7.4.3 The Spline Basis Representation . . . . . . . . . . 296
7.4.4 Choosing the Number and Locations
of the Knots . . . . . . . . . . . . . . . . . . . . . . 297
7.4.5 Comparison to Polynomial Regression . . . . . . . 299
xii Contents

7.5 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . 300


7.5.1 An Overview of Smoothing Splines . . . . . . . . . 300
7.5.2 Choosing the Smoothing Parameter λ . . . . . . . 301
7.6 Local Regression . . . . . . . . . . . . . . . . . . . . . . . . 303
7.7 Generalized Additive Models . . . . . . . . . . . . . . . . . 305
7.7.1 GAMs for Regression Problems . . . . . . . . . . . 306
7.7.2 GAMs for Classification Problems . . . . . . . . . . 308
7.8 Lab: Non-Linear Modeling . . . . . . . . . . . . . . . . . . 309
7.8.1 Polynomial Regression and Step Functions . . . . . 310
7.8.2 Splines . . . . . . . . . . . . . . . . . . . . . . . . . 315
7.8.3 Smoothing Splines and GAMs . . . . . . . . . . . . 317
7.8.4 Local Regression . . . . . . . . . . . . . . . . . . . 324
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

8 Tree-Based Methods 331


8.1 The Basics of Decision Trees . . . . . . . . . . . . . . . . . 331
8.1.1 Regression Trees . . . . . . . . . . . . . . . . . . . 331
8.1.2 Classification Trees . . . . . . . . . . . . . . . . . . 337
8.1.3 Trees Versus Linear Models . . . . . . . . . . . . . 341
8.1.4 Advantages and Disadvantages of Trees . . . . . . . 341
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive
Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . 343
8.2.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . 343
8.2.2 Random Forests . . . . . . . . . . . . . . . . . . . . 346
8.2.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . 347
8.2.4 Bayesian Additive Regression Trees . . . . . . . . . 350
8.2.5 Summary of Tree Ensemble Methods . . . . . . . . 353
8.3 Lab: Tree-Based Methods . . . . . . . . . . . . . . . . . . . 354
8.3.1 Fitting Classification Trees . . . . . . . . . . . . . . 355
8.3.2 Fitting Regression Trees . . . . . . . . . . . . . . . 358
8.3.3 Bagging and Random Forests . . . . . . . . . . . . 360
8.3.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . 361
8.3.5 Bayesian Additive Regression Trees . . . . . . . . . 362
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

9 Support Vector Machines 367


9.1 Maximal Margin Classifier . . . . . . . . . . . . . . . . . . 367
9.1.1 What Is a Hyperplane? . . . . . . . . . . . . . . . . 368
9.1.2 Classification Using a Separating Hyperplane . . . 368
9.1.3 The Maximal Margin Classifier . . . . . . . . . . . 370
9.1.4 Construction of the Maximal Margin Classifier . . 372
9.1.5 The Non-separable Case . . . . . . . . . . . . . . . 372
9.2 Support Vector Classifiers . . . . . . . . . . . . . . . . . . . 373
9.2.1 Overview of the Support Vector Classifier . . . . . 373
9.2.2 Details of the Support Vector Classifier . . . . . . . 374
9.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . 377
9.3.1 Classification with Non-Linear Decision
Boundaries . . . . . . . . . . . . . . . . . . . . . . 378
9.3.2 The Support Vector Machine . . . . . . . . . . . . 379
Contents xiii

9.3.3 An Application to the Heart Disease Data . . . . . 382


9.4 SVMs with More than Two Classes . . . . . . . . . . . . . 383
9.4.1 One-Versus-One Classification . . . . . . . . . . . . 384
9.4.2 One-Versus-All Classification . . . . . . . . . . . . 384
9.5 Relationship to Logistic Regression . . . . . . . . . . . . . 384
9.6 Lab: Support Vector Machines . . . . . . . . . . . . . . . . 387
9.6.1 Support Vector Classifier . . . . . . . . . . . . . . . 387
9.6.2 Support Vector Machine . . . . . . . . . . . . . . . 390
9.6.3 ROC Curves . . . . . . . . . . . . . . . . . . . . . . 392
9.6.4 SVM with Multiple Classes . . . . . . . . . . . . . 393
9.6.5 Application to Gene Expression Data . . . . . . . . 394
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

10 Deep Learning 399


10.1 Single Layer Neural Networks . . . . . . . . . . . . . . . . 400
10.2 Multilayer Neural Networks . . . . . . . . . . . . . . . . . . 402
10.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . 406
10.3.1 Convolution Layers . . . . . . . . . . . . . . . . . . 407
10.3.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . 410
10.3.3 Architecture of a Convolutional Neural Network . . 410
10.3.4 Data Augmentation . . . . . . . . . . . . . . . . . . 411
10.3.5 Results Using a Pretrained Classifier . . . . . . . . 412
10.4 Document Classification . . . . . . . . . . . . . . . . . . . . 413
10.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . 416
10.5.1 Sequential Models for Document Classification . . 418
10.5.2 Time Series Forecasting . . . . . . . . . . . . . . . 420
10.5.3 Summary of RNNs . . . . . . . . . . . . . . . . . . 424
10.6 When to Use Deep Learning . . . . . . . . . . . . . . . . . 425
10.7 Fitting a Neural Network . . . . . . . . . . . . . . . . . . . 427
10.7.1 Backpropagation . . . . . . . . . . . . . . . . . . . 428
10.7.2 Regularization and Stochastic Gradient Descent . . 429
10.7.3 Dropout Learning . . . . . . . . . . . . . . . . . . . 431
10.7.4 Network Tuning . . . . . . . . . . . . . . . . . . . . 431
10.8 Interpolation and Double Descent . . . . . . . . . . . . . . 432
10.9 Lab: Deep Learning . . . . . . . . . . . . . . . . . . . . . . 435
10.9.1 Single Layer Network on Hitters Data . . . . . . . 437
10.9.2 Multilayer Network on the MNIST Digit Data . . . 444
10.9.3 Convolutional Neural Networks . . . . . . . . . . . 448
10.9.4 Using Pretrained CNN Models . . . . . . . . . . . 452
10.9.5 IMDB Document Classification . . . . . . . . . . . 454
10.9.6 Recurrent Neural Networks . . . . . . . . . . . . . 458
10.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465

11 Survival Analysis and Censored Data 469


11.1 Survival and Censoring Times . . . . . . . . . . . . . . . . 470
11.2 A Closer Look at Censoring . . . . . . . . . . . . . . . . . . 470
11.3 The Kaplan–Meier Survival Curve . . . . . . . . . . . . . . 472
11.4 The Log-Rank Test . . . . . . . . . . . . . . . . . . . . . . 474
11.5 Regression Models With a Survival Response . . . . . . . . 476
xiv Contents

11.5.1 The Hazard Function . . . . . . . . . . . . . . . . . 476


11.5.2 Proportional Hazards . . . . . . . . . . . . . . . . . 478
11.5.3 Example: Brain Cancer Data . . . . . . . . . . . . 482
11.5.4 Example: Publication Data . . . . . . . . . . . . . 482
11.6 Shrinkage for the Cox Model . . . . . . . . . . . . . . . . . 484
11.7 Additional Topics . . . . . . . . . . . . . . . . . . . . . . . 486
11.7.1 Area Under the Curve for Survival Analysis . . . . 486
11.7.2 Choice of Time Scale . . . . . . . . . . . . . . . . . 487
11.7.3 Time-Dependent Covariates . . . . . . . . . . . . . 488
11.7.4 Checking the Proportional Hazards Assumption . . 488
11.7.5 Survival Trees . . . . . . . . . . . . . . . . . . . . . 488
11.8 Lab: Survival Analysis . . . . . . . . . . . . . . . . . . . . . 489
11.8.1 Brain Cancer Data . . . . . . . . . . . . . . . . . . 489
11.8.2 Publication Data . . . . . . . . . . . . . . . . . . . 493
11.8.3 Call Center Data . . . . . . . . . . . . . . . . . . . 494
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498

12 Unsupervised Learning 503


12.1 The Challenge of Unsupervised Learning . . . . . . . . . . 503
12.2 Principal Components Analysis . . . . . . . . . . . . . . . . 504
12.2.1 What Are Principal Components? . . . . . . . . . . 505
12.2.2 Another Interpretation of Principal Components . 508
12.2.3 The Proportion of Variance Explained . . . . . . . 510
12.2.4 More on PCA . . . . . . . . . . . . . . . . . . . . . 512
12.2.5 Other Uses for Principal Components . . . . . . . . 515
12.3 Missing Values and Matrix Completion . . . . . . . . . . . 515
12.4 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . 520
12.4.1 K-Means Clustering . . . . . . . . . . . . . . . . . 521
12.4.2 Hierarchical Clustering . . . . . . . . . . . . . . . . 525
12.4.3 Practical Issues in Clustering . . . . . . . . . . . . 532
12.5 Lab: Unsupervised Learning . . . . . . . . . . . . . . . . . 535
12.5.1 Principal Components Analysis . . . . . . . . . . . 535
12.5.2 Matrix Completion . . . . . . . . . . . . . . . . . . 539
12.5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . 542
12.5.4 NCI60 Data Example . . . . . . . . . . . . . . . . . 546
12.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552

13 Multiple Testing 557


13.1 A Quick Review of Hypothesis Testing . . . . . . . . . . . 558
13.1.1 Testing a Hypothesis . . . . . . . . . . . . . . . . . 558
13.1.2 Type I and Type II Errors . . . . . . . . . . . . . . 562
13.2 The Challenge of Multiple Testing . . . . . . . . . . . . . . 563
13.3 The Family-Wise Error Rate . . . . . . . . . . . . . . . . . 565
13.3.1 What is the Family-Wise Error Rate? . . . . . . . 565
13.3.2 Approaches to Control the Family-Wise Error Rate 567
13.3.3 Trade-Off Between the FWER and Power . . . . . 572
13.4 The False Discovery Rate . . . . . . . . . . . . . . . . . . . 573
13.4.1 Intuition for the False Discovery Rate . . . . . . . 573
13.4.2 The Benjamini–Hochberg Procedure . . . . . . . . 575
Contents xv

13.5 A Re-Sampling Approach to p-Values and False Discovery


Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
13.5.1 A Re-Sampling Approach to the p-Value . . . . . . 578
13.5.2 A Re-Sampling Approach to the False Discovery Rate579
13.5.3 When Are Re-Sampling Approaches Useful? . . . . 581
13.6 Lab: Multiple Testing . . . . . . . . . . . . . . . . . . . . . 583
13.6.1 Review of Hypothesis Tests . . . . . . . . . . . . . 583
13.6.2 Family-Wise Error Rate . . . . . . . . . . . . . . . 585
13.6.3 False Discovery Rate . . . . . . . . . . . . . . . . . 588
13.6.4 A Re-Sampling Approach . . . . . . . . . . . . . . 590
13.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593

Index 597
1
Introduction

An Overview of Statistical Learning


Statistical learning refers to a vast set of tools for understanding data. These
tools can be classified as supervised or unsupervised. Broadly speaking,
supervised statistical learning involves building a statistical model for pre-
dicting, or estimating, an output based on one or more inputs. Problems of
this nature occur in fields as diverse as business, medicine, astrophysics, and
public policy. With unsupervised statistical learning, there are inputs but
no supervising output; nevertheless we can learn relationships and struc-
ture from such data. To provide an illustration of some applications of
statistical learning, we briefly discuss three real-world data sets that are
considered in this book.

Wage Data
In this application (which we refer to as the Wage data set throughout this
book), we examine a number of factors that relate to wages for a group of
men from the Atlantic region of the United States. In particular, we wish
to understand the association between an employee’s age and education, as
well as the calendar year, on his wage. Consider, for example, the left-hand
panel of Figure 1.1, which displays wage versus age for each of the individu-
als in the data set. There is evidence that wage increases with age but then
decreases again after approximately age 60. The blue line, which provides
an estimate of the average wage for a given age, makes this trend clearer.
Given an employee’s age, we can use this curve to predict his wage. However,
it is also clear from Figure 1.1 that there is a significant amount of vari-
ability associated with this average value, and so age alone is unlikely to
provide an accurate prediction of a particular man’s wage.

© Springer Nature Switzerland AG 2023 1


G. James et al., An Introduction to Statistical Learning, Springer Texts in Statistics,
https://doi.org/10.1007/978-3-031-38747-0_1
2 1. Introduction

300

300

300
200

200

200
Wage

Wage

Wage
50 100

50 100

50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5

Age Year Education Level

FIGURE 1.1. Wage data, which contains income survey information for men
from the central Atlantic region of the United States. Left: wage as a function of
age. On average, wage increases with age until about 60 years of age, at which
point it begins to decline. Center: wage as a function of year. There is a slow
but steady increase of approximately $10,000 in the average wage between 2003
and 2009. Right: Boxplots displaying wage as a function of education, with 1
indicating the lowest level (no high school diploma) and 5 the highest level (an
advanced graduate degree). On average, wage increases with the level of education.

We also have information regarding each employee’s education level and


the year in which the wage was earned. The center and right-hand panels
of Figure 1.1, which display wage as a function of both year and education,
indicate that both of these factors are associated with wage. Wages increase
by approximately $10,000, in a roughly linear (or straight-line) fashion,
between 2003 and 2009, though this rise is very slight relative to the vari-
ability in the data. Wages are also typically greater for individuals with
higher education levels: men with the lowest education level (1) tend to
have substantially lower wages than those with the highest education level
(5). Clearly, the most accurate prediction of a given man’s wage will be
obtained by combining his age, his education, and the year. In Chapter 3,
we discuss linear regression, which can be used to predict wage from this
data set. Ideally, we should predict wage in a way that accounts for the
non-linear relationship between wage and age. In Chapter 7, we discuss a
class of approaches for addressing this problem.

Stock Market Data


The Wage data involves predicting a continuous or quantitative output value.
This is often referred to as a regression problem. However, in certain cases
we may instead wish to predict a non-numerical value—that is, a categorical
or qualitative output. For example, in Chapter 4 we examine a stock market
data set that contains the daily movements in the Standard & Poor’s 500
(S&P) stock index over a 5-year period between 2001 and 2005. We refer
to this as the Smarket data. The goal is to predict whether the index will
increase or decrease on a given day, using the past 5 days’ percentage
changes in the index. Here the statistical learning problem does not involve
predicting a numerical value. Instead it involves predicting whether a given
1. Introduction 3

Yesterday Two Days Previous Three Days Previous

6
Percentage change in S&P

Percentage change in S&P

Percentage change in S&P


4

4
2

2
0

0
−2

−2

−2
−4

−4

−4
Down Up Down Up Down Up

Today’s Direction Today’s Direction Today’s Direction

FIGURE 1.2. Left: Boxplots of the previous day’s percentage change in the S&P
index for the days for which the market increased or decreased, obtained from the
Smarket data. Center and Right: Same as left panel, but the percentage changes
for 2 and 3 days previous are shown.

day’s stock market performance will fall into the Up bucket or the Down
bucket. This is known as a classification problem. A model that could
accurately predict the direction in which the market will move would be
very useful!
The left-hand panel of Figure 1.2 displays two boxplots of the previous
day’s percentage changes in the stock index: one for the 648 days for which
the market increased on the subsequent day, and one for the 602 days for
which the market decreased. The two plots look almost identical, suggest-
ing that there is no simple strategy for using yesterday’s movement in the
S&P to predict today’s returns. The remaining panels, which display box-
plots for the percentage changes 2 and 3 days previous to today, similarly
indicate little association between past and present returns. Of course, this
lack of pattern is to be expected: in the presence of strong correlations be-
tween successive days’ returns, one could adopt a simple trading strategy
to generate profits from the market. Nevertheless, in Chapter 4, we explore
these data using several different statistical learning methods. Interestingly,
there are hints of some weak trends in the data that suggest that, at least
for this 5-year period, it is possible to correctly predict the direction of
movement in the market approximately 60% of the time (Figure 1.3).

Gene Expression Data


The previous two applications illustrate data sets with both input and
output variables. However, another important class of problems involves
situations in which we only observe input variables, with no corresponding
output. For example, in a marketing setting, we might have demographic
information for a number of current or potential customers. We may wish to
understand which types of customers are similar to each other by grouping
individuals according to their observed characteristics. This is known as a
4 1. Introduction

0.52
Predicted Probability

0.50
0.48
0.46
Down Up

Today’s Direction

FIGURE 1.3. We fit a quadratic discriminant analysis model to the subset


of the Smarket data corresponding to the 2001–2004 time period, and predicted
the probability of a stock market decrease using the 2005 data. On average, the
predicted probability of decrease is higher for the days in which the market does
decrease. Based on these results, we are able to correctly predict the direction of
movement in the market 60% of the time.

clustering problem. Unlike in the previous examples, here we are not trying
to predict an output variable.
We devote Chapter 12 to a discussion of statistical learning methods
for problems in which no natural output variable is available. We consider
the NCI60 data set, which consists of 6,830 gene expression measurements
for each of 64 cancer cell lines. Instead of predicting a particular output
variable, we are interested in determining whether there are groups, or
clusters, among the cell lines based on their gene expression measurements.
This is a difficult question to address, in part because there are thousands
of gene expression measurements per cell line, making it hard to visualize
the data.
The left-hand panel of Figure 1.4 addresses this problem by represent-
ing each of the 64 cell lines using just two numbers, Z1 and Z2 . These
are the first two principal components of the data, which summarize the
6,830 expression measurements for each cell line down to two numbers or
dimensions. While it is likely that this dimension reduction has resulted in
some loss of information, it is now possible to visually examine the data
for evidence of clustering. Deciding on the number of clusters is often a
difficult problem. But the left-hand panel of Figure 1.4 suggests at least
four groups of cell lines, which we have represented using separate colors.
In this particular data set, it turns out that the cell lines correspond
to 14 different types of cancer. (However, this information was not used
to create the left-hand panel of Figure 1.4.) The right-hand panel of Fig-
ure 1.4 is identical to the left-hand panel, except that the 14 cancer types
are shown using distinct colored symbols. There is clear evidence that cell
lines with the same cancer type tend to be located near each other in this
two-dimensional representation. In addition, even though the cancer infor-
mation was not used to produce the left-hand panel, the clustering obtained
does bear some resemblance to some of the actual cancer types observed
in the right-hand panel. This provides some independent verification of the
accuracy of our clustering analysis.

You might also like