Module07 - Model Selection and Regularization
Module07 - Model Selection and Regularization
and
Regularization
Linear Model Selection and Regularization
• Despite its simplicity, the linear model has distinct
advantages in terms of its interpretability and often shows
good predictive performance.
• There are ways simple linear model can be improved, by
replacing ordinary least squares fitting with some
alternative fitting procedures.
Why consider alternatives to least squares?
• Prediction Accuracy:
• If the true relationship between predictors and response is
approximately linear, then least square estimates have low
bias
• For 𝑛 ≫ 𝑝 (number of obs much greater than number of
predictors), the LS estimates have low variance
• When 𝑛 is not much larger than 𝑝, LS fits can have a lot of
variability resulting in poor prediction
• When 𝑝 > 𝑛 the variance is infinite so the method cannot be
used
• Model Interpretability:
• B y removing irrelevant features — that is, by setting the
corresponding coefficient estimates to zero — we can
obtain a model that is more easily interpreted.
Three classes of methods
• Subset Selection: Identify a subset of the p predictors that is
believed to be related to the response, then fit a model
using least squares on the reduced set of variables.
1.0
Residual Sum of Squares
0.8
6e+07
0.6
R2
4e+07
0.4
2e+07
8e+07
0.2
0.0
2 4 6 8 10 2 4 6 8 10
For each possible model containing a subset of the ten predictors in the C r e d i t data
set, the R S S and R 2 are displayed. The red frontier tracks the best model for a given
number of predictors, according to R S S and R 2 . Though the data set contains only
ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and
takes on three values, leading to the creation of two dummy variables.
B. Stepwise Selection
𝑝 𝑝+1
• In FSS, a total of 1 + 2
models are evaluated
• For 𝑝 = 20, best subset model potentially requires fitting over 1
million models, whereas FSS fits 211 models
• It is not guaranteed that FSS will find the best possible model out
of 2𝑝 possibilities
Backward Stepwise Selection
30000
0.96
0.94
25000
25000
0.92
Adjusted R 2
20000
20000
BIC
Cp
0.90
15000
15000
0.88
0.86
10000
10000
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
2
4
6
8
Number of Predictors
10
Validation Set Error
100 120 140 160 180 200 220
2
4
6
8
Number of Predictors
10
Cross−Validation Error
100 120 140 160 180 200 220
Credit data example
2
4
6
8
Number of Predictors
10
One Standard Error Rule
RSS = yi − β0 − βj xij
i=1 j=1
R are
• In contrast, the ridge regression coefficient estimates β
the values that minimize-
2
n p p p
yi − β0 − βj xij + λ βj = RSS + λ βj 2
2
• Like least square, RR finds coefficients that fit the data well
p
• The second term λ σj=1 βj 2 (shrinkage penalty) penalizes making
coefficients non-zero
• For 𝜆 = 0, RR is same as LS
• For 𝜆 > 0 the coefficients are shrunk towards zero
• Suitable value of 𝜆 can be chosen using cross-validation
Credit data example
Income
400
400
Limit
Rating
300
300
Standardized Coefficients
Standardized Coefficients
Student
200
200
100
100
0
0
−100
−100
−300
−300
1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0
xij
x ij =
1 n 2
σ
n i=1 xij − xത j
Why Does Ridge Regression Improve Over Least
Squares?
The Bias-Variance tradeoff
60
60
Mean Squared Error
50
40
40
30
30
20
20
10
10
0
0
1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0
400
Standardized Coefficients
Standardized Coefficients
300
300
200
200
100
100
0
0
Income
−100
Limit
−200
Rating
Student
20 50 100 200 500 2000 5000 −300 0.0 0.2 0.4 0.6 0.8 1.0
and
2
n p p
minimize
yi − β0 − βj xij subject to βj 2 ≤ s
β
i=1 j=1 j=1
respectively.
The Lasso Picture
Comparing the Lasso and Ridge Regression
60
Mean Squared Error
50
40
30
20
10
0
Left: Plots of squared bias (black), variance (green), and test M S E (purple)., simulated
dataset, all predictors are related to the response
Right: Only two predictors are related to response
Choice of Tuning Parameter
Zm = ∅jm X j (1)
j=1
where,
M
βj = θm ∅jm (3)
m=1
35
30
25
Ad Spending
20
15
10
5
0
10 20 30 40 50 60 70
Population
The population size (pop) and ad spending (ad) for 100 different cities are shown as
purple circles. The green solid line indicates the first principal component, and the
blue dashed line indicates the second principal component.
Pictures of P C A : continued
10
30
5
Ad Spending
20
0
15
−5
10
−10
5
20 30 40 50 −20 −10 0 10 20
A subset of the advertising data. Left: The first principal component, chosen to
minimize the sum of the squared perpendicular distances to each point, is shown in
green. These distances are represented using the black dashed line segments. Right: The
left-hand panel has beenrotated so that the first principal component lies on the x-axis.
Principal Components Regression
30
50
25
Ad Spending
Population
20
40
15
30
10
20
5
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Plots of the first principal component scores 𝑧𝑖1 versus pop and ad. The relationships
are strong.
Principal Component Regression