Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
32 views

Module07 - Model Selection and Regularization

1. The document discusses linear model selection and regularization methods for improving predictive performance over ordinary least squares when the number of observations is not much larger than the number of predictors. 2. It describes three main classes of methods: subset selection, shrinkage, and dimension reduction. Subset selection identifies a subset of important predictors, shrinkage fits all predictors but shrinks coefficients, and dimension reduction projects predictors into a lower-dimensional space. 3. Stepwise selection methods like forward and backward stepwise are introduced as computationally feasible alternatives to best subset selection for large numbers of predictors. Metrics like Cp, AIC, BIC, and adjusted R2 are also discussed for selecting the optimal model size.

Uploaded by

kriti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Module07 - Model Selection and Regularization

1. The document discusses linear model selection and regularization methods for improving predictive performance over ordinary least squares when the number of observations is not much larger than the number of predictors. 2. It describes three main classes of methods: subset selection, shrinkage, and dimension reduction. Subset selection identifies a subset of important predictors, shrinkage fits all predictors but shrinks coefficients, and dimension reduction projects predictors into a lower-dimensional space. 3. Stepwise selection methods like forward and backward stepwise are introduced as computationally feasible alternatives to best subset selection for large numbers of predictors. Metrics like Cp, AIC, BIC, and adjusted R2 are also discussed for selecting the optimal model size.

Uploaded by

kriti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Model Selection

and
Regularization
Linear Model Selection and Regularization
• Despite its simplicity, the linear model has distinct
advantages in terms of its interpretability and often shows
good predictive performance.
• There are ways simple linear model can be improved, by
replacing ordinary least squares fitting with some
alternative fitting procedures.
Why consider alternatives to least squares?
• Prediction Accuracy:
• If the true relationship between predictors and response is
approximately linear, then least square estimates have low
bias
• For 𝑛 ≫ 𝑝 (number of obs much greater than number of
predictors), the LS estimates have low variance
• When 𝑛 is not much larger than 𝑝, LS fits can have a lot of
variability resulting in poor prediction
• When 𝑝 > 𝑛 the variance is infinite so the method cannot be
used
• Model Interpretability:
• B y removing irrelevant features — that is, by setting the
corresponding coefficient estimates to zero — we can
obtain a model that is more easily interpreted.
Three classes of methods
• Subset Selection: Identify a subset of the p predictors that is
believed to be related to the response, then fit a model
using least squares on the reduced set of variables.

• Shrinkage : Fit a model involving all p predictors, but the


estimated coefficients are shrunken towards zero relative to
the least squares estimates.

• Dimension Reduction : Project the 𝑝 predictors into a 𝑀 -


dimensional subspace, where 𝑀 < 𝑝, by computing 𝑀
different linear combinations, or projections, of the
variables.
Subset Selection
A. Best Subset Selection

1. Let 𝑀0 denote the null model, which contains no predictors.


This model simply predicts the sample mean for each
observation.
2 . F or 𝑘 = 1, 2, . . . 𝑝:
p
a) Fit all k
models that contain exactly k predictors.

b) Pick the best among these pk models, and call it Mk .


Here best is defined as having the smallest RSS, or
equivalently largest R 2 .
3. Select a single best model from among M0 , . . . , Mp using
cross-validated prediction error, Cp ( A I C ) , B I C , or adjusted
R2.
Example- Credit data set

1.0
Residual Sum of Squares

0.8
6e+07

0.6
R2
4e+07

0.4
2e+07
8e+07

0.2
0.0
2 4 6 8 10 2 4 6 8 10

Number of Predictors Number of Predictors

For each possible model containing a subset of the ten predictors in the C r e d i t data
set, the R S S and R 2 are displayed. The red frontier tracks the best model for a given
number of predictors, according to R S S and R 2 . Though the data set contains only
ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and
takes on three values, leading to the creation of two dummy variables.
B. Stepwise Selection

• For computational reasons, best subset selection cannot be


applied with very large p. Why not?
• Best subset selection may also suffer from computational
problems when p is large.
• There are potentially 2𝑝 possibilities to choose from
• Thus an enormous search space can lead to overfitting and
high variance of the coefficient estimates.
• For both of these reasons, stepwise methods, are more
convenient alternatives to best subset selection.
Forward Stepwise Selection

1. Let M0 denote the null model, which contains no


predictors.
2. F or 𝑘 = 0, . . . , 𝑝 − 1:
a) Consider all 𝑝 − 𝑘 models that augment the
predictors in Mk with one additional predictor.
b) Choose the best among these 𝑝 − 𝑘 models, and
call it 𝑀 𝑘 + 1 . Here best is defined as having
smallest R S S or highest R 2 .
3. Select a single best model from among 𝑀 0 , . . . , 𝑀 𝑝 using
cross-validated prediction error,, or 𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 2 .
𝐶𝑝, 𝐴𝐼𝐶 , 𝐵𝐼𝐶 .
Forward Stepwise Selection

𝑝 𝑝+1
• In FSS, a total of 1 + 2
models are evaluated
• For 𝑝 = 20, best subset model potentially requires fitting over 1
million models, whereas FSS fits 211 models
• It is not guaranteed that FSS will find the best possible model out
of 2𝑝 possibilities
Backward Stepwise Selection

1. Let M0 denote the null model, which contains p


predictors.
2. F or 𝑘 = 𝑝, 𝑝 − 1, . . . , 1:
a) Consider all 𝑘 models that contain all but one of the
predictors in 𝑀 𝑘 , for a total of 𝑘 − 1 predictors.
b) Choose the best among these 𝑘 models, and call it
𝑀 𝑘 − 1 . Here best is defined as having smallest R S S or
highest R 2 .
3. Select a single best model from among M 0 , . . . ,M p using
cross-validated prediction error, 𝐶 𝑝 , 𝐴 𝐼 𝐶 , 𝐵 𝐼 𝐶 , or
adjusted R 2 .
Choosing the Optimal Model
• Hybrid models of forward and backward selection
process are also used

• The model containing all of the predictors will always have


the smallest R S S and the largest R 2 , since these quantities
are related to the training error.
• A model with low test error is desired
, not a model with low training error.

• Therefore, R S S and R 2 are not suitable for selecting the


best model among a collection of models with different
numbers of predictors.
Estimating test error: two approaches
• Indirectly estimate test error by making an adjustment to
the training error to account for the bias due to overfitting.
• Directly estimate the test error, using either a validation
set approach or a cross-validation approach.
• There are techniques that adjust the training error for the
model size, and can be used to select among a set of
models with different numbers of variables.
C p , A I C, BI C, and Adjusted R 2

• Mallow’s 𝐶 𝑝 (Estimate of test MSE):


1
Cp = σ2
RSS + 2dෝ
n
where 𝑑 is the total # of parameters used and 𝜎2 is an estimate of the
variance of the error ϵ associated with each response measurement.

• If 𝜎ො 2 is an unbiased estimator of 𝜎 2 , then 𝐶𝑝 is an unbiased estimator of


test MSE
AIC

• The AIC criterion is defined for a large class of models


fit by maximum likelihood
• For Gaussian errors, AIC is given by:
1
AIC = σ2
RSS + 2dෝ
nෝσ2
• In the case of the linear model with Gaussian errors,
maximum likelihood and least squares are the same
thing, and C p and AIC are equivalent.
BIC
1
BIC = σ2
RSS + log(n) dෝ
n
• Like C p , the BIC will tend to take on a small value for a
model with a low test error
• For 𝑛 > 7, log 𝑛 > 2, penalizing models with more variables
• BIC puts a larger penalty on model size
• For a least squares model with d variables, the adjusted R 2
statistic is calculated as:
RSS/(n − d − 1)
Adjusted R2 = 1 −
TSS/(n − 1)

where T S S is the total sum of squares.


• Unlike C p , A I C , and B I C , for which a small value
indicates a model with a low test error, a large value of
adjusted R 2 indicates a model with a small test error.
Credit data example
30000

30000

0.96
0.94
25000

25000

0.92
Adjusted R 2
20000

20000
BIC
Cp

0.90
15000

15000

0.88
0.86
10000

10000

2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Number of Predictors Number of Predictors Number of Predictors


Adjusted 𝑹 𝟐

• Maximizing the adjusted R 2 is equivalent to minimizing


RSS/(n − d − 1). While RSS always decreases as the number
of variables in the model increases, RSS/(n − d − 1) may
increase or decrease, due to the presence of d in the
denominator.
• Unlike the R 2 statistic, the adjusted R 2 statistic pays a price
for the inclusion of unnecessary variables in the model.
Validation and Cross-Validation
• Each of the procedures returns a sequence of models Mk
indexed by model size 𝑘 = 0, 1, 2, . . . .

• We compute the validation set error or the cross-validation


error for each model 𝑀𝑘 under consideration, and then select
the k for which the resulting estimated test error is smallest.
• This procedure has an advantage relative to A I C , B I C , C p ,
and adjusted R 2 , in that it provides a direct estimate of the
test error
• Doesn’t require an estimate of the error variance σ 2 . (getting
an estimate of 𝜎 2 is difficult)

• Applicable for wider range of cases, where it is difficult get the


degrees of freedom of the model
• It is not clear in some cases what the number of parameter is, CV is
a good alternative in such cases
Square Root of BIC
100 120 140 160 180 200 220

2
4
6
8

Number of Predictors
10
Validation Set Error
100 120 140 160 180 200 220

2
4
6
8

Number of Predictors
10

Cross−Validation Error
100 120 140 160 180 200 220
Credit data example

2
4
6
8

Number of Predictors
10
One Standard Error Rule

• If the models have errors within 1 standard deviation, it is better to


choose the simpler one (with fewer parameters)
Shrinkage Methods
• The subset selection methods use least squares to fit a
linear model that contains a subset of the predictors.
• As an alternative, fit a model containing all p predictors
using a technique that constrains or regularizes the
coefficient estimates, or equivalently, that shrinks the
coefficient estimates for irrelevant predictors towards zero.
Ridge regression
• Recall that the least squares fitting procedure estimates
β0 ,β1 , . . . , βp using the values that minimize-
2
n p

RSS = ෍ yi − β0 − ෍ βj xij
i=1 j=1

෠ R are
• In contrast, the ridge regression coefficient estimates β
the values that minimize-
2
n p p p

෍ yi − β0 − ෍ βj xij + λ ෍ βj = RSS + λ ෍ βj 2
2

i=1 j=1 j=1 j=1

where 𝜆 ≥ 0 is a tuning parameter, to be determined


separately.
Ridge Regression

• Like least square, RR finds coefficients that fit the data well
p
• The second term λ σj=1 βj 2 (shrinkage penalty) penalizes making
coefficients non-zero
• For 𝜆 = 0, RR is same as LS
• For 𝜆 > 0 the coefficients are shrunk towards zero
• Suitable value of 𝜆 can be chosen using cross-validation
Credit data example

Income

400
400

Limit
Rating

300
300
Standardized Coefficients

Standardized Coefficients
Student

200
200

100
100
0

0
−100

−100
−300
−300

1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0

λ || 𝛽෠𝜆𝑅 ||2 / ||𝛽||


෠ 2
Ridge regression: scaling of predictors
• The standard least squares coefficient estimates are scale
equivariant: multiplying X j by a constant c simply leads to
a scaling of the least squares coefficient estimates by a
factor of 1/c.
• But for RR, scaling makes a difference
• It is best to apply ridge regression after standardizing the
predictors, using the formula-

xij
x෤ ij =
1 n 2
σ
n i=1 xij − xത j
Why Does Ridge Regression Improve Over Least
Squares?
The Bias-Variance tradeoff
60

60
Mean Squared Error

Mean Squared Error


50

50
40

40
30

30
20

20
10

10
0

0
1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0

λ || 𝛽መ𝜆𝑅 ||2 / ||𝛽||


መ 2

Simulated data with n = 50 observations, p = 45 predictors, all having


nonzero coefficients. Squared bias (black), variance (green), and test mean
squared error (purple) for the ridge regression predictions on a simulated data
set, as a function of λ and || 𝛽෠𝜆𝑅 ||2 / ||𝛽||
෠ 2 . The horizontal dashed lines
indicate the minimum possible M S E . The purple crosses indicate the ridge
regression models for which the M S E is smallest.
The Lasso
• RR doesn’t set any coefficient to zero, so it contains all
the predictors
• The Lasso is a relatively recent alternative to ridge
regression. The lasso coefficients, β෠ Lλ , minimize the
quantity-
2
n p p p

෍ yi − β0 − ෍ βj xij + λ ෍ |βj | = RSS + λ ෍ |βj |


i=1 j=1 j=1 j=1

• In statistical parlance, the lasso uses an l1 penalty instead


of an l2 penalty. The l1 norm of a coefficient vector β is
given by ǁβǁ 1 = σ |βj |.

• Lasso can set coefficient to 0, exactly, when the feature is not


important, much like best subset selection.
The Lasso

• Lasso uses an 𝑙1 penalty 𝛽 1


= σ|𝛽|
• Lasso yields sparsity, a subset of variables are selected
Example: Credit dataset
400

400
Standardized Coefficients

Standardized Coefficients
300

300
200

200
100

100
0

0
Income

−100
Limit
−200

Rating
Student

20 50 100 200 500 2000 5000 −300 0.0 0.2 0.4 0.6 0.8 1.0

|| 𝛽መ𝜆𝑅 ||2 / ||𝛽||


መ 2
λ
The Variable Selection Property of the Lasso
Why is it that the lasso, unlike ridge regression, results in
coefficient estimates that are exactly equal to zero?
One can show that the lasso and ridge regression coefficient
estimates solve the problems-
2
n p p
minimize
෍ yi − β0 − ෍ βj xij subject to ෍ |βj | ≤ s
β
i=1 j=1 j=1

and

2
n p p
minimize
෍ yi − β0 − ෍ βj xij subject to ෍ βj 2 ≤ s
β
i=1 j=1 j=1

respectively.
The Lasso Picture
Comparing the Lasso and Ridge Regression

60
Mean Squared Error
50
40
30
20
10
0

0.02 0.10 0.50 2.00 10.00 50.00

Left: Plots of squared bias (black), variance (green), and test M S E (purple)., simulated
dataset, all predictors are related to the response
Right: Only two predictors are related to response
Choice of Tuning Parameter

•In Lasso or RR, the value for parameter 𝜆 (or


budget 𝑠) has to be chosen
•Cross-validation is used to solve this problem
•Choose a grid of 𝜆 values and compute CV-
error for each of them
•Pick the one for which CV-error is minimum
Dimension Reduction Methods
• The methods that we have discussed so far in this chapter
have involved fitting linear regression models, via least
squares or a shrunken approach, using the original
predictors, X1 ,X 2, . . . , X p.
• We now explore a class of approaches that transform the
predictors and then fit a least squares model using the
transformed variables. We will refer to these techniques as
dimension reduction methods.
Dimension Reduction Methods
• Let Z1 ,Z2 , . . . , ZM represent M < p linear combinations of
our original p predictors. That is,
p

Zm = ෍ ∅jm X j (1)
j=1

for some constants ∅m1 , . . . , ∅mp .


• We can then fit the linear regression model,
M
(2)
yi = θ0 + ෍ θm zim + ∈i , i = 1 ,…,n
m=1

using ordinary least squares.


• If the constants ∅m1 , . . . , ∅mp are chosen correctly, this can
often outperform l e a s t s q u a r e regression.
Dimension Reduction Methods
• Notice that from definition (1),
M M p p M p

෍ θm zim = ෍ θm ෍ ∅jm xij = ෍ ෍ θm ∅jm xij = ෍ βj xij


m=1 m=1 j=1 j=1 m=1 j=1

where,
M

βj = ෍ θm ∅jm (3)
m=1

• Hence model (2) can be thought of as a special case of


the original linear regression model.
• Dimension reduction serves to constrain the estimated β j
coefficients, since now they must take the form (3).
• In cases where 𝑀 ≪ 𝑝 dimension reduction techniques are
very useful
Pictures of P C A

35
30
25
Ad Spending
20
15
10
5
0

10 20 30 40 50 60 70

Population

The population size (pop) and ad spending (ad) for 100 different cities are shown as
purple circles. The green solid line indicates the first principal component, and the
blue dashed line indicates the second principal component.
Pictures of P C A : continued

10
30

2nd Principal Component


25

5
Ad Spending
20

0
15

−5
10

−10
5

20 30 40 50 −20 −10 0 10 20

Population 1st Principal Component

A subset of the advertising data. Left: The first principal component, chosen to
minimize the sum of the squared perpendicular distances to each point, is shown in
green. These distances are represented using the black dashed line segments. Right: The
left-hand panel has beenrotated so that the first principal component lies on the x-axis.
Principal Components Regression

• PCR involves making first 𝑀 principal components 𝑍1 , . . , 𝑍𝑀 then


using these components as predictors of linear regression
• Often a small number of principal components suffice to explain
most of the variability in the data
• The first principal component direction is along which the
observations varies the most (largest variance)
• The second principal component has largest variance, subject
to being uncorrelated with the first.
• Hence many correlated original variables, are replaced with a
small set of principal components that capture their joint
variation.
• The assumption is that the direction at which 𝑋1 , . . 𝑋𝑝 show the
most variation are the directions associated with 𝑌
• This is not always true but a reasonable assumption
Pictures of P C A : continued
60

30
50

25
Ad Spending
Population

20
40

15
30

10
20

5
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

1st Principal Component 1st Principal Component

Plots of the first principal component scores 𝑧𝑖1 versus pop and ad. The relationships
are strong.
Principal Component Regression

PCR applied to simulated data


Left: All predictors are related to response
Right: Only two of them are related to response
Partial Least Squares

• Like P C R , P L S is a dimension reduction method, which


first identifies a new set of features Z1 ,Z2 , . . . , Z𝑀 that are
linear combinations of the original features, and then fits a
linear model via O L S using these M new features.
• But unlike P C R , P L S identifies these new features in a
supervised way
• It makes use of the response Y to identify new features
that not only approximate the old features well, but also
that are related to the response.
• Roughly speaking, the P L S approach attempts to find
directions that help explain both the response and the
predictors.
Details of Partial Least Squares
• After standardizing the p predictors, P L S computes the
first direction Z 1 by setting each ∅1j in (1) equal to the
coefficient from the simple linear regression of Y onto Xj .
• One can show that this coefficient is proportional to the
correlation between Y and X j.
p
• Hence, in computing Z1 = σj=1 ∅j1 X j P L S places the
highest weight on the variables that are most strongly
related to the response.
• Subsequent directions are found by taking residuals and
then repeating the above prescription.
Example: Hitters Data
Example
Example

You might also like