Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Better Subset Regression Using The Nonnegative Garrote: Technometrics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Technometrics

ISSN: 0040-1706 (Print) 1537-2723 (Online) Journal homepage: http://amstat.tandfonline.com/loi/utch20

Better Subset Regression Using the Nonnegative


Garrote

Leo Breiman

To cite this article: Leo Breiman (1995) Better Subset Regression Using the Nonnegative
Garrote, Technometrics, 37:4, 373-384

To link to this article: http://dx.doi.org/10.1080/00401706.1995.10484371

Published online: 12 Mar 2012.

Submit your article to this journal

Article views: 116

Citing articles: 37 View citing articles

Full Terms & Conditions of access and use can be found at


http://amstat.tandfonline.com/action/journalInformation?journalCode=utch20

Download by: [71.74.122.179] Date: 05 February 2016, At: 17:44


@ 1995 American Statistical Association and TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4
the American Society for Quality Control

Better Subset Regression Using the


Nonnegative Garrote
Leo BREIMAN

Statistics Department
University of California, Berkeley
Berkeley, CA 94720

A new method, called the nonnegative (nn) garrote, is proposed for doing subset regression. It both
shrinks and zeroes coefficients. In tests on real and simulated data, it produces lower prediction error
than ordinary subset selection. It is also compared to ridge regression. If the regression equations
generated by a procedure do not change drastically with small changes in the data, the procedure is
called stable. Subset selection is unstable, ridge is very stable, and the nn-garrote is intermediate.
Simulation results illustrate the effects of instability on prediction error.
Downloaded by [71.74.122.179] at 17:44 05 February 2016

KEY WORDS: Little bootstrap; Model error; Prediction; Stability

1. INTRODUCTION gave a recent overview of ridge methods. Some studies


(i.e., Frank and Friedman 1993; Hoerl, Schuenemeyer,
One of the most frequently used statistical procedures
and Hoer1 1986) have shown that ridge regressions give
is subset-selection regression. That is, given data of the more accurate predictions than subset regressions unless,
form {(y,, XI,,, . . . , xy,), n = 1, . . . , N}, some of the assuming that y is of the form
predictor variables xi, . . . , X~ are eliminated and the pre-
diction equation for y is based on the remaining set of vari- y = CBkXk +r,
ables. The selection of the included variables uses either k
the best subset method or a forward/backward stepwise all but a few of the (bk} are nearly zero and the rest are
method. These procedures give a sequence of subsets of large. Thus, although subset regression can improve ac-
{Xl,..-, xM} of dimension 1,2, . . . , M. Then some other curacy if M is large, it is usually not as accurate as ridge.
method is used to decide which of the M subsets to use. Ridge has its own drawbacks. It gives aregression equa-
Subset selection is useful for two reasons, variance re- tion no simpler than the original ordinary least squares
duction and simplicity. It is well known that each ad- (OLS) equation. Furthermore, it is not scale invariant. If
ditional coefficient estimated adds to the variance of the the scales used to express the individual predictor variables
regression equation. The fewer coefficients estimated, the are changed, then the ridge coefficients do not change in-
lower the variance. Unfortunately, using too few variables versely proportional to the changes in the variable scales.
leads to increased bias. But, if a regression equation based The usual recipe is to standardize the {x,} to mean 0, vari-
on 40 variables, say, can be reduced (without loss of ac- ance 1 and then apply ridge. But the recipe is arbitrary;
curacy) to one based on 5 variables, then not only is the that is, interquartile ranges could be used to normalize in-
equation simpler but we may also have learned something stead, giving a different regression predictor. For a spirited
about which variables are important in predicting y. discussion of this issue, see Smith and Cambell (1980).
Using prediction accuracy as our “gold standard,” the Another aspect of subset regression is its instability
hope is that subset regression will produce a regression with respect to small perturbations in the data. Say that
equation simpler and more accurate than the equation N = 100, M = 40, and that using stepwise deletion of
based on all variables. If M is large, it usually suc- variables a sequence of subsets of variables {xm; rn E I&),
ceeds. But, if M is moderate to small-that is, M 5 lo- of dimension k (j&l = k), k = 1,. . . , M, has been se-
then there is evidence that often the full regression equa- lected. Now remove a single data case (y,, x,), and use
tion is more accurate than the selected subset regression. the same selection procedure, getting a sequence of sub-
Roecker (1991) did recent work on this issue and gave sets {xm; m E <L}. Usually the {cl} and ([k] are different
references to relevant past work. so that for some k a slight data perturbation leads to a dras-
The prime competitor to subset regression in terms of tic change in the prediction equation. On the other hand, if
variance reduction is ridge regression. Here the coeffi- one uses ridge estimates and deletes a single data case, the
cients are estimated by (X’X + AZ)-‘X’Y, where J, is a new ridge estimates, for the same A-,will be close to the old.
shrinkage parameter. Increasing A.shrinks the coefficient Much work and research have gone into subset-
estimates, but none are set equal to zero. Gruber (1990) selection regression, but the basic method remains flawed

373
374 LEO BREIMAN

by its relative lack of accuracy and instability. Subset Useful analytical results are not available when the
regression either zeroes a coefficient, if it is not in the number of variables is comparable to the sample size. In
selected subsets, or inflates it. Ridge regression gains its this area empirical results, good heuristics, and simula-
accuracy by selective shrinking. Methods that select sub- tions are the only general tools available. Properly used,
sets, are stable, and shrink are needed. Here is one: Let they can give valuable insights. For instance, the concept
(&} be the original OLS estimates. Take (ck} to minimize of stability was nurtured by the simulation results reported
in Section 5 and previous work using simulations to study
subset selection.
c (y” - c cki%xk,\ll Sorting out how to reduce complexity and prediction er-
k \ k /
ror is a complicated problem. There are few relevant stud-
under the constraints ies in the statistical literature. The book by Miller (1990)
ck 2% ck 5s. summarizes work on variable subset selection and gives an
c extensive bibliography, but it is primarily concerned with
k
low-dimensional issues. The work in this article came
The &(k(s) = c&k are the new predictor coefficients. As mainly out of a combination of the ideas from Breiman
the garrote is drawn tighter by decreasing s, mo:e of the (1993), who used nonnegativity and sum constraints in the
(ck} become zero and the remaining nonzero j%(s) are context of combining regressions, and the previous explo-
shrunken. rations of subset selection of Breiman (1992) and Breiman
This procedure is called the nonnegative (nn) garrote.
Downloaded by [71.74.122.179] at 17:44 05 February 2016

and Spector (1992). Tibshirani (1994), stimulated by the


The garrote eliminates some variables, shrinks others, and results in a preprint of this article, devised another method
is relatively stable. It is also scale invariant. I show that for shrinking and subset selection.
it is almost always more accurate than subset selection The constrained least squares minimization used in the
and that its accuracy is competitive with ridge. In gen- nn-garrote can be solved rapidly even for numerous x
eral nn-garrote produces regression equations having more variables. I used a modification of the elegant nonnega-
nonzero coefficients than subset regression. But the loss tive least squares algorithm given by Lawson and Hanson
in simplicity is offset by substantial gains in accuracy. (1974). No stability problems were encountered and com-
The organization of this article is as follows: Section putation times increased only moderately as the number of
2 on model selection gives definitions of prediction and x variables increased. A FORTRAN subroutine that out-
model error together with a brief outline of useful esti- puts the values of the (ck) for any value of s, 0 < s < M,
mates of these errors. These estimates are used to deter- is available by ftp to stat-ftp.berkeley.edu in the directory
mine the value of the garrote parameter s, the ridge pa- lpubluserlbreiman.
rameter h, and the dimensionality of the subset regression.
In Section 3, nn-garrote is compared to subset regression
2. MODEL SELECTION
on two well-known data sets. The first is the stackloss
data given by Daniel and Wood (1980). The second is an 2.1 Prediction and Model Error
ozone data set used by Breiman and Friedman (1985). In
The prediction error is defined as the average error in
Section 4, I assume that X’X = I. The action of the nn-
predicting y from x for future cases not used in the con-
garrote becomes clear, and it can be compared to ridge and
struction of the prediction equation. The data on hand
subset selection over an interesting range of (Bk} distribu-
are of the form ((y,, x,), n = 1,. . . , N}, where x, =
tions. Section 5 reports on a simulation comparison of
(xi,,, . . . , x~,,) and the symbols y, xi, . . . , XM are used as
methods. Conclusions and concluding remarks are given
generic notation for the response and M predictor vari-
in Section 6.
ables,
In many regression problems the number of predic-
There are two regression situations, X-controlled and
tor variables is a substantial fraction of the sample size,
X-random. In the controlled situation, the (x,} are se-
and variable subset selection is used to reduce complex-
lected by the experimenter and only y is random. In the
ity and variance. The large ratio of variables to sample
X-random situation, both y and x are randomly selected.
size often reflects the experimenters inclusion of non-
Different definitions of prediction error are appropriate.
linear terms in search of a better fit. For instance, the
In the controlled situation, future data are assumed gath-
stackloss data has three x variables and 17 cases (after
ered using the same (x,] as in the present data and thus
removal of four outliers). To get a better fit, Daniel and
havetheform((y~‘+‘,x,),n = l,...,N}. Ifp(x)isthe
Wood (1980) introduced quadratic and interaction terms,
prediction equation derived from the present data, then
going from three to nine variables. Then subset selection
define the prediction error as
was used to arrive at a three-variable model. The ozone
data set has 330 cases and eight variables but is known
to have strong nonlinearities. The analysis in Section 3 PEG)
=Ec (y,“”
- i^i(xn>)“,
n
includes quadratic and interaction terms for a total of 44
variables. where the expectation is over (y,“ew}.

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


BETTER SUBSET REGRESSION USING THE NONNEGATIVE GARROTE 375

If the data are generated by the mechanism y,, = data { (&, x,)}, repeat the subset-section process getting
I + en, where the {E,} are mean-zero uncorrelated a new sequence of OLS predictors {j&, k = 1, . . , M}.
with average variance cr2, then Then

PEti% = No2 + ~t~tx.) - i-i<x,N2.


n
The first component is the inherent prediction error due to
the noise. The second component is the prediction error where the expectation is on the {&} only.
due to lack of fit to the underlying model. This component This is made computable by replacing a2 by the noise
is called model error and denoted by ME(F). The size variance estimate Z2 and the expectation over the I?,}
of the model error reflects different methods of model by the average over many repetitions. This procedure is
estimation. If p = Q?&x,,, and j-? = Z&x,, then called the little bootstrap.
ME(F) = 6 B>‘tX’X>(~- PI. Little bootstrap can also be applied to nn-garrote and
ridge. Suppose that the nn-garrote predictor j& has been
If the {x,} are random, then it is assumed that the (yn, x,) computed for parameter values sk with resulting residual
are iid selections from the parent distribution (Y, X). Then sum of squares RSS(sk), k = 1, . . . , K. Now add {7”} to
if c(x) is the prediction equation constructed using the the {yn), getting {y”}, where the I?,) are iid N(0, t2Z2>.
present data, PE(E) = E(Y - F(X))2. Assuming that Using the (F;I, x,) data, derive the nn-garrote predictor
Y = p(X)+e, where E(E ] X) = 0, thenPE($) = Ec2+ j& for the parameter value Sk, and compute the quantity
Downloaded by [71.74.122.179] at 17:44 05 February 2016

E&(X) - p(X))2. Again, the relevant error is the second $ C 7, pk (X”) . Repeat several times, average these quan-
component. To put model error in this situation on the tities, and denote the result by @(Sk). The PE estimate is
same scale as in the X-controlled case, define ME(p) = p^E(j&) = RSS(Sk) + 2&(Q).
N . E(p(X) - c(X)J2, and similarly for PIJ($). If p = In ridge regression, denote by j& the predictor using
C_Bmxmandj? = Z/?,,,X~, thenME = (/I-p)‘(N.F) parameter &. The little bootstrap estimate is RSS(&) +
(/3 - B), when I’ij = EXiXj. 2B,(hk), where B,(hk) is computed just as in subset se-
lection and nn-garrote. It was shown by Breiman (1992)
2.2 Estimating Error
that, for subset selection, the bias of the little bootstrap es-
Each regression procedure that we study produces a timate is small for t small. The same proof holds, almost
sequence of models {pk(x)}. Variable selection gives a word for word, for the nn-garrote and ridge. But what
sequence of subsets of variables {x,, m E [k], l{kl = happens in subset selection is that as t J 0, the variance
k, k = 1, . . . , M, and ck(x) is the OLS linear regression of Bt increases rapidly, and Br has no sensible limiting
based on Ix,,,, m c (k}. In nn-garrote, a sequence of s- value. Experiments by Breiman (1992) indicated that the
parameter values si, . . . , SK is selected and j&(x), k = best range for t is [.6, .8] and that averaging over 25 rep-
1 .., K, is the prediction equation using parameter Sk. etitions to form Bt is usually sufficient.
In ridge, a sequence of h-parameter values At, . . . ,)LK is On the other hand, in ridge regression the variance of
selected, and j&(x) is the ridge regression based on &. Br does not increase appreciably as t J. 0, and taking this
If we knew the true value Of PE(&), the model selected limit results in the more unbiased estimate
would be the minimizer of PE(&). We refer to these
selections as the crystal-ball models. Otherwise, the se- @j&) = RSS(&) + 2~2tr(X’X(X’X + &I)-‘).
lection process construg an estimate I%&) and selects (2.1)
that pk that minimizes PE. The estimation methods differ This turns out to be an excellent PE estimate that
for X-controlled and X-random. selects regression equations Ek with PE(j&) close to
2.2.1 X-Controlled Estimates. The most widely mink! PE(&). The estimate (2.1) was proposed on
used estimate in subset selection is Mallows C,. If k other grounds by Mallows (1973). See also Hastie and
is the number of variables in the subset, RSS(k) is the Tibshirani (1990).
residual sum of squares using $k, and 5’ is the noise vari- The situation in nn-garrote is intermediate between sub-
ance estimate derived from the full model (all variables), set selection and ridge. The variance of Bt increases as
then the C, estimate is p^E(zk) = RSS(k) + 2kz2. But t gets small, but a finite variance limit exists. It does not
Breiman (1992) showed that this estimate is heavily biased perform as well as using t in the range [.6, .8], however.
and does poorly in model selection. Therefore, our preferred PE estimates for subset selection
It was shown in the same article that a better estimate and nn-garrote use t E [.6, .8] and (2.1) for the ridge PE
for PE(j&) is estimate. The behavior of Bt for t small is a reflection of
RSS(k) + 2Bt (k), the stability of the regression procedures used. This was
explored further by Breiman (1994).
where Bt (k) is defined as follows:
Let o2 be the noise variance, and add iid N(0, t2a2), 2.2.2 X-Random Estimates. For subset regressions
0 < f _( 1, noise {‘$I to the {y,,}, getting {yn}. Using the (pk} in the X-random situations, the most frequently

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


376 LEO BREIMAN

encountered PE estimate is I compare nn-garrote to subset selection using back-


ward deletion. Daniel and Wood gave two possible fitting
E(Fkk) = (1 -’ +j2RSS(k). equations, stating that there is little to choose between
them. Backward deletion and 250 repetitions of little boot-
The results of Breiman and Spector (1992) show that this strap pick the second of these equations,
estimate can be strongly biased and does poorly in select-
ing accurate models. What does work is cross-validation. y= 14.1 + .71x1 + .51x2 + .0254x,x2. (3.1)
V-fold CV is used to estimate PE(j&) for subset se- Garrote picks an equation using the same variables,
lection and nn-garrote. The data L = {(y,,, x,), n =
1 . . , N} are split into V subsets Cl, . . . , Lv. Let Cc”) = T= 14.1 + .77x, + .40X2 + .0152x1X2. (3.2)
i 1 f&. Using subset selection (nn-garrote) and the data
in Cc”), form the predictors {@j(x)). The CV estimate is The estimated model errors are 3.0 and 1.0, respectively
(with estimated prediction errors 41.4 and 39.3). The two
equations appear similar, but each pair of coefficients dif-
fers by almost .5 if the x variables are put on standardized
and z(Fkik) = E(Fk) - Nz2. Taking V in the range 5 scales.
to 10 gives satisfactory results. The value of s selected is .25M (M = 9). Because
To get an accurate PE estimate for the ridge regression s = 9 corresponds to the full OLS regression, this could
be interpreted as meaning that the coefficients were shrunk
Downloaded by [71.74.122.179] at 17:44 05 February 2016

GJ., remove the nth case (yn, x,) from the data, and re-
compute z*(x) getting j$n’ (x) . Then the estimate is to 25% of the OLS values. The sum of the coefficients
in the garrote equation (3.2) is a bit smaller than those in
E(h)= c (yn - ~~-“‘(x,))2. (3.1), but the major effect is the redistribution of emphasis
n on the three variables included.
This is the leave-one-out CV estimate. If r,(h) = y,, -
El((x,) and h,(h) = xA(X’X + hZ)-‘x,, then 3.2 Ozone Data
The ozone data were also used by Friedman and
%(A) = C(r.(h)/l - h,(X))2.
n Silverman (1989), Hastie and Tibshirani (1990), and Cook
(1993). It consists of daily ozone and meteorological data
Usually, h,(h) 2: x(X) is a good approximation, where for the Los Angeles Basin in 1976. There are 330 cases
x(h) = tr(X’X(X’X + )LZ)-‘)/N. With this approxima- with no missing data. The dependent variable is ozone.
tion There are eight meterological predictor variables:
@A) = RSS@)/( 1 - x(h))2. (2.2)
XI : 500 mb height
This estimate was first derived by Golub, Heath, and x2: wind speed
Wahba (1979) and is called the GCV (generalized cross- x3: humidity
validation) estimate of PE. Its accuracy is confirmed in the x4: surface temperature
simulation in Section 7. x5: inversion height
Breiman and Spector (1992) found that the “infinitesi- x6: pressure gradient
mal” version of CV-that is, leave-one-out-gave poorer x7 : inversion temperature
results in subset selection than five- or tenfold CV [for xs: visibility
theoretical work on this issue, see Shuo (1993) and Zhang
(1992)]. But leave-one-out works well in ridge regression. These data are known to be nonlinear in some of the
Simulation results show that tenfold CV is slightly better variables, so, after subtracting means, interactions and
for nn-garrote than leave-one-out. This again reflects the quadratic terms were added, giving 44 variables. Subset
relative stabilities of the three procedures. selection was done using backward deletion of variables.
To get the estimates of the best subset size, garrote pa-
3. TWO EXAMPLES rameter, and prediction errors, tenfold CV was used. The
The use of the nn-garrote is illustrated in two well- tenfold CV was repeated five times using different random
known data sets. One is X-controlled data, and the other divisions of the data and the results averaged.
I put into the X-random context. Subset selection chooses the five-variable equation

3.1 The Stackloss Data p= 6.2 + 4.6~6 + 2.4X2X4 - 1.3X2Xg + 5.5~42-4.2x;,


(3.3)
These data are the three-variable stackloss data studied
in chapter 5 of Daniel and Wood (1980). By including whereas nn-garrote chooses the seven-variable equation
quadratic terms along with the linear, it becomes a nine-
variable problem. Eliminating the outliers identified by 7 = 6.2 + 3.9x, - 1.7~~ - .3x;
Daniel and Wood leaves 17 cases. +.6x2x4 +5.2x,” + .8x5x7 - .4x,2. (3.4)

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


BETTER SUBSET REGRESSION USING THE NONNEGATIVE GARROTE 377

(All variables, including interactions and quadratic terms


are standardized to mean 0, variance 1.)
The estimated mean prediction error for the subset equa-
tion (3.3) is 10.0, with mean model error 3.3. The nn- 5-

garrote equation (3.4) has an estimated mean prediction


error of 9.0 with mean model error of 2.3. Each equa-
tion has a strong temperature term xi with about the same
coefficient. Otherwise, they are dissimilar, and include -6 -4 -2 0 2 4 6
different variables. All of the coefficients in the subset- beta
selection equation (3.3) are substantial in size. But due to
Figure 1. Shrinkage Factor for nn-Garrote.
the shrinking nature of nn-garrote, some of the coefficients
in (3.4) are small.
The value .26 is selected for the nn-garrote parameter. expression. The nn-garrote coefficients are
This is surprisingly small because s = 44 corresponds
to full OLS regression. Thus the coefficients have been
shrunk to less than 1% of their OLS value. (4.2)
Equation (3.4) includes some quadratic terms without
The ridge coefficients are
the corresponding linear terms. One referee objected to
the nonhierarchical form of this model, and an associate
Downloaded by [71.74.122.179] at 17:44 05 February 2016

editor asked me to comment, noting that many statisti-


cians prefer hierarchical regression models. My model-
All three of these estimates are of the form 2 =
selection approach is based on minimizing prediction
,9@, h)p, where 8 is a shrinkage factor. OLS estimates
error, and if a nonhierarchical model has smaller predic-
correspond to 8 3 1. Ridge regression gives a co_nstant
tion error, so be it. I agree, however, that in some situations
shrinkage, 8 = l/( 1+h). Subset selection is 0 for I/l I 5 h
hierarchical models make more physical sense.
and 1 otherwise. The nn-garrote shrinkage is continuous,
0 if IpI 5 h and then increasing to 1. The nn-garrote
4. X ORTHONORMAL
shrinkage factor is graphed in Figure 1 for h = 1.
In the X-controlled case, assume that X’X = I and that If the (FL} are any estimates of the I&}, then the model
y is generated as error is
MI383 > = ~6% - a>“.
yn = c Bmxmn+ En7 m
n
For estimates of the form OF,
where the {e,lare iid N(0, 1).
Then OLS pm = &+Zm, where the Z, are iid N(0, 1). ME@.) = ~6% - dirt, N&d2.

Although this is a highly simplified situation, it can give m

interesting insights into the comparative behavior of sub- I denote the minimum loss by ME* = mink ME(h).
set selection, ridge regression, and the nn-garrote regres- Assume that M is large and that the & } are iid selec-
sions. The best subset of k variables consists of those x, tions from a distribution P (d/I). Then
corresponding to the k largest IEm1so that the coefficients
of a best subset regression are ME(h) = ~(I% - Qth, + Zrn, ~)tBm + ZmN2

iq’ = UIEnl s aL, m=l,...,M, (4.1) 2: hi * E(B - etp + z, I)(B + z>j2,

for some h 1 0, where Z (.) is the indicator function. giving the approximation
In nn-garrote, the expression
ME* = M m;‘n E[p - e(p + Z, i)(p + Z)12, (4.4)
I \ 2
where Z has an N(0, 1) distribution independent of /3.
1 (Y. - ~cm&xmn)
m To simplify notation, put ME* = ME*/M. For the ridge
shrink,
is minimized under the constraints c, > 0, all m,
c ,c, = s. The solution is of the form
a2+ EP2
cm=
( z >’
l-
=EpTT+
where h is determined from s by the condition C c, = s The other minimizations are not analytically tractable but
and the superscript + indicates the positive part of the are easy to compute numerically.

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


378 LEO BREIMAN

I wanted to look at the ME* values for a “reveal- that the OLS ME* is 1. Note that the ME* for nn-garrote is
ing” family of distributions of p. It is known that ridge always lower than the subset selection ME* and is usually
is “optimal” if B has an N(0, a2) distribution. Subset lower than the ridge ME* except at p = 0.
selection is best if many of the coefficients are 0 and Another question is how many variables are included
the rest large. This led to use of the family P(@) = in the regressions by subset selection compared to nn-
p8(@)+qQ(@, o), where6(@)isamassconcentrated garrote. If hs and hG are the values of A.that minimize the
at 0 and Q(&, a) is N(0, a’). The range of p is [0, 11, respective model errors, then the proportions Ps and PC
and o E [0,5]. of B’s zeroed are
Figure 2 plots ME* versus o for p = 0, .3, .6, .9 for
subset selection, ridge, and nn-garrote. The scaling is such ps = PUB + Zl 5 As)
PC = p(Ib + ZI 5 AC).

Figure 3 gives plots of Ps, PC versus p for D =


p=o
1.0, 1.5,3.0.
1 -

SIGMA = 1.0

.5 -
Downloaded by [71.74.122.179] at 17:44 05 February 2016

0 I I I I
0 1 2 3 4 5

I-J=.3

1 -

n E

.5 -
SIGMA J 1.5

o- 1 -
0 1 2 3 4 5

1176

1 -

SIGMA = 3.0
0 1 2 3 4 5

p=.9

0 1 2 3 4 5
0 .5

Figure 2. ME* Versus Sigma: -+-, Garrote; +, Subset; Figure 3. Proportion Zeroed by Procedure Versus Proportion
+, Ridge. Zero in Distribution: +, Garrote; +, Subset.

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


BETTER SUBSET REGRESSION USING THE NONNEGATIVE GARROTE 379

In regard to simplicity-that is, how many variables are Table 1. Results of Simulation Consisting of Five Runs,
included in the regression-Figure 3 shows that nn-garrote 250 Iterations Each
is comparable to subset selection. Subset selection has a X-controlled X-random
Cluster
discontinuity at (T = 1. For (T < 1, it deletes all variables radius #nonzero coeff. #nonzero coeff.
and Ps = 1. For r~ > 1, it settles down to the behavior
shown in the o = 1.5 and 3.0 graphs. 1 2 3
2 6 9
5. SIMULATION RESULTS 3 10 15
4 14 21
Because analytic results are difficult to come by in this 5 18 27
area, the major proving ground is testing on simulated
data.
The X-random runs had a similar structure, using back-
5.1 Simulation Structure ward deletion, s values = 1, . . . (40, and h values such
I did two simulations, one, with 20 variables and 40 that tr(X’X(X’X + hl)-‘) = 1, . . . (40. The ME values
cases in the X-controlled case and the other with 40 vari- for subset selection and nn-garrote were estimated using
ables and 80 cases in the X-random case. The major pur- tenfold CV. The ME values for ridge regression were es-
pose was to compare the accuracies of subset selection, timated using GCV (2.2). The true ME was computed as
nn-garrote, and ridge regression. The secondary purpose A@ - p>‘r(B’ - p>.
Downloaded by [71.74.122.179] at 17:44 05 February 2016

was to learn as much as possible about other interesting


5.2 Simulation Results
characteristics.
The data were generated by In each run, various summary statistics for the 250 it-
erations were computed.
Y=~BmXm+t
5.2.1 Accuracy. The most important results were
with {E) iid N(0, 1). The ticklish specifications are the
the average true model errors for the predictors selected
coefficients {pm} and the X design. What is clear is that by the various methods. Figure 4(a) plots these values
the results are sensitive to the proportion of the {pm} that versus the cluster radius for the X-controlled simulation.
are 0. To create a “level playing field,” five different sets of
Figure 5(a) plots the average true ME values versus the
coefficients were used in each simulation. At one extreme, cluster radius for the X-random case (the nn-garrote esti-
almost all of the coefficients are 0. At the other, most are
mate is chosen using tenfold CV). The two graphs give the
nonzero.
same message: nn-garrote is always more accurate than
A coefficient cluster centered at j of radius rc is defined
variable selection. If there are many nonzero coefficients,
as
/?(j + i) = (r-c - lil)2, Ii\ 5 rc ridge is more accurate. If there are only a few, nn-garrote
= 0, otherwise. wins.
An important issue is how much of the differences be-
Each cluster has 2rc - 1 nonzero coefficients. In the X- tween the ME values for subset selection, garrote, and
controlled case with 20 variables, the coefficients were in ridge [plotted in Figs. 4(a) and 5(a)] can be attributed to
two clusters centered at 5 and 15, in the X-random case, in random fluctuation. In the simulation, standard errors
three clusters centered at 10,20, and 30. The values of the
are estimated for these differences at each cluster radius.
coefficients were renormalized so that in the X-controlled Table 2 gives these estimates averaged over the five cluster
case, the average R2 was around .85, in the X-random,
radii.
about .75.
Each simulation consisted of five runs with 250 itera- 5.2.2 Using a Crystal Ball. I have been comparing
tions in each run. Each of the five runs used a different the estimated best of M subset regressions to the estimated
cluster radius with rc = 1,2, 3,4,5. This gave the re- best of M nn-garrote and ridge regressions. That is, PE
sults shown in Table 1. The X distribution was generated estimates are constructed and the prediction equation hav-
by sampling from N(0, F), where F;i = pli-jl. In each ing minimum estimated PE selected. A natural question is
iteration p was selected at random from [- 1, 11. what would happen if we had a crystal ball-that is, if we
In each X-controlled iteration, subset selection was selected the best predictor based on “inside” knowledge
done using backward variable deletion. nn-garrote used of the true ME? For instance, what is the minimum ME
svalues 1,2,..., 20, and ridge regression searched over among all subset regression predictors? Among all nn-
h values such that tr(X’X(X’X + hZ)-‘) = 1,2, . . ,20. garrote predictors? Among ridge predictors?
The ME values for subset selection and nn-garrote were This separates the issue of how good a predictor there
estimated using the average of 25 repetitions of little boot- is among the set of M candidates from the issue of how
strap with I = .6. The ME values for ridge were estimated well we can recognize the best. Figure 4(b) gives a plot
usin& (2.1). The true ME for each predictor was computed of the minimum true ME’s for the subset selection, nn-
as (B’ - b)X’X(? - #I). garrote, and ridge predictors versus cluster radius for the

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


380 LEO BREIMAN

ME’S OF PREDICTORS SELECTED USING PE ESTIMATES ME’S OF PREDICTORS SELECTED USING PE ESTIMATES

60
20 -

15 -

10 -

5-

0
0 ’
1 2 3 4 5
1 2 3 4 5
(a) (a)
ME’S OF PREDICTORS SELECTED USING CRYSTALBALL
ME’S OF PREDICTORS SELECTED USING CRYSTAL BALL
I I

,,I
Downloaded by [71.74.122.179] at 17:44 05 February 2016

0 ’ I
1 2 3 4 5 0 ’
1 2 3 4 5
lb)
(b)
ME DIFFERENCES-FALLIBLE MINUS CRYSTAL BALL
6 ME DIFFERENCES -FAWBLE MINUS CRYSTAL BALL
i

1 2 3 4 5

(c)

Figure 4. ME in X-Controlled Simulation Versus Cluster Ra- Figure 5. ME in X-Random Simulation Versus Cluster Ra-
dius: a-, Subset; -0-, Garrote; --I-, Ridge. dius: +, Subset; -+-, Garrote; -+-, Ridge.

X-controlled simulation. Figure 5(b) gives the analogous the predictors selected by the ME estimates. The impli-
plot for the X-random simulation. Figures 4(c) and 5(c) cations are interesting. The crystal-ball subset-selection
show how much larger the fallible knowledge ME is than predictors are close (in ME) to the crystal-ball nn-garrote
the crystal-ball ME. Table 3 gives the estimated SE’s for predictors. The problem is that it is difficult to find
the differences of the crystal-ball ME’s plotted in Figures the minimum ME subset-selection model. On the other
4(b) and 5(b) (averaged over the cluster radii). hand, the crystal-ball ridge predictors are not as good
The differences between the minimum true ME’s for the as the other two, but the ridge ME estimates do better
three methods are smaller than the ME differences using selection.

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


BETTER SUBSET REGRESSION USING THE NONNEGATIVE GARROTE 381

Table 2.

Difference
Estimated SE’s for ME Differences

X-controlled X-random
,- X-CONlROLED

Subset-garrote
Subset-ridge
Garrote-ridge
.3
.4
.3
.9
1.2
.8
61 3
Better methods to select low ME subset regressions
6- 3
4 -

could make that procedure more competitive in accuracy.


But I believe that the intrinsic instability of the method 2 -

will not allow better selection.


5.2.3 Accuracy of the ME Estimates. The ME esti- 1 2 3 4 5
mates are used both to select a prediction equation and
(a)
to supply an ME estimate for the selected predictor. For
X-RANDOM
the selected predictors I computed the average absolute
value of the difference between the estimated and true
ME’s. The results are graphed in Figure 6(a) versus clus-
Downloaded by [71.74.122.179] at 17:44 05 February 2016

ter radius for the X-controlled simulation and in 6(b) for


the X-random simulation.
The ME estimates for subset selection are considerably
worse than those for nn-garrote or ridge regression. Part of
the lack of accuracy is downward bias, given in Table 4 as
averaged over all values of cluster radius. But downward
bias is not the only source of error. The standard deviation
5
of the ME estimates for subset regression is also consid-
erably larger than for nn-garrote and ridge regression. 1
0 1
5.2.4 Number of Variables. I kept track of the av- 1 2 3 4 5
erage number of variables in the selected predictors for
(b)
subset selection and nn garrote. Figure 7(a) plots these
values versus cluster radius for the X-controlled simula- Figure 6. Average Absolute Error in ME Estimate for Se-
tion. Figure 7(b) is a plot for the X-random simulation. lected Predictor Versus Cluster Radius: --W-, Subset; -0-,
Garrote; +-, Ridge.
In the X-controlled situation, not many more variables are
used by nn-garrote than subset selection. In the X-random
simulation, nn-garrote uses almost twice the number of does not necessarily translate into lower prediction er-
variables as subset selection. ror. To explore the difference, I used the same data as in
the 20-variable X-controlled simulation. In each of 250
5.2.5 Best Subsets Versus Variable Deletion. In the
iterations, two sequences of subsets were formed, one by
best-subsets procedure, the selected subset [k of k vari-
deletion, the other by the Furnival and Wilson (1974) best-
ables is such that the regression of y on (x,, m E {k}
subsets algorithm, Leaps.
has minimum RSS among all k variable regressions. Our
Then 25 repetitions of little bootstrap were done using
simulations did subset selection using backward deletion
deletion and the result used to select one subset out of the
of variables. The question (raised by an associate editor)
deletion sequence. Another 25 little bootstraps were done
is how much our results reflect the use of deletion rather
using Leaps and the results used to select one of the best
than best subsets.
subsets. The ME’s were computed for each of the selected
Certainly, the subsets selected by the best-subsets pro-
subsets and then averaged over the 250 repetitions. The
cedure have lower RSS than those found using dele-
results are plotted in Figure 8. The differences are small.
tion. But, as exemplified in Section 5.2.7, lower RSS

Table 4. Downward Bias as Averaged Over all Values of


Table 3. Estimated SE’s for Crystal-Ball ME Differences Cluster Radius

Difference X-controlled X-random Downward bias X-controlled X-random

Subset-garrote .2 .4 Subset selection 5.8 13.0


Subset-ridge .3 .8 nn-Garrote 3.2 5.0
Garrote-ridge .3 .6 Ridge 1.9 4.1

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


382 LEO BREIMAN

X-CONTROLLED
15

1
10

0 J

1 2 3 4 5

(a)
X-RANDOM
Downloaded by [71.74.122.179] at 17:44 05 February 2016

0 ’
1 2 3 4 5

(b)

Figure 7. Average Number of Variables Used in Predictors Selected Versus Cluster Radius: +, Subset; -+-, Garrote.

5.2.6 Nesting of the nn-Garrote Subsets. Stepwise KS


variable deletion or addition produces nested subsets of 30

variables. But the sequence of best (lowest RSS) sub-


sets of dimension 1,2, . . . , M are generally not nested.
A natural question is whether the subsets of variables

25

20 0 25

15

10 -

5 -
20
11 16 21
0 ’
1 2 3 4 5 bE

Figure 8. ME in X-Controlled Simulation, Leaps Versus Dele- Figure 9. RSS Versus ME for 100 “Best” IO-Variable Subset
tion: +, Best Subsets; -Q-, Deletion. Regressions.

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


BETTER SUBSET REGRESSION USING THE NONNEGATIVE GARROTE 383

produced by nn-garrote as s decreases are nested. The The nn-garrote results may have profitable applica-
answer is “almost always, but not always.” For instance, tion to tree-structured classification and regression. The
in the 1,250 iterations of nn-garrote in the 20-variable present method for finding the “best” tree resembles
X-controlled simulation, 17 were not nested. Of the 1,250 stepwise variable deletion using V-fold CV. Specifically,
in the 40-variable X-random simulation, 68 were not a large tree is grown and pruned upward using V-fold
nested. CV to estimate the optimal amount of pruning. I am
52.7 RSS Versus ME Instability in Subset Selection. experimenting with the use of a procedure analogous to
To illustrate the instability of subset selection, I generated nn-garrote to replace pruning. The results, to date, have
an X-controlled data set with rc = 3 and p = .7. Leaps been as encouraging as in the linear regression case.
was used to find the 100 subset regressions based on 10 Another possible application is to selection of more
variables having lowest RSS. For each of these regres- accurate autoregressive models in time series. Picking the
sions, the true ME was computed. Figure 9 is a graph of order of the autoregressive scheme is similar to estimating
RSS versus ME for the 100 equations. The 100 lowest the best subset regression. The nn-garrote methodology
RSS values are tightly packed. But the ME spreads over should carry over to this area and may provide increased
wide range. Shifting from one of these models to another prediction accuracy.
would result in only a small RSS difference but could give The ideas used in the nn-garrote can be applied to get
a large ME change. other regression shrinkage schemes. For instance, let {bk}
be the original OLS estimates. Take {ck} to minimize
Downloaded by [71.74.122.179] at 17:44 05 February 2016

6. CONCLUSIONS AND REMARKS

I have given evidence that the nn-garrote is a worthy F (Y”- pBix,.)2


competitor to subset selection methods. It provides sim-
ple regression equations with better predictive accuracy. under the constraint c ci 5 s. This version leads to
Unless a large proportion of the “true” coefficients are a procedure intermediate between nn-garrote and ridge
nonnegligible, it gives accuracy better than or comparable regression. In the X’X = I case, its shrinkage factor is
to ridge methods. Data reuse methods such as little boot-
strap or V-fold CV do well in estimating good values of
e@, P
A)=-82
the garrote parameter.
Some simulation results can be viewed as intriguing
+h2.
aspects of stability. Each regression procedure chooses Unlike ridge, it is scale invariant. Our expectation is that
from a collection of regression equations. Instability is it will be uniformly more accurate than ridge regression
intuitively defined as meaning that small changes in the while being almost as stable. Like ridge regression, it does
data can bring large changes in the equations selected. If, not zero coefficients and produce simplified predictors.
by use of a crystal ball, we could choose the lowest PE Study of this version of the garrote is left to future research.
equations among the subset section, nn-garrote, and ridge
collections, the differences in accuracy between the three ACKNOWLEDGMENTS
procedures are sharply reduced. But the more unstable a It is a pleasure to acknowledge the many illuminating
procedure is, the more difficult it is to accurately estimate conversations on regression regularization that I have had
PE or ME. Thus, subset-selection accuracy is severely with Jerry Friedman over the years and particularly during
affected by the relatively poor performance of the PE es- our recent collaboration on methods for predicting multi-
timates in picking a low PE subset. ple correlated responses. This work doubtless stimulated
On the other hand, ridge regression, which offers only some of my thinking about the nn-garrote. Phil Spector
a small diversity of models but is very stable, sometimes did the S run that produced the data used in Figure 9, and
wins because the PE estimates are able to accurately lo- I gratefully acknowledge his assistance. Research was
cate low PE ridge predictors. nn-garrote is intermediate. supported by National Science Foundation Grant DMS-
Its crystal-ball selection is usually somewhat better than 9212419.
the crystal-ball subset selection, but its increased stability
allows a better location of low PE nn-garrote predictions, [Received December 1993. Revised March 1995.1
and this increases its edge.
The work in this article raises interesting questions. For REFERENCES
instance, can the concept of stability be formalized and Breiman, L. (1992), “The Little Bootstrap and Other Methods for Di-
applied to the general issue of selection from a family mensionality Selection in Regression: X-Fixed Prediction Error,”
a predictors? Can one use a formal definition to get a Journal of the American Statistical Association, 87, 738-754.
numerical measure of stability for procedures such as the - (1993), “Stacked Regressions,” Technical Report 367, Univer-
sity of California, Berkeley, Statistics Dept.
three dealt with here? In another area, why is it that the nn- - (1994). “The Heuristics of Instability in Model Selection,”
garrote produces “almost always, but not always” nested Technical Report 416, University of California, Berkeley, Statistics
sequences of variable subsets? Dept.

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4


LEO BREIMAN

Breiman, L., and Friedman, J. (1985) “Estimating Optimal Trans- Hastie, T., and Tibshirani, R. (1990) Generalized Additive Models.
formations for Multiple Regression and Correlation” (with dis- New York: Chapman and Hall.
cussion), Journal of the American Statistical Association, 80, Hoerl, R., Schuenemeyer, J., and Hoerl, A. (1986). “A Simulation
580-619. of Biased Estimation and Subset Selection Regression Technique,”
Breiman, L., and Spector, P. (1992), “Submodel Selection and Evalua- Technometrics, 28, 369-380.
tion in Regression: The X-Random Case,” International Statistical Lawson, C., and Hanson, R. (1974), Solving Least Squares Problems,
Review, 60.291-319. Englewood Cliffs, NJ: Prentice-Hall.
Cook, R. (1993), “Exploring Partial Residual Plots,” Technometrics, Mallows, C. (1973), “Some Comments on Cp,” Technomefrics, 15,
35.351-362. 661-675.
Daniel, C., and Wood, F. (1980). “Fitting Equations to Data,” New Miller, A. (1990). Subset Selection in Regression, London: Chapman
York: John Wiley. and Hall.
Frank, E., and Friedman, J. (1993), “A Statistical View of Some Chemo- Roecker, E. (1991). “Prediction Error and its Estimation of Subset-
metrics Regression Tools” (with discussion), Technometrics, 35, Selected Models,” Technomefrics, 33, 459-468.
109-148. Shao, J. (1993). “LinearModel Selection via Cross-validation,” Journal
Friedman, J., and Silverman, B. (1989), “Flexible Parsimonious ofthe American Statistical Association, 88.486-494.
Smoothing and Additive Modeling” (with discussion), Technornet- Smith, G., and Cambell, E (1980), “‘A Critique of Some Ridge Regres-
rics. 3 1, 3-40. sion Methods” (with discussion), Journal ofthe Americun Statistical
Fumival, G., and Wilson, R. (1974), “Regression by Leaps and Association, 75, 74-103.
Bounds,” Technometrics, 16, 499-5 11. Tibshirani, R. (1994). “Regression Shrinkage and Selection via the
Golub, G., Heath, M., and Wahba, G. (1979), “Generalized Cross- Lasso,” Technical Report 9401, University of Toronto, Dept. of
validation as a Method for Choosing a Good Ridge Parameter,” Tech- Statistics.
nometrics, 2 1, 2 15-224. Zhang, P. (1992), “Model Selection via Multifold Cross-validation,”
Downloaded by [71.74.122.179] at 17:44 05 February 2016

Gruber, M. (1990). Regression Estimutor.s: A Comparative Study, Technical Report 257, University of California, Berkeley, Statistics
Boston: Academic Press. Dept.

TECHNOMETRICS, NOVEMBER 1995, VOL. 37, NO. 4

You might also like