Better Subset Regression Using The Nonnegative Garrote: Technometrics
Better Subset Regression Using The Nonnegative Garrote: Technometrics
Better Subset Regression Using The Nonnegative Garrote: Technometrics
Leo Breiman
To cite this article: Leo Breiman (1995) Better Subset Regression Using the Nonnegative
Garrote, Technometrics, 37:4, 373-384
Statistics Department
University of California, Berkeley
Berkeley, CA 94720
A new method, called the nonnegative (nn) garrote, is proposed for doing subset regression. It both
shrinks and zeroes coefficients. In tests on real and simulated data, it produces lower prediction error
than ordinary subset selection. It is also compared to ridge regression. If the regression equations
generated by a procedure do not change drastically with small changes in the data, the procedure is
called stable. Subset selection is unstable, ridge is very stable, and the nn-garrote is intermediate.
Simulation results illustrate the effects of instability on prediction error.
Downloaded by [71.74.122.179] at 17:44 05 February 2016
373
374 LEO BREIMAN
by its relative lack of accuracy and instability. Subset Useful analytical results are not available when the
regression either zeroes a coefficient, if it is not in the number of variables is comparable to the sample size. In
selected subsets, or inflates it. Ridge regression gains its this area empirical results, good heuristics, and simula-
accuracy by selective shrinking. Methods that select sub- tions are the only general tools available. Properly used,
sets, are stable, and shrink are needed. Here is one: Let they can give valuable insights. For instance, the concept
(&} be the original OLS estimates. Take (ck} to minimize of stability was nurtured by the simulation results reported
in Section 5 and previous work using simulations to study
subset selection.
c (y” - c cki%xk,\ll Sorting out how to reduce complexity and prediction er-
k \ k /
ror is a complicated problem. There are few relevant stud-
under the constraints ies in the statistical literature. The book by Miller (1990)
ck 2% ck 5s. summarizes work on variable subset selection and gives an
c extensive bibliography, but it is primarily concerned with
k
low-dimensional issues. The work in this article came
The &(k(s) = c&k are the new predictor coefficients. As mainly out of a combination of the ideas from Breiman
the garrote is drawn tighter by decreasing s, mo:e of the (1993), who used nonnegativity and sum constraints in the
(ck} become zero and the remaining nonzero j%(s) are context of combining regressions, and the previous explo-
shrunken. rations of subset selection of Breiman (1992) and Breiman
This procedure is called the nonnegative (nn) garrote.
Downloaded by [71.74.122.179] at 17:44 05 February 2016
If the data are generated by the mechanism y,, = data { (&, x,)}, repeat the subset-section process getting
I + en, where the {E,} are mean-zero uncorrelated a new sequence of OLS predictors {j&, k = 1, . . , M}.
with average variance cr2, then Then
E&(X) - p(X))2. Again, the relevant error is the second $ C 7, pk (X”) . Repeat several times, average these quan-
component. To put model error in this situation on the tities, and denote the result by @(Sk). The PE estimate is
same scale as in the X-controlled case, define ME(p) = p^E(j&) = RSS(Sk) + 2&(Q).
N . E(p(X) - c(X)J2, and similarly for PIJ($). If p = In ridge regression, denote by j& the predictor using
C_Bmxmandj? = Z/?,,,X~, thenME = (/I-p)‘(N.F) parameter &. The little bootstrap estimate is RSS(&) +
(/3 - B), when I’ij = EXiXj. 2B,(hk), where B,(hk) is computed just as in subset se-
lection and nn-garrote. It was shown by Breiman (1992)
2.2 Estimating Error
that, for subset selection, the bias of the little bootstrap es-
Each regression procedure that we study produces a timate is small for t small. The same proof holds, almost
sequence of models {pk(x)}. Variable selection gives a word for word, for the nn-garrote and ridge. But what
sequence of subsets of variables {x,, m E [k], l{kl = happens in subset selection is that as t J 0, the variance
k, k = 1, . . . , M, and ck(x) is the OLS linear regression of Bt increases rapidly, and Br has no sensible limiting
based on Ix,,,, m c (k}. In nn-garrote, a sequence of s- value. Experiments by Breiman (1992) indicated that the
parameter values si, . . . , SK is selected and j&(x), k = best range for t is [.6, .8] and that averaging over 25 rep-
1 .., K, is the prediction equation using parameter Sk. etitions to form Bt is usually sufficient.
In ridge, a sequence of h-parameter values At, . . . ,)LK is On the other hand, in ridge regression the variance of
selected, and j&(x) is the ridge regression based on &. Br does not increase appreciably as t J. 0, and taking this
If we knew the true value Of PE(&), the model selected limit results in the more unbiased estimate
would be the minimizer of PE(&). We refer to these
selections as the crystal-ball models. Otherwise, the se- @j&) = RSS(&) + 2~2tr(X’X(X’X + &I)-‘).
lection process construg an estimate I%&) and selects (2.1)
that pk that minimizes PE. The estimation methods differ This turns out to be an excellent PE estimate that
for X-controlled and X-random. selects regression equations Ek with PE(j&) close to
2.2.1 X-Controlled Estimates. The most widely mink! PE(&). The estimate (2.1) was proposed on
used estimate in subset selection is Mallows C,. If k other grounds by Mallows (1973). See also Hastie and
is the number of variables in the subset, RSS(k) is the Tibshirani (1990).
residual sum of squares using $k, and 5’ is the noise vari- The situation in nn-garrote is intermediate between sub-
ance estimate derived from the full model (all variables), set selection and ridge. The variance of Bt increases as
then the C, estimate is p^E(zk) = RSS(k) + 2kz2. But t gets small, but a finite variance limit exists. It does not
Breiman (1992) showed that this estimate is heavily biased perform as well as using t in the range [.6, .8], however.
and does poorly in model selection. Therefore, our preferred PE estimates for subset selection
It was shown in the same article that a better estimate and nn-garrote use t E [.6, .8] and (2.1) for the ridge PE
for PE(j&) is estimate. The behavior of Bt for t small is a reflection of
RSS(k) + 2Bt (k), the stability of the regression procedures used. This was
explored further by Breiman (1994).
where Bt (k) is defined as follows:
Let o2 be the noise variance, and add iid N(0, t2a2), 2.2.2 X-Random Estimates. For subset regressions
0 < f _( 1, noise {‘$I to the {y,,}, getting {yn}. Using the (pk} in the X-random situations, the most frequently
GJ., remove the nth case (yn, x,) from the data, and re-
compute z*(x) getting j$n’ (x) . Then the estimate is to 25% of the OLS values. The sum of the coefficients
in the garrote equation (3.2) is a bit smaller than those in
E(h)= c (yn - ~~-“‘(x,))2. (3.1), but the major effect is the redistribution of emphasis
n on the three variables included.
This is the leave-one-out CV estimate. If r,(h) = y,, -
El((x,) and h,(h) = xA(X’X + hZ)-‘x,, then 3.2 Ozone Data
The ozone data were also used by Friedman and
%(A) = C(r.(h)/l - h,(X))2.
n Silverman (1989), Hastie and Tibshirani (1990), and Cook
(1993). It consists of daily ozone and meteorological data
Usually, h,(h) 2: x(X) is a good approximation, where for the Los Angeles Basin in 1976. There are 330 cases
x(h) = tr(X’X(X’X + )LZ)-‘)/N. With this approxima- with no missing data. The dependent variable is ozone.
tion There are eight meterological predictor variables:
@A) = RSS@)/( 1 - x(h))2. (2.2)
XI : 500 mb height
This estimate was first derived by Golub, Heath, and x2: wind speed
Wahba (1979) and is called the GCV (generalized cross- x3: humidity
validation) estimate of PE. Its accuracy is confirmed in the x4: surface temperature
simulation in Section 7. x5: inversion height
Breiman and Spector (1992) found that the “infinitesi- x6: pressure gradient
mal” version of CV-that is, leave-one-out-gave poorer x7 : inversion temperature
results in subset selection than five- or tenfold CV [for xs: visibility
theoretical work on this issue, see Shuo (1993) and Zhang
(1992)]. But leave-one-out works well in ridge regression. These data are known to be nonlinear in some of the
Simulation results show that tenfold CV is slightly better variables, so, after subtracting means, interactions and
for nn-garrote than leave-one-out. This again reflects the quadratic terms were added, giving 44 variables. Subset
relative stabilities of the three procedures. selection was done using backward deletion of variables.
To get the estimates of the best subset size, garrote pa-
3. TWO EXAMPLES rameter, and prediction errors, tenfold CV was used. The
The use of the nn-garrote is illustrated in two well- tenfold CV was repeated five times using different random
known data sets. One is X-controlled data, and the other divisions of the data and the results averaged.
I put into the X-random context. Subset selection chooses the five-variable equation
interesting insights into the comparative behavior of sub- I denote the minimum loss by ME* = mink ME(h).
set selection, ridge regression, and the nn-garrote regres- Assume that M is large and that the & } are iid selec-
sions. The best subset of k variables consists of those x, tions from a distribution P (d/I). Then
corresponding to the k largest IEm1so that the coefficients
of a best subset regression are ME(h) = ~(I% - Qth, + Zrn, ~)tBm + ZmN2
for some h 1 0, where Z (.) is the indicator function. giving the approximation
In nn-garrote, the expression
ME* = M m;‘n E[p - e(p + Z, i)(p + Z)12, (4.4)
I \ 2
where Z has an N(0, 1) distribution independent of /3.
1 (Y. - ~cm&xmn)
m To simplify notation, put ME* = ME*/M. For the ridge
shrink,
is minimized under the constraints c, > 0, all m,
c ,c, = s. The solution is of the form
a2+ EP2
cm=
( z >’
l-
=EpTT+
where h is determined from s by the condition C c, = s The other minimizations are not analytically tractable but
and the superscript + indicates the positive part of the are easy to compute numerically.
I wanted to look at the ME* values for a “reveal- that the OLS ME* is 1. Note that the ME* for nn-garrote is
ing” family of distributions of p. It is known that ridge always lower than the subset selection ME* and is usually
is “optimal” if B has an N(0, a2) distribution. Subset lower than the ridge ME* except at p = 0.
selection is best if many of the coefficients are 0 and Another question is how many variables are included
the rest large. This led to use of the family P(@) = in the regressions by subset selection compared to nn-
p8(@)+qQ(@, o), where6(@)isamassconcentrated garrote. If hs and hG are the values of A.that minimize the
at 0 and Q(&, a) is N(0, a’). The range of p is [0, 11, respective model errors, then the proportions Ps and PC
and o E [0,5]. of B’s zeroed are
Figure 2 plots ME* versus o for p = 0, .3, .6, .9 for
subset selection, ridge, and nn-garrote. The scaling is such ps = PUB + Zl 5 As)
PC = p(Ib + ZI 5 AC).
SIGMA = 1.0
.5 -
Downloaded by [71.74.122.179] at 17:44 05 February 2016
0 I I I I
0 1 2 3 4 5
I-J=.3
1 -
n E
.5 -
SIGMA J 1.5
o- 1 -
0 1 2 3 4 5
1176
1 -
SIGMA = 3.0
0 1 2 3 4 5
p=.9
0 1 2 3 4 5
0 .5
Figure 2. ME* Versus Sigma: -+-, Garrote; +, Subset; Figure 3. Proportion Zeroed by Procedure Versus Proportion
+, Ridge. Zero in Distribution: +, Garrote; +, Subset.
In regard to simplicity-that is, how many variables are Table 1. Results of Simulation Consisting of Five Runs,
included in the regression-Figure 3 shows that nn-garrote 250 Iterations Each
is comparable to subset selection. Subset selection has a X-controlled X-random
Cluster
discontinuity at (T = 1. For (T < 1, it deletes all variables radius #nonzero coeff. #nonzero coeff.
and Ps = 1. For r~ > 1, it settles down to the behavior
shown in the o = 1.5 and 3.0 graphs. 1 2 3
2 6 9
5. SIMULATION RESULTS 3 10 15
4 14 21
Because analytic results are difficult to come by in this 5 18 27
area, the major proving ground is testing on simulated
data.
The X-random runs had a similar structure, using back-
5.1 Simulation Structure ward deletion, s values = 1, . . . (40, and h values such
I did two simulations, one, with 20 variables and 40 that tr(X’X(X’X + hl)-‘) = 1, . . . (40. The ME values
cases in the X-controlled case and the other with 40 vari- for subset selection and nn-garrote were estimated using
ables and 80 cases in the X-random case. The major pur- tenfold CV. The ME values for ridge regression were es-
pose was to compare the accuracies of subset selection, timated using GCV (2.2). The true ME was computed as
nn-garrote, and ridge regression. The secondary purpose A@ - p>‘r(B’ - p>.
Downloaded by [71.74.122.179] at 17:44 05 February 2016
ME’S OF PREDICTORS SELECTED USING PE ESTIMATES ME’S OF PREDICTORS SELECTED USING PE ESTIMATES
60
20 -
15 -
10 -
5-
0
0 ’
1 2 3 4 5
1 2 3 4 5
(a) (a)
ME’S OF PREDICTORS SELECTED USING CRYSTALBALL
ME’S OF PREDICTORS SELECTED USING CRYSTAL BALL
I I
,,I
Downloaded by [71.74.122.179] at 17:44 05 February 2016
0 ’ I
1 2 3 4 5 0 ’
1 2 3 4 5
lb)
(b)
ME DIFFERENCES-FALLIBLE MINUS CRYSTAL BALL
6 ME DIFFERENCES -FAWBLE MINUS CRYSTAL BALL
i
1 2 3 4 5
(c)
Figure 4. ME in X-Controlled Simulation Versus Cluster Ra- Figure 5. ME in X-Random Simulation Versus Cluster Ra-
dius: a-, Subset; -0-, Garrote; --I-, Ridge. dius: +, Subset; -+-, Garrote; -+-, Ridge.
X-controlled simulation. Figure 5(b) gives the analogous the predictors selected by the ME estimates. The impli-
plot for the X-random simulation. Figures 4(c) and 5(c) cations are interesting. The crystal-ball subset-selection
show how much larger the fallible knowledge ME is than predictors are close (in ME) to the crystal-ball nn-garrote
the crystal-ball ME. Table 3 gives the estimated SE’s for predictors. The problem is that it is difficult to find
the differences of the crystal-ball ME’s plotted in Figures the minimum ME subset-selection model. On the other
4(b) and 5(b) (averaged over the cluster radii). hand, the crystal-ball ridge predictors are not as good
The differences between the minimum true ME’s for the as the other two, but the ridge ME estimates do better
three methods are smaller than the ME differences using selection.
Table 2.
Difference
Estimated SE’s for ME Differences
X-controlled X-random
,- X-CONlROLED
Subset-garrote
Subset-ridge
Garrote-ridge
.3
.4
.3
.9
1.2
.8
61 3
Better methods to select low ME subset regressions
6- 3
4 -
X-CONTROLLED
15
1
10
0 J
1 2 3 4 5
(a)
X-RANDOM
Downloaded by [71.74.122.179] at 17:44 05 February 2016
0 ’
1 2 3 4 5
(b)
Figure 7. Average Number of Variables Used in Predictors Selected Versus Cluster Radius: +, Subset; -+-, Garrote.
25
20 0 25
15
10 -
5 -
20
11 16 21
0 ’
1 2 3 4 5 bE
Figure 8. ME in X-Controlled Simulation, Leaps Versus Dele- Figure 9. RSS Versus ME for 100 “Best” IO-Variable Subset
tion: +, Best Subsets; -Q-, Deletion. Regressions.
produced by nn-garrote as s decreases are nested. The The nn-garrote results may have profitable applica-
answer is “almost always, but not always.” For instance, tion to tree-structured classification and regression. The
in the 1,250 iterations of nn-garrote in the 20-variable present method for finding the “best” tree resembles
X-controlled simulation, 17 were not nested. Of the 1,250 stepwise variable deletion using V-fold CV. Specifically,
in the 40-variable X-random simulation, 68 were not a large tree is grown and pruned upward using V-fold
nested. CV to estimate the optimal amount of pruning. I am
52.7 RSS Versus ME Instability in Subset Selection. experimenting with the use of a procedure analogous to
To illustrate the instability of subset selection, I generated nn-garrote to replace pruning. The results, to date, have
an X-controlled data set with rc = 3 and p = .7. Leaps been as encouraging as in the linear regression case.
was used to find the 100 subset regressions based on 10 Another possible application is to selection of more
variables having lowest RSS. For each of these regres- accurate autoregressive models in time series. Picking the
sions, the true ME was computed. Figure 9 is a graph of order of the autoregressive scheme is similar to estimating
RSS versus ME for the 100 equations. The 100 lowest the best subset regression. The nn-garrote methodology
RSS values are tightly packed. But the ME spreads over should carry over to this area and may provide increased
wide range. Shifting from one of these models to another prediction accuracy.
would result in only a small RSS difference but could give The ideas used in the nn-garrote can be applied to get
a large ME change. other regression shrinkage schemes. For instance, let {bk}
be the original OLS estimates. Take {ck} to minimize
Downloaded by [71.74.122.179] at 17:44 05 February 2016
Breiman, L., and Friedman, J. (1985) “Estimating Optimal Trans- Hastie, T., and Tibshirani, R. (1990) Generalized Additive Models.
formations for Multiple Regression and Correlation” (with dis- New York: Chapman and Hall.
cussion), Journal of the American Statistical Association, 80, Hoerl, R., Schuenemeyer, J., and Hoerl, A. (1986). “A Simulation
580-619. of Biased Estimation and Subset Selection Regression Technique,”
Breiman, L., and Spector, P. (1992), “Submodel Selection and Evalua- Technometrics, 28, 369-380.
tion in Regression: The X-Random Case,” International Statistical Lawson, C., and Hanson, R. (1974), Solving Least Squares Problems,
Review, 60.291-319. Englewood Cliffs, NJ: Prentice-Hall.
Cook, R. (1993), “Exploring Partial Residual Plots,” Technometrics, Mallows, C. (1973), “Some Comments on Cp,” Technomefrics, 15,
35.351-362. 661-675.
Daniel, C., and Wood, F. (1980). “Fitting Equations to Data,” New Miller, A. (1990). Subset Selection in Regression, London: Chapman
York: John Wiley. and Hall.
Frank, E., and Friedman, J. (1993), “A Statistical View of Some Chemo- Roecker, E. (1991). “Prediction Error and its Estimation of Subset-
metrics Regression Tools” (with discussion), Technometrics, 35, Selected Models,” Technomefrics, 33, 459-468.
109-148. Shao, J. (1993). “LinearModel Selection via Cross-validation,” Journal
Friedman, J., and Silverman, B. (1989), “Flexible Parsimonious ofthe American Statistical Association, 88.486-494.
Smoothing and Additive Modeling” (with discussion), Technornet- Smith, G., and Cambell, E (1980), “‘A Critique of Some Ridge Regres-
rics. 3 1, 3-40. sion Methods” (with discussion), Journal ofthe Americun Statistical
Fumival, G., and Wilson, R. (1974), “Regression by Leaps and Association, 75, 74-103.
Bounds,” Technometrics, 16, 499-5 11. Tibshirani, R. (1994). “Regression Shrinkage and Selection via the
Golub, G., Heath, M., and Wahba, G. (1979), “Generalized Cross- Lasso,” Technical Report 9401, University of Toronto, Dept. of
validation as a Method for Choosing a Good Ridge Parameter,” Tech- Statistics.
nometrics, 2 1, 2 15-224. Zhang, P. (1992), “Model Selection via Multifold Cross-validation,”
Downloaded by [71.74.122.179] at 17:44 05 February 2016
Gruber, M. (1990). Regression Estimutor.s: A Comparative Study, Technical Report 257, University of California, Berkeley, Statistics
Boston: Academic Press. Dept.