14 - An Introduction To Ensemble Methods For Data Analysis
14 - An Introduction To Ensemble Methods For Data Analysis
Title
An Introduction to Ensemble Methods for Data Analysis
Permalink
https://escholarship.org/uc/item/059919k4
Author
Berk, Richard
Publication Date
2005-03-27
Abstract
This paper provides an introduction to ensemble statistical proce-
dures as a special case of algorithmic methods. The discussion beings
with classification and regression trees (CART) as a didactic device
to introduce many of the key issues. Following the material on CART
is a consideration of cross-validation, bagging, random forests and
boosting. Major points are illustrated with analyses of real data.
1 Introduction
There are a growing number of new statistical procedures Leo Breiman
(2001b) has called ”algorithmic.” Coming from work primarily in statistics,
applied mathematics, and computer science, these techniques are sometimes
linked to “data mining,” “machine learning,” and “statistical learning.”
With algorithmic methods, there is no statistical model in the usual sense;
no effort to made to represent how the data were generated. And no apologies
∗
Support for work on this paper was provided by the National Science Foundation:
(SES - 0437169) ”Ensemble Methods for Data Analysis in the Behavioral, Social and
Economic Sciences.” The support is gratefully acknowledged. Matthias Schonlau, Greg
Ridgeway, and two reviewers provided a number of helpful suggestions after reading an
earlier version of this paper. Their assistance is also gratefully acknowledged.
1
are offered for the absence of a model. There is a practical data analysis prob-
lem to solve that is attacked directly with procedures designed specifically for
that purpose. If, for example, the goal is to determine which prison inmates
are likely to engage in some form of serious misconduct while in prison (Berk
and Baek, 2003), there is a classification problem to be addressed. Should
the goal be to minimize some function of classification errors, procedures
are applied with that minimization task paramount. There is no need to
represent how the data were generated if it is possible to accurately classify
inmates by other means.
Algorithmic methods have been applied to a wide variety of data analysis
problems, particularly in the physical and biomedical sciences (Sutton and
Barto, 1999; Witten and Frank, 2000; Christianini and Shawne-Taylor, 2000;
Breiman, 2001b, Hastie et al., 2001; Dasu and Johnson, 2003): predicting
next-day ozone levels, finding fraudulent credit card telephone calls among
many millions of transactions, and determining the predictors of survival for
hepatitus patients. There is growing evidence that a number of algorithmic
methods perform better than conventional statistical procedures for the tasks
algorithmic methods are designed to undertake (Breiman, 2001a; Hastie et
al., 2001). In addition, they raise important issues about the statistical
modeling and the nature of randomness more generally.
Among the great variety of algorithmic approaches, there is a group that
depends on combining the fitted values from a number of fitting attempts;
fitted values are said to be “combined” or “bundled” (Hothorn, 2003). For
example, one might combine the fitted values from several regression analyses
that differ in how nuisance parameters are handled. Another example would
be to average the fitted values from nonparametric regression applied to a
large number of single-subject experimental trials (Faraway, 2004).
The term “ensemble methods” is commonly reserved for bundled fits pro-
duced by a stochastic algorithm, the output of which is some combination
of a large number of passes through the data. Such methods are loosely re-
lated to iterative procedures on the one hand and to bootstrap procedures on
the other. An example is the average of a large number of kernel smoothes
of a given variable, each based on a bootstrap sample from the same data
set. The idea is that a “weak” procedure can be strengthened if given a
opportunity to operate “by committee.” Ensemble methods often perform
extremely well and in many cases, can be shown to have desirable statistical
properties (Breiman, 2001a; 2001c; Buehlmann and Yu, 2002; Mannor et al.,
2002; Grandvelet, 2004).
2
This paper provides a brief review of ensemble methods likely to be espe-
cially useful for the analysis of social science data. Illustrations and software
are considered. The purpose of the discussion is to introduce ensemble meth-
ods, although some deeper conceptual issues will be briefly considered. For
didactic purposes, techniques for categorical response variables will be em-
phasized. The step to equal interval response variables is then easily made.
3
Figure 1 reveals that cases with z-values of 3 or greater and x-values of -4
or less are always “B.” Cases with z-values less than 3 and x-values greater
than 6 are always “A.”
A A
A B A
A B A
6
A B A A B
X
A
B A A A B B B
B B AB -4
A B B B
A B B
3
Z
Recursive Partitioning of a Binary Outcome
(where Y = A or B and predictors are Z and X)
4
Root Node
No x > c1 Yes
Terminal Node 1
Internal Node
No z > c2 Yes
5
3. A way is needed to influence the size of the tree so that only stable
terminal nodes are constructed.
4. A way is needed to consider how “good” the tree is.
5. A way is needed to protect against overfitting.
6. A way is needed to interpret and communicate the results.
6
More formally, impurity is defined as
where φ ≥ 0, φ(p) = φ(1 − p), and φ(0) = φ(1) < φ(p); impurity is non-
negative, and symmetrical with a minimum when τ contains all 0’s or all
1’s.
There are three popular options for φ: in order below, Bayes error, the
entropy function, and the Gini Index.
7
Failure Success Total
Left Node: x ≤ c n11 n12 n1.
Right Node: x > c n21 n22 n2.
n.1 n.2 n..
where i(τ ) is the the value of the parent entropy impurity, p(τR ) is the prob-
ability of being in the right daughter, p(τL ) is the probability of being in
the left daughter, and the rest is defined as before. The two probabilities
can be estimated from a table such as Table 1; they are simply the marginal
proportions n1. /n.. and n2. /n.. .3 CART computes ∆I(s, τ ) for all splits on
3
∆I(s, τ ) is effectively the reduction in the deviance and thus, there is a clear link to
the generalized linear model.
8
each variable and chooses the variable and split with the largest value. The
same operation is undertaken for all subsequent nodes.
In principle, the CART algorithm can keep partitioning until there is
only one case in each node. Such a tree is called “saturated.” However, well
before a tree is saturated, there will usually be far too many terminal nodes
to interpret, and the number of cases in each will be quite small. One option,
is to instruct CART not to construct terminal nodes with samples smaller
than some specified value. A second option will be considered shortly.
Figure 3 shows a classification tree for an analysis of which prison inmates
engage in some form of reportable misconduct while in prison. A minimum
node sample size of 100 was imposed for reasons to be explained shortly. The
variables in Figure 3, selected by CART from a larger set of 12 predictors,
are defined as follows.
9
specified such a model in logistic regression and when the more likely all-
main-effect model was applied to these data, the fit was dramatically worse
and led to somewhat different conclusions.4
However, CART can sometimes get things very wrong. If the appropriate
response function is linear and additive, CART will no better than a conven-
tional parametric analysis and most likely a lot worse. Fortunately, this is
rarely a problem for the ensemble procedures to be considered shortly.
where T̃ is the set of terminal nodes of T , p(τ ) is the probability that a case
will fall in terminal node τ , and r(τ ) is a certain quality of that node, which
has some parallels to the error sum of squares in linear regression. More
details will be provided shortly. The purpose of pruning is to select the best
subtree T ∗ starting with the saturated tree T0 so that R(T ) is minimized.
To get that job done, one needs r(τ ). That, in turn, will depend on two
factors: classification errors and the complexity of the tree. And there will
be a tradeoff between the two.
10
CART for Serious 115s
gang.f=a
N|
7571/2040
term< 3.5
N N
6449/1365 1122/675
agerec.f=cd
N N
768/328 354/347
N Y
128/77 226/270
11
in the data analysis task we want to accomplish. For this, some function of
classification errors is often exploited.5
Suppose the problem is whether a beach is classified as closed because of
pollution. Closed is coded 1 and not closed is coded 0. We then examine each
terminal node and “predict” all of the cases to be 1’s if the majority of cases
in that node are 1. This is an application of a “majority vote” rule. If the
majority of cases in that node are 0, we “predict” that all of the cases will be
0’s. This too is an application of a majority vote rule. For example, if there
are 17 cases in a given terminal node and 10 are 1’s, we predict 1 (closed) for
all of the cases in that node. We then necessarily have 7 misclassifications.
If 5 cases in that node are 1’s, we “predict” that all of the cases are 0, and
then there are 5 misclassifications.
There are two kind of classification errors: false positives and false nega-
tives. Here, a false positive here would occur if the model classifies a given
beach as closed when in fact it is not. A false negative would occur if the
model classifies a given beach as open when it is not. Clearly, one would like
both false positives and false negatives to be as small as possible.
However, the consequences of the two kinds of errors may not be the
same. Are the social and health costs the same for telling the public that a
beach is closed when it is not as for telling the public that a beach is open
when it is not? Some might argue that the costs of falsely claiming a beach
is open are higher because people might travel to that beach and then be
turned away and/or not know the beach is polluted and swim there. The
general point is that one needs to consider building in the costs of both kinds
of errors. (For an example, see Berk et al., 2005a)
In practice, all one needs is the relative costs of the two kinds of errors,
such as 5 to 1. But with that in hand, one can for each node determine the
costs of predicting either y = 1 or y = 0.
Suppose that there are 7 cases in a terminal node, 3 coded as 1 (the beach
is closed) and 4 are coded as 0 (the beach is open). For now, the costs of
both false positives and false negatives are 1. Then, the “expected cost” of
a false positive is 1 × 4/7 = .57 while the expected cost of a false negative is
1 × 3/7 = .43. (4/7 is the probability — assuming random sampling — of a
5
We will see later that boosting goes directly for a minimization of classification error.
There is no path through a measure of impurity. One cost is that links to more conventional
statistical modeling are lost and with it a number of desirable properties of estimators.
This had led to a concerted effort to reformulate boosting within a statistical framework
(Friedman et al, 2000; Friedman, 2001; Friedman, 2002).
12
false positive and 3/7 is the probability of a false negative.) For these costs
(and any costs that are the same for false positives and false negatives), we
are better off predicting for all of the cases in this node that the beach is
open; the expected cost (of false negatives) is lower: .43 compared to .57.
The value of .43 is often called the within-node misclassification cost or the
conditional (on τ ) misclassification cost.
But suppose now that the costs of a false negative are 3 times the cost
of a false positive. So, the costs would be 1 unit for a false positive and
3 units for a false negative. Then, the expected cost of a false positive is
1 × 4/7 = .57, while the expected cost of a false negative is 3 × 3/7 = 1.29.
Clearly, it is better for this node to predict that the beach will be closed: .57
compared to 1.29.
But there is more. One does not wan to considert just the within-node
costs, but the costs aggregated over all of the terminal nodes as well. So
one needs to multiply the conditional (within node) misclassification cost
for each terminal node by the probability of a case falling in that node. For
example, if there are 500 cases in the root node and 50 cases in terminal node
τ , that probability is .10. If the expected cost of that node is, say, 1.29, the
unconditional cost is .10×1.29 = .129. To arrive at the costs over all terminal
nodes, and hence for that tree, one simply adds all of the unconditional costs
for each terminal node.
Notice that all of the data have been used to build the tree, and then these
same data are “dropped ” down that tree to get the overall costs of that tree.
The same data used to build the model are used to evaluate it. And given
all of data manipulations, overfitting is a real problem. As a result, expected
costs are usually too low. The notation Rs (τ ) is to make clear that the costs
are probably too optimistic. The “s” stands for “resubstitution.” Measures
of model quality are derived when the data used to build the model are
“resubstituted” back into the model. Overfitting will be addressed at length
shortly.
13
way to specify costs when the procedure being used, or the software, does
not allow for specification of the costs directly.
The basic logic is this: if the cost of a false positive is 5 times greater
than the cost of a false negative, it is much the same as saying that for every
false negative there are 5 false positives. Costs translate into numbers of
cases. The same kind of logic applies when using the prior probabilities.
While the data may indicate that there are 3 successes for each failure, if a
false success is twice as important as a false failure, it is as if there were 6
successes for each failure.7 In short, one can get to the same place by using
prior probabilities or costs. For the formal details see Breiman et al., (1984:
Section 4.4).
14
and Singer, 1999, section 4.2.3). Still, in some cases, one or more of the
sample sizes for terminal nodes may be too small. The results, then, can
be unstable. One option is to specify through trial and error a value of α
for pruning that leads to terminal nodes each with a sufficient number of
cases. Alternatively, it is often possible to set the minimum sample size that
is acceptable as another tuning parameter.8
where βm is the weight given to the mth term. Note that once the basis
functions hm are defined, the model is linear in these transformations of X.
For example, a fourth order polynomial would have four basis function of X
(here a single variable): X, X 2 , X 3 , X 4 . These basis function in X could be
entered into a conventional linear regression model.11
8
The use in CART of a penalty for complexity can be placed in a broader framework
of imposing complexity penalties on a wide variety of fitting functions. For example, ridge
regression can be usefully viewed in this context. A very accessible discussion can be found
in Hastie et al., 2001, sections 2.7 and 3.4. Complexities penalties are also closely related
to the variance-bias tradeoff that will be considered later in this paper.
9
If the new data are from a different population, the fitting is likely to be even worse.
10
A “training sample” is the data used to build the model.”
11
See Hastie et al. (2001, section 5.4.1) for a related discussion of “effective degrees of
freedom.”
15
The basis function formulation can be applied to CART at three points
in the fitting process. First, for any given predictor being examined for its
optimal split, overfitting will increase with the number of splits possible. In
effect, a greater number of basis functions are being screened (where a given
split is a basis function).
Second, for each split, CART includes as X all predictors as inputs. An
optimal spilt is chosen over all possible splits of all possible predictors. This
defines the optimal basis function for that stage. Hence within each stage,
overfitting increases as the number of candidate predictors increases.
Then for each new stage, a new optimal basis function is chosen and
applied. Consequently, overfitting increases with the number of stages, which
for CART means the number of optimal basis functions, typically represented
by the number of nodes in the tree.
Once a node is defined, it is unchanged by splits farther down the tree.
New basis functions are just introduced along with the old ones. There-
fore, CART is forward stagewise additive model that can produce overfitting
within predictors, within stages and over stages.
The overfitting can be be misleading in at least two ways. First, mea-
sures meant to reflect how well the model fits the data are likely to be too
optimistic. Thus, for example, the number of classification errors may be too
small. Second, the model itself may have a structure that will not generalize
well. For example, one or more predictors may be included in a tree that
really do not belong.
Ideally, there would be two random samples from the same population.
One would be a training data set and one would be a testing data set. A tree
would be built using the training data set and some measure of fit obtained.
A simple measure might be the fraction of cases classified correctly. A more
complicated measure might takes costs into account. Then with the tree
fixed, cases from the testing data set would be “dropped down” the tree and
the fit computed again. It is almost certain that the fit would degrade, and
how much is a measure of overfitting. The fit measure from the test data set
is a better indicator of how accurate the classification process really is.
Often there is only a single data set. An alternative strategy is to split
the data up into several randomly chosen, non-overlapping parts. Ten such
subsets are common. The tree is built on nine of the splits and evaluated
with the remaining one. So, if there are 1000 observations, one would build
the tree on 900 randomly selected observations and evaluate the tree using
the other 100 observations. This would be done ten times, one for each
16
non-overlapping split, and for each a measure of fit computed. The proper
measure of fit is the average of the fit measure over the ten splits. The
procedure is called 10-fold cross-validation and is routinely available in many
CART programs. Extensions on this basic idea using bootstrap samples are
also available (Efron and Tibshirani, 1993, Chapter 17).
Unfortunately, cross-validiation neglects that the model itself is poten-
tially misleading. A method is needed to correct the model for overfitting,
not just its goodness-of-fit measure. For this we turn to “ensemble methods.”
3 Ensemble methods
The idea of averaging over ten measure of fit constructed from random, non-
overlapping subsets of the data can be generalized in a very powerful way.
If averaging over measures of fit can help correct for overly optimistic fit
assessments, then “averaging” over a set of CART trees might help correct
for overly optimistic trees.
3.1 Bagging
The idea of combining fitted values from a number of fitting attempts has
been suggested by several authors (LeBlanc and Tibsharini, 1996; Mojirshe-
beibani, 1997; 1999; Mertz, 1999). In an important sense, the whole becomes
more than the sum of its parts. Perhaps the earliest procedure to exploit a
combination of “random trees” is bagging (Breiman, 1996). Bagging stands
for “bootstrap aggregation” and may be best understood initially as nothing
more than an algorithm.
Consider the following steps in a fitting algorithm with a data set having
n observations and a binary response variable.
4. For each case in the data set, count the number of times over trees that
it is classified in one category and the number of times over trees it is
classified in the other category
17
5. Assign each case to a category by a majority vote over the set of trees.12
18
that sense, the associations revealed by algorithmic methods can be seen as
descriptive. Algorithmic models are not causal models.
At the same time, there remains the very real numerical dependence on
CART as a critical part of the algorithm. Consequently, certain features of
the CART procedure need to be specified for the bagged trees to be properly
constructed. Within algorithmic models, these arguments can be seen a
“tuning parameters,” affecting how the algorithm functions. For example,
specification of a prior distribution for a binary response can be seen as a
way to compensate for highly skewed response distributions, rather than as
a belief about what the marginal distribution of the response really is.
The basic output from bagging is simply the predicted classes for each
case. Commonly there is also an estimate of the classification error and a
cross-tabulation of the classes predicted by the classes observed. The cross-
tabulation can be useful for comparing the number of false positives to the
number of false negatives. Sometimes the software stores each of the trees
as well, although they are rarely of any interest because the amount of infor-
mation is typically overwhelming. One effective implementation of bagging
is the package “ipred” available in R.15
Once one considers bagging to be an example of an algorithmic method,
the door is opened to a wide variety of other approaches. An algorithmic
procedure called “random forests” is the next step through that door.
19
By appropriate measures of fit, bagging can be an improvement over
CART.16 And by these same criteria, random forests can be an improve-
ment over bagging. Indeed, random forests is among the very best classifiers
invented to date (Breiman, 2001a). The improvement goes well beyond com-
pensation for overfitting and capitalizes on some new principles that are not
yet fully understood.
Consider Figure 4. It was constructed from data drawn as 4 random
samples of 15 from a bivariate Poisson distribution in which the two variables
were moderately related. For each sample, the 15 points are simply connected
by straight lines, which is nothing more than a set of linear interpolations.
Because there is no smoothing, each fitted value for each value of x is an
unbiased estimate of the relationship between x and y. Yet, one can see
that the fitted lines vary substantially from sample to sample; the variance
over fits is large. However, were one to take the average of y for each value
of x and connect those points, the fit would still be unbiased and much
smoother. Moreover, if one repeated the same procedures with 4 new random
samples, the averaged fits from the 2 sets of 5 samples would likely be far
more similar than the fits that were constructed from a single sample. The
variance estimated from a set of 4 samples will be smaller than the variance
estimated from a single sample.
Note that the same argument would apply had the response been categor-
ical and the fitted values were interpreted as probabilities. The logic works
for classification problems as well. It is just harder to see what is going on
in a graph such as Figure 4.
One can formalize these ideas a bit (Breiman, 2001c) by recalling that
for conventional regression analysis and given set of predictor values x0 , the
variance over repeated independent samples of an observed value of the re-
sponse variable y around the fitted value fˆ(x0 ) is the sum of three parts: 1)
an irreducible error, 2) the bias in the fitted value, and 3) the variance of the
fitted value. More explicitly (Hastie et al., 2001: 197),
E[(y − fˆ(x0 ))2 |x = x0 ] = σ2 + [E fˆ(x0 ) − f (x0 )]2 + E[fˆ(x0 ) − E fˆ(x0 )]2 . (11)
There is nothing that can be done about the σ2 , which results from the
model’s error term . In general, the more complex the model fˆ, the smaller
16
The relative performance of CART compared to bagged trees depends on size and
kinds of influence associated with each observation (Grandvalet, 2004), but the issues are
beyond the scope of this paper.
20
Example of a Weak Learner
15
●
10
● ● ● ●
● ● ●
Y
● ● ●
● ●
● ● ● ●
5
● ● ●
● ● ●
● ●
● ●
0
0 5 10 15
21
the squared bias, but the larger the variance; there is a tradeoff between the
two that constrains conventional modeling. We were implicitly facing this
tradeoff when earlier the impact of model complexity was addressed.
However, ensemble predictors can in principle sever the link between the
bias and the variance. In an ideal world, the bias would be zero, and the
variance would go to zero as the number of independent fits in the ensemble
is increased without limit. In practice, one can make the fˆ very complex in
an effort to achieve unbiasedness while shrinking the variance by averaging
over many sets of fitted values, each based on a bootstrap sample of the data.
The sampling of predictors is also very important. Working with random
samples as predictors at each split increases the independence between fit-
ted values over trees. This makes the averaging more effective. In addition,
sampling predictors implies that competition between predictors can be dra-
matically reduced. For example, if there is in CART a single predictor that
can fit very well only a few observations, it will never be chosen to define
a split when there are other predictors that can improve the fit for a large
number of observations. Sometimes this will not matter. But if the first pre-
dictor is the only variable available that can fit those few observations, those
observations will not be well characterized by the fitted values. When predic-
tors are sampled, however, the competition that highly specialized predictors
have to face can be substantially reduced. From time to time, they will be
chosen to define a split. By sampling predictors, random forests makes the
fitting function more flexible.
The goal of finding a role for highly specialized predictors is an argument
for growing very large, unpruned trees. Generally, this seems to be a wise
strategy. However, large trees can can lead to very unstable results when
there are a substantial number of predictors that at best weakly related to
the response and correlated substantially with one another (Segal, 2003).
In effect, this becomes a problem with multicollinearity that the averaging
over trees is unlikely to overcome. In practice, therefore, it can be useful to
work with smaller trees, especially when there are a large number of weak
predictors that are associated with one another. Alternatively, one can screen
out at least some of these predictors before random forests is applied to the
data.
Just like bagging, random forests leaves no tree behind to interpret. As
a result, there is no way to show how inputs are related to the output. One
response is to record the decrease in the fitting measure (e.g., Gini Index) each
time a given variable is used to define a split. The sum of these reductions
22
for a given tree is measure of “importance” for that variable. One can then
average this measure over the set of trees.
Like variance partitions, however, reductions in the fitting criterion ig-
nore the forecasting ability of a model, perhaps the most important asset of
ensemble methods. Breiman (2001a) has suggested another form of random-
ization to assess the role of each predictor, which is implemented in the latest
versions of random forests in R.17
1. Grow a forest.
2. Suppose x is the predictor of interest, and it has V distinct values in
the training data. Construct V data sets as follows.
(a) For each of the V values of x, make up a new data set where x
only takes on that value, leaving the rest of the data the same.
17
The original version of R was written in fortran by Leo Breiman and Adele Cutler. The
R port was done by Andy Liaw and Matthew Weiner. Andy Liaw (andy liaw@merck.com)
is the maintainer.
23
(b) For each of the V data sets, predict the response using the random
forests.
(c) Average these predictions over the trees.
(d) Plot the average prediction for each x against the values of x.
24
The three rely at least implicitly on differential costs, but will not necessarily
lead to exactly the same results.
Consider the following illustration. Recall the earlier CART classification
example using data from the California Department of Corrections. Now we
make the problem more difficult by trying to forecast the very small number
of inmates (about 2.5%) who are likely to commit offenses that would likely
be felons if committed outside of prison: assault, rape, drug trafficking and
the like. The predictors are the following.
2. Age at the time of the earliest arrest in years (AgeArr) — 0-17 (30%);
18-21 (41%); 22-29 (19%) 30-35 (6%); 36 or older (4%)
25
Nevertheless, one can get some additional leverage on the problem of finding
the few very high risk inmates by treating the prior distribution as a tuning
parameter (For details, see Berk and Baek, 2003). Of the 18 true positives in
the test data set, 5 are correctly identified, and the number of false negatives
is from a policy point of view acceptable.
All 18 of these inmates were missed by logistic regression model using the
same predictors (Berk and Baek, 2003). CART did no better until a prior
reflecting relative costs was introduced and even then did not perform as well
as random forests.
Figure 5 shows one measure of importance for each of the predictors used.
The height of each bar represents the mean reduction over trees in the random
forest fitting criterion (the Gini index) when a specified predictor becomes
a new splitting criterion. The pattern is somewhat similar for other mea-
sures of importance and has the didactic advantage of close links to deviance
partitions familiar to social scientists. Clearly, term length, age at arrest
and age at reception into CDC are the substantially driving the fit. Partial
dependence plots (not shown) indicate that the signs of the relationships are
just what one would expect. In short, random forests provides useful results
when other methods are likely to fail.
4 Boosting
Like CART, boosting is a forward stagewise additive model (Hastie et al.,
2001: 305-306). But where CART works with smaller and smaller partitions
of the data at each stage, boosting uses the entire data set as each stage.
Boosting gets its name from its ability to take a “weak learning algorithm”
(which by definition performs just a bit better than random guessing) and
“boosting” it into an arbitrarily “strong” learning algorithm (Schapire, 1999:
26
Reduction in the Gini Criterion for Each Predictor
500
400
Mean Reduction in Gini
300
200
100
0
Age at Arrest Age at Rec CDC CYA Gang Jail Psych Term Length
Figure 5: Average Reduction in Gini Criterion for Each Predictor for Very
Serious Misconduct
27
1). It “combines the outputs from many ‘weak’ classifiers to produce a pow-
erful ‘committee’ ”(Hastie et a., 2001: 299). But boosting formally has no
stochastic components and so may not seem to be an ensemble method (at
least as defined here).
Consider, for example, the ADaboost algorithm (Hastie et al., 2001: 301;
Freund and Schapire, 1996; Schapire, 1999).19
2. For m = 1 to M :
(a) Fit a classifier Gm (x) to the training data using the weights wi .
PN
wi I(yi 6=Gm (xi ))
(b) Compute: errm = i=1 Pn
w
.
i=1 i
28
y is the response coded 1 or -1, and f (x) here is the fitted values. They also
show that the overall additive expansion constructed by ADaboost is the
same as estimating one half the log odds that (y = 1|x).
While the exponential loss function can be computationally attractive, it
may not be ideal. Cases that are atypical can lead ADaboost in a misleading
direction. For example, work by Friedman and his colleagues (2000) suggests
that logistic loss is more robust and less vulnerable to overfitting, and in fact,
ADaboost can be treated as a special case of a wide variety of possible loss
functions (Friedman, 2001; 2002).
There is no formal stopping rule for ADaboost and as a result, ADaboost
can overfit (Jing, 2004). The number of passes over the data is a tuning
parameter that in practice depends on trial and error, often indexed by a
measure of fit. One such measure is the cross-validation statistic, but there
are several others that each penalize model complexity a bit differently. Often
the number of classification errors will decline up to a particular number of
passes over the data and then begin to increase. The point of inflection can
sometimes be treated as a useful stopping point. But, there is nothing in
boosting implying convergence.20
Still, there is a broad consensus that ADaboost performs well. Part of
the reason may be found in a conjecture from Breiman (2001a: 20-21) that
ADaboost is really a random forest. Recall that random number generators
are deterministic computer algorithms. Breiman’s conjecture is that AD-
aboost behaves as if the weights constructed at each pass over the data were
stochastic. If this is correct, formal theory explaining the success of random
forests may apply to ADaboost. For example, as the data are reweighted,
highly specialized predictors can play a role. Thus, Adaboost and boosting
more generally, may actually represent ensemble methods.
The key output from boosting is much the same as the key output from
bagging: predicted classifications, error rates and “confusion tables.” Like
for bagging software, information on the importance of predictors is not
directly available, but lots of good ideas are being implemented. For example,
“Generalized Boosted Models” (gbm) in R has the measure of importance
that lies behind partial dependence plots (Friedman, 2001) and a measure of
20
Indeed, for a given sample size, “boosting forever” is not consistent (Mannor, et al.,
2002). But, for a given stopping point, Zhang and Yu (2005) show that under fairly general
conditions, boosting is consistent as the number of observations increases without limit.
One of the complications in this literature is that the term “consistency” can be used for
rather different thought experiments.
29
importance much like the one Breiman implements in random forests.21
where the summation is over all cases in the node, and ȳ(τ ) is the mean of
those cases. As before, the split s is chosen that it maximizes
No cost weights are used because there are no false positives and false nega-
tives. Then, to get the impurity for the entire tree, one sums over all terminal
nodes to arrive at R(T ). There are pruning procedures akin to the categori-
cal case. Bagging, out-of-bag estimates, random forests and boosting follow
pretty much as before, but there are often generalizations to loss functions
associated with the generalized linear model.
With CART, prediction is based on some feature of each terminal node,
such as the mean or median. That value is assigned to each case in the
node. For ensemble methods, prediction for each case is the average over
replications of the summary statistic.
Without classification, the current options for representing predictor im-
portance are fewer. They boil down to the contribution of each predictor to
the fitting criterion, much like partitions of the total sum of squares in linear
regression, or partitions of the deviance under the generalized linear model.
It is also possible to work with partitions of the total prediction error, which
has much the same flavor, but is based on test data or out-of-bag (OOB)
observations.
21
The package gbm in R is written by Greg Ridgeway. He is also the maintainer
(gregr@rand.org).
30
6 Uses of Ensemble Methods
It should already be clear that ensemble methods can be used in social sci-
ence analyses for classification and forecasting. Many can also be useful for
describing the relationships between inputs and an outputs. But there are
several other applications as well.
31
of membership in each treatment group estimated with less bias. More
credible estimates of intervention effects would follow.
4. More generally, one could use ensemble methods to implement the co-
variance adjustments inherent in multiple regression and related pro-
cedures. One would “residualize” the response and the predictors of
interest with ensemble methods. The desired regression coefficients
would then be estimated from the sets of residuals. For example, if
there is a single predictor on which interest centers (e.g., participation
in a social program), one would simply regress the residualized response
on the single residualized predictor. Both would have been constructed
as the difference between the observed values and the estimated values
from the ensemble output. It is likely that the adjustments for con-
founding would be more complete than with conventional covariance
adjustments.
3. determining which variables in the data set are to be inputs and which
are to be outputs;
However, none of these activities are formal or deductive, and they leave lots
of room for interpretation. If the truth be told, theory plays much the same
role in data mining as it does in most conventional analyses. But in data
mining, there is far less pretense.
32
8 Software Considerations
As note earlier, CART is widely available in most popular statistical pack-
ages. Bagging, random forests and boosting are available in R (along with
support vector machines and other powerful data mining tools). A web
search will readily locate lots of other data mining shareware although not
within the same kind of broad computing environment that one can find in
R. There are also a growing number of private sector software providers who
offer a wide range of data mining procedures. For example, Salford Systems
(www.salford-systems.com/) sells user-friendly versions of CART, random
forests, and MARS.
There are important tradeoffs. The procedures available in R will likely
be cutting edge, even experimental, but users sometimes have to live with
bugs and incomplete documentation. In addition, the code is likely to be
updated relatively frequently (e.g., twice in a year) and in a manner that is
not necessarily downwardly compatible. Also, in R there is no point-and-
click. Most work is done on the command line.
Private sector products will generally be somewhat behind on the latest
developments, often several years behind. But if one can afford the pur-
chase price or licensing fees, the rewards are likely to be reliable code, good
documentation, and a user friendly interface.
9 Conclusions
Ensemble methods and related procedures are not simply an elaboration on
conventional statistical and causal modeling. They can represent a funda-
mental break with current traditions in applied social research that has been
dominated by causal modeling. There is no doubt that for many kind of ap-
plications, ensemble methods are the strongest procedures known. Moreover,
the data mining more generally is evolving very rapidly. Performance will
likely improve further.
33
10 References
Berk, R.A. (2003) Regression Analysis: A Constructive Critique. Sage Pub-
lications, Newbury Park, CA.
Berk, R.A., and J. Baek. (2003) “Ensemble Procedures for Finding High
Risk Prison Inmates.” Department of Statistics, UCLA (under review).
Berk, R.A., Li, A., and L.J. Hickman (2005b) “Statistical Difficulties in
Determining the Role of Race in Capital Cases: A Re-analysis of Data
from the State of Maryland,” Quantitative Criminology, forthcoming.
Breiman, L., Friedman, J.H., Olshen, R.A., and C.J. Stone (1984) Classifi-
cation and Regression Trees. Monterey, Ca: Wadsworth
34
Breiman, L. (2001d) “Wald Lecture II: Looking Inside the Black Box,” at
ftp://ftp.stat.berkeley. edu/pub/users/breiman/
Chipman, K.J. and J.W. Shavlik (1998) “Bayesian CART Model Search”
(with discussion). Journal of the American Statistical Association 93:
935-960.
Dasu, T., and T. Johnson (2003) Exploratory Data Mining and Data Clean-
ing. New York: John Wiley and Sons.
Fan G., and B. Gray (2005) “Regression Tree Analysis Using TARGET.”
Journal of Computational and Graphical Statistics 14: 206-218.
35
Hastie, T., Tibshirani, R., and J. Friedman, The Elements of Statistical
Learning. Springer-Verlag, 2001.
Loh, W.-Y. (2002) “Regression Trees with Unbiased Variable Selection and
Interaction Detection.” Statistica Sinica 12: 361-386.
McCaffrey, D., Ridgeway, G., and A. Morral (2004) “Propensity Score Esti-
mation with Boosted Regression for Evaluating Adolescent Substance
Abuse Treatment.” Psychological Methods 9: 403-425.
36
Segal, M.R. (2003) “Machine Learning Benchmarks and Random Forest
Regression.” Working paper, Department of Biostatistics University of
California San Francisco (http://www.ics.uci.edu/ mlearn/MLRepository.html)
Su, X., Wang, M., and J. Fan (2004) “Maximum Likelihood Regression
Trees.” Journal of Computational and Graphical Statistics 13: 586-
598. Sutton, R.S. and A.G. Barto (1999) Reinforcement Learning. MIT
Press.
Venables, W.N. and B.D. Ripley, Modern Applied Statistics with S, fourth
edition, Springer-Verlag, 2002, Chapter 9.
37