Random Forest
Random Forest
Random Forest
Usage in R
The user interface to random forest is consistent with We can compare random forests with support
that of other classification functions such as nnet() vector machines by doing ten repetitions of 10-fold
(in the nnet package) and svm() (in the e1071 pack- cross-validation, using the errorest functions in the
age). (We actually borrowed some of the interface ipred package:
code from those two functions.) There is a formula
interface, and predictors can be specified as a matrix
or data frame via the x argument, with responses as a
vector via the y argument. If the response is a factor,
randomForest performs classification; if the response > library(ipred)
is continuous (that is, not a factor), randomForest > set.seed(131)
performs regression. If the response is unspecified, > error.RF <- numeric(10)
randomForest performs unsupervised learning (see > for(i in 1:10) error.RF[i] <-
below). Currently randomForest does not handle + errorest(type ~ ., data = fgl,
ordinal categorical responses. Note that categorical + model = randomForest, mtry = 2)$error
> summary(error.RF)
predictor variables must also be specified as factors
Min. 1st Qu. Median Mean 3rd Qu. Max.
(or else they will be wrongly treated as continuous).
0.1869 0.1974 0.2009 0.2009 0.2044 0.2103
The randomForest function returns an object of > library(e1071)
class "randomForest". Details on the components > set.seed(563)
of such an object are provided in the online docu- > error.SVM <- numeric(10)
mentation. Methods provided for the class includes > for (i in 1:10) error.SVM[i] <-
predict and print. + errorest(type ~ ., data = fgl,
+ model = svm, cost = 10, gamma = 1.5)$error
> summary(error.SVM)
A classification example Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2430 0.2453 0.2523 0.2561 0.2664 0.2710
The Forensic Glass data set was used in Chapter 12 of
MASS4 (Venables and Ripley, 2002) to illustrate vari-
ous classification algorithms. We use it here to show
how random forests work:
We see that the random forest compares quite fa-
> library(randomForest) vorably with SVM.
> library(MASS)
> data(fgl) We have found that the variable importance mea-
> set.seed(17) sures produced by random forests can sometimes be
> fgl.rf <- randomForest(type ~ ., data = fgl, useful for model reduction (e.g., use the important
+ mtry = 2, importance = TRUE,
variables to build simpler, more readily interpretable
+ do.trace = 100)
100: OOB error rate=20.56%
models). Figure 1 shows the variable importance of
200: OOB error rate=21.03% the Forensic Glass data set, based on the fgl.rf ob-
300: OOB error rate=19.63% ject created above. Roughly, it is created by
400: OOB error rate=19.63%
500: OOB error rate=19.16%
> print(fgl.rf)
Call:
randomForest.formula(formula = type ~ ., > par(mfrow = c(2, 2))
data = fgl, mtry = 2, importance = TRUE, > for (i in 1:4)
do.trace = 100) + plot(sort(fgl.rf$importance[,i], dec = TRUE),
Type of random forest: classification + type = "h", main = paste("Measure", i))
Number of trees: 500
No. of variables tried at each split: 2
Measure 1 Measure 2
No. of variables tried at each split: 4
40
RI RI
Mg
Al Mean of squared residuals: 10.64615
15
Mg
% Var explained: 87.39
30
Ca
10
Ba
20
Ca K Na
Ba Si
Si The mean of squared residuals is computed as
Al
5
10
Fe Fe n
MSEOOB = n1 { yi y iOOB }2 ,
K Na
0
1
Measure 3 Measure 4
where y iOOB is the average of the OOB predictions
25
RI Mg Al Mg
0.6
RI
Ca Al Ba Ca for the ith observation. The percent variance ex-
20
K Si Na Fe Na
K plained is computed as
0.4
Si
15
Ba
10
MSEOOB
0.2
Fe
1 ,
y2
5
0.0
variables. 10
0
0 10 20
50
30 40 50
A regression example 40
1/2
The default mtry is p/3, as opposed to p for
classification, where p is the number of predic-
tors. Figure 2: Comparison of the predictions from ran-
dom forest and a linear model with the actual re-
The default nodesize is 5, as opposed to 1 for sponse of the Boston Housing data.
classification. (In the tree building algorithm,
nodes with fewer than nodesize observations
An unsupervised learning example
are not splitted.)
Because random forests are collections of classifica-
There is only one measure of variable impor- tion or regression trees, it is not immediately appar-
tance, instead of four. ent how they can be used for unsupervised learning.
The trick is to call the data class 1 and construct a
> data(Boston)
> set.seed(1341)
class 2 synthetic data, then try to classify the com-
> BH.rf <- randomForest(medv ~ ., Boston) bined data with a random forest. There are two ways
> print(BH.rf) to simulate the class 2 data:
Call:
randomForest.formula(formula = medv ~ ., 1. The class 2 data are sampled from the prod-
data = Boston) uct of the marginal distributions of the vari-
Type of random forest: regression ables (by independent bootstrap of each vari-
Number of trees: 500 able separately).
2. The class 2 data are sampled uniformly from data set. This measure of outlyingness for the jth
the hypercube containing the data (by sam- observation is calculated as the reciprocal of the sum
pling uniformly within the range of each vari- of squared proximities between that observation and
ables). all other observations in the same class. The Example
section of the help page for randomForest shows the
The idea is that real data points that are similar to measure of outlyingness for the Iris data (assuming
one another will frequently end up in the same ter- they are unlabelled).
minal node of a tree exactly what is measured by
the proximity matrix that can be returned using the
proximity=TRUE option of randomForest. Thus the Some notes for practical use
proximity matrix can be taken as a similarity mea-
sure, and clustering or multi-dimensional scaling us- The number of trees necessary for good perfor-
ing this similarity can be used to divide the original mance grows with the number of predictors.
data points into groups for visual exploration. The best way to determine how many trees are
We use the crabs data in MASS4 to demonstrate necessary is to compare predictions made by a
the unsupervised learning mode of randomForest. forest to predictions made by a subset of a for-
We scaled the data as suggested on pages 308309 est. When the subsets work as well as the full
of MASS4 (also found in lines 2829 and 6368 forest, you have enough trees.
in $R HOME/library/MASS/scripts/ch11.R), result-
ing the the dslcrab data frame below. Then run For selecting mtry , Prof. Breiman suggests try-
randomForest to get the proximity matrix. We can ing the default, half of the default, and twice
then use cmdscale() (in package mva) to visualize the default, and pick the best. In our experi-
the 1proximity, as shown in Figure 3. As can be ence, the results generally do not change dra-
seen in the figure, the two color forms are fairly well matically. Even mtry = 1 can give very good
separated. performance for some data! If one has a very
large number of variables but expects only very
> library(mva) few to be important, using larger mtry may
> set.seed(131) give better performance.
> crabs.prox <- randomForest(dslcrabs,
+ ntree = 1000, proximity = TRUE)$proximity A lot of trees are necessary to get stable es-
> crabs.mds <- cmdscale(1 - crabs.prox) timates of variable importance and proximity.
> plot(crabs.mds, col = c("blue", However, our experience has been that even
+ "orange")[codes(crabs$sp)], pch = c(1, though the variable importance measures may
+ 16)[codes(crabs$sex)], xlab="", ylab="")
vary from run to run, the ranking of the impor-
tances is quite stable.
class 1 and 1% class 2), it may be necessary to
0.2
change the prediction rule to other than ma-
jority votes. For example, in a two-class prob-
0.1
lem with 99% class 1 and 1% class 2, one may
want to predict the 1% of the observations with
0.0
largest class 2 probabilities as class 2, and use
0.1
the smallest of those probabilities as thresh-
old for prediction of test data (i.e., use the
0.2
B/M
type=prob argument in the predict method
B/F
and threshold the second column of the out-
0.3
O/M
memory (and potentially execution time) can L. Breiman. Manual on setting up, using, and
be saved. understanding random forests v3.1, 2002.
http://oz.berkeley.edu/users/breiman/
Since the algorithm falls into the embarrass-
Using_random_forests_V3.1.pdf. 18, 19
ingly parallel category, one can run several
random forests on different machines and then
aggregate the votes component to get the final T. Bylander. Estimating generalization error on two-
result. class datasets using out-of-bag estimates. Machine
Learning, 48:287297, 2002. 18, 22
Bibliography
Andy Liaw
L. Breiman. Bagging predictors. Machine Learning, 24 Matthew Wiener
(2):123140, 1996. 18 Merck Research Laboratories
L. Breiman. Random forests. Machine Learning, 45(1): andy_liaw@merck.com
532, 2001. 18 matthew_wiener@merck.com