Random Forest
Algorithm class: Non-parametric
Mechanism: Average predictions of many trees (de-correlated)
Applicable: Both classification and regression problem
Random Forest is a generalization of Bagging, and typically achieves much
better performance
• Essentially, provide an improvement over bagging by a small tweak
• This reduces the variance when we average the trees
Idea
Split variable randomization
• Follow a similar bagging process but …
Trees produced by bagging
Idea
Split variable randomization
• Follow a similar bagging process but …
• Each time a split is to be performed,
- regression trees: m = p/3
- classification trees: m= 𝑝
- m is commonly referred to as mtry
Trees produced by RF
Random Forest
Essentially
• Bagging introduces randomness into rows of the data
• Random forest introduces randomness into __________________________
• This provides a more diverse set of trees that almost always lowers the
prediction error
Out of bag (OOB) Performance
• For large enough N, on average 63% or the original records end up in any
bootstrap sample
• i.e. 37% of the observations are not used in the construction of a particular tree
• These observations are considered OOB and can be used for efficient assessment
of model performance (unstructured, but free, cross validation)
• RF typically has the least variability in prediction accuracy when tuning
• Let’s now look at how to implement RF
Implementation of Random Forest
• Simple way: ranger, full grid search
• More advanced: h2o, random grid search & early stopping rules
Ames Housing Example (RF), with ranger package
Direct implementation of RF, no tuning
…
For regression tree
Baseline RF model, RMSE ≈ 25,500
Next, we will look at how to tune hyperpara. to improve the model
Random Forest
Tuning Hyperparameters
Random forests provide good "out-of-the-box" performance but there are a few hyperpara.
we can tune to increase performance.
# Trees
Typically have the largest impact on predictive accuracy
Mtry
Min node size/Max depth
(Tree Complexity) Some impact on predictive accuracy, but can increase
computational efficiency
Sampling scheme
Random Forest
#Features in dataset?
Tuning Hyperparameters: # Trees
Need to be sufficiently large: stabilize error rate
Rule of thumb: start with 10p trees and adjust as
necessary
More trees provide robust and stable error
estimates and variable importance measures
Computation time increases linearly with the
number of trees
Random Forest
Tuning Hyperparameters: mtry (#split vars)
Balance low tree correlation and reasonable
predictive strength
Rule of thumb default:
Regression default:
Classification default:
Start with 5 values evenly spaced from 2 to p,
including the default rule-of-thumb value
Few relevant predictors: Should we or mtry?
Random Forest
Tuning Hyperparameters: Min node size/Max depth (Tree Complexity)
Control the complexity of individual trees
Error Growth
Rule of thumb:
Regression default: 5
Classification default: 1
Start with 3 values (1,5,10)
Study (Segal, 2004) has shown Run Time Reduction
Few relevant predictors: Node size
Very large data sets: Node size
Impact of Node size on error & run time (Right Figure)
• If run time is a concern, can run time substantially by node size
Random Forest
Tuning Hyperparameters: Sampling scheme
1. Sample size (default: 100%)
2. Sample with replacement / without replacement
(default: with replacement)
Rationale:
Sample size ______ between-tree correlation
Sampling without replacement produces trees that are less biased
• Ensures [obs. with low-freq categories] more likely to be selected
• Especially important when data has categories that are imbalanced
Rule of thumb:
3-4 values of sample sizes ranging from 25-100%
Try both sampling with/without replacement
Ames Housing Example (RF), with ranger package (cont’d)
Tuning Strategy Illustration
mtry
Min node size
Sample scheme
Note: [Link] returns a dataframe with columns mtry,
[Link], replace, [Link], rmse (values to be filled)
Ames Housing Example (RF), with ranger package (cont’d)
Tuning Strategy Illustration
#trees
mtry
Node size
Sample scheme
Fills rmse in hyper_grid
(created by [Link])
Ames Housing Example (RF), with ranger package (cont’d)
Tuning Strategy Illustration %improvement of RMSE w.r.t. baseline model
RMSE slightly improvement over
baseline model
Observations
1. Default mtry = 26 (#features/3)
nearly sufficient
2. Smaller node size performs better
(deeper tree)
3. Sample <100% and sample
without replacement consistently
performs better
• Probably due to data having a lot
of high-cardinality & imbalanced
categorial features
Ames Housing Example (RF), with h2o package
Benefits of h2o package:
• Random grid search
• Full Cartesian hyperpara. search can be computationally expensive
• Randomly jump from one random para. combination to another
• Can specify early stopping rules
• E.g. #models trained >= threshold, certain runtime elapses
Ames Housing Example (RF), with h2o package
Baseline h2o RF
• Syntax and result very similar to the baseline ranger RF
Similar to baseline RF using ranger
Ames Housing Example (RF), with h2o package
h2o RF with Random Grid Search + Early Stopping Rule (Optional)
Recall in ranger,
we build the hyperpara. grid using the following syntax
Ames Housing Example (RF), with h2o package
h2o RF with Random Grid Search + Early Stopping Rule (Optional)
In h2o, we use a list
Ames Housing Example (RF), with h2o package
h2o RF with Random Grid Search + Early Stopping Rule (Optional)
In h2o, we use a list
Min node size Random grid-search strategy: “RandomDiscrete”
• Randomly jump from one hyperpara. combination to another
Early stopping criteria for grid-search
• Stop if the last 10 RF models do NOT improve RMSE by 0.1%
• Stop if run time > 5 mins
Ames Housing Example (RF), with h2o package
h2o RF with Random Grid Search + Early Stopping Rule (Optional)
Early stopping criteria for building one RF
• Stop if the last 10 trees added do NOT improve RMSE by 0.5%
Ames Housing Example (RF), with h2o package
h2o RF with Random Grid Search + Early Stopping Rule (Optional)
Ames Housing Example (RF), with h2o package
h2o RF with Random Grid Search + Early Stopping Rule (Optional)
Note: with early stopping, results may NOT be
the same (#models searched in laptops of
different speed will be different)
Assessed 66 models, best CV RMSE = 24670
This is near-optimal, and the random grid-
search is more efficient
Feature Interpretation
For RF: 2 approaches for variable importance
At this point, do not need to know the details, just know there are 2 measures
Impurity (Same as CART)
• Based on the average total reduction in MSE
Permutation (Applicable for All ML models, will talk about it in more details)
• Permute a feature to a random value, see how it affects MSE
Feature Interpretation
E.g. using ranger
Feature Interpretation
Typically, similar variables at the top between the two approaches
• Can conclude top 3 important vars: Overall_Qual, Gr_Liv_Area, Neighborhood
Summary
Method Hyperpara Unique features RMSE Package
Demonstrated
CART • Tree depth Simple to interpret - rpart
• Node size
• cp caret
method =
“rpart”
Random Forest # Trees (~10p) Subsample rows/cols ~24000 ranger
Mtry (#split vars, p/3 or 𝑝) Early Stopping
Node size (Tree Complexity) (in adding trees) h2o
Sampling scheme Algorithm =
• Sample size “randomForest”
• Sample with/without replacement
End