Lecture 15: Tree-Based Algorithms — Applied ML
Lecture 15: Tree-Based Algorithms — Applied ML
Contents
Lecture 15: Tree-Based Algorithms
15.1. Decision Trees
15.2. Learning Decision Trees
15.3. Bagging
15.4. Random Forests
In this lecture, we will cover a new class of supervised machine learning model, namely tree-
based models, which can be used for both regression and classification tasks.
We start by building some intuition about decision trees and apply this algorithm to a familiar
dataset we have already seen.
15.1.1. Intuition
Decision tress are machine learning models that mimic how a human would approach this
problem.
Training Dataset + Learning Algorithm → Predictive Model
Attributes + Features Model Class + Objective + Optimizer
Let’s start by loading this dataset and looking at some rows from the data.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 1 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
from sklearn import datasets
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled
by the standard deviation times `n_samples` (i.e. the sum of squares of each
column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 2 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
Let’s breakdown this visualization a bit more. The visualized output above depicts the
process by which our trained decision tree classifier makes predictions. Each of the non-leaf
nodes in this diagram corresponds to a decision that the algorithm makes based on the data
or alternatively a rule that it applies. Namely, for a new data point, starting at the root of this
tree, the classifier looks at a specific feature and asks whether the value for this feature is
above or below some threshold. If it is below, we proceed down the tree to the left, and if it is
above, we proceed to the right. We continue this process at each depth of the tree until we
reach the leaf nodes (of which we have four in this example).
Let’s follow these decisions down to one of the terminal leaf nodes to see how this works.
Starting at the top, the algorithm asks whether a patient has high or low bmi (defined by the
threshold 0.009). For the high bmi patients, we proceed down the right and look at their
blood pressure. For patients with high blood pressure (values exceeding 0.017), the
algorithm has identified a ‘high’ risk sub-population of our dataset, as we see that 76 out the
86 patients that fall into this category have high risk of diabetes.
A decision rule r : X → {True, False} is a partition of the feature space into two
disjoint regions, e.g.:
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 3 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
In theory, we could have more complex rules with multiple thresholds, but this would add
unnecessary complexity to the algorithm, since we can simply break rules with multiple
thresholds into several sequential rules.
With this definition of decision rules, we can now more formally specify decision trees as
(usually binary) trees, where:
Returning to the visualization from our diabetes example, we have 3 internal nodes
corresponding to the top two layers of the tree and four leaf nodes at the bottom of the tree.
We can illustrate decision regions via this figure from Hastie et al.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 4 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
Setting aside the top left image for a moment, the other 3 depictions from Hastie et al. show
how we can visualize decision trees. For an input space of two features X1, X2, we first have
the actual tree, similar to what we’ve seen thus far (bottom left). This tree has a maximum
depth of 3 and first splits the inputs along the X1 dimension. On the left branch we have one
more internal decision node based on values of X2 resulting in 2 leaf nodes that each
correspond to a region of the input space. On the right branch of the tree, we see two more
decision nodes that result in an additional 3 leaf nodes, each corresponding to a different
region. Note that as mentioned above, rather than using more complicated decision rules
such as t1 < X1 ≤ t3, we can separate this rule into two sequential and simpler decisions,
i.e., the root node and the first decision node to its right.
We can equivalently view this decision tree in terms of how it partitions the input space (top
right image). Note that each of the partition lines on the X1 and X2 axes correspond to our
decision nodes and the partitioned regions correspond to the leaf nodes in the tree depicted
in the bottom left image.
Finally, if each region corresponds to a predicted value for our target, then we can draw
these decision regions with their corresponding prediction values on a third axis (bottom
right image).
Returning to the top left image, this image actually depicts a partition of the input space that
would not be possible using decision trees, which highlights the point that not every partition
corresponds to a decision tree. Specifically, for this image the central region that is non-
rectangular cannot be created from a simple branching procedure of a decision tree.
The I{⋅} is an indicator function (one if {⋅} is true, else zero) and values yR ∈ Y are
the outputs for that region.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 5 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
The set R is a collection of decision regions. They are obtained by a recursive learning
procedure (recursive binary splitting).
The rules defining the regions R can be organized into a tree, with one rule per internal
node and regions being the leaves.
We saw how decision trees are represented. How do we now learn them from data?
At a high level, decision trees are grown by adding nodes one at a time, as shown in the
following pseudocode:
def build_tree():
while tree.is_complete() is False:
leaf, leaf_data = tree.get_leaf()
new_rule = create_rule(leaf_data)
tree.append_rule(leaf, new_rule)
Most often, we build the tree until it reaches a maximum number of nodes. The crux of the
algorithm is in create_rule.
When x has continuous features, the rules have the following form:
True if xj ≤ t
r(x) = {
False if xj > t
When x has categorical features, rules may have the following form:
True if xj = tk
r(x) = {
False if xj ≠ tk
Thus each rule r : X → {T, F} maps inputs to either true (T) or false (F) evaluations.
How does the create_rule function choose a new rule r? Given a dataset
D = {(x(i), y(i) ∣ i = 1, 2, … , n}, let’s say that R is the region of a leaf and
DR = {(x(i), y(i) ∣ x(i) ∈ R} is the data for R. We will greedily choose the rule that
minimizes a loss function.
Specifically, we add to the leaf a new rule r : X → {T, F} that minimizes a loss:
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 6 of 17
⎛
⎝
left subtree
⎜⎟
Lecture 15: Tree-Based Algorithms — Applied ML
right subtree
⎠
min L({(x, y) ∈ DR ∣ r(x) = T}) + L({(x, y) ∈ DR ∣ r(x) = F} )
r∈U
⎞
where L is a loss function over a subset of the data flagged by the rule and U is the set of
possible rules.
L(DR) =
1
|DR|
∑ I {y ≠ most-common-y(DR)}.
(x,y)∈DR
At a leaf node with region R, we predict most-common-y(DR), the most common class y
in the data. Notice that this loss incentivizes the algorithm to cluster together points that
have similar class labels.
Other losses that can be used for classification include entropy or the Gini index. These all
optimize for a split in which different classes do not mix.
min
r∈U
∑
L(DR) =
⎣(x,y)∈DR ∣ r(x)=T
∑ (y − average-y(DR))2.
(x,y)∈DR
If this was a leaf node, we would predict average-y(DR), the average y in the data. The
above loss measures the resulting squared error.
⎡
(y − ptrue(r))2 + ∑
(x,y)∈DR ∣ r(x)=F
Notice that this loss incentivizes the algorithm to cluster together all the data points that
have similar y values into the same region.
Page 7 of 17
Trees (CART)
⎢⎥
Lecture 15: Tree-Based Algorithms — Applied ML
Finally, we conclude this section with a few additional comments on the above training
procedure:
Nodes are added until the tree reaches a maximum depth or the leaves can’t be split
anymore.
In practice, trees are also often pruned in order to reduce overfitting.
There exist alternative algorithms, including ID3, C4.5, C5.0. See Hastie et al. for
details.
To summarize and combine our definition of trees with the optimization objective and
procedure defined in this section, we have a decision tree algorithm that can be used for
both classification and regression and is known as CART.
Next, we are going to see a general technique to improve the performance of machine
learning algorithms.
A very expressive model (e.g., a high degree polynomial) fits the training dataset
perfectly.
The model also makes wildly incorrect prediction outside this dataset and doesn’t
generalize.
def true_fn(X):
return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 40
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html
20/10/24, 6:23 PM
Page 8 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
All of these models perform poorly on regions outside of data on which they were
trained.
Even though we trained the same class of model with the same hyperparameters on
each of these subsampled datasets, we attain very different realizations of trained
model for each subsample.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 9 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
An algorithm that has a tendency to overfit is also called high-variance, because it outputs a
predictive model that varies a lot if we slightly perturb the dataset.
The term ‘bagging’ stands for ‘bootstrap aggregation.’ The data samples that are taken with
replacement are known as bootstrap samples.
The idea of bagging is that the random errors of any given model trained on a specific
bootstrapped sample, will ‘cancel out’ if we combine many models trained on different
bootstrapped samples and average their outputs, which is known as ensembling in ML
terminology.
for i in range(n_models):
# collect data samples and fit models
X_i, y_i = sample_with_replacement(X, y, n_samples)
model = Model().fit(X_i, y_i)
ensemble.append(model)
We are going to train a large number of polynomial regressions on random subsets of the
dataset of points that we created earlier.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 10 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
Let’s visualize the prediction of the bagged model on each random dataset sample and
compare to the predictions from an un-bagged model.
# visualize them
ax.plot(X_line, true_fn(X_line), label="True function")
ax.plot(X_line, y_lines[:,i], label="Model Trained on Samples")
ax.plot(X_line, y_line, label="Bagged Model")
ax.scatter(Xs[i], ys[i], edgecolor='b', s=20, label="Samples",
alpha=0.2)
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title('Random sample %d' % i)
Compared to the un-bagged model, we see that our bagged model varies less from one
sample to the next and also better generalizes to unseen points in the sample.
To summarize what we have seen in this section, bagging is a general technique that can be
used with high-variance ML algorithms.
It averages predictions from multiple models trained on random subsets of the data.
Next, let’s see how bagging can be applied to decision trees. This will also provide us with a
new algorithm.
To distinguish random forests from decision trees, we will follow a running example.
Recall, that this is a classical dataset originally published by R. A. Fisher in 1936. Nowadays,
it’s widely used for demonstrating machine learning algorithms.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 11 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
import numpy as np
import pandas as pd
from sklearn import datasets
print(iris.DESCR)
.. _iris_dataset:
:Summary Statistics:
The famous Iris database, first used by Sir R.A. Fisher. The dataset is
taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
.. topic:: References
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 12 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
The code below will be used to visualize predictions from decision trees on this dataset.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 13 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
# https://scikit-
learn.org/stable/auto_examples/neighbors/plot_classification.html
from sklearn.tree import DecisionTreeClassifier
from matplotlib.colors import ListedColormap
import warnings
warnings.filterwarnings("ignore")
def make_grid(X):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X.iloc[:, 0].min() - 0.1, X.iloc[:, 0].max() + 0.1
y_min, y_max = X.iloc[:, 1].min() - 0.1, X.iloc[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
return xx, yy, x_min, x_max, y_min, y_max
We see two problems with the output of the decision tree on the Iris dataset:
The decision boundary between the two classes is very non-smooth and blocky.
The decision tree overfits the data and the decision regions are highly fragmented.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 14 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
Recall that this is called the high-variance problem, because small perturbations of the data
lead to large changes in model predictions.
Consider the performance of a decision tree classifier on 3 random subsets of the data.
clf = DecisionTreeClassifier()
clf.fit(X_random, y_random)
Z = make_2d_preds(clf, X_random)
make_2d_plot(ax, Z, X_random, y_random)
ax.set_title('Random sample %d' % i)
Instantiating our definition of bagging with decision trees, we obtain the following
pseudocode definition of random forests:
for i in range(n_models):
# collect data samples and fit models
X_i, y_i = sample_with_replacement(X, y, n_samples)
model = DecisionTree().fit(X_i, y_i)
random_forest.append(model)
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 15 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
Now, each prediction is the average on the set of bagged decision trees.
# visualize predictions
make_2d_plot(ax, np.rint(Z_avg), X_all, y_all)
The boundaries are much more smooth and well-behaved compared to those we saw for
decision trees. We also clearly see less overfitting.
As with decision trees, they require little data preparation (no rescaling, handle
continuous and discrete features, work well for classification and regression).
They are often quite accurate.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 16 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM
By Cornell University
© Copyright 2023.
https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 17 of 17