0% found this document useful (0 votes)

5 views

Lecture 15: Tree-Based Algorithms — Applied ML

This lecture focuses on tree-based algorithms in supervised machine learning, specifically decision trees, bagging, and random forests. It explains the intuition behind decision trees, their structure, and how they can be applied to datasets like the UCI diabetes dataset for both regression and classification tasks. The lecture also discusses the pros and cons of decision trees, including their interpretability and potential for overfitting.

Uploaded by

thepalaceartisanfoundation

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Lecture 15: Tree-Based Algorithms — Applied ML

Uploaded by

thepalaceartisanfoundation

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

Lecture 15: Tree-Based Algorithms

Contents
Lecture 15: Tree-Based Algorithms
15.1. Decision Trees
15.2. Learning Decision Trees
15.3. Bagging
15.4. Random Forests

In this lecture, we will cover a new class of supervised machine learning model, namely tree-
based models, which can be used for both regression and classification tasks.

We start by building some intuition about decision trees and apply this algorithm to a familiar
dataset we have already seen.

15.1.1. Intuition
Decision tress are machine learning models that mimic how a human would approach this
problem.

1. We start by picking a feature (e.g., age)

2. Then we branch on the feature based on its value (e.g, age > 65?)
3. We select and branch on one or more features (e.g., is it a man?)
4. Then we return an output that depends on all the features we’ve seen (e.g., a man over
65)

15.1.2. Decision Trees: Example

Let’s see an example on the diabetes dataset.

15.1.2.1. Review: Components of A Supervised Machine

Learning Problem
Recall that a supervised machine learning problem has the following structure:




     
Training Dataset + Learning Algorithm → Predictive Model
Attributes + Features Model Class + Objective + Optimizer

15.1.2.2. The UCI Diabetes Dataset

To explain what a decision tree is, we are going to use the UCI diabetes dataset that we have
been working with earlier.

Let’s start by loading this dataset and looking at some rows from the data.

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 1 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
from sklearn import datasets

# Load the diabetes dataset

diabetes = datasets.load_diabetes(as_frame=True)
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one

year after baseline

:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled
by the standard deviation times `n_samples` (i.e. the sum of squares of each
column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004)
"Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

# Load the diabetes dataset

diabetes_X, diabetes_y = diabetes.data, diabetes.target

# create a binary risk feature

diabetes_y_risk = diabetes_y.copy()
diabetes_y_risk[:] = 0
diabetes_y_risk[diabetes_y > 150] = 1

# Print part of the dataset

diabetes_X.head()

age sex bmi bp s1 s2 s3 s4 s5 s6

0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646

1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204

2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930

3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362

4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 2 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

15.1.2.3. Using sklearn to train a Decision Tree Classifier

We will train a decision tree using its implementation in sklearn. We will import
DecisionTreeClassifier from sklearn, fit it on the UCI Diabetes Dataset, and visualize the
tree using the plot_tree function from sklearn.

from matplotlib import pyplot as plt

from sklearn.tree import DecisionTreeClassifier, plot_tree

# create and fit the model

clf = DecisionTreeClassifier(max_depth=2)
clf.fit(diabetes_X.iloc[:,:4], diabetes_y_risk)

# visualize the model

plot_tree(clf, feature_names=diabetes_X.columns[:4], impurity=False);

Let’s breakdown this visualization a bit more. The visualized output above depicts the
process by which our trained decision tree classifier makes predictions. Each of the non-leaf
nodes in this diagram corresponds to a decision that the algorithm makes based on the data
or alternatively a rule that it applies. Namely, for a new data point, starting at the root of this
tree, the classifier looks at a specific feature and asks whether the value for this feature is
above or below some threshold. If it is below, we proceed down the tree to the left, and if it is
above, we proceed to the right. We continue this process at each depth of the tree until we
reach the leaf nodes (of which we have four in this example).

Let’s follow these decisions down to one of the terminal leaf nodes to see how this works.
Starting at the top, the algorithm asks whether a patient has high or low bmi (defined by the
threshold 0.009). For the high bmi patients, we proceed down the right and look at their
blood pressure. For patients with high blood pressure (values exceeding 0.017), the
algorithm has identified a ‘high’ risk sub-population of our dataset, as we see that 76 out the
86 patients that fall into this category have high risk of diabetes.

15.1.3. Decision Trees: Formal Definition

Let’s now define a decision tree a bit more formally.

15.1.3.1. Decision Rules

To do so, we introduce a core concept for decision trees, that of decision rules:

A decision rule r : X → {True, False} is a partition of the feature space into two
disjoint regions, e.g.:

True if xbmi ≤ 0.009

r(x) = {
False if xbmi > 0.009

Normally, a rule applies to only one feature or attribute xj of x.

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 3 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

If xj is continuous, the rule normally separates inputs xj into disjoint intervals (

−∞, c], (c, ∞).

In theory, we could have more complex rules with multiple thresholds, but this would add
unnecessary complexity to the algorithm, since we can simply break rules with multiple
thresholds into several sequential rules.

With this definition of decision rules, we can now more formally specify decision trees as
(usually binary) trees, where:

Each internal node n corresponds to a rule rn

The j-th edge out of n is associated with a rule value vj, and we follow the j-th edge if
rn(x) = vj
Each leaf node l contains a prediction f(l)
Given input x, we start at the root, apply its rule, follow the edge that corresponds to
the outcome, and repeat recursively.

plot_tree(clf, feature_names=diabetes_X.columns[:4], impurity=False);

Returning to the visualization from our diabetes example, we have 3 internal nodes
corresponding to the top two layers of the tree and four leaf nodes at the bottom of the tree.

15.1.3.2. Decision Regions

The next important concept in decision tress is that to decision regions.

Decision trees partition the space of features into regions:

A decision region R ⊆ X is a subset of the feature space defined by the application of

a set of rules r1, r2, … , rm and their values v1, v2, … , vm ∈ {True, False}, i.e.:

R = {x ∈ X ∣ r1(x) = v1 and … and rm(x) = vm}

For example, a decision region in the diabetes problem is:

R = {x ∈ X ∣ xbmi ≤ 0.009 and xbp > 0.004}

These regions correspond to the leaves of a decision tree.

We can illustrate decision regions via this figure from Hastie et al.

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 4 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

The illustrations are as follows:

Top left: regions that cannot be represented by a tree

Top right: regions that can be represented by a tree
Bottom left: tree generating the top right regions
Bottom right: function values assigned to the regions

Setting aside the top left image for a moment, the other 3 depictions from Hastie et al. show
how we can visualize decision trees. For an input space of two features X1, X2, we first have
the actual tree, similar to what we’ve seen thus far (bottom left). This tree has a maximum
depth of 3 and first splits the inputs along the X1 dimension. On the left branch we have one
more internal decision node based on values of X2 resulting in 2 leaf nodes that each
correspond to a region of the input space. On the right branch of the tree, we see two more
decision nodes that result in an additional 3 leaf nodes, each corresponding to a different
region. Note that as mentioned above, rather than using more complicated decision rules
such as t1 < X1 ≤ t3, we can separate this rule into two sequential and simpler decisions,
i.e., the root node and the first decision node to its right.

We can equivalently view this decision tree in terms of how it partitions the input space (top
right image). Note that each of the partition lines on the X1 and X2 axes correspond to our
decision nodes and the partitioned regions correspond to the leaf nodes in the tree depicted
in the bottom left image.

Finally, if each region corresponds to a predicted value for our target, then we can draw
these decision regions with their corresponding prediction values on a third axis (bottom
right image).

Returning to the top left image, this image actually depicts a partition of the input space that
would not be possible using decision trees, which highlights the point that not every partition
corresponds to a decision tree. Specifically, for this image the central region that is non-
rectangular cannot be created from a simple branching procedure of a decision tree.

15.1.3.3. Decision Trees

With the concept of regions, we can define a decision tree as a model f : X → Y of the
form

f(x) = ∑ yRI{x ∈ R}.

R∈R

The I{⋅} is an indicator function (one if {⋅} is true, else zero) and values yR ∈ Y are
the outputs for that region.

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 5 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

The set R is a collection of decision regions. They are obtained by a recursive learning
procedure (recursive binary splitting).
The rules defining the regions R can be organized into a tree, with one rule per internal
node and regions being the leaves.

15.1.4. Pros and Cons of Decision Trees

Decision trees are important models in machine learning

They are highly interpretable.

Require little data preparation (no rescaling, handle continuous and discrete features).
Can be used for classification and regression equally well.

Their main disadvantages are that:

If they stay small and interpretable, they are not as powerful.

If they are large, they easily overfit and are hard to regularize.

We saw how decision trees are represented. How do we now learn them from data?

At a high level, decision trees are grown by adding nodes one at a time, as shown in the
following pseudocode:

def build_tree():
while tree.is_complete() is False:
leaf, leaf_data = tree.get_leaf()
new_rule = create_rule(leaf_data)
tree.append_rule(leaf, new_rule)

Most often, we build the tree until it reaches a maximum number of nodes. The crux of the
algorithm is in create_rule.

15.2.1. Learning New Decision Rules

What is the set of possible rules that create_rule can add to the tree?

When x has continuous features, the rules have the following form:

True if xj ≤ t
r(x) = {
False if xj > t

for a feature index j and threshold t ∈ R.

When x has categorical features, rules may have the following form:

True if xj = tk
r(x) = {
False if xj ≠ tk

for a feature index j and possible value tk for xj.

Thus each rule r : X → {T, F} maps inputs to either true (T) or false (F) evaluations.

How does the create_rule function choose a new rule r? Given a dataset
D = {(x(i), y(i) ∣ i = 1, 2, … , n}, let’s say that R is the region of a leaf and
DR = {(x(i), y(i) ∣ x(i) ∈ R} is the data for R. We will greedily choose the rule that
minimizes a loss function.

Specifically, we add to the leaf a new rule r : X → {T, F} that minimizes a loss:

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 6 of 17




⎛

⎝ 
left subtree
  

⎜⎟
Lecture 15: Tree-Based Algorithms — Applied ML

right subtree
⎠
min L({(x, y) ∈ DR ∣ r(x) = T}) + L({(x, y) ∈ DR ∣ r(x) = F} )
r∈U
⎞

where L is a loss function over a subset of the data flagged by the rule and U is the set of
possible rules.

15.2.2. Objectives for Trees

What loss functions might we want to use when training a decision tree?

15.2.2.1. Objectives for Trees: Classification

In classification, we can use the misclassification rate

L(DR) =
1
|DR|
∑ I {y ≠ most-common-y(DR)}.
(x,y)∈DR

At a leaf node with region R, we predict most-common-y(DR), the most common class y
in the data. Notice that this loss incentivizes the algorithm to cluster together points that
have similar class labels.

A few other perspectives on the classification objective above:

The above loss measures the resulting misclassification error.

For ‘optimal’ leaves, all features have the same class (hence this loss will separate the
classes well), and are maximally pure/accurate.
This loss can therefore also be seen as a measure of leaf ‘purity’.

Other losses that can be used for classification include entropy or the Gini index. These all
optimize for a split in which different classes do not mix.

15.2.2.2. Objectives for Trees: Regression

In regression, it is common to minimize the L2 error between the data and the single best
prediction we can make on this data:

min
r∈U
∑
L(DR) =

⎣(x,y)∈DR ∣ r(x)=T
∑ (y − average-y(DR))2.
(x,y)∈DR

If this was a leaf node, we would predict average-y(DR), the average y in the data. The
above loss measures the resulting squared error.

This yields the following optimization problem for selecting a rule:

⎡
(y − ptrue(r))2 + ∑
(x,y)∈DR ∣ r(x)=F

where pTrue(r) = average-y({(x, y) ∣ (x, y) ∈ DR and r(x) = True}) and

(y − pfalse(r))2

pFalse(r) = average-y({(x, y) ∣ (x, y) ∈ DR and r(x) = False}) are the average

predictions on each part of the data split.

Notice that this loss incentivizes the algorithm to cluster together all the data points that
have similar y values into the same region.

15.2.2.3. Other Practical Considerations

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html
⎤
⎦
20/10/24, 6:23 PM

Page 7 of 17
Trees (CART)

⎢⎥
Lecture 15: Tree-Based Algorithms — Applied ML

Finally, we conclude this section with a few additional comments on the above training
procedure:

Nodes are added until the tree reaches a maximum depth or the leaves can’t be split
anymore.
In practice, trees are also often pruned in order to reduce overfitting.
There exist alternative algorithms, including ID3, C4.5, C5.0. See Hastie et al. for
details.

15.2.4. Algorithm: Classification and Regression

To summarize and combine our definition of trees with the optimization objective and
procedure defined in this section, we have a decision tree algorithm that can be used for
both classification and regression and is known as CART.

Type: Supervised learning (regression and classification).

Model family: Decision trees.
Objective function: Squared error, misclassification error, Gini index, etc.
Optimizer: Greedy addition of rules, followed by pruning.

Next, we are going to see a general technique to improve the performance of machine
learning algorithms.

We will then apply it to decision trees to define an improved algorithm.

15.3.1. Overfitting and High-Variance

Recall that overfitting is one of the most common failure modes of machine learning.

A very expressive model (e.g., a high degree polynomial) fits the training dataset
perfectly.
The model also makes wildly incorrect prediction outside this dataset and doesn’t
generalize.

To demonstrate overfitting, we return to a previous example, in which we take random

samples around some ‘true’ function relationship y = f(x).

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

def true_fn(X):
return np.cos(1.5 * np.pi * X)

np.random.seed(0)
n_samples = 40
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1

X_test = np.linspace(0, 1, 100)

plt.plot(X_test, true_fn(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples");

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html
20/10/24, 6:23 PM

Page 8 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

15.3.1.1. Fitting High-Degree Polynomials

Let’s see what happens if we fit a high degree polynomial to random samples of 20 points
from this dataset. That is, we fit the same degree polynomial on different random draws of
20 points from our original dataset of 40 points and visualize the resulting trained polynomial
model.

n_plots, X_line = 3, np.linspace(0,1, 20)

plt.figure(figsize=(14, 5))
for i in range(n_plots):
ax = plt.subplot(1, n_plots, i + 1)
random_idx = np.random.randint(0, 20, size=(20,))
X_random, y_random = X[random_idx], y[random_idx]

polynomial_features = PolynomialFeatures(degree=6, include_bias=False)

linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",
linear_regression)])
pipeline.fit(X_random[:, np.newaxis], y_random)

ax.plot(X_line, true_fn(X_line), label="True function")

ax.plot(X_line, pipeline.predict(X_line[:, np.newaxis]), label="Model")
ax.scatter(X_random, y_random, edgecolor='b', s=20, label="Samples",
alpha=0.2)
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title('Random sample %d' % i)

Some things to notice from these plots:

All of these models perform poorly on regions outside of data on which they were
trained.
Even though we trained the same class of model with the same hyperparameters on
each of these subsampled datasets, we attain very different realizations of trained
model for each subsample.

15.3.1.2 High-Variance Models

This phenomenon seen above is also known as the high variance problem, which can be
summarized as follows: each small subset on which we train yields a very different model.

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 9 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

An algorithm that has a tendency to overfit is also called high-variance, because it outputs a
predictive model that varies a lot if we slightly perturb the dataset.

15.3.2. Bagging: Bootstrap Aggregation

The idea of bagging is to reduce model variance by averaging many models trained on
random subsets of the data.

The term ‘bagging’ stands for ‘bootstrap aggregation.’ The data samples that are taken with
replacement are known as bootstrap samples.

The idea of bagging is that the random errors of any given model trained on a specific
bootstrapped sample, will ‘cancel out’ if we combine many models trained on different
bootstrapped samples and average their outputs, which is known as ensembling in ML
terminology.

In pseudocode, this can be performed as follows:

for i in range(n_models):
# collect data samples and fit models
X_i, y_i = sample_with_replacement(X, y, n_samples)
model = Model().fit(X_i, y_i)
ensemble.append(model)

# output average prediction at test time:

y_test = ensemble.average_prediction(x_test)

There exist a few closely related techniques to bagging:

Pasting is when samples are taken without replacement.

Random features are when we randomly sample the features.
Random patching is when we do both of the above.

15.3.2.1. Example: Bagged Polynomial Regression

Let’s apply bagging to our polynomial regression problem.

We are going to train a large number of polynomial regressions on random subsets of the
dataset of points that we created earlier.

We start by training an ensemble of bagged models, essentially implementing the

pseudocode we saw above.

n_models, n_subset = 10000, 30

ensemble, Xs, ys = [], [], []
for i in range(n_models):
# take a random subset of the data
random_idx = np.random.randint(0, n_subset, size=(n_subset,))
X_random, y_random = X[random_idx], y[random_idx]

# train a polynomial regression model

polynomial_features = PolynomialFeatures(degree=6, include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",
linear_regression)])
pipeline.fit(X_random[:, np.newaxis], y_random)

# add it to our set of bagged models

ensemble += [pipeline]
Xs += [X_random]
ys += [y_random]

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 10 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

Let’s visualize the prediction of the bagged model on each random dataset sample and
compare to the predictions from an un-bagged model.

n_plots, X_line = 3, np.linspace(0,1,25)

plt.figure(figsize=(14, 5))
for i in range(n_plots):
ax = plt.subplot(1, n_plots, i + 1)

# generate average predictions

y_lines = np.zeros((25, n_models))
for j, model in enumerate(ensemble):
y_lines[:, j] = model.predict(X_line[:, np.newaxis])
y_line = y_lines.mean(axis=1)

# visualize them
ax.plot(X_line, true_fn(X_line), label="True function")
ax.plot(X_line, y_lines[:,i], label="Model Trained on Samples")
ax.plot(X_line, y_line, label="Bagged Model")
ax.scatter(Xs[i], ys[i], edgecolor='b', s=20, label="Samples",
alpha=0.2)
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title('Random sample %d' % i)

Compared to the un-bagged model, we see that our bagged model varies less from one
sample to the next and also better generalizes to unseen points in the sample.

To summarize what we have seen in this section, bagging is a general technique that can be
used with high-variance ML algorithms.

It averages predictions from multiple models trained on random subsets of the data.

Next, let’s see how bagging can be applied to decision trees. This will also provide us with a
new algorithm.

15.4.1. Motivating Random Forests with Decision

Trees
To motivate the random forests algorithm, we will see pitfalls of decision trees using the
language of high-variance models that we just defined.

To distinguish random forests from decision trees, we will follow a running example.

15.4.1.1. Review: Iris Flowers Classification Dataset

We will re-use the Iris flowers dataset, which we have previously seen.

Recall, that this is a classical dataset originally published by R. A. Fisher in 1936. Nowadays,
it’s widely used for demonstrating machine learning algorithms.

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 11 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

import numpy as np
import pandas as pd
from sklearn import datasets

# Load the Iris dataset

iris = datasets.load_iris(as_frame=True)

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset

--------------------

Data Set Characteristics:

:Number of Instances: 150 (50 in each of three classes)

:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================

Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is
taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field
and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to
a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"

Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene
Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE
Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 12 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

# print part of the dataset

iris_X, iris_y = iris.data, iris.target
pd.concat([iris_X, iris_y], axis=1).head()

sepal length sepal width petal length petal width

target
(cm) (cm) (cm) (cm)

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

# Plot also the training points

p1 = plt.scatter(iris_X.iloc[:, 0], iris_X.iloc[:, 1], c=iris_y, s=50,
cmap=plt.cm.Paired)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Versicolour',
'Virginica', 'Query'],
loc='lower right');

15.4.1.2. Decision Trees on the Flower Dataset

Let’s now consider what happens when we train a decision tree on the Iris flower dataset.

The code below will be used to visualize predictions from decision trees on this dataset.

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 13 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

# https://scikit-
learn.org/stable/auto_examples/neighbors/plot_classification.html
from sklearn.tree import DecisionTreeClassifier
from matplotlib.colors import ListedColormap
import warnings
warnings.filterwarnings("ignore")

def make_grid(X):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X.iloc[:, 0].min() - 0.1, X.iloc[:, 0].max() + 0.1
y_min, y_max = X.iloc[:, 1].min() - 0.1, X.iloc[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
return xx, yy, x_min, x_max, y_min, y_max

def make_2d_preds(clf, X):

xx, yy, x_min, x_max, y_min, y_max = make_grid(X)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
return Z

def make_2d_plot(ax, Z, X, y):

# Create color maps
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])

xx, yy, x_min, x_max, y_min, y_max = make_grid(X)

# Put the result into a color plot

ax.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points

ax.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=plt.cm.Paired,
edgecolor='k', s=50)
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Sepal Width')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())

We may now train and visualize a decision tree on this dataset.

# Train a Decision Tree Model

ax = plt.gca()
X = iris_X.iloc[:,:2]
clf = DecisionTreeClassifier()
clf.fit(X, iris_y)
Z = make_2d_preds(clf, X)
make_2d_plot(ax, Z, X, iris_y)

We see two problems with the output of the decision tree on the Iris dataset:

The decision boundary between the two classes is very non-smooth and blocky.
The decision tree overfits the data and the decision regions are highly fragmented.

15.4.1.3. High-Variance Decision Trees

When the trees have sufficiently large depth, they can quickly overfit the data.

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 14 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

Recall that this is called the high-variance problem, because small perturbations of the data
lead to large changes in model predictions.

Consider the performance of a decision tree classifier on 3 random subsets of the data.

n_plots, n_flowers, n_samples = 3, iris_X.shape[0], 40

plt.figure(figsize=(14, 5))
for i in range(n_plots):
ax = plt.subplot(1, n_plots, i + 1)
random_idx = np.random.randint(0, n_flowers, size=(n_samples,))
X_random, y_random = iris_X.iloc[random_idx, :2], iris_y[random_idx]

clf = DecisionTreeClassifier()
clf.fit(X_random, y_random)
Z = make_2d_preds(clf, X_random)
make_2d_plot(ax, Z, X_random, y_random)
ax.set_title('Random sample %d' % i)

15.4.2 Random Forests

In order to reduce the variance of the basic decision tree, we apply bagging – the variance
reduction technique that we have seen earlier.

We refer to bagged decision trees as Random Forests.

Instantiating our definition of bagging with decision trees, we obtain the following
pseudocode definition of random forests:

for i in range(n_models):
# collect data samples and fit models
X_i, y_i = sample_with_replacement(X, y, n_samples)
model = DecisionTree().fit(X_i, y_i)
random_forest.append(model)

# output average prediction at test time:

y_test = random_forest.average_prediction(y_test)

We may implement random forests in python as follows:

n_models, n_flowers, n_subset = 300, iris_X.shape[0], 10

random_forest = []
for i in range(n_models):
# sample the data with replacement
random_idx = np.random.randint(0, n_flowers, size=(n_subset,))
X_random, y_random = iris_X.iloc[random_idx, :2], iris_y[random_idx]

# train a decision tree model

clf = DecisionTreeClassifier()
clf.fit(X_random, y_random)

# append it to our ensemble

random_forest += [clf]

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 15 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

15.4.2.1 Random Forests on the Flower Dataset

Consider now what happens when we deploy random forests on the same dataset as before.

Now, each prediction is the average on the set of bagged decision trees.

# Visualize predictions from a random forest

ax = plt.gca()

# compute average predictions from all the models in the ensemble

X_all, y_all = iris_X.iloc[:,:2], iris_y
Z_list = []
for clf in random_forest:
Z_clf = make_2d_preds(clf, X_all)
Z_list += [Z_clf]
Z_avg = np.stack(Z_list, axis=2).mean(axis=2)

# visualize predictions
make_2d_plot(ax, np.rint(Z_avg), X_all, y_all)

The boundaries are much more smooth and well-behaved compared to those we saw for
decision trees. We also clearly see less overfitting.

15.4.3 Algorithm: Random Forests

Summarizing random forests, we have:

Type: Supervised learning (regression and classification).

Model family: Bagged decision trees.
Objective function: Squared error, misclassification error, Gini index, etc.
Optimizer: Greedy addition of rules, followed by pruning.

15.4.4 Pros and Cons of Random Forests

Random forests remain a popular machine learning algorithm:

As with decision trees, they require little data preparation (no rescaling, handle
continuous and discrete features, work well for classification and regression).
They are often quite accurate.

Their main disadvantages are that:

They are not interpretable due to bagging (unlike decision trees).

They do not work with unstructured data (e.g., images, audio).

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 16 of 17
Lecture 15: Tree-Based Algorithms — Applied ML 20/10/24, 6:23 PM

By Cornell University
© Copyright 2023.

https://kuleshov-group.github.io/aml-book/contents/lecture15-decision-trees.html Page 17 of 17

Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
Diabetes Case Study - Jupyter Notebook
100% (1)
Diabetes Case Study - Jupyter Notebook
10 pages
Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Introduction To Decision Tree: Gini Index
No ratings yet
Introduction To Decision Tree: Gini Index
15 pages
Data Mining
No ratings yet
Data Mining
15 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
Introduction To Scikit Learn
100% (1)
Introduction To Scikit Learn
108 pages
Thyroid Disease Classification Using ML
No ratings yet
Thyroid Disease Classification Using ML
37 pages
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
17 pages
Building A Simple Machine Learning Model On Breast Cancer Data
No ratings yet
Building A Simple Machine Learning Model On Breast Cancer Data
12 pages
Standar Ization
No ratings yet
Standar Ization
7 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
diabetes-prediction-using-machine-learning
No ratings yet
diabetes-prediction-using-machine-learning
16 pages
Decision Support
No ratings yet
Decision Support
21 pages
Whole ML PDF 1614408656
100% (1)
Whole ML PDF 1614408656
214 pages
Clustering Approach in Diabetes Dataset: Submitted By: Submitted To: Dr. Mridu Sahu
No ratings yet
Clustering Approach in Diabetes Dataset: Submitted By: Submitted To: Dr. Mridu Sahu
20 pages
Train Test Split in Python
No ratings yet
Train Test Split in Python
11 pages
Data Science Technical Interview Questions
No ratings yet
Data Science Technical Interview Questions
24 pages
Binod ML Project-052
No ratings yet
Binod ML Project-052
14 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
17 pages
ML Minor May
No ratings yet
ML Minor May
5 pages
Coding Titanicmain
No ratings yet
Coding Titanicmain
58 pages
Data Science Crash Course
100% (1)
Data Science Crash Course
32 pages
4-10 Aiml
No ratings yet
4-10 Aiml
25 pages
20MEMECH Part 6 - NN Vol - 1
No ratings yet
20MEMECH Part 6 - NN Vol - 1
34 pages
Top 10 Data Mining Algorithms
No ratings yet
Top 10 Data Mining Algorithms
65 pages
Machine Learning
100% (1)
Machine Learning
21 pages
Mini PPT 2
No ratings yet
Mini PPT 2
32 pages
ADVANCE AIML CIE3 ANS
No ratings yet
ADVANCE AIML CIE3 ANS
5 pages
Interview Questions
100% (1)
Interview Questions
67 pages
Project 1
No ratings yet
Project 1
4 pages
12 Dimensionality Reduction Techniqwues (with Python Codes)
No ratings yet
12 Dimensionality Reduction Techniqwues (with Python Codes)
20 pages
JETIR2205326
No ratings yet
JETIR2205326
9 pages
Em Semester Project
No ratings yet
Em Semester Project
21 pages
CC Unit IV
No ratings yet
CC Unit IV
30 pages
lecture2-supervised-learning slides
No ratings yet
lecture2-supervised-learning slides
56 pages
Practical No4 - 5 ML
No ratings yet
Practical No4 - 5 ML
11 pages
Breast Cancer Detection Algo Comparison
No ratings yet
Breast Cancer Detection Algo Comparison
15 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
1 Lecture 2: Supervised Machine Learning
No ratings yet
1 Lecture 2: Supervised Machine Learning
20 pages
Machinelearningmastery Com Bagging Ensemble With Different D
No ratings yet
Machinelearningmastery Com Bagging Ensemble With Different D
21 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
21 pages
Using Predictive Analytics Model To Diagnose Breast Cnacer
No ratings yet
Using Predictive Analytics Model To Diagnose Breast Cnacer
9 pages
Section 2 - Introduction To Machine Learning-Bje Edits - Ipynb - Colab
No ratings yet
Section 2 - Introduction To Machine Learning-Bje Edits - Ipynb - Colab
7 pages
utf-8''C2M1 Assignment
No ratings yet
utf-8''C2M1 Assignment
24 pages
Predicting Drug Solubilty Wtih Deep Learning
No ratings yet
Predicting Drug Solubilty Wtih Deep Learning
9 pages
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
No ratings yet
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
10 pages
NB4-09 PT IV Data Augmentation and Early Stopping
No ratings yet
NB4-09 PT IV Data Augmentation and Early Stopping
5 pages
Compendium Iim Shillong Analytics and Prod Man
No ratings yet
Compendium Iim Shillong Analytics and Prod Man
68 pages
mlPPT_11_45
No ratings yet
mlPPT_11_45
31 pages
2 Machine Learning
No ratings yet
2 Machine Learning
21 pages
Data Management in Healthcare Final
No ratings yet
Data Management in Healthcare Final
25 pages
ML Lab File[1]
No ratings yet
ML Lab File[1]
43 pages
Web Application
No ratings yet
Web Application
13 pages
R Code For Discriminant and Cluster Analysis
No ratings yet
R Code For Discriminant and Cluster Analysis
23 pages
Data Science and Data Analytics Lab CS695A: Sayan Maity Cse 3B Roll-05 12017009001193
No ratings yet
Data Science and Data Analytics Lab CS695A: Sayan Maity Cse 3B Roll-05 12017009001193
30 pages
Income Prediction
No ratings yet
Income Prediction
19 pages
Roland Barthes Writing Degree Zero 1953pdf
No ratings yet
Roland Barthes Writing Degree Zero 1953pdf
82 pages
ICRI Guideline No. 310.1R-2008 - Guide for surface preparation for the repair of deteriorated concrete resulting from reinforcing steel corrosion
No ratings yet
ICRI Guideline No. 310.1R-2008 - Guide for surface preparation for the repair of deteriorated concrete resulting from reinforcing steel corrosion
16 pages
1st year routine Final
No ratings yet
1st year routine Final
6 pages
S. No Basis of Comparison Baseband Transmission Broadband Transmission
No ratings yet
S. No Basis of Comparison Baseband Transmission Broadband Transmission
2 pages
Q4_WS_TLE 7_Lesson 7_ Week 7
No ratings yet
Q4_WS_TLE 7_Lesson 7_ Week 7
6 pages
Pan Pearl River Delta Physics Olympiad 2014: V 10 M/s V
No ratings yet
Pan Pearl River Delta Physics Olympiad 2014: V 10 M/s V
10 pages
Claret College of Isabela: Senior High School
0% (1)
Claret College of Isabela: Senior High School
5 pages
The Problems of The World of Education in The Middle of The Covid-19 Pandemic
No ratings yet
The Problems of The World of Education in The Middle of The Covid-19 Pandemic
8 pages
Floorplanner Editor Manual Version 160919
No ratings yet
Floorplanner Editor Manual Version 160919
55 pages
PHYS1101 - Advanced Physics Varsha Venkatesh and William Jackson 29 May, 2009
No ratings yet
PHYS1101 - Advanced Physics Varsha Venkatesh and William Jackson 29 May, 2009
4 pages
CLIMDEX: Climate Extremes Indices
No ratings yet
CLIMDEX: Climate Extremes Indices
5 pages
FFF Solutions
No ratings yet
FFF Solutions
152 pages
Soalan Matematik Tambahan Kertas 1 SABAH
No ratings yet
Soalan Matematik Tambahan Kertas 1 SABAH
18 pages
DENAIR Oil-Free Air Compressor
No ratings yet
DENAIR Oil-Free Air Compressor
10 pages
Topics Entrance Tests Maths Physics
No ratings yet
Topics Entrance Tests Maths Physics
1 page
0.1 Differential Operator: D DX D DX 2
No ratings yet
0.1 Differential Operator: D DX D DX 2
16 pages
RC 2
No ratings yet
RC 2
13 pages
Compal La-B131p r1.0 Schematics
No ratings yet
Compal La-B131p r1.0 Schematics
53 pages
Sexual Magick in Thelema
100% (4)
Sexual Magick in Thelema
18 pages
Toyota Ebook PDF
100% (2)
Toyota Ebook PDF
47 pages
FINAL EXAM Grade 8 3rd Quarter
No ratings yet
FINAL EXAM Grade 8 3rd Quarter
5 pages
Powering Large Scale Network
No ratings yet
Powering Large Scale Network
8 pages
royal_enfield_himalayan_450_technical_specifications
No ratings yet
royal_enfield_himalayan_450_technical_specifications
6 pages
BoK BW M
No ratings yet
BoK BW M
204 pages
Step 2.2 Change Impact Assessment
0% (1)
Step 2.2 Change Impact Assessment
3 pages
[FREE PDF sample] Parsforte Internazionale de arte 3rd Edition Parsforte International Amiramin Sharifi ebooks
100% (3)
[FREE PDF sample] Parsforte Internazionale de arte 3rd Edition Parsforte International Amiramin Sharifi ebooks
50 pages
Saarthi Education Jija Mata Colony, Near Paithan Gate A Bad. Cont: 8694947070 / 5050
No ratings yet
Saarthi Education Jija Mata Colony, Near Paithan Gate A Bad. Cont: 8694947070 / 5050
8 pages
QSPM Matrix
0% (1)
QSPM Matrix
2 pages
An Analysis of Liszts Totentanz
No ratings yet
An Analysis of Liszts Totentanz
3 pages
Skype Login Test Cases Updated
No ratings yet
Skype Login Test Cases Updated
4 pages