Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
50 views

Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction

This document introduces tree-based machine learning methods. It discusses how recursive binary partitioning can be used to split a multidimensional space into regions, and how this process can be represented as a tree structure. It provides an example of building a decision tree to predict Titanic passenger survival using attributes like age, sex, and fare. The document also discusses historical tree algorithms, how to interpret a decision tree, and how tree construction works.

Uploaded by

Tev Wallace
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction

This document introduces tree-based machine learning methods. It discusses how recursive binary partitioning can be used to split a multidimensional space into regions, and how this process can be represented as a tree structure. It provides an example of building a decision tree to predict Titanic passenger survival using attributes like age, sex, and fare. The document also discusses historical tree algorithms, how to interpret a decision tree, and how tree construction works.

Uploaded by

Tev Wallace
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Knowledge Discovery and Data Mining

Lecture 11 - Tree methods - Introduction

Tom Kelsey

School of Computer Science


University of St Andrews
http://tom.host.cs.st-andrews.ac.uk
twk@st-andrews.ac.uk

Tom Kelsey ID5059-11-TM 11th Jan 2021 1 / 49


Recap

y = f (X) + e

Covariates/attibutes/predictor variables are columns of X


can be ordinal, categorical and/or numeric data
The response y can be a class or a number
Analysed several workflows
Covered the overfit/undefit tradeoff (regularisation)
Looked at model fit measures for classifiers...
...and how to balance precision & recall
Covered in detail at how to solve regression problems
analytically using the normal equation
iteratively using variations on gradient descent

Tom Kelsey ID5059-11-TM 11th Jan 2021 2 / 49


Tree methods

We now start looking in more detail at the actual methods that


the code has been calling.
Start with a relatively straightforward model class: Trees
Need-to-knows
1 How recursive binary partitioning of Rp works.
2 How to sketch a partitioning of R2 on the basis of a series of
simple binary splits.
3 How to go from a series of binary splitting rules to a tree
representation and vice versa.

Tom Kelsey ID5059-11-TM 11th Jan 2021 3 / 49


Worked example – the Titanic

Take data about passengers on the Titanic


Covariates in X are age, sex, fare, socio-economic status, and
number of siblings/spouses aboard
a typical mix of categorical and numeric data
we don’t care about distributions, variance, etc.
There are others (parents/children aboard, port of
embarkation, cabin), but I don’t think these will help
The response y is binary: 1 for survival, 0 for death
Our function f (X, θ) will be a tree (or a set of if-then-else clauses),
and error e will be percentage of incorrect classifications.

Full details, along with R and Python tutorials, are at Kaggle

Tom Kelsey ID5059-11-TM 11th Jan 2021 4 / 49


Worked example – the Titanic

Methodology (we’ll fill in the details later):


1 Learn a tree using the training data
2 Prune this tree to guard against overfit
3 Assess the tree using holdback data – i.e. X data with no
response
4 Send the predicted responses to Kaggle
5 They know the actual survival values, and return a percent
correct

Full details, along with R and Python tutorials, are at Kaggle .


There is also code and data in L11 on Moodle & Studres

Tom Kelsey ID5059-11-TM 11th Jan 2021 5 / 49


Women and children first!

Tom Kelsey ID5059-11-TM 11th Jan 2021 6 / 49


Interpretation

Node 1:
62% of passengers die; 38% survive
100% of the data is below this node
if we pruned to here, we’d predict non-survival for everyone
Node 2:
81% of males die; 19% survive
65% of the data is below this node
if we pruned to here, we’d predict non-survival
Node 4:
83% of males older than 6.5 years die; 17% survive
62% of the test data follows this path down the tree
this is a leaf, so we predict non-survival

Tom Kelsey ID5059-11-TM 11th Jan 2021 7 / 49


Again, but with more covariates

Tom Kelsey ID5059-11-TM 11th Jan 2021 8 / 49


Worked example – the Titanic

Pruning has no effect – on this tree, with these stopping and


splitting conditions – so we predict the test passengers using the
full tree.
Sending to Kaggle gives me 79% accuracy:

mean(ŷ ≡ y) ≈ 0.79
To improve upon this – using this method – we can
1 Use more or fewer covariates
it turns out that where a passenger got on affected their
survival chances
2 Modify the splitting, stopping & pruning settings for the
code

Tom Kelsey ID5059-11-TM 11th Jan 2021 9 / 49


Historical perspective

1960 Automatic Interaction Detection (AID) related to the


clustering literature.
THAID, CHAID
ID3, C4.5, C5.0
CART 1984 Breiman et al.

Tom Kelsey ID5059-11-TM 11th Jan 2021 10 / 49


Recursive partitioning on R

Take an n × p matrix X, define a p dimensional space Rp . We


wish to apply a simple rule recursively:
1 Select a variable xi and split on the basis of a single value
xi = a. We now have two spaces: xi ≤ a and xi > a.
2 Select one of the current sub-spaces, select a variable xj , and
split this sub-space on the basis of a single value xj = b.
3 Repeatedly select sub-spaces and split in two.

Note that this process can extend beyond just the two
dimensions represented by x1 and x2 . If this were 3-dimensions
(i.e. include an x3 ) then the partitions would be cubes. Beyond
this the partitions are conceptually hyper-cubes.

Tom Kelsey ID5059-11-TM 11th Jan 2021 11 / 49


An arbitrary 2-D space

X2

X1

An arbitrary 2-D space

Tom Kelsey ID5059-11-TM 11th Jan 2021 12 / 49


Space splitting
X1=a

X2
Y=f(X1<a) Y=f(X1>=a)

X1

A single split

Tom Kelsey ID5059-11-TM 11th Jan 2021 13 / 49


Space splitting

Y=f(X2>=b&(X1<a))

X2
Y=f(X1>=a)
X2=b

Y=f(X2<c&(X1<a))

X1

Split of a subspace

Tom Kelsey ID5059-11-TM 11th Jan 2021 14 / 49


Space splitting

Y=f(X2>=b&(X1<a))

Y=f(X2>=c&(X1>=a))

X2

Y=f(X2<c&(X1<a)) X2=c

Y=f(X2<c&(X1>=a))

X1

Further splitting of a subspace

Tom Kelsey ID5059-11-TM 11th Jan 2021 15 / 49


Space splitting

X1=d

Y=f(X2>=b&(X1<a))
Y=f(X2>=c&(X1>=a&<d))

Y=f(X2>=c&(X1>=a&>=d))
X2

Y=f(X2<c&(X1<a))

Y=f(X2<c&(X1>=a))

X1

Further splitting

Tom Kelsey ID5059-11-TM 11th Jan 2021 16 / 49


Space splitting

X2
Z

X1

Potential 3-D surface

Tom Kelsey ID5059-11-TM 11th Jan 2021 17 / 49


Binary partitioning process as a tree

<a >=a
X1

<b >=b <c >=c


X2 X2

C1 C2

<d >=d
C3 X2

C4 C5

An example tree diagram for a contrived partioning

Tom Kelsey ID5059-11-TM 11th Jan 2021 18 / 49


Tree representation

The splitting points are called nodes - these have a binary


splitting rule associated with them
The two new spaces created by the split are represented by
lines leaving the nodes, these are referred to as the branches.
A tree with one split is a stump.
The nodes at the bottom of the diagram are referred to as the
terminal nodes and collectively represent all the final
partitions/subspaces of the data.
You can ‘drop’ a vector x down the tree to determine which
subspace this coordinate falls into.

Tom Kelsey ID5059-11-TM 11th Jan 2021 19 / 49


Exercise

The following is the summary of a series of splits in R2 :

(x1 > 10)


(x1 ≤ 10) & (x2 ≤ 5)
(x1 ≤ 10) & (x2 > 5) & (x2 ≤ 10)

1 Sketch the progression of splits in 2-dimensions.


2 Produce a tree that summarises this series of splits.

Tom Kelsey ID5059-11-TM 11th Jan 2021 20 / 49


Tree construction

We can model the response as a constant for each region (or


equivalently, leaf)
If we are minimising sums of squares, the optimal constant
for a region/leaf is the average of the observed outputs for
all inputs associated with the region/leaf
Computing the optimal binary partition for given inputs
and output is computationally intractable, in general
A greedy algorithm is used that finds an optimal variable
and split point given an initial choice (or guess), then
continues for sub-regions
This is quick to compute (sums of averages) but errors at the
root lead to errors at the leaves

Tom Kelsey ID5059-11-TM 11th Jan 2021 21 / 49


How big should the tree be?

Tradeoff between bias and variance


Small tree – high bias, low variance
Not big enough to capture the correct model structure
Large tree – low bias, high variance
Overfitting – in the extreme case each input is in exactly one
region
Optimal size should be adaptively chosen from the data
We could stop splitting based on a threshold for decreases in sum
of squares, but this might rule out a useful split further down the
tree.
Instead we construct a tree that is probably too large, and prune
it by cost-complexity calculations – next lecture

Tom Kelsey ID5059-11-TM 11th Jan 2021 22 / 49


Regression trees

Consider our general regression problem (note can be


classification):
y = f (X) + e
and the usual approximation model (linear in its parameters):

y = Xβ + e

‘Standard’ interactions of form β p (X1 X2 )


These are simple in form and quite hard to interpret
succinctly
What is probably the simplest interaction form to interpret?
Recursive binary splitting rules for the Covariate space

Tom Kelsey ID5059-11-TM 11th Jan 2021 23 / 49


Advantages of tree models

Actually tree models in general, and CART in particular


Nonparametric
no probabilistic assumptions
Automatically performs variable selection
important variables at or near the root
Any combination of continuous/discrete variables allowed
so we can automatically bin massively categorical variables
into a few categories
e.g. zip code, make/model, etc.
No need to scale or centre the data

Tom Kelsey ID5059-11-TM 11th Jan 2021 24 / 49


Advantages of tree models

Discovers interactions among variables


Handles missing values automatically
using surrogate splits
Invariant to monotonic transformations of predictive
variable
Not sensitive to outliers in predictive variables
Easy to spot when CART is struggling to capture a linear
relationship (and therefore might not be suitable)
repeated splits on the same variable
Good for data exploration, visualisation, multidisciplinary
discussion
in the Titanic example gives hard values for "child" to
support the heuristic "women & children first"

Tom Kelsey ID5059-11-TM 11th Jan 2021 25 / 49


Disdvantages of tree models

Discrete output values, rather than continuous


one response per finite number of leaf nodes
Trees can be large and hence hard to interpret
Can be unstable when covariates are correlated
slightly different data gives completely different trees
Not good for describing linear relationships
trees are inherently nonlinear
Not always the best predictive model
might be outperformed by NN, RF, SVM, etc.
Sensitive to data rotation (see later)

Tom Kelsey ID5059-11-TM 11th Jan 2021 26 / 49


Terminology

Training data
Validation data
Resubstitution error
the error rate of a tree on the cases from which it was
constructed
Generalisation error
the error rate of a tree on unseen cases
Purity
how mixed-up the training data is at a node

Tom Kelsey ID5059-11-TM 11th Jan 2021 27 / 49


Tree Generation

All tree methods follow this basic scheme:


We have a set of mixed-up data
so immediately we need some measure of how mixed-up the
data is
Find the covariate-value pair (xj , sj ) that produces the most
separation in the data X, y
split the data into two subsets: rows in which values of
xj < sj , the other rows in which xj ≥ sj
the data in each is less mixed-up
each split forms the root of a new tree
Recurse by repeating for each subtree

Tom Kelsey ID5059-11-TM 11th Jan 2021 28 / 49


Tree Generation

The methods differ based on three choices:


How "mixed-up" is measured
we need to measure randomness
or, equivalently, levels of node purity
How we decide when to stop splitting
Shiny Example

the heuristics for this are common to all methods


often as simple as "stop when 4 or fewer items are in subset"
How we condense the instances that fall into each split
i.e. what is the actual output at a terminal node (predictions)

Tom Kelsey ID5059-11-TM 11th Jan 2021 29 / 49


Regression Trees

Classical statistical approach:


Mixed-up is measured by standard deviation (or any
measure of variability)
In ANOVA terms, find nodes with minimal within variance
and hence maximal between variance
Condense using the average of the instances that fall into
each split (i.e. predict with the mean)

Tom Kelsey ID5059-11-TM 11th Jan 2021 30 / 49


Classification Trees
Information Theory approach:
Mixed-up is measured by amount of (im)purity
Condense via majority class (predict the most common)
Purity can be measured by entropy, Gini index or the
twoing rule
Gini can produce pure but small nodes – the twoing rule is a
tradeoff between purity and equality of data on either side
of the split
Twoing isn’t considered it in detail here
We also need pruning criteria
plan to construct a big (overfitting) tree
then reduce tree complexity to get a good tradeoff between
resubstitution error and generalisation error
Before we look at tree construction we need to learn more
about randomness and differences between categorical and
numeric data
Tom Kelsey ID5059-11-TM 11th Jan 2021 31 / 49
Purity

A table or subtable is pure if it contains only one class


In regression tree terminology, the SD of the outputs is zero
In classification tree terminology, only one category is
present
We split and resplit in order to increase node purity
Complete tree purity is analogous to overfitting

Tom Kelsey ID5059-11-TM 11th Jan 2021 32 / 49


Purity of nodes

Intuitively that we want to optimise for some measure of the


purity of nodes
Consider J-level categorical response variable. A node gives
a vector of proportions p = [p1 , . . . , pJ ] for the levels of our
response
J
∑ pj = 1 so the vector p is a probability distribution of the
j=1
response classes within the node

Tom Kelsey ID5059-11-TM 11th Jan 2021 33 / 49


Node purity

We can list some desirable/necessary properties of an


impurity measure, which will be a function of the these
proportions φ(p)
1 1
φ(p) will be a maximum when p = [ , . . . , ]. This is our
J J
definition of the least pure node, i.e. there is an equal
mixture of all classes
φ(p) will be a minimum when pj = 1 (and therefore the the
others are zero). This is our definition of our most pure node,
only one class exists

Tom Kelsey ID5059-11-TM 11th Jan 2021 34 / 49


Node purity

Our measure of the impurity of a node t will be given by

i(t) = φ((p1 |t), . . . , (pJ |t))

A measure of the decrease of impurity resulting from


splitting node t into a left node tL and a right node tR will be
given by
δi(t) = i(t) − pL i(tL ) − pR i(tR )
where pL and pR are the proportion of points in t that go to
the left and right respectively

Tom Kelsey ID5059-11-TM 11th Jan 2021 35 / 49


Logarithms: refresher
bx = y if and only if logb y = x
To be precise one needs to distinguish special cases, for example
y cannot be 0. The log of a negative number is complex; not
needed here.
Inverse of exponentiation:

logb bx = x = b(logb x)

Base change:
logb x
loga x =
logb a

(This is why you don’t need a log2 button on your calculator.)


Useful identity:
logb x−1 = − logb x

Tom Kelsey ID5059-11-TM 11th Jan 2021 36 / 49


Entropy Definition

We work in base 2, taking the bit as our unit.


We can now precisely define the entropy of a set of output
classes as:
1
H (p1 , . . . , pn ) = ∑ pi log2 pi = − ∑ pi log2 pi

Tom Kelsey ID5059-11-TM 11th Jan 2021 37 / 49


Example

Class Bus Car Train


Probability 0.4 0.3 0.3

H (0.4, 0.3, 0.3) = −0.4 log2 0.4 − 0.3 log2 0.3 − 0.3 log2 0.3
≈ 1.571

We say our output class data has entropy about 1.57 bits per class.

Tom Kelsey ID5059-11-TM 11th Jan 2021 38 / 49


Example

Class Bus Car Train


Probability 0 1 0

H (0, 1, 0) = −1 log2 1
= 0

We say our output class data has zero entropy, meaning zero
randomness

Tom Kelsey ID5059-11-TM 11th Jan 2021 39 / 49


Example

Class Bus Car Train


Probability 1/3 1/3 1/3

1 1 1 1 1 1 1 1 1
H ( , , ) = − log2 − log2 − log2
3 3 3 3 3 3 3 3 3
−1.584963 −1.584963 −1.584963
≈ − − −
3 3 3
≈ 0.528321 + 0.528321 + 0.528321
≈ 1.584963

We say our output class data has maximum entropy for a class of
this size, meaning the most randomness

Tom Kelsey ID5059-11-TM 11th Jan 2021 40 / 49


Gini Index

One minus the sum of the squared output probabilities:

1 − Σp2j

In our example, 1 − (0.42 + 0.32 + 0.32 ) = 0.660.


Minimum Gini index is zero
1 1
Maximum Gini index is 1 − n( )2 = 1 − , two thirds in our
n n
example

Tom Kelsey ID5059-11-TM 11th Jan 2021 41 / 49


Entropy and Gini compared

Node impurity measures versus class proportion for 2-class problem

Tom Kelsey ID5059-11-TM 11th Jan 2021 42 / 49


Misclassification error rate

Defined as the number of incorrect classifications divided by the


number of all classifications
Hence, using terminology from earlier lectures, equal to 1 minus
the accuracy of a classification predictor:

a+d
MER = 1 −
a+b+c+d
In the context of analysing nodes, this is 1 minus the maximum
proportion in p = [p1 , . . . , pJ ]:

MER = 1 − max(pj )

From the chart, Entropy and Gini capture more of the notion of
node impurity, and so are preferred measures for tree growth
Misclassification is used extensively in tree pruning

Tom Kelsey ID5059-11-TM 11th Jan 2021 43 / 49


Sample Gini gain calculation for age 38.6

Original Gini G0 = 1 − p(A)2 − p(M)2 , i.e.


 2  2
55 20
1− −
75 75
For M, 8 are above and 12 are below
For A, 52 are above and 3 are below
In all, 60 are above and 15 are below
 2  2
52 8
Above Gini Ga = 1 − −
60 60
 2  2
3 12
Below Gini Gb = 1 − −
15 15
   
60 15
Gini gain is G0 − × Ga − × Gb ≈ 0.142
75 75

Tom Kelsey ID5059-11-TM 11th Jan 2021 44 / 49


Gini gain for each age

Tom Kelsey ID5059-11-TM 11th Jan 2021 45 / 49


Sample information gain calculation for age 38.6

Information gain at a node is the entropy of the node minus the


weighted average of the entropies above and below the split
       
55 75 20 75
Original entropy log2 + log2
75 55 75 20
For M, 8 are above and 12 are below
For A, 52 are above and 3 are below
In all, 60 are above and 15 are below
       
52 60 8 60
Above entropy Ha = log2 + log2
60 52 60 8
       
3 15 12 12
Below entropy Hb = log2 + log2
15 3 15 15
   
60 15
Information gain is H0 − × Ha − × Hb ≈ 0.6
75 75

Tom Kelsey ID5059-11-TM 11th Jan 2021 46 / 49


Information gain for each age

Tom Kelsey ID5059-11-TM 11th Jan 2021 47 / 49


Summary

Recursive binary partitioning gives a binary tree for numeric


and/or binary category covariates and either numeric or
categorical responses
covariates are either above or below a split, or in one of two
classes
numeric responses are the average of the included values
information gain and Gini gain are similar metrics for
optimal splits
We go on to consider multi-categorical covariates, and
discretisation of numeric covariates

Tom Kelsey ID5059-11-TM 11th Jan 2021 48 / 49


26 response values, but binary splits at each level

Tree for the Letter Recognition dataset, restricted to depth 3 for ease of visualisation

Tom Kelsey ID5059-11-TM 11th Jan 2021 49 / 49

You might also like