Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
Tom Kelsey
y = f (X) + e
Node 1:
62% of passengers die; 38% survive
100% of the data is below this node
if we pruned to here, we’d predict non-survival for everyone
Node 2:
81% of males die; 19% survive
65% of the data is below this node
if we pruned to here, we’d predict non-survival
Node 4:
83% of males older than 6.5 years die; 17% survive
62% of the test data follows this path down the tree
this is a leaf, so we predict non-survival
mean(ŷ ≡ y) ≈ 0.79
To improve upon this – using this method – we can
1 Use more or fewer covariates
it turns out that where a passenger got on affected their
survival chances
2 Modify the splitting, stopping & pruning settings for the
code
Note that this process can extend beyond just the two
dimensions represented by x1 and x2 . If this were 3-dimensions
(i.e. include an x3 ) then the partitions would be cubes. Beyond
this the partitions are conceptually hyper-cubes.
X2
X1
X2
Y=f(X1<a) Y=f(X1>=a)
X1
A single split
Y=f(X2>=b&(X1<a))
X2
Y=f(X1>=a)
X2=b
Y=f(X2<c&(X1<a))
X1
Split of a subspace
Y=f(X2>=b&(X1<a))
Y=f(X2>=c&(X1>=a))
X2
Y=f(X2<c&(X1<a)) X2=c
Y=f(X2<c&(X1>=a))
X1
X1=d
Y=f(X2>=b&(X1<a))
Y=f(X2>=c&(X1>=a&<d))
Y=f(X2>=c&(X1>=a&>=d))
X2
Y=f(X2<c&(X1<a))
Y=f(X2<c&(X1>=a))
X1
Further splitting
X2
Z
X1
<a >=a
X1
C1 C2
<d >=d
C3 X2
C4 C5
y = Xβ + e
Training data
Validation data
Resubstitution error
the error rate of a tree on the cases from which it was
constructed
Generalisation error
the error rate of a tree on unseen cases
Purity
how mixed-up the training data is at a node
logb bx = x = b(logb x)
Base change:
logb x
loga x =
logb a
H (0.4, 0.3, 0.3) = −0.4 log2 0.4 − 0.3 log2 0.3 − 0.3 log2 0.3
≈ 1.571
We say our output class data has entropy about 1.57 bits per class.
H (0, 1, 0) = −1 log2 1
= 0
We say our output class data has zero entropy, meaning zero
randomness
1 1 1 1 1 1 1 1 1
H ( , , ) = − log2 − log2 − log2
3 3 3 3 3 3 3 3 3
−1.584963 −1.584963 −1.584963
≈ − − −
3 3 3
≈ 0.528321 + 0.528321 + 0.528321
≈ 1.584963
We say our output class data has maximum entropy for a class of
this size, meaning the most randomness
1 − Σp2j
a+d
MER = 1 −
a+b+c+d
In the context of analysing nodes, this is 1 minus the maximum
proportion in p = [p1 , . . . , pJ ]:
MER = 1 − max(pj )
From the chart, Entropy and Gini capture more of the notion of
node impurity, and so are preferred measures for tree growth
Misclassification is used extensively in tree pruning
Tree for the Letter Recognition dataset, restricted to depth 3 for ease of visualisation