Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Decision Tree Example

The document discusses the ID3 algorithm for decision tree learning, focusing on the process of selecting the best decision attribute and the criteria for stopping the tree growth. It explains concepts such as entropy, information gain, and the GINI index, which are used to evaluate the effectiveness of attributes in classifying training examples. Additionally, it addresses practical issues in classification, including underfitting, overfitting, and the challenges of finding an optimal decision tree.

Uploaded by

Sri Latha
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Decision Tree Example

The document discusses the ID3 algorithm for decision tree learning, focusing on the process of selecting the best decision attribute and the criteria for stopping the tree growth. It explains concepts such as entropy, information gain, and the GINI index, which are used to evaluate the effectiveness of attributes in classifying training examples. Additionally, it addresses practical issues in classification, including underfitting, overfitting, and the challenges of finding an optimal decision tree.

Uploaded by

Sri Latha
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Foundations of Machine Learning

Module 2: Linear Regression and Decision Tree


Part C: Learning Decision Tree

Sudeshna Sarkar
IIT Kharagpur
Top-Down Induction of Decision Trees ID3

1. Which node to proceed with?


1. A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendant
4. Sort training examples to leaf node according to the
attribute value of the branch
5. If all training examples are perfectly classified (same
value of target attribute) stop, else iterate over new
leaf nodes. 2. When to stop?
Choices
• When to stop
– no more input features
– all examples are classified the same
– too few examples to make an informative split

• Which test to split on


– split gives smallest error.
– With multi-valued features
• split on all values or
• split values into half.
Which Attribute is ”best”?

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

ICS320 4
Principled Criterion
• Selection of an attribute to test at each node -
choosing the most useful attribute for classifying
examples.
• information gain
– measures how well a given attribute separates the training
examples according to their target classification
– This measure is used to select among the candidate
attributes at each step while growing the tree
– Gain is measure of how much we can reduce
uncertainty (Value lies between 0,1)
Entropy
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (logp) bits to
message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode
information about certainty/uncertainty about S
EntropySplogpplogp plogpplogp
Entropy

• The entropy is 0 if the outcome


is ``certain”.
• The entropy is maximum if we
have no knowledge of the
system (or any outcome is
equally possible).

• S is a sample of training examples


• p+ is the proportion of positive examples
• p- is the proportion of negative examples
• Entropy measures the impurity of S
Entropy(S) = plogpplogp
Information Gain
Gain(S,A): expected reduction in entropy due to partitioning S
on attribute A

Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64


= 0.99

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


Information Gain
Entropy([21+,5-]) = 0.71 Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.74 Entropy([8+,30-]) = 0.62
Gain(S,A1)=Entropy(S) Gain(S,A2)=Entropy(S)
-26/64*Entropy([21+,5-])
-51/64*Entropy([18+,33-])
-38/64*Entropy([8+,30-])
-13/64*Entropy([11+,2-])
=0.27
=0.12

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


ICS320 9
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
ID3 Algorithm
Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes

[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]


Selecting the Next Attribute
S=[9+,5-]
E=0.940

Outlook

Sunny Overcast Rain

[2+, 3-] [4+, 0] [3+, 2-]

E=0.971 E=0.0 E=0.971

Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
Selecting the Next Attribute
The information gain values for the 4 attributes are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029

where S denotes the collection of training examples

Note: 0Log20 =0
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Overcast Rain

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[4+,0-] [3+,2-]
[2+,3-]

? Yes ?
Test for this node

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970


Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
Selecting the Next Attribute
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940

Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0

Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
Splitting Rule: GINI Index
• GINI Index
– Measure of node impurity

GINInode(Node) 1 [ p(c)] 2

c  classes

Sv
GINIsplit (A)   S
GINI(N v )
v Values(A )
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Continuous Attribute – Binary Split
• For continuous attribute
– Partition the continuous value of attribute A into a
discrete set of intervals
– Create a new boolean attribute Ac , looking for a
threshold c,
true if Ac  c
Ac 
 false otherwise
How to choose c ?
• consider all possible splits and finds the best cut
Practical Issues of Classification
• Underfitting and Overfitting

• Missing Values

• Costs of Classification
Hypothesis Space Search in Decision Trees
• Conduct a search of the space of decision trees which
can represent all possible discrete functions.

• Goal: to find the best decision tree

• Finding a minimal decision tree consistent with a set of data


is NP-hard.

• Perform a greedy heuristic search: hill climbing without


backtracking

• Statistics-based decisions using all data

20
Bias and Occam’s Razor
Prefer short hypotheses.
Argument in favor:
– Fewer short hypotheses than long hypotheses
– A short hypothesis that fits the data is unlikely to
be a coincidence
– A long hypothesis that fits the data might be a
coincidence

ICS320 21

You might also like