Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Machine Learning for Language Technology 2015
http://stp.lingfil.uu.se/~santinim/ml/2015/ml4lt_2015.htm
Decision Trees (2)
Entropy, Information Gain, Gain Ratio
Marina Santini
santinim@stp.lingfil.uu.se
Department of Linguistics and Philology
Uppsala University, Uppsala, Sweden
Autumn 2015
Acknowledgements
• Weka’s slides
• Wikipedia and other websites
• Witten et al. (2011: 99-108; 195-203; 192-203)
Decision Trees (Part 2) 2
Outline
• Attribute selection
• Entropy
• Suprisal
• Information Gain
• Gain Ratio
• Pruning
• Rules
Decision Trees (Part 2) 3
4 Decision Trees (Part 2)
Constructing decision trees
 Strategy: top down
Recursive divide-and-conquer fashion
 First: select attribute for root node
Create branch for each possible attribute value
 Then: split instances into subsets
One for each branch extending from the node
 Finally: repeat recursively for each branch, using
only instances that reach the branch
 Stop if all instances have the same class
Play or not?
• The weather dataset
Decision Trees (Part 2) 5
6 Decision Trees (Part 2)
Which attribute to select?
Computing purity: the information
measure
• information is a measure of a reduction of
uncertainty
• It represents the expected amount of
information that would be needed to “place”
a new instance in the branch.
Decision Trees (Part 2) 7
8 Decision Trees (Part 2)
Which attribute to select?
9 Decision Trees (Part 2)
Final decision tree
 Splitting stops when data can’t be split any further
10 Decision Trees (Part 2)
Criterion for attribute selection
 Which is the best attribute?
 Want to get the smallest tree
 Heuristic: choose the attribute that produces the
“purest” nodes
11 Decision Trees (Part 2)
-- Information gain: increases with the average purity of the subsets
-- Strategy: choose attribute that gives greatest information gain
How to compute Informaton Gain:
Entropy
1. When the number of either yes OR no is zero
(that is the node is pure) the information is
zero.
2. When the number of yes and no is equal, the
information reaches its maximum because
we are very uncertain about the outcome.
3. Complex scenarios: the measure should be
applicable to a multiclass situation, where a
multi-staged decision must be made.
Decision Trees (Part 2) 12
Entropy
• Entropy (aka expected surprisal)
Decision Trees (Part 2) 13
Suprisal: Definition
• Surprisal (aka self-information) is a measure of the information
content associated with an event in a probability space.
• The smaller its probability of an event, the larger the surprisal
associated with the information that the event occur.
• By definition, the measure of surprisal is positive and additive. If an
event C is the intersection of two independent events A and B, then
the amount of information knowing that C has happened, equals
the sum of the amounts of information of event A and
event B respectively:
I(A ∩ B)=I(A)+I(B)
Decision Trees (Part 2) 14
Surprisal: Formula
• The surprisal “I” of an event is:
Decision Trees (Part 2) 15
16 Decision Trees (Part 2)
Entropy: Formulas
 Formulas for computing entropy:
17 Decision Trees (Part 2)
Entropy: Outlook, sunny
 Formulae for computing the entropy:
= (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445
Measures: Information & Entropy
• Watch out: There are many statements in the
literature which say that information is the same
as entropy.
• Properly speaking: entropy is a probabilistic
measure of uncertainty or ignorance and
information is a measure of a reduction of
uncertainty
• However, in our context we use entropy (ie the
quantity of uncertainty) to measure the purity of
a node.
Decision Trees (Part 2) 18
19 Decision Trees (Part 2)
Example: Outlook
20 Decision Trees (Part 2)
Computing Information Gain
 Information gain: information before splitting –
information after splitting
 Information gain for attributes from weather data:
gain(Outlook ) = 0.247 bits
gain(Temperature ) = 0.029 bits
gain(Humidity ) = 0.152 bits
gain(Windy ) = 0.048 bits
gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
21 Decision Trees - Part 2
Information Gain Drawbacks
 Problematic: attributes with a large number
of values (extreme case: ID code)
22 Decision Trees - Part 2
Weather data with ID code
N
M
L
K
J
I
H
G
F
E
D
C
B
A
ID code
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHotOvercast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindyHumidityTemp.Outlook
23 Decision Trees - Part 2
Tree stump for ID code attribute
 Entropy of split (see Weka book 2011: 105-108):
 Information gain is maximal for ID code (namely 0.940
bits)
24 Decision Trees - Part 2
Information Gain Limitations
 Problematic: attributes with a large number
of values (extreme case: ID code)
 Subsets are more likely to be pure if there is
a large number of values
 Information gain is biased towards choosing
attributes with a large number of values
 This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
 (Another problem: fragmentation)
25 Decision Trees - Part 2
Gain ratio
 Gain ratio: a modification of the information gain
that reduces its bias
 Gain ratio takes number and size of branches into
account when choosing an attribute
 It corrects the information gain by taking the intrinsic
information of a split into account
 Intrinsic information: information about the class is
disregarded.
26 Decision Trees - Part 2
Gain ratios for weather data
0.019Gain ratio: 0.029/1.5570.157Gain ratio: 0.247/1.577
1.557Split info: info([4,6,4])1.577Split info: info([5,4,5])
0.029Gain: 0.940-0.9110.247Gain: 0.940-0.693
0.911Info:0.693Info:
TemperatureOutlook
0.049Gain ratio: 0.048/0.9850.152Gain ratio: 0.152/1
0.985Split info: info([8,6])1.000Split info: info([7,7])
0.048Gain: 0.940-0.8920.152Gain: 0.940-0.788
0.892Info:0.788Info:
WindyHumidity
27 Decision Trees - Part 2
More on the gain ratio
 “Outlook” still comes out top
 However: “ID code” has greater gain ratio
 Standard fix: ad hoc test to prevent splitting on that
type of attribute
 Problem with gain ratio: it may overcompensate
 May choose an attribute just because its intrinsic
information is very low
 Standard fix: only consider attributes with greater
than average information gain
28 Decision Trees - Part 2
Interim Summary
 Top-down induction of decision trees: ID3,
algorithm developed by Ross Quinlan
 Gain ratio just one modification of this basic
algorithm
  C4.5: deals with numeric attributes, missing
values, noisy data
 Similar approach: CART
 There are many other attribute selection
criteria!
(But little difference in accuracy of result)
29 Decision Trees - Part 2
Pruning
 Prevent overfitting to noise in the data
 “Prune” the decision tree
 Two strategies:
 Postpruning
take a fully-grown decision tree and discard
unreliable parts
 Prepruning
stop growing a branch when information
becomes unreliable
 Postpruning preferred in practice—
prepruning can “stop early”
30 Decision Trees - Part 2
Postpruning
 First, build full tree
 Then, prune it
 Fully-grown tree shows all attribute interactions
 Problem: some subtrees might be due to chance
effects
 Two pruning operations:
 Subtree replacement
 Subtree raising
31 Decision Trees - Part 2
Subtree replacement
 Bottom-up
 Consider replacing a tree only
after considering all its subtrees
32 Decision Trees - Part 2
Subtree raising
 Delete node
 Redistribute instances
 Slower than subtree
replacement
(Worthwhile?)
33 Decision Trees - Part 2
Prepruning
 Based on statistical significance test
 Stop growing the tree when there is no statistically
significant association between any attribute and the
class at a particular node
 Most popular test: chi-squared test
 ID3 used chi-squared test in addition to
information gain
 Only statistically significant attributes were allowed to be
selected by information gain procedure
34 Decision Trees - Part 2
From trees to rules
 Easy: converting a tree into a set of rules
 One rule for each leaf:
 Produces rules that are unambiguous
 Doesn’t matter in which order they are executed
 But: resulting rules are unnecessarily complex
 Pruning to remove redundant tests/rules
35 Decision Trees - Part 2
From rules to trees
 More difficult: transforming a rule set into a tree
 Tree cannot easily express disjunction between rules
36 Decision Trees - Part 2
From rules to trees: Example
 Example: rules which test different attributes
Symmetry needs to be broken
 Corresponding tree contains identical subtrees
( “replicated subtree problem”)
If a and b then x
If c and d then x
Topic Summary
• Attribute selection
• Entropy
• Suprisal
• Information Gain
• Gain Ratio
• Pruning
• Rules
• Quizzes are naively tricky, just to double check
that your attention is still with me 
Decision Trees (Part 2) 37
Quiz 1: Regression and Classification
Which of these statement is correct in the context
of machine learning?
1. Classification is is the process of computing
model that predicts a numeric quantity.
2. Regression and Classification mean the same.
3. Regression is the process of computing model
that predicts a numeric quantity.
Decision Trees - Part 2 38
Quiz 2: Information Gain
What is the main drawback of the IG metric in
certain contexts?
1. It is biassed towards attributes that have
many values.
2. It is based on entropy rather than suprisal.
3. None of the above.
Decision Trees - Part 2 39
Quiz 3: Gain Ratio
What is the main difference between IG and
GR?
1. GR disregards the information about the
class, and IG takes the class into account.
2. IG disregards the information about the class
and GR takes the class into account.
3. None of the above.
Decision Trees - Part 2 40
Quiz 4: Pruning
Which pruning strategy is commonly recommended?
1. Prepruning
2. Postpruning
3. Subtree raising
Decision Trees - Part 2 41
The End
Decision Trees - Part 2 42

More Related Content

Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio

  • 1. Machine Learning for Language Technology 2015 http://stp.lingfil.uu.se/~santinim/ml/2015/ml4lt_2015.htm Decision Trees (2) Entropy, Information Gain, Gain Ratio Marina Santini santinim@stp.lingfil.uu.se Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Autumn 2015
  • 2. Acknowledgements • Weka’s slides • Wikipedia and other websites • Witten et al. (2011: 99-108; 195-203; 192-203) Decision Trees (Part 2) 2
  • 3. Outline • Attribute selection • Entropy • Suprisal • Information Gain • Gain Ratio • Pruning • Rules Decision Trees (Part 2) 3
  • 4. 4 Decision Trees (Part 2) Constructing decision trees  Strategy: top down Recursive divide-and-conquer fashion  First: select attribute for root node Create branch for each possible attribute value  Then: split instances into subsets One for each branch extending from the node  Finally: repeat recursively for each branch, using only instances that reach the branch  Stop if all instances have the same class
  • 5. Play or not? • The weather dataset Decision Trees (Part 2) 5
  • 6. 6 Decision Trees (Part 2) Which attribute to select?
  • 7. Computing purity: the information measure • information is a measure of a reduction of uncertainty • It represents the expected amount of information that would be needed to “place” a new instance in the branch. Decision Trees (Part 2) 7
  • 8. 8 Decision Trees (Part 2) Which attribute to select?
  • 9. 9 Decision Trees (Part 2) Final decision tree  Splitting stops when data can’t be split any further
  • 10. 10 Decision Trees (Part 2) Criterion for attribute selection  Which is the best attribute?  Want to get the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes
  • 11. 11 Decision Trees (Part 2) -- Information gain: increases with the average purity of the subsets -- Strategy: choose attribute that gives greatest information gain
  • 12. How to compute Informaton Gain: Entropy 1. When the number of either yes OR no is zero (that is the node is pure) the information is zero. 2. When the number of yes and no is equal, the information reaches its maximum because we are very uncertain about the outcome. 3. Complex scenarios: the measure should be applicable to a multiclass situation, where a multi-staged decision must be made. Decision Trees (Part 2) 12
  • 13. Entropy • Entropy (aka expected surprisal) Decision Trees (Part 2) 13
  • 14. Suprisal: Definition • Surprisal (aka self-information) is a measure of the information content associated with an event in a probability space. • The smaller its probability of an event, the larger the surprisal associated with the information that the event occur. • By definition, the measure of surprisal is positive and additive. If an event C is the intersection of two independent events A and B, then the amount of information knowing that C has happened, equals the sum of the amounts of information of event A and event B respectively: I(A ∩ B)=I(A)+I(B) Decision Trees (Part 2) 14
  • 15. Surprisal: Formula • The surprisal “I” of an event is: Decision Trees (Part 2) 15
  • 16. 16 Decision Trees (Part 2) Entropy: Formulas  Formulas for computing entropy:
  • 17. 17 Decision Trees (Part 2) Entropy: Outlook, sunny  Formulae for computing the entropy: = (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445
  • 18. Measures: Information & Entropy • Watch out: There are many statements in the literature which say that information is the same as entropy. • Properly speaking: entropy is a probabilistic measure of uncertainty or ignorance and information is a measure of a reduction of uncertainty • However, in our context we use entropy (ie the quantity of uncertainty) to measure the purity of a node. Decision Trees (Part 2) 18
  • 19. 19 Decision Trees (Part 2) Example: Outlook
  • 20. 20 Decision Trees (Part 2) Computing Information Gain  Information gain: information before splitting – information after splitting  Information gain for attributes from weather data: gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity ) = 0.152 bits gain(Windy ) = 0.048 bits gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2]) = 0.940 – 0.693 = 0.247 bits
  • 21. 21 Decision Trees - Part 2 Information Gain Drawbacks  Problematic: attributes with a large number of values (extreme case: ID code)
  • 22. 22 Decision Trees - Part 2 Weather data with ID code N M L K J I H G F E D C B A ID code NoTrueHighMildRainy YesFalseNormalHotOvercast YesTrueHighMildOvercast YesTrueNormalMildSunny YesFalseNormalMildRainy YesFalseNormalCoolSunny NoFalseHighMildSunny YesTrueNormalCoolOvercast NoTrueNormalCoolRainy YesFalseNormalCoolRainy YesFalseHighMildRainy YesFalseHighHotOvercast NoTrueHighHotSunny NoFalseHighHotSunny PlayWindyHumidityTemp.Outlook
  • 23. 23 Decision Trees - Part 2 Tree stump for ID code attribute  Entropy of split (see Weka book 2011: 105-108):  Information gain is maximal for ID code (namely 0.940 bits)
  • 24. 24 Decision Trees - Part 2 Information Gain Limitations  Problematic: attributes with a large number of values (extreme case: ID code)  Subsets are more likely to be pure if there is a large number of values  Information gain is biased towards choosing attributes with a large number of values  This may result in overfitting (selection of an attribute that is non-optimal for prediction)  (Another problem: fragmentation)
  • 25. 25 Decision Trees - Part 2 Gain ratio  Gain ratio: a modification of the information gain that reduces its bias  Gain ratio takes number and size of branches into account when choosing an attribute  It corrects the information gain by taking the intrinsic information of a split into account  Intrinsic information: information about the class is disregarded.
  • 26. 26 Decision Trees - Part 2 Gain ratios for weather data 0.019Gain ratio: 0.029/1.5570.157Gain ratio: 0.247/1.577 1.557Split info: info([4,6,4])1.577Split info: info([5,4,5]) 0.029Gain: 0.940-0.9110.247Gain: 0.940-0.693 0.911Info:0.693Info: TemperatureOutlook 0.049Gain ratio: 0.048/0.9850.152Gain ratio: 0.152/1 0.985Split info: info([8,6])1.000Split info: info([7,7]) 0.048Gain: 0.940-0.8920.152Gain: 0.940-0.788 0.892Info:0.788Info: WindyHumidity
  • 27. 27 Decision Trees - Part 2 More on the gain ratio  “Outlook” still comes out top  However: “ID code” has greater gain ratio  Standard fix: ad hoc test to prevent splitting on that type of attribute  Problem with gain ratio: it may overcompensate  May choose an attribute just because its intrinsic information is very low  Standard fix: only consider attributes with greater than average information gain
  • 28. 28 Decision Trees - Part 2 Interim Summary  Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan  Gain ratio just one modification of this basic algorithm   C4.5: deals with numeric attributes, missing values, noisy data  Similar approach: CART  There are many other attribute selection criteria! (But little difference in accuracy of result)
  • 29. 29 Decision Trees - Part 2 Pruning  Prevent overfitting to noise in the data  “Prune” the decision tree  Two strategies:  Postpruning take a fully-grown decision tree and discard unreliable parts  Prepruning stop growing a branch when information becomes unreliable  Postpruning preferred in practice— prepruning can “stop early”
  • 30. 30 Decision Trees - Part 2 Postpruning  First, build full tree  Then, prune it  Fully-grown tree shows all attribute interactions  Problem: some subtrees might be due to chance effects  Two pruning operations:  Subtree replacement  Subtree raising
  • 31. 31 Decision Trees - Part 2 Subtree replacement  Bottom-up  Consider replacing a tree only after considering all its subtrees
  • 32. 32 Decision Trees - Part 2 Subtree raising  Delete node  Redistribute instances  Slower than subtree replacement (Worthwhile?)
  • 33. 33 Decision Trees - Part 2 Prepruning  Based on statistical significance test  Stop growing the tree when there is no statistically significant association between any attribute and the class at a particular node  Most popular test: chi-squared test  ID3 used chi-squared test in addition to information gain  Only statistically significant attributes were allowed to be selected by information gain procedure
  • 34. 34 Decision Trees - Part 2 From trees to rules  Easy: converting a tree into a set of rules  One rule for each leaf:  Produces rules that are unambiguous  Doesn’t matter in which order they are executed  But: resulting rules are unnecessarily complex  Pruning to remove redundant tests/rules
  • 35. 35 Decision Trees - Part 2 From rules to trees  More difficult: transforming a rule set into a tree  Tree cannot easily express disjunction between rules
  • 36. 36 Decision Trees - Part 2 From rules to trees: Example  Example: rules which test different attributes Symmetry needs to be broken  Corresponding tree contains identical subtrees ( “replicated subtree problem”) If a and b then x If c and d then x
  • 37. Topic Summary • Attribute selection • Entropy • Suprisal • Information Gain • Gain Ratio • Pruning • Rules • Quizzes are naively tricky, just to double check that your attention is still with me  Decision Trees (Part 2) 37
  • 38. Quiz 1: Regression and Classification Which of these statement is correct in the context of machine learning? 1. Classification is is the process of computing model that predicts a numeric quantity. 2. Regression and Classification mean the same. 3. Regression is the process of computing model that predicts a numeric quantity. Decision Trees - Part 2 38
  • 39. Quiz 2: Information Gain What is the main drawback of the IG metric in certain contexts? 1. It is biassed towards attributes that have many values. 2. It is based on entropy rather than suprisal. 3. None of the above. Decision Trees - Part 2 39
  • 40. Quiz 3: Gain Ratio What is the main difference between IG and GR? 1. GR disregards the information about the class, and IG takes the class into account. 2. IG disregards the information about the class and GR takes the class into account. 3. None of the above. Decision Trees - Part 2 40
  • 41. Quiz 4: Pruning Which pruning strategy is commonly recommended? 1. Prepruning 2. Postpruning 3. Subtree raising Decision Trees - Part 2 41
  • 42. The End Decision Trees - Part 2 42