Module 3-Decision Tree Learning
Module 3-Decision Tree Learning
○ Repeat: - test the attribute specified by the node - move down the tree branch
corresponding to the value of the attribute-value in the given example
A Decision Tree for the concept PlayTennis.
Day Outlook Temperatur Humidity Wind PlayTennis
e
Ú (Outlook=Overcast)
● This instance will be sorted down to the leftmost branch of the decision tree shown in
previous slide and will be classified as a negative example.
1. Instances are represented by attribute-value pairs: Instances are described by a fixed set of
attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree
learning is when each attribute takes on small number of disjoint possible values (e.g., Hot, Mild,
Cold).
2. The target function has discrete output values: The decision tree for the concept PlayTennis
assigns a boolean classification (e.g., yes or no) to each example. Decision tree methods can
also have more than two possible output values
3. Disjunctive descriptions may be required: As noted above, decision trees naturally
represent disjunctive expressions.
4. The training data may contain errors: Decision tree learning methods are robust to
errors, both errors in classifications of the training examples and errors in the attribute
values that describe these examples.
5. The training data may contain missing attribute values: Decision tree methods can be
used even when some training examples have unknown values (e.g., if the Humidity of the
day is known for only some of the training examples).
• Many practical problems such as learning to classify medical patients by their disease,
equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting
on payments, etc. have been found to fit these characteristics.
• Such problems, in which the task is to classify examples into one of a discrete set of possible
categories, are often referred to as classification problems.
4. Basic Decision Tree Learning Algorithm
● Decision Tree learning algorithms employ top-down greedy search through the space of
possible solutions.
● ID3 algorithm is one of the most commonly used Decision Tree learning algorithms and it
applies this general approach to learning the decision tree
ID3 Algorithm
ID3 Algorithm ( Iterative Dichotomiser 3 )
• ID3 algorithm uses “Information Gain” to determine how informative an attribute is (i.e.,
how well an attribute classifies the training examples).
• Information Gain is based on a measure that we call Entropy, which characterizes the
impurity of a collection of examples S (i.e., impurity↑ → E(S)↑):
Entropy(S) ≡ – p⊕ log2 p⊕ – p⊗ log2 p⊗,
where p⊕ and p⊗ are the proportion of positive and negative examples in S, respectively.
• In the case that that the target attribute can take n values:
Entropy(S) ≡ – ∑i pi log2 pi, i = [1..n]
where pi is the proportion of examples in S having the target attribute value i.
where values(A) is the set of all possible values for A
INFORMATION GAIN Sv is the subset of S for which A has a value
|S| is the size of S
Gain(S,A): Expected Day Outlook Temperature Humidity Wind PlayTennis
Compute the information gain for the attributes wind in the play-tennis data set:
• |S|=14 , Attribute wind
• Two values: weak and strong
•||=8
•| |=6
Now, let us determine E| | Now, let us determine E| |
• Instances=8, YES=6, NO=2 • Instances=6, YES=3, NO=3
• [6+,2-] • [3+,3-]
• E| | = -(6/8)log2 (6/8) - (2/8)log2 (2/8)=0.81 • E| | =-(3/6)log2 (3/6)-(3/6)log2 (3/6)=1.0
Gain(S,wind) = 0.048
Information gain calculation for Attribute “humidity”:
Gain(S, outlook)=0.25
Attribute temperature:
Gain(S, temperature)=0.003
Summary
• Gain(S, outlook)=0.25
• Gain(S, temp)=0.03
• Gain(S, humidity)=0.15 So, attribute with highest info.
• Gain(S, wind)=0.048 gain is the OUTLOOK, therefore
use outlook as the root node
Decision Tree – Next Level
• All leaf nodes are associated with training examples from the same class (entropy=0)
• The attribute temperature is not used
5. Hypothesis space search in decision tree
● The hypothesis space searched bylearning
ID3 is the set of possible decision trees.
● ID3 perform a simple-to-complex, hill climbing search through the hypothesis space
starting from empty tree and then progressing to elaborated hypothesis to correctly
classifying the training data.
● Features of the Hypothesis space search in decision tree learning:
• Complete hypothesis space: any finite discrete-valued function can be expressed.
• Incomplete search: searches incompletely through the hypothesis space until the tree is
consistent with the data.
• Single hypothesis: only one current hypothesis (simplest one) is maintained.
• No backtracking: one an attribute is selected, this cannot be changed. Problem: might not
be the optimum solution (globally).
• Full training set at each step: attributes are selected by computing information gain on the
full training set. Advantage: Robustness to errors. Problem: Non-incremental
6. Inductive Bias in Decision Tree Learning
● Inductive bias is the set of assumptions that, together with the training data, deductively justify
the classifications assigned by the learner to future instances.
● The inductive bias of ID3 consists of describing the basis by which it chooses one of the
consistent hypotheses over the others.
● The ID3 search strategy
(a) selects in favor of shorter trees over longer ones
(b) selects trees that place the attributes with highest information gain closest to the root.
● ID3 considers a complete hypothesis space (i.e., one capable of expressing any finite discrete
valued function) but it searches incompletely through this complete hypothesis space, from
simple to complex hypotheses, until its termination condition is met.
● CANDIDATE-ELIMINATION algorithm consider an incomplete hypothesis space (i.e., one that
can express only a subset of the potentially teachable concepts) but it searches this space
completely, finding every hypothesis consistent with the training data.
Thus,
Preference bias: The inductive bias of ID3 is preferred for certain hypotheses over others (e.g., for
shorter hypotheses), with no hard restriction on the hypotheses. This form of bias is typically
called a preference bias (or, alternatively, a search bias).
Restriction bias: The bias of the CANDIDATEELIMINATION algorithm is in the form of a categorical
restriction on the set of hypotheses considered. This form of bias is typically called a restriction
bias (or, alternatively, a language bias).
Typically, a preference bias is more desirable than a restriction bias, because it allows the learner
to work within a complete hypothesis space that is assured to contain the unknown target
function.
b. Why Prefer Short Hypotheses?
William Occam was one of the first to discuss this question, around the year 1320, so this bias
often goes by the name of Occam's razor.
Argument in favour:
● There are fewer short hypotheses than long hypotheses
○ a short hypotheses that fits data unlikely to be coincidence
○ a long hypotheses that fits data might be coincidence
Argument opposed:
● There are many ways to define small sets of hypotheses e.g., all trees with a prime number of
nodes that use attributes beginning with “Z"
● It will produce two different hypotheses from the same training examples when it is applied by
two learners basis
Example: two learners, both applying Occam's razor, would generalize in different ways if one used
the XYZ attribute to describe its examples and the other used only the attributes Outlook, Temperature,
Humidity, and Wind.
7. Issues in Decision Tree Learning
Practical issues in learning decision trees include:
○ Determining how deeply to grow the decision tree
○ Handling continuous attributes
○ Choosing an appropriate attribute selection measure
○ Handling training data with missing attribute values
○ Handling attributes with differing costs, and
○ Improving computational efficiency.
Lets discuss how these issues are addressed using basic ID3 algorithm
○ Nodes are removed only if the resulting pruned tree performs no worse than
the original over the validation set
○ Nodes are pruned iteratively by choosing the node whose removal most increases
the accuracy of decision tree over validation set
○ Continue until further pruning is necessary
○ Here the validation set used for pruning is distinct from both the training and test
sets
○ Disadvantage: Data is limited (withholding part of it for the validation set reduces
even further the number of examples available for training)
● Rule post-pruning finds the high accuracy hypotheses. It involves flowing steps:
○ Infer decision tree growing until the training data fit as well as possible and allow
overfitting to occur
○ Convert the learned tree into an equivalent set of rules by creating one rule for each path
from the root to a leaf node
○ Prune each rule by removing any preconditions that result in improving its estimated
accuracy
○ Sort the pruned rules by their estimated accuracy and consider them in this sequence
when classifying subsequent instances
○ One rule is generated for each leaf node in the tree
○ Antecedent: Each attribute test along the path from the root to the leaf
○ Consequent: The classification at the leaf
○ Removing any antecedent, whose removal does not worsen its estimated accuracy
2. Incorporating Continuous-Valued Attributes
● Continuous valued attributes can be partitioned into a discrete number of disjoint intervals
and then membership can be tested over these intervals.
Example: If the learning task PlayTennis include continuous valued attribute “Temperature” in the
range 40-90 then “Temperature” becomes a bad choice for classification (It alone may perfectly
classify the training examples and therefore promise the highest information gain) while
remaining a poor predictor on the test set.
● The solution to this problem is to classify based not on the actual temperature, but on
dynamically determined intervals within which the temperature falls.
● Like, by introducing boolean attributes T a , a <T b , b < T c and T> c , instead of real
valued T. a, b and c.
● In the PlayTennis example, there are two candidate thresholds, corresponding to the values of
Temperature at which the value of PlayTennis changes: (48 + 60)/2, and (80 + 90)/2.
● The information gain can then be computed for each of the candidate attributes,
Temperature>54 and Temperature>85 the best can be selected (Temperature>54). This
dynamically created boolean attribute can then compete with the other discrete-valued
candidate attributes available for growing the decision tree.
● The information gain measure has a bias that favors attributes with many values over those
with only a few.
For example: If you imagine an attribute Date with unique values for each training example,
then Gain(S,Date) will yield H(S) since
● Obviously no other attribute can do better. This will result in a very broad tree of depth 1.
● To guard against this, GainRatio(S,A) can be used instead of Gain(S,A).
where,
● What happens if some of the training examples contain one or more ``?'', meaning ``value not
known'' instead of the actual attribute values?
● Here are some common ad hoc solutions:
• Substitute ``?'' by the most common value in that column.
• Substitute ``?'' by the most common value among all training examples that have been sorted into the
tree at that node.
• Substitute ``?'' by the most common value among all training examples that have been sorted into the
tree at that node with the same classification as the incomplete example.
5. Handling attributes with differing costs
● In some learning tasks the instance attributes may have associated costs.
Example, in learning to classify medical diseases patients can be described in terms of attributes
such as Temperature, Biopsy-Result, Pulse, Blood Test-Results, etc. These attributes vary
significantly in their costs, both in terms of monetary cost and cost to patient comfort. In such
tasks, we would prefer decision trees that use low-cost attributes where possible, relying on high-
cost attributes only when needed to produce reliable classifications.
● It can be done by dividing the Gain by the cost of the attribute Cost(A), so that lower-cost
attributes would be preferred.
● Use a CostedGain(S,A) which is defined along the lines of: