Machine Learning: MVJ21CS62
Machine Learning: MVJ21CS62
Machine Learning: MVJ21CS62
MODULE 2
INTRODUCTION
Decision tree learning is a method for approximating discrete-valued target functions, in which the
learned function is represented by a decision tree.
Learned trees can also be re-represented as sets of if- then rules to improve human readability.
These learning methods are among the most popular of inductive inference algorithms and have been
successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to
assess credit risk of loan applicants.
DECISION TREE REPRESENTATION
Decision trees classify instances by sorting them down the tree from the root to some leaf node, which
provides the classification of the instance. Each node in the tree specifies a test of some attribute of the
instance, and each branch descending from that node corresponds to one of the possible values for this
attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified
by this node, then moving down the tree branch corresponding to the value of the attribute in the given
example. This process is then repeated for the subtree rooted at the new node
Fig 3.1 illustrates a typical learned decision tree. This decision tree classifies Saturday mornings
according to whether they are suitable for playing tennis.
• This would be sorted down the leftmost branch of this decision tree and would therefore be classified as
a negative instance (i.e., the tree predicts that PlayTennis = no).
• This tree and the example used in Table to illustrate the ID3 learning algorithm are adapted from
(Quinlan 1986). In general, decision trees represent a disjunction of conjunctions of constraints on the
attribute values of instances.
• Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to
a disjunction of these conjunctions.
• For example, the decision tree shown in Figure 3.1 corresponds to the expression
• The training data may contain errors. Decision tree learning methods are robust to errors, both errors
in classifications of the training examples and errors in the attribute values that describe these examples.
• Training data may contain missing attribute values. Decision tree methods can be used even when
some training examples have unknown values (e.g., if the Humidity of the day is known for only some
of the training examples). This issue is discussed in Section 3.7.4.
Many practical problems have been found to fit these characteristics. Decision tree learning has
therefore been applied to problems such as learning to classify medical patients by their disease,
equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting on
payments. Such problems, in which the task is to classify examples into one of a discrete set of possible
categories, are often referred to as classification problems.
• This approach is exemplified by the ID3 algorithm (Quinlan 1986) and its successor C4.5 (Quinlan
1993), which form the primary focus of our discussion here.
• Our basic algorithm, ID3, learns decision trees by constructing them topdown, beginning with the
question "which attribute should be tested at the root of the tree?'To answer this question, each instance
attribute is evaluated using a statistical test to determine how well it alone classifies the training
examples.
• The best attribute is selected and used as the test at the root node of the tree. A descendant of the root
node is then created for each possible value of this attribute, and the training examples are sorted to the
appropriate descendant node (i.e., down the branch corresponding to the example's value for this
attribute).
• The entire process is then repeated using the training examples associated with each descendant node to
select the best attribute to test at that point in the tree.
• This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to
reconsider earlier choices.
• A simplified version of the algorithm, specialized to learning boolean-valued functions (i.e., concept
learning), is described in Table 3.1.
What is a good quantitative measure of the worth of an attribute? We will define a statistical property,
called information gain, that measures how well a given attribute separates the training examples
according to their target classification. ID3 uses this information gain measure to select among the
candidate attributes at each stepwhile growing the tree.
• Then the entropy of S relative to this boolean classification is Notice that the entropy is 0 if all members
of S belong to the same class.
• For example, if all members are positive (pe = I), then p, is 0, and Entropy(S) =
• -1 . log2(1) - 0 . log2 0 = -1 . 0 - 0 . log2 0 = 0.
• Note the entropy is 1 when the collection contains an equal number of positive and negative examples.
• If the collection contains unequal numbers of positive and negative examples, the entropy is between 0
and 1. Figure 3.2 shows the form of the entropy function relative to a boolean classification, as p+,
varies between 0 and 1.
• One interpretation of entropy from information theory is that it specifies the minimum number of bits of
information needed to encode the classification of an arbitrary member of S (i.e., a member of S drawn
at random with uniform probability).
• For example, if p, is 1, the receiver knows the drawn example will be positive, so no message need be
sent, and the entropy is zero.
• On the other hand, if p+ is 0.5, one bit is required to indicate whether the drawn example is positive or
negative.
• If p+ is 0.8, then a collection of messages can be encoded using on average less than 1 bit per message
by assigning shorter codes to collections of positive examples and longer codes to less likely negative
examples.
• Thus far we have discussed entropy in the special case where the target classification is boolean. More
generally, if the target attribute can take on c different values, then the entropy of S relative to this c-
wise classification is defined as where pi is the proportion of S belonging to class i.
• Note the logarithm is still base 2 because entropy is a measure of the expected encoding length
DEPARTMENT OF CSE Page 5
Machine Learning MVJ21CS62
measured in bits.
• Note also that if the target attribute can take onc possible values, the entropy can be as large as log, c.
The measure we will use, called information gain, is simply the expected reduction in entropy caused
by partitioning the examples according to this attribute. More precisely, the information gain, Gain(S,
A) of an attribute A, relative to a collection of examples S, is defined as
where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for which
• The expected entropy described by this second term is simply the sum of the entropies of each subset S,,
weighted by the fraction of examples that belong to S,. Gain(S, A) is therefore the expected reduction in
entropy caused by knowingthe value of attribute A.
• Put another way, Gain(S, A) is the information provided about the target &action value, given the value
of some other attribute A.
• The value of Gain(S, A) is the number of bits saved when encoding the target value of an arbitrary
member of S, by knowing the value of attribute A.
• For example, suppose S is a collection of training-example days described by attributes including Wind,
which can have the values Weak or Strong.
• As before, assume S is a collection containing 14 examples, [9+, 5-]. Of these 14 examples, suppose 6
of the positive and 2 of the negative examples have Wind = Weak, and the remainder have Wind =
Strong.
• The information gain due to sorting the original 14 examples by the attribute Wind may then be
calculated as
Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing
the tree.
The use of information gain to evaluate the relevance of attributes is summarized in Figure 3.3. In this
figure the information gain of two different attributes, Humidity and Wind, is computed in order to
determine which is the better attribute for classifying the training examples shown in Table 3.2.
2. The hypothesis space searched by ID3 is the set of possible decision trees.
3. ID3 performs a simple-to complex, hill-climbing search through this hypothesis space, beginning with
the empty tree, then considering progressively more elaborate hypotheses in search of a decision tree
that correctly classifies the training data.
4. The evaluation function that guides this hill- climbing search is the information gain measure. This
search is depicted in Figure 3.5.
5. By viewing ID3 in terms of its search space and search strategy, we can get some insight into its
capabilities and limitations.
• ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued functions,
relative to the available attributes. Because every finite discrete-valued function can be represented by
some decision tree, ID3 avoids one of the major risks of methods that search incomplete hypothesis
spaces (such as methods that consider only conjunctive hypotheses): that the hypothesis space might not
contain the target function.
• ID3 maintains only a single current hypothesis as it searches through the space of decision trees. This
contrasts, for example, with the earlier version space candidate elimination method, which maintains the
set of all hypotheses consistent with the available training examples.
• By determining only a single hypothesis, ID3 loses the capabilities that follow from explicitly
representing all consistent hypotheses. For example, it does not have the ability to determine how many
alternative decision trees are consistent with the available training data, or to pose new instance queries
that optimally resolve among these competing hypotheses.
• ID3 in its pure form performs no backtracking in its search. Once it, selects an attribute to test at a
particular level in the tree, it never backtracks to reconsider this choice. Therefore, it is susceptible to the
usual risks of hill-climbing search without backtracking: converging to locally optimal solutions that are
not globally optimal.
• In the case of ID3, a locally optimal solution corresponds to the decision tree it selects along the single
search path it explores. However, this locally optimal solution may be less desirable than trees that
• In other words, what is its inductive bias? Recall from Chapter 2 that inductive bias is the set of
assumptions that, together with the training data, deductively justify the classifications assigned by the
learner to future instances.
• Given a collection of training examples, there are typically many decision trees consistent with these
examples.
• Describing the inductive bias of ID3 therefore consists of describing the basis by which it chooses one of
these consistent hypotheses over the others.
• Which of these decision trees does ID3 choose?
• It chooses the first acceptable tree it encounters in its simple-to-complex, hill-climbing search through
the space of possible trees.
• Roughly speaking, then, the ID3 search strategy
• selects in favour of shorter trees over longer ones, and
• selects trees that place the attributes with highest information gain closest to the root.
• Because of the subtle interaction between the attribute selection heuristic used by ID3 and the particular
training examples it encounters, it is difficult to characterize precisely the inductive bias exhibited by
ID3. However, we can approximately characterize its bias as a preference for short decision trees over
complextrees.
• In fact, one could imagine an algorithm similar to ID3 that exhibits precisely this inductive bias.
• Consider an algorithm that begins with the empty tree and searches breadth First through progressively
more complex trees, first considering all trees of depth 1, then all trees of depth 2, etc.
• Once it finds a decision tree consistent with the training data, it returns the smallest consistent tree at
that search depth (e.g., the tree with the fewest nodes).
• Let us call this breadth-first search algorithm BFS-ID3. BFS-ID3 finds a shortest decision tree and thus
exhibits precisely the bias "shorter trees are preferred over longer trees."
• ID3 can be viewed as an efficient approximation to BFS-ID3, using a greedy heuristic search to
attempt to find the shortest tree without conducting the entire breadth-first search through the hypothesis
space.
• Because ID3 uses the information gain heuristic and a hill climbing strategy, it exhibits a more complex
bias than BFS-ID3.
• In particular, it does not always find the shortest consistent tree, and it is biased to favor trees that place
attributes with high information gain closest to the root.
• A closer approximation to the inductive bias of ID3: Shorter trees are preferred over longer trees.
Trees that place high information gain attributes close to the root are preferred over those that do not.
1. Determining how deeply to grow the decision tree, handling continuous attributes,
Below we discuss each of these issues and extensions to the basic ID3 algorithm that address them. ID3
has itself been extended to address most of these issues, with the resulting system renamed C4.5
(Quinlan 1993).
DEPARTMENT OF CSE Page 11
Machine Learning MVJ21CS62
• We will say that a hypothesis overfits the training examples if some other hypothesis that fits the
training examples less well actually performs better over the entire distribution of instances (i.e.,
including instances beyond the training set).