Machine Learning: MVJ21CS62

Machine Learning MVJ21CS62
MODULE 2
DECISION TREE LEARNING
INTRODUCTION
Decision tree learning is a method for approximating discrete-valued target functions, in which the
learned function is represented by a decision tree.
Learned trees can also be re-represented as sets of if- then rules to improve human readability.
These learning methods are among the most popular of inductive inference algorithms and have been
successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to
assess credit risk of loan applicants.
DECISION TREE REPRESENTATION
Decision trees classify instances by sorting them down the tree from the root to some leaf node, which
provides the classification of the instance. Each node in the tree specifies a test of some attribute of the
instance, and each branch descending from that node corresponds to one of the possible values for this
attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified
by this node, then moving down the tree branch corresponding to the value of the attribute in the given
example. This process is then repeated for the subtree rooted at the new node
DEPARTMENT OF CSE Page 1

Fig 3.1 illustrates a typical learned decision tree. This decision tree classifies Saturday mornings
according to whether they are suitable for playing tennis.
For example, the instance

(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
• This would be sorted down the leftmost branch of this decision tree and would therefore be classified as
a negative instance (i.e., the tree predicts that PlayTennis = no).
• This tree and the example used in Table to illustrate the ID3 learning algorithm are adapted from
(Quinlan 1986). In general, decision trees represent a disjunction of conjunctions of constraints on the
attribute values of instances.
• Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to
a disjunction of these conjunctions.
• For example, the decision tree shown in Figure 3.1 corresponds to the expression
(Outlook = Sunny A Humidity = Normal)

V (Outlook = Overcast)
v (Outlook = Rain A Wind = Weak)
APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING

Although a variety of decision tree learning methods have been developed with somewhat differing
capabilities and requirements, decision tree learning is generally best suited to problems with the
following characteristics:
• Instances are represented by attribute-value pairs. Instances are described by a fixed set of attributes
(e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree learning is when
each attribute takes on small number of disjoint possible values (e.g., Hot, Mild, Cold). However,
extensions to the basic algorithm (discussed in Section 3.7.2) allow handling real-valued attributes as
well (e.g., representing Temperature numerically).
• The targetfunction has discrete output values. The decision tree in Figure 3.1assigns a boolean
classification (e.g., yes or no) to each example. Decision tree methods easily extend to learning
functions with more than two possible output values. A more substantial extension allows learning target
functions with real-valued outputs, though the application of decision trees in this setting is less
common.
• Disjunctive descriptions may be required. As noted above, decision trees naturally represent disjunctive
expressions.
• The training data may contain errors. Decision tree learning methods are robust to errors, both errors
in classifications of the training examples and errors in the attribute values that describe these examples.
• Training data may contain missing attribute values. Decision tree methods can be used even when
some training examples have unknown values (e.g., if the Humidity of the day is known for only some
of the training examples). This issue is discussed in Section 3.7.4.
Many practical problems have been found to fit these characteristics. Decision tree learning has
therefore been applied to problems such as learning to classify medical patients by their disease,
equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting on
payments. Such problems, in which the task is to classify examples into one of a discrete set of possible
categories, are often referred to as classification problems.
THE BASIC DECISION TREE LEARNING ALGORITHM

• Most algorithms that have been developed for learning decision trees are variations on a core algorithm
that employs a top-down, greedy search through the space of possible decision trees.
• This approach is exemplified by the ID3 algorithm (Quinlan 1986) and its successor C4.5 (Quinlan
1993), which form the primary focus of our discussion here.
• Our basic algorithm, ID3, learns decision trees by constructing them topdown, beginning with the
question "which attribute should be tested at the root of the tree?'To answer this question, each instance
attribute is evaluated using a statistical test to determine how well it alone classifies the training
examples.
• The best attribute is selected and used as the test at the root node of the tree. A descendant of the root
node is then created for each possible value of this attribute, and the training examples are sorted to the
appropriate descendant node (i.e., down the branch corresponding to the example's value for this
attribute).
• The entire process is then repeated using the training examples associated with each descendant node to
select the best attribute to test at that point in the tree.
• This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to
reconsider earlier choices.
• A simplified version of the algorithm, specialized to learning boolean-valued functions (i.e., concept
learning), is described in Table 3.1.

Which Attribute Is the Best Classifier?

The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. We
would like to select the attribute that is most useful for classifying examples.
What is a good quantitative measure of the worth of an attribute? We will define a statistical property,
called information gain, that measures how well a given attribute separates the training examples
according to their target classification. ID3 uses this information gain measure to select among the
candidate attributes at each stepwhile growing the tree.

ENTROPY MEASURES HOMOGENEITY OF EXAMPLES

In order to define information gain precisely, we begin by defining a measure commonly used in
information theory, called entropy.
Definition of entropy:
• It characterizes the (im)purity of an arbitrary collection of examples. Given a collection S(samples),
containing positive and negative examples of some target concept, the entropy of S relative to this
boolean classification is where p+, is the proportion of positive examples in S and p-, is the proportion
of negative examples in S.
• To illustrate, suppose S is a collection of 14 examples of some Boolean concept, including 9 positive

and5 negative examples (we adopt the notation [9+, 5-] to summarize such a sample of data).
• Then the entropy of S relative to this boolean classification is Notice that the entropy is 0 if all members
of S belong to the same class.
• For example, if all members are positive (pe = I), then p, is 0, and Entropy(S) =
• -1 . log2(1) - 0 . log2 0 = -1 . 0 - 0 . log2 0 = 0.
• Note the entropy is 1 when the collection contains an equal number of positive and negative examples.
• If the collection contains unequal numbers of positive and negative examples, the entropy is between 0
and 1. Figure 3.2 shows the form of the entropy function relative to a boolean classification, as p+,
varies between 0 and 1.
• One interpretation of entropy from information theory is that it specifies the minimum number of bits of
information needed to encode the classification of an arbitrary member of S (i.e., a member of S drawn
at random with uniform probability).
• For example, if p, is 1, the receiver knows the drawn example will be positive, so no message need be
sent, and the entropy is zero.
• On the other hand, if p+ is 0.5, one bit is required to indicate whether the drawn example is positive or
negative.
• If p+ is 0.8, then a collection of messages can be encoded using on average less than 1 bit per message
by assigning shorter codes to collections of positive examples and longer codes to less likely negative
examples.
• Thus far we have discussed entropy in the special case where the target classification is boolean. More
generally, if the target attribute can take on c different values, then the entropy of S relative to this c-
wise classification is defined as where pi is the proportion of S belonging to class i.
• Note the logarithm is still base 2 because entropy is a measure of the expected encoding length
measured in bits.
• Note also that if the target attribute can take onc possible values, the entropy can be as large as log, c.
INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY

Given entropy as a measure of the impurity in a collection of training examples, we can now define a
measure of the effectiveness of an attribute in classifying the training data.
The measure we will use, called information gain, is simply the expected reduction in entropy caused
by partitioning the examples according to this attribute. More precisely, the information gain, Gain(S,
A) of an attribute A, relative to a collection of examples S, is defined as
where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for which
attribute A has value .

• Note the first term in Equation (3.4) is just the entropy of the original collection S, and the second term
isthe expected value of the entropy after S is partitioned using attribute A.
• The expected entropy described by this second term is simply the sum of the entropies of each subset S,,

weighted by the fraction of examples that belong to S,. Gain(S, A) is therefore the expected reduction in
entropy caused by knowingthe value of attribute A.
• Put another way, Gain(S, A) is the information provided about the target &action value, given the value
of some other attribute A.
• The value of Gain(S, A) is the number of bits saved when encoding the target value of an arbitrary
member of S, by knowing the value of attribute A.
• For example, suppose S is a collection of training-example days described by attributes including Wind,
which can have the values Weak or Strong.
• As before, assume S is a collection containing 14 examples, [9+, 5-]. Of these 14 examples, suppose 6
of the positive and 2 of the negative examples have Wind = Weak, and the remainder have Wind =
Strong.
• The information gain due to sorting the original 14 examples by the attribute Wind may then be
calculated as
Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing
the tree.
The use of information gain to evaluate the relevance of attributes is summarized in Figure 3.3. In this
figure the information gain of two different attributes, Humidity and Wind, is computed in order to
determine which is the better attribute for classifying the training examples shown in Table 3.2.

REFER NOTES FOR ID3 EXAMPLES AND PROBLEMS
HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING

1. As with other inductive learning methods, ID3 can be characterized as searching a space of hypotheses
for one that fits the training examples.
2. The hypothesis space searched by ID3 is the set of possible decision trees.
3. ID3 performs a simple-to complex, hill-climbing search through this hypothesis space, beginning with
the empty tree, then considering progressively more elaborate hypotheses in search of a decision tree
that correctly classifies the training data.
4. The evaluation function that guides this hill- climbing search is the information gain measure. This
search is depicted in Figure 3.5.
5. By viewing ID3 in terms of its search space and search strategy, we can get some insight into its
capabilities and limitations.

• ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued functions,
relative to the available attributes. Because every finite discrete-valued function can be represented by
some decision tree, ID3 avoids one of the major risks of methods that search incomplete hypothesis
spaces (such as methods that consider only conjunctive hypotheses): that the hypothesis space might not
contain the target function.
• ID3 maintains only a single current hypothesis as it searches through the space of decision trees. This
contrasts, for example, with the earlier version space candidate elimination method, which maintains the
set of all hypotheses consistent with the available training examples.
• By determining only a single hypothesis, ID3 loses the capabilities that follow from explicitly
representing all consistent hypotheses. For example, it does not have the ability to determine how many
alternative decision trees are consistent with the available training data, or to pose new instance queries
that optimally resolve among these competing hypotheses.
• ID3 in its pure form performs no backtracking in its search. Once it, selects an attribute to test at a
particular level in the tree, it never backtracks to reconsider this choice. Therefore, it is susceptible to the
usual risks of hill-climbing search without backtracking: converging to locally optimal solutions that are
not globally optimal.
• In the case of ID3, a locally optimal solution corresponds to the decision tree it selects along the single
search path it explores. However, this locally optimal solution may be less desirable than trees that

would have been encountered along a different branch of the search.

• ID3 uses all training examples at each step in the search to make statistically based decisions regarding
how to refine its current hypothesis. This contrasts with methods that make decisions incrementally,
based on individual training examples (e.g., FIND-S or CANDIDATE- ELIMINATOION).
• One advantage of using statistical properties of all the examples (e.g., information gain) is that the
resulting search is much less sensitive to errors in individual training examples.
• ID3 can be easily extended to handle noisy training data by modifying its termination criterion to
accept hypotheses that imperfectly fit the training data
INDUCTIVE BIAS IN DECISION TREE LEARNING

• What is the policy by which ID3 generalizes from observed training examples to classify unseen
instances?
• In other words, what is its inductive bias? Recall from Chapter 2 that inductive bias is the set of
assumptions that, together with the training data, deductively justify the classifications assigned by the
learner to future instances.
• Given a collection of training examples, there are typically many decision trees consistent with these
examples.
• Describing the inductive bias of ID3 therefore consists of describing the basis by which it chooses one of
these consistent hypotheses over the others.
• Which of these decision trees does ID3 choose?
• It chooses the first acceptable tree it encounters in its simple-to-complex, hill-climbing search through
the space of possible trees.
• Roughly speaking, then, the ID3 search strategy
• selects in favour of shorter trees over longer ones, and
• selects trees that place the attributes with highest information gain closest to the root.
• Because of the subtle interaction between the attribute selection heuristic used by ID3 and the particular
training examples it encounters, it is difficult to characterize precisely the inductive bias exhibited by
ID3. However, we can approximately characterize its bias as a preference for short decision trees over
complextrees.

APPROXIMATE INDUCTIVE BIAS OF ID3:
• Shorter trees are preferred over larger trees.
• In fact, one could imagine an algorithm similar to ID3 that exhibits precisely this inductive bias.
• Consider an algorithm that begins with the empty tree and searches breadth First through progressively
more complex trees, first considering all trees of depth 1, then all trees of depth 2, etc.
• Once it finds a decision tree consistent with the training data, it returns the smallest consistent tree at
that search depth (e.g., the tree with the fewest nodes).
• Let us call this breadth-first search algorithm BFS-ID3. BFS-ID3 finds a shortest decision tree and thus
exhibits precisely the bias "shorter trees are preferred over longer trees."
• ID3 can be viewed as an efficient approximation to BFS-ID3, using a greedy heuristic search to
attempt to find the shortest tree without conducting the entire breadth-first search through the hypothesis
space.
• Because ID3 uses the information gain heuristic and a hill climbing strategy, it exhibits a more complex
bias than BFS-ID3.
• In particular, it does not always find the shortest consistent tree, and it is biased to favor trees that place
attributes with high information gain closest to the root.
• A closer approximation to the inductive bias of ID3: Shorter trees are preferred over longer trees.
Trees that place high information gain attributes close to the root are preferred over those that do not.
ISSUES IN DECISION TREE LEARNING

Practical issues in learning decision trees include
1. Determining how deeply to grow the decision tree, handling continuous attributes,
2. Choosing an appropriate attribute selection measure,
3. Handling training datawith missing attribute values,
4. Handling attributes with differing costs,
5. Improving computational efficiency.
Below we discuss each of these issues and extensions to the basic ID3 algorithm that address them. ID3
has itself been extended to address most of these issues, with the resulting system renamed C4.5
(Quinlan 1993).
Avoiding Overfitting the Data

• The algorithm described in Table 3.1 grows each branch of the tree just deeply enough to perfectly
classify the training examples. While this is sometimes a reasonable strategy, in fact it can lead to
difficulties when there is noise in the data, or when the number of training examples is too small to
produce a representative sample of the true target function. In either of these cases, this simple algorithm
can produce trees that overfit the training examples.
• We will say that a hypothesis overfits the training examples if some other hypothesis that fits the
training examples less well actually performs better over the entire distribution of instances (i.e.,
including instances beyond the training set).

Machine Learning: MVJ21CS62

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Machine Learning: MVJ21CS62

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning: MVJ21CS62

Uploaded by

Copyright:

Available Formats

Machine Learning MVJ21CS62

DECISION TREE LEARNING

DEPARTMENT OF CSE Page 1

For example, the instance

(Outlook = Sunny A Humidity = Normal)

APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING

THE BASIC DECISION TREE LEARNING ALGORITHM

DEPARTMENT OF CSE Page 3

Which Attribute Is the Best Classifier?

DEPARTMENT OF CSE Page 4

ENTROPY MEASURES HOMOGENEITY OF EXAMPLES

• To illustrate, suppose S is a collection of 14 examples of some Boolean concept, including 9 positive

INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY

attribute A has value .

DEPARTMENT OF CSE Page 6

DEPARTMENT OF CSE Page 7

REFER NOTES FOR ID3 EXAMPLES AND PROBLEMS

HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING

DEPARTMENT OF CSE Page 8

DEPARTMENT OF CSE Page 9

would have been encountered along a different branch of the search.

INDUCTIVE BIAS IN DECISION TREE LEARNING

DEPARTMENT OF CSE Page 10

APPROXIMATE INDUCTIVE BIAS OF ID3:

• Shorter trees are preferred over larger trees.

ISSUES IN DECISION TREE LEARNING

2. Choosing an appropriate attribute selection measure,

3. Handling training datawith missing attribute values,

4. Handling attributes with differing costs,

5. Improving computational efficiency.

Avoiding Overfitting the Data

DEPARTMENT OF CSE Page 12

You might also like