100% found this document useful (1 vote)

73 views

Module 3-Decision Tree Learning

The document describes module 2 on decision tree learning. It covers topics like decision tree representation, appropriate problems for decision tree learning, and the basic ID3 algorithm. The ID3 algorithm uses information gain, based on entropy, to determine the attribute that best splits the data when building a decision tree in a top-down greedy manner. It calculates the information gain of attributes like outlook, temperature, humidity and wind on a tennis play dataset to select the root node of the decision tree.

Uploaded by

ramya

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

73 views

Module 3-Decision Tree Learning

Uploaded by

ramya

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Module 2

Decision Tree Learning

Topics to be covered:
1. Introduction
2. Decision Tree Representation
3. Appropriate Problems for Decision Tree Learning
4. Basic decision tree learning algorithm
5. Hypothesis space search in Decision Tree Learning
6. Inductive bias in Decision Tree Learning
7. Issues in Decision Tree Learning
1. Introduction
● Decision Tree Learning is one of the widely used practical method for inductive
inference like diagnosing medical cases , assessing credit risk of loan applications,
etc.
● Method for approximating discrete-valued functions
● Robust to noisy data and capable of learning disjunctive expressions
● Learned trees can also be re-represented as a set of if-then-rules to improve
human readability
● Algorithms: ID3, ASSISTANT, C4.5
2. Decision Tree Representation
● A tree classifies instances:
○ Node: an attribute which describes an instance.
○ Branch: possible values of the attribute
○ Leaf: class to which the instances belong

● Procedure (of classifying):

○ An instance is classified by starting at the root node of the tree

○ Repeat: - test the attribute specified by the node - move down the tree branch
corresponding to the value of the attribute-value in the given example
A Decision Tree for the concept PlayTennis.
Day Outlook Temperatur Humidity Wind PlayTennis
e

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes

The decision tree shown in the figure corresponds to

the expression:

(Outlook =Sunny ^ Humidity=Normal)

Ú (Outlook=Overcast)

 (Outlook = Rain ^ Wind = Weak)

Example: Consider the following instance:

(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind=Strong)

● This instance will be sorted down to the leftmost branch of the decision tree shown in
previous slide and will be classified as a negative example.

● Tree predicts that “PlayTennis = No”

3. Appropriate Problems for Decision Tree
Learning
Decision tree learning is generally best suited to the problems with the following
characteristics:

1. Instances are represented by attribute-value pairs: Instances are described by a fixed set of
attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree
learning is when each attribute takes on small number of disjoint possible values (e.g., Hot, Mild,
Cold).

2. The target function has discrete output values: The decision tree for the concept PlayTennis
assigns a boolean classification (e.g., yes or no) to each example. Decision tree methods can
also have more than two possible output values
3. Disjunctive descriptions may be required: As noted above, decision trees naturally
represent disjunctive expressions.

4. The training data may contain errors: Decision tree learning methods are robust to
errors, both errors in classifications of the training examples and errors in the attribute
values that describe these examples.

5. The training data may contain missing attribute values: Decision tree methods can be
used even when some training examples have unknown values (e.g., if the Humidity of the
day is known for only some of the training examples).

• Many practical problems such as learning to classify medical patients by their disease,
equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting
on payments, etc. have been found to fit these characteristics.

• Such problems, in which the task is to classify examples into one of a discrete set of possible
categories, are often referred to as classification problems.
4. Basic Decision Tree Learning Algorithm
● Decision Tree learning algorithms employ top-down greedy search through the space of
possible solutions.

● A general Decision Tree learning algorithm:

1. Perform a statistical test of each attribute to determine how well it classifies the training
examples when considered alone;
2. Select the attribute that performs best and use it as the root of the tree;
3. To decide the descendant node down each branch of the root (parent node), sort the
training examples according to the value related to the current branch and repeat the
process described in steps 1 and 2.

● ID3 algorithm is one of the most commonly used Decision Tree learning algorithms and it
applies this general approach to learning the decision tree
ID3 Algorithm
ID3 Algorithm ( Iterative Dichotomiser 3 )
• ID3 algorithm uses “Information Gain” to determine how informative an attribute is (i.e.,
how well an attribute classifies the training examples).

• Information Gain is based on a measure that we call Entropy, which characterizes the
impurity of a collection of examples S (i.e., impurity↑ → E(S)↑):
Entropy(S) ≡ – p⊕ log2 p⊕ – p⊗ log2 p⊗,
where p⊕ and p⊗ are the proportion of positive and negative examples in S, respectively.

• Entropy(S) = 0 if S contains only positive or only negative examples

p⊕ = 1, p⊗ = 0, Entropy(S) = – 1· 0 – 0 · log2 p ⊗ = 0

• Entropy(S) = 1 if S contains equal amount of positive and negative examples

p⊕ = ½, p⊗ = ½, Entropy(S) = – ½· (-1) – ½· (-1) = 1

• In the case that that the target attribute can take n values:
Entropy(S) ≡ – ∑i pi log2 pi, i = [1..n]
where pi is the proportion of examples in S having the target attribute value i.
where values(A) is the set of all possible values for A
INFORMATION GAIN Sv is the subset of S for which A has a value
|S| is the size of S
Gain(S,A): Expected Day Outlook Temperature Humidity Wind PlayTennis

reduction in entropy D1 Sunny Hot High Weak No

caused by partitioning D2 Sunny Hot High Strong No
the examples according D3 Overcast Hot High Weak Yes
to this attribute A: D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

Consider the following
Example: Target concept: D8 Sunny Mild High Weak No

Play Tennis D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Entropy Calculation:

Compute the entropy of the play-tennis example:

• We have two classes, YES and NO
• We have 14 instances with 9 classified as YES and 5 as NO – i.e. no. of classes, c=2

• = - (9/14) log2 (9/14) = 0.41

• = - (5/14) log2 (5/14) = 0.53
• E(S) = + = 0.94

Information gain calculation for Attribute “wind”:

Compute the information gain for the attributes wind in the play-tennis data set:
• |S|=14 , Attribute wind
• Two values: weak and strong
•||=8
•| |=6
Now, let us determine E| | Now, let us determine E| |
• Instances=8, YES=6, NO=2 • Instances=6, YES=3, NO=3

• [6+,2-] • [3+,3-]
• E| | = -(6/8)log2 (6/8) - (2/8)log2 (2/8)=0.81 • E| | =-(3/6)log2 (3/6)-(3/6)log2 (3/6)=1.0

Going back to information gain computation for the attribute wind:

= 0.94 - (8/14) 0.81 - (6/14)1.00

= 0.048

Gain(S,wind) = 0.048
Information gain calculation for Attribute “humidity”:

|S|=14 Attribute humidity

Two values: high and normal
||=7
||=7

For value: high –> [3+,4-] E| | =-(3/7)log2 (3/7)-(4/7)log2 (4/7)=0.98

For value: normal->[6+,1-] E| | =-(6/7)log2 (6/7)-(1/7)log2 (1/7)=0.59

= 0.94 - (7/14) 0.98 - (7/14)0.59

= 0.15
So, humidity provides GREATER information gain
Gain(S,humidity) = 0.15
than wind
Information gain calculation for Attribute “outlook” and “temperature”:
Attribute outlook:

Gain(S, outlook)=0.25

Attribute temperature:

Gain(S, temperature)=0.003

Summary
• Gain(S, outlook)=0.25
• Gain(S, temp)=0.03
• Gain(S, humidity)=0.15 So, attribute with highest info.
• Gain(S, wind)=0.048 gain is the OUTLOOK, therefore
use outlook as the root node
Decision Tree – Next Level

• After determining OUTLOOK as the root node, we need to

expand the tree

E| | =-(2/5)log2 (2/5)-(3/5)log2 (3/5)

=0.97
• Gain( , Humidity) = 0.97-(3/5) 0.0 – (2/5) 0.0=0.97
• Gain ( , Wind) = 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019
• Gain( , Temperature) = 0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57

Highest information gain is humidity, so use this attribute

Continue ….. and Final DT

• Continue until all the examples are classified

– Gain ( , Wind), Gain ( , Humidity),Gain ( , Temp)
– Gain ( , Wind) is the highest

• All leaf nodes are associated with training examples from the same class (entropy=0)
• The attribute temperature is not used
5. Hypothesis space search in decision tree
● The hypothesis space searched bylearning
ID3 is the set of possible decision trees.
● ID3 perform a simple-to-complex, hill climbing search through the hypothesis space
starting from empty tree and then progressing to elaborated hypothesis to correctly
classifying the training data.
● Features of the Hypothesis space search in decision tree learning:
• Complete hypothesis space: any finite discrete-valued function can be expressed.
• Incomplete search: searches incompletely through the hypothesis space until the tree is
consistent with the data.
• Single hypothesis: only one current hypothesis (simplest one) is maintained.
• No backtracking: one an attribute is selected, this cannot be changed. Problem: might not
be the optimum solution (globally).
• Full training set at each step: attributes are selected by computing information gain on the
full training set. Advantage: Robustness to errors. Problem: Non-incremental
6. Inductive Bias in Decision Tree Learning
● Inductive bias is the set of assumptions that, together with the training data, deductively justify
the classifications assigned by the learner to future instances.
● The inductive bias of ID3 consists of describing the basis by which it chooses one of the
consistent hypotheses over the others.
● The ID3 search strategy
(a) selects in favor of shorter trees over longer ones
(b) selects trees that place the attributes with highest information gain closest to the root.

a. Restriction Biases and Preference Biases

Consider inductive bias exhibited by ID3 and by the CANDIDATE-ELIMINATION algorithm.

● ID3 considers a complete hypothesis space (i.e., one capable of expressing any finite discrete
valued function) but it searches incompletely through this complete hypothesis space, from
simple to complex hypotheses, until its termination condition is met.
● CANDIDATE-ELIMINATION algorithm consider an incomplete hypothesis space (i.e., one that
can express only a subset of the potentially teachable concepts) but it searches this space
completely, finding every hypothesis consistent with the training data.

Thus,
Preference bias: The inductive bias of ID3 is preferred for certain hypotheses over others (e.g., for
shorter hypotheses), with no hard restriction on the hypotheses. This form of bias is typically
called a preference bias (or, alternatively, a search bias).

Restriction bias: The bias of the CANDIDATEELIMINATION algorithm is in the form of a categorical
restriction on the set of hypotheses considered. This form of bias is typically called a restriction
bias (or, alternatively, a language bias).

Typically, a preference bias is more desirable than a restriction bias, because it allows the learner
to work within a complete hypothesis space that is assured to contain the unknown target
function.
b. Why Prefer Short Hypotheses?

William Occam was one of the first to discuss this question, around the year 1320, so this bias
often goes by the name of Occam's razor.

Argument in favour:
● There are fewer short hypotheses than long hypotheses
○ a short hypotheses that fits data unlikely to be coincidence
○ a long hypotheses that fits data might be coincidence

Argument opposed:
● There are many ways to define small sets of hypotheses e.g., all trees with a prime number of
nodes that use attributes beginning with “Z"
● It will produce two different hypotheses from the same training examples when it is applied by
two learners basis
Example: two learners, both applying Occam's razor, would generalize in different ways if one used
the XYZ attribute to describe its examples and the other used only the attributes Outlook, Temperature,
Humidity, and Wind.
7. Issues in Decision Tree Learning
Practical issues in learning decision trees include:
○ Determining how deeply to grow the decision tree
○ Handling continuous attributes
○ Choosing an appropriate attribute selection measure
○ Handling training data with missing attribute values
○ Handling attributes with differing costs, and
○ Improving computational efficiency.

Lets discuss how these issues are addressed using basic ID3 algorithm

1. Avoiding Overfitting the Data

Definition: given a hypothesis space H, a hypothesis is said to overfit the training data if
there exists some alternative hypothesis , such that h has smaller error than h' over the
training examples, but h' has smaller error than h over the entire distribution of instances .
Example: When there is noise in the data, or when the number of training examples is
too small to produce a representative sample of the true target function then it can
produce trees that overfit the training examples.

● How ID3 avoid overfitting

ID3 adds new nodes to grow the decision tree as a result the accuracy of the tree
measured over the training examples increases monotonically.

Here, it can be seen that accuracy of the

tree over the training data increases when
the tree size exceeds approximately 25
nodes.
● How can we prevent overfitting? Here are some common heuristics:
• Don't try to fit all examples, stop before the training set is exhausted.
• Fit all examples then prune(expand) the resultant tree.

● Methods to use a validation set to prevent overfitting:

a. Reduced Error Pruning
b. Rule Post-Pruning

a. Reduced Error Pruning:

○ Consider each of the decision nodes to be a candidate for pruning. Pruning

means to substitute a subtree rooted at the node, by a leaf which the most
common class of the training examples assigned

○ Nodes are removed only if the resulting pruned tree performs no worse than
the original over the validation set
○ Nodes are pruned iteratively by choosing the node whose removal most increases
the accuracy of decision tree over validation set
○ Continue until further pruning is necessary
○ Here the validation set used for pruning is distinct from both the training and test
sets
○ Disadvantage: Data is limited (withholding part of it for the validation set reduces
even further the number of examples available for training)

The impact of reduced-error pruning on the accuracy of

the decision tree is illustrated here. The additional line
in figure shows accuracy over the test examples as the
tree is pruned. When pruning begins, the tree is at its
maximum size and lowest accuracy over the test set.
As pruning proceeds, the number of nodes is reduced
and accuracy over the test set increases.
b. Rule Post-Pruning

● Rule post-pruning finds the high accuracy hypotheses. It involves flowing steps:
○ Infer decision tree growing until the training data fit as well as possible and allow
overfitting to occur
○ Convert the learned tree into an equivalent set of rules by creating one rule for each path
from the root to a leaf node
○ Prune each rule by removing any preconditions that result in improving its estimated
accuracy
○ Sort the pruned rules by their estimated accuracy and consider them in this sequence
when classifying subsequent instances
○ One rule is generated for each leaf node in the tree
○ Antecedent: Each attribute test along the path from the root to the leaf
○ Consequent: The classification at the leaf
○ Removing any antecedent, whose removal does not worsen its estimated accuracy
2. Incorporating Continuous-Valued Attributes

● Continuous valued attributes can be partitioned into a discrete number of disjoint intervals
and then membership can be tested over these intervals.

Example: If the learning task PlayTennis include continuous valued attribute “Temperature” in the
range 40-90 then “Temperature” becomes a bad choice for classification (It alone may perfectly
classify the training examples and therefore promise the highest information gain) while
remaining a poor predictor on the test set.

● The solution to this problem is to classify based not on the actual temperature, but on
dynamically determined intervals within which the temperature falls.
● Like, by introducing boolean attributes T a , a <T b , b < T c and T> c , instead of real
valued T. a, b and c.
● In the PlayTennis example, there are two candidate thresholds, corresponding to the values of
Temperature at which the value of PlayTennis changes: (48 + 60)/2, and (80 + 90)/2.

● The information gain can then be computed for each of the candidate attributes,
Temperature>54 and Temperature>85 the best can be selected (Temperature>54). This
dynamically created boolean attribute can then compete with the other discrete-valued
candidate attributes available for growing the decision tree.

3. Alternative Measures for Selecting Attributes

● The information gain measure has a bias that favors attributes with many values over those
with only a few.
For example: If you imagine an attribute Date with unique values for each training example,
then Gain(S,Date) will yield H(S) since

● Obviously no other attribute can do better. This will result in a very broad tree of depth 1.
● To guard against this, GainRatio(S,A) can be used instead of Gain(S,A).

where,

with P(Sv) estimated by relative frequency as before

4. Handling Training Examples with Missing Attribute Values

● What happens if some of the training examples contain one or more ``?'', meaning ``value not
known'' instead of the actual attribute values?
● Here are some common ad hoc solutions:
• Substitute ``?'' by the most common value in that column.
• Substitute ``?'' by the most common value among all training examples that have been sorted into the
tree at that node.
• Substitute ``?'' by the most common value among all training examples that have been sorted into the
tree at that node with the same classification as the incomplete example.
5. Handling attributes with differing costs

● In some learning tasks the instance attributes may have associated costs.

Example, in learning to classify medical diseases patients can be described in terms of attributes
such as Temperature, Biopsy-Result, Pulse, Blood Test-Results, etc. These attributes vary
significantly in their costs, both in terms of monetary cost and cost to patient comfort. In such
tasks, we would prefer decision trees that use low-cost attributes where possible, relying on high-
cost attributes only when needed to produce reliable classifications.

● It can be done by dividing the Gain by the cost of the attribute Cost(A), so that lower-cost
attributes would be preferred.
● Use a CostedGain(S,A) which is defined along the lines of:

where may be a constant that

determines the relative importance of cost
versus information gain.
End of Module 2

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
GPTFX A Novel GPT-3 Based Framework For Mental Health Detection and Explanations
No ratings yet
GPTFX A Novel GPT-3 Based Framework For Mental Health Detection and Explanations
8 pages
Diffie-Hellman Key Exchange PDF
No ratings yet
Diffie-Hellman Key Exchange PDF
6 pages
Text
No ratings yet
Text
131 pages
Network Layer: Delivery, Forwarding, and Routing
No ratings yet
Network Layer: Delivery, Forwarding, and Routing
62 pages
Machine Learning (6CS4-02) Unit-3 Notes
No ratings yet
Machine Learning (6CS4-02) Unit-3 Notes
21 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Da Unit-1
No ratings yet
Da Unit-1
23 pages
Practical File Cloud Computing IT-704
No ratings yet
Practical File Cloud Computing IT-704
27 pages
4-5. Mathematical Analysis of Recursive and NonRecursive Techniques
No ratings yet
4-5. Mathematical Analysis of Recursive and NonRecursive Techniques
59 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
ADL Unit-3
No ratings yet
ADL Unit-3
21 pages
ML Unit 1-Notes
No ratings yet
ML Unit 1-Notes
21 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Unit-3-Second Chapter
No ratings yet
Unit-3-Second Chapter
9 pages
Operation Research Chapter Five 5. Networks and Project Management
No ratings yet
Operation Research Chapter Five 5. Networks and Project Management
11 pages
Btech Cs 6 Sem Data Compression Kcs 064 2023
No ratings yet
Btech Cs 6 Sem Data Compression Kcs 064 2023
2 pages
Lab Program
100% (1)
Lab Program
15 pages
Ad3461 Ml Lab Manual
100% (1)
Ad3461 Ml Lab Manual
54 pages
DCCN Unit-2
No ratings yet
DCCN Unit-2
31 pages
Data Flow Diagram: User Encryption/ Decryption Receiver
No ratings yet
Data Flow Diagram: User Encryption/ Decryption Receiver
1 page
Ad3311 - Artificial Intelligence Lab Manual
No ratings yet
Ad3311 - Artificial Intelligence Lab Manual
30 pages
AoA Important Question
100% (1)
AoA Important Question
3 pages
Experiment-7: Implementation of K-Means Clustering Algorithm
No ratings yet
Experiment-7: Implementation of K-Means Clustering Algorithm
3 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
Oosd Notes
50% (2)
Oosd Notes
131 pages
Daa Question Bank Unit-3
No ratings yet
Daa Question Bank Unit-3
4 pages
Ai-Unit-Iii Notes
No ratings yet
Ai-Unit-Iii Notes
46 pages
Design and Analysis of Algorithm (Lab)
No ratings yet
Design and Analysis of Algorithm (Lab)
27 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
23 pages
Big Data Nit067
No ratings yet
Big Data Nit067
1 page
OOPS Concepts in PHP
100% (2)
OOPS Concepts in PHP
40 pages
AI Ch-14 Inroduction To Prolog
No ratings yet
AI Ch-14 Inroduction To Prolog
15 pages
File Organization in DBMS
No ratings yet
File Organization in DBMS
13 pages
Web Tech
No ratings yet
Web Tech
20 pages
Data Warehousing & Data Mining
No ratings yet
Data Warehousing & Data Mining
1 page
Unit 2
No ratings yet
Unit 2
29 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
FIND-S Algorithm: Machine Learning 15CSL76
No ratings yet
FIND-S Algorithm: Machine Learning 15CSL76
3 pages
Object Oriented Programming in C++
No ratings yet
Object Oriented Programming in C++
4 pages
ML UNIT 2 Sir
No ratings yet
ML UNIT 2 Sir
46 pages
CC - Ques Paper
100% (1)
CC - Ques Paper
2 pages
CS402 Data Mining and Warehousing PDF
No ratings yet
CS402 Data Mining and Warehousing PDF
3 pages
DSAP Lab Manual PDF
No ratings yet
DSAP Lab Manual PDF
61 pages
Instant Ebooks Textbook Cognitive Computing Theory and Applications 1st Edition Venkat N. Gudivada Download All Chapters
100% (4)
Instant Ebooks Textbook Cognitive Computing Theory and Applications 1st Edition Venkat N. Gudivada Download All Chapters
84 pages
Unit 01 Basic Concepts of DBMS & Data Models
100% (1)
Unit 01 Basic Concepts of DBMS & Data Models
150 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
43 pages
DCCN Notes
No ratings yet
DCCN Notes
27 pages
Cloud Computing Notes(Unit-1 to 5)
100% (1)
Cloud Computing Notes(Unit-1 to 5)
98 pages
Internship Report
No ratings yet
Internship Report
7 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
Cloud Computing Unit 3
No ratings yet
Cloud Computing Unit 3
47 pages
Support Vector Machine (SVM) : Basic Terminologies
100% (1)
Support Vector Machine (SVM) : Basic Terminologies
2 pages
Audio Encryption
No ratings yet
Audio Encryption
11 pages
Mini Project HPC
No ratings yet
Mini Project HPC
17 pages
Cloud Computing Unit-2 (A)
No ratings yet
Cloud Computing Unit-2 (A)
23 pages
ML - Unit 2 - Part I
No ratings yet
ML - Unit 2 - Part I
15 pages
AI_01_ID3
No ratings yet
AI_01_ID3
7 pages
Module 3-1 PDF
No ratings yet
Module 3-1 PDF
43 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
18EE753 DM Mod-1 Notes Final One
No ratings yet
18EE753 DM Mod-1 Notes Final One
14 pages
BDA Mod2@AzDOCUMENTS - in
No ratings yet
BDA Mod2@AzDOCUMENTS - in
64 pages
Disasters Management (18EE753) : Earth Quakes
No ratings yet
Disasters Management (18EE753) : Earth Quakes
33 pages
Final PPT Module 1
No ratings yet
Final PPT Module 1
24 pages