Module 3
Module 3
Module 3
Part -1
Decision Tree
Definition of Decision Tree
• Decision Tree: Decision tree learning is a method for approximating discrete-valued target
functions, in which the learned function is represented by a decision tree.
• Each node in the tree specifies a test of some attribute of the instance, and each
branch descending from that node corresponds to one of the possible values for
this attribute.
• An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch corresponding
to the value of the attribute in the given example. This process is then repeated
for the subtree rooted at the new node.
Figure: Decision tree for the concept PlayTennis. An example is classified by sorting it through the tree to the
appropriate leaf node, then returning the classification associated with the leaf.
• Decision trees represent a disjunction of conjunctions of constraints on the attribute values of
instances.
• Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the
tree itself to a disjunction of these conjunctions
• For example, the decision tree shown in above figure corresponds to the expression
Appropriate Problems for Decision Tree Learning
i. Instances are represented by attribute-value pairs – Instances are described by a fixed set of attributes
and their values
ii. The target function has discrete output values – The decision tree assigns a Boolean classification
(e.g., yes or no) to each example. Decision tree methods easily extend to learning functions with more than
two possible output values.
iv. The training data may contain errors – Decision tree learning methods are robust to errors, both errors
in classifications of the training examples and errors in the attribute values that describe these examples.
v. The training data may contain missing attribute values – Decision tree methods can be used even
when some training examples have unknown values.
The Basic Decision Tree Learning Algorithm
The basic algorithm is ID3 which learns decision trees by constructing them top-down.
• Attributes is a list of other attributes that may be tested by the learned decision tree.
• If all Examples are positive, Return the single-node tree Root, with label = +
• If all Examples are negative, Return the single-node tree Root, with label = -
• If Attributes is empty, Return the single-node tree Root, with label = most common value of
Target_attribute in Examples
• Otherwise Begin
o A ← the attribute from Attributes that best* classifies Examples
o The decision attribute for Root ← A
o For each possible value, vi, of A,
-Add a new tree branch below Root, corresponding to the test A = vi
-Let Examples vi, be the subset of Examples that have value vi for A
-If Examples vi , is empty
-Then below this new branch add a leaf node with label = most
common value of Target_attribute in Examples
• End
• Return Root
* The best attribute is the one with highest information gain
• Table: Summary of the ID3 algorithm specialized to learning Boolean-valued
functions.
• ID3 is a greedy algorithm that grows the tree top-down, at each node
selecting the attribute that best classifies the local training examples.
• This process continues until the tree perfectly classifies the training examples,
or until all attributes have been used.
Which attribute is a best classifier?
2.4. Entropy Measures Homogeneity of Examples
• To define information gain, we begin by defining a measure called entropy.
Entropy measures the impurity of a collection of examples.
• Where,
Ans:
• The entropy is 0 if all members of S belong to the same class
• The entropy is 1 when the collection contains an equal number of positive and
negative examples
• Here the target attribute PlayTennis, which can have values yes or no for
different days.
• Consider the first step through the algorithm, in which the topmost node of
the decision tree is created.
Attribute :outlook
Attribute:Temp
Gain(S,Temp)=0.94-4/14(1.0)-6/14(0.9183)-4/14(0.8113)=0.0289
Hypothesis space search in decision making
1. ID3 can be characterized as searching a space of hypothesis for one that fits
the training examples
(In space of hypothesis we will be having so many hypothesis –we are searching
the space of hypothesis in order to find one that will best fit our training example
or our training data).
2. ID3 will Search set of possible decision trees from available hypothesis
3.ID3 performs simple to complex searching
[it starts from simple then goes on increases the complexity increases]
First start with empty tree and keep on adding the nodes.
➔Every discrete valued (finite) function can be described by some decision tree.
->ID3 can be easily extended to handle noisy training data by modifying its
termination criterion to accept hypotheses that imperfectly fit the training data
Inductive Bias in Decision Tree Learning
• Given a collection of training examples, there are typically many decision trees
consistent with these examples.
[In decision tree some times we will get only one tree. Some cases we will get
more than one tree based on the training datasets. When we have one tree there
will not have any biase in a tree. But when we have more than one tree we are
having biase in a tree]
• Describing the inductive bias of ID3 therefore consists of describing the basis by
which it chooses one of these consistent hypotheses over the others.
-Trees that place high information gain attributes close to the root are preferred
over those that do not.
[Suppose we are have two tree of same size then we have to consider the
information gain]
1. Arguments in favor
2. Argument against
• i.e a particular training data will work on good in training data and doesn’t work
good on the real world data such tree is call overfitting
If the particular values are having continuous value then we cannot apply the
decision tree directly.
First we need to convert the particular attributes which are having continuous
values into a discrete possibilities then only we can apply decision tree learning
3. Handling training examples with missing attribute values:
• The decision tree algorithm works well with the data which has some sort of
error but if you want to use decision tree.
• What we need to do this if we have some missing attribute we need to fill those
particular missing attributes with proper values then only we can use decision
tree learning that is let us say that a particular attribute is not having an value for
the fifth example we need to find some value or we need to fill that particular
fifth example with a proper value and then we need to use the decision tree
learning.
4. Handling attributes with different costs whenever we apply decision that is the
core decision algorithm each and every in the given problem is given equal
importance like if you have four attribute all four attributes are given equal
importance.
• i. Overfitting can occur when the training examples contain random errors or
noise
• ii. When small numbers of examples are associated with leaf nodes.
Approaches to avoiding overfitting in decision
tree learning
• Pre-pruning (avoidance): Stop growing the tree earlier, before it
reaches the point where it perfectly classifies the training data
• Post-pruning (recovery): Allow the tree to overfit the data, and then
post-prune the tree
`
Rule Post-Pruning
• Rule post-pruning is successful method for finding high accuracy hypotheses Rule
post-pruning involves the following steps:
• Infer the decision tree from the training set, growing the tree until the training data
is fit as well as possible and allowing overfitting to occur.
• Convert the learned tree into an equivalent set of rules by creating one rule for
each path from the root node to a leaf node.
• Prune (generalize) each rule by removing any preconditions that result in
improving its estimated accuracy.
• Sort the pruned rules by their estimated accuracy, and consider them in this
sequence when classifying subsequent instances.
There are three main advantages by converting the
decision tree to rules before pruning
• Converting to rules allows distinguishing among the different contexts in which a
decision node is used.
• Because each distinct path through the decision tree node produces a distinct
rule, the pruning decision regarding that attribute test can be made differently
for each path.
• Converting to rules removes the distinction between attribute tests that occur
near the root of the tree and those that occur near the leaves.
• Thus, it avoid messy bookkeeping issues such as how to reorganize the tree if the
root node is pruned while retaining part of the subtree below this test. o
Converting to rules improves readability.
• Rules are often easier for to understand.
• 3. Alternative Measures for Selecting Attributes • The problem is if
attributes with many values, Gain will select it ?
• 3. Alternative Measures for Selecting Attributes • The problem is if
attributes with many values, Gain will select it ?
5. Handling Attributes with Differing Costs
ARTIFICIAL NEURAL NETWORKS
• Artificial neural networks (ANNs) provide a general, practical method for
learning realvalued, discrete-valued, and vector-valued target functions.
• Biological Motivation: The study of artificial neural networks (ANNs) has been
inspired by the observation that biological learning systems are built of very
complex webs of interconnected
• The input to the neural network is a 30x32 grid of pixel intensities obtained
from a forward-pointed camera mounted on the vehicle. The network output is
the direction in which the vehicle is steered
PERCEPTRON
• One type of ANN system is based on a unit called a perceptron.
• Perceptron is a single layer neural network.
Perceptron training rule:
• The rule is
Gradient descent and delta rule
Termination condition
1. Epoch may set to 10
2. Acceptable error rate for
example we have reach to
the acceptable error rate
we will stop the execution
of this particular algorithm
and we consider final step
algorithm as a weights of
the algorithm
Issues in gradient decent algorithm
Difference between gradient decent and Stochastic Gradient decent
MULTILAYER NETWORKS AND THE BACKPROPAGATION ALGORITHM