Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Module 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 103

Module->3

Part -1

Decision Tree
Definition of Decision Tree
• Decision Tree: Decision tree learning is a method for approximating discrete-valued target
functions, in which the learned function is represented by a decision tree.

• {discrete value->specific value->yes or no}


Decision Tree Representation
• Decision trees classify instances by sorting them down the tree from the root to
some leaf node, which provides the classification of the instance.

• Each node in the tree specifies a test of some attribute of the instance, and each
branch descending from that node corresponds to one of the possible values for
this attribute.

• An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch corresponding
to the value of the attribute in the given example. This process is then repeated
for the subtree rooted at the new node.
Figure: Decision tree for the concept PlayTennis. An example is classified by sorting it through the tree to the
appropriate leaf node, then returning the classification associated with the leaf.
• Decision trees represent a disjunction of conjunctions of constraints on the attribute values of
instances.

• Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the
tree itself to a disjunction of these conjunctions

• For example, the decision tree shown in above figure corresponds to the expression
Appropriate Problems for Decision Tree Learning
i. Instances are represented by attribute-value pairs – Instances are described by a fixed set of attributes
and their values

ii. The target function has discrete output values – The decision tree assigns a Boolean classification
(e.g., yes or no) to each example. Decision tree methods easily extend to learning functions with more than
two possible output values.

iii. Disjunctive descriptions may be required

iv. The training data may contain errors – Decision tree learning methods are robust to errors, both errors
in classifications of the training examples and errors in the attribute values that describe these examples.

v. The training data may contain missing attribute values – Decision tree methods can be used even
when some training examples have unknown values.
The Basic Decision Tree Learning Algorithm
The basic algorithm is ID3 which learns decision trees by constructing them top-down.

• ID3(Examples, Target_attribute, Attributes)

• Examples are the training examples.

• Target_attribute is the attribute whose value is to be predicted by the tree.

• Attributes is a list of other attributes that may be tested by the learned decision tree.

• Returns a decision tree that correctly classifies the given Examples.


• Create a Root node for the tree

• If all Examples are positive, Return the single-node tree Root, with label = +

• If all Examples are negative, Return the single-node tree Root, with label = -

• If Attributes is empty, Return the single-node tree Root, with label = most common value of
Target_attribute in Examples
• Otherwise Begin
o A ← the attribute from Attributes that best* classifies Examples
o The decision attribute for Root ← A
o For each possible value, vi, of A,
-Add a new tree branch below Root, corresponding to the test A = vi
-Let Examples vi, be the subset of Examples that have value vi for A
-If Examples vi , is empty
-Then below this new branch add a leaf node with label = most
common value of Target_attribute in Examples

• Else below this new branch add the subtree

• o ID3(Examples vi, Target_attribute, Attributes –{A})) .

• End

• Return Root
* The best attribute is the one with highest information gain
• Table: Summary of the ID3 algorithm specialized to learning Boolean-valued
functions.

• ID3 is a greedy algorithm that grows the tree top-down, at each node
selecting the attribute that best classifies the local training examples.

• This process continues until the tree perfectly classifies the training examples,
or until all attributes have been used.
Which attribute is a best classifier?
2.4. Entropy Measures Homogeneity of Examples
• To define information gain, we begin by defining a measure called entropy.
Entropy measures the impurity of a collection of examples.

• Given a collection S, containing positive and negative examples of some target


concept, the entropy of S relative to this Boolean classification is.

• Where,

• p+→ proportion of positive examples in S

• p- → proportion of negative examples in S.


• Example:
• Suppose S is a collection of 14 examples of some boolean concept, including 9
positive and 5 negative examples. Then the entropy of S relative to this boolean
classification is

Ans:
• The entropy is 0 if all members of S belong to the same class

• The entropy is 1 when the collection contains an equal number of positive and
negative examples

• If the collection contains unequal numbers of positive and negative examples,


the entropy is between 0 and 1.

Figure: The Entropy relative to a Boolean classification, as


the proportion of positive examples varies from 0 to 1.
Information Gain Measures the Expected
Reduction in Entropy
• Information Gain, is the expected reduction in entropy caused by partitioning
the examples according to this attribute
An Illustrative Example
• To illustrate the operation of ID3, consider the learning task represented by
the training examples of below table.

• Here the target attribute PlayTennis, which can have values yes or no for
different days.

• Consider the first step through the algorithm, in which the topmost node of
the decision tree is created.
Attribute :outlook
Attribute:Temp

Gain(S,Temp)=0.94-4/14(1.0)-6/14(0.9183)-4/14(0.8113)=0.0289
Hypothesis space search in decision making
1. ID3 can be characterized as searching a space of hypothesis for one that fits
the training examples
(In space of hypothesis we will be having so many hypothesis –we are searching
the space of hypothesis in order to find one that will best fit our training example
or our training data).
2. ID3 will Search set of possible decision trees from available hypothesis
3.ID3 performs simple to complex searching
[it starts from simple then goes on increases the complexity increases]

First start with empty tree and keep on adding the nodes.

➔Every discrete valued (finite) function can be described by some decision tree.

[in order to describe a function or in order to describe a training data using


decision trees we must have the data in form of discrete values and also should
have finite amount of data]

->Decision tree avoid the major risk of searching incomplete hypothesis

➔Has only single current hypothesis

➔Cannot determine alternate decision trees


->backtracking is not possible

->ID3 can be easily extended to handle noisy training data by modifying its
termination criterion to accept hypotheses that imperfectly fit the training data
Inductive Bias in Decision Tree Learning
• Given a collection of training examples, there are typically many decision trees
consistent with these examples.

[In decision tree some times we will get only one tree. Some cases we will get
more than one tree based on the training datasets. When we have one tree there
will not have any biase in a tree. But when we have more than one tree we are
having biase in a tree]

• Describing the inductive bias of ID3 therefore consists of describing the basis by
which it chooses one of these consistent hypotheses over the others.

• Which of these decision tree does ID3 choose?


• It chooses the first acceptable tree it encounters in its simple-to-complex, hill
climbing search through the space of possible trees.

• There are 2 biase we can use ID3

1. Approximate inductive biase ID3

-shorter trees are preferred over a larger tree

2. A closer approximation to the inductive bias of ID3:

-Shorter trees are preferred over longer trees.

-Trees that place high information gain attributes close to the root are preferred
over those that do not.
[Suppose we are have two tree of same size then we have to consider the
information gain]

Why prefer shortest hypothesis?

1. Arguments in favor

• There are fewer short hypothesis than long ones

• If a short hypothesis fits data unlikely to be a coincidence.

2. Argument against

• Not every short hypothesis is a reasonable one.


Issues in decision tree
1. Overfitting the data: if we depend too much on the training data while drawing
the decision tree there is a possibility that the tree will go for overfitting.

• i.e a particular training data will work on good in training data and doesn’t work
good on the real world data such tree is call overfitting

• Overfitting can be reduced with two technique

a. Reduced error tunning

b. Post rule tunning


2. Incorporating Continuous valued attributes: the decision tree works well with a
problems where we have attribute values fixed number of attributes are there and
the discrete number of possibilities for each attributes.

If the particular values are having continuous value then we cannot apply the
decision tree directly.

First we need to convert the particular attributes which are having continuous
values into a discrete possibilities then only we can apply decision tree learning
3. Handling training examples with missing attribute values:

• The decision tree algorithm works well with the data which has some sort of
error but if you want to use decision tree.

• What we need to do this if we have some missing attribute we need to fill those
particular missing attributes with proper values then only we can use decision
tree learning that is let us say that a particular attribute is not having an value for
the fifth example we need to find some value or we need to fill that particular
fifth example with a proper value and then we need to use the decision tree
learning.
4. Handling attributes with different costs whenever we apply decision that is the
core decision algorithm each and every in the given problem is given equal
importance like if you have four attribute all four attributes are given equal
importance.

Sometimes a given problem definition there is possibility that a particular attribute


may have more importance or it is given more weightage such case we cannot use
the core decision tree learning there we need to handle with some particular issue
some sort of calculation like if you have particular attribute which is having more
weightage compared to the other attributes that has to be given more importance
that thing is not considered in core decision tree algorithm
1. Avoiding Overfitting the Data
• Definition - Overfit: Given a hypothesis space H, a hypothesis h ∈ H is said to
overfit the training data if there exists some alternative hypothesis h' ∈ H, such
that h has smaller error than h' over the training examples, but h' has a smaller
error than h over the entire distribution of instances.
• The horizontal axis of this plot indicates the
total number of nodes in the decision tree, as
the tree is being constructed. The vertical axis
indicates the accuracy of predictions made by
the tree.

• The solid line shows the accuracy of the


decision tree over the training examples. The
broken line shows accuracy measured over an
independent set of test example

• The accuracy of the tree over the training


examples increases monotonically as the tree
is grown. The accuracy measured over the
independent test examples first increases, then
decreases.
• How can it be possible for tree h to fit the training examples better than h', but for
it to perform more poorly over subsequent examples?

• i. Overfitting can occur when the training examples contain random errors or
noise

• ii. When small numbers of examples are associated with leaf nodes.
Approaches to avoiding overfitting in decision
tree learning
• Pre-pruning (avoidance): Stop growing the tree earlier, before it
reaches the point where it perfectly classifies the training data

• Post-pruning (recovery): Allow the tree to overfit the data, and then
post-prune the tree
`
Rule Post-Pruning
• Rule post-pruning is successful method for finding high accuracy hypotheses Rule
post-pruning involves the following steps:
• Infer the decision tree from the training set, growing the tree until the training data
is fit as well as possible and allowing overfitting to occur.
• Convert the learned tree into an equivalent set of rules by creating one rule for
each path from the root node to a leaf node.
• Prune (generalize) each rule by removing any preconditions that result in
improving its estimated accuracy.
• Sort the pruned rules by their estimated accuracy, and consider them in this
sequence when classifying subsequent instances.
There are three main advantages by converting the
decision tree to rules before pruning
• Converting to rules allows distinguishing among the different contexts in which a
decision node is used.
• Because each distinct path through the decision tree node produces a distinct
rule, the pruning decision regarding that attribute test can be made differently
for each path.
• Converting to rules removes the distinction between attribute tests that occur
near the root of the tree and those that occur near the leaves.
• Thus, it avoid messy bookkeeping issues such as how to reorganize the tree if the
root node is pruned while retaining part of the subtree below this test. o
Converting to rules improves readability.
• Rules are often easier for to understand.
• 3. Alternative Measures for Selecting Attributes • The problem is if
attributes with many values, Gain will select it ?
• 3. Alternative Measures for Selecting Attributes • The problem is if
attributes with many values, Gain will select it ?
5. Handling Attributes with Differing Costs
ARTIFICIAL NEURAL NETWORKS
• Artificial neural networks (ANNs) provide a general, practical method for
learning realvalued, discrete-valued, and vector-valued target functions.

• Biological Motivation: The study of artificial neural networks (ANNs) has been
inspired by the observation that biological learning systems are built of very
complex webs of interconnected

• Neurons Human: information processing system consists of brain neuron: basic


building block cell that communicates information to and from various parts of
body
Facts of Human Neurobiology
• Number of neurons ~ 10^11
• Connection per neuron ~ 10 ^4 – 5
• Neuron switching time ~ 0.001 second or 10^ -3
• Scene recognition time ~ 0.1 second
• 100 inference steps doesn’t seem like enough
• Highly parallel computation based on distributed representation
NEURAL NETWORK REPRESENTATIONS
• A prototypical example of ANN learning is provided by Pomerleau's system
ALVINN, which uses a learned ANN to steer an autonomous vehicle driving at
normal speeds on public highways.

• The input to the neural network is a 30x32 grid of pixel intensities obtained
from a forward-pointed camera mounted on the vehicle. The network output is
the direction in which the vehicle is steered
PERCEPTRON
• One type of ANN system is based on a unit called a perceptron.
• Perceptron is a single layer neural network.
Perceptron training rule:
• The rule is
Gradient descent and delta rule
Termination condition
1. Epoch may set to 10
2. Acceptable error rate for
example we have reach to
the acceptable error rate
we will stop the execution
of this particular algorithm
and we consider final step
algorithm as a weights of
the algorithm
Issues in gradient decent algorithm
Difference between gradient decent and Stochastic Gradient decent
MULTILAYER NETWORKS AND THE BACKPROPAGATION ALGORITHM

• Multilayer networks learned by the BACKPROPAGATION algorithm are


capable of expressing a rich variety of nonlinear decision surfaces
Backpropagation algorithm

You might also like