Lesson 7 Supervised Method (Decision Trees) Algorithms
Lesson 7 Supervised Method (Decision Trees) Algorithms
Important terminology
Root Node: This attribute is used for dividing the data into two or more
sets. The feature attribute in this node is selected based on Attribute
Selection Techniques.
Branch or Sub-Tree: A part of the entire decision tree is called a branch or
sub-tree.
1|P ag e
Splitting: Dividing a node into two or more sub-nodes based on if-else
conditions.
Decision Node: After splitting the sub-nodes into further sub-nodes, then
it is called the decision node.
Leaf or Terminal Node: This is the end of the decision tree where it
cannot be split into further sub-nodes.
Pruning: Removing a sub-node from the tree is called pruning.
The root node feature is selected based on the results from the Attribute Selection
Measure(ASM).
The ASM is repeated until a leaf node, or a terminal node cannot be split into
sub-nodes.
2|P ag e
What is Attribute Selective Measure(ASM)?
Attribute Subset Selection Measure is a technique used in the data mining process for
data reduction. The data reduction is necessary to make better analysis and prediction
of the target variable.
1. Gini index
2. Information Gain(ID3)
Gini index
The measure of the degree of probability of a particular variable being
wrongly classified when it is randomly chosen is called the Gini index or
Gini impurity. The data is equally distributed based on the Gini index.
3|P ag e
Pi= probability of an object being classified into a particular class.
When you use the Gini index as the criterion for the algorithm to select the
feature for the root node., The feature with the least Gini index is selected.
2 Information Gain(ID3)
Entropy is the main concept of this algorithm
o which helps determine a feature or attribute that gives maximum
information about a class is called Information gain or ID3
algorithm.
o By using this method, we can reduce the level of entropy from the
root node to the leaf node.
4|P ag e
algorithms used while Training Decision Trees.
ID3, C4.5, CART and Pruning
1. ID3: Ross Quinlan is credited within the development of ID3, which is shorthand
for “Iterative Dichotomiser 3.” This algorithm leverages entropy and information
gain as metrics to evaluate candidate splits. Some of Quinlan’s research on this
algorithm from 1986 can be found here (PDF, 1.4 MB) (link resides outside
of ibm.com).
2. C4.5: This algorithm is considered a later iteration of ID3, which was also
developed by Quinlan. It can use information gain or gain ratios to evaluate split
points within the decision trees.
It breaks down a dataset into smaller and smaller subsets while at the same time
an associated decision tree is incrementally developed. The final result is a tree
with decision nodes and leaf nodes.
A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast
and Rainy). Leaf node (e.g., Play) represents a classification or decision. The
topmost decision node in a tree which corresponds to the best predictor called
root node. Decision trees can handle both categorical and numerical data.
5|P ag e
Algorithm
The core algorithm for building decision trees called ID3 by J. R. Quinlan which
employs a top-down, greedy search through the space of possible branches with no
backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. In
ZeroR model there is no predictor, in OneR model we try to find the single best
predictor, naive Bayesian includes all predictors using Bayes' rule and the
independence assumptions between predictors but decision tree includes all predictors
with the dependence assumptions between predictors.
Entropy
Entropy- is the measure of uncertainty in the data. The effort is to reduce the
entropy and maximize the information gain.
o The feature having the most information is considered important by the
algorithm and is used for training the model.
o By using Information gain you are actually using entropy.
A decision tree is built top-down from a root node and involves partitioning the
data into subsets that contain instances with similar values (homogenous).
ID3 algorithm uses entropy to calculate the homogeneity of a sample.
6|P ag e
o NB If the sample is completely homogeneous the entropy is zero and if
the sample is an equally divided it has entropy of one.
7|P ag e
Information Gain
The information gain is based on the decrease in entropy after a dataset is split
on an attribute. Constructing a decision tree is all about finding attribute that
8|P ag e
returns the highest information gain (i.e., the most homogeneous branches).
Step 2: The dataset is then split on the different attributes. The entropy for each
branch is calculated. Then it is added proportionally, to get total entropy for the
split. The resulting entropy is subtracted from the entropy before the split. The
result is the Information Gain, or decrease in entropy.
9|P ag e
Step 3: Choose attribute with the largest information gain as the decision node, divide
the dataset by its branches and repeat the same process on every branch.
10 | P a g e
Step 4b: A branch with entropy more than 0 needs further splitting.
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all
data is classified.
11 | P a g e
A decision tree can easily be transformed to a set of rules by mapping from the
root node to the leaf nodes one by one.
http://youtube.com/watch?v=pRaKQC_DKLM
12 | P a g e