Decision Trees Iterative Dichotomiser 3 (ID3) For Classification: An ML Algorithm
Decision Trees Iterative Dichotomiser 3 (ID3) For Classification: An ML Algorithm
An ML Algorithm
INTRODUCTION
Decision trees are a type of Supervised Machine Learning (that is you explain what the input is and what
the corresponding output is in the training data) where the data is continuously split according to a
certain parameter.
A (Decision) Tree
The tree can be explained by two entities, namely decision nodes and leaves.
An example of a decision tree can be explained using the above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habits, and physical activity,
etc.
And the leaves, which are outcomes like either ‘fit’, or ‘unfit’.
In this case, this was a binary classification problem (a yes, no type problem).
DEFINITIONS
Before discussing the ID3 algorithm, we’ll go through a few definitions.
Entropy
Entropy, also called Shannon Entropy is denoted by H(S) for a finite set S, is the measure of the amount
of uncertainty or randomness in data.
Example:
Consider a coin toss whose probability of heads is 0.5 and the probability of tails is 0.5. Here the
entropy is the highest possible since there’s no way of determining what the outcome might be.
Consider a coin that has heads on both sides, the entropy of such an event can be predicted
perfectly since we know beforehand that it’ll always be heads. In other words, this event has no
randomness hence its entropy is zero.
In particular, lower values imply less uncertainty while higher values imply high uncertainty.
Information Gain
Information gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is the
effective change in entropy after deciding on a particular attribute A. It measures the relative change in
entropy with respect to the independent variables.
Alternatively,
where IG(S,A) is the information gain by applying feature A. H(S) is the Entropy of the entire set, while
the second term calculates the Entropy after applying the feature A, where P(x) is the probability of
event x.
WORKING
Now that we know what a Decision Tree is, we’ll see how it works internally. There are many algorithms
out there that construct Decision Trees, but one of the best is called an ID3 Algorithm. ID3 Stands for
Iterative Dichotomiser 3 (ID3).
Example 1
Consider a piece of data collected over the course of 14 days where the features are Outlook,
Temperature, Humidity, Wind, and the outcome variable is whether Golf was played on the day. Now,
our job is to build a predictive model that takes in the above 4 parameters and predicts whether Golf
will be played on the day. We’ll build a decision tree to do that using the ID3 algorithm.
Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of them
belong to one class, and the other half belong to another class that is perfect randomness. Here it’s
0.94 which means the distribution is fairly random.
Now the next step is to choose the attribute that gives us the highest possible Information
Gain which we’ll choose as the root node.
where ‘x’ is the possible values for an attribute. Here, attribute ‘Wind’ takes two possible values in
the sample data, hence x = {Weak, Strong}
Amongst all the 14 examples we have 8 places where the wind is weak and 6 where the wind is
Strong.
Wind = Weak Wind = Strong Total
8 6 14
Now out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’ for
‘Play Golf’. So, we have,
Similarly, out of 6 Strong examples, we have 3 examples where the outcome was ‘Yes’ for Play Golf
and 3 where we had ‘No’ for Play Golf.
Remember, here half items belong to one class while the other half belongs to others. Hence, we
have perfect randomness.
Now we have all the pieces required to calculate the Information Gain,
This tells us the Information Gain by considering ‘Wind’ as the feature and gives us an information
gain of 0.048. Now we must similarly calculate the Information Gain for all the features.
We can clearly see that IG(S,Outlook) has the highest information gain of 0.246, hence we chose
Outlook attribute as the root node. At this point, the decision tree looks like.
Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no coincidence
by any chance, the simple tree resulted because of the highest information gain is given by the
attribute Outlook.
Now how do we proceed from this point? We can simply apply recursion; you might want to look at
the algorithm steps described earlier.
Now that we’ve used Outlook, we’ve got three of them remaining Humidity, Temperature, and
Wind. And, we had three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast
node already ended up having leaf node ‘Yes’, so we’re left with two subtrees to compute: Sunny
and Rain.