ID3 Algorithm
ID3 Algorithm
i =1
1. Introduction
1.2 Information Gain
A decision tree is a tree in which each branch
node represents a choice between a number of Measuring the expected reduction in Entropy As
alternatives, and each leaf node represents a decision. we mentioned before, to minimize the decision tree
depth, when we traverse the tree path, we need to select
Decision tree are commonly used for gaining
the optimal attribute for splitting the tree node, which we
information for the purpose of decision -making. can easily imply that the attribute with the most entropy
Decision tree starts with a root node on which it is for reduction is the best choice. We define information gain
users to take actions. From this node, users split each as the expected reduction of entropy related to specified
node recursively according to decision tree learning attribute when splitting a decision tree node.
algorithm. The final result is a decision tree in which
each branch represents a possible scenario of decision r Sj
and its outcome. We demonstrate this on ID3, a well- Gain( S , S1..S r ) = Entropy( S ) − ∑ Entropy( S j )
known and influential algorithm for the task of decision j =1 S
tree learning. We note that extensions of ID3 are widely
used in real market applications.
For inductive learning, decision tree learning is
ID3 is a simple decision tree learning algorithm attractive for 3 reasons:
developed by Ross Quinlan (1983). The basic idea of
ID3 algorithm is to construct the decision tree by 1. Decision tree is a good generalization for unobserved
instance, only if the instances are described in terms of different values. One of the attributes in the database is
features that are correlated with the target concept. designated as the class attribute; the set of possible
values for this attribute being the classes. We wish to
2. The methods are efficient in computation that is predict the class of a transaction by viewing only the
proportional to the number of observed training non-class attributes. This can then be used to predict the
instances. class of new transactions for which the class is
unknown. For example, the weather problem is a toy
3. The resulting decision tree provides a representation data set which we will use to understand how a decision
of the concept that appeal to human because it renders tree is built. It is reproduced with slight modifications in
the classification process self-evident. Witten and Frank (1999), and concerns the conditions
under which some hypothetical outdoor game may be
played. In this dataset, there are five categorical
attributes outlook, temperature, humidity, windy, and
1.3 Related Work
play. We are interested in building a system which will
enable us to decide whether or not to play the game on
In this paper, we have focused on the problem the basis of the weather conditions, i.e. we wish to
of minimizing test cost while maximizing accuracy. In predict the value of play using outlook, temperature
some settings, it is more appropriate to minimize humidity, and windy. We can think of the attribute we
misclassification costs instead of maximizing accuracy. wish to predict, i.e. play, as the output attribute, and the
For the two class problem, Elkan gives a method to other attributes as input.
minimize misclassification costs given classification
probability estimates. Bradford et al. compare pruning
algorithms to minimize misclassification costs. As both 2.2 Decision Trees and the ID3 Algorithm
of these methods act independently of the decision tree
growing process, they can be incorporated with our The main ideas behind the ID3 algorithm are:
algorithms (although we leave this as future work). Ling
etal propose a cost-sensitive decision tree algorithm that 1. Each non-leaf node of a decision tree corresponds to
optimizes both accuracy and cost. However, the cost an input attribute, and each arc to a possible value of that
insensitive version of their algorithm (i.e. the algorithm attribute. A leaf node corresponds to the expected value
run if all feature costs are zero), reduces to a splitting of the output attribute when the input attributes are
criteria that maximizes accuracy, which is well known to described by the path from the root node to that leaf
be inferior to the information gain and gain ratio node.
criterion. Integrating machine learning with program
understanding is an active area of current research. 2. In a “good” decision tree, each non-leaf node should
Systems that analyze root cause errors in distributed correspond to the input attribute which is the most
systems and systems that find bugs using dynamic informative about the output attribute amongst all the
predicates may both benefit from cost sensitive learning input attributes not yet considered in the path from the
to decrease overhead monitoring costs. root node to that node. This is because we would like to
predict the output attribute using the smallest possible
number of questions on average.
2. Classification by Decision Tree Learning
The ID3 algorithm assumes that each attribute
This section briefly describes the machine is categorical, that is containing discrete data only, in
learning and data mining problem of classification and contrast to continuous data such as age, height etc. The
ID3, a well-known algorithm for it. The presentation principle of the ID3 algorithm is as follows. The tree is
here is rather simplistic and very brief and we refer the constructed top-down in a recursive fashion. At the root,
reader to Mitchell [12] for an in-depth treatment of the each attribute is tested to determine how well it alone
subject. The ID3 algorithm for generating decision trees classified the transactions. The “best” attribute (to be
was first introduced by Quinlan in [15] and has since discussed below) is then chosen and the remaining
become a very popular learning tool. transactions are partitioned by it. ID3 is then recursively
called on each partition (which is a smaller database
containing only the appropriate transactions and without
2.1 The Classification Problem the splitting attribute).
3. Conclusion
Figure 1: The ID3 Algorithm for Decision Tree Learning
The paper conducted concludes that ID3 works
ID3(R, C, T ) fairly well on classification problems having datasets
1. If R is empty, return a leaf-node with the class value with nominal attribute values. It also works well in case
of missing attribute values but the way missing attributes
assigned to the most transactions in T. are handled actually governs the performance of the
2. If T consists of transactions which all have the same algorithm. In case of neglecting instances with missing
values for the attribute leads to high error rate compared
value look for the class attribute, return a leaf-node with to selecting the missing value as a separate value.
the value c (finished classification path). Decision tree induction is one of the classification
techniques used in decision support systems and
3. Otherwise, machine learning process. With decision tree technique
(a) Determine the attribute that best classified the the training data set is recursively partitioned using
depth- first (Hunt’s method) or breadth-first greedy
transactions in T , let it be A. technique (Shafer et al ,1996) until each partition is pure
(b) Let a, b the values of attribute A and let T (a 1), ..., T or belong to the same class/leaf node (Hunts et al, 1966
and Shafer et al , 1996). Decision tree model is preferred
(am) be a partition of T such that every transaction in among other classification algorithms because it is an
T(ai) has the attribute value a. eager learning algorithm and easy to implement.