Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Survey On Decision Tree Algorithms of Classification in Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324941161

A Survey on Decision Tree Algorithms of Classification in Data Mining

Article  in  International Journal of Science and Research (IJSR) · April 2016

CITATIONS READS

55 12,231

2 authors, including:

Sunil Kumar
Sangam University
20 PUBLICATIONS   78 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

cyber security View project

Social Media Security Risks, Cyber Threats And Risks Prevention And Mitigation Techniques View project

All content following this page was uploaded by Sunil Kumar on 04 May 2018.

The user has requested enhancement of the downloaded file.


International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391

A Survey on Decision Tree Algorithms of


Classification in Data Mining
Himani Sharma1, Sunil Kumar2
1
M.Tech Student, Department of computer Science, SRM University, Chennai, India
2
Assistant Professor, Department of computer Science, SRM University, Chennai, India

Abstract: As the computer technology and computer network technology are developing, the amount of data in information industry is
getting higher and higher. It is necessary to analyze this large amount of data and extract useful knowledge from it. Process of
extracting the useful knowledge from huge set of incomplete, noisy, fuzzy and random data is called data mining. Decision tree
classification technique is one of the most popular data mining techniques. In decision tree divide and conquer technique is used as
basic learning strategy. A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a
test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is
the root node. This paper focus on the various algorithms of Decision tree (ID3, C4.5, CART), their characteristic, challenges,
advantage and disadvantage.

Keywords: Decision Tree Learning, classification, C4.5, CART, ID3

1. Introduction other methods of classification. SQL statements can be


constructed from tree that can be used to access databases
In order to discover useful knowledge which is desired by the efficiently. Decision tree classifiers obtain similar or better
decision maker, the data miner applies data mining accuracy when compared with other classification methods.
algorithms to the data obtained from data collector. The A number of data mining techniques have already been done
privacy issues coming with the data mining operations are on educational data mining to improve the performance of
twofold. If personal information can be directly observed in students like Regression, Genetic algorithm, Bays
the data, privacy of the original data owner (i.e. the data classification, k-means clustering, associate rules, prediction
provider) will be compromised. On the other hand, equipping etc. Data mining techniques can be used in educational field
with the many powerful data mining techniques, the data to enhance our understanding of learning process to focus on
miner is able to find out various kinds of information identifying, extracting and evaluating variables related to the
underlying the data. Sometimes the data mining results learning process of students. Classification is one of the most
reveals sensitive information about the data owners. As the frequently. The C4.5, ID3, CART decision tree are applied
data miner gets the already modified data so here the on the data of students to predict their performance. These
objective was to show the comparative performance between algorithms are explained below-
already used classification method and the new method
introduced. As previous studies shows that the ensemble 2. ID3 Algorithm
techniques provide better results than the decision tree
method thus the desired result was inspired thru this concern. Iterative Dichotomiser 3 is a simple decision tree learning
algorithm introduced in 1986 by Quinlan Ross. It is serially
1.1 Decision Tree implemented and based on Hunt‟s algorithm. The basic idea
of ID3 algorithm is to construct the decision tree by
A decision tree is a flowchart-like tree structure, where each employing a top-down, greedy search through the given sets
internal node represents a test on an attribute, each branch to test each attribute at every tree node. In the decision tree
represents an outcome of the test, class label is represented method, information gain approach is generally used to
by each leaf node (or terminal node) . Given a tuple X, the determine suitable property for each node of a generated
attribute values of the tuple are tested against the decision decision tree. Therefore, we can select the attribute with the
tree. A path is traced from the root to a leaf node which holds highest information gain (entropy reduction in the level of
the class prediction for the tuple. It is easy to convert maximum) as the test attribute of current node. In this way,
decision trees into classification rules. Decision tree learning the information needed to classify the training sample subset
uses a decision tree as a predictive model which maps obtained from later on partitioning will be the smallest. So,
observations about an item to conclusions about the item's the use of this property for partitioning the sample set
target value. It is one of the predictive modelling approaches contained in current node will make the mixture degree of
used in statistics, data mining and machine learning. Tree different types for all generated sample subsets reduced to a
models where the target variable can take a finite set of minimum. Hence, the use of an information theory approach
values are called classification trees, in this tree structure, will effectively reduce the required dividing number of object
leaves represent class labels and branches represent classification.
conjunctions of features that lead to those class labels.
Decision tree can be constructed relatively fast compared to
Volume 5 Issue 4, April 2016
www.ijsr.net
Paper ID: NOV162954 2094
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391
3. C4.5 Algorithm feature is used in forecasting a dependent variable given a set
of predictor variables over a given period of time. CARTS
C4.5 is an algorithm used to generate a decision tree supports continuous and nominal attribute data and have
developed by Ross Quinlan. C4.5 is an extension of average speed of processing.
Quinlan’s earlier ID3 algorithm. The decision trees generated
by C4.5 can be used for classification and for this reason 4.1 CART Advantages
C4.5 is often referred to as a statistical classifier. As splitting
criteria, C4.5 algorithm uses information gain. It can accept 1) Non parametric (no probabilistic assumptions)
data with categorical or numerical values. Threshold is 2) Automatically perform variable selection
generated to handle continuous values and then divide 3) Use any combination of continuous or discrete variables
attributes with values above the threshold and values equal to  Very nice feature: ability to automatically bin massively
or below the threshold. C4.5 algorithm can easily handle categorical variables into a few categories.
missing values, as missing attribute values are not utilized in 4) Zip code, business class, make/model.
gain calculations by C4.5. 5) Establish “interactions” among variables
 Good for “rules” search
3.1 The algorithm C4.5 has following advantages:  Hybrid GLM-CART models

 Handling each attribute with different cost. Table 1: Comparisons between different Decision Tree
 Handling training data set with missing attribute values- Algorithms
Features ID3 C4.5 CART
C4.5 allows attribute values to be marked as „?‟ for
Type of data Categorical Continuous and continuous and
missing. Missing attribute values are not used in gain and
Categorical nominal
entropy calculations. attributes data
 Handling both continuous and discrete attributes- to handle Speed Low Faster than ID3 Average
continuous attributes, C4.5 creates a threshold and then Boosting Not supported Not supported Supported
splits the list into those whose attribute value is above the Pruning No Pre-pruning Post pruning
threshold and those that are less than or equal to it. Missing Values Can’t deal with Can’t deal with Can deal with
 Pruning trees after creation- C4.5 goes back through the Formula Use information Use split info Use Gini
tree once it has been created, and attempts to remove entropy and and gain ratio diversity index
branches that are not needed, by replacing them with leaf information Gain
nodes.
5. Decision Tree Learning Software
3.2 C4.5’s tree-construction algorithm differs in several
respects from CART, for instance Some softwares are used for the analysis of data and some
are used for commonly used data sets for decision tree
 Tests in CART are always binary, but C4.5 allows two or learning are discussed below-
more outcomes.
 CART uses Gini index to rank tests, whereas C4.5 uses WEKA: WEKA (Waikato Environment for Knowledge
information-based criteria. Analysis) workbench is set of different data mining tools
 CART prunes trees with a cost-complexity model whose developed by machine learning group at University of
parameters are estimated by cross-validation, whereas C4.5 Waikato, New Zealand. For easy access to this functionality,
uses a single-pass algorithm derived from binomial it contains a collection of visualization tools and algorithms
confidence limits. for data analysis and predictive modeling together with
 This brief discussion has not mentioned what happens graphical user interfaces. WEKA supported versions are
when some of a case’s values are unknown. windows, Linux and MAC operating systems and it
providens various associations, classification and clustering
CART looks for surrogate tests that approximate the algorithms. All of WEKA's techniques are predicated on the
outcomes when the tested attribute has an unknown value, on assumption that the data is available as a single flat file or
the other hand C4.5 apportions the case probabilistically relation, where each data point is described by a fixed
among the outcomes. number of attributes (normally, numeric or nominal
attributes). It also provides pre-processors like attributes
4. CART Algorithm selection algorithms and filters. WEKA provides J48. In J48
we can construct trees with EBP, REP and unpruned trees.
It stands for Classification And Regression Trees. It was
GATree: GATree (Genetically Evolved Decision Trees) use
introduced by Breiman in 1984. It builds both classifications
genetic algorithms to directly evolve classification decision
and regression trees. The classification tree construction by
trees. Instead of using binary strings, it adopts a natural
CART is based on binary splitting of the attributes. CART
representation of the problem by using binary tree structure.
also based on Hunt‟s algorithm and can be implemented
On request to the authors, the evaluation version of GATree
serially. Gini index is used as splitting measure in selecting
is now available. To generate decision trees, we can set
the splitting attribute. CART is different from other Hunt‟s
various parameters like generations, populations, crossover
based algorithm because it is also use for regression analysis
and mutation probability etc.
with the help of the regression trees. The regression analysis
Volume 5 Issue 4, April 2016
www.ijsr.net
Paper ID: NOV162954 2095
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391
Alice d'ISoft: Alice d’ISoft software for Data Mining by cover categories in remote sensing, binary tree with genetic
decision tree is a powerful and inviting tool that allows the algorithm for land cover classification.
creation of segmentation models. For the business user, this
software makes it possible to explore data on line Web Applications Chen et al presented a decision tree
interactively and directly. Alice d’ISoft software works on learning approach to diagnosing failures in large Internet
windows operating system. And the evaluation version of sites. Bonchi et al proposed decision trees for intelligent web
Alice d’ISoft is available on request to the authors. caching.

See5/C5.0: See5/C5.0 has been designed to analyze 7. Conclusion


substantial databases containing thousands to millions of
records and tens to hundreds of numeric, time, date, or This paper studied various decision tree algorithms. Each
nominal fields. It takes advantage of computers with up to algorithm has got its own pros and cons as given in this
eight cores in one or more CPUs (including Intel Hyper- paper. The efficiency of various decision tree algorithms can
Threading) to speed up the analysis. See5/C5.0 is easy to use be analyzed based on their accuracy and time taken to derive
and does not presume any special knowledge of Statistics the tree. This paper provides students and researcher some
/Machine Learning. It is available for Windows Xp/Vista/7/8 basic fundamental information about decision tree
and Linux. algorithms, tools and applications.

6. Applications Of Decision Trees In Different References


Areas Of Data Mining
[1] Anju Rathee, Robin prakash mathur, “Survey on
The decision tree algorithms are largely used in all area of Decision Tree classification algorithm for the evaluation
real life. The application areas are listed below- of student performance” International Journal of
Computers & Technology, Volume 4 No. 2, March-
Business: Decision trees are use in visualization of April, 2013, ISSN 2277-3061
probabilistic business models, used in customer relationship [2] S.Anupama Kumar and Dr. Vijayalakshmi M.N. (2011)
management and used for credit scoring for credit card users. “Efficiency of decision trees in predicting student‟s
academic performance”, D.C. Wyld, et al. (Eds):
Intrusion Detection: Decision trees is used to generate CCSEA 2011, CS & IT 02, pp. 335-343, 2011.
genetic algorithms to automatically generate rules for an [3] Devi Prasad bhukya and S. Ramachandram “ Decision
intrusion detection expert system. Abbes et al. proposed tree induction- An Approach for data classification using
protocol analysis in intrusion detection using decision tree. AVL –Tree”, International journal of computer and
electrical engineering, Vol. 2, no. 4, August, 2010.
Energy Modeling: Decision tree is used for energy [4] Jiawei Han and Micheline Kamber Data Mining:
modeling. Energy modeling for buildings is one of the Concepts and Techniques, 2ndedition.
important tasks in building design. [5] Baik, S. Bala, J. (2004), A Decision Tree Algorithm For
Distributed Data Mining.
E-Commerce: Decision tree is widely use in e-commerce, [6] Quinlan, J.R., C4.5 -- Programs For Machine
use to generate online catalog which is essence for the Learning.Morgan Kaufmann Publishers, San Francisco,
success of an e-commerce web site. Ca, 1993.
[7] Introdution To Data Mining By Tan, Steinbach, Kumar.
Image Processing: Perceptual grouping of 3-D features in [8] Mr. Brijain R Patel, Mr. Kushik K Rana, ”ASurvey on
aerial image using decision tree classifier. Decision Tree Algorithm for Classification”, © 2014
IJEDR, Volume 2, Issue 1.
Medicine: Medical research and practice are the important [9] Prof. Nilima Patil and Prof. Rekha Lathi(2012),
areas of application for decision tree techniques. Decision Comparison of C5.0 & CART Classification algorithms
tree is most useful in diagnostics of various diseases.and also using pruning technique.
use for Heart sound diagnosis. [10] Baik, S. Bala, J. (2004), A Decision Tree Algorithm For
Distributed Data Mining.
Industry: decision tree algorithm is useful in production [11] Neha Midha and Dr. Vikram Singh, ”A Survey on
quality control (faults identification), non-destructive tests Classification Techniques in Data Minng”, IJCSMS
areas. (International Journal of Computer Science &
Management Studies) Vol. 16, Issue 01, Publishing
Intelligent Vehicles: The job of finding the lane boundaries Month: July 2015.
of the road is important task in development of intelligent [12] Juan Pablo Gonzalez and U. Ozguner (2000). Lane
vehicles. Gonzalez and Ozguner proposed lane detection for detection using histogram-based segmentation and
intelligent vehicles by using decision tree. decision trees. Proc. of IEEE Intelligent Transportation
Remote Sensing: Remote sensing is a strong application area Systems.
for pattern recognition work with decision trees. Some
researcher proposed algorithm for classification for land

Volume 5 Issue 4, April 2016


www.ijsr.net
Paper ID: NOV162954 2096
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391
[13] M. Chen, A. Zheng, J. Lloyd, M. Jordan and E. Brewer
(2004). Failure diagnosis using decision trees. Proc. of
the International Conference on Autonomic Computing.
[14] Francesco Bonchi, Giannotti, G. Manco, C. Renso, M.
Nanni, D. Pedreschi and S. Ruggieri (2001). Data
mining for intelligent web caching. Proc. of International
Conference on Information Technology: Coding and
computing, 2001, pp. 599 - 603.
[15] Ian H. Witten; Eibe Frank, Mark A. Hall (2011). "Data
Mining: Practical machine learning tools and techniques,
3rd Edition".
[16] A. Papagelis and D. Kalles (2000). GATree: Genetically
evolved decision trees. Proc. 12th International
Conference On Tools With Artificial Intelligence, pp.
203-206.
[17] ELOMAA, T. (1996) Tools and Techniques for
Decision Tree Learning.
[18] R. Quinlan (2004). Data Mining Tools See5 and C5.0
Rulequest Research (1997).
[19] S. K. Murthy, S. Salzberg, S. Kasif And R. Beigel
(1993). OC1: Randomized induction of oblique decision
trees. In Proc. Eleventh National Conference on
Artificial Intelligence, Washington, DC, 11-15th, July
1993. AAAI Press, pp. 322-327.
[20] Dipak V. Patil and R. S. Bichkar (2012). Issues in
Optimization of Decision Tree Learning:

Volume 5 Issue 4, April 2016


www.ijsr.net
Paper ID: NOV162954 2097
Licensed Under Creative Commons Attribution CC BY
View publication stats

You might also like