Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Leo Breiman 2001 Random Forest Algorithm Weka - Google Scholar

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Computer Applications (0975 – 8887)

Volume 55– No.6, October 2012

Performance Analysis of Classification Tree


Learning Algorithms
D. L. Gupta A. K. Malviya Satyendra Singh
Department of CSE Department of CSE Department of CSE
Assistant Professor Associate Professor M.Tech Student
KNIT Sultanpur, India KNIT Sultanpur, India KNIT Sultanpur, India

ABSTRACT this paper is to compare the performance of classification in


Classification is a supervised learning approach, which maps various machine learning algorithms using open source data
a data item into predefined classes. There are various set. The solution for this problem may be provided by
classification algorithms proposed in the literature. In this calculating confusion matrix, accuracy and error rate in taken
paper authors have used four classification algorithms such as data set.
J48, Random Forest (RF), Reduce Error Pruning (REP) and
Logistic Model Tree (LMT) to classify the “WEATHER Classification is a technique which organizes data of a given
NOMINAL” open source Data Set. Waikato Environment for class. The proposed model architecture is shown in Figure 1.
Knowledge Analysis (WEKA) has been used in this paper for It describes the applied classification algorithm, which finally
the experimental result and they found that Random Forest classifies the faulty and non-faulty module.
algorithm classify the given data set better than the other
Check Input Data
algorithms for this specific data set. In this paper, the Set
Input DataSet
performance of classifier algorithms is evaluated for 5 fold
cross validation test.

N
Keywords Preprocessing
Model
Classifier Rules OK
Extract DataSet
Classes

Decision Tree, J48, Random Forest, REP, LMT, Cross-


Y
Validation, Supervised Learning and Performance Measure.
Apply Classication
Selection Attributes
Agorithms
1. INTRODUCTION
Classification is a tree based structure which is a concept of
data mining (machine learning) technique. It used to predict Classification Software Quality
data instances through attributes. Classification is a method Algorithms Prediction Modules

where one can classify future data into known classes. In


general this approach uses a training data set to build a model
and test data set to validate it. Popular classification Fig 1: Proposed Model Architecture
techniques include decision trees, Naïve Bayes, Logistic
regression, etc. The accuracy of the supervised classification Classification may be used in classifying cancer cells as
will be much better than unsupervised classification, but working or damaged, classifying any card transactions as
depends on prior knowledge. J48 tree algorithm basically uses authorized or unauthorized, classifying food items as
the divide-and-conquer algorithm by splitting a root tree into a vitamins, minerals, proteins or carbohydrates, classification of
subset of two partitions of child nodes [1]. Random Forest is a news into sports, weather, stocks etc.
machine learning classifier that works over many iterations of The rest of the paper is organized as follow. In section 2
the same technique but with a different approach [2], [6]. authors have described the basic classification learning
Reduced Error Pruning performed as well as most of the other algorithms. Section 3 describes the performance measures for
pruning methods in terms of accuracy and better than most in classification. Section 4 explains the experimental result and
terms of tree size [3]. analysis. Conclusion and future work is shown in section 5.

It is very difficult to select any prediction techniques in 2. CLASSIFICATION LEARNING


practical situation, because prediction depends on many
factors like nature of problem, nature of data set, uncertain ALGORITHMS
availability of data. Machine learning algorithms are most Classification techniques can be compared on the basis of
significant classifiers to solve a variety of problems in predictive accuracy, speed, robustness, scalability and
software development and mainly in software fault prediction. interpretability criteria [4]. In data mining classification tree is
Prediction of faulty and non-faulty modules have been done a supervised learning algorithm. So one can prepare popular
by so many researchers and organizations
classifiers: J48, Random Forest, Reduce Error Pruning, and
involved in it but still there is a lackness of best techniques Logistic Model Tree. For comparison purpose, authors have
that always outperforms other methods overall. Hence more also prepared the fault-prone filtering techniques. A
reasonable research is required for the assessment of more classification model is able to identify the fault-prone (fp)
result. This data set “WEATHER NOMINAL” has been module correctly. The algorithm C5.0 is superior to C4.5. J48
analyzed and statistical calculation have been done on it to is the enhanced version of C4.5 but the working of both
clarify the faulty and non-faulty module. The main focus of

39
International Journal of Computer Applications (0975 – 8887)
Volume 55– No.6, October 2012

algorithms are very similar. The goal of decision tree is to 1) RF may produce a highly accurate classifier for
predict to response on a categorical dependent variable to more data sets.
measure a more predictor. The WEATHER NOMINAL uses a 2) RF has much simplicity.
3) RF provides a fast learning approach.
5 attributes and 14 instances as shown given a Table.1.

Table 1. The data set used in our analysis list 2.1.3 Reduce Error Prune
This method introduced by Quinlan [11]. It is the simplest and
most understandable method in decision tree pruning. For
Weather Nominal Data Set every non-leaf sub tree of the original decision tree, the
change in misclassification over the test set is examined. The
Attributes 5 REP incremental pruning developed by Written and Frank in
1999 is a fast regression tree learner that uses information
Instances 14 variance reduction in the data set which is splited into a
training set and a prune set.
Sum of Weight 14 When any one traverse the tree from bottom to top then he
she may apply the procedure which checks for each internal
node and replace it with most frequently class, keeping in
mind about tree accuracy, which must not reduced. Now the
2.1 Decision Trees node is pruned. This procedure will continue until any further
A decision tree is a flow-chart-like tree structure. The internal pruning would decrease the accuracy.
node denotes a test on an attribute, each branch represents an
outcome of the test, and leaf nodes represent classes or class
distribution [4][9]. The top most node in a tree shown by oval 2.1.4 Logistic Model Tree
is a root node. Further internal nodes are represented by Logistic Model Tree (LMT) [12] algorithm makes a tree with
rectangles, and leaf nodes are denoted by circles which are binary and multiclass target variables, numeric and missing
depicted in figure 3. values. So this technique uses logistic regression tree. LMT
produces a single outcome in the form of tree containing
binary splits on numeric attributes.
2.1.1 J48 Algorithm
J48 is a tree based learning approach. It is developed by Ross
Quinlan which is based on iterative dichtomiser (ID3) 2.2 Cross-Validation Test
algorithm [1]. J48 uses divide-and-conquer algorithm to split Cross-validation (CV) method used in order to validate the
a root node into a subset of two partitions till leaf node (target predicted model. CV test basically divide the training data
node) occur in tree. Given a set T of total instances the into a number of partitions or folds. The classifier is evaluated
following steps are used to construct the tree structure. by accuracy on one phase after learned from other one. This
process is repeated until all partitions have been used for
evaluation [13]. The most common types are 10-fold, n-fold
Step 1: If all the instances in T belong to the same group and bootstrap result obtained into a single estimation.
class or T is having fewer instances, than the tree is leaf
labeled with the most frequent class in T.
3. PERFORMANCE MEASURES FOR
Step 2: If step 1 does not occur then select a test based on a CLASSIFICATION
single attribute with at least two or greater possible outcomes. One can use following performance measures for the
Then consider this test as a root node of the tree with one
classification and prediction of fault prone module according
branch of each outcome of the test, partition T into
corresponding T1, T2, T3........, according to the result for each to his/her own need.
respective cases, and the same may be applied in recursive
way to each sub node. 3.1 Confusion Matrix
The confusion matrix is used to measure the performance of
Step 3: Information gain and default gain ratio are ranked two class problem for the given data set Table 2. The right
using two heuristic criteria by algorithm J48. diagonal elements TP (true positive) and TN (true negative)
correctly classify Instances as well as FP (false positive) and
2.1.2 Random Forest Algorithm FN (false negative) incorrectly classify Instances.
Random Forest algorithm was initially developed by Leo
Breiman, a statistician at the University of California [2] Confusion Matrix
Berkeley. Random Forests is a method by which one can
calculate accuracy rate in better way. Some attributes of
Random Forest is mentioned below [7].

[1]. It efficiently works on large data sets (training data sets).


[2]. It provides consistent accuracy than the other algorithms.
[3]. By this method estimate missing data if any and retains Correctly Incorrectly
the accuracy rate even if the bulk of the data is missing. Classifiy Classifiy
Instance Instance
[4]. It also provides an estimate of important attributes in the TP+TN FP+FN
classification. Fig 2: Confusion Matrix
The strength of RF is mentioned below [6][8] [10].

40
International Journal of Computer Applications (0975 – 8887)
Volume 55– No.6, October 2012

Table 2. Example of Confusion matrix 3.5 Precision


Precision is the ratio of modules correctly classified to the
Predicted number of entire modules classified fault-prone. It is
proportion of units correctly predicted as faulty.

Actual Yes No
TP
Pr ecision 
Yes TP FN TP  FP

No FP TN 3.6 F-Measure
FM is a combination of recall and precision. It is also defined
as harmonic mean of precision and recall.

Table 3. Actual Vs Predicted Confusion matrix


2* Re call * Pr ecision
F  Measure 
Predicted Re call  Pr ecision
Faulty- Non-Faulty
Prone Prone
3.7 Accuracy
Faulty True False It is defined as the ratio of correctly classified instances to
Positive Negative total number of instances.
Actual (TP) (FN) TP  TN
Non- False True Accuracy 
faulty Positive Negative TP  FP  TN  FN
(FP) (TN)
4. EXPERIMENTAL WORK AND
Total number of instances = Correctly classified instance + ANALYSIS
Incorrectly classified instance The repository data contains 5 attributes and 14 instances
respectively. WEKA [5] tool have been applied on
Correctly classified instance = TP + TN “WEATHER NOMINAL” data set taking 5 fold cross
validation for performance evaluation of the different
Incorrectly classified instance = FP + FN algorithms. Table 4 reveals confusion matrix for mentioned
four algorithms, which maps the actual and predicted values
for the respective algorithms. In Table 5 authors have
3.2 Cost Matrix calculated Precision, Recall, F-measure and accuracy value
A cost matrix is similar to confusion matrix but minor and they have found that the accuracy of the RF is 57.14%
difference is with finding the value of cost accuracy through which is best among J48, REP and LMT methods of
misclassification error rate. classification.
Misclassification error rate = 1 – Accuracy Table 6 depicts instances correctly predicted vs. instance
incorrectly predicted with accuracy and total execution time
3.3 Calculate Value TPR, TNR, FPR, and taken by each algorithm. The accuracy of RF is greater than
other examined techniques but time taken to make model is
FNR greater than J48 and REP. One can also observe that total time
One can calculate the value of true positive rate, true negative taken to make a model is minimum for REP model. Here
rate, false positive rate and false negative rate by methods authors have also prepared error rate of each examined
shown below. algorithm which is mentioned in Table 7. Hence they
observed that RF is showing minimum error rate than the
TPR  TPTP
 FN
other techniques, and J48 algorithm is showing maximum
error rate. Figure 4 depicts the accuracy of the taken
TNR  TN
FP  TN classifiers as well as error rate has been shown in figure 5.

Figure 3 depicts for prediction whether the “Tennis Game”


FP R  TPFN
 FN will be played or not. Here authors have taken an open source
data set named “WEATHER NOMINAL” which contains 5
FNR  FP
FP  TN
attributes as well as 14 instances. Conditions have been shown
in this figure.

3.4 Recall If outlook is sunny and humidity is normal then play the game
Recall is the ratio of modules correctly classified as fault- otherwise not-play it.
prone to the number of entire faulty modules.
If outlook is overcast then game will definitely be played.
Recall  TP
TP  FN If outlook is rainy and windy situation is not according then
game will be held otherwise not game may not be played.

41
International Journal of Computer Applications (0975 – 8887)
Volume 55– No.6, October 2012

Table 4. Classifiers Confusion Matrix


outlook
Predicted

rainy
sunny
overcast J48 RF REP LMT

play Actual Play Not- Play Not- Play Not- Play Not-
humidity windy Play Play Play Play

Play 4 5 5 4 7 2 6 3
false true
normal
high Not 3 2 2 3 5 0 4 1
Play

Not-play play play Not-play

Fig 3: Decision tree for the weather data

Table 5. Prediction Performance Measures


TPR FPR TNR FNR Precision Recall F- Accuracy
Measure in %

Algorithms PR NR PR NR

J48 0.444 0.6 0.4 0.556 0.571 0.286 0.444 0.4 0.5 42.85

RF 0.556 0.4 0.6 0.444 0.714 0.429 0.556 0.6 0.625 57.14

REP 0.778 1 0 0.222 0.583 0 0.778 0 0.667 50.00

LMT 0.667 0.8 0.2 0.333 0.6 0.25 0.667 0.2 0.632 50.00

Table 6. Performance Measure about Confusion matrix

Classifier Instances Instances Accuracy in Total Time Taken to


Algorithms Correctly Incorrectly % Build Model (in
Predicted Predicted seconds)
J48 6 8 42.85 0.02

RF 8 6 57.14 0.08

REP 7 7 50.00 0.01


LMT 7 7 50.00 0.14

42
International Journal of Computer Applications (0975 – 8887)
Volume 55– No.6, October 2012

Accuracy of Classifier
60%
50%
40%
30% Accuracy of Classifier
20%
10%
0%
J48 RF REP LMT

Fig 4: Accuracy of Classifiers

Table 7. Error rate of classifier for 5 fold cross validation

Error J48 RF REP LMT


Rate
in % 57.14 42.85 50.00 50.00

Error Rate
60.00%
50.00%
40.00%
30.00%
Error Rate
20.00%
10.00%
0.00%
J48 RF REP LMT

Fig 5: Error Rate

5. CONCLUSION AND FUTURE WORK [2]. L. Breiman, “Random Forests. Machine Learning,”
In this paper authors have examined J48, RF, REP and LMT vol.45(1), pp. 5-32, 2001.
method of classification and observed that RF is having [3]. F. Esposito, D. Malerba, and G. Semeraro, “A
maximum accuracy and minimum error rate. On the basis of comparative Analysis of Methods for Pruning Decision
accuracy measures, of the classifiers one can easily provide Trees”, IEEE transactions on pattern analysis and
the guidelines regarding fault-prone prediction issues of any machine intelligence, Vol.19(5), pp. 476-491, 1997.
given data set in the respective situations.
[4]. J. Han and M. Kamber, “Data Mining: Concept and
More similar studies on different data set for machine learning Techniques”, Morgan Kaufmann Publishers, 2004.
approach is needed to confirm the above finding. [5]. WEKA:http//www.cs.waikato.ac.nz/ml/weka.

6. REFERENCES [6]. T.K.Ho, “ The Random Subspace Method for


constructing Decision Forest”,IEEE Transcation on
[1]. J. R. Quinlan, “C4.5: Programs for Machine Learning”,
Pattern Analysis and Machine
San Mateo,CA, Morgan Kaufmann Publishers,1993.
Intelligence,Vol.20(8),pp.(832-944),1998.

43
International Journal of Computer Applications (0975 – 8887)
Volume 55– No.6, October 2012

[7]. Random Forest by Leo Breiman and Adele [11]. J.R. Quinlan, “Simplifying decision trees”, Internal
Cutler:http://www.stat.berkeley.edu/~breiman/RandomF Journal of Human Computer Studies,Vol.51, pp. 497-
orests/cc_home.htm. 491, 1999.
[8]. G. Biau, L. Devroye, G. Lugosi, “Consisting of Random [12]. N. Landwehr, M. Hall, and E. Frank, “ Logistic model
Forests and other Averaging Classifiers,” Journal of trees”. for Machine Learning.,Vol. 59(1-2),pp.161-205,
Machine Learning Research, 2008. 2005.
[9]. J.R. Quinlan, “Induction of Decession Trees : Machine [13]. N. Laves son and P. Davidson, “Multi-dimensional
Learning”,vol.1,pp.81-106,1986. measures function for classifier performance”, 2nd.
IEEE International conference on intelligent system,
[10]. F. Livingston, “Implementation of Breiman’s Random pp.508-513, 2004.
Forest Machine Learning algorithm,” Machine learning
Journal, 2008.

44

You might also like