Practical 5: Introduction To Weka For Classfication
Practical 5: Introduction To Weka For Classfication
Practical 5: Introduction To Weka For Classfication
Introduction to Weka
1
4. Have a look at the different attributes. In Current relation, we can see that there are 14
instances and 5 attributes in the dataset. Click on each attribute to see its properties in
Selected attribute and a graph of the distribution of the values of the attribute. The colours
in the graph each correspond to a class. Pay attention to the type of the attributes. In this
dataset all the attributes are Nominal: the values indicate different distinct categories that
describe the attribute. An attribute could also be Numeric: the values are numbers that
measure the attribute. Notice that play has been suggested as the class attribute, that is the
one that is predicted from the other attributes.
5. Have a look at the data (Edit… button). We can see all the data in this window. Each of the
rows corresponds to an instance and the columns are the attributes.
6. We are now going to build a decision tree. On the Classify tab, the default classifier is ZeroR
so click on Choose to select the Id3 classifier from the trees.
2
ID3 is one of the simplest decision tree classifiers. Clicking on the classifier name text box, in
this case Id3, will bring up a window providing a very short description of the classifier. Click
on More for a bit more details and on Capabilities to know the kinds of attributes and classes
the classifier can handle.
This information tells us that the ID3 algorithm can only handle nominal attributes and
cannot deal with missing values. We can therefore apply it to our data. Note that classifiers
that are not compatible with the data are greyed out and cannot be selected.
7. The Test options allow us to choose how to train the classifier.
a. Use training set will use all the data for both the training and the test sets.
b. Supplied test set allows you to provide a separate test set.
c. Cross-validation will perform cross-validation according to the number of folds
provided. This means that the data will be split into k subsets of equal size. For each
value of i in {1, 2, … , k} the classifier will be tested on the ith subset after being
trained on all the other data. The k results are then averaged to describe the
performance of the classifier.
d. Percentage split will train the classifier on the indicated percentage of the data and
test it on the rest.
8. The dropdown menu allows us to choose the class attribute. Here play has already been
appropriately suggested. Clicking Start will execute the training and evaluation process. For
the moment, let us just try Use training set and click Start.
9. The results are displayed in the Classifier output panel. outlook
a. The decision tree is given in text form:
sunny rainy
outlook = sunny
| humidity = high: no overcast
| humidity = normal: yes humidity yes windy
outlook = overcast: yes high normal TRUE FALSE
outlook = rainy
| windy = TRUE: no
no yes yes
| windy = FALSE: yes no
and corresponds to the representation on the right.
Unfortunately, in Weka, we cannot see a visualisation of a tree produced by ID3.
However, this is possible for the J48 classifier, which is an implementation of the
C4.5 algorithm. To visualise a tree, right-click on the corresponding result in the
Result list and choose Visualize tree.
b. A summary of the evaluation gives information such as the percentage of correctly
and incorrectly classified instances.
3
c. Information about the accuracy is given. TP and FP refer to True Positives and False
Positives respectively.
d. The confusion matrix contains information about the prediction in terms of true and
false positives and true and false negatives.
Prediction
Actual # true positives # false negatives
value # false positives # true negatives
You can find lots of information about classifier accuracy and confusion matrices
online, for example http://www.dataschool.io/simple-guide-to-confusion-matrix-
terminology
Classification