CHAPTER
9
Guide to Using the System
.
We must finally descend to the level of nitty-gritty detail to discuss
the way that C4.5 is used. The advice given by the Hitch-Hikers'
Guide
is relevant here: DON'T PANIC! Although there are numerous options
that control how the system behaves, many of these need never concern
the typical user.
In the following sections it is often necessary to denote parts of commands and file names that mayor may not be present. We follow the
usual convention of enclosing this optional material in brackets; remember that these brackets should be omitted when typing commands.
9.1 Files
9.1.1 Filestem
As mentioned in Chapter 1, every task needs a short name, referred to
as its files tern, that identifies its files. All files read and written by the
system are of the form filestem. extension, where extension characterizes
the type of information involved. The filestem used in Chapter 1 was
labor-neg; generally, a filestem can be any string of characters that is
acceptable as a file name to your operating system. If your environment
imposes restrictions on the length of file names, the filestem should be
at least nine characters shorter than this limit to allow for the addition
of extensions,
shell script.
9.1.2
Names
and shorter
still if you intend to use the cross-validation
file
The fundamental file for any task is the names file, called filestern.names,
that provides names for classes, attributes,
and attribute values. Each
name consists of a string of characters of any length, with some restrictions:
. A name cannot be the single character "?"
. Although any character can appear in a name, the special characters
comma (,), colon (:), vertical bar (I), and backslash (\) have particular
81
GUIDE
82
TO USING
THE
meanings and must be escaped (preceded
if they appear in a name.
9.1 FILES
SYSTEM
by a backslash
character)
. A period may appear in a name provided it is not followed by a space.
. Embedded spaces are also permitted in a name, but multiple whitespace characters
(spaces and tabs) are replaced by a single space.
Otherwise, the form 01 a name is rather unconstrained-for
numbers are perfectly acceptable as names.
~
The names file consists of a series of entries, each starting
possible values it can take if it is a discrete attribute.
In these entries,
tabs have been used to make the file more intelligible. Some attribute
names such as wage increase first year, and some attribute values such as
below average, contain embedded spaces, which are significant; all other
spaces and tabs are ignored.
9.1.3 Data file
example,
on a new
line and ending with a period. Blank lines, spaces, and tabs may be used
to make the file more readable and have no significance (except within
a name). In addition, the vertical bar character (I) appearing anywhere
on a line causes the rest of that line to b.e ignored, and can be used to
incorporate comments in the file.
The first entry in the names file gives the class names, separated
83
by
commas (and don't forget the period at the end of the entry!). There
must be at least two class names and their order is not important.
The rest of the file consists of a single entry for each attribute.
An
attribute entry begins with the attribute name followed by a colon, and
then a specification of the values that the attribute can take. Four specifications are possible:
. ignore; causes the value of the attribute to be disregarded.
. continuous; indicates that the attribute has numeric values, either
integer or floating point.
. discreteN, where N is a positive integer;specifiesthat the attribute
has discrete values, and there are no more than N of them.
. A list of names separated by commas; also indicates that the attribute
has discrete values and specifies them explicitly (preferred to discrete
since it enables the data to be checked). As with class names, the
order of attribute values is arbitrary.
Each entry is terminated with a period.
This will probably be clearer if you look again at the sample names
file in Figure 1-1. The first line contains the class names and is followed
by an optional blank line to improve readability. Next come 16 entries,
1 for every attribute.
Each attribute name is followed by a colon and
then either continuous (if it is a real-valued attribute) or a list of the
The data file files tern. data is used to describe the training cases from
which decision trees and/or rulesets are to be constructed.
Each line
describes one case, providing the values for all the attributes and then
the case's class, separated by commas and terminated by a period. The
attribute values must appear in the same order that the attributes were
given in the names file. The order of cases themselves does not matter.
Names used as attribute or class values obey the same restrictions as
for the names file; in particular, embedded special characters must be
escaped by preceding them with a backslash. If the value of an attribute,
either discrete or continuous, is not known or is not relevant, that value
is specified by a question mark. Values of continuous attributes may
be given in integer, fixed-point, or floating-point form, so that all the
following are acceptable:
1, 1.2, + 1.2,
-
.2, -12E-3,
0.00012
In fact, any numeric value that is acceptable to your local C compiler
will probably work.
The data file can also contain embedded comments. The appearance
of the character I (unescaped) anywhere on a line causes the remainder
of that line to be ignored. It is unwise to commence a comment in the
middle of a name.
9.1.4 Test file
Por some applications, you may decide to reserve part of the available
data as a test set to evaluate the classifier you have produced. If there
is one, the test set appears in the file filestern.test, in exactly the same
format as the data file.
9.1.5 Files generated
by the system
The programs C4.5 and C4.5RULESand the cross-validation script described later also generate files with the same filestem as above. As a
GUIDE
84
TO USING
9.2
THE SYSTEM
user, you will not have to worry about these, other than to make sure
that you do not delete or modify them while they are still relevant! The
principal file extensions are:
.
unpruned, containing the one or more unpruned
C4.5. Used by C4.5RULES.
trees generated
by
.
tres[ +identifierJ and rres[ +identifier];
and rules on the cross-validation.
9.1.6
summary
of results
-f files tern
The software does not impose any (practical) limit on the number of
classes, attributes,
attribute values, or cases in the data and test files.
The numbers of classes, attributes,
and discrete values per attribute
must all be representable by short integers; in most implementations
of
C, this allows more than 16,000 of each. The number of training and test
cases are represented by full integers, normally permitting about 109 of
each. However, your particular operating system may limit the amount
of memory that can be allocated to anyone process, affecting the size of
problem that can be attempted.
If a large training set is processed on a machine with a small physical
memory, the programs will run very slowly as a result of the need to
swap pages in and out.
9.2
Running
the programs
Each of the programs allows options to be invoked on the command
line. Although some of them are rarely used, they are all given here for
completeness, together with the default values that are used if the option
is not invoked.
(Default: DF)
This option is almost always used to specify the filestem
of the task as above. If no filestem is given, the default
filestem DF is assumed.
-u
(Default: no test set)
This option is invoked when a test file has been prepared.
-5
(Default: no grouping)
As described in Chapter 7, this option causes the values
of discrete attributes to be grouped for tests. The default
is no grouping, in which case tests on discrete attributes
have a separate branch for every possible value of the
attribute.
-m weight
(Default: 2)
Near-trivial tests in which almost all the training cases
have the same outcome can lead to odd trees with little
for trees
Size restrictions
85
The command for invoking the program to build a decision tree is c4.5.
The options that can be used with this command are:
by C4.5RULES. Used 'by CONSULTR.
. toil +identifierJ and roil +identifier], for i = 0, 1, 2,...; output of
the decision tree and rules programs on different sections of a crossvalidation experiment that has an optional identifier.
THE PROGRAMS
9.2.1 Decision tree induction
. tree, the final pruned tree; if more than one tree is generated via
windowing, this is the tree chosen as the best. Used 1.r CONSULT.
. rules, the final ruleset generated
RUNNING
predictive power. To avoid this undesirable eventuality,
C4.5 requires that any test used in the tree must have at
least two outcomes with a minimum number of cases (or,
to be more precise, the sum of the weights of the cases for
at least two of the subsets Ti must attain some minimum).
The default minimum is 2, but can be changed by this
option; a higher value may be a good idea for tasks where
there is a lot of noisy data.
-c CF
(Default: 25%)
The CF value affects decision tree pruning, discussed in
Chapter 4. Small values cause more heavy pruning than
large values, with a most pronounced effect on smaller
data sets. The default value seems to work reasonably
well for many tasks, but you might want to alter it to
a lower value if you notice that the actual error rate of
pruned trees on test cases is much higher than the estimated error rate (indicative of underpruning).
GUIDE
86
-v level
-t trees
TO USING
9.2
THE SYSTEM
(Default: level 0)
The program contains embedded code to show what is
going on during the execution of the algorithms. Level 0,
the default, generates no expository output of this kind,
level 1 a little, and so on. The levels 3 and above can
generate an enormous amount of output which is meaningful only to people who have an intimate knowledge of
the code. The maximum level is 5.
.
(Default: 10)
This is one of the options that can be \lsed to invoke windowing, discussed in Chapter 6, in which trees are grown
iteratively.
It specifies the number of trees to be grown
in this manner before one is selected as the best. Unless
one of the windowing options is specified, the program
will grow a single tree in the standard way as described
in Chapter 2.
-w s~ze
(Default: determined by data file)
This option invokes windowing and specifies the number
of cases to be included in the initial window. The default
is the maximum of 20% of the training cases and twice
the square root of the number of training cases.
-i increment
Options can appear in any order.
-g
(Default: gain ratio criterion)
The criterion used for assessing possible splits, described
in Chapter 2, can be changed to the older gain criterion
by this option.
-p
(Default: hard thresholds)
When this opti~n is invoked, subsidiary cutpoints Z+ and
Z- are determined for each test on a continuous attribute.
The subsidiary cutpoints affect only the way cases are
87
THE PROGRAMS
classified using the interactive
described in Chapter 8.
decision tree interpreter
as
For example, the command
c4.5 -c 10 -u -f mytask -s
would invoke the decision tree program on a task with filestern my task
(so with names file mytask. names and training cases in mytask.data).
A
single tree would be constructed
with grouped-value tests for discrete
attributes, pruned using a CF value of 10%, and tested on the unseen
cases in mytask. test.
9.2.2
Rul€ induction
The rule induction program, invoked by the command c4.5rules, should
only be used after running the decision tree program C4.5, sinc€ it reads
the unpruned file containing the unpruned tree(s). Four of the six options
have the same meaning as for the previous program, namely
-f filestem (Default: OF)
-u (Default:
-v level (Default:
-c CF (Default:
(Default: determined by window size)
This option activates windowing and gives the maximum
number of cases that can be added to the window at each
iteration.
The default is 20% of the initial window size.
Whatever the value of this option, however, at least half
of the training cases misclassified by the current rule are
added to the next window.
RUNNING
no test set)
level 0)
25%)
(where the CF value is used to prune rules rather than trees).
two options are:
The last
-F confidence (Default: no significance testing)
If this option is used, the significance of each condition
in the left-hand side of a rule is checked using Fisher's
"exact" test; each condition must be judged significant
at the specified confidence level. This will generally have
the effect of producing shorter rules than the standard
pruning method, with some risk of over-generalization
for
tasks with little data.
-r redundancy (Default: 1.0)
The number of bits required to encode a set of rules, as
presented in Chapter 5, can be increased substantially by
the presence of irrelevant or redundant attributes.
This
I
I
.-.
-
~
I
I
I
88
GUIDE
TO USING
THE
9.3
SYSTEM
option can be used to specify an approximate redundancy
factor when the user has reason to believe that there are
many redundant attributes.
A redundancy value of 2.5,
for instance, implies that there are 2.5 times more attributes than are likely to be useful. The -r option is
likely to be beneficial only when the user knows the data
well and can estimate an appropriate value.
9.2.3
Interactive
classification
model
interpret~rs
These programs, CONSULT for decision tree models and CONSULTR for
production rule models, accept the same input and so can be dealt with
together.
Naturally enough, the programs should be used only after
the appropriate models have been generated by the C4.5 and C4.5RULES
programs, and are invoked respectively by
consult [-f filestem] [-t]
consultr [-f filestem] [-t]
.
CONDUCTING
89
EXPERIMENTS
a single number; or
. an interval consisting of two numbers separated by a hyphen (-). In
this latter case, the value of the attribute is treated as being jllliformly
distributed in the interval.
Each reply must be followed by a carriage return. Any incorrect input
will cause a brief error message to appear, after which the user will be
prompted again for the same attribute value.
When the classification model has inquired after all relevant information, the conclusion is presented as described in the previous chapter.
The interpreter then prompts the user with
Retry, new case or quit [r,n,q]:
The response to this prompt
with the following effects:
is one of the three characters
indicated,
. q: The program exits.
where the -f option gives the filestem, as before. If the -t option appears,
the decision tree or ruleset is printed at the start of the consultation
session.
. n: A new case is classified; the queries recommence as above.
.
These programs request information from the user by prompting for
the values of attributes.
A prompt consists of an attribute name followed
by a colon. If the attribute is discrete, the user can reply with
r: The same case is reexamined. At each query, the reply given last
time appears in square brackets. This previous information can be
left unchanged by simply pressing carriage return, or can be altered
by providing new information in the same manner as before.
. "?" indicating the attribute value is not known;
9.3
. a single possible value; or
Up to this point, the method suggested for estimating the reliability of
a classification model is to divide the data into a training and test set,
build the model using only the training set, and examine its performance
on the unseen test cases. This is quite satisfactory when there is plenty
of data, but in the more common circumstance of having less data than
we would like, two problems arise. First, in order to get a reasonably
accurate fix on error rate, the test set must be large, so the training set
is impoverished.
Secondly, when the total amount of data is moderate
(several hundred cases, say), different divisions of the data into training
and test sets can produce surprisingly large variations in error rates OIl
unseen cases.
.
a set of possible values of the form
Vj:Pj,V2:P2,...,Vk:Pk
where the if, 's are possible values and the P;'s are corresponding
value probabilities (see Chapter 8). If the if,'s do not cover all possible values and the sum of the P;'s is less than one, the unassigned
probability is distributed equally over the remaining values.
Similarly, the user can reply to a query concerning
with
a real-valued attribute
. "?" indicating the attribute value is not known;
Conducting
experiments
A more robust estimate of accuracy on unseen cases can be obtained by
cross-validation.
In this procedure, the available data is divided into N
blocks so as to make each block's number of cases and class distribution
-I
90
..
~
.~ =
GUIDE
TO USING
THE
SYSTEM
9.4 USING OPTIONS: A CREDIT APPROVAL EXAMPLE
as uniform as possible. N different classification models are then built,
in each of which one block is omitted from the training data, and the
resulting model is tested on the cases in that omitted block. In this
":".~. way, each case appears in exactly one test set. Provided that N is not
too small-10
is a common number-the
average error rate over the N
unseen test sets is a good predictor of the error rate of a model built
from all the data.8
may seem complex, but two small programs
and the
shell script that invokes them have been included to facilitate trials. The
command to execute the script is
xval.sh filestern N [options] [+identifierj
where
. filestern is the name of the task, as abo~e. The script expects to find
the corresponding names and data files and, if there is also a test file,
the cases in the data and test files are merged before being divided
into blocks.
.
.
.
N is the number of blocks to be used, and so the number of train/test
runs to be made. N should never exceed the number of cases available.
options, if they appear, are any options that are to be applied to C4.5
and/or C4.5RULES, in any order.
the optional +identifier is used to label and recognize the output
from this particular cross-validation
run. There should not be any
spaces between the "+" and the identifier. Since this suffix will be
attached to every file name generated, it should be short-two
or
three characters is recommended.
The script generates successive data and test files, builds a decision tree
model and a production rule model from the training cases, and uses
the models to classify the unseen test cases. The average figures for
the two model types are written to the files filestem.tres[ +identifier]
and filestern.rres[ +identifierJ.
Files with extension toil +identifier] and
roil +identifier] contain the output from C4.5 and C4.5RULES for the ith
of the N runs.
8. This gives a slight overestimate of the error rate, since each of the N models is
constructed from a subset of the data.
.
For example, suppose that we wished to carry out a ten-way crossvalidation of mytask, grouping tests for the construction of trees and
setting a redundancy factor of 2.5 when rules are produced. The command
. ~
Cross-validation
91
xval.sh mytask 10 -5 -r 2.5 +me
would invoke the cross-validation script as above. This would divide
any data into ten blocks, numbered 0 through 9, and would produce the
following output files:
. Output from individual runs from
.
task.to9+me
and
a summary
C4.5
in files mytask.toO+me
to my-
in mytask.tres+me.
Similar output from the ten C4.5RULES runs in files mytask. roO+me
to mytask.r09+me and a summary in mytask.rres+me.
The summary files are quite terse, giving the lines from each individual
run that describe errors on the training and test sets, and similar lines
that show the average errors on training and test. For instance, when
a ten-way cross-validation
was carried out on the labor-neg data from
Chapter 1, the summary file for C4.5RULES (Figure 9-1) indicates that,
on the first block, there were three errors on the training data and one
on the test data, none on each for the second block, and so on. Averaged
over the ten blocks, the rules had a 3.7% error rate on the training data
and a 13.7% error rate on unseen test cases. This last figure would be
the cross-validation estimate of the error rate of a ruleset built from all
the data.
9.4
Using
options:
A credit
approval
example
We close this chapter with a small example that illustrates the use of
some of the options above. The domain for this case study concerns
approval of credit facilities using a dataset provided by a bank. The 690
cases are split 44% to 56% between two classes; the 15 attributes include
6 with numeric values and 9 discrete-valued attributes, the latter having
from 2 to 14 possible values. This dataset is distributed with the system
under the name crx, but the names of classes, attributes, and attribute
values have been disguised by replacing them with symbols to protect
the bank's interest in the data. (Such bowdlerization does not affect the
learning task in any way but makes the resulting classification models
rather uninformative.)
-~
92
GUIDE
TO USING
THE
Table 9-1. Results with selected
Decision Trees
Tested 51, errors 3 (5.9%) «
Tested 6, errors 1 (16.7%) «
Tested 51, errors 0 (0.0%) «
Tested 6, errors 0 (0.0%) «
Tested 51, errors 2 (3.9%) «
Tested 6, errors 2 (33.3%) «
Tested 51, errors 1 (2.0%) «
Tested 6, errors 0 (0.0%) «
Tested 51, errors 3 (5.9%) «
Tested 6, errors 3 (50.0%) «
Tested 51, errors 1 (2.0%) «
Tested 6, errors 0 (0.0%) «
Tested 51, errors 3 (5.9%) «
Tested 6, errors 1 (16.7%) «
Tested 52, errors 1 (1.9%) «
Tested 5, errors 1 (20.0%) «
Tested 52, errors 2 (3.8%) «
Tested 5, errors 0 (0.0%) «
Tested 52, errors 3 (5.8%) «
Tested 5, errors 0 (0.0%) «
train:
test:
9.4 USING OPTIONS: A CREDIT APPROVAL EXAMPLE
SYSTEM
Options
options
Rules
Size
Unseen
Error
Rate
Predicted
Error
Rate
Unseen
Error
Rate
default
58.8
17.5%
12.4%
17.4%
-s
69.5
17.4%
11.6%
15.2%
-m15
15.9
14.5%
14.4%
14.3%
-dO
39.1
15.8%
15.6%
16.1%
-m15 -dO
10.7
14.5%
16.6%
14.2%
-tlO
67.1
17.1%
11.8%
16.8%
-tlO -m15
15.9
15.1%
14.1%
14.5%
~
Tested 51.3, errors 1.9 (3.7%) «
Tested 5.7, errors 0.8 (13.7%) «
Fjgure 9-1. Summary file for rule models, labor-neg data
As a sighting shot, a ten-way cross-validation
is carried out by the command
93
using all default options
xval.sh crx 10
The key information extracted from the summary files crx.tres and crx.rres
is given in the first line of Table 9-1. For this cross-validation, the average simplified decision tree of 58.8 nodes has an error rate of 17.5% on
unseen cases whereas the pessimistic error rate predicted from the training data is 12.4%. The unseen error rate for production rules, 17.4%, is
quite close to that for the trees.
Inspection of the simplified trees in files crx.toO to crx.t09 reveals that
the multivalued discrete attributes appear infrequently.
This could be
because they contain little information relevant to classification, or alternatively because they are not being treated equitably by the default
gain ratio selection criterion. To explore this further, the cross-validation
is rerun with
As Table 9-1
trees (in fact,
error rate for
ment for trees
this task.
the -s option that enables the formation of value groups.
shows, this appears to have little effect on the decision
their average size increases slightly), but results in a lower
the rule-based classifiers. The lack of significant improvesuggests that value grouping is not going to be helpful for
The feature that stands out in the first default run is the disparity between the observed and estimated error rates for unseen cases. This sort
of situation can arise when the attributes allow an almost "pure" partition of the training cases into single-class subsets, but where much of the
structure induced by this partition has little predictive power. (Since a
single continuous attribute can give rise to numerous possible divisions,
the phenomenon often occurs when there are many independent continuous attributes.)
There are two ways to address the problem: increasing
the -m value so as to prevent overly fine-grained divisions of the training set, or reducing the -c option with the effect that more structure
is pruned away. When the same cross-validation
is repeated with -m1S
and with -cIO, Table 9-1 shows that the former is more effective here.
The error rates on unseen cases obtained with -m1S are much smaller
for both trees and rules; moreover, the predicted error rate for trees is
almost exactly correct, indicating an appropriate level of pruning. If we
94
GUIDE
TO USING
THE
SYSTEM
try to gild the lily by using both options together, Table 9-1 shows that
the error rate is still low but the simplified trees are smaller.
So far, we have reduced the error rate from 17.5% to 14.5% for pruned
trees and from 17.4% to 14.2% for rules, an improvement ofroughly 20%.
Next, we investigate whether wind owing would help in this domain. The
-no option causes ten trees to be produced, one of which is selected as
the best tree, but all of which are used when a ruleset is constructed.
The windowing option slows down the cross-validation by a factor of 20
or so and has little apparent benefit, either on its own or ~hen invoked
with the -m15 option.
At this stage it seems sensible to call a halt to further'trials
and to
recommend using the options -m15 and -cIO. The values 15 and 10 were
more or less plucked from the air, so we could perhaps see whether other
values near these give noticeably better result~; in my experience, such
fine-tuning is not terribly productive because the system is relatively
insensitive to small changes in parameter values. In an actual application we would now return to the dataset and generate a single tree or
production rule classifier from all the available cases using these options.