A Genetic-Algorithm For Discovering Small-Disjunct Rules in Data Mining
A Genetic-Algorithm For Discovering Small-Disjunct Rules in Data Mining
Postgraduate Program in Applied Computer Science, Computer Science Department, Pontifcia Universidade Catlica do
Paran (PUCPR), R. Imaculada Conceio 1155, Curitiba PR 80215-901, Brazil
b Computer Science Department, Universidade Tuiuti do Paran (UTP), Av. Comendador Franco 1860,
Curitiba PR 80215-090, Brazil
Abstract
This paper addresses the well-known classification task of data mining, where the goal is to discover rules predicting
the class of examples (records of a given dataset). In the context of data mining, small disjuncts are rules covering a small
number of examples. Hence, these rules are usually error-prone, which contributes to a decrease in predictive accuracy. At
first glance, this is not a serious problem, since the impact on predictive accuracy should be small. However, although each
small-disjunct covers few examples, the set of all small disjuncts can cover a large number of examples. This paper presents
evidence that this is the case in several datasets. This paper also addresses the problem of small disjuncts by using a hybrid
decision-tree/genetic-algorithm approach. In essence, examples belonging to large disjuncts are classified by rules produced by
a decision-tree algorithm (C4.5), while examples belonging to small disjuncts are classified by a genetic-algorithm specifically
designed for discovering small-disjunct rules. We present results comparing the predictive accuracy of this hybrid system
with the prediction accuracy of three versions of C4.5 alone in eight public domain datasets. Overall, the results show that
our hybrid system achieves better predictive accuracy than all three versions of C4.5 alone.
2002 Elsevier Science B.V. All rights reserved.
Keywords: Data mining; Classification; Genetic-algorithm; Rule discovery; Small disjuncts
1. Introduction
In essence, data mining consists of extracting knowledge from data. The basic idea is that, intuitively,
real-world databases contain hidden knowledge useful
for decision making, if this hidden knowledge can be
discovered. For instance, data about the previous sales
of a company might contain hidden, implicit knowledge about which kind of product each kind of customer tends to buy. Hence, by analyzing that data, one
Corresponding author.
E-mail addresses: deborah@utp.br (D.R. Carvalho),
a.a.freitas@ukc.ac.uk (A.A. Freitas).
URL: http://www.cs.ukc.ac.uk/people/staff/aaf/
can discover knowledge potentially useful for increasing the sales of the company.
Data mining is actually an interdisciplinary field,
since there are many kinds of algorithms, derived from
several different research areas (arguably, mainly machine learning and statistics), which can be used to
extract knowledge from data. In this paper we follow
a machine learning approach (rather than a statistical approach) for data mining [16]. More precisely,
we are interested in discovering knowledge that is not
only accurate, but also comprehensible for the user.
The discovery of comprehensible knowledge is facilitated by the use of a high-level, rule-based knowledge
representationsee later.
1568-4946/02/$ see front matter 2002 Elsevier Science B.V. All rights reserved.
PII: S 1 5 6 8 - 4 9 4 6 ( 0 2 ) 0 0 0 3 1 - 5
76
77
contrast, we use a genetic-algorithm that does discover high-level, comprehensible small-disjunct rules,
which is important in the context of data mining.
Weiss investigated the interaction of noise with rare
cases (true exceptions) and showed that this interaction led to degradation in classification accuracy
when small-disjunct rules are eliminated [22]. However, these results have a limited utility in practice,
since the analysis of this interaction was made possible
by using artificially generated datasets. In real-world
datasets the correct concept to be discovered is not
known a priori, so that it is not possible to make a
clear distinction between noise and true rare cases.
Weiss did experiments showing that, when noise is
added to real-world datasets, small disjuncts contribute
disproportional and significantly for the total number
of classification errors made by the discovered rules
[23].
3. A hybrid decision-tree/genetic-algorithm
system for rule discovery
As mentioned in Section 1, we present a hybrid
method for rule discovery that combines decision trees
and genetic algorithms. The basic idea is to use a wellknown decision-tree algorithm to classify examples
belonging to large disjuncts and use a genetic-algorithm to discover rules classifying examples belonging
to small disjuncts. This approach tries to combine the
best of both worlds. Decision-tree algorithms have a
bias towards generality that is well suited for large
disjuncts, but not for small disjuncts. Indeed, one of
the drawbacks of decision-tree-building algorithms is
the fragmentation problem [8], where the set of examples belonging to a tree node gets smaller and smaller
as the depth of the tree is increased, making it difficult to induce reliable rules from deep levels of the
tree.
On the other hand, genetic algorithms are robust,
flexible algorithms, which tend to cope well with attribute interactions [6,7,18]. Hence, they can be more
easily tailored for coping with small disjuncts, which
are associated with large degrees of attribute interaction [17,20].
The proposed system discovers rules in two training
phases. In the first phase we run C4.5, a well-known
decision-tree induction algorithm [19]. The induced,
78
79
(1)
TP + FN
FP + TN
80
where TP (true positives) is the number of + examples that were correctly classified as + examples; FP
(false positives) the number of examples that were
wrongly classified as + examples; FN (false negatives) the number of + examples that were wrongly
classified as examples; TN (true negatives) the
number of examples that were correctly classified
as examples.
For a comprehensive discussion about this and related rule-quality measures in general, independent of
genetic algorithms, the reader is referred to [10]. Here,
we briefly mention that, in the above formula, the term
(TP/(TP + FN)) is often called sensitivity, whereas the
term (TN/(FP + TN)) is often called specificity. These
two terms are multiplied to force the GA to discover
rules that have both high sensitivity and high specificity, since it would be relatively simple to maximize
one of these terms by reducing the other.
Note that our fitness function does not take into
account rule comprehensibility. However, our GA has
a rule-pruning operator that fosters the discovery of
shorter, more comprehensible rules, as discussed in
Section 3.5.
3.4. Conventional genetic operators
We use the well-known tournament method for selection, with tournament size of 2 [14]. We also use
standard one-point crossover with crossover probability of 80%, and mutation probability of 1%. Furthermore, we use elitism with an elitist factor of 1
i.e. the best individual of each generation is passed
unaltered into the next generation [9].
The action of one-point crossover in our previouslydescribed individual representation is illustrated in
Fig. 3, where the crossover point, denoted by a vertical
line (|), fell between the second and the third genes.
In each gene (corresponding to a rule condition) the
number 1 or 0 between brackets denotes the value of
the active bit flag, as explained in Section 3.2. Note
that crossover points can fall only between genes, and
not inside a gene. Hence, crossover swaps entire rule
conditions between individuals, but it cannot produce
new rule conditions.
The creation of new rule conditions is accomplished
by the mutation operator, which replaces the attribute
value of a condition (the element Vij of Fig. 2) with
a new randomly-generated value belonging to the do-
main of the corresponding attribute. For instance, suppose the condition marital status = single is encoded into the genotype of an individual. A mutation
could modify this condition into the new condition
marital status = divorced.
3.5. Rule pruning operator
We have developed an operator especially designed
for improving the comprehensibility of candidate
rules. The basic idea of this operator, called the rulepruning operator, is to remove several conditions from
a rule to make it shorter. In a high-level of abstraction,
removing conditions from a rule is a common way
of rendering a rule more comprehensible in the data
mining and machine learning literature (although details of the method vary a lot among algorithms). Our
rule-pruning operator is applied to every individual of
the population, right after the individual is formed as
a result of crossover and mutation operators.
Unlike the usually simple operators of conventional
GAs, our rule-pruning operator is an elaborate procedure based on information theory [3]. This procedure can be regarded as a way of incorporating a
classification-related heuristic into a GA for rule discovery. The heuristic in question is to favor the removal of rule conditions with low information gain,
(2)
The use of the above rule-pruning procedure combines the stochastic nature of GAs, which is partly
responsible for their robustness, with an informationtheoretic heuristic for deciding which conditions compose a rule antecedent, which is one of the strengths
of some well-known data mining algorithms. As a result of the action of this procedure, our GA tends to
produce rules that have both a relatively small number of attributes and high-information-gain attributes,
whose values are estimated to be more relevant for
predicting the class of an example.
A more detailed description of our rule-pruning
procedure is shown in Fig. 4. As can be seen in this
Figure, the above-described iterative mechanism for
removing conditions from a rule is implemented by
sorting the conditions in increasing order of information gain. From the viewpoint of the GA, this is
a logical sort, rather than a physical one. In other
words, the sorted conditions are stored in a data
structure completely separated from the individuals
data structure, so that there is no modification in the
actual order of the conditions in the genome of the
individual.
where
c
|Gj |
|Gj |
Info(G) =
log2
|T |
T
(3)
j =1
Info(G|condi )
c
|Vij |
|Vij |
|Vi |
=
log2
|T |
|Vi |
|Vi |
j =1
|Vi |
|T |
c
j =1
|Vij |
|Vij |
log2
(4)
|Vi |
|Vi |
81
82
4. Computational results
We have evaluated our system on eight public domain datasets from the well-known dataset repository
at UCI (University of California at Irvine): Adult,
Connect, CRX, Hepatitis, Segmentation, Splice, Voting and Wave. These datasets are available at the:
http://www.ics.uci.edu/mlearn/MLRepository.html.
Table 1 summarizes the datasets used in our experiments. The first column shows the dataset identification, while the remaining columns report the number
of attributes, the number of examples and the number
of classes of the dataset, respectively.
In our experiments, we have used the predefined division of the Adult dataset into a training and a test set.
For the Connect dataset, we have randomly partitioned
this dataset into a training set and a test set. In the
case of these two datasets the use of a single training/
test set partition is acceptable due to the relatively large
size of the data. Since the other six datasets are not
so large as the Adult and Connect datasets, to make
the results more reliable we have run a 10-fold crossvalidation procedure for each of those six datasets. In
other words, the dataset was divided into 10 partitions
and our hybrid decision-tree/genetic-algorithm system
was run 10 times. In each run, one of the 10 partitions
was chosen as the test set and all the other nine partitions were merged to compose the training set. The
results reported below for those six datasets are an
average over the 10 runs.
Table 1
Main characteristics of datasets used in our experiment
Dataset
No. of
atributes
No. of
examples
No. of
classes
Wave
Hepatitis
Adult
CRX
Voting
Connect
Splice
Segmentation
21
19
14
15
16
42
60
19
5000
155
45222
690
506
67557
3190
2310
3
2
2
2
2
3
3
7
As described in Section 3, our hybrid decisiontree/GA rule discovery system uses a GA to discover
rules for classifying small-disjunct examples only
recall that large-disjunct examples are classified by the
decision-tree. Intuitively, the performance of our system will be significantly dependent on the definition
of a small-disjunct.
In our experiments, we have used a commonplace
definition of small-disjunct, based on a fixed threshold of the number of examples covered by the disjunct. The general definition is: A decision-tree leaf
is considered a small-disjunct if and only if the number of examples belonging to that leaf is smaller than
or equal to a fixed size S. We report here experiments
with two different values for the parameter S, namely
S = 10 and S = 15.
We chose the values of S = 10 and S = 15 for
our experiments for the following reasons. First, it is
desirable that the value of S is not too small, since in
this case there would always be too few examples for
training the GA. Intuitively, values of S considerably
smaller than 10 would have the disadvantage of producing disjuncts that are too small, with too few examples for inducing reliable rules. On the other hand,
a considerably larger value of S is not desirable either, due two reasons. First, the GA was designed to
discover rules covering only few examples, which has
simplified its design, since each run of the GA has
to discover only one rule per class. As the number of
examples belonging to a small-disjunct increases, ideally the GA should probably discover more than one
rule per class for each small-disjunct. This would require a significant chance in the design of the GA. In
the current GA, an individual represents a single rule,
and there is no mechanism to enforce population diversity (such as niching). If one wanted the GA to discover more rules per class, such a mechanism would
have to be added. Second, and perhaps even more important, a large value of S would beg the question of
the meaning of a small disjunct. Our work is based
on the assumption that in general C4.5 can correctly
classify large-disjunct examples. The GA is called to
classify only small-disjunct examples. If a leaf node in
the tree produced by C4.5 has considerably more than
15 examples, intuitively C4.5 has enough examples
to perform a good classification (otherwise it would
probably have split on that node, producing a larger
tree). Hence, in our experiments we used the values
83
C4.5
Hybrid C4.5/GA
S = 10
Wave
Hepatitis
Adult
CRX
Voting
Connect
Splice
Segmentation
75.54
83.64
79.21
84.53
94.63
72.6
45.98
97.67
+79.60
84.97
+79.83
86.12
92.30
+75.93
46.45
93.62
S = 15
(2.0)
(6.0)
(0.1)
(4.0)
(1.0)
(0.8)
(0.9)
(0.8)
+80.30
82.25
+79.55
86.37
91.72
+74.62
47.70
93.16
(2.0)
(5.0)
(0.1)
(2.0)
(2.0)
(0.2)
(2.0)
(1.0)
84
sults will reflect mainly differences in search strategies. Hence, we can compare the evolutionary search
strategy of the GA against the local, greedy search
strategy of a rule induction or decision-tree algorithm.
Within this spirit we now report the results of another experiment comparing our hybrid C4.5/GA system against a double run of C4.5. The later is a
new way of using C4.5 to cope with small disjuncts,
as follows. The main idea of our double run of a
decision-tree algorithm is to build a classifier running
the algorithm C4.5 twice. The first run considers all
examples in the original training set, producing a first
decision-tree. Once identified which examples belong
to small disjuncts, the system groups all the examples belonging to small disjuncts (according to the first
decision-tree) into a single example subset. This can
be thought of as a second training set. Then C4.5 is
run again on this second, reduced training set producing a second decision-tree.
In order to classify a new example, the rules discovered by both runs of C4.5 are used as follows. First,
the system checks whether the new example belongs
to a large disjunct of the first decision-tree. If so, the
class predicted by the corresponding leaf node is assigned to the new example. Otherwise (i.e. the example belongs to one of the small disjuncts of the first
decision-tree), the new example is classified by the
second decision-tree.
The motivation for this more elaborate use of C4.5
was an attempt to create a simple algorithm that was
more robust to cope with small disjuncts.
The results comparing the accuracy rate (on the test
set) of our hybrid system against the accuracy rate of
double C4.5denoted by C4.5(2)are shown in
Table 3.
Analogously to Table 2, the numbers between
brackets in the third and fifth columns of Table 3 are
standard deviations, the cells where the hybrid system outperforms double C4.5 are shown in bold,
and the cells where the hybrid systems results are
significantly better (worse) than double C4.5s results
contain the symbol + (). Again, a result was
considered statistically significant when the difference
between the accuracy rate of the hybrid system and of
double C4.5 was larger than two standard deviations.
As can be observed in the Table 3, in all 16 cases
the accuracy rate of the hybrid system was better than
the accuracy rate of double C4.5. Furthermore, the
85
Table 3
Accuracy rate of double C4.5 (denoted C4.5 (2)) and our hybrid
C4.5/GA system
Table 4
Accuracy rate of our hybrid C4.5/GA system and C4.5 without
pruning
Dataset
Dataset
C4.5 without
pruning
S = 10
(C4.5/GA)
S = 15
(C4.5/GA)
Wave
Hepaitis
Adult
CRX
Voting
Connect
Splice
Segmentation
74.31
77.67
76.50
85.04
91.04
69.10
46.30
97.08
+79.60
84.97
+79.83
86.12
92.30
+75.93
46.45
93.62
+80.30
82.25
+79.55
86.37
91.72
+74.62
47.70
93.16
Wave
Hepatitis
Adult
CRX
Voting
Connect
Splice
Segmentation
S = 10
S = 15
C45(2)
C4.5/GA
C45(2)
C4.5/GA
69.97
81.90
79.19
85.50
89.50
49.90
46.33
44.10
+79.60
84.97
+79.83
86.12
92.30
+75.93
46.45
93.62
71.34
82.10
78.81
86.10
90.60
56.80
46.84
55.80
+80.30
82.25
+79.55
86.37
91.72
+74.62
47.70
93.16
(2.0)
(6.0)
(0.1)
(4.0)
(1.0)
(0.8)
(0.9)
(0.8)
(2.0)
(5.0)
(0.1)
(2.0)
(2.0)
(0.2)
(2.0)
(1.0)
(2.0)
(6.0)
(0.1)
(4.0)
(1.0)
(0.8)
(0.9)
(0.8)
(2.0)
(5.0)
(0.1)
(2.0)
(2.0)
(0.2)
(2.0)
(1.0)
86
In this paper we have described a hybrid decisiontree (C4.5)/GA system, where examples belonging to
large disjuncts are classified by rules produced by
a decision-tree algorithm and examples belonging to
small disjuncts are classified by rules produced by a
genetic-algorithm developed specifically for discovering small-disjunct rules.
More precisely, we have compared our hybrid
C4.5/GA system with three algorithms based on the
use of C4.5 alone, namely: (a) the default version
of C4.5; (b) a double run of C4.5, where all the
small-disjunct examples found by the first run of C4.5
are grouped together into a second training set,
References
[1] D.R. Carvalho, A.A. Freitas, A hybrid decision-tree/geneticalgorithm for coping with the problem of small disjuncts
in data mining, in: Proceedings of the 2000 Genetic and
Evolutionary Computation Conference (Gecco-2000), Las
Vegas, NV, USA, July 2000, pp. 10611068.
[2] D.R. Carvalho, A.A. Freitas, A genetic algorithm-based
solution for the problem of small disjuncts: principles of data
mining and knowledge discovery, in: Proceedings of the 4th
European Conference, PKDD-2000, Lyon, France. Lecture
Notes in Artificial Intelligence 1910, Springer, Berlin, 2000,
pp. 345352.
[3] T.M. Cover, J.A. Thomas, Elements of Information Theory,
Wiley, New York, 1991.
87
88
[22] G.M. Weiss, Learning with rare cases and small disjuncts, in:
Proceedings of the 12th International Conference on Machine
Learning (ICML-95), Morgan Kaufmann, Los Altos, CA,
1995, pp. 558565.
[23] G.M. Weiss, The problem with noise and small disjuncts, in:
Proceedings of the International Conference on Machine Learning (ICML-98), Morgan Kaufmann, Los Altos, CA, 1998,
pp. 574578.