Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Calibrated Lazy Associative Classification: Abstract. Classification Is An Important Problem in Data Mining. Given An Ex

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Calibrated Lazy Associative Classication

Adriano Veloso
1
, Wagner Meira Jr
1
, Mohammed Zaki
2
1
Computer Science Dept. Universidade Federal de Minas Gerais Brazil
2
Computer Science Dept. Renseelaer Polytechnic Institute USA
Abstract. Classication is an important problem in data mining. Given an ex-
ample x and a class c, a classier usually works by estimating the probability of
x being member of c (i.e., membership probability). Well calibrated classiers
are those able to provide accurate estimates of class membership probabilities,
that is, the estimated probability p(c|x) is close to p(c| p(c|x)), which is the true,
empirical probability of x being member of c given that the probability esti-
mated by the classier is p(c|x). Calibration is not a necessary property for
producing accurate classiers, and thus, most of the research has focused on
direct accuracy maximization strategies (i.e., maximum margin) rather than on
calibration. However, non-calibrated classiers are problematic in applications
where the reliability associated with a prediction must be taken into account
(i.e., cost-sensitive classication, cautious classication etc.). In these appli-
cations, a sensible use of the classier must be based on the reliability of its
predictions, and thus, the classier must be well calibrated.
In this paper we show that lazy associative classiers (LAC) are accurate, and
well calibrated using a well known, sound, entropy-minimization method. We
explore important applications where such characteristics (i.e., accuracy and
calibration) are relevant, and we demonstrate empirically that LAC drastically
outperforms other classiers, such as SVMs, Naive Bayes, and Decision Trees
(even after these classiers are calibrated by specic methods). Additional high-
lights of LAC include the ability to incorporate reliable predictions for improv-
ing training, and the ability to refrain from doubtful predictions.
1. Introduction
The classication problem is dened as follows. We have an input data set called the
training data (denoted as D) which consists of examples (or instances) composed of a
set of m attribute-values (a
1
, a
2
, . . . , a
m
) along with a special variable called the class.
This class variable draws its value from a discrete set of classes (c
1
, c
2
, . . . , c
n
). The
training data is used to construct a classier that relates the features (or attribute values)
in the training data to the class variable. The test set (denoted as T ) for the classication
problem consist of a set of instances for which only the feature values are known while
the class value is unknown. The classier, which is a function from {a
1
, a
2
, . . . , a
m
} to
{c
1
, c
2
, . . . , c
n
}, is used to predict the class value for test instances.
There are countless paradigms and strategies for devising a classier. One of
these strategies is to explore relationships, dependencies and associations between fea-
tures and classes. These associations are usually hidden in the examples, and when un-
covered, they may reveal important aspects concerning the underlying phenomenon that

This research was sponsored by UOL (www.uol.com.br) through its UOL Bolsa Pesquisa program,
process number 20080131200100, and it was partially supported by CNPq, Capes, Finep and Fapemig.
generated the examples. These aspects can be explored for sake of prediction. This is
the strategy adopted by associative classiers, where the classication function is built
from rules [Liu et al. 1998] of the form X c
i
(where X is a set of features and c
i
is a
class). Associative classication has shown to be valuable in many applications, including
document categorization [Veloso et al. 2006b], ranking [Veloso et al. 2008] and bioinfor-
matics [Cong et al. 2004]. There are other applications, however, where high accuracy is
not the only requirement, not even the most important one, as illustrated by the following
examples:
Cautious Classication Suppose a Digital Library in which documents must be
sorted into pre-specied categories. The administrator may consider to use a clas-
sier, as long as it matches a minimum acceptable accuracy (which is specied by
the administrator). In order to fulll this accuracy constraint, the classier must
provide estimates of class membership probabilities for each document x (i.e.,
p(c
1
|x), p(c
2
|x) . . . p(c
n
|x)). These estimates are used to select the documents that
are safely classied, and the documents that must have their corresponding pre-
dictions abstained. The estimates are also used to calculate the accuracy after each
prediction. The classier must stop when the estimated accuracy reaches its mini-
mum acceptable value, and then the administrator performs manual inspection of
the remaining documents. In this case, in addition to achieving high accuracy, the
classier must also inform which documents are likely to be correctly classied
and those that are not (these documents will be manually classied).
Cost-Sensitive Classication Suppose an organization which requests charita-
ble donations. Each request has a xed cost z, and the expected amount that an
individual x will donate is y(x) dollars. The organization wants to improve its
marketing efforts by developing a classier that will be used to maximize its net
revenue. According to [Zadrozny and Elkan 2001], the optimal net maximization
strategy is to solicit x if and only if p(donate|x) >
z
y(x)
(i.e., a marketer will in-
clude x in the mailing list only if the expected return from this order exceeds the
cost invested in generating the order), where p(donate|x) is the estimated proba-
bility that x will donate. In this case, in addition to achieving high accuracy, the
classier must also take into account the risk of wrongly predicting a donation.
For both examples, it is important to estimate class membership probabilities,
p(c
1
|x), p(c
2
|x) . . . p(c
n
|x), as reliably as possible, for each x. A calibrated classi-
er
1
is one which provides reliable estimates of class membership. Intuitively, if
p(c
i
|x)p(c
i
| p(c
i
|x)) for all x, then the classier is said to be well calibrated
2
. Accu-
rate classiers are not necessarily well calibrated. Maximum margin classiers, such
as SVMs, push probability mass away from 0 and 1, yelding a characteristic sigmoid
shaped distortion in the estimated membership probabilities [Zadrozny and Elkan 2002].
Naive Bayes classiers, which make unrealistic independence assumptions, push prob-
ability estimates closer to 0 and 1 [Zadrozny and Elkan 2001]. Boosting classiers
can be viewed as an additive logistic regression model [Friedman et al. 2000], and
1
The notion of calibration (sometimes termed reliability), was introduced in [Dawid 1982] to describe
situations where the observed frequencies of events match the probabilities forecasted for them.
2
As an illustration, please consider all the examples for which a classier assigns a probability
p(c
i
|x)=0.90. If this classier is calibrated, then nearly 90% of these examples are correctly classied
(i.e., the empirical probability, p(c| p(c|x)), is close to 0.90). In this case, p(c
i
| p(c
i
|x))0.90 p(c
i
|x).
thus, the predictions performed by these classiers t a logit of the true probabili-
ties, as opposed to the true probabilities themselves. Decision Tree classiers try to
make leaves homogeneous, and thus, the estimates are systematically moved closer to
1 [Niculescu-Mizil and Caruana 2005]. In general, classiers that follow an accuracy
maximization strategy are not useful for estimating class membership probabilities.
The proposed associative classiers, on the other hand, do not follow any accu-
racy maximization strategy (i.e., boosting, maximum margin, homogeneous leaves etc.),
neither make unrealistic assumptions (i.e., feature independence). Instead, they simply
use the generated rules to describe the training data. This description is then used to esti-
mate class membership probabilities. If the description is accurate, then the estimates are
likely to be close to the actual probabilities. However, generating an accurate description
of the training data is challenging. This is mainly because practical limitations impose
the need of frequency-based pruning methods to avoid rule explosion, and thus important
rules may be not included in the description, hurting calibration. An alternate approach
to avoid rule explosion is to generate rules on a demand-driven basis, according to the in-
stance being classied. This approach is known as lazy associative classication (LAC),
and it has been shown that important rules are more likely to be included in the descrip-
tion of the training data [Veloso et al. 2006a]. To further increase the calibration of lazy
associative classiers, some mechanisms can be used to calibrate the estimates.
Calibration mechanisms have already been studied in weather forecasting, game
theory, and more recently in classication [Cohen and Goldszmidt 2004]. Most of the
classiers demand sophisticated, complex calibration mechanisms, such as Platt Scal-
ing [Platt 1999] (for SVMs), Logistic Correction [Friedman et al. 2000] (for boosting-
based classiers), m-estimation [Cestnik 1990] (for Naive-Bayes classiers), and Curtail-
ment [Zadrozny and Elkan 2001] (for Decision Tree classiers). In this paper, we propose
to calibrate associative classiers using a well-known entropy-minimization method. To
evaluate the effectiveness of the calibrated classiers, we performed a systematic set of
experiments using datasets obtained from actual and challenging applications. Our results
suggest that calibrated lazy associative classiers are superior than SVM, Naive Bayes,
and Decision Tree classiers (even after calibrating these classiers) in applications where
calibration is necessary. The specic contributions of this paper are:
We propose mechanisms to calibrate LAC, and we evaluate them using complex
real-world applications.
We showthat classiers that maximize accuracy are usually non-calibrated and ill-
suited for cost-sensitive and cautious classication problems. We also show that
LAC, which simply describes the training data, easily becomes calibrated after a
(entropy minimization) binning mechanism is used.
We show that calibrated LAC achieves superior performance than other classiers,
such as SVMs, Naive Bayes and Decision Trees, even after these classiers are
calibrated using specic calibration mechanisms.
We show that the proposed classiers are able to place test instances into two
distinct zones a safe zone (where predictions are very likely to be correct) and a
danger zone (where predictions are doubtful). This ability enables the classiers to
(i) use reliable predictions for training, and (ii) abstain from doubtful predictions.
2. Related Work
Several methods for associative classication have already been proposed in the literature,
from which we mention CBA [Liu et al. 1998] and CMAR [Li et al. 2001]. An analysis
revealed that CBA is poorly calibrated (i.e., probability estimates are far from the ac-
tual class probabilities). The main reason is that CBA follows an accuracy maximization
strategy, which selects, from the set of generated rules, the one that presents the highest
condence, and thus probability estimates are shifted towards 1. In contrast, CMAR em-
ploys multiple rules while performing the prediction, and, thus, CMAR is usually better
calibrated than CBA. However, the calibration of CMAR is greatly harmed due to the
pruning method adopted (i.e., important rules may be pruned, and not considered while
performing the prediction). The associative classier proposed in this paper tends to be
better calibrated than CBA and CMAR because it employs multiple rules that are gen-
erated on a demand-driven basis, according to the instance being classied, reducing the
chances of missing important rules due to pruning, and consequently providing a better
description of the data.
There are also studies investigating calibration of other classiers, such as
SVMs, Naive Bayes, and Decision Trees. The calibration of Naive Bayes and De-
cision Tree classiers were investigated in [Zadrozny and Elkan 2001], where it was
shown that probabilities estimated by these classiers are usually far from the ob-
served, true probabilities. The same methodology was used to show that SVM classi-
ers are poorly calibrated [Zadrozny and Elkan 2002], and that the distortion very often
forms a sigmoid pattern. Boosting-based classiers were shown to be poorly calibrated
in [Niculescu-Mizil and Caruana 2005]. Although these classiers are not calibrated, they
tend to assign higher probabilities to the correct class, that is, if p(c
i
|x) > p(c
j
|x) then it
is likely that p(c
i
|x) > p(c
j
|x). Therefore, these classiers are usually accurate. The ad-
vantages of calibrated classiers were discussed in [Cohen and Goldszmidt 2004], where
it was shown that calibrating a classier is guaranteed not to decrease its accuracy.
There are several existing mechanisms used to correct the distortion between esti-
mated and observed probabilities. Mechanisms for calibrating SVM classiers transform
the predictions to posterior probabilities by passing them through a sigmoid. The para-
metric approach proposed in [Platt 1999] consists in nding the parameters a and b for
a sigmoid function of the form p
c
(c
i
|x) =
1
/1+e
a p(c
i
|x)+b
, which transforms the original
estimated probability, p(c
i
|x), into a calibrated estimate, p
c
(c
i
|x). These parameters are
found by minimizing the negative log-likelihood of the data. Mechanisms for calibrating
Decision Tree and Naive Bayes classiers were proposed in [Zadrozny and Elkan 2001].
These mechanisms are based on smoothing the distribution of the original estimates, and
they also rely on nding parameters from the data. Boosting based classiers are cali-
brated using a mechanismcalled Logistic Regression, proposed in [Friedman et al. 2000].
This mechanism transforms original estimates, p(c
i
|x), into calibrated estimates, p
c
(c
i
|x),
using the function p
c
(c
i
|x) =
1
/1+e
2a p(c
i
|x)
. All the mentioned mechanisms nd the cor-
responding parameters through 10-fold cross-validation using the training data.
In this paper we propose a technique for calibrating the probabilities estimated by
an associative classier. The proposed technique is based on binning the prediction space.
The number of bins, and their corresponding sizes, are easily obtained by the sound and
well-known entropy-minimization technique [Fayyad and Irani 1993].
3. Associative Classication
Associative classication exploits the fact that, frequently, there are strong associa-
tions between features (a
1
, a
2
, . . . , a
m
) and classes (c
1
, c
2
, . . . , c
n
). The learning strat-
egy adopted by associative classiers is based on uncovering such associations from
the training data, and then building a function {a
1
, a
2
, . . . , a
m
} {c
1
, c
2
, . . . , c
n
} us-
ing such associations. Typically, these associations are expressed using rules of the form
X c
1
, . . . , X c
n
, where X {a
1
, . . . , a
m
}. In the following discussion we will
denote as R an arbitrary rule set. Similarly, we will denote as R
c
i
a subset of R which
is composed of rules of the form X c
i
. A rule X c
i
is said to match instance x if
X x (x contains all features in X), and these rules form the rule set R
x
c
i
. That is, R
x
c
i
is
composed of rules predicting class c
i
and matching instance x. Obviously, R
x
c
i
R
c
i
R.
Naturally, there is a total ordering among rules, in the sense that some rules show
stronger associations than others. A widely used statistic, called condence and denoted
as (X c
i
), measures the strength of the association between X and c
i
.
The probability (or likelihood) of instance x being member of class c
i
is estimated
by combining rules in R
x
c
i
. An effective strategy is to interpret R
x
as a poll, in which rule
X c
i
R
x
is a vote given by X for class c
i
. The weight of a vote X c
i
depends
on the strength of the association between X and c
i
, which is (X c
i
). The process
of estimating the probability of x being member of c
i
starts by summing weighted votes
for class c
i
and then averaging the obtained value by the total number of votes for c
i
,
as expressed by the score s(c
i
, x) shown in Equation 1 (where r
j
R
x
c
i
and |R
x
c
i
| is the
cardinality of R
x
c
i
). Thus, the score s(c
i
, x) gives the average condence of the rules in
R
x
c
i
. The estimated probability of x being member of c
i
, p(c
i
|x), is simply obtained by
normalizing s(c
i
, x), as shown in Equation 2. A higher value of p(c
i
, x) indicates a higher
likelihood of x being member of c
i
. The class associated with the highest likelihood is
nally predicted.
s(c
i
, x) =
|R
x
c
i
|

j=1
(r
j
)
|R
x
c
i
|
(1) p(c
i
|x) =
s(c
i
, x)
n

j=1
s(c
j
, x)
(2)
Algorithm 1 Lazy Associative Classication.
Require: Examples in D,
min
, and test instance t
Ensure: The predicted class for instance t
1: Let L(a
i
) be the set of examples in D in which feature a
i
has occurred
2: D
t

3: for each feature a


i
t do D
t
D
t
L(a
i
)
4:
t

min
|D
t
|
5: for each c
i
do R
t
c
i
rules X c
i
, such that X t and (X c
i
)
t
6: Estimate p(c
1
|x), . . . , p(c
n
|x), and predict the class with the highest probability
Lazy Associative Classication (LAC) Rule extraction is a major issue when devis-
ing an associative classier. Extracting all rules is frequently unfeasible, and, thus, prun-
ing strategies are employed in order to reduce the number of rules that are processed. The
typical pruning strategy is based on a support threshold,
min
, which separates frequent
from infrequent rules. Ideally, infrequent rules are not important. However, it is often
the case that infrequent rules indicate true feature-class associations, and are, therefore,
important for sake of classication. An optimal minimum support threshold is unlikely to
exist, and tuning is generally driven by intuition, and prone to error as a consequence.
Typically, a
min
value is employed and a single rule set R is extracted from
the training data, D. This rule set is a description of D and is used to classify all test
instances. Two problems may arise: (i) a not so frequent but important rule may be not
extracted from D (hurting the quality of the description), and (ii) a useless
3
rule can be
extracted from D (incurring unecessary overhead). An ideal scenario would be to extract
only useful rules fromD, without discarding important ones. Test instances have valuable
information that can be used during rule extraction to guide the search for useful and
important rules. We propose to achieve the ideal scenario by extracting rules on a demand-
driven basis, according to the test instance being considered. Specically, whenever a test
instance t is being considered for classication, that instance is used as a lter to remove
irrelevant features (and often entire examples) from D, forming a projected training data,
D
t
, which contains only features that are included in instance t [Veloso et al. 2006a]. This
process reduces the size and dimensionality of the training data, since irrelevant features
and examples are not considered while extracting the rules for a specic test instance.
A typical strategy used to prevent support-based over-pruning (i.e., discarding im-
portant rules) is to use a different cut-off value, which is calculated depending on the
frequency of the classes. More specically, the cut-off value is higher for rules predicting
more frequent classes, and lower for rules predicting less frequent classes. The problem
with this strategy is that it does not take into account the frequency of the features com-
posing the rule, and, thus, if an important rule is composed of rare features, it will be
discarded, specially if this rule predicts a highly frequent class. We propose an alternate
strategy that employs multiple cut-off values, which are calculated depending on the fre-
quency of the features composing a test instance. Intuitively, if a test instance t contains
frequent features (i.e., these features occur in many examples in D), then the size of the
projected training data will be large. Otherwise, if a test instance t contains rare fea-
tures (i.e., these features occur only in few examples in D), then the size of the projected
training data will be small. For a xed value of
min
, the cut-off value for instance t,

t
, is calculated based on the size of the corresponding projected training data, that is,

t
=
min
|D
t
|. The cut-off value applied while considering test instance t varies from
1
t

min
|D|, which is bounded by
min
|D| (the single cut-off value applied by
the typical support-based pruning strategy). Therefore, the chance of discarding impor-
tant (but less frequent) rules is reduced. The main steps of lazy associative classication
are shown in Algorithm 1.
Different test instances may demand different rule sets, but different rule sets often
share common rules. In this case, caching is very effective in reducing work replication.
An efcient caching approach was proposed in [Veloso et al. 2008], and is also used here.
3
A rule is useless if it does not match any test instance in T .
4. Calibration
In this section we dene calibrated classiers, and then we propose mechanisms to cal-
ibrate the associative classier described in Section 3. We nalize this section by dis-
cussing the advantages of well calibrated classiers.
4.1. -calibrated Classier
The calibration of a classier can be visualized using reliability diagrams. Diagrams for
two arbitrary classiers on an arbitrary dataset are depicted in Figure 1 (left). These
diagrams are built as follows [DeGroot and Fienberg 1982]. First, the probability space
(i.e., the x-axis) is divided into a number of bins, which was chosen to be 10 in our case.
Probability estimates (i.e., theoretical probabilities) with value between 0 and 0.1 fall in
the rst bin, with value between 0.1 and 0.2 in the second bin, and so on. The fraction of
correct predictions associated with each bin, which is the true, empirical probability (i.e.,
p(c| p(c|x))), is plotted against the estimated probability (i.e., p(c|x)). If the classier
is well calibrated, the points will fall near the diagonal line, indicating that estimated
probabilities are close to empirical probabilities. The degree of calibration of a classier,
denoted as , is obtained by measuring the discrepancy between observed probabilities
(o
i
) and estimated probabilities (e
i
), as shown in Equation 3 (where n is the number
of observations). Values of range from 0 to 1. A value of 0 means that there is no
relationship between estimated probabilities and true probabilities. A value of 1 means
that all points lie exactly on a straight line with no scatter. Classier 1 is better calibrated
than Classier 2, as shown in Figure 1 (right).
= 1
1
n
n

i=1
(o
i
e
i
)
2
(o
i
+ e
i
)
2
(3)
Figure 1. Left Reliability Diagram. Right Classiers with =0.97 and =0.82.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
T
r
u
e

(
E
m
p
i
r
i
c
a
l
)

M
e
m
b
e
r
s
h
i
p

P
r
o
b
a
b
i
l
i
t
y
Estimated (Theoretical) Probability
Classifier 1
Classifier 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
T
r
u
e

(
E
m
p
i
r
i
c
a
l
)

M
e
m
b
e
r
s
h
i
p

P
r
o
b
a
b
i
l
i
t
y
Estimated (Theoretical) Probability
Ideal Classifier
Classifier 1 (0.97)
Classifier 2 (0.82)
4.2. Calibration based on Entropy Minimization
To transform original probability estimates, p(c
i
|x), into accurate well calibrated proba-
bilities, we also use a method based on binning. The method starts by estimating member-
ship probabilities using the training data. A typical strategy is 10-Fold Cross-Validation.
In this case, the training data, D, is divided into 10 partitions, and at each trial, 9 parti-
tions are used for training the classier, while the remaining partition is used to simulate
a test set. After the 10 trials, the classier will have stored in the set O, the membership
probability estimates for all examples in D. This process is shown in Algorithm 2.
Algorithm 2 Estimating Membership Probabilities.
Require: Examples in D
Ensure: For each example x in D, the corresponding membership probabilities
p(c
1
|x), p(c
2
|x), . . . , p(c
n
|x), along with the correct class
1: O
2: Split D into 10 equal-sized partitions, d
1
, d
2
, . . . , d
10
3: for each partition d
i
do
4: for each example x d
i
do
5: Estimate probabilities, p(c
1
|x), p(c
2
|x), . . . , p(c
n
|x), using {D-d
i
} as training
6: O O {( p(c
1
|x), v
1
)} . . . {( p(c
n
|x), v
n
)}, where v
i
=1 if c
i
is the correct
class for example x, and v
i
=0 otherwise
7: end for
8: end for
9: return O
Once the probabilities are estimated, a naive strategy would proceed by rst sort-
ing these probabilities in ascending order (i.e., the probability space), and then dividing
them into k equal sized bins, each having pre-specied boundaries. An estimate is placed
in a bin according to its value (i.e., values between 0 and
1
k
are placed in the rst bin,
values between
1
k
and
2
k
in the second, and so on). The probability associated with a bin is
given by the fraction of correct predictions that were placed in it. An estimate p(c
i
|x) is
nally calibrated by using the probability associated with the corresponding bin. Speci-
cally, each bin b
lu
B (with l and u being its boundaries) works as a map, relating es-
timates p(c
i
|x) (such that l p(c
i
|x)<u) to the corresponding calibrated estimates, p
b
lu
.
Thus, this process essentialy discretizes the probability space into intervals, so that the
accuracy associated with the predictions in each interval is as reliable as possible.
Such naive strategy, however, can be disastrous as critical information may be lost
due to innapropriate bin boundaries. Instead, we propose to use information entropy asso-
ciated with candidate bins to select the boundaries [Fayyad and Irani 1993]. To illustrate
the method, suppose we are given a set of pairs ( p(c
i
|x), v)
4
O and a threshold f that
induces two partitions of O (b
f
and b
f>
, where b
f
contains pairs ( p(c
i
|x), v) for which
p(c
i
|x) f, and b
f>
contains pairs for which p(c
i
|x) > f). The entropy of this partition
is given by Equation 4. The threshold f which minimizes the entropy over all possible
partitions is selected. This method is then applied recursively to both of the partitions in-
duced by f, b
f
and b
f>
, creating multiple intervals until a stopping criterion is achieved.
Spliting stops if the information gain (the difference between the entropies before and
after the split) is lower than a certain value, and the nal set of bins, B, is found.
Finally, for each instance x in the test set, p(c
1
|x), p(c
2
|x), . . . , p(c
n
|x) are esti-
mated using the entire training data, D. Then, the estimated probabilities are calibrated
using the accuracy associated with the approprate bin in B, as shown in Algorithm 3.
E(O, t) =
|b
t
|
|O|
Entropy(b
t
) +
|b
t>
|
|O|
Entropy(b
t>
) (4)
4
v can take the values 0 (the prediction is wrong) or 1 (otherwise), as shown in step 6 of Algorithm 2.
Algorithm 3 Calibrating the Probabilities.
Require: Examples in D, instances in T , the calibrated probability p
b
lu
of each bin b
lu
Ensure: For each estimate p(c
i
|x), the corresponding calibrated estimate p
c
(c
i
|x)
1: for each instance x T do
2: Estimate probabilities, p(c
1
|x), p(c
2
|x), . . . , p(c
n
|x), using D as training
3: for each c
i
do output p
c
(c
i
|x)=p
b
lu
, such that l p(c
i
|x) < u
4: end for
4.3. Advantages of Well Calibrated Associative Classiers
We now discuss some of the advantages of well calibrated classiers.
Assessing the Reliability of Predictions Accurate probability estimates are important in
many applications, specially when different classication errors incur different penalties,
such as spam detection, net revenue optimization, medical diagnosis etc. In this case, a
prediction is performed only if it is likely to be correct. Thus, well calibrated classiers
are important in such applications, since they provide accurate probability estimates.
Safe Zone and Danger Zone Test instances can be divided into two zones. The safe zone
contains instances for which the corresponding predictions are likely to be correct. These
test instances can be used to enhance the training data. More specically, the training
data is augmented with a new example, which is composed by the features of the test
instance along with its corresponding prediction (which is likely to be correct). Since
LAC generates rules at classication time, the next instances to be classied can take
advantage of information coming from instances in the safe zone that were inserted in
the training data. The danger zone, on the other hand, contains instances for which the
predictions are doubtful, and the classier can abstain from predicting the class of such
instances. A threshold r
min
(i.e., a minimum reliability) delimits these two zones.
Estimating the Accuracy After a Prediction In several applications it is desirable to
have a reliable estimate of classication accuracy. Well calibrated classiers can estimate
the accuracy after each prediction, as shown in Algorithm 4. If the classier is well
calibrated the estimated accuracy will be close to the actual accuracy.
Producing Accurate Classiers An accurate classier has the following property:
p(c
i
|x) < p(c
j
|x) iff p(c
i
|x) < p(c
j
|x). A calibrated classier has the follow-
ing property: p(c
i
|x)p(c
i
| p(c
i
|x)) and p(c
j
|x)p(c
j
| p(c
j
|x)). Thus, calibration is
a natural way of producing accurate classiers, because if p(c
i
|x)p(c
i
| p(c
i
|x)) and
p(c
j
|x)p(c
j
| p(c
j
|x)) (i.e., the classier is well calibrated), then if p(c
i
|x)<p(c
j
|x) it
is likely that p(c
i
|x)< p(c
j
|x) (i.e., the classier is accurate).
Algorithm 4 Estimating the Accuracy.
Require: P
c
, the calibrated probabilities given by Algorithm 3
Ensure: acc, the estimated accuracy
1: For a calibrated probability p, let n[p] be the number of times p appears in P
c
2: acc 0
3: for each distinct probability p P
c
do acc acc + p
n[p]
|Pc|
5. Evaluation
In this section we present the experimental results for the evaluation of the pro-
posed calibrated lazy associative classiers, hereafter referred to as CaLAC (Naive),
CaLAC (Entropy) and CaLAC (Ent. + SZ)
5
. Our evaluation is based on a comparison
against current state-of-the-art classiers
6
, which include SVM [Joachims 2006], Naive-
Bayes [Cussens 1993], and Decision Tree classiers [Quinlan 1993]. After being cali-
brated using specic mechanisms [Platt 1999, Cestnik 1990, Zadrozny and Elkan 2001],
these classiers are respectively referred to as CaSVM, CaNB and CaDT. All experiments
were run on 1.8 MHz Intel processors with 2GB main memory under Linux. Instead of
performing experiments on many datasets that are relativelly small and easy to achieve
good performance, we prefer to do a detailed investigation using complex datasets ob-
tained from two real-world applications that are known to be challenging.
5.1. Cautious Classication ACM Digital Library
For this application we used a dataset called ACM-DL, which was extracted from the
ACM Computing Classication System (http://portal.acm.org/dl.cfm/).
ACM-DL is a set of 6,682 documents (metadata records) labelled using the 8 rst level
categories of ACM (General Literature, Hardware, Computer Systems Organization, Soft-
ware, Data, Theory of Computation, Mathematics of Computing, Information Systems,
Computing Methodologies, Computer Applications, Computing Milieux). Features in
ACM-DL include words (in title and abstracts) and citations to other documents. ACM-
DL has a vocabulary of 9,840 unique words, and a total of 11,510 internal citations and
40,387 citations for documents outside of the digital library. The classier must decide
to which category a document belongs. However, the administrator of the digital library
imposes an additional minimum accuracy requirement, acc
min
, to the classier. In this
case, the classier must estimate the total accuracy after each prediction is performed,
and then it must decide to continue classifying documents (if the estimated accuracy will
be higher than acc
min
) or to stop classication (if the estimated accuracy will be lower
than acc
min
). We performed 10-fold cross validation and the nal result represents the
average of the ten runs (all results to be presented were found statistically signicant at
the 95% condence level when tested with the two-tailed paired t-test).
Calibrating the Classiers We employed the two mechanisms described in Section
4.2 to calibrate LAC the Naive Binning mechanism and the Entropy Minimization (En-
tropy) mechanism. After trying some possibilities, we decided to set the number of bins
for the Naive Binning mechanism to 5. They are shown in Figure 2 (left). Bin bound-
aries (i.e., original probability estimates) are shown in the x-axis and the corresponding
calibrated probabilities are shown in the y-axis. The Entropy-Minimization mechanism
generated 5 bins of varying sizes, which are shown in Figure 2 (middle). After the bins
5
CaLAC (Naive) is the classier obtained after calibrating LAC using the naive binning mechanism.
Similarly, CaLAC (Entropy) is the classier obtained after calibrating LAC using the entropy minimization
mechanism. CaLAC (Ent. + SZ) is the classier obtained using the entropy minimization mechanism, and,
in this case, instances in the safe zone are incorporated into the training data, as discussed in Section 4.3.
6
For CaLAC (Naive), CaLAC (Entropy) and CaLAC (Ent. + SZ), we set
min
=0.001. There is no prun-
ing based on minimum condence, since we are not following any accuracy maximization strategy. For
CaLAC (Ent. + SZ) we set r
min
=0.90. For CaSVM we used linear kernels and set C=0.90 (in the rst appli-
cation) and C=2.00 (in the second application). These parameters were set according to the LIBSVM tool.
For CaNB and CaDT we used the default parameters, which were also used in other works [Cussens 1993].
are found, we apply LAC to the test set, and we replace the original probability estimate
(x-axis) by the calibrated probability associated with the corresponding bin (y-axis). The
result of calibration is depicted in Figure 2 (right), which shows values for LAC, be-
fore and after being calibrated. Other classiers were also evaluated. The worst classier
in terms of calibration, is SVM with =0.69. After calibrating SVM, the corresponding
classier CaSVM shows =0.75. NB and LAC, with =0.76 and =0.78, respectively, are
already better calibrated than CaSVM. These classiers, when calibrated, show the best
calibration degrees CaNB with a =0.91, and CaLAC (Entropy) with =0.94. Next, we
will evaluate how this difference in calibration affects the effectiveness of the classiers.
Figure 2. Left Naive Binning Mechanism. Middle Entropy Minimization Mech-
anism. Right Classiers, Before, and After being Calibrated.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.801.00 0.600.80 0.400.60 0.200.40 0.000.20
C
a
l
i
b
r
a
t
e
d

P
r
o
b
a
b
i
l
i
t
y
Probability Estimate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.561.00 0.440.56 0.360.44 0.290.36 0.000.29
C
a
l
i
b
r
a
t
e
d

P
r
o
b
a
b
i
l
i
t
y
Probability Estimates
0.68
0.7
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
CaLAC(Ent.) CaNB CaDT CaLAC(Naive) LAC NB CaSVM DT SVM
D
e
g
r
e
e

o
f

C
a
l
i
b
r
a
t
i
o
n
Figure 3. Accuracy Estimated by Calibrated Classiers.
0.79
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
V
a
l
u
e
Proportion of Documents Classified
Empirical Accuracy of CaLAC
Accuracy Estimated by CaLAC (Entropy)
Accuracy Estimated by CaLAC (Ent. + SZ)
Accuracy Estimated by CaLAC (Naive)
0.75
0.76
0.77
0.78
0.79
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
V
a
l
u
e
Proportion of Documents Classified
Empirical Accuracy of CaSVM
Accuracy Estimated by CaSVM
0.79
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
V
a
l
u
e
Proportion of Documents Classified
Empirical Accuracy of CaNB
Accuracy Estimated by CaNB
0.79
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
V
a
l
u
e
Proportion of Documents Classified
Empirical Accuracy of CaDT
Accuracy Estimated by CaDT
Results We start our analysis by evaluating each classier in terms of its ability in
estimating the actual accuracy. Figure 3 shows the actual accuracy and the accuracy
estimates for each classier, so that the corresponding values can be directly compared
7
.
As expected, CaLAC (Entropy) and CaLAC (Ent. + SZ) show to be better calibrated
than CaLAC (Naive). This is because the bins used by CaLAC (Naive) are produced in
an ad-hoc way, while the bins used by CaLAC (Entropy) and CaLAC (Ent. + SZ) are
produced following the entropy-minimization strategy. The direct consequence of such
strategy is that a bin is likely to contain estimates for which the calibrated probabilities are
as similar as possible. While in most of the cases CaNB and CaDT are well calibrated,
CaSVM very often underestimates or overestimates the actual accuracy, and is poorly
calibrated. The main reason of the poor performance of CaSVM is that Plat Scaling is
prone to overtting, since this calibration mechanism is based on regression. The other
calibration mechanisms apparently do not overt as much. This explanation is supported
by the results present in [Cohen and Goldszmidt 2004] (which show that Naive Bayes
classiers are much better calibrated than SVM classiers).
If the administrator of the digital library species a threshold acc
min
(i.e., the
minimum acceptable accuracy of the classier), then the value of the classier resides
in how many documents it is able to classify while respecting acc
min
. Figure 4 (left)
shows the proportion of documents in the test set each classier is able to classify for a
given acc
min
. Clearly, CaLAC (Entropy) and CaLAC (Ent. + SZ) are the best performers,
except for acc
min
values higher than 0.95, when the best performer is CaLAC (Naive).
CaNB and CaDT are in close rivalry, with CaDT being slightly superior. In most cases,
both CaNB and CaDT show to be superior than CaLAC (Naive). CaSVM, as expected,
is the worst performer for all values of acc
min
. Figure 4 (right) shows an accuracy map
using different classiers. Dark regions in the map indicate the areas with lower accuracy
values. The areas induced by CaLAC (Entropy) and CaLAC (Ent. + SZ) are the clearest
ones, while the area induced by CaSVM is the darkest one.
Figure 4. Left Comparing Calibrated Classiers. Right Accuracy Map.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96
P
r
o
p
o
r
t
i
o
n

o
f

D
o
c
u
m
e
n
t
s

C
l
a
s
s
i
f
i
e
d
Minimum Acceptable Accuracy
CaLAC (Naive)
CaLAC (Entropy)
CaLAC (Ent. + SZ)
CaSVM
CaNB
CaDT
0.7
0.75
0.8
0.85
0.9
0.95
1
CaLAC(Ent. + SZ)
CaLAC(Entropy)
CaNB
CaLAC(Naive)
CaDT
CaSVM
0.2
0.4
0.6
0.8
1
Proportion of Documents Classified
0.75
0.8
0.85
0.9
0.95
1
Accuracy
5.2. Cost-Sensitive Classication KDDCUP98
For this application we used a dataset called KDD-98, which was used in KDDCUP98
contest. This dataset was provided by the Paralyzed Veterans of America (PVA), an orga-
nization which devises programs and services for US veterans. With a database of over
13 million donors, PVA is also one of the worlds largest direct mail fund raisers.
7
For each experiment, predictions were sorted from the most reliable to the least reliable.
The total cost invested in generating a request (including the mail cost), is $0.68
per piece mailed. Thus, PVA wants to maximize net revenue by soliciting only individu-
als that are likely to respond with a donation. The KDD-98 dataset contains information
about individuals that have (or have not) made charitable donations in the past. The pro-
vided training data consists of 95,412 examples, and the provided test set consists of
96,367 instances. Each example/instance corresponds to an individual, and is composed
of 479 features. The training data has an additional eld that indicates the amount of
the donation (a value $0 indicates that the individual have not made a donation). From
the 96,367 individuals in the test set, only 4,872 are donors. If all individuals in the test
set were solicited, the total prot would be only $10,547. On the other hand, if only
those individuals that are donors were solicited, the total prot would be $72,764. Thus,
the classier must choose which individuals to solicit a new donation from. According
to [Zadrozny and Elkan 2001], the optimal net maximization strategy is to solicit an indi-
vidual x if and only if p(donate|x) >
0.68
y(x)
, where y(x) is the expected amount donated by
x
8
. Thus, in addition to calculating p(donate|x), the classier must also estimate y(x).
Estimating the Donation Amount and p(donate|x) For this application, each dona-
tion amount (i.e., $200, $199, . . ., $1, and $0) is considered as a class. Thus, for an indi-
vidual x, rules of the form X y (with X x) are generated, and Equation 2 is used to
estimate the likelihood of each amount (i.e., p(y=$200|x), p(y=$199|x), . . . , p(y=$0|x)).
The donation amount, y(x), is nally estimated by a linear combination of the probabil-
ities associated with each amount, as shown in Equation 5. The probability of donation,
p(donate|x), is simply given by 1 p(y=$0|x). Obviously, for this application, it is crucial
that p(y=i|x) p(y=i|x).
y(x) =
$200

i=$0
i p(y = i|x) (5)
Calibrating the Classiers For the Naive Binning mechanism, we evaluated four
different congurations, with 5, 8, 10, and 15 xed-sized bins, which are respectivelly
referred to as CaLAC(N5), CaLAC(N8), CaLAC(N10) and CaLAC(N15). We com-
pared the calibration achieved by each conguration against the calibration using the En-
tropy Minimization mechanism, CaLAC(Ent.), in terms of . Other calibrated classiers,
CaSVM, CaNB and CaDT are also included in our comparison. The calibration degree,
, achieved by each calibration mechanism, is shown in Figure 5. CaSVM, CaLAC(N15)
and CaLAC(N10) achieved the lowest calibration degrees. This is because the corre-
sponding calibration mechanisms overt the training data (i.e., for Naive Binning too
many bins incur small-sized bins for which the corresponding accuracy may not be reli-
able, and for CaSVM, the Platt Scaling mechanism is based on regression). CaLAC(Ent.)
and CaNB are the best performers, achieving values as high as 0.94. For now on, we
will use CaLAC(N8) as the representative of the classier calibrated by the Naive Binning
mechanism. Next we will analyze the effectiveness of classiers with different calibration
degrees for net revenue optimization.
8
The basic idea is to solicit a person x for whom the expected return p(donate|x)y(x) is greater than the
cost of mailing the solicitation.
Table 1. Comparing Calibration
Mechanisms in Terms of .
0.7
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
CaLAC(Ent.) CaNB CaDT CaLAC(N8) CaLAC(N5) CaSVM CaLAC(N10) CaLAC(N15)
D
e
g
r
e
e

o
f

C
a
l
i
b
r
a
t
i
o
n
Table 2. Comparing Classiers in
Terms of Prot and MSE.
Classier Prot MSE
LAC $11,097 0.0961
CaLAC (N8) $12,442 0.0958
CaLAC (Entropy) $14,818 0.0936
CaLAC (Ent. + SZ) $14,862 0.0938
CaSVM $12,969 0.0958
CaNB $14,682 0.0952
CaDT $14,190 0.0953
Results We used prot as the primary metric for assessing the effectiveness of the
classiers for net revenue optimization. For assessing the accuracy of probability esti-
mates, we use the mean squared error (MSE)
9
. Table 1 shows the effectiveness of each
classier. In all cases, the differences in prot are much more accentuated than the dif-
ferences in MSE. As can be seem, LAC achieved the lowest prot (which is slightly
superior than soliciting all individuals), and this is because it was not calibrated yet. For
the same reason, LAC was also the worst performer in terms of MSE. Calibrated classi-
ers CaLAC (N8) and CaSVM showed similar performance in terms of prot. According
to [Cohen and Goldszmidt 2004], the poor performance of CaSVM is due to overtting.
CaDT and CaNB are again in close rivalry. Calibrating LAC using the entropy minimiza-
tion mechanism is very protable, and the corresponding classiers, CaLAC (Entropy)
and CaLAC (Ent. + SZ), are the best performers.
6. Conclusions
Calibration is the property of estimating reliable membership probabilities. This property
is important in many classication problems. Most of the existing classiers, although
accurate, are not well calibrated, due to the bias imposed by the accuracy maximization
strategy that is typically adopted. We proposed a lazy associative classier (LAC) which
does not follow the accuracy maximization paradigm (i.e., it simply describes the training
data, and then it uses this description to infer membership probabilities that are used for
prediction). We show that LAC is usually better calibrated than other classiers. To fur-
ther calibrate LAC, some mechanisms become necessary. We have proposed calibration
mechanisms based on learning a mapping from original estimates to calibrated estimates.
These mechanisms discretize the probability space into a set of bins, and for each bin it is
associated a calibrated probability. One of the proposed mechanisms greatly differs from
other existing approaches, because instead of using bins with pre-specied boundaries, it
automatically nds the boundaries that minimize the entropy in each bin. We evaluate the
proposed mechanisms by comparing them with other calibration mechanisms. The evalu-
ation was carried using real-world applications. We showed that the entropy-minimization
mechanism, when coupled with LAC, provides the best overall results. The evaluation of
this mechanism coupled with other classiers is the target of future investigation.
9
Squared error is dened as
n

i=1
(T(c
i
|x) p(c
i
|x))
2
. T(c
i
|x) is 1 if the class of x is c
i
, and 0 otherwise.
References
Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning. In Proc.
of the ECAI., pages 147149.
Cohen, I. and Goldszmidt, M. (2004). Properties and benets of calibrated classiers. In
Proc. of the PKDD Conf., pages 125136. Springer-Verlag Inc.
Cong, G., Tung, A. H., Xu, X., Pan, F., and Yang, J. (2004). FARMER: nding interesting
rule groups in microarray data. In Proc. of the SIGMOD Conf., pages 143154. ACM.
Cussens, J. (1993). Bayes and pseudo-bayes estimates of conditional probabilities and
their reliability. In Proc. of the ECML., pages 136152. Springer-Verlag Inc.
Dawid, A. (1982). The well calibrated bayesian. J. of American Statist. Assoc.,
77(379):605613.
DeGroot, M. and Fienberg, S. (1982). The comparison and evalution of forecasters. Statis-
tician, 32:1222.
Fayyad, U. and Irani, K. (1993). Multi interval discretization of continuous-valued at-
tributes for classication learning. In Proc. of the IJCAI., pages 10221027.
Friedman, J., Hastie, T., and Tibishirani, R. (2000). Additive logistic regression: A statis-
tical view of boosting. The Annals of Statistics, 2(38).
Joachims, T. (2006). Training linear SVMs in linear time. In Proc. of the KDD Conf.,
pages 217226. ACM.
Li, W., Han, J., and Pei, J. (2001). CMAR: Accurate and efcient classication based on
multiple class-association rules. In Proc. of the ICDM, pages 369376. IEEE.
Liu, B., Hsu, W., and Ma, Y. (1998). Integrating classication and association rule mining.
In Proc. of the KDD Conf., pages 8086. ACM.
Niculescu-Mizil, A. and Caruana, R. (2005). Obtaining calibrated probabilities from
boosting. In Proc. of the UAI Conf., pages 413420. AUAI.
Platt, J. (1999). Probabilistic outputs for support vector machines and comparison to
regularized likelihood methods. Advances in Large Margin Classiers, pages 6174.
Quinlan, J. (1993). C4.5: Programs for Machine Learning. M. Kaufmann.
Veloso, A., Jr., W. M., and Zaki, M. J. (2006a). Lazy associative classication. In Proc.
of the ICDM, pages 645654. IEEE.
Veloso, A., Meira, W., Cristo, M., Goncalves, M., and Zaki, M. (2006b). Multi-evidence,
multi-criteria, lazy associative document classication. In Proc. of the CIKM, pages
218227. ACM.
Veloso, A., Mosrri, H., Goncalves, M., and Meira, W. (2008). Learning to rank at query-
time using association rules. In Proc. of the SIGIR Conf. ACM.
Zadrozny, B. and Elkan, C. (2001). Obtaining calibrated probability estimates from deci-
sion trees and naive bayesian classiers. In Proc of the ICML, pages 609616.
Zadrozny, B. and Elkan, C. (2002). Transforming classier scores into accurate multiclass
probability estimates. In Proc. of the KDD Conf., pages 694699. ACM.

You might also like