20 Cost Sensitive Learning
p pb p0 p + p0 b p0 b + p0 bp p pb p0 p + p0 b
0 0.6
0.2 0.4 b’ p pb p0 p + p0 b p0 b(1 p)
p(1 b) p p(1 b) p(1 b)(1 p0 )
0.6 0.2
= = :
p 0.8 p0 b(1 p) p0 b(1 p)
Figure 1: p0 as a function of p and b, when b0 = 0:5.
n0 p(1 b)(1 p0 ) b p(1 p0 )
= = :
compute estimated probabilities p^0 that are correct given a dif- n p0 b(1 p) 1 b p0 (1 p)
ferent base rate b0 . From this point of view, the theorem has
a remarkable aspect. It lets us use a classifier learned from a Note that the effective cardinality of the subset of negative
training set drawn from one probability distribution on a test training examples must be changed in a way that does not
set drawn from a different probability distribution. The theo- change the distribution of examples within this subset.
rem thus relaxes one of the most fundamental assumptions of
almost all research on machine learning, that training and test 4 Effects of changing base rates
sets are drawn from the same population. Changing the training set prevalence of positive and negative
The insight in the proof is the introduction of the variable examples is a common method of making a learning algo-
e that is the ratio of P (xjj = 0) and P (xjj = 1). If we rithm cost-sensitive. A natural question is what effect such a
try to compute the actual values of these probabilities, we change has on the behavior of standard learning algorithms.
find that we have more variables to solve for than we have Separately, many researchers have proposed duplicating or
simultaneous equations. Fortunately, all we need to know for discarding examples when one class of examples is rare, on
any particular example x is the ratio e. the assumption that standard learning methods perform bet-
The special case of Theorem 2 where p0 = 0:5 was recently ter when the prevalence of different classes is approximately
worked out independently by Weiss and Provost [2001]. The equal [Kubat and Matwin, 1997; Japkowicz, 2000]. The pur-
case where b = 0:5 is also interesting. Suppose that we do not pose of this section is to investigate this assumption.
know the base rate of positive examples at the time we learn
a classifier. Then it is reasonable to use a training set with 4.1 Changing base rates and Bayesian learning
b = 0:5. Theorem 2 says how to compute probabilities p0 Given an example x, a Bayesian classifier applies Bayes’
later that are correct given that the population of test examples rule to compute the probability of each class j as P (j jx) =
has base rate b0 . Specifically, P (xjj )P (j )=P (x): Typically P (xjj ) is computed by a func-
p p=2 p tion learned from a training set, P (j ) is estimated as the train-
p0 = b0 = : Pj , and P (x) is computed indirectly
1=2 p=2 + b0 p b0 =2 p + (1 p)(1 b0 )=b0 ing set frequency of class
by solving the equation j P (j jx) = 1.
This function of p and b0 is plotted in Figure 1. A Bayesian learning method essentially learns a model
Using Theorem 2 as a lemma, we can now prove Theo- P (xjj ) of each class j separately. If the frequency of a class is
rem 1 with a slight change of notation. changed in the training set, the only change is to the estimated
Theorem 1: To make a target probability threshold p cor- base rate P (j ) of each class. Therefore there is little reason
respond to a given probability threshold p0 , the number of to expect the accuracy of decision-making with a Bayesian
negative training examples should be multiplied by classifier to be higher with any particular base rates.
Naive Bayesian classifiers are the most important special
p 1 p0
: case of Bayesian classification. A naive Bayesian classi-
1 p p0 fier is based on the assumption that within each class, the
values of the attributes of examples are independent. It
Proof: We want to compute an adjusted base rate b0 such is well-known that these classifiers tend to give inaccurate
that for a classifier trained using this base rate, an estimated probability estimates [Domingos and Pazzani, 1996]. Given
an example x, suppose that a naive Bayesian classifier com- Bayes’ rule I (A)= is
putes N (x) as its estimate of P (j = 1jx). Usually N (x)
is too extreme: for most x, either N (x) is close to 0 and
X P (ak jj = 1)P (j = 1) P (ak jj = 0)P (j = 0)
P (ak )( ) ( ) :
then N (x) < P (j = 1jx) or N (x) is close to 1 and then k
P (ak ) P (ak )
N (x) > P (j = 1jx).
However, the ranking of examples by naive Bayesian clas- Grouping the P (ak ) factors for each k gives that I (A)= is
sifiers tends to be correct: if N (x) < N (y ) then P (j = X
P (ak )1 2
(P (ak jj = 1)P (j = 1)P (ak jj = 0)P (j = 0)) :
1jx) < P (j = 1jy ). This fact suggests that given a cost-
sensitive application where optimal decision-making uses the k
probability threshold p , one should empirically determine Now the base rate factors can be brought outside the sum, so
a different threshold p such that N (x) p is equivalent to I (A) is 1 P (j = 1) P (j = 0) times the sum
P (j = 1jx) p . This procedure is likely to improve X
the accuracy of decision-making, while changing the propor- P (ak )1 2
P (ak jj = 1) P (ak jj = 0) : (3)
tion of negative examples using Theorem 1 in order to use the k
threshold 0.5 is not. Because 1 P (j = 1) P (j = 0) is constant for all at-
tributes, the attribute A for which I (A) is minimum is deter-
4.2 Decision tree growing
mined by the minimum of (3). If 2 = 1 then (3) depends
We turn our attention now to standard decision tree learning only on P (ak jj = 1) and P (ak jj = 0), which do not depend
methods, which have two phases. In the first phase a tree is on the base rates. Otherwise, (3) is different for different base
grown top-down, while in the second phase nodes are pruned rates because
from the tree. We discuss separately the effect on each phase
of changing the proportion of negative and positive training P (ak ) = P (ak jj = 1)P (j = 1) + P (ak jj = 0)P (j = 0)
examples. unless the attribute A is independent of the class j , that is
A splitting criterion is a metric applied to an attribute P (ak jj = 1) = P (ak jj = 0) for 1 k m.
that measures how homogeneous the induced subsets are, if The sum (3) has its maximum value 1 if A is independent
a training set is partitioned based on the values of this at- of j . As desired, the sum is smaller otherwise, if A and j are
tribute. Consider a discrete attribute A that has values A = a1 correlated and hence splitting on A is reasonable.
through A = am for some m 2. In the two-class case, stan- Theorem 3 implies that changing the proportion of positive
dard splitting criteria have the form or negative examples in the training set has no effect on the
X p of the tree if the decision tree growing method uses
I (A) = P (A = ak )f (pk ; 1 pk ) the 2 p(1 p) impurity criterion. If the algorithm uses a
k=1 different criterion, such as the C4.5 entropy measure, the ef-
where pk = P (j = 1jA = ak ) and all probabilities are
fect is usually small, because all impurity criteria are similar.
frequencies in the training set to be split based on A. The
The experimental results of Drummond and p Holte [2000]
and Dietterich et al. [1996] show that the 2 p(1 p) crite-
function f (p; 1 p) measures the impurity or heterogeneity
rion normally leads to somewhat smaller unpruned decision
of each subset of training examples. All such functions are trees, sometimes leads to more accurate trees, and never leads
qualitatively similar, with a unique maximum at p = 0:5, and
to much less accurate trees. Therefore we can recommend its
equal minima at p = 0 and p = 1.
use, and we can conclude that regardless of the impurity cri-
Drummond and Holte [2000] have shown p that for two- terion, applying Theorem 1 is not likely to have have much
valued attributes the impurity function 2 p(1 p) sug- influence on the growing phase of decision tree learning.
gested by Kearns and Mansour [1996] is invariant to changes
in the proportion of different classes in the training data. We 4.3 Decision tree pruning
prove here a more general result that applies to all discrete-
Standard methods for pruning decision trees are highly sen-
valued attributes and that shows that related impurity func-
sitive to the prevalence of different classes among training
tions, including the Gini index [Breiman et al., 1984], are not
examples. If all classes except one are rare, then C4.5 often
invariant to base rate changes.
prunes the decision tree down to a single node that classifies
Theorem 3: Suppose f (p; 1 p) = (p(1 p)) where all examples as members of the common class. Such a clas-
> 0 and > 0. For any collection of discrete-valued sifier is useless for decision-making if failing to recognize an
attributes, the attribute that minimizes I (A) using f is the example in a rare class is an expensive error.
same regardless of changes in the base rate P (j = 1) of the Several papers have examined recently the issue of how to
training set if = 0:5, and not otherwise in general. obtain good probability estimates from decision trees [Brad-
ford et al., 1998; Provost and Domingos, 2000; Zadrozny and
Proof: For any attribute A, by definition Elkan, 2001b]. It is clear that it is necessary to use a smooth-
m ing method to adjust the probability estimates at each leaf of
I (A) = P (ak )P (j = 1jak ) P (j = 0jak ) a decision tree. It is not so clear what pruning methods are
The experiments of Bauer and Kohavi [1999] suggest that
where a1 through am are the possible values of A. So by no pruning is best when using a decision tree with probability
