Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

20 Cost Sensitive Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

To appear, Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01).

The Foundations of Cost-Sensitive Learning


Charles Elkan
Department of Computer Science and Engineering 0114
University of California, San Diego
La Jolla, California 92093-0114
elkan@cs.ucsd.edu
Abstract
cost-sensitive decision-making is that it can be optimal to act
This paper revisits the problem of optimal learn-
as if one class is true even when some other class is more
ing and decision-making when different misclassi-
probable. For example, it can be rational not to approve a
fication errors incur different penalties. We char-
large credit card transaction even if the transaction is most
acterize precisely but intuitively when a cost ma-
likely legitimate.
trix is reasonable, and we show how to avoid the
mistake of defining a cost matrix that is economi- 1.1 Cost matrix properties
cally incoherent. For the two-class case, we prove
a theorem that shows how to change the proportion A cost matrix C always has the following structure when
of negative examples in a training set in order to there are only two classes:
make optimal cost-sensitive classification decisions actual negative actual positive
using a classifier learned by a standard non-cost-
sensitive learning method. However, we then argue
predict negative C (0; 0) = 00 C (0; 1) = 01
that changing the balance of negative and positive predict positive C (1; 0) = 10 C (1; 1) = 11
training examples has little effect on the classifiers Recent papers have followed the convention that cost ma-
produced by standard Bayesian and decision tree trix rows correspond to alternative predicted classes, while
learning methods. Accordingly, the recommended columns correspond to actual classes, i.e. row/column = i/j =
way of applying one of these methods in a domain predicted/actual.
with differing misclassification costs is to learn a In our notation, the cost of a false positive is 10 while
classifier from the training set as given, and then the cost of a false negative is 01 . Conceptually, the cost of
to compute optimal decisions explicitly using the labeling an example incorrectly should always be greater than
probability estimates given by the classifier. the cost of labeling it correctly. Mathematically, it should
always be the case that 10 > 00 and 01 > 11 . We call
1 Making decisions based on a cost matrix these conditions the “reasonableness” conditions.
Suppose that the first reasonableness condition is violated,
Given a specification of costs for correct and incorrect pre- so 00  10 but still 01 > 11 . In this case the optimal
policy is to label all examples positive. Similarly, if 10 > 00
dictions, an example should be predicted to have the class
that leads to the lowest expected cost, where the expectation but 11  01 then it is optimal to label all examples negative.
is computed using the conditional probability of each class
We leave the case where both reasonableness conditions are
given the example. Mathematically, let the (i; j ) entry in a
violated for the reader to analyze.
cost matrix C be the cost of predicting class i when the true
Margineantu [2000] has pointed out that for some cost ma-
class is j . If i = j then the prediction is correct, while if
i 6= j the prediction is incorrect. The optimal prediction for trices, some class labels are never predicted by the optimal
policy as given by Equation (1). We can state a simple, intu-
an example x is the class i that minimizes
itive criterion for when this happens. Say that row m domi-
X nates row n in a cost matrix C if for all j , C (m; j )  C (n; j ).
L(x; i) = P (j jx)C (i; j ): (1)
In this case the cost of predicting n is no greater than the cost
j
of predicting m, regardless of what the true class j is. So it
Costs are not necessarily monetary. A cost can also be a waste is optimal never to predict m. As a special case, the opti-
of time, or the severity of an illness, for example. mal prediction is always n if row n is dominated by all other
For each i, L(x; i) is a sum over the alternative possibilities rows in a cost matrix. The two reasonableness conditions for
for the true class of x. In this framework, the role of a learning a two-class cost matrix imply that neither row in the matrix
algorithm is to produce a classifier that for any example x dominates the other.
can estimate the probability P (j jx) of each class j being the Given a cost matrix, the decisions that are optimal are un-
true class of x. For an example x, making the prediction i changed if each entry in the matrix is multiplied by a positive
means acting as if i is the true class of x. The essence of constant. This scaling corresponds to changing the unit of
account for costs. Similarly, the decisions that are optimal loans and repay them. The net change in assets is then +4.
are unchanged if a constant is added to each entry in the ma- Regardless of the baseline, any method of accounting should
trix. This shifting corresponds to changing the baseline away give a difference of 8 between these scenarios. But with the
from which costs are measured. By scaling and shifting en- erroneous cost matrix above, the first scenario gives a total
tries, any two-class cost matrix that satisfies the reasonable- cost of 6, while the second scenario gives a total cost of 0.
ness conditions can be transformed into a simpler matrix that In general the amount in some cells of a cost or benefit
always leads to the same decisions: matrix may not be constant, and may be different for different
0 001 examples. For example, consider the credit card transactions
011
domain. Here the benefit matrix might be
1
fraudulent legitimate
where 001 = ( 01 00 )=( 10 00 ) and 011 = ( 11 refuse $20 $20
00 )=( 10 00 ). From a matrix perspective, a 2x2 cost matrix
effectively has two degrees of freedom. approve x 0.02x
where x is the size of the transaction in dollars. Approving
1.2 Costs versus benefits a fraudulent transaction costs the amount of the transaction
Although most recent research in machine learning has used because the bank is liable for the expenses of fraud. Refusing
the terminology of costs, doing accounting in terms of bene- a legitimate transaction has a non-trivial cost because it an-
fits is generally preferable, because avoiding mistakes is eas- noys a customer. Refusing a fraudulent transaction has a non-
ier, since there is a natural baseline from which to measure trivial benefit because it may prevent further fraud and lead to
all benefits, whether positive or negative. This baseline is the the arrest of a criminal. Research on cost-sensitive learning
state of the agent before it takes a decision regarding an ex- and decision-making when costs may be example-dependent
ample. After the agent has made the decision, if it is better is only just beginning [Zadrozny and Elkan, 2001a].
off, its benefit is positive. Otherwise, its benefit is negative.
When thinking in terms of costs, it is easy to posit a cost 1.3 Making optimal decisions
matrix that is logically contradictory because not all entries In the two-class case, the optimal prediction is class 1 if and
in the matrix are measured from the same baseline. For ex- only if the expected cost of this prediction is less than or equal
ample, consider the so-called German credit dataset that was to the expected cost of predicting class 0, i.e. if and only if
published as part of the Statlog project [Michie et al., 1994].
The cost matrix given with this dataset is as follows: P (j = 0jx) 10 + P (j = 1jx) 11
actual bad actual good  P (j = 0jx) 00 + P (j = 1jx) 01
predict bad 0 1 which is equivalent to
predict good 5 0 (1 p) 10 + p 11  (1 p) 00 + p 01
Here examples are people who apply for a loan from a bank. given p = P (j = 1jx). If this inequality is in fact an equality,
“Actual good” means that a customer would repay a loan then predicting either class is optimal.
while “actual bad” means that the customer would default. The threshold for making optimal decisions is p such that
The action associated with “predict bad” is to deny the loan.
Hence, the cashflow relative to any baseline associated with (1 p ) 10 + p 11 = (1 p ) 00 + p 01 :
this prediction is the same regardless of whether “actual Assuming the reasonableness conditions the optimal predic-
good” or “actual bad” is true. In every economically reason- tion is class 1 if and only if p  p . Rearranging the equation
able cost matrix for this domain, both entries in the “predict for p leads to the solution
bad” row must be the same.
10 00
Costs or benefits can be measured against any baseline, but p = (2)
the baseline must be fixed. An opportunity cost is a fore- 10 00 + 01 11
gone benefit, i.e. a missed opportunity rather than an actual
assuming the denominator is nonzero, which is implied by
the reasonableness conditions. This formula for p shows that
penalty. It is easy to make the mistake of measuring different
opportunity costs against different baselines. For example,
any 2x2 cost matrix has essentially only one degree of free-
the erroneous cost matrix above can be justified informally as
dom from a decision-making perspective, although it has two
follows: “The cost of approving a good customer is zero, and
degrees of freedom from a matrix perspective. The cause of
the cost of rejecting a bad customer is zero, because in both
the apparent contradiction is that the optimal decision-making
cases the correct decision has been made. If a good customer
policy is a nonlinear function of the cost matrix.
is rejected, the cost is an opportunity cost, the foregone profit
of 1. If a bad customer is approved for a loan, the cost is the
lost loan principal of 5.” 2 Achieving cost-sensitivity by rebalancing
To see concretely that the reasoning in quotes above is in- In this section we turn to the question of how to obtain a clas-
correct, suppose that the bank has one customer of each of the sifier that is useful for cost-sensitive decision-making.
four types. Clearly the cost matrix above is intended to imply Standard learning algorithms are designed to yield clas-
that the net change in the assets of the bank is then 4. Al- sifiers that maximize accuracy. In the two-class case, these
ternatively, suppose that we have four customers who receive classifiers implicitly make decisions based on the probability
threshold 0.5. The conclusion of the previous section was that 3 New probabilities given a new base rate
we need a classifier that given an example x, says whether or
not P (j = 1jx)  p for some target threshold p that in
In this section we state and prove a theorem of independent
interest that happens also to be the tool needed to prove The-
general is different from 0.5. How can a standard learning al- orem 1. The new theorem answers the question of how the
gorithm be made to produce a classifier that makes decisions
based on a general p ?
predicted class membership probability of an example should
change in response to a change in base rates. Suppose that
The most common method of achieving this objective is p = P (j = 1jx) is correct for an example x, if x is drawn
to rebalance the training set given to the learning algorithm, from a population with base rate b = P (j = 1) positive ex-
i.e. to change the the proportion of positive and negative train- amples. But suppose that in fact x is drawn from a population
ing examples in the training set. Although rebalancing is a with base rate b0 . What is p0 = P 0 (j = 1jx)?
common idea, the general formula for how to do it correctly We make the assumption that the shift in base rate is
has not been published. The following theorem provides this the only change in the population to which x belongs.
formula. Formally, we assume that within the positive and nega-
Theorem 1: To make a target probability threshold p cor-
tive subpopulations, example probabilities are unchanged:
P 0 (xjj = 1) = P (xjj = 1) and P 0 (xjj = 0) = P (xjj = 0).
respond to a given probability threshold p0 , the number of Given these assumptions, the following theorem shows how
negative examples in the training set should be multiplied by to compute p0 as a function of p, b, and b0 .
p 1 p 0
:
1 p p0
Theorem 2: In the context just described,
p pb
p0 = b 0 :
While the formula in Theorem 1 is simple, the proof of its b pb + b0 p bb0
correctness is not. We defer the proof until the end of the
next section. Proof: Using Bayes’ rule, p = P (j = 1jx) is
In the special case where the threshold used by the learn- P (xjj = 1)P (j = 1) P (xjj = 1)b
ing method is po = 0:5 and 00 = 11 = 0, the theorem says = :
that the number of negative training examples should be mul- P (x) P (x)
tiplied by p =(1 p ) = 10 = 01 : This special case is used Because j = 1 and j = 0 are mutually exclusive, P (x) is
by Breiman et al. [1984].
The directionality of Theorem 1 is important to understand.
P (xjj = 1)P (j = 1) + P (xjj = 0)P (j = 0):
Suppose we have a learning algorithm L that yields classi- Let = P (xjj = 1), let d = P (xjj = 0), and let e = d= .
fiers that make predictions based on a probability threshold Then
p0 . Given a training set S and a desired probability thresh- b b
old p , the theorem says how to create a training set S 0 by
p= = :
b + d(1 b) b + e(1 b)
changing the number of negative training examples such that
L applied to S 0 gives the desired classifier.
Similarly,
b0
p0 = 0 :
Theorem 1 does not say in what way the number of nega- b + e(1 b0 )
tive examples should be changed. If a learning algorithm can
use weights on training examples, then the weight of each Now we can solve for e as a function of p and b. We have
pb + pe(1 b) = b so e = (b pb)=(p pb): Then the
denominator for p0 is
negative example can be set to the factor given by the theo-
rem. Otherwise, we must do oversampling or undersampling.
b pb 0 b pb
b0 + e(1 b0 ) = b0 +
Oversampling means duplicating examples, and undersam-
b
pling means deleting examples. p pb p pb
Sampling can be done either randomly or deterministically. b pb 0  b pb  b pb + b p bb0
0
While deterministic sampling can reduce variance, it risks in- = +b 1 = :
troducing bias, if the non-random choice of examples to du- p pb p pb p pb
plicate or eliminate is correlated with some property of the Finally we have
examples. Undersampling that is deterministic in the sense
p pb
that the fraction of examples with each value of a certain fea- p0 = b 0 :
ture is held constant is often called stratified sampling. b pb + b0 p bb0
It is possible to change the number of positive examples in- It is important to note that Theorem 2 is a statement about
stead of or as well as changing the number of negative exam- true probabilities given different base rates. The proof does
ples. However in many domains one class is rare compared to not rely on how probabilities may be estimated based on some
the other, and it is important to keep all available examples of learning process. In particular, the proof does not use any
the rare class. In these cases, if we call the rare class the posi- assumptions of independence or conditional independence, as
tive class, Theorem 1 says directly how to change the number made for example by a naive Bayesian classifier.
of common examples without discarding or duplicating any If a classifier yields estimated probabilities p^ that we as-
of the rare examples. sume are correct given a base rate b, then Theorem 2 lets us
probability p0 corresponds to a probability p for a classifier
trained using the base rate b.
We need to compute the adjusted b0 as a function of b, p,
and p0 . From the proof of Theorem 2, p0 b p0 pb + p0 b0 p
p0 bb0 = b0 p b0 pb. Collecting all the b0 terms on the left, we
have b0 p b0 pb b0 p0 p + b0 p0 b = p0 b p0 pb, which gives that
p’ the adjusted base rate should be
p0 b p0 pb p0 b(1 p)
b0 =
1
= :
0.8 p pb p p + p b p pb p0 p + p0 b
0 0
Suppose that b = 1=(1+n) and b0 = 1=(1+n0 ) so the number
0.6 1
of negative training examples should be multiplied by n0 =n to
0.4
get the adjusted base rate b0 . We have that n0 = (1 b0 )=b0 is
0.8
0.2

p pb p0 p + p0 b p0 b + p0 bp p pb p0 p + p0 b
0 0.6
0
0.2 0.4 b’ p pb p0 p + p0 b p0 b(1 p)
p(1 b) p p(1 b) p(1 b)(1 p0 )
0
0.4
0.6 0.2
= = :
p 0.8 p0 b(1 p) p0 b(1 p)
0
Figure 1: p0 as a function of p and b, when b0 = 0:5.
Therefore
n0 p(1 b)(1 p0 ) b p(1 p0 )
= = :
compute estimated probabilities p^0 that are correct given a dif- n p0 b(1 p) 1 b p0 (1 p)
ferent base rate b0 . From this point of view, the theorem has
a remarkable aspect. It lets us use a classifier learned from a Note that the effective cardinality of the subset of negative
training set drawn from one probability distribution on a test training examples must be changed in a way that does not
set drawn from a different probability distribution. The theo- change the distribution of examples within this subset.
rem thus relaxes one of the most fundamental assumptions of
almost all research on machine learning, that training and test 4 Effects of changing base rates
sets are drawn from the same population. Changing the training set prevalence of positive and negative
The insight in the proof is the introduction of the variable examples is a common method of making a learning algo-
e that is the ratio of P (xjj = 0) and P (xjj = 1). If we rithm cost-sensitive. A natural question is what effect such a
try to compute the actual values of these probabilities, we change has on the behavior of standard learning algorithms.
find that we have more variables to solve for than we have Separately, many researchers have proposed duplicating or
simultaneous equations. Fortunately, all we need to know for discarding examples when one class of examples is rare, on
any particular example x is the ratio e. the assumption that standard learning methods perform bet-
The special case of Theorem 2 where p0 = 0:5 was recently ter when the prevalence of different classes is approximately
worked out independently by Weiss and Provost [2001]. The equal [Kubat and Matwin, 1997; Japkowicz, 2000]. The pur-
case where b = 0:5 is also interesting. Suppose that we do not pose of this section is to investigate this assumption.
know the base rate of positive examples at the time we learn
a classifier. Then it is reasonable to use a training set with 4.1 Changing base rates and Bayesian learning
b = 0:5. Theorem 2 says how to compute probabilities p0 Given an example x, a Bayesian classifier applies Bayes’
later that are correct given that the population of test examples rule to compute the probability of each class j as P (j jx) =
has base rate b0 . Specifically, P (xjj )P (j )=P (x): Typically P (xjj ) is computed by a func-
p p=2 p tion learned from a training set, P (j ) is estimated as the train-
p0 = b0 = : Pj , and P (x) is computed indirectly
1=2 p=2 + b0 p b0 =2 p + (1 p)(1 b0 )=b0 ing set frequency of class
by solving the equation j P (j jx) = 1.
This function of p and b0 is plotted in Figure 1. A Bayesian learning method essentially learns a model
Using Theorem 2 as a lemma, we can now prove Theo- P (xjj ) of each class j separately. If the frequency of a class is
rem 1 with a slight change of notation. changed in the training set, the only change is to the estimated
Theorem 1: To make a target probability threshold p cor- base rate P (j ) of each class. Therefore there is little reason
respond to a given probability threshold p0 , the number of to expect the accuracy of decision-making with a Bayesian
negative training examples should be multiplied by classifier to be higher with any particular base rates.
Naive Bayesian classifiers are the most important special
p 1 p0
: case of Bayesian classification. A naive Bayesian classi-
1 p p0 fier is based on the assumption that within each class, the
values of the attributes of examples are independent. It
Proof: We want to compute an adjusted base rate b0 such is well-known that these classifiers tend to give inaccurate
that for a classifier trained using this base rate, an estimated probability estimates [Domingos and Pazzani, 1996]. Given
an example x, suppose that a naive Bayesian classifier com- Bayes’ rule I (A)= is
putes N (x) as its estimate of P (j = 1jx). Usually N (x)
is too extreme: for most x, either N (x) is close to 0 and
X P (ak jj = 1)P (j = 1) P (ak jj = 0)P (j = 0)
P (ak )( ) ( ) :
then N (x) < P (j = 1jx) or N (x) is close to 1 and then k
P (ak ) P (ak )
N (x) > P (j = 1jx).
However, the ranking of examples by naive Bayesian clas- Grouping the P (ak ) factors for each k gives that I (A)= is
sifiers tends to be correct: if N (x) < N (y ) then P (j = X
P (ak )1 2
(P (ak jj = 1)P (j = 1)P (ak jj = 0)P (j = 0)) :
1jx) < P (j = 1jy ). This fact suggests that given a cost-
sensitive application where optimal decision-making uses the k
probability threshold p , one should empirically determine Now the base rate factors can be brought outside the sum, so
a different threshold p such that N (x)  p is equivalent to I (A) is 1 P (j = 1) P (j = 0) times the sum
P (j = 1jx)  p . This procedure is likely to improve X
the accuracy of decision-making, while changing the propor- P (ak )1 2
P (ak jj = 1) P (ak jj = 0) : (3)
tion of negative examples using Theorem 1 in order to use the k
threshold 0.5 is not. Because 1 P (j = 1) P (j = 0) is constant for all at-
tributes, the attribute A for which I (A) is minimum is deter-
4.2 Decision tree growing
mined by the minimum of (3). If 2 = 1 then (3) depends
We turn our attention now to standard decision tree learning only on P (ak jj = 1) and P (ak jj = 0), which do not depend
methods, which have two phases. In the first phase a tree is on the base rates. Otherwise, (3) is different for different base
grown top-down, while in the second phase nodes are pruned rates because
from the tree. We discuss separately the effect on each phase
of changing the proportion of negative and positive training P (ak ) = P (ak jj = 1)P (j = 1) + P (ak jj = 0)P (j = 0)
examples. unless the attribute A is independent of the class j , that is
A splitting criterion is a metric applied to an attribute P (ak jj = 1) = P (ak jj = 0) for 1  k  m.
that measures how homogeneous the induced subsets are, if The sum (3) has its maximum value 1 if A is independent
a training set is partitioned based on the values of this at- of j . As desired, the sum is smaller otherwise, if A and j are
tribute. Consider a discrete attribute A that has values A = a1 correlated and hence splitting on A is reasonable.
through A = am for some m  2. In the two-class case, stan- Theorem 3 implies that changing the proportion of positive
dard splitting criteria have the form or negative examples in the training set has no effect on the
m
X p of the tree if the decision tree growing method uses
structure
I (A) = P (A = ak )f (pk ; 1 pk ) the 2 p(1 p) impurity criterion. If the algorithm uses a
k=1 different criterion, such as the C4.5 entropy measure, the ef-
where pk = P (j = 1jA = ak ) and all probabilities are
fect is usually small, because all impurity criteria are similar.
frequencies in the training set to be split based on A. The
The experimental results of Drummond and p Holte [2000]
and Dietterich et al. [1996] show that the 2 p(1 p) crite-
function f (p; 1 p) measures the impurity or heterogeneity
rion normally leads to somewhat smaller unpruned decision
of each subset of training examples. All such functions are trees, sometimes leads to more accurate trees, and never leads
qualitatively similar, with a unique maximum at p = 0:5, and
to much less accurate trees. Therefore we can recommend its
equal minima at p = 0 and p = 1.
use, and we can conclude that regardless of the impurity cri-
Drummond and Holte [2000] have shown p that for two- terion, applying Theorem 1 is not likely to have have much
valued attributes the impurity function 2 p(1 p) sug- influence on the growing phase of decision tree learning.
gested by Kearns and Mansour [1996] is invariant to changes
in the proportion of different classes in the training data. We 4.3 Decision tree pruning
prove here a more general result that applies to all discrete-
Standard methods for pruning decision trees are highly sen-
valued attributes and that shows that related impurity func-
sitive to the prevalence of different classes among training
tions, including the Gini index [Breiman et al., 1984], are not
examples. If all classes except one are rare, then C4.5 often
invariant to base rate changes.
prunes the decision tree down to a single node that classifies
Theorem 3: Suppose f (p; 1 p) = (p(1 p)) where all examples as members of the common class. Such a clas-
> 0 and > 0. For any collection of discrete-valued sifier is useless for decision-making if failing to recognize an
attributes, the attribute that minimizes I (A) using f is the example in a rare class is an expensive error.
same regardless of changes in the base rate P (j = 1) of the Several papers have examined recently the issue of how to
training set if = 0:5, and not otherwise in general. obtain good probability estimates from decision trees [Brad-
ford et al., 1998; Provost and Domingos, 2000; Zadrozny and
Proof: For any attribute A, by definition Elkan, 2001b]. It is clear that it is necessary to use a smooth-
m ing method to adjust the probability estimates at each leaf of
X
I (A) = P (ak )P (j = 1jak ) P (j = 0jak ) a decision tree. It is not so clear what pruning methods are
best.
k=1
The experiments of Bauer and Kohavi [1999] suggest that
where a1 through am are the possible values of A. So by no pruning is best when using a decision tree with probability
smoothing. The overall conclusion of Bradford et al. [1998] References
is that the best pruning is either no pruning or what they call [Bauer and Kohavi, 1999] Eric Bauer and Ron Kohavi. An em-
“Laplace pruning.” The idea of Laplace pruning is: pirical comparison of voting classification algorithms: Bagging,
1. Do Laplace smoothing: If n training examples reach a boosting, and variants. Machine Learning, 36:105–139, 1999.
node, of which k are positive, let the estimate at this [Bradford et al., 1998] J. Bradford, C. Kunz, R. Kohavi, C. Brunk,
node of P (j = 1jx) be (k + 1)=(n + 2). and C. Brodley. Pruning decision trees with misclassification
costs. In Proceedings of the European Conference on Machine
2. Compute the expected loss at each node using the Learning, pages 131–136, 1998.
smoothed probability estimates, the cost matrix, and the
[Breiman et al., 1984] L. Breiman, J. H. Friedman, R. A. Olshen,
training set. and C. J. Stone. Classification and Regression Trees. Wadswoth,
3. If the expected loss at a node is less than the sum of the Belmont, California, 1984.
expected losses at its children, prune the children. [Dietterich et al., 1996] T. G. Dietterich, M. Kearns, and Y. Man-
We can show intuitively that Laplace pruning is similar to sour. Applying the weak learning framework to understand and
no pruning. In the absence of probability smoothing, the ex- improve C4.5. In Proceedings of the Thirteenth International
pected loss at a node is always greater than or equal to the sum Conference on Machine Learning, pages 96–104. Morgan Kauf-
mann, 1996.
of the expected losses at its children. Equality holds only if
the optimal predicted class at each child is the same as the [Domingos and Pazzani, 1996] Pedro Domingos and Michael Paz-
optimal predicted class at the parent. Therefore, in the ab- zani. Beyond independence: Conditions for the optimality of
sence of smoothing, step (3) cannot change the meaning of a the simple Bayesian classifier. In Proceedings of the Thirteenth
International Conference on Machine Learning, pages 105–112.
decision tree, i.e. the classes predicted by the tree, so Laplace Morgan Kaufmann, 1996.
pruning is equivalent to no pruning.
[Drummond and Holte, 2000] Chris Drummond and Robert C.
With probability smoothing, if the expected loss at a node
Holte. Exploiting the cost (in)sensitivity of decision tree splitting
is less than the sum of the expected losses at its children, the criteria. In Proceedings of the Seventeenth International Confer-
difference must be caused by smoothing, so without smooth- ence on Machine Learning, pages 239–246, 2000.
ing there would presumably be equality. So pruning the chil-
[Japkowicz, 2000] N. Japkowicz. The class imbalance problem:
dren is still only a simplification that leaves the meaning of
Significance and strategies. In Proceedings of the International
the tree unchanged. Note that the effect of Laplace smooth- Conference on Artificial Intelligence, Las Vegas, June 2000.
ing is small at internal tree nodes, because at these nodes typ-
ically k >> 1 and n >> 2. [Kearns and Mansour, 1996] M. Kearns and Y. Mansour. On the
boosting ability of top-down decision tree learning algorithms.
In summary, growing a decision tree can be done in a In Proceedings of the Annual ACM Symposium on the Theory of
cost-insensitive way. When using a decision tree to esti- Computing, pages 459–468. ACM Press, 1996.
mate probabilities, it is preferable to do no pruning. If costs
[Kubat and Matwin, 1997] M. Kubat and S. Matwin. Addressing
are example-dependent, then decisions should be made using
the curse of imbalanced training sets: One-sided sampling. In
smoothed probability estimates and Equation (1). If costs are Proceedings of the Fourteenth International Conference on Ma-
fixed, i.e. there is a single well-defined cost matrix, then each chine Learning, pages 179–186. Morgan Kaufmann, 1997.
node in the unpruned decision tree can be labeled with the [Margineantu, 2000] Dragos Margineantu. On class probability es-
optimal predicted class for that leaf. If all the leaves under a timates and cost-sensitive evaluation of classifiers. In Workshop
certain node are labeled with the same class, then the subtree Notes, Workshop on Cost-Sensitive Learning, International Con-
under that node can be eliminated. This simplification makes ference on Machine Learning, June 2000.
the tree smaller but does not change its predictions. [Michie et al., 1994] D. Michie, D. J. Spiegelhalter, and C. C. Tay-
lor. Machine Learning, Neural and Statistical Classification. El-
5 Conclusions lis Horwood, 1994.
This paper has reviewed the basic concepts behind optimal [Provost and Domingos, 2000] Foster Provost and Pedro Domin-
learning and decision-making when different misclassifica- gos. Well-trained PETs: Improving probability estimation trees.
tion errors cause different losses. For the two-class case, we Technical Report CDER #00-04-IS, Stern School of Business,
have shown rigorously how to increase or decrease the pro- New York University, 2000.
portion of negative examples in a training set in order to make [Weiss and Provost, 2001] Gary M. Weiss and Foster Provost. The
optimal cost-sensitive classification decisions using a classi- effect of class distribution on classifier learning. Technical Re-
fier learned by a standard non cost-sensitive learning method. port ML-TR 43, Department of Computer Science, Rutgers Uni-
However, we have investigated the behavior of Bayesian and versity, 2001.
decision tree learning methods, and concluded that changing [Zadrozny and Elkan, 2001a] Bianca Zadrozny and Charles Elkan.
the balance of negative and positive training examples has Learning and making decisions when costs and probabilities are
little effect on learned classifiers. Accordingly, the recom- both unknown. Technical Report CS2001-0664, Department of
mended way of using one of these methods in a domain with Computer Science and Engineering, University of California, San
Diego, January 2001.
differing misclassification costs is to learn a classifier from
the training set as given, and then to use Equation (1) or Equa- [Zadrozny and Elkan, 2001b] Bianca Zadrozny and Charles Elkan.
tion (2) directly, after smoothing probability estimates and/or Obtaining calibrated probability estimates from decision trees
adjusting the threshold of Equation (2) empirically if neces- and naive Bayesian classifiers. In Proceedings of the Eighteenth
International Conference on Machine Learning, 2001. To appear.
sary.

You might also like