Lecture 7
Lecture 7
Lecture 7
P( X | C ) P(C )
P(C | X ) =
P( X )
P( x1 , x2 , , xn | c j ) P(c j )
= argmax
c j C P( x1 , x2 ,, xn )
= argmax P( x1 , x2 , , xn | c j ) P(c j )
c j C
The Naïve Bayes Model
ˆ 1 ( x − 23 .88) 2
1 ( x − 23.88) 2
P( x | No) =
exp −
=
exp −
7.09 2 2 7.09 7.09 2
2
50.25
12
Zero conditional probability
• If no example contains the feature value
– In this circumstance, we face a zero conditional probability problem
during test
Pˆ ( x1 | ci ) Pˆ (a jk | ci ) Pˆ ( xn | ci ) = 0 for x j = a jk , Pˆ (a jk | ci ) = 0
n + mp
Pˆ (a jk | ci ) = c (m-estimate)
n+m
nc : number of training examples for which x j = a jk and c = ci
n : number of training examples for which c = ci
p : prior estimate (usually, p = 1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m 1) 13
Zero conditional probability
• Example: P(outlook=overcast|no)=0 in the play-tennis dataset
– Adding m “virtual” examples (m: up to 1% of #training example)
• In this dataset, # of training examples for the “no” class is 5.
• We can only add m=1 “virtual” example in our m-esitmate remedy.
– The “outlook” feature can takes only 3 values. So p=1/3.
– Re-estimate P(outlook|no) with the m-estimate
▪ Naïve Bayes
• Performance of naïve Bayes is competitive to most of state-of-the-art classifiers
even if in presence of violating the independence assumption
• It has many successful applications, e.g., spam mail filtering