Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture 7

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Lecture 7: Naïve bayse

Naïve bayes classifier


• It is a classification technique based on Bayes theorem with
independent assumption among features (predictors).

• Naïve Bayes model is easy to build, with no complicated


iterative parameter estimation which makes it particularly
useful for very large datasets
Bayes Theorem
▪ Given a class C and feature X which bears on the class:

P( X | C ) P(C )
P(C | X ) =
P( X )

▪ P(C) : independent probability of C (hypotheses): prior probability


▪ P(X) : independent probability of X (data, predicator)
▪ P(X|C): conditional probability of X given C: likelihood
▪ P(C|X): conditional probability of C given X: posterior probability
Maximum A Posterior
▪ Based on Bayes Theorem, we can compute the Maximum A Posterior
(MAP) hypothesis for the data
▪ We are interested in the best hypothesis for some space C given
observed training data X.
c MAP  argmax P(c | X )
cC
P ( X | c ) P (c )
= argmax
cC P( X )
= argmax P( X | c) P(c)
cC

C: set of all hypothesis (Classes).


Note that we can drop P(X) as the probability of the data is constant
(and independent of the hypothesis).
Bayes Classifiers
Assumption: training set consists of instances of different classes
described cj as conjunctions of attributes values
Task: Classify a new instance d based on a tuple of attribute values
into one of the classes cj  C
Key idea: assign the most probable class c MAP using Bayes Theorem.

cMAP = argmax P(c j | x1 , x2 , , xn )


c j C

P( x1 , x2 , , xn | c j ) P(c j )
= argmax
c j C P( x1 , x2 ,, xn )
= argmax P( x1 , x2 , , xn | c j ) P(c j )
c j C
The Naïve Bayes Model

▪ The Naïve Bayes Assumption: Assume that the effect of the


value of the predictor (X) on a given class ( C ) is
independent of the values of other predictors.

▪ This assumption is called class conditional independence


P( x1 , x2 ,, xn | C ) = P( x1 | C )  P( x2 | C )  ..... P( xn | C )
P( x1 , x2 ,, xn | C ) =  n
i =1 P( xi | C )
Naïve Bayes Algorithm
• Naïve Bayes Algorithm (for discrete input attributes) has two phases
– 1. Learning Phase: Given a training set S, Learning is easy, just
create probability tables.
For each target value of ci (ci = c1 ,  , c L )
Pˆ (C = ci )  estimate P(C = ci ) with examples in S;
For every attribute value x jk of each attribute X j ( j = 1,  , n; k = 1,  , N j )
Pˆ ( X j = x jk |C = ci )  estimate P( X j = x jk |C = ci ) with examples in S;
Output: conditional probability tables; for X j , N j  L elements

– 2. Test Phase: Given an unknown instance X = ( a1 ,  , an ) ,


Look up tables to assign the label c* to X’ if

[ Pˆ ( a1 |c * )    Pˆ ( an |c * )]Pˆ (c * )  [ Pˆ ( a1 |c)    Pˆ ( an |c)]Pˆ (c), c  c * , c = c1 ,  , c L


Classification is easy, just multiply probabilities
Example
• Example: Play Tennis
Example
• Learning Phase

Outlook Play=Yes Play=No Temperature Play=Yes Play=No


Sunny 2/9 3/5 Hot 2/9 2/5

Overcast 4/9 0/5 Mild 4/9 2/5

Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No

High 3/9 4/5 Strong 3/9 3/5

Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14


Example
• Test Phase

– Given a new instance, predict its label


x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– Decision making with the MAP rule

P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053


P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. 10


Naïve Bayes
• Algorithm: Continuous-valued Features
– Numberless values taken by a continuous-valued feature
– Conditional probability often modeled with the normal distribution
 ( x j −  ji ) 2 
exp  − 
1
Pˆ ( x j | ci ) =
2  ji  2 ji
2 
 
 ji : mean (avearage) of feature values x j of examples for which c = ci
 ji : standard deviation of feature values x j of examples for which c = ci

– Learning Phase: for X = ( X1 ,  , Xn ), C = c1 ,  , c L


Output: n L normal distributions and P(C = ci ) i = 1,  , L

– Test Phase: Given an unknown instance X = ( a1 ,  , an )


• Instead of looking-up tables, calculate conditional probabilities with all
the normal distributions achieved in the learning phrase
• Apply the MAP rule to assign a label (the same as done for the discrete case)
11
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes = 21.64 , Yes = 2.35
 =  xn ,  =  ( xn − )2
2
 No = 23.88 , No = 7.09
N n=1 N n=1

– Learning Phase: output two Gaussian models for P(temp|C)


1  ( x − 21 .64) 2
 1  ( x − 21.64) 2

Pˆ ( x | Yes) = exp − 
 = exp
 − 
2.35 2  2  2.35  2.35 2
2
 11.09 

ˆ 1  ( x − 23 .88) 2
 1  ( x − 23.88) 2

P( x | No) = 
exp − 
 = 
exp − 
7.09 2  2  7.09  7.09 2
2
 50.25 
12
Zero conditional probability
• If no example contains the feature value
– In this circumstance, we face a zero conditional probability problem
during test

Pˆ ( x1 | ci )    Pˆ (a jk | ci )    Pˆ ( xn | ci ) = 0 for x j = a jk , Pˆ (a jk | ci ) = 0

– For a remedy, class conditional probabilities re-estimated with

n + mp
Pˆ (a jk | ci ) = c (m-estimate)
n+m
nc : number of training examples for which x j = a jk and c = ci
n : number of training examples for which c = ci
p : prior estimate (usually, p = 1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m  1) 13
Zero conditional probability
• Example: P(outlook=overcast|no)=0 in the play-tennis dataset
– Adding m “virtual” examples (m: up to 1% of #training example)
• In this dataset, # of training examples for the “no” class is 5.
• We can only add m=1 “virtual” example in our m-esitmate remedy.
– The “outlook” feature can takes only 3 values. So p=1/3.
– Re-estimate P(outlook|no) with the m-estimate

0 (No.of samples outlook=overcast|no )

5 (No.of samples class=no )

1/3 (outloock has 3 values(sunny,


overcast, rain) )
1
Conclusion
▪ Naïve Bayes is based on the independence assumption
▪ Training is very easy and fast; just requiring considering each attribute in each
class separately
▪ Test is straightforward; just looking up tables or calculating conditional
probabilities with normal distributions

▪ Naïve Bayes
• Performance of naïve Bayes is competitive to most of state-of-the-art classifiers
even if in presence of violating the independence assumption
• It has many successful applications, e.g., spam mail filtering

You might also like