Classification Bayes
Classification Bayes
Classification Bayes
(shape=round, color=red)
apple or banana?
(shape=round, color=red) evidence
hypothesis
Classification is to determine P(H|X),
(posteriori probability), the probability
that the hypothesis holds given the
observed data sample X
• P(H|X)=P(H∩X)/P(X)=(P(X|H)*P(H))/P(X)
P(H1∩X)/P(X)=(P(X|H1)*P(H1))/P(X)
P(H2∩X)/P(X)=(P(X|H2)*P(H2))/P(X)
• P(H|X)=P(H∩X)/P(X)=(P(X|H)*P(H))/P(X)
P(H1∩X)/P(X)=(P(X|H1)*P(H1))/P(X)
P(H2∩X)/P(X)=(P(X|H2)*P(H2))/P(X)
Example
fruits
apple banana
P(H|X)=P(H∩X)/P(X)=(P(X|H)*P(H))/P(X)
X=(shape=round, color=red)
P(H=apple|X)
P((shape=round and color=red)|apple)*p(apple)
P(H=banana|X)
P((shape=round and color=red)|banana)*p(banana)
Derivation of Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i ) P( x | C i) P( x | C i ) P( x | C i ) ... P( x | C i)
k 1 2 n
k 1
• MLE
8
Assume independent
fruits
apple banana
X=(shape=round, color=red)
P(H=apple|X)
P((shape=round and color=red)|apple)*p(apple)
P(shanp=round|apple)*P(color=red|apple)*p(apple)
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (posteriori probability), the
probability that the hypothesis holds given the observed data
sample X
P(yes|X) > P(no|X)
X => yes
10
Bayesian Theorem: Basics
11
Recall a few probability basics
• For events a and b:
• Bayes’ Rule
P ( a b)
p ( a | b)
p (b)
p ( a b)
p (b | a )
p(a)
p (a b) p (a | b) p (b) p (b | a ) p (a)
Prior
p( H X ) p( X | H ) p( H )
p( H | X )
p( X ) p( X )
Posterior Likelihood
Bayesian Theorem
• MLE
15
Naïve Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
16
Naïve Bayesian Classifier: An Example
• X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
• P(yes|X)
P(yes) * P(age <=30|yes)*P(I=m|yes)*P(S=yes|yes)*P(CR=Fair|yes)
• P(no|X)
P(no) * P(age <=30|no)*P(I=m|no)*P(S=yes|no)*P(CR=Fair|no)
Naïve Bayesian Classifier: An Example
• P(yes|X)
P(yes) * P(age <=30|yes)
*P(I=m|yes)*P(S=yes|
yes)*P(CR=Fair|yes)
P(yes)=9/14=0.643
P(age<=30|yes)=2/9
P(I=m|yes)=
P(S=yes|yes)=
P(CR=Fair|yes)=
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
• Compute P(X|yes) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(yes|X)
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
19
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|no) for each class
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
P(no|X)
P(X|no)*P(no) : P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
P(yes|X)=0.028
Therefore, X belongs to class (“buys_computer = yes”)
20
Smoothing to Avoid Overfitting
label
D1 1 1 0 0 0 0 yes
D2 1 0 1 0 0 0 yes
D3 1 0 0 1 0 0 yes
D4 1 0 0 0 1 1 no
D5 1 0 0 0 1 1 ?
Chinese Beijing Shanghai Macao Tokyo Japan Class
label
D1 1 1 0 0 0 0 yes
D2 1 0 1 0 0 0 yes
D3 1 0 0 1 0 0 yes
D4 1 0 0 0 1 1 no
D5 1 0 0 0 1 1 ?
D2 1 0 1 0 0 0 yes
D4 1 0 0 0 1 1 no
label
D5
D1 1 01 0 0 10 10 ?yes
D2 1 0 1 0 0 0 yes
D3 1 0 0 1 0 0 yes
D4 1 0 0 0 1 1 no
D5 1 0 0 0 1 1 ?
Text classification example(Bernoulli model)
Text classification
• Model 2: Multinomial = Class conditional unigram
– One feature Xi for each word pos in document
• feature’s values are all words in dictionary
– Value of Xi is the word in position i
– Naïve Bayes assumption:
• Given the document’s topic, word in one position in the document tells us
nothing about words in other positions
• Word appearance does not depend on position
P( X i w | c) P( X j w | c)
for all positions i,j, word w, and class c
c NB argmax P (c j ) P( xi | c j )
c j C i
yes 1 5 1 0 0 1
no 0 1 0 1 1 0
Bernoulli
mutinomial
Naïve Bayesian Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks
34
Bayesian Belief Networks
• Bayesian belief network allows a subset of the variables
conditionally independent
• A graphical model of causal relationships
– Represents dependency among the variables
– Gives a specification of joint probability distribution
Nodes: random variables
Links: dependency
X Y X and Y are the parents of Z, and Y is the
parent of P
Z No dependency between Z and P
P
Has no loops or cycles
2/19/23 Data Mining: Concepts and Techniques 35
Bayesian Belief Network: An Example