Classification Bayes

BAYESIAN CLASSIFICATION
Bayesian Classification: Why?

• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
2
Example
fruits
apple banana
Make observations on two features : shape and color
(shape=round, color=red)
apple or banana?
(shape=round, color=red) evidence
p(apple|shape=round and color=red) >

p(banana|shape=round and color=red)
hypothesis
Classification is to determine P(H|X),
(posteriori probability), the probability
that the hypothesis holds given the
observed data sample X
• P(H|X)=P(H∩X)/P(X)=(P(X|H)*P(H))/P(X)
P(H1|X) > P(H2|X) X=>H1
P(H1|X) < P(H2|X) X=>H2
P(H1∩X)/P(X)=(P(X|H1)*P(H1))/P(X)
P(H2∩X)/P(X)=(P(X|H2)*P(H2))/P(X)
• P(H|X)=P(H∩X)/P(X)=(P(X|H)*P(H))/P(X)
P(H1|X) > P(H2|X) X=>H1
P(H1|X) < P(H2|X) X=>H2
P(H1∩X)/P(X)=(P(X|H1)*P(H1))/P(X)
P(H2∩X)/P(X)=(P(X|H2)*P(H2))/P(X)
Example
fruits
apple banana
P(H|X)=P(H∩X)/P(X)=(P(X|H)*P(H))/P(X)
X=(shape=round, color=red)
P(H=apple|X)
P((shape=round and color=red)|apple)*p(apple)
P(H=banana|X)
P((shape=round and color=red)|banana)*p(banana)
Derivation of Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i )   P( x | C i)  P( x | C i )  P( x | C i )  ...  P( x | C i)
k 1 2 n
k 1
• MLE
8
Assume independent
fruits
apple banana
X=(shape=round, color=red)
P(H=apple|X)
P((shape=round and color=red)|apple)*p(apple)
P(shanp=round|apple)*P(color=red|apple)*p(apple)
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (posteriori probability), the
probability that the hypothesis holds given the observed data
sample X
P(yes|X) > P(no|X)
X => yes
P(yes|X) < P(no|X)

X => no
10
Bayesian Theorem: Basics
• P(H) (prior probability), the initial probability

– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelyhood), the probability of observing the sample X,
given that the hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
11
Recall a few probability basics
• For events a and b:
• Bayes’ Rule
P ( a  b)
p ( a | b) 
p (b)
p ( a  b)
p (b | a ) 
p(a)
p (a  b)  p (a | b) p (b)  p (b | a ) p (a)
Prior
p( H  X ) p( X | H ) p( H )
p( H | X )  
p( X ) p( X )
Posterior Likelihood
Bayesian Theorem
• Given training data X, posteriori probability of a hypothesis H,

P(H|X), follows the Bayes theorem
P(H | X)  P( X | H ) P( H )  P(X | H ) P(H ) / P(X)

P(X)
• Informally, this can be written as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
13
Towards Naïve Bayesian Classifier
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only
needs to be maximized P(C | X)  P(X | C )P(C )

i i i
14
Derivation of Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i )   P( x | C i)  P( x | C i )  P( x | C i )  ...  P( x | C i)
k 1 2 n
k 1
• MLE
15
Naïve Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
16
Naïve Bayesian Classifier: An Example
• X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
• P(yes|X)
P(yes) * P(age <=30|yes)*P(I=m|yes)*P(S=yes|yes)*P(CR=Fair|yes)
• P(no|X)
P(no) * P(age <=30|no)*P(I=m|no)*P(S=yes|no)*P(CR=Fair|no)
• P(yes|X)
P(yes) * P(age <=30|yes)
*P(I=m|yes)*P(S=yes|
yes)*P(CR=Fair|yes)
P(yes)=9/14=0.643
P(age<=30|yes)=2/9
P(I=m|yes)=
P(S=yes|yes)=
P(CR=Fair|yes)=
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
• Compute P(X|yes) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|yes) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(yes|X)
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
19
• P(Ci): P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|no) for each class
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|no) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(no|X)
P(X|no)*P(no) : P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
P(yes|X)=0.028
Therefore, X belongs to class (“buys_computer = yes”)
20
Smoothing to Avoid Overfitting
 To eliminate zeros, we use add-one or Laplace

smoothing
Ex. Suppose a dataset with 1000 tuples, income=low

(0), income= medium (990), and income = high (10)
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
Smoothing to Avoid Overfitting
• P(yes|X)
P(yes) * P(age <=30|yes)
*P(I=m|yes)*P(S=yes|
yes)*P(CR=Fair|yes)
P(age<=30|yes)=(2+1)/
(9+3)
P(I=m|yes)=(4+1)/(9+3)
P(S=yes|yes)=
P(CR=Fair|yes)=
Practice
student Gender Rating Class label
Yes M Excellent Y
Yes M Fair Y
Yes F Fair Y
Yes F Good N
No M Fair Y
No M Excellent N
<student=yes, gender=F, Rating=Fair>

Class label=?
Underflow Prevention: log space
• Multiplying lots of probabilities, which are between 0 and 1
by definition, can result in floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to perform all
computations by summing logs of probabilities rather than
multiplying probabilities.
• Class with highest final un-normalized log probability score is
still the most probable.
c NB  argmax log P(c j ) 
c jC
 log P( x | c )
i positions
i j
• Note that model is now just max of sum of weights…

Text classification
• Model 1: Bernoulli
– One feature Xw for each word in dictionary
– Xw = true in document d if w appears in d
– Naive Bayes assumption:
• Given the document’s topic, appearance of one word in the
document tells us nothing about chances that another word
appears
• This is the model used in the binary independence
model in classic probabilistic relevance feedback
in hand-classified data
Example
Chinese Beijing Shanghai Macao Tokyo Japan Class
label
D1 1 1 0 0 0 0 yes
D2 1 0 1 0 0 0 yes
D3 1 0 0 1 0 0 yes
D4 1 0 0 0 1 1 no
D5 1 0 0 0 1 1 ?
Chinese Beijing Shanghai Macao Tokyo Japan Class
label
D1 1 1 0 0 0 0 yes
D2 1 0 1 0 0 0 yes
D3 1 0 0 1 0 0 yes
D4 1 0 0 0 1 1 no
D5 1 0 0 0 1 1 ?
D2 1 0 1 0 0 0 yes
D3 1Chinese 0Beijing 0Shanghai 1Macao 0Tokyo 0Japan yes

Class
D4 1 0 0 0 1 1 no
label
D5
D1 1 01 0 0 10 10 ?yes
D2 1 0 1 0 0 0 yes
D3 1 0 0 1 0 0 yes
D4 1 0 0 0 1 1 no
D5 1 0 0 0 1 1 ?
Text classification example(Bernoulli model)
Text classification
• Model 2: Multinomial = Class conditional unigram
– One feature Xi for each word pos in document
• feature’s values are all words in dictionary
– Value of Xi is the word in position i
– Naïve Bayes assumption:
• Given the document’s topic, word in one position in the document tells us
nothing about words in other positions
• Word appearance does not depend on position
P( X i  w | c)  P( X j  w | c)
for all positions i,j, word w, and class c
• Just have one multinomial feature predicting all words

Using Multinomial Naive Bayes Classifiers to
Classify Text: Basic method
• Attributes are text positions, values are words.
c NB  argmax P (c j ) P( xi | c j )
c j C i
 argmax P (c j ) P( x1 " our" | c j )  P ( xn " text" | c j )

c j C
 Assume that classification is independent of the positions of the words

 Use same parameters for each position
 Result is bag of words model (over tokens not types)
8
Beijing Chinese Macao Japan Tokyo Shanghai
yes 1 5 1 0 0 1
no 0 1 0 1 1 0
 P(c)=3/4, P(~c)=1/4  P(chinese|~c)=(1+1)/(3+6)=2/9

 P(chinese|c)=(5+1)/(8+6)=3/7  P(tokyo|~c)=(1+1)/(3+6)=2/9
 P(Tokyo|c)=(0+1)/(8+6)=1/14  p(Japan|~c)=(1+1)/(3+6)=2/9
 P(Japan|c)=(0+1)/(8+6)=1/14
d5
c : (3/4)*(3/7)3*(1/14)*(1/14)=0.0003
~c:(1/4)*(2/9)3*(2/9)*(2/9)=0.0001
Practice
docID Words in document Class label
Training set 1 cat cat dog cat cat cat yes
2 dog mouse yes
3 ant ant yes
4 dog ant no
Test set 5 cat cat cat ant ?
Bernoulli
mutinomial
Naïve Bayesian Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks
34
Bayesian Belief Networks
• Bayesian belief network allows a subset of the variables
conditionally independent
• A graphical model of causal relationships
– Represents dependency among the variables
– Gives a specification of joint probability distribution
 Nodes: random variables
 Links: dependency
X Y  X and Y are the parents of Z, and Y is the
parent of P
Z  No dependency between Z and P
P
 Has no loops or cycles
2/19/23 Data Mining: Concepts and Techniques 35
Bayesian Belief Network: An Example
Family The conditional probability table (CPT)

Smoker
History for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
CPT shows the conditional probability for each

possible combination of its parents
PositiveXRay Dyspnea Derivation of the probability of a

particular combination of values of X,
from CPT:
n
Bayesian Belief Networks P ( x1 ,..., xn )   P ( x i | Parents (Y i ))
i 1
2/19/23 Data Mining: Concepts and Techniques 36

Classification Bayes

Uploaded by

Copyright:

Available Formats

Classification Bayes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification Bayes

Uploaded by

Copyright:

Available Formats

BAYESIAN CLASSIFICATION

Bayesian Classification: Why?

Make observations on two features : shape and color

p(apple|shape=round and color=red) >

P(H1|X) > P(H2|X) X=>H1

P(H1|X) < P(H2|X) X=>H2

P(H1|X) > P(H2|X) X=>H1

P(H1|X) < P(H2|X) X=>H2

P(yes|X) < P(no|X)

• P(H) (prior probability), the initial probability

• Given training data X, posteriori probability of a hypothesis H,

P(H | X)  P( X | H ) P( H )  P(X | H ) P(H ) / P(X)

• Since P(X) is constant for all classes, only

needs to be maximized P(C | X)  P(X | C )P(C )

X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

 To eliminate zeros, we use add-one or Laplace

Ex. Suppose a dataset with 1000 tuples, income=low

<student=yes, gender=F, Rating=Fair>

• Note that model is now just max of sum of weights…

Chinese Beijing Shanghai Macao Tokyo Japan Class

D3 1Chinese 0Beijing 0Shanghai 1Macao 0Tokyo 0Japan yes

• Just have one multinomial feature predicting all words

• Attributes are text positions, values are words.

 argmax P (c j ) P( x1 " our" | c j )  P ( xn " text" | c j )

 Assume that classification is independent of the positions of the words

Beijing Chinese Macao Japan Tokyo Shanghai

 P(c)=3/4, P(~c)=1/4  P(chinese|~c)=(1+1)/(3+6)=2/9

Family The conditional probability table (CPT)

LC 0.8 0.5 0.7 0.1

CPT shows the conditional probability for each

PositiveXRay Dyspnea Derivation of the probability of a

You might also like