Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Classification Bayes

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

BAYESIAN CLASSIFICATION

Bayesian Classification: Why?


• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
2
Example
fruits
apple banana

Make observations on two features : shape and color

(shape=round, color=red)
apple or banana?
(shape=round, color=red) evidence

p(apple|shape=round and color=red) >


p(banana|shape=round and color=red)

hypothesis
Classification is to determine P(H|X),
(posteriori probability), the probability
that the hypothesis holds given the
observed data sample X
• P(H|X)=P(H∩X)/P(X)=(P(X|H)*P(H))/P(X)

P(H1|X) > P(H2|X) X=>H1

P(H1|X) < P(H2|X) X=>H2

P(H1∩X)/P(X)=(P(X|H1)*P(H1))/P(X)

P(H2∩X)/P(X)=(P(X|H2)*P(H2))/P(X)
• P(H|X)=P(H∩X)/P(X)=(P(X|H)*P(H))/P(X)

P(H1|X) > P(H2|X) X=>H1

P(H1|X) < P(H2|X) X=>H2

P(H1∩X)/P(X)=(P(X|H1)*P(H1))/P(X)

P(H2∩X)/P(X)=(P(X|H2)*P(H2))/P(X)
Example
fruits
apple banana

P(H|X)=P(H∩X)/P(X)=(P(X|H)*P(H))/P(X)

X=(shape=round, color=red)
P(H=apple|X)
P((shape=round and color=red)|apple)*p(apple)

P(H=banana|X)
P((shape=round and color=red)|banana)*p(banana)
Derivation of Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i )   P( x | C i)  P( x | C i )  P( x | C i )  ...  P( x | C i)
k 1 2 n
k 1

• MLE

8
Assume independent
fruits
apple banana

X=(shape=round, color=red)
P(H=apple|X)
P((shape=round and color=red)|apple)*p(apple)

P(shanp=round|apple)*P(color=red|apple)*p(apple)
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (posteriori probability), the
probability that the hypothesis holds given the observed data
sample X
P(yes|X) > P(no|X)
X => yes

P(yes|X) < P(no|X)


X => no

10
Bayesian Theorem: Basics

• P(H) (prior probability), the initial probability


– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelyhood), the probability of observing the sample X,
given that the hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income

11
Recall a few probability basics
• For events a and b:
• Bayes’ Rule
P ( a  b)
p ( a | b) 
p (b)
p ( a  b)
p (b | a ) 
p(a)
p (a  b)  p (a | b) p (b)  p (b | a ) p (a)
Prior
p( H  X ) p( X | H ) p( H )
p( H | X )  
p( X ) p( X )
Posterior Likelihood
Bayesian Theorem

• Given training data X, posteriori probability of a hypothesis H,


P(H|X), follows the Bayes theorem

P(H | X)  P( X | H ) P( H )  P(X | H ) P(H ) / P(X)


P(X)
• Informally, this can be written as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
13
Towards Naïve Bayesian Classifier
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

• Since P(X) is constant for all classes, only

needs to be maximized P(C | X)  P(X | C )P(C )


i i i
14
Derivation of Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i )   P( x | C i)  P( x | C i )  P( x | C i )  ...  P( x | C i)
k 1 2 n
k 1

• MLE

15
Naïve Bayesian Classifier: Training Dataset

Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’

Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)

16
Naïve Bayesian Classifier: An Example
• X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)

• P(yes|X)
P(yes) * P(age <=30|yes)*P(I=m|yes)*P(S=yes|yes)*P(CR=Fair|yes)

• P(no|X)
P(no) * P(age <=30|no)*P(I=m|no)*P(S=yes|no)*P(CR=Fair|no)
Naïve Bayesian Classifier: An Example
• P(yes|X)
P(yes) * P(age <=30|yes)
*P(I=m|yes)*P(S=yes|
yes)*P(CR=Fair|yes)
P(yes)=9/14=0.643
P(age<=30|yes)=2/9
P(I=m|yes)=
P(S=yes|yes)=
P(CR=Fair|yes)=
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
• Compute P(X|yes) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

X = (age <= 30 , income = medium, student = yes, credit_rating = fair)


P(X|yes) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(yes|X)
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028

19
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|no) for each class
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)


P(X|no) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(no|X)
P(X|no)*P(no) : P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

P(yes|X)=0.028
Therefore, X belongs to class (“buys_computer = yes”)
20
Smoothing to Avoid Overfitting

 To eliminate zeros, we use add-one or Laplace


smoothing

Ex. Suppose a dataset with 1000 tuples, income=low


(0), income= medium (990), and income = high (10)
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
Smoothing to Avoid Overfitting
• P(yes|X)
P(yes) * P(age <=30|yes)
*P(I=m|yes)*P(S=yes|
yes)*P(CR=Fair|yes)
P(age<=30|yes)=(2+1)/
(9+3)
P(I=m|yes)=(4+1)/(9+3)
P(S=yes|yes)=
P(CR=Fair|yes)=
Practice
student Gender Rating Class label
Yes M Excellent Y
Yes M Fair Y
Yes F Fair Y
Yes F Good N
No M Fair Y
No M Excellent N

<student=yes, gender=F, Rating=Fair>


Class label=?
Underflow Prevention: log space
• Multiplying lots of probabilities, which are between 0 and 1
by definition, can result in floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to perform all
computations by summing logs of probabilities rather than
multiplying probabilities.
• Class with highest final un-normalized log probability score is
still the most probable.
c NB  argmax log P(c j ) 
c jC
 log P( x | c )
i positions
i j

• Note that model is now just max of sum of weights…


Text classification
• Model 1: Bernoulli
– One feature Xw for each word in dictionary
– Xw = true in document d if w appears in d
– Naive Bayes assumption:
• Given the document’s topic, appearance of one word in the
document tells us nothing about chances that another word
appears
• This is the model used in the binary independence
model in classic probabilistic relevance feedback
in hand-classified data
Example

Chinese Beijing Shanghai Macao Tokyo Japan Class

label

D1 1 1 0 0 0 0 yes

D2 1 0 1 0 0 0 yes

D3 1 0 0 1 0 0 yes

D4 1 0 0 0 1 1 no

D5 1 0 0 0 1 1 ?
Chinese Beijing Shanghai Macao Tokyo Japan Class

label

D1 1 1 0 0 0 0 yes

D2 1 0 1 0 0 0 yes

D3 1 0 0 1 0 0 yes

D4 1 0 0 0 1 1 no

D5 1 0 0 0 1 1 ?
D2 1 0 1 0 0 0 yes

D3 1Chinese 0Beijing 0Shanghai 1Macao 0Tokyo 0Japan yes


Class

D4 1 0 0 0 1 1 no
label

D5
D1 1 01 0 0 10 10 ?yes

D2 1 0 1 0 0 0 yes

D3 1 0 0 1 0 0 yes

D4 1 0 0 0 1 1 no

D5 1 0 0 0 1 1 ?
Text classification example(Bernoulli model)
Text classification
• Model 2: Multinomial = Class conditional unigram
– One feature Xi for each word pos in document
• feature’s values are all words in dictionary
– Value of Xi is the word in position i
– Naïve Bayes assumption:
• Given the document’s topic, word in one position in the document tells us
nothing about words in other positions
• Word appearance does not depend on position

P( X i  w | c)  P( X j  w | c)
for all positions i,j, word w, and class c

• Just have one multinomial feature predicting all words


Using Multinomial Naive Bayes Classifiers to
Classify Text: Basic method

• Attributes are text positions, values are words.

c NB  argmax P (c j ) P( xi | c j )
c j C i

 argmax P (c j ) P( x1 " our" | c j )  P ( xn " text" | c j )


c j C

 Assume that classification is independent of the positions of the words


 Use same parameters for each position
 Result is bag of words model (over tokens not types)
8

Beijing Chinese Macao Japan Tokyo Shanghai

yes 1 5 1 0 0 1
no 0 1 0 1 1 0

 P(c)=3/4, P(~c)=1/4  P(chinese|~c)=(1+1)/(3+6)=2/9


 P(chinese|c)=(5+1)/(8+6)=3/7  P(tokyo|~c)=(1+1)/(3+6)=2/9
 P(Tokyo|c)=(0+1)/(8+6)=1/14  p(Japan|~c)=(1+1)/(3+6)=2/9
 P(Japan|c)=(0+1)/(8+6)=1/14
d5
c : (3/4)*(3/7)3*(1/14)*(1/14)=0.0003
~c:(1/4)*(2/9)3*(2/9)*(2/9)=0.0001
Practice
docID Words in document Class label
Training set 1 cat cat dog cat cat cat yes
2 dog mouse yes
3 ant ant yes
4 dog ant no
Test set 5 cat cat cat ant ?

Bernoulli

mutinomial
Naïve Bayesian Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks
34
Bayesian Belief Networks
• Bayesian belief network allows a subset of the variables
conditionally independent
• A graphical model of causal relationships
– Represents dependency among the variables
– Gives a specification of joint probability distribution
 Nodes: random variables
 Links: dependency
X Y  X and Y are the parents of Z, and Y is the
parent of P
Z  No dependency between Z and P
P
 Has no loops or cycles
2/19/23 Data Mining: Concepts and Techniques 35
Bayesian Belief Network: An Example

Family The conditional probability table (CPT)


Smoker
History for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

CPT shows the conditional probability for each


possible combination of its parents

PositiveXRay Dyspnea Derivation of the probability of a


particular combination of values of X,
from CPT:
n
Bayesian Belief Networks P ( x1 ,..., xn )   P ( x i | Parents (Y i ))
i 1
2/19/23 Data Mining: Concepts and Techniques 36

You might also like