3.5 Session 14 - Naive Bayes Classifier
3.5 Session 14 - Naive Bayes Classifier
3.5 Session 14 - Naive Bayes Classifier
MODULE III
Prepared by
Ms.P.Anantha Prabha/
Ms.S.Soundarya
AP/CSE
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
MODULE III
Course Outcome
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
MODULE III
3
SESSION 13
16CS318 DATA ANALYTICS MODULE 3
Learning Objective
SESSION 14 4
16CS318 DATA ANALYTICS MODULE 3
Outline
• Bayesian Classifier
– Principle of Bayesian classifier
– Bayes’ theorem of probability
5
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
Bayesian Classifier
6
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
Bayesian Classifier
• Principle
– If it walks like a duck, quacks like a duck, then it is probably a duck
Bayesian Classifier
• A statistical classifier
– Performs probabilistic prediction, i.e., predicts class membership probabilities
• Foundation
– Based on Bayes’ Theorem.
• Assumptions
1. The classes are mutually exclusive and exhaustive.
2. The attributes are independent given the class.
Air-Traffic Data
Days Season Fog Rain Class
Weekday Spring None None On Time
Weekday Winter None Slight On Time
Weekday Winter None None On Time
Holiday Winter High Slight Late
Saturday Summer Normal None On Time
Weekday Autumn Normal None Very Late
Holiday Summer High Slight On Time
Sunday Summer Normal None On Time
Weekday Winter High Heavy Very Late
Weekday Summer None Slight On Time
Air-Traffic Data
Cond. from previous slide…
Air-Traffic Data
• In this database, there are four attributes
A = [ Day, Season, Fog, Rain]
with 20 tuples.
• The categories of classes are:
C= [On Time, Late, Very Late, Cancelled]
• Given this is the knowledge of data and classes, we are to find most likely
classification for any other unseen instance, for example:
Bayesian Classifier
• In many applications, the relationship between the attributes set and the class
variable is non-deterministic.
– In other words, a test cannot be classified to a class label with certainty.
• Before going to discuss the Bayesian classifier, we should have a quick look at
the Theory of Probability and then Bayes’ Theorem.
Simple Probability
Simple Probability
• Suppose, A and B are any two events and P(A), P(B) denote the probabilities
that the events A and B will occur, respectively.
Can you give an example, so that two events are not mutually exclusive?
Hint: Tossing two identical coins, Weather (sunny, foggy, warm)
Simple Probability
• Independent events: Two events are independent if occurrences of one does
not alter the occurrence of other.
Can you give an example, where an event is dependent on one or more other
events(s)?
Hint: Receiving a message (A) through a communication channel (B)
over a computer (C), rain and dating.
Joint Probability
𝑃 𝐴∪𝐵 =𝑃 𝐴 +𝑃 𝐵 −𝑃 𝐴∩𝐵
Conditional Probability
Definition 8.2: Conditional Probability
Suppose, A and B are two events associated with a random experiment. The
probability of A under the condition that B has already occurred and 𝑃(𝐵) ≠ 0 is
given by
𝑃(𝐴 ∩ 𝐵)
=
𝑃(𝐵)
19
CS 40003: Data Analytics
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
Conditional Probability
Corollary 8.1: Conditional Probability
𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 .𝑃 𝐵 𝐴 , 𝑖𝑓 𝑃 𝐴 ≠ 0
or 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐵 .𝑃 𝐴 𝐵 , 𝑖𝑓 𝑃(𝐵) ≠ 0
𝑃 𝐴 ∩ 𝐵 ∩ 𝐶 = 𝑃 𝐴 .𝑃 𝐵 .𝑃 𝐶 𝐴 ∩ 𝐵
For n events A1, A2, …, An and if all events are mutually independent to each other
𝑃 𝐴1 ∩ 𝐴2 ∩ … … … … ∩ 𝐴𝑛 = 𝑃 𝐴1 . 𝑃 𝐴2 … … … … 𝑃 𝐴𝑛
Note:
𝑃 𝐴 𝐵 = 0if events are mutually exclusive
𝑃 𝐴 𝐵 = 𝑃 𝐴 if A and B are independent
𝑃 𝐴 𝐵 ⋅ 𝑃 𝐵 = 𝑃 𝐵 𝐴 ⋅ 𝑃(𝐴)otherwise,
P A ∩ B = P(B ∩ A) 20
CS 40003: Data Analytics
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
Conditional Probability
• Generalization of Conditional Probability:
P(A ∩ B) P(B ∩ A)
P AB = =
P(B) P(B)
P(B|A)∙P(A)
= ∵P A ∩ B = P(B|A) ∙P(A) = P(A|B) ∙P(B)
P(B)
ഥ , where A
By the law of total probability : P(B) = P B ∩ A ∪ B ∩ A ഥ denotes the
compliment of event A. Thus,
P(B|A) ∙ P(A)
P AB =
P B∩A ∪ B∩A ഥ
P B A ∙ P(A)
=
P B A ∙ P A + P(B│A ഥ) ∙ P(A
ഥ)
CS 40003: Data Analytics 21
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
Conditional Probability
In general,
P(A) ∙ P D A
P AD =
P A ∙ P D A + P B ∙ P D B + P(C) ∙ P(D│C)
Total Probability
Definition 8.3: Total Probability
𝑃 𝐴 = 𝑃 𝐸1 . 𝑃 𝐴 𝐸1 + 𝑃 𝐸2 . 𝑃 𝐴 𝐸2 + ⋯ … … … . +𝑃 𝐸𝑛 . 𝑃(𝐴|𝐸𝑛 )
Reverse Probability
Example 8.3:
A bag (Bag I) contains 4 red and 3 black balls. A second bag (Bag II) contains 2 red and 4
black balls. You have chosen one ball at random. It is found as red ball. What is the
probability that the ball is chosen from Bag I?
Here,
𝐸1 = Selecting bag I
𝐸2 = Selecting bag II
A = Drawing the red ball
We are to determine P(𝐸1 |A). Such a problem can be solved using Bayes' theorem of
probability.
Bayes’ Theorem
𝑃 𝐸𝑖 . 𝑃(𝐴|𝐸𝑖 )
𝑃(𝐸𝑖 𝐴 =
σ𝑛𝑖=1 𝑃 𝐸𝑖 . 𝑃(𝐴|𝐸𝑖 )
• There are any two class conditional probabilities namely P(Y= 𝑦𝑖 |X=x) and
P(Y= 𝑦𝑗 | X=x).
• If P(Y= 𝑦𝑖 | X=x) >P(Y= 𝑦𝑗 | X=x), then we say that 𝑦𝑖 is more stronger than 𝑦𝑗
for the instance X = x.
Class
Attribute On Time Late Very Late Cancelled
None 5/14 = 0.36 0/2 = 0 0/3 = 0 0/1 = 0
Fog
Case3: Class = Very Late : 0.15 × 1.0 × 0.67 × 0.33 × 0.67 = 0.0222
𝑝𝑖 = 𝑃 𝐶𝑖 × ෑ 𝑃(𝐴𝑗 = 𝑎𝑗 |𝐶𝑖 )
𝑗=1
𝑝𝑥 = max 𝑝1 , 𝑝2 , … . . , 𝑝𝑘
– In the following, we discuss the schemes to deal with continuous attributes in Bayesian
classifier.
1. We can discretize each continuous attributes and then replace the continuous values
with its corresponding discrete intervals.
2. We can assume a certain form of probability distribution for the continuous variable and
estimate the parameters of the distribution using the training data. A Gaussian distribution
is usually chosen to represent the posterior probabilities for continuous attributes. A
general form of Gaussian distribution will look like
2
1 x−μ
P x: μ, σ2 = e−
2πσ 2σ2
2
where, μ and σ denote mean and variance, respectively.
For each class Ci, the posterior probabilities for attribute Aj(it is the numeric
attribute) can be calculated following Gaussian normal distribution as follows.
1 aj − μij 2
P Aj = aj|Ci = e−
2πσij 2σij2
Here, the parameter μijcan be calculated based on the sample mean of attribute
value of Aj for the training records that belong to the class Ci.
Similarly, σij2 can be estimated from the calculation of variance of such training
records.
• The M-estimation is to deal with the potential problem of Naïve Bayesian Classifier
when training data size is too poor.
– If the posterior probability for one of the attribute is zero, then the overall class-
conditional probability for the class vanishes.
– In other words, if training data do not cover many of the attribute values, then we may not
be able to classify some of the test records.
M-estimate Approach
• M-estimate approach can be stated as follows
𝑛𝑐𝑖 + 𝑚𝑝
P Aj = aj|Ci =
𝑛+𝑚
where, n = total number of instances from class C𝑖
𝑛𝑐𝑖 = number of training examples from class C𝑖 that take the value Aj =aj
m = it is a parameter known as the equivalent sample size, and
p = is a user specified parameter.
Note:
If n = 0, that is, if there is no training set available, then 𝑃 ai|C𝑖 = p,
so, this is a different value, in absence of sample value.
A Practice Example
age income studentcredit_rating
buys_compu
Example 8.4 <=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’
>40 medium no fair yes
C2:buys_computer = ‘no’
>40 low yes fair yes
>40 low yes excellent no
Data instance
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
<=30 low yes fair yes
Student = yes
>40 medium yes fair yes
Credit_rating = fair)
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
CS 40003: Data Analytics 40
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
A Practice Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
POINTS TO PONDER
• Naïve Bayes based on the independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers
even in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
42
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
KEY TERMS
• Bayesian Classifier
• Naïve Bayesian Classifier
• M-Estimate
43
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
Questions
1. Naive Bayes is a machine learning implementation of
__________theorem
Answers
1. Bayes
2. False
3. M Approach
45
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
LEARNING OUTCOME
46
SESSION 14
16CS318 DATA ANALYTICS MODULE 3
47
SESSION 14