Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
1 views

Bayesian_Learning1

The document discusses Bayesian Learning, focusing on concepts like conditional probability, Bayesian reasoning, and Naïve Bayes classifiers. It highlights the application of Bayesian methods in filtering spam emails by estimating probabilities based on training data. The document also addresses challenges such as combinatorial explosion in estimating probabilities with multiple symptoms and the use of Naïve Bayes to simplify calculations by assuming conditional independence among attributes.

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Bayesian_Learning1

The document discusses Bayesian Learning, focusing on concepts like conditional probability, Bayesian reasoning, and Naïve Bayes classifiers. It highlights the application of Bayesian methods in filtering spam emails by estimating probabilities based on training data. The document also addresses challenges such as combinatorial explosion in estimating probabilities with multiple symptoms and the use of Naïve Bayes to simplify calculations by assuming conditional independence among attributes.

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Bayesian Learning Slide 1 of 21

Bayesian Learning

Conditional Probability
Bayesian Reasoning
Naïve Bayes Classifiers
Bayesian Networks

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 2 of 21

Filtering SPAM

The Basic Problem

Our electronic mailboxes fill up with vast numbers of


unwanted junk mail trying to sell us things - SPAM

The Basic Solution

A filter that can determine whether an e-mail is SPAM and


delete it before delivery.

Subproblem

How does the filter know which e-mails are SPAM?

Solution to Subproblem

Given two sets of example e-mail messages:


Set A comprising only SPAM e-mails
Set B comprising only non-SPAM e-mails
the filter could learn to identify SPAM e-mails

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 3 of 21

CONDITIONAL PROBABILITY
The conditional probability of event X occurring given that
event Y has occurred is:
P ( X  Y)
P ( X | Y) 
P (Y )

 Do not confuse P(X|Y) with P(Y|X).


The probability that someone is the US President, given
that they are American, is about 410-9.
The probability that someone is American, given that they
are the US President, is 1.0.

Bayes Theorem
There is a simple relationship between P(X|Y) and P(Y|X).
This is Bayes Theorem.
The relationship is also usefully expressed as:

P ( X  Y) P ( X  Y)  P ( X ) P ( X  Y) P ( X )
P ( X | Y)    
P (Y) P (Y)  P ( X ) P(X) P (Y)

 P (Y | X ) 
P(X)
P (Y )

P ( X  Y)  P ( X | Y) P (Y)  P (Y | X ) P ( X )

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 4 of 21

Independence
X and Y are said to be independent if
P ( X  Y)  P ( X )  P (Y)

This means:
 There is no statistical relationship between X and Y.
 Knowing the value of one gives you no help in predicting
the value of the other.

Note that if X and Y are independent:

P ( X | Y)  P ( X )

Conditional Independence

Often we are interested in whether there is a statistical


relationship between two variables in the special case that a
third variable has a particular value.

X is said to be conditionally independent of Y given Z if


xi , y j , zk ( P ( X  xi | Y  y j  Z  zk )  P ( X  xi | Z  zk ))

where xi, yj and zk are possible values of X, Y and Z.

This is usually written as

P( X | Y  Z)  P( X | Z)

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 5 of 21

BAYESIAN REASONING
Bayes theorem is of immense practical value in drawing
conclusions from evidence.
An Example
Suppose a person is sneezing and you have 3 hypotheses:
A cold Hay fever Healthy
Which is more likely?
Suppose you also know three unconditional probabilities:
Probability that, at a given time, any member of the
population has a cold, P(cold).
Probability that, at a given time, any member of the
population has hay fever, P(hayfever).
P(cold) and P(hayfever) are called prior probabilities.
Probability that any member of the population sneezes
P(sneeze)
And two conditional probabilities:
Probability that someone who has a cold will sneeze,
P(sneeze|cold).
Probability that someone who has hay fever will sneeze,
P(sneeze|hayfever).

Then the posterior probability, P(cold|sneeze), is:


P (cold | sneeze)  P ( sneeze | cold ) 
P (cold )
P ( sneeze)
And the posterior probability, P(hay fever|sneeze), is:
P (hayfever | sneeze)  P ( sneeze | hayfever ) 
P (hayfever )
P ( sneeze)
What is P(healthy |sneeze) ?
P.D.Scott & L. Citi University of Essex
Bayesian Learning Slide 6 of 21

Now let’s put in some numbers:


P(cold) = 0.02, P(hayfever) = 0.05, P(sneeze) = 0.08.
P(sneeze|cold) = 0.98, P(sneeze|hayfever) = 0.75.
Then:
P (hayfever | sneeze)  0.75   0.469.
0.05
0.08
P (cold | sneeze)  0.98   0.245.
0.02

P (healthy | sneeze)  ?
0.08

So we can conclude that hay fever is the more likely


explanation when we observe a sneeze.
What if we don’t observe a sneeze?

Maximum A Posteriori Hypotheses


More generally, given a set of mutually exclusive hypotheses
H and evidence E ,
The maximum a posteriori (MAP ) hypothesis is given by:
hMAP  arg max P (h | E )  arg max P ( E | h) 
P (h )
hH hH P (E )
 arg max P ( E | h)  P (h)
hH

IF exact conditional probabilities are known, the performance


of a complete Bayesian system is optimal: no better classifier
could be built using the same evidence.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 7 of 21

More Than One Piece of Evidence


Suppose the patient concerned also had a fever.
How could this be brought into the diagnosis?

P(sneeze  fever | cold)


We would need to know:

P(sneeze  fever | hayfever)


And the most likely hypothesis would be given by:

hMAP  arg max P ( sneeze  fever | h)  P (h)


h{cold , ha yfever , hea lthy}

Thus if we have n pieces of evidence


hMAP  arg max P ( E1  ...  Ei  ...  En | h)  P (h)
hH

But of course, patients may exhibit different combinations of

P(¬ sneeze  fever | cold)


symptoms. To deal with these we would also need to know:

P(¬ sneeze  fever | hayfever)


P(sneeze  ¬ fever | cold)
P(sneeze  ¬ fever | hayfever)
P(¬ sneeze  ¬ fever | cold)
P(¬ sneeze  ¬ fever | hayfever)

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 8 of 21

Combinatorial Explosion

This illustrates the major practical limitation of Bayesian


methods.

Suppose we were to construct a more realistic diagnostic


system using these ideas.
Suppose there n possible symptoms to consider

probabilities P(E 1… E n|h) in order to determine h MAP .


We would need estimates of all the conditional

How many of these would there be?


As many as there are possible combinations of evidence.
This number rises exponentially with n .

Thus, even for simple binary symptoms, we would need 2n


conditional probabilities for each possible hypothesis (i.e.
diagnosis)

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 9 of 21

BAYESIAN LEARNING
Estimating Probabilities
Estimating probabilities is essentially a matter of counting the
occurrences of particular combinations of values in the
training data set.
Thus P'(cj), an estimate of P(c j), is given by
P (c j ) 
F (c j )
N
where N is the size of the training set and
F(cj) is the number of examples of class cj.
Similarly P'(a i  b k|cj) is given by
F (a i  bk  c j )
P (a i  bk | c j ) 
F (c j )

where F(a i b k  cj) is the number of examples that are in


class cj and have value a i for attribute A and b k for attribute B.

Dealing with very low counts:


This method works well when the training set contains
sufficient examples in the relevant region of example space to
enable such estimates to be made.
It can run into problems as counts approach and reach zero.
The standard solution is to use a correction that estimates the
value that would have been obtained with m additional
examples:
F (.....  c j )  mp
mestimate 
F (c j )  m

For a variable Ai, p is typically 1/k where Ai takes k values.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 10 of 21

Complete Bayes Classifiers


Suppose we had a training set of examples in the form of
feature vectors A with categorical attributes A1…An and a
classification variable C .

In principle we could construct a classifier using the

cMAP  arg max P (c | a1    a n )


relationship:

cC

 arg max P (a1    a n | c ) 


P (c )
cC P (a1    a n )
 arg max P (a1    a n | c )  P (c )
cC

where a 1…an are the values taken by attributes A1…An.

We could readily estimate the values of P(c) by counting the


occurrence of each value of c in the training set.

Estimating all the conditional probabilities P(a 1a n|cj)


would only be possible if the data set contained a number of
examples (10?) of every feature vector in the example
space.
So if there were as few as 10 attributes each taking 4 values
we would need at least 410x10 ≃ 107 training examples!

Thus this approach is not feasible unless the feature space is


very small.

The performance of a complete Bayesian system is optimal


(no better classifier could be built using the same evidence)
only IF exact conditional probabilities are known.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 11 of 21

Naive Bayes Classifiers


The complete Bayes classifier is impractical because so much
data is required to estimate the conditional probabilities.
Can we get round this problem by finding a much more
economical way to estimate them.
Assuming Conditional Independence
Suppose we assume that all the attributes are conditionally
independent given the classification variable.
Then

P (a 1    a n | c j )   P (a i | c j )
n

i 1

For our simple diagnostic system, this assumption would be

P(sneeze  fever | cold) = P(sneeze | cold) x P(fever | cold)


that

or, equivalently: P(sneeze | fever  cold) = P(sneeze | cold)

The number of conditional probabilities required is thus


drastically reduced.
If there are 10 attributes, each taking 4 values, and 2
classes, only 80 conditional probabilities need be
estimated.
This is quite feasible using a reasonably sized training set.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 12 of 21

Filtering Spam Using Naïve Bayes


A reasonable simplification is to regard e-mails as text objects.
Thus each e-mail can be viewed as a sequence of words.
Any given e-mail will include many words
Some are much more likely to occur in Spam than in other
e-mails. e.g.
P(“Viagra”|Spam) >> P(“Viagra”|Not Spam)
Some are much less likely to occur in Spam than in other
e-mails. e.g.
P(“Timetable”|Spam) << P(“Timetable”|Not Spam)
And some are about equally likely to occur in Spam and in
other e-mails. e.g.
P(“and”|Spam)  P(“and”|Not Spam)

Suppose we knew all these conditional probabilities.


Then we could apply Naïve Bayes to estimate the
probability that a given e-mail was or was not Spam.
Given an e-mail comprised of words w1….wn

P ( Spam | w1    wn )  P ( w1    wn | Spam) 
P ( Spam)
P ( w1    wn )

P ( NotSpam | w1    wn )  P ( w1    wn | NotSpam) 
P ( NotSpam)
P ( w1    wn )
Hence, using Naïve Bayes approximation,

MostLikelyClass  arg max P (c )   P ( wi | c )


c{Spa m, NotSpa m} i

where the product is taken over all the words in the


candidate e-mail.
P.D.Scott & L. Citi University of Essex
Bayesian Learning Slide 13 of 21

Estimating the Probabilities

P ( Spam) 
N Spa m
N Spa m  N NotSpa m

P ( NotSpam) 
N NotSpa m
N Spa m  N NotSpa m

where NSpam is the number of examples of Spam and NNotSpam


is the number of examples that are not Spam.

ni ,Spa m  1
P ( wi | Spam) 
n Spa m  NumWords
where
Numwords is the total number of distinct words found in the
entire set of examples.
nSpam is the total number of words found in the set of Spam
examples.
ni,Spam is the number of times the word wi occurs in the set
of Spam examples.

Values for P (wi|NotSpam) are obtained similarly.

Note that an m-estimate is used for the conditional


probabilities because many words will have very low counts.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 14 of 21

NUMERIC ATTRIBUTES

The examples we have discussed have all involved only


categorical attributes.

What happens if some (or even all) of the attributes are


numeric?

Two types of solution:

1. Discretization
Map the numeric attribute to a set of discrete values and treat
the result in the same way as a categorical attribute.
e.g. Temperature oCelsius > {cold, cool, normal, warm hot}
We will discuss this approach in more detail later.

2. Assume a distribution
Assume the numeric attribute has a particular probability
distribution. Gaussian (i.e. normal) is the usual assumption.
Parameters of the distribution can be estimated from the
training set.
One such distribution is needed for each hypothesis for each
attribute.
These distributions can then be used to compute the
probabilities of a given value of the attribute occurring for ach
hypothesis.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 15 of 21

How Well Do Naive Bayes Classifiers Work?


Comparative studies show that Naïve Bayesian classifiers
often perform surprisingly well.
Results obtained for many data sets are typically comparable
to those produced by decision trees and back-propagation.
Why is this surprising?
Because the conditional independence assumption is very
strong.
It is often not a realistic assumption. e.g.
A person’s affluence is likely to depend on both
amount of education and parental affluence.
But these two factors do not have independent effects
since parental affluence is likely to influence the
amount of education.
So why do Naïve Bayesian methods work?
Three hypotheses:
1. In most data sets, the conditional independence
assumption is realistic.
But, interactions between the variables are very
common in real data sets
2. Even when the conditional independence
assumptions are invalid, the answers produced
will be right most of the time.
That is, the naive Bayesian model is a good
approximation.
3. Because the naïve Bayes system makes strong
assumptions and builds a simple model, it needs
less data than other methods.
P.D.Scott & L. Citi University of Essex
Bayesian Learning Slide 16 of 21

BAYESIAN BELIEF NETWORKS


What do we do when the naive Bayesian approach provides
too crude a model?
The complete Bayesian model, incorporating all conditional
probabilities, is not feasible.

A compromise:
Build a model that
 Specifies which conditional independence assumptions
are valid.
 Provides sets of conditional probabilities to specify the
joint probability distributions wherever dependencies
exist.

Bayesian belief networks achieve these objectives by


 Specifying the conditional independence assumptions in
a directed acyclic graph.
Each node denotes a variable.
Each arc indicates a dependency between the nodes
at its start and finish.
Consequently a variable X is conditionally independent
of variable Y, given the immediate predecessors of X,
if no path exists from X to Y.

 A conditional probability table is provided for each node.


This specifies the probability distribution of the
associated variable given the values of the nodes
immediate predecessors.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 17 of 21

An Example

BURGLAR EARTHQUAKE
P(B) P(E)
0.001 0.002

ALARM
B E P(A)
T T 0.95
T F 0.94
F T 0.29
F F 0.001

JOHN CALLS MARY CALLS


A P(J) A P(M)

T 0.90 T 0.70

F 0.05 F 0.01

There is an alarm which almost always rings if there is a


burglary, is sometimes triggered by an earthquake and very
occasionally gives a false alarm.
Earthquakes and burglaries are infrequent.
John and Mary are neighbours who will phone the owner if
they hear the alarm. They won’t always hear it and
occasionally imagine they have heard it.
John and Mary’s behaviour is conditionally independent given
the alarm.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 18 of 21

Constructing Belief Networks

Methods for learning Bayesian networks are an active area of


current research.
Two situations:
 The structure of the network is known: only the
conditional probability tables must be learned.
 The structure of the network is unknown and must be
learned.
Network Structure Known
This is obviously the easier learning task.
If data is available on all the variables:
Very straightforward
Simply compute the entries for the conditional probability
tables by counting, just as for naive Bayesian classifier.

If data is absent for some variables:


Considerably more difficult.
The space of possible probability assignments must be
searched for the one that maximises the likelihood of the
available data.
Two established approaches:
Hill climbing along line of steepest ascent.
The EM algorithm.
(See Mitchell pp 188-196 for details).

Network Structure Unknown


A difficult research problem.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 19 of 21

UNDERSTANDING OTHER LEARNING PROCEDURES

This lecture has been concerned with the application of


Bayesian inference to develop classifier learning procedures.

Bayesian methods are also important because the provide


insight into the behaviour of other approaches to learning.

Much of Mitchell’s discussion of Bayesian learning is


concerned with this second role.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 20 of 21

Exercise
The Prisoner Paradox
Three prisoners in solitary confinement, A, B and C, have
been sentenced to death on the same day but, because there
is a national holiday, the governor decides that one will be
granted a pardon. The prisoners are informed of this but told
that they will not know which one of them is to be spared until
the day scheduled for the executions.
Prisoner A says to the jailer “I already know that at least one
the other two prisoners will be executed, so if you tell me the
name of one who will be executed, you won’t have given me
any information about my own execution”.
The jailer accepts this and tells him that C will definitely die.
A then reasons “Before I knew C was to be executed I had a 1
in 3 chance of receiving a pardon. Now I know that either B or
myself will be pardoned the odds have improved to 1 in 2.”.
But the jailer points out “You could have reached a similar
conclusion if I had said B will die, and I was bound to answer
either B or C, so why did you need to ask?”.

What are A’s chances of receiving a pardon and why?


Construct an explanation that would convince others that you
are right.

You could tackle this by Bayes theorem, by drawing a belief


network, or by common sense. Whichever approach you
choose should deepen your understanding of the deceptively
simple concept of conditional probability.

P.D.Scott & L. Citi University of Essex


Bayesian Learning Slide 21 of 21

Suggested Readings:
Mitchell, T. M. (1997) “Machine Learning”, McGraw-Hill. Chapter 6.
(Be careful to distinguish procedures of theoretical importance from
those that can actually be used).
Tan, Steinbach & Kumar (2006) “Introduction to Data Mining”.
Section 5.3
Han & Kamber (2006) “Data Mining: Concepts and Techniques”.
Section 6.4
Michie, D ., Spiegelhalter, D. J. & Taylor, C.C. (1994) “Machine
learning, neural and statistical classification” Ellis Horwood. (A series
of articles on evaluations of various machine learning procedures).

An article describing the application of Naïve Bayes learning to filter


spam from e-mail can be found at:
http://www.paulgraham.com/spam.html

Implementations
Several implementations of the Naïve Bayes procedure are available
as part of the WEKA suite of data mining programs.

P.D.Scott & L. Citi University of Essex

You might also like