Bayesian_Learning1
Bayesian_Learning1
Bayesian Learning
Conditional Probability
Bayesian Reasoning
Naïve Bayes Classifiers
Bayesian Networks
Filtering SPAM
Subproblem
Solution to Subproblem
CONDITIONAL PROBABILITY
The conditional probability of event X occurring given that
event Y has occurred is:
P ( X Y)
P ( X | Y)
P (Y )
Bayes Theorem
There is a simple relationship between P(X|Y) and P(Y|X).
This is Bayes Theorem.
The relationship is also usefully expressed as:
P ( X Y) P ( X Y) P ( X ) P ( X Y) P ( X )
P ( X | Y)
P (Y) P (Y) P ( X ) P(X) P (Y)
P (Y | X )
P(X)
P (Y )
P ( X Y) P ( X | Y) P (Y) P (Y | X ) P ( X )
Independence
X and Y are said to be independent if
P ( X Y) P ( X ) P (Y)
This means:
There is no statistical relationship between X and Y.
Knowing the value of one gives you no help in predicting
the value of the other.
P ( X | Y) P ( X )
Conditional Independence
P( X | Y Z) P( X | Z)
BAYESIAN REASONING
Bayes theorem is of immense practical value in drawing
conclusions from evidence.
An Example
Suppose a person is sneezing and you have 3 hypotheses:
A cold Hay fever Healthy
Which is more likely?
Suppose you also know three unconditional probabilities:
Probability that, at a given time, any member of the
population has a cold, P(cold).
Probability that, at a given time, any member of the
population has hay fever, P(hayfever).
P(cold) and P(hayfever) are called prior probabilities.
Probability that any member of the population sneezes
P(sneeze)
And two conditional probabilities:
Probability that someone who has a cold will sneeze,
P(sneeze|cold).
Probability that someone who has hay fever will sneeze,
P(sneeze|hayfever).
P (healthy | sneeze) ?
0.08
Combinatorial Explosion
BAYESIAN LEARNING
Estimating Probabilities
Estimating probabilities is essentially a matter of counting the
occurrences of particular combinations of values in the
training data set.
Thus P'(cj), an estimate of P(c j), is given by
P (c j )
F (c j )
N
where N is the size of the training set and
F(cj) is the number of examples of class cj.
Similarly P'(a i b k|cj) is given by
F (a i bk c j )
P (a i bk | c j )
F (c j )
cC
P (a 1 a n | c j ) P (a i | c j )
n
i 1
P ( Spam | w1 wn ) P ( w1 wn | Spam)
P ( Spam)
P ( w1 wn )
P ( NotSpam | w1 wn ) P ( w1 wn | NotSpam)
P ( NotSpam)
P ( w1 wn )
Hence, using Naïve Bayes approximation,
P ( Spam)
N Spa m
N Spa m N NotSpa m
P ( NotSpam)
N NotSpa m
N Spa m N NotSpa m
ni ,Spa m 1
P ( wi | Spam)
n Spa m NumWords
where
Numwords is the total number of distinct words found in the
entire set of examples.
nSpam is the total number of words found in the set of Spam
examples.
ni,Spam is the number of times the word wi occurs in the set
of Spam examples.
NUMERIC ATTRIBUTES
1. Discretization
Map the numeric attribute to a set of discrete values and treat
the result in the same way as a categorical attribute.
e.g. Temperature oCelsius > {cold, cool, normal, warm hot}
We will discuss this approach in more detail later.
2. Assume a distribution
Assume the numeric attribute has a particular probability
distribution. Gaussian (i.e. normal) is the usual assumption.
Parameters of the distribution can be estimated from the
training set.
One such distribution is needed for each hypothesis for each
attribute.
These distributions can then be used to compute the
probabilities of a given value of the attribute occurring for ach
hypothesis.
A compromise:
Build a model that
Specifies which conditional independence assumptions
are valid.
Provides sets of conditional probabilities to specify the
joint probability distributions wherever dependencies
exist.
An Example
BURGLAR EARTHQUAKE
P(B) P(E)
0.001 0.002
ALARM
B E P(A)
T T 0.95
T F 0.94
F T 0.29
F F 0.001
T 0.90 T 0.70
F 0.05 F 0.01
Exercise
The Prisoner Paradox
Three prisoners in solitary confinement, A, B and C, have
been sentenced to death on the same day but, because there
is a national holiday, the governor decides that one will be
granted a pardon. The prisoners are informed of this but told
that they will not know which one of them is to be spared until
the day scheduled for the executions.
Prisoner A says to the jailer “I already know that at least one
the other two prisoners will be executed, so if you tell me the
name of one who will be executed, you won’t have given me
any information about my own execution”.
The jailer accepts this and tells him that C will definitely die.
A then reasons “Before I knew C was to be executed I had a 1
in 3 chance of receiving a pardon. Now I know that either B or
myself will be pardoned the odds have improved to 1 in 2.”.
But the jailer points out “You could have reached a similar
conclusion if I had said B will die, and I was bound to answer
either B or C, so why did you need to ask?”.
Suggested Readings:
Mitchell, T. M. (1997) “Machine Learning”, McGraw-Hill. Chapter 6.
(Be careful to distinguish procedures of theoretical importance from
those that can actually be used).
Tan, Steinbach & Kumar (2006) “Introduction to Data Mining”.
Section 5.3
Han & Kamber (2006) “Data Mining: Concepts and Techniques”.
Section 6.4
Michie, D ., Spiegelhalter, D. J. & Taylor, C.C. (1994) “Machine
learning, neural and statistical classification” Ellis Horwood. (A series
of articles on evaluations of various machine learning procedures).
Implementations
Several implementations of the Naïve Bayes procedure are available
as part of the WEKA suite of data mining programs.