Unit 2 Bayesian Learning
Unit 2 Bayesian Learning
• Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the
probabilities of observing various data given the hypothesis, and the observed data itself.
• Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to calculate the
posterior probability P(h|D), from the prior probability P(h), together with P(D) and P(D|h).
Bayes theorem:
• P(h) - the initial probability that hypothesis h holds, before we have observed the training data.
• P(D) - the prior probability that training data D will be observed (i.e., the probability of D given no
knowledge about which hypothesis holds).
• P(D|h) - the probability of observing data D given some world in which hypothesis h holds.
• P (h|D) - the posterior probability of h, because it reflects our confidence that h holds after we have seen
the training data D.
• MAP- In many learning scenarios, the learner considers some set of candidate hypotheses H and is
interested in finding the most probable hypothesis h 𝛜 H given the observed data D. Any such maximally
probable hypothesis is called a maximum a posteriori (MAP) hypothesis.
• We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of
each candidate hypothesis.
Example- Does Patient have Cancer or not
• The available data is from a particular laboratory test with two possible outcomes:
• We have prior knowledge that over the entire population of people only .008 have this disease.
• Furthermore, the lab test is only an imperfect indicator of the disease. The test returns a correct
positive result in only 98% of the cases in which the disease is actually present and a correct negative
result in only 97% of the cases in which the disease is not present. In other cases, the test returns the
opposite result.
First consider the case where h is inconsistent with the training data D. P(D/h) to be 0 when h is inconsistent
with D, we have
The posterior probability of a hypothesis inconsistent with D is zero. Now consider the case where h is
consistent with D. P(D/h) to be 1 when h is consistent with D, we have
where VSH,is the subset of hypotheses from H that are consistent with D i.e., VSH,is t he version space of H with
respect to D
• To summarize, Bayes theorem implies that the posterior probability
P(h/D)under our assumed P(h) and P(D/h) is
Bayes Optimal Classifier
• So far we have considered the question "what is the most probable hypothesis given
the training data?' In fact, the question that is often of most significance is the
closely related question "what is the most probable classification of the new
instance given the training data?’
• It may seem that this second question can be answered by simply applying the
MAP hypothesis to the new instance, we need to check whether it is the only best
way and if not what could be the other option.
• Consider a hypothesis space containing three hypotheses, hl, h2, and h3. Suppose that
the posterior probabilities of these hypotheses given the training data are .4, .3, and .3
respectively.
P(h1/D)=0.4
P(h2/D)=0.3
P(h3/D)=0.3
From the above information we could identify that
• hl is the MAP hypothesis
• Suppose a new instance x is encountered, which is classified positive by hl, but
negative by h2 and h3. Taking all hypotheses into account, the probability that x is
positive is .4 (the probability associated with h1), and the probability that it is negative
is therefore .6. The most probable classification (negative) in this case is different
from the classification generated by the MAP hypothesis.
• In general, the most probable classification of the new instance is obtained by
combining the predictions of all hypotheses, weighted by their posterior probabilities.
• If the possible classification of the new example can take on any value vj from some
set V, then the probability P(vj/D) that the correct classification for the new instance is
vj
The optimal classification of the new instance is the value v,, for which P (vj/D) is
maximum.
Bayes optimal Classification :
Bayes optimal classification method maximizes the probability that the new instance
is classified correctly, given the available data, hypothesis space, and prior
probabilities over the hypothesis.
Example
• Let the set of possible classifications of the new instance is V = (+, -) and posterior
probabilities of these hypotheses given the training data are .4, .3, and .3 respectively.
• The possible classification of the new example can be computed as follows:
Therefore
and
Naive Bayes Classifier
• One highly practical Bayesian learning method is the naive Bayes learner,
often called the naive Bayes classifier.
• Naive Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• The naive Bayes classifier applies to learning tasks where each instance x is
described by a conjunction of attribute values and where the target function
f (x) can take on any value from some finite set V.
• A set of training examples of the target function is provided, and a new
instance is presented, described by the tuple of attribute values (al, a2…an).
• The learner is asked to predict the target value, or classification, for this new
instance.
• The Bayesian approach to classifying the new instance is to assign the most probable target value, VMAP
given the attribute values (al,a 2 . . .a n) that describe the instance.
• It is easy to estimate each of the P(vj) simply by counting the frequency with which each target value vj
occurs in the training data. However, estimating the different P(al, a2.. . an, / vj) terms is not feasible unless
we have a very, very large set of training data. The problem is that the number of these terms is equal to the
number of possible instances times the number of possible target values. Therefore, we need to see every
instance in the instance space many times in order to obtain reliable estimates.
Naive Bayes Classifier:
• The naive Bayes classifier is based on the simplifying assumption that the attribute
values are conditionally independent given the target value.
• In other words, the assumption is that given the target value of the instance, the
probability of observing the conjunction al, a2.. .a, is just the product of the
probabilities for the individual attributes:
where vNB denotes the target value output by the naive Bayes classifier.
To summarize, the naive Bayes learning method involves a learning step in which the
various P(vj) and P(ai/vj) terms are estimated, based on their frequencies over the
training data. The set of these estimates corresponds to the learned hypothesis and used
to classify new instance.
Example: Consider a fictional dataset that describes the weather conditions for
playing tennis. Test it on a new instance today=(sunny, cool, high, strong)
1. Calculate prior probabilities P(vj) of target values(vj)
• Here V={Yes, No}
• Total instances in a data set =14
• No. of times target value is yes =9
• No. of times target value is No =5
as follows
----(Product rule of Probability)
Representation of Bayesian Network
• A Bayesian belief network (Bayesian network for short) represents the joint probability distribution for a set
of variables
• For example, the Bayesian network in given Figure represents the joint probability distribution over the
boolean variables Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup. In general,
Node
Arc
• A Bayesian network represents the joint probability distribution by specifying a set of conditional
independence assumptions represented by a directed acyclic graph(DAG), together with sets of local
conditional probabilities(CPT)
• For each variable two types of information is given which are
• Each variable in the joint space is represented by a node in the Bayesian network. For
each variable two types of information are specified. First, the network arcs represent
the assertion that the variable is conditionally independent of its non descendants in the
network given its immediate predecessors in the network.
• Second, a conditional probability table is given for each variable, describing the
probability distribution for that variable given the values of its immediate predecessors.
• The joint probability for any desired assignment of values (yl, . . . , yn,) to the tuple of
network variables (YI . . . Yn,) can be computed by the formula
Consider the node Campfire. The network nodes and arcs represent the assertion that Campfire is conditionally
independent of its non descendants Lightning and Thunder, given its immediate parents Storm and
BusTourGroup. This means that once we know the value of the variables Storm and BusTourGroup, the
variables Lightening and Thunder provide no additional information about Campfire. The right side of the
figure shows the conditional probability table associated with the variable Campfire. The top left entry in this
table, for example, expresses the assertion that
• Z = x^2+y^2 - equation(1)
𝑥2 36 25 16 9 4 1 0 1 4 9 16 25 36
Hyperplane