Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Unit 2 Bayesian Learning

Uploaded by

2130153114
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit 2 Bayesian Learning

Uploaded by

2130153114
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Bayesian Learning

Bayesian reasoning provides a probabilistic approach to inference.


Bayesian learning methods are relevant to our study of machine learning
for two different reasons:
• Bayesian learning algorithms that calculate explicit probabilities for
hypotheses, such as the naive Bayes classifier, are among the most
practical approaches to certain types of learning problems.
• They provide a useful perspective for understanding many learning
algorithms that do not explicitly manipulate probabilities.
Features of Bayesian Learning
• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct This provides a more flexible approach to learning
than algorithms that completely eliminate a hypothesis if it is found to be inconsistent
with any single example.
• Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis.
• Bayesian methods can accommodate hypotheses that make probabilistic predictions.
• New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods can
be measured.
• Bayesian statistics is an approach based on Bayes Theorem
Bayes Theorem
• In machine learning we are often interested in determining the best hypothesis from some space H, given the
observed training data D. One way to specify what we mean by the best hypothesis is to say that we demand
the most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the
various hypothesis in H.

• Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the
probabilities of observing various data given the hypothesis, and the observed data itself.

• Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to calculate the
posterior probability P(h|D), from the prior probability P(h), together with P(D) and P(D|h).

Bayes theorem:
• P(h) - the initial probability that hypothesis h holds, before we have observed the training data.
• P(D) - the prior probability that training data D will be observed (i.e., the probability of D given no
knowledge about which hypothesis holds).
• P(D|h) - the probability of observing data D given some world in which hypothesis h holds.
• P (h|D) - the posterior probability of h, because it reflects our confidence that h holds after we have seen
the training data D.
• MAP- In many learning scenarios, the learner considers some set of candidate hypotheses H and is
interested in finding the most probable hypothesis h 𝛜 H given the observed data D. Any such maximally
probable hypothesis is called a maximum a posteriori (MAP) hypothesis.
• We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of
each candidate hypothesis.
Example- Does Patient have Cancer or not
• The available data is from a particular laboratory test with two possible outcomes:

+(positive) and -(negative).

• We have prior knowledge that over the entire population of people only .008 have this disease.

• Furthermore, the lab test is only an imperfect indicator of the disease. The test returns a correct
positive result in only 98% of the cases in which the disease is actually present and a correct negative
result in only 97% of the cases in which the disease is not present. In other cases, the test returns the
opposite result.

• The above situation can be summarized by the following probabilities:


• Observe a new patient for whom the lab test returns a positive result and
diagnose the patient is having cancer or not?
• Observe a new patient for whom the lab test returns a negative result and
diagnose the patient is having cancer or not using Bayes rule?
Bayes Theorem and Concept Learning
Brute-Force based concept learning
 Consider a case of concept learning problem :
 H= Hypothesis space define over an instance space X
 X = instance space
 C: X {0, 1}
C is the target concept which is Boolean Valued function.
For the above we can design a straight forward concept learning algorithm to output hMAP
We may choose the probability distributions P(h) and P(D/h) in any way we wish, to
describe our prior knowledge about the learning task. Here let us choose them to be
consistent with the following assumptions:
1. The training data D is noise free (i.e., di = c(xi)).
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable
than any other.
Given these assumptions, what values should we specify for P(h)? Given no prior
knowledge that one hypothesis is more likely than another, it is reasonable to assign the
same prior probability to every hypothesis h in H.
• The probability of observing the target values D = (dl . . .dm) for the fixed set of instances
(x1 . . . xm) given a world in which hypothesis h holds (i.e., given a world in which h is the
correct description of the target concept c). Since we assume noise-free training data, the
probability of observing classification di given h is just 1 if di = h(xi) and 0 if di = h(xi).

In other words, the probability of data D given hypothesis h is 1 if D is consistent with h,


and 0 otherwise
Let us consider the first step of this algorithm, which uses Bayes theorem to compute the posterior probability
P(h/D) of each hypothesis h given the observed training data D.

First consider the case where h is inconsistent with the training data D. P(D/h) to be 0 when h is inconsistent
with D, we have

The posterior probability of a hypothesis inconsistent with D is zero. Now consider the case where h is
consistent with D. P(D/h) to be 1 when h is consistent with D, we have

where VSH,is the subset of hypotheses from H that are consistent with D i.e., VSH,is t he version space of H with
respect to D
• To summarize, Bayes theorem implies that the posterior probability
P(h/D)under our assumed P(h) and P(D/h) is
Bayes Optimal Classifier
• So far we have considered the question "what is the most probable hypothesis given
the training data?' In fact, the question that is often of most significance is the
closely related question "what is the most probable classification of the new
instance given the training data?’
• It may seem that this second question can be answered by simply applying the
MAP hypothesis to the new instance, we need to check whether it is the only best
way and if not what could be the other option.
• Consider a hypothesis space containing three hypotheses, hl, h2, and h3. Suppose that
the posterior probabilities of these hypotheses given the training data are .4, .3, and .3
respectively.
P(h1/D)=0.4
P(h2/D)=0.3
P(h3/D)=0.3
From the above information we could identify that
• hl is the MAP hypothesis
• Suppose a new instance x is encountered, which is classified positive by hl, but
negative by h2 and h3. Taking all hypotheses into account, the probability that x is
positive is .4 (the probability associated with h1), and the probability that it is negative
is therefore .6. The most probable classification (negative) in this case is different
from the classification generated by the MAP hypothesis.
• In general, the most probable classification of the new instance is obtained by
combining the predictions of all hypotheses, weighted by their posterior probabilities.
• If the possible classification of the new example can take on any value vj from some
set V, then the probability P(vj/D) that the correct classification for the new instance is
vj

The optimal classification of the new instance is the value v,, for which P (vj/D) is
maximum.
Bayes optimal Classification :

Bayes optimal classification method maximizes the probability that the new instance
is classified correctly, given the available data, hypothesis space, and prior
probabilities over the hypothesis.
Example
• Let the set of possible classifications of the new instance is V = (+, -) and posterior
probabilities of these hypotheses given the training data are .4, .3, and .3 respectively.
• The possible classification of the new example can be computed as follows:

Therefore

and
Naive Bayes Classifier
• One highly practical Bayesian learning method is the naive Bayes learner,
often called the naive Bayes classifier.
• Naive Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• The naive Bayes classifier applies to learning tasks where each instance x is
described by a conjunction of attribute values and where the target function
f (x) can take on any value from some finite set V.
• A set of training examples of the target function is provided, and a new
instance is presented, described by the tuple of attribute values (al, a2…an).
• The learner is asked to predict the target value, or classification, for this new
instance.
• The Bayesian approach to classifying the new instance is to assign the most probable target value, VMAP
given the attribute values (al,a 2 . . .a n) that describe the instance.

• We can use Bayes theorem to rewrite this expression as

• It is easy to estimate each of the P(vj) simply by counting the frequency with which each target value vj
occurs in the training data. However, estimating the different P(al, a2.. . an, / vj) terms is not feasible unless
we have a very, very large set of training data. The problem is that the number of these terms is equal to the
number of possible instances times the number of possible target values. Therefore, we need to see every
instance in the instance space many times in order to obtain reliable estimates.
Naive Bayes Classifier:
• The naive Bayes classifier is based on the simplifying assumption that the attribute
values are conditionally independent given the target value.
• In other words, the assumption is that given the target value of the instance, the
probability of observing the conjunction al, a2.. .a, is just the product of the
probabilities for the individual attributes:

• Naive Bayes Classification:

where vNB denotes the target value output by the naive Bayes classifier.

To summarize, the naive Bayes learning method involves a learning step in which the
various P(vj) and P(ai/vj) terms are estimated, based on their frequencies over the
training data. The set of these estimates corresponds to the learned hypothesis and used
to classify new instance.
Example: Consider a fictional dataset that describes the weather conditions for
playing tennis. Test it on a new instance today=(sunny, cool, high, strong)
1. Calculate prior probabilities P(vj) of target values(vj)
• Here V={Yes, No}
• Total instances in a data set =14
• No. of times target value is yes =9
• No. of times target value is No =5

• So, P(Play tennis=Yes) =9/14 = 0.64


P(Play tennis=No) =5/14 = 0.36
2. Calculate Current or conditional probability of
individual attribute P(ai/vj)
• Here we have 4 attributes a={outlook, temperature, Humidity, Wind}
• So calculate P(outlook=sunny/play tennis=Yes)
• P (outlook=sunny/play tennis=No)
• P (outlook=overcast/play tennis=Yes)
• P (outlook=overcast/play tennis=No)
• P (outlook=Rain/play tennis=Yes)
• P (outlook=Rain/play tennis=No)
Similarly calculate conditional probability for all attributes
Calculate the vNB for classifying the new instance
New instance is today=(sunny, cool, high, strong)

As vNB(yes) < vNB(no) so final classification is “no”


Bayesian Belief Network
• The naive Bayes classifier, which assumes that all the variables are
conditionally independent given the value of the target variable.
• This assumption dramatically reduces the complexity of learning the target
function. When it is met, the naive Bayes classifier outputs the optimal Bayes
classification. However, in many cases this conditional independence
assumption is clearly overly restrictive.
• Bayesian belief networks allows conditional independence assumptions that
apply to subsets of the variables. Thus, Bayesian belief networks provide an
intermediate approach that is less constraining than the global assumption of
conditional independence made by the naive Bayes classifier, but more
tractable than avoiding conditional independence assumptions altogether.
• A Bayesian belief network describes the probability distribution over a set of variables.
• Consider an arbitrary set of random variables Yl . . . Yn where each variable Yi can take
on the set of possible values V(Yi).
• We define the joint space of the set of variables Y to be the cross product V(Yl) x V(Y2)
x . . . V(Y,).
• The probability distribution over this joint space is called the joint probability
distribution.
• The joint probability distribution specifies the probability for each of the possible
variable bindings for the tuple (Yl . . . Yn).
Conditional Independence
• Let X, Y, and Z be three discrete-valued random variables. We say that X is conditionally
independent of Y given Z if the probability distribution governing X is independent of
the value of Y given a value for Z; that is, if

where xi V(X), yj  V(Y), and zk  V(Z).

We commonly write the above expression in abbreviated form as P(X/Y, Z) = P(X/Z).


This definition of conditional independence can be extended to sets of variables as well.
We say that the set of variables X1 . . . Xi is conditionally independent of the set of
variables Y1 . . . Ym given the set of variables Z1 . . . Zn, if
P(X1 ... X  Y1 ... Ym, Z1 ... Zn,) = P(Xl ... X1  Z1 ... Zn)
• There is a relation between this definition and our use of conditional independence in the
definition of the naive Bayes classifier
• The naive Bayes classifier assumes that the instance attribute A1 is conditionally
independent of instance attribute A2 given the target value V. This allows the naive Bayes
classifier to calculate P(Al, A2/V) in

as follows
----(Product rule of Probability)
Representation of Bayesian Network
• A Bayesian belief network (Bayesian network for short) represents the joint probability distribution for a set
of variables
• For example, the Bayesian network in given Figure represents the joint probability distribution over the
boolean variables Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup. In general,
Node

Arc

• A Bayesian network represents the joint probability distribution by specifying a set of conditional
independence assumptions represented by a directed acyclic graph(DAG), together with sets of local
conditional probabilities(CPT)
• For each variable two types of information is given which are
• Each variable in the joint space is represented by a node in the Bayesian network. For
each variable two types of information are specified. First, the network arcs represent
the assertion that the variable is conditionally independent of its non descendants in the
network given its immediate predecessors in the network.
• Second, a conditional probability table is given for each variable, describing the
probability distribution for that variable given the values of its immediate predecessors.
• The joint probability for any desired assignment of values (yl, . . . , yn,) to the tuple of
network variables (YI . . . Yn,) can be computed by the formula

• where Parents(Yi) denotes the set of immediate predecessors of Yi in the network.


Note the values of P(yi/Parents(yi)) are precisely the values stored in the
conditional probability table associated with node Yi.
Illustration

Consider the node Campfire. The network nodes and arcs represent the assertion that Campfire is conditionally
independent of its non descendants Lightning and Thunder, given its immediate parents Storm and
BusTourGroup. This means that once we know the value of the variables Storm and BusTourGroup, the
variables Lightening and Thunder provide no additional information about Campfire. The right side of the
figure shows the conditional probability table associated with the variable Campfire. The top left entry in this
table, for example, expresses the assertion that

P(Campfire = True /Storm = True, BusTourGroup = True) = 0.4


Inference
• A Bayesian network is used to infer the value of some target variable (e.g., Forest Fire)
given the observed values of the other variables. As we are dealing with random variables
the target variable cant be a single determined value.
• Therefore we wish to infer is the probability distribution for the target variable, which
specifies the probability that it will take on each of its possible values given the observed
values of the other variables.
• This inference step can be straightforward if values for all of the other variables in the
network are known exactly.
• In the more general case we may wish to infer the probability distribution for some
variable (e.g., Forest Fire) given observed values for only a subset of the other variables
(e.g., Thunder and Bus Tour Group may be the only observed values available).
• In general, a Bayesian network can be used to compute the probability distribution for any
subset of network variables given the values or distributions for any subset of the
remaining variables.
Example:
Harry installed a new burglar alarm at his home to detect burglary. The
alarm reliably responds at detecting a burglary but also responds for
minor earthquakes. Harry has two neighbors David and Sophia, who
have taken a responsibility to inform Harry at work when they hear the
alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time
too. On the other hand, Sophia likes to listen to high music, so
sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.
Problem: Calculate the probability that alarm has sounded, but
there is neither a burglary, nor an earthquake occurred, and David
and Sophia both called the Harry.
Solution
From the formula of joint distribution, we
can write the problem statement in the
form of probability distribution:

• P(S, D, A, ¬B, ¬E) =


• P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P
(¬B) *P (¬E).
= 0.75* 0.91* 0.001*
0.998*0.999
= 0.00068045.

• Hence, a Bayesian network can


answer any query about the domain
by using Joint distribution.
Support Vector Machine
• SVM is a famous supervised machine learning algorithm used for classification
as well as regression algorithms. However, mostly it is preferred for
classification algorithms. It basically separates different target classes in a
hyperplane in n-dimensional or multidimensional space.
• The main motive of the SVM is to create the best decision boundary that can
separate two or more classes(with maximum margin) so that we can correctly put
new data points in the correct class.
• Support vectors are defined as the data points, which are closest to the hyperplane
and have some effect on its position. As these vectors are supporting the
hyperplane, therefore named as Support vectors.
• Support vectors are the elements of the training set that would change the position
of the dividing hyperplane if removed. Support vectors are the critical elements of
the training set
• The problem of finding the optimal hyper plane is an optimization problem and
can be solved by optimization techniques
SVM Hyperplane
• There may be multiple lines/decision boundaries to segregate the classes in n-dimensional space.
Still, we want to search out the simplest decision boundary that helps to classify the information
points.
• This best boundary is considered to be the hyperplane of SVM. The dimensions of the hyperplane
rely on the features present within the dataset. These features suggest if there are 2 target labels in
the dataset.
• Then the hyperplane is going to be a line. And if there are 3 target labels, then the hyperplane is
going to be a 2-dimension plane. We always create a hyperplane that provides maximum margin.
This margin simply means there should be a maximum distance between the data points.
• It is called SVM because it chooses
extreme vectors or support vectors
to create the hyperplane,
that’s why it is named so
How the SVM Algorithm Works
It mainly focuses on creating a
hyperplane to separate target
classes.
Basically, a decision boundary has
to be made to separate these two
classes.

There can be multiple lines that can


separate these two classes.
SVM has to find the best line.
How does the SVM find the best line?
• According to the SVM algorithm, we search for the points nearest the decision
boundary from both the classes. These points are called support vectors.
• Now, we have to calculate the distance between the line and the support vectors.
This distance is known as the margin. Our target is
to maximize the margin. SVM always looks for optimal margin, i.e., for which
the margin is maximum.
• Now, you must be wondering why SVM tries to keep the maximum separation gap
between the two classes. It is done so that the model can efficiently predict future
data points.
What if our dataset is a bit complex, which cannot be separated by a linear line?
It can be seen that this dataset is not
linearly separable.
Just by drawing a line, we cannot classify this
dataset i.e we cannot classify the dataset by
a linear decision boundary, but, this data can
be converted into a linear one using higher
dimensions.
• Let’s create one more dimension
and name it as z. By using the
following equation

• Z = x^2+y^2 - equation(1)

Data in 3D plane can be represented as follow


SVM: Linear Example
3 3 6 6
• Positively labelled data points are 1
, −1
, 1
, −1
1 0 0 −1
• Negatively labelled data points are 0
, 1
, −1
, 0
• To find the perfect hyperplane, firstly plot these data points and then
find the nearest data points on each side of the boundary.

Each vector is augmented with a 1 as a bias input to consider another dimension


• Hyperplane equation is y=wx+b
SVM Kernels
• In SVM we need linear decision boundaries. SVM kernels are the mathematical
functions used to create multidimensional space so that a non linear decision
surface can be transformed to a linear equation in a higher dimension spaces.
Following data cannot be classified using a line, so kernel is to be used to convert
into higher dimensions to properly classify the data using linear surface like
plane.
Types of SVM kernels
• Linear
• Polynomial
• Gaussian OR Radial Basis Kernels
Formulae of all the kernels are different but the task is same ie to
convert non linear boundary to linear boundary.
If 1D data is given then we cannot separate the classes using line.
Change 1D to 2D. Plot 𝑥 2 𝑖𝑛𝑠𝑡𝑒𝑎𝑑 𝑜𝑓 𝑥
Feature -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
x
Feature -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
x

𝑥2 36 25 16 9 4 1 0 1 4 9 16 25 36

Hyperplane

Now the plotted data can be classified using a line.


Selection of the kernel depends on the type of data set. Firstly
go with linear kernel, if accuracy is not good then go for other
kernels.

x1 and x2 are the features,


T is the transpose
d is the degree of polynomial
Properties of SVM
1. SVM is a supervised learning algorithm mainly used for
classification.
2. It is used to create the best decision boundary.
3. SVM gives good performance even in case of high dimensional data
and small set of training patterns.
4. It can solve difficult problems that are not linearly separable by
converting the data in higher dimensions to make it linearly separable.
Issues with SVM
1. SVM does not perform very well when the data set has more noise
i.e. target classes are overlapping.
2. In cases where the number of features for each data point exceeds
the number of training data samples, the SVM will underperform.
3. As the support vector classifier works by putting data points, above
and below the classifying hyperplane there is no probabilistic
explanation for the classification.
4. Selection of wrong kernel will lead to poor performance. As an
example, using the linear kernel when the data are not linearly
separable results in the algorithm performing poorly.
5. SVM performs poorly in imbalanced data set. Data imbalance
usually reflects an unequal distribution of classes within a dataset

You might also like