Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
90 views

UE20CS302 Unit3 Slides

The document discusses ensemble models and Bayesian learning. It provides an overview of ensemble learning techniques including bagging and boosting. Bagging involves training multiple models on randomly sampled subsets of the original training data and combining their predictions by voting or averaging. Boosting trains models sequentially such that later models focus more on previously misclassified examples. Both techniques can result in a combined model with better performance than a single model.

Uploaded by

Koushi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

UE20CS302 Unit3 Slides

The document discusses ensemble models and Bayesian learning. It provides an overview of ensemble learning techniques including bagging and boosting. Bagging involves training multiple models on randomly sampled subsets of the original training data and combining their predictions by voting or averaging. Boosting trains models sequentially such that later models focus more on previously misclassified examples. Both techniques can result in a combined model with better performance than a single model.

Uploaded by

Koushi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 308

MACHINE

INTELLIGENCE
Ensemble Models and
Bayesian Learning
K.S.Srinivas
Department of Computer Science
and Engineering
Machine Intelligence

Unit III
Emsemble Models and Bayesian
Learning

Srinivas K S
Department of Computer Science
Ensemble Learning

An ensemble method is a

• technique that combines the predictions from


multiple machine learning algorithms together

• to make more accurate predictions than any


individual model.
The learners that we use are usually weak learners
• They are among the most powerful techniques in
machine learning, often outperforming other methods.

• This comes at the cost of increased algorithmic and


model complexity
Ensemble Learning

• The key idea 1 we have learners where the output is


slightly better than chance i.e the accuracy is a little
better than 50% but not significantly higher
• Multiple learners can be modelled using
• Different Algorithms
• Different Hyperparameters of the same algorithm
• Different subsets of the training data
• Different features of the training data
• The key idea 2 They construct multiple, diverse predictive
models from adapted versions of the training data (most often
reweighted or resampled);

• They combine the predictions of these models in some way,


often by simple averaging or voting (possibly weighted).
General Approach
Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers
General Approach

• We have seen that decision trees earlier have a tendency to overfit i.e
they have high variance
• We could offcourse prune trees but is often difficult.
• Ensemble learning ensures that the combined out of several weak
learners produce a final model that has low variance
• Given a set of n independent observations Z1, . . . , Zn, each with
variance σ2, the variance of the mean Z^ of the observations is
givenby σ2/n.
• In other words, averaging a set of observations reduces variance.
Intuition behind ensemble learning

• Lets take the example of the 3 learners on


the right.
• Let the box be the instance space and the Key Concept: The errors
errors produced by each of the learners made must be independent
marked as circles
• Lets take an arbitrary point marked in pink
and make a prediction
• Now the Red makes an error on the sample
but the Green and Blue guys get it right
• The average variance is lower.

1. But that’s not always the case. Look at this picture .


2. The intersection areas could actually result in the voting
being wrong for the pink points
Bias Variance - Recap
Low Bias and Low Variance
• A model with high bias is too simple and has low
number of predictors

• Due to which it is unable to capture the underlying


pattern of data.

• It pays very little attention to the training data and


oversimplifies the model. This leads to high error on
training and test data.

• Any model which has very large number of predictors


will end up being a very complex model
• which will deliver very accurate predictions for Source:https://towardsdatascience.com/holy-grail-for-bias-variance-
the training data that it has seen already but this tradeoff-overfitting-underfitting-7fad64ab5d76

complexity
• makes the generalization of this model to
unseen data very difficult i.e a high variance
model.
Bias and Variance

• We have seen in neural networks that if we


have very large number of epochs we may
overfit The ensemble itself will produce a model with low bias
• Low Bias – High flexibility (DT/ANN) and low variance as illustrated earlier
• High Variance – we give different subsets of
In fact even if the individual learners have high bias the
training data we get different models
new combined learner will have a low bias
• More flexible representations(Low Bias)
have High variance The hypothesis of the new learner may not even be in
• More powerful representations have high the HS of the individual learners
variance
• We want to have a low bias and low variance
model
Basics models perform not so well by themselves either
. because they have a high bias (low degree of freedom models,
for example) or because they have too much variance to be
robust (high degree of freedom models, for example)
Many weak learners increase our confidence

• Ensemble prevent overfitting


• We don’t need to worry about stopping criteria
• Lets assume we have n learners in a binary classification problem
• If all of them have an accuracy of 0.7 and predict the same class for
a given instance, what would your confident be on the prediction
𝑛
𝐶 = 1− 1−𝐴
• In Reality not all learners would predict the same nor would they
have the same accuracy
• Lets assume that n1 of those learners predict class1 and n2 class2
• Lets also assume that with no loss of generality that n1 > n2 ,
meaning that if we took a voting the prediction would always be
class 1
• The probability of the class being really class1 would be
𝑛𝐶𝑅 𝑃𝑅 𝑛⊥
1−𝑃 𝑛−𝑅 = 𝑛 𝐶𝑛∗ 𝐴 𝑛⋅ 1−𝐴
2
Combining ensemble learners

• Learners can be unweighted


• Learners can be weighted wt ȡ accuracy ,
1/variance of the learner
• For making a prediction using weights
Challenges in ensemble learners
• The critical point of ensemble learners is
that they need to be independent

• averaging a set of observations reduces


variance.

• Which can be achieved by either using


different subsets or different learners .

• Shall see this in the next session – Bagging


and Boosting
Types of Ensemble Methods

• Manipulate data distribution


• Example: bagging, boosting
• Manipulate input features
• Example: random forests
• Manipulate class labels
• Example: error-correcting output coding
Machine Intelligence

Unit III
Bagging

Srinivas K S
Department of Computer Science
Bagging

• One way of having different learners to have independent errors is


by splitting the data into subsets and passing them to different
learners
• But since training instances could be small we may end up with
overfitting and high variance
• Instance we would could randomly sample from the data set
creating new data sets of the same size as the original or a very large
fraction of the data set with replacement
• This method is called bootstrap aggregation or bagging
• It has been shown when we create a data set as described above we
would have close to 67% of data from the original data set about
33% of original data is not selected for a sufficiently large sample
size

• P(Data not being selected) =


Bagging

• Multiple subsets are created from the original


dataset, selecting observations with replacement.
• A base model (weak model) is created on each of
these subsets.
• The models run in parallel and are independent of
each other.
• The final predictions are determined by combining
the predictions from all the models. (Voting or
averaging)
Bagging
Bagging Error Calculation

• We can do K-fold cross validation for error calculation


• Typically 1/3 of samples are left out in every subset.
• These are the out of bag examples
• All you have to do is measure the error on the unused
sample in those learners that did not use them
• The average of the error from the learners gives us the
error for that sample
• you can accumulate their error overall those data points
that are out of bag and calculate an average
• The error calculation is close to the leave out one
approach
How many learners and Advantages

• Most research has shown that about 100 learners are


good enough
You can formulate these output probabilities
Machine Intelligence

Unit III
Boosting

Srinivas K S
Department of Computer Science
Boosting - Preamble

• Let different learners have different weights ( earlier


stated by accuracy or by variance)
• Let each learner progressively learn from the previous
learner 1. Let U be the instance space and A be the training
data
2. Let a Hypothesis h1 misclassify the instances in
yellow

3. By learning progressively I mean the 2nd learner


h2 is told make other errors but ensure that the
h1 misclassified instances are learnt correctly

4. By giving higher weights to those sample


5. Ensure the weights of the samples that were got
right reduced
6. Make sure the sum of all weights add up to 1
Boosting vs Bagging

• In the case of Bagging, any element has the same


probability to appear in a new data set.

• However, for Boosting the observations are weighted


and therefore some of them will take part in the new
sets more often.

• One of the most popular Boosting techniques is the


“Adaboost”
Boosting Overview

• Most example on boosting use a weak learner called a


decision stump
• This has one node and based on the value does a
binary split
• Boosting approach instead learns slowly and
incrementally
Boosting Overview

• Most example on boosting use a weak learner called a


decision stump
• These red crosses on the right are basically mis-classifications
• Adjust the weights of those points.
• Further classify after assigning weights.
• Ensure the weights of the samples that were got right reduced
• the classier that you use in the end is basically the summation
of all the individual classifiers
Boosting Overview

• There is chance to overfit


• have a weighting factor here a. Have a1, a2 and a3
for each of the learners to get your final decision
boundary
• These weights have some mechanism that we will see
shortly which is dependent on number of errors made
• Some learners are better than others so give them
higher weights
THANK YOU

Srinivas K S
Department of Computer Science & Engineering
srinivasks@pes.edu
MACHINE
INTELLIGENCE
AdaBoost

K.S.Srinivas
Department of Computer Science
and Engineering
Machine Intelligence

AdaBoost

Srinivas K S
Department of Computer Science
Adaboost – The Algorithm

Source: Peter Flach – Machine Learning – The art and science of algorithms that make sense of data
Schematic illustration of Boosting
Adaboost – Broken down in simple terms

• There is only 1 Data set being used


• Initialize each of the instances to the same normalized weight i.e
w = 1/N , N= number of instances
• Choose the learner with the highest accuracy as your start
learner say h1(x)
• Run the algorithm and collect the error rate = % of misclassified
examples
• Ensure that h2(x) does not classify the misclassified points of h1(x)
• This perforce means that h2(x) runs after h1(x)
• Continue and perform h3(x) making sure that it does get h2(x)
errors right
• Finally take a vote of all the hypothesis (offcourse weighted) and
state the hypothesis
• There are 2 weights – instance weights and hypothesis weights
Adaboost

• So lets walk thru the algorithm with all its gory details

• Get your instance set that you will use for training

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost

• 1st weight – instance weight


• When we start the algorithm have the same weight
• So assign weights as 1/N = 1/10 (ten sample)
• Since we are using a binary classifier convert true +1
and false to -1

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
• Create many decision stumps such as if x1 < 2.1 it to be +1 and X1>=
2.1 as -1
• You can create many such stumps and choose the one with the lowest
error rate
• Assume that the decision stump above is the best one (Not true ) but
for this example lets just assume
• Lets make a prediction and calculate the error rate

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
• Evaluate the error for the mth classifier. Here wn( m) is
the weight of the nth data instance in the mth
iteration. The Identity function:
I(a,b) = 1 if a != b and = 0 otherwise.
N

 n I ( ym ( xn )  tn )
w (m)

m  n 1
N

w
n 1
(m)
n

• Then evaluate the value of the classifier using:


• αm = ½ * ln{ (1 – εm)/εm}
Adaboost
• Sum of weight times loss column stores the total error
• It is 0.3 in this case.
• Compute the Stump weight . I will not prove it but the alpha is got by
taking a partial derivative of the error with respect to error .

alpha = ln[(1-epsilon)/epsilon] / 2 = ln[(1 – 0.3)/0.3] / 2


alpha = 0.42
• Remember our final H(x) is a weighted sum of individual hypothesis
• So alpha 1 is 0.42 for our first decision stump

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost

• The other main funda in adaboost is that the next


learner must learn from its previous learner Which is take each instance and multiply it
• The way to be doing this is by upping the weights of with e raised to –alpha if correctly classified
the instances that it got wrong and reducing the and e raised to +alpha if incorrectly classified
weights of the instances that it got right The Term N is a normalizer is because we
• This is done thru the expression said that the sum of all weights must add
upto 1

Lets derive N now

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost

Sum of all weights must be 1

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
 We’ll use alpha to update weights in the next round.

 wi+1 = wi * math.exp(-alpha * actual * prediction) where i refers to


instance number.

 Also, sum of weights must be equal to 1. That’s why, we have to


normalize weight values. Dividing each weight value to sum of
weights column enables normalization.

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
 In the next round I choose x1 < 3.5 as -1 and x1>= 3.5 as +1

 Off course I use the new weights this time.

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
 You can calculate epsilon, alpha and new weights using the same
procedure
 epsilon = 0.21, alpha = 0.65
 And find weights for the next round

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
 At each round I update my final hypothesis

 I have given a table here and the calculations for 4 rounds

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
 At each round I update my final hypothesis

 I have given a table here and the calculations for 4 rounds

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
For example, prediction of the 1st instance will be

0.42 x 1 + 0.65 x (-1) + 0.38 x 1 + 1.1 x 1 = 1.25

And we will apply sign function

Sign(0.25) = +1 aka true which is correctly classified.

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost

Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Alpha vs Error

A plot of the “value” of a classifier vs the Error rate will be as


follows.
Alpha vs Error

• We can see that when the error rate is 0, alpha is close


to infinity ( very high value for the classifier).
• If error rate = 0.5, value = 0. This makes sense because
the classifier with error rate = 0.5 is as good as a “coin
tosser”.
• If error rate = 1, then everything is gotten wrong by the
classifier and hence value is –infinity.
Adaboost toy example
Please try a calculation of this as a home assignment

X 0 1 2 3 4 5 6 7 8 9

Y + + + - - - + + + -
THANK YOU

Srinivas K S
Department of Computer Science & Engineering
srinivasks@pes.edu
MACHINE
INTELLIGENCE
BAYESIAN LEARNING

K.S.Srinivas
Department of Computer Science and Engineering
Probabilistic Learning

• Much of our dear friend the neural


network outputs values are in terms of
probability

• A sigmoid function is not a probability


density function (PDF), as it integrates to
infinity.
• However, it corresponds to the
cumulative probability function of
the logistic distribution.
• Given that Sigmoid values lie in the
interval [0,1], you can still interpret them
as a confidence index.
• SoftMax function outputs a vector that
represents the probability distributions of a
Probabilistic Learning

• Prof Winston Patrick often describes


Artificial Intelligence as computational
statistics
• You will see that thru the class that our
intuitive error function for neural Probability deals with predicting the likelihood of future events
networks in indeed probabilistic

• We will see in this class how probability is


used for modelling concepts

• Bayesian Probability is the notion of


probability about partial beliefs

• Bayesian Estimation calculates the validity


of a proposition
Probabilistic Learning

• Calculating the validity of a proposition is based on


• Prior Estimates
• New Evidence
• Based on this is the posterior estimation done We Learn a concept from +example
alone
• This forms that heart of Bayes Theorem but lets step back

• Psychological research has shown that people can learn


concepts from positive examples alone (
Probabilistic Learning

• We can think of learning the meaning of a word as


equivalent to concept learning, which in turn is equivalent
to binary classification.
• To see this, define f(x) = 1 if x is an example of the concept
C, and f(x) = 0 otherwise
• Then the goal is to learn the indicator function f, which just
defines which elements are in the set C.
• I am thinking of some arithmetical concept, such as:
• Prime numbers
• Numbers between 1 and 10
• Even numbers
• I give you a series of randomly chosen positive examples
from the chosen class
Probabilistic Learning

• Suppose the data set contains all integers from 1 to 100


• We are asked to learn which are the other numbers similar
to this concept
Probabilistic Learning

• Thus some numbers are more likely than others


• We can represent this as a probability distribution, p(˜x|D), which is the
probability that ˜x ∈ C given the data D for any ˜x ∈ {1, . . . , 100}.
• This is called the posterior predictive distribution.
Probabilistic Learning

• Now suppose I tell you that 8, 2 and 64 are also positive examples.
• Now you may guess that the hidden concept is “powers of two”. This is an example of
induction
• How can we explain this behavior and emulate it in a machine?
• The classic approach to induction is to suppose we have a hypothesis space of
concepts, H, such as: odd numbers, even numbers, all numbers between 1 and 100,
powers of two.
• The subset of H that is consistent with the data D is called the version space.
• As we see more examples, the version space shrinks and we become increasingly
certain about the concept

However, the version space is not the whole story.


After seeing D = {16}, there are many consistent rules; how do you combine them to
predict if ˜x ∈ C?
Also, after seeing D ={16, 8, 2, 64}, why did you choose the rule “powers of two” and not,
say, “all even numbers”, or “powers of two except for 32”, both of which are equally
consistent with the evidence?
Likelihood

However, the version space is not the whole story.


After seeing D = {16}, there are many consistent rules; how do you combine them to predict if ˜x ∈
C?
Also, after seeing D ={16, 8, 2, 64}, why did you choose the rule “powers of two” and not, say, “all
even numbers”, or “powers of two except for 32”, both of which are equally consistent with the
evidence?

Why would you say H(power of 2 ) instead of H(even number)?


• Lets extend the set to all integers from 1…100?
• How many powers of 2 do we have - 6 and even 50
• Then p(D|htwo) = 1/6, since there are only 6 powers
• of two less than 100, but p(D|heven) = 1/50, since there are 50
even numbers.
• So the likelihood that h = htwo is higher than if h = heven
Likelihood

Likelihood:
• Assume examples are sampled uniformly at random from all
numbers that are consistent with the hypothesis
• Size principle: Favors smallest consistent hypotheses
Prior

• Prior is the mechanism by which background knowledge can be


brought to bear on a problem.
• Suppose you were told 1400, 1200, 1600, 1800 are the arithmetic
output of some other numbers, would you believe 1100 to be
belonging to that same concept or would you say 1183 belongs to the
same concept
• Based on prior experience, some hypotheses are more probable
(natural) than others
Posterior

where I(D ∈ h) is 1 iff (iff and only if) all the data are in the extension of the hypothesis
h

That’s Bayes theorem for you


Posterior

• We see that the posterior is a combination of prior and


likelihood. In the case of most of the concepts, the prior is
uniform, so the posterior is proportional to the likelihood.
Posterior

In general, when we have enough data, the posterior p(h|D)


becomes peaked on a single concept, namely the MAP
estimate, i.e.,
p(h|D) → δˆhMAP (h)

where ˆhMAP = argmaxh p(h|D) is the posterior mode, and


where δ is the Dirac measure
defined by
δx(A) =
• 1 if x ∈ A
• 0 if x ∈ A
Posterior
MAP (“maximum a posteriori”) Learning
P ( D | h) P ( h)
Bayes rule: P( h | D) 
P( D)
Goal of learning: Find maximum a posteriori hypothesis hMAP:

hMAP = argmax P(h | D)


hÎ H

This is the optimal P(D | h)P(h)


hypothesis in the sense
= argmax
that no other hypothesis
hÎ H P(D)
is more likely.

= argmax P(D | h)P(h)


hÎ H

* note that P(D) can be dropped, because it is a constant independent of h


Recap to Bayes theorem

Bayes Theorem provides a principled way for calculating a


conditional probability.

It provides to build a probabilistic model to describe the


relationship between data (D) and a hypothesis (h).
Bayes Theorem for Modeling Hypotheses
In the context of classifiers, hypothesis h and training data
D are related as

𝑷 𝑫 𝒉 𝑷(𝒉)
𝑷 𝒉𝑫 =
𝑷(𝑫)

• 𝑃(h)–prior probability of hypothesis h


• 𝑃(D)–prior probability of training data D
• 𝑃 (h|D) – posterior probability of h given D
• 𝑃 (D|h) – Likelihood: probability of D given h
Best hypothesis ≈ most probable hypothesis
The goal of Bayesian Learning:

To locate a hypothesis that best explains the observed data.


MAP Learning: Applications

Fitting models like linear regression for predicting a


numerical value, and logistic regression for binary
classification can be framed and solved under the MAP
probabilistic framework.

This provides an alternative to the more common maximum


likelihood estimation (MLE) framework.
ML Hypothesis
In some cases, if every hypothesis in H is equally probable a
priori i.e., P(hi) = P(hj) for all hi and hj in H

𝒉𝑴𝑳 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑷(𝑫|𝒉)


𝒉𝝐𝑯

hML is called the “maximum likelihood hypothesis”.

P(D|h) – the likelihood of the data D given h


Summary : Classes of Hypotheses
1. MAP Hypothesis

𝒉𝑴𝑨𝑷 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑷 𝑫 𝒉 𝑷(𝒉)


𝒉𝝐𝑯

2. ML Hypotheis

𝒉𝑴𝑳 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑷(𝑫|𝒉)


𝒉𝝐𝑯

When hMAP = hML?


If P(hi) = P(hj) , ∀ i,j then. hMAP = hML
Example: Does patient have cancer or not?
A patient takes a test for cancer. The test has two outcomes:
positive and negative. It is known that if the patient has
cancer, the test is positive 98% of the time. If the patient
does not have cancer, the test is negative 97% of the time.
It is also known that 0.008 of the population has cancer.

The patient’s test is positive.

Which is more likely: Should we diagnose a patient whose


lab result is positive as having cancer?
Problem Summary
• Hypothesis space: Outcome
h1 = Patient has cancer D ={+,-}
h2 = Patient does not have cancer
• Prior Probability: 0.008 of the population has cancer.
Thus
P(cancer) = P(h1) = 0.008
P(¬cancer) =P(h2) = 0.992
• Conditional probability:
P(+ | h1) = 0.98, P(− | h1) = 0.02
P(+ | h2) = 0.03, P(− | h2) = 0.97
• Posterior knowledge:
Blood test is + for this patient.
What is the probability that the patient indeed has cancer?

ℎ𝑀𝐴𝑃 ≡ argmax 𝑷 𝒉 𝑫
ℎ𝜖𝐻

≡ argmax 𝑷 𝒉𝟏 + , 𝑷 𝒉𝟐 + }
ℎ𝜖𝐻

≡ argmax{ 𝑷 + 𝒉𝟏 𝑷 𝒉𝟏 , 𝑷 + 𝒉𝟐 𝑷 𝒉𝟐 }
ℎ𝜖𝐻

≡ argmax{ 𝟎. 𝟗𝟖 ∗ 𝟎. 𝟎𝟖 , 𝟎. 𝟎𝟑 ∗ 𝟎. 𝟗𝟗𝟐}
ℎ𝜖𝐻

≡ argmax{ 𝟎. 𝟎𝟎𝟕𝟖, 𝟎. 𝟎𝟐𝟗𝟖}


ℎ𝜖𝐻
𝒉𝑴𝑨𝑷 ≡ 𝒉𝟐(¬ cancer)
The most probable hypothesis is h2(The patient does not have cancer)
Normalization of probabilities
The exact posterior probabilities can be determined by
normalizing the above properties to 1

0.0078
P(h1|+) = = 0.21
0.0078 +0.0298
0.0298
P(h2|+) = = 0.79
0.0078 +0.0298

⇒ the result of Bayesian inference depends strongly on the prior


probabilities, which must be available in order to apply the
method directly
THANK YOU

Srinivas K.S
Department of Computer Science
srinivasks@pes.edu
Bayes Theorem and Concept Learning
What is the relationship between Bayes theorem and the problem of
concept learning?
It can be used for designing a straightforward learning algorithm called
Brute-Force MAP LEARNING algorithm
Brute-Force MAP Learning Algorithm

The MAP hypothesis output is used to design a simple


learning algorithm called “Brute force learning algorithm”
Brute-Force MAP Learning Algorithm

For each hypothesis h ∈H, calculate the posterior probability


P(D|h)P(h) ………….(1)

P(D)
Output hypothesis hMAP with the highest posterior probability

hMAP = argmax P(h|D)…………(2)


h∈H
Brute-Force MAP Learning Algorithm

This brute-force MAP learning algorithm may be computationally infeasible, as it


requires applying the Bayes theorem to all h ∈ H.
However, this is still useful as a standard against which other concept
learning approaches may be judged.
Relation to Concept learning

Consider our usual concept learning task

instance space X, hypothesis space H, training examples D


consider the Find S learning algorithm (outputs most specific hypothesis from
the version space V SH;D )

What would Bayes rule produce as the MAP hypothesis ?

Does “Find S” output a MAP hypothesis ?


Relation to Concept learning

Assume fixed set of instances <x1, x2, ………, xm>

Assume D is the set of classifications D = <


c(x1),…….,c(xm)>

Choose P(D|h):
Relation to Concept learning

Assume fixed set of instances <x1, x2, ………, xm>

Assume D is the set of classifications D = <


c(x1),…….,c(xm)>

Choose P(D|h):

P(D|h) =1 if h is consistent with D

P(D|h) = 0 otherwise

Choose P(h) to be uniform distribution:

1
P(h) = |𝐻| for all h in H
Brute-Force MAP Learning

Therefore, the brute force algorithm can now proceed in two ways.

0 .𝑃(ℎ)
𝑃 ℎ𝐷 = =0
𝑃(𝐷)

If h is consistent with training data D


1
1.
|𝐻|
𝑃 ℎ𝐷 =
𝑃(𝐷)
1
1 .|𝐻| 1
= =
| 𝑉𝑆𝐻,𝐷 |
|𝑉𝑆𝐻,𝐷 |
|𝐻|

Thus, every consistent hypothesis has posterior probability


1/|VSH,D| and is a MAP hypothesis.
Brute-Force MAP Learning
As data is added, certainty of hypotheses increases.

Figure (a) Figure (b) Figure (c)

evolution of probabilities
(a) all hypotheses have the same probability
(b) + (c) as training data accumulates, the posterior
probability of inconsistent hypotheses becomes zero while
the total probability summing to 1 is shared equally
among the remaining consistent hypotheses
Consistent Learners

Every hypothesis consistent with D is a MAP hypothesis, if


• uniform probability over H
• target function c H
deterministic, noise-free data

FindS will output a MAP hypothesis, even though it does not explicitly use
probabilities in learning.

Bayesian interpretation of inductive bias : use Bayes theorem, define restrictions


on P(h) and P(D| h )
Characterizing Learning Algorithms by Equivalent MAP Learners
References

1. “Machine Learning”, Tom Mitchell, McGraw Hill Education (India), 2013.


Notations used

h (hypothesis): A single hypothesis, e.g. an instance or specific candidate model that


maps inputs to outputs and can be evaluated and used to make predictions.
H (hypothesis set): A space of possible hypotheses for mapping inputs to outputs
that can be searched, often constrained by the choice of the framing of the problem,
the choice of model and the choice of model configuration.

D:
THANK YOU

Srinivas K.S
Department of Computer Science
srinivasks@pes.edu
MACHINE
INTELLIGENCE
Maximum Likelihood
and Bayes Optimal Classifier
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE
INTELLIGENCE
Maximum Likelihood
and Bayes Optimal Classifier
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE

Maximum Likelihood and Bayes


Optimal Classifier

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Maximum Likelihood and Least Squared Error Hypothesis

• Let us consider a continuous valued target function.


• Probably one of the problem faced by many learning
approaches such as Neural network,Linear
Regression,Polynomial Curve fitting
• A straight forward Bayesian analysis will show that
under certain assumption
• any learning algorithm that minimizes output
hypothesis prediction and the training data will
output maximum likelihood hypothesis (HML)
• we have already witnessed this???
• In neural network method the attempt to minimize
the sum of the squared error over the training data
MACHINE INTELLIGENCE
Maximum Likelihood and Least Squared Error Hypothesis

• Learner L : X--->R ,where R represents a set of real


number.
• L to learn unknown target function f: X-->R drawn
from H
• number of training examples be m
• We will assume our training data t be noisy
• So,each instance has training examples <xi,di>,where function f
di is noisy training value such that di=f(xi)+ei,where ei
is a random variable noise that is drawn from a target function
Gaussian Distribution with mean 0. f (HML)
MACHINE INTELLIGENCE
Maximum Likelihood and Least Squared Error Hypothesis

• Let us assume all hypothesis start with equal probability.


• Now,
where p is probability distribution

• Given noise ei obeys Normal Distribution,then di obeys Gaussian Distribution


• According to Gaussian Distribution
1
m
1 ( di   )2
 argmax  e 2
2

hH i 1 2  2
• where μ=h(xi)
MACHINE INTELLIGENCE
Maximum Likelihood and Least Squared Error Hypothesis

• we now apply a transformation i.e common in


max likelihood calculations
• Rather than maximizing this as complicated
expression we shall choose to maximize its (loss
complicated) logarithm. 𝑛
• The first term in this expression is a constant 1 1
ln − ⅆ − ℎ 𝜘𝑖 2
,independent of ln and can therefore be 2𝜎 2 𝑖
2𝜋𝜎 2
discarded yielding. 𝑖=−1
MACHINE INTELLIGENCE
Maximum Likelihood and Least Squared Error Hypothesis
1
• maximizing those negative quantity is same as m ( d i  h ( xi ) 2
 argmax  2
2
minimizing the positive quantity ,yielding us
hH i 1
m 1
( d i  h ( xi ) 2

 argmin
hH
 i 1
2 2

• finally we can discard the constants ,independent


og h ,giving us

m
 argmax (di   )2
hH i 1

• We can also easily prove that the Maximum


Likelihood hypothesis while predicting
probabilities will be the same as minimizing the
cross entropy loss.
MACHINE INTELLIGENCE

Maximum Likelihood and Bayes


Optimal Classifier

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Bayes Optimal Classifier

• We have so far covered – “What is the most • The most probable classification is -ve.
probable hypothesis given the training data?”. • In this case it is different from the classification
• But we can now attempt to answer the question, generated by MAP hypothesis
“What is the most probable classification of a new
instance given the training data?”
• We can answer this by using MAPhypothesis to
new instance,but we can do better
• Consider a hypothesis space consisting of 3
hypothesis h1,h2,h3.
• Suppose positive probability of these hypothesis
given the training data are 0.4,0.3 and 0.3
respectively.
• Suppose a new instance x is encountered ,which is
classified as +ve by h1 and -ve by h2 and h3.
• Taking all hypothesis into account ,the probability
that x is positive is 0.4 and probability that x being
negative is 0.6
MACHINE INTELLIGENCE
Bayes Optimal Classifier
• In general the most probable classification of new
instance is obtained by combined the prediction
of all hypothesis ,weighted by their posterior
probabilities.
• If the possible classification of the new instance
can take one of any value Vj from set V, then the
probability P(Vj|D) that the correct classification
for the new instance is Vj is

j | D)P(Vj | hj )*P(hi | D)
P(V
vjV hiH

• where P(hi|D) is the weight associated with


hypothesis hi
• The optimal classification of the new instance is
the value Vj,for which P(Vj|D) is maximum i.e

argmax
v j V
  P (V j | h j ) * P ( h i | D )
hi  H
MACHINE INTELLIGENCE
Bayes Optimal Classifier
• To illustrate in terms of the above example ,the
set of possible values of new instance is V
argmax
v j V
  P (V j | h j ) * P ( h i | D )
hi  H
V +ve -ve
and h1 ,h2 and h3 are three hypothesis THIS EQUATION IS CALLED Bayes Optimal Classifier
or
P(h1|D) 0.4 P(-ve|h1) 0 P(+ve|h1) 1 Bayes Optimal Learner
P(h2|D) 0.3 P(-ve|h2) 1 P(+ve|h2) 0

P(h3|D) 0.3 P(-ve|h3) 1 P(+ve|h3) 0

Therefore,
 P(  ve/h
h i H
i ).P(h i /D)  
1x0.4
0x0.3 
0x0.3
0.

 P(ve/h
h i H
i ).P(h i /D)  0x0.4 1x0.3 1x0.3  0.6
P(Vj | hj )*P(hi | D) ve
argmax
vjV hiH
MACHINE INTELLIGENCE
Bayes Optimal Classifier
• This method maximizes the probability that new
instance is classified correctly,given the available
argmax
v j V
  P (V j | h j ) * P ( h i | D )
hi  H
data,hypothesis space and prior probabilities over
the hypothesis.
MACHINE INTELLIGENCE
Gibbs Algorithm
• Bayes optimal Classifier obtains the best performance that can
be achieved from the training data,it is quite costly to apply.
• The expense is due to the fact that it computes the posterior
probability for every hypothesis in H and combines the
prediction of each hypothesis to classify new instance
• An alternative less optimal method is the “GIBBS ALGORITHM”
defined as follows
1. Choose a hypothesis h from H at random according to
posterior probability distribution of over H.
2. use 'h' to predict the classification of the next instance x

Note: surprisingly, it can be shown that under certain


conditions the expected misclassification error for GIBBS
ALGORITHM is at most twice the expected error of Bayes
Optimal Classifier
THANK YOU

K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701
MACHINE
INTELLIGENCE
Naïve Bayes and Applications

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE

Naïve Bayes and Applications

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Naive Bayes Classifier
• Highly practical Bayesian Learning method
• Comparable performance with neural network and decision tree
learning.
• Naive Bayes classier applies to learning task where each instance
x is classified by the conjunction of attribute value and where
the target function f(x) can take any value from finite set V.
• A set of training examples of the target function is provided,and
a new instance is presented in tuple of (a1,a2,a3,a4,........an)
• The learner/classifier is asked to predict the target value or the
classification of the new instance.

• The Bayesian approach to classify the new instance is to assign


the most probable value.
• VMAP given attribute values(a1.....an) that describe the instance is
MACHINE INTELLIGENCE
Naive Bayes Classifier

• Lets recall Bayes theorem


• using this

• It is easy to count P(Vj):


• Count of the no of frequency with target value Vj.Occurring in the training set.
• Looks easy to you until ,we think what if the training data set is very very large
MACHINE INTELLIGENCE
Naive Bayes Classifier

• Naive Bayes Classifier is based on simplifying assumption that


the attribute values are conditionally independent given the
target value
• i.e the assumption is that given the target value of the
instance,the probability of observing the conjunction
a1,a2,a3,.....an is just the product of the probabilities for the
individual attributes

• using this in our VMAP equation


MACHINE INTELLIGENCE
Naive Bayes Classifier

• Lets note few things


• Number of distinct P(ai|Vj) terms that must be estimated
from the training set:'D' is equal to the number of distinct
attribute value times the number of distinct target values
• Based on the frequencies over the training set
• whenever NB assumption of conditional independence is
satisfied,this NB classification=MAP classification
MACHINE INTELLIGENCE
Example- Play Tennis

• Recall the data set that we used to build a Decision Tree


• with 14 samples of target concept Play Tennis with
values YES or NO
• Each day is described by attributes,
(OUTLOOK,TEMPERATURE,HUMIDITY,WIND)
• use NB classifier and the training data from the table to
classify the following novel instance.
x=(Outlook=Sunny, Temp=Cool, Hum=High,
Wind=strong)
MACHINE INTELLIGENCE
Example- Play Tennis

=argmax[
P(Vj).P(outlook=sunny|vj).P(temperature=cool|vj).
P(humidity=high|vj).P(wind=strong|vj). ]

Vj can be YES or NO
probabilities of the different target values can
easily be estimated based on their frequency over
the 14 training examples
MACHINE INTELLIGENCE
Example- Play Tennis

similarly we estimate the conditional probabilities


P(wind=strong|play tennis=YES)=3/9=0.33
P(wind=strong|play tennis=NO)=3/5=0.60
P(outlook=sunny|play tennis=YES)=2/9=0.2222
P(outlook=sunny|play tennis=NO)=3/5=0.60
P(Temp=cool|play tennis=YES)=3/9=0.333
P(Temp=cool|play tennis=NO)=1/5=0.2
P(humidity=high|play tennis=YES)=3/9=0.333
P(humidity=high|play tennis=NO)=4/5=0.80
MACHINE INTELLIGENCE
Example- Play Tennis
P(wind=strong|play tennis=YES)=3/9=0.33
P(wind=strong|play tennis=NO)=3/5=0.60
P(Temp=cool|play tennis=YES)=3/9=0.333 P(outlook=sunny|play tennis=YES)=2/9=0.2222
P(Temp=cool|play tennis=NO)=1/5=0.2 P(outlook=sunny|play tennis=NO)=3/5=0.60
P(humidity=high|play tennis=YES)=3/9=0.3333 P(humidity=high|play tennis=NO)=4/5=0.80
P(yes).P(cool|yes)P(sunny|yes)P(high|yes)P(strong|yes)=0.64 x0..333x0.2222x0.333x0.333=0.0211
P(no).P(cool|no)P(sunny|no)P(high|no)P(strong|no)=0.36 x0.2x0.6x0.8x0.6=0.02

Therefore PlayTennis(x)=NO
MACHINE INTELLIGENCE
Special Case
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
• consider the following data set
A great game sports
• the task is to classify the sentence “A very close game”as
sports or not sports The election is over Not sports
• In this data set we do not have numbers but we have only text very clean match sports
• We need to convert all this text into numbers that we can use
a clean but sports
for calculation. HOW?????
forgettable game
• One solution is to use frequency of words
• Ignore word order and sentence construction it was a close election not sports
• Treat every document as a set of words it contains.
• Now the feature used in this case is the counts of words
i.e(words frequency)
• Its a simplistic approach,but works surprisingly well
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
• Now,we need to transform the probability we want to
A great game sports
calculate into something that can be calculated using word
frequencies. The election is over Not sports
very clean match sports
• Bayes Theorem for example:
a clean but sports
P(a
very
close
game/spor
s)xP(s
s)
P(sports/a
very
close 
game) forgettable game
P(avery
close
game)
it was a close election not sports
• since in our classier ,we are just trying to find out which
category has bigger probability we can discard the divisor
• This is same for both the categories
• we can compare
P(A very close game/sports) x P(sports)
with
P(A very close game/not sports) x P(not sports)
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
• The probabilities can be calculated:
A great game sports
1. count how many times the sentence
'A very close game' appears in sports category The election is over Not sports
2. Divide by the total very clean match sports
3. obtain P(a very close game|sports)
a clean but sports
forgettable game
• PROBLEM: we do not have the 'sentence' in the training set
it was a close election not sports
=>probability is zero
• unless every sentence appears in the training set,what we
want to classify, the model wont classify
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
so
A great game sports
• we assume that every word in a sentence is independent of
the other ones The election is over Not sports
• no longer we will look for entire sentences,but for only very clean match sports
words(individual)
a clean but sports
forgettable game
• i,e for a sentence “This was a funny party” is same as “funny is
party was this” is same as “party funny this was a” it was a close election not sports

• we can write this as:


P(a very close game)=P(a)xP(very)xP(close)xP(game)
• This enables to make the model work well,
• Now lets apply,
P(a very close game/sports)=
P(a/sports)xP(very/sports)xP(close/sports)xP(game/sports) x
P(sports)
• Now as all these individual words actually show up several
times in our training set,we can do our calculations
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
CALCULATING PROBABILITIES
A great game sports
• the final step is just to calculate every probability and see
which one turns to be larger The election is over Not sports
• First: calculate a priori probability for each category,i.e for the very clean match sports
sentence given in the training set
a clean but sports
P(sports)=3/5=0.6
forgettable game
P(not sports)=2/5=0.4
• calculate P(game/sports) : counting number of times the it was a close election not sports
word game appears in the sports sample,divided by the total
no of words in sports
i.e it appears twice for 11 words.
P(game/sports)=2/11=0.18181

• A problem again
• the word close does not appear in any sports ,and would lead
us 0 when multiplied with other probability
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
To resolve this we do something called Laplace smoothing
A great game sports
• i.e we add 1 to every count so its never zero
• to again balance this,we add the no of possible words to The election is over Not sports
divisor, very clean match sports
• in our case the possible words are:
a clean but sports
{a,great,game,the election,is,over,.......election}=14
forgettable game
• Applying smoothing we get
it was a close election not sports
WORD P(word/sports) P(word/Not
sports)
a (2+1)/(14+11)=3/25 (1+1)/(9+14)=2/23

very (1+1)/(14+11)=2/25 (0+1)/(9+14)=1/23

close (0+1)/(14+11)=1/25 (1+1)/(9+14)=2/23

game (2+1)/(14+11)=3/25 (0+1)/(9+14)=1/23


MACHINE INTELLIGENCE
Example2-Text Classification
WORD P(word/sports) P(word/Not sports) sentence class
A great game sports
a (2+1)/(14+11)=3/25 (1+1)/(9+14)=2/23 The election is over Not sports
very clean match sports
very (1+1)/(14+11)=2/25 (0+1)/(9+14)=1/23 a clean but sports
forgettable game
close (0+1)/(14+11)=1/25 (1+1)/(9+14)=2/23 it was a close election not sports

game (2+1)/(14+11)=3/25 (0+1)/(9+14)=1/23

Now we multiply all the probabilities to see which is bigger

P(a/sports)xP(very/sports)xP(close/sports)xP(game/sports)=0.000027648

P(a/not sports)xP(very/not sports)xP(close/not sports)xP(game/not sports)=5.717532x10-6

by this we successfully classify it as “sports category”


MACHINE INTELLIGENCE
Advanced Techniques

• Removing stop words:


example: a,able,the
a very close game =>>>very close game
• Words like election/elected are grouped together and counted as one word
• Using n-grams:
instead of counting individual words we can count sequence of words
example:'clean match','close election'
• TFIDF
term frequency–inverse document frequency, is a numerical statistic that
is intended to reflect how important a word is to a document in a collection or
corpus.
THANK YOU

K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701
MACHINE
INTELLIGENCE
EXPECTATION MAXIMIZATION

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE
INTELLIGENCE
EXPECTATION MAXIMIZATION

K.S.Srinivas
Department of Computer Science
and Engineering
Machine Intelligence

Unit III
Expectation Maximization

Srinivas K.S
Department of Computer Science
Expectation Maximization

• Maximum likelihood estimation is an approach to density estimation for a dataset by


searching across probability distributions and their parameters.

• It is a general and effective approach that underlies many machine learning


algorithms, although it requires that the training dataset is complete, e.g. all relevant
interacting random variables are present.

• Maximum likelihood becomes intractable if there are variables that interact with
those in the dataset but were hidden or not observed, so-called latent variables.

• The expectation-maximization algorithm is an approach for performing maximum


likelihood estimation in the presence of latent variables

It does this by f
estimating the values for the latent variables, (E)
then optimizing the model(M),
then repeating these two steps until convergence.
Unsupervised Learning and EM

• A central application of unsupervised learning is in the field


of density estimation.

• We will cover unsupervised learning in Unit 4 some 8 hours from


now – but lets understand one of the simplest unsupervised
learning algorithm to set the context for expectation
maximization

• K-Means clustering
K-Means Clustering

• K-means clustering is a simple and elegant approach for partitioning


a data set into K distinct, non-overlapping clusters.
• To perform K-means clustering, we must first specify the
desired number of clusters K
• K-means algorithm will assign each observation to exactly one
of the K clusters
K-Means Clustering

Let C1, . . ., CK denote sets containing the indices of the observations in


each cluster

They must satisfy 2 properties

1. C1 ∪ C2 ∪ . . . ∪ CK = {1, . . ., n}. In other words, each observation


belongs to at least one of the K clusters.
2. Ck ∩ Ck’ = ∅ for all k = k’. In other words, the clusters are
nonoverlapping: no observation belongs to more than one cluster.
K-Means Clustering
• The idea behind K-means clustering is that a good clustering is one for which the
within-cluster variation is as small as possible.

• The within-cluster variation for cluster Ck is a measure W(Ck) of the amount by which
the observations within a cluster differ from each other.

• In words, this formula says that we want to partition the observations into K clusters
such that the total within-cluster variation, summed over all K clusters, is as small as
possible

• The intra-cluster distance is measured using the Euclidian distance between pair wise
instances in the cluster
Expectation Maximization of K-Means
• The E-step is assigning the data points to the closest cluster.
• The M-step is computing the centroid of each cluster.

• Lets prove that convergence is guaranteed


• E-Step
• where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0

• M-Step
Closing Notes on K-Means
A normal distribution such that the mean μ = 0 and
standard deviation σ = 1 for your data

K-Means algorithm converges to a local minimum:


○ Can try multiple random restarts
Expectation Maximization

In the expectation, or E-step, the missing data are


estimated given the
• observed data and
• current estimate of the model parameters

In the maximization or the M-step, the likelihood function


is maximized
• under the assumption that the missing data are
known
The estimate of the missing data from the E-step are used
in lieu of the actual missing data.
Expectation Maximization

Assume that we have two coins, C1 and C2

Assume the bias of C1 is 𝜃1 (i.e., probability of getting


heads with C1)

Assume the bias of C2 is 𝜃2 (i.e., probability of getting


heads with C2)

We want to find 𝜃1, 𝜃2 by performing a number of trials


(i.e., coin tosses)
Expectation Maximization

So far we have kind of looked at EM

Lets look at a real EM using a binomial model


A coin experiment
Suppose your friend has posed a challenge:
Estimate the bias of two coins in her possession.

They might be fair coins, be more heavily weighted towards heads; you
don't know.
Here's the clue she's provided: a piece of paper with 5 records of an
experiment where she's:
Expectation Maximization
Chosen one of the two coins at random.
Flipped that same coin 10 times.
How can you provide a reasonable estimate of each coin bias? Let's
refer to these coins as coin A and coin B and their bias as θA and θB.
Expectation Maximization
Expectation Maximization

Thus : (If A represents θ1 and B represents θ2)

θ1 = 24/30 = 0.8
θ2 = 9/20 = 0.45
Expectation Maximization

Assume a more challenging problem. We do not know the identities of


the coins used for each set of tosses (we treat them as hidden
variables).
Expectation Maximization

This can be modelled as a binomial distribution .


Each trail belongs to either Coin A or Coin B
• We only know that each coin has an equal chance of being
chosen each time.
• In this scenario, the coin is not observed, and could be considered
a hidden or latent variable
• This is the setup
Expectation Maximization
Right now we're stuck, because we'd like to count up the number of heads for each coin,
but we don't know which coin is being flipped in each trial.

It turns out that we can make progress by starting with a guess for the coin biases

Which will allow us to estimate which coin was chosen in each trial and come up with an
estimate for the expected number of heads and tails for each coin across the trials (E-
step)

We then use these counts to recompute a better guess for each coin bias (M-step)

By repeating these two steps, we continue to get a better estimate of the two coin
biases and converge at a solution that turns out to be a local maximum to the problem.
Expectation Maximization
Estimating likelihood each coin was chosen
Estimate the probability that each coin is the true coin given the flips we see in the trial
Which will allow us to estimate which coin was chosen in each trial .
Use that to proportionally assign heads and tails counts to each coin.
Let's make this concrete with one of the examples we just mentioned:

Lets initially guess that


our current biases for coin A and B are 0.4 and 0.7
• we observe the following flips: HHHHHHHHTT
• what is the probability that these flips came from coin A and coin B? Let's call this
series of flips event E, the event we chose A be ZA and B ZB.
• Both coin are equally likely to be chosen to P(ZA) = P(ZB) = 0.5
• Now we need to estimate the P(E|ZA) and P(E|ZB) for example P(HHHHHHHHTT|ZA)
• Recollect since there are only 2 choices we could use
Expectation Maximization

It looks like the first trail came from Coin B

But what we wish to find is P(ZA|E) and P(ZB|E)


Expectation Maximization

P(ZA)=P(ZB)=0.5
We can eliminate the values from
the equation

Thanks to Baye's theorem and the law of total


probability, we can partition all of the events in Z
(which coin we choose) over ZA and ZB as we have to
choose one or the other.
Expectation Maximization

Lets do a full cycle for one set of


trails
Expectation Maximization
MACHINE
INTELLIGENCE
EXPECTATION MAXIMIZATION - GMM

K.S.Srinivas
Department of Computer Science
and Engineering
Machine Intelligence

Unit III
Gaussian Mixture Models

Srinivas K.S
Department of Computer Science
Gaussian Distributions

Univariate Gaussian Distribution

Multivariate Gaussian Distribution


A multivariate normal distribution is a vector in
multiple normally distributed variables, such that any linear
combination of the variables is also normally distributed.
Gaussian Mixture Models

We need to estimate these parameters of a distribution


One method – Maximum Likelihood (ML) Estimation.
Gaussian Mixture Models
Gaussian Mixture Models
Gaussian Mixture Models – The Algorithm

Lets understand this using a 1-d


example
EM-GMM Example
EM-GMM Example
EM-GMM Example
D>1 Example
Applications

Estimating parameters of a Gaussian Mixture


model.

Baum Welch Algorithm in Hidden Markov


Models.

Clustering.
THANK YOU

Srinivas K S
Department of Computer Science & Engineering
srinivasks@pes.edu
MACHINE
INTELLIGENCE
Hidden Markov Model

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE

Discrete Markov Processes

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Discrete Markov Process

We have so far commonly dealt with random samples.

A random sample can be thought of as a set of objects that are


chosen randomly. Or, more formally, it’s “a sequence of
independent, identically distributed (IID) random variables“

Identically Distributed means that there are no overall trends–


the distribution doesn’t fluctuate and all items in the sample are
taken from the same probability distribution.

Independent means that the sample items are all independent


events. In other words, they aren’t connected to each other in
any way.
MACHINE INTELLIGENCE
Discrete Markov Process

Random variables X and Y on the same probability space are said to be independent
if the events X = a and Y = b are independent for all values a,b. Equivalently, the joint
distribution of independent r.v.’s decomposes as

P(X = a,Y = b) = P(X = a)P(Y = b) ∀a,b.

Examples: Put m balls with numbers written on them in an urn. Draw n balls from the urn
with replacement, and let Xi be the number on the ith ball. Then X1, X2, ..., Xn will be i.i.d.

But if we draw the balls without replacement, X1, X2, ..., Xn will not be i.i.d. - they will all
have the same distribution, but will not be independent.

If we draw the balls with replacement, but let Xi be i times the number on the ith ball, then
X1, X2, ..., Xn will not be i.i.d. - they will be independent, but they will have different
distributions.
MACHINE INTELLIGENCE
Discrete Markov Process

• The IID assumption that we make during Naïve Bayes is a very


strong assumption
• This does not always hold good in many cases
MACHINE INTELLIGENCE
Discrete Markov Process

• In each of these cases that we just saw the occurrence of a value of a


feature effects that value of another feature
• As an example if it had rainy over the last 48 hours what are the
chances of today being rainy as opposed to be sunny.
• In each of these cases we need to consider the probabilities effect of
one feature on another.
• Modelling of such cases is done on the lines of Finite State Machines
MACHINE INTELLIGENCE
Discrete Markov Process

• We know that state machines usually have a start end state, an


alphabet that takes you to a new state and we have an end state.

• The alphabets can be thought of the change in conditions that


cause a state transition.

• We modify the conditions slightly in modelling Markov Process


there is no THE start state

• Transitions can happen between all states ( more on this later)


MACHINE INTELLIGENCE
Discrete Markov Process

Random variables
• The possible states of the outcomes are also known as the domain of the
random variable, and the outcome is based on the probability
distribution defined over the domain of the random variable.
• In rolling a six sided die, the domain of the random variable outcome, O, is
given by domain(O) = (1, 2, 3, 4, 5, 6), and the probability distribution is
given by a uniform distribution P(o) = 1/6 ∀ ∈ domain(O).
• The domain of the random variable has discrete variables; such random
variables are known as discrete random variables.
• Consider the random variable representing the stock price of
Google tomorrow. The domain of this random variable will be all positive
real numbers with most of the probability mass distributed around ±5% of
today's price. Such random variables are known as continuous random
variables.
MACHINE INTELLIGENCE
Discrete Markov Process

Random processes
Random variables are able to mathematically represent the outcomes of a single random
phenomenon.

What if we want to represent these random events over some period of time or the length
of an experiment?

 let's say we want to represent the stock prices for a whole day at intervals of every
one hour

 we want to represent the height of a ball at intervals of every one second after
being dropped from some height in a vacuum.

For such situations, we would need a set of random variables, each of which will represent
the outcome at the given instance of time. These sets of random variables that represent
random variables over a period of time are also known as random processes. It is worth
noting that the domains of all these random variables are the same.
MACHINE INTELLIGENCE
Discrete Markov Process

• Such random processes, in which we can deterministically find the state of each
random variable given the initial conditions (in this case, dropping the ball, zero initial
velocity) and the parameters of the system (in this case, the value of gravity), are known
as deterministic random processes (commonly called deterministic processes).

• Random processes, in which we can't determine the state of a process, even if we are
given the initial conditions and all the parameters of the system, are known as
stochastic random processes (commonly called stochastic processes).
MACHINE INTELLIGENCE
Discrete Markov Process

Markov processes
A stochastic process is called a Markov process if the state of the random variable at the next
instance of time depends only on the outcome of the random variable at the current time.

Markov property
This property of a system, such that the future states of the system depend only on the
current state of the system, is also known as the Markov property.

Systems satisfying the Markov property are also known as memoryless systems
MACHINE INTELLIGENCE
Discrete Markov Process

A Discrete Markov process ( Markov chain)


• The start state is not defined

• Is a stochastic process over a discrete state space satisfying the Markov property.

• The probability of moving from the current state to the next state depends only on the
present state and not on any of the previous states.

• Is is said to be irreducible if we can reach any state of the given Markov chain from any other
state.

• state j is said to be accessible from state i if an integer nij ≥ 0 exists such that the following
condition is met:
MACHINE INTELLIGENCE
Discrete Markov Process

• This is the probability of a system being in state Rn+1 given that the machine has been in R1 at t=1,
R2 at t=2 and Rn at t=n can be represented as follows

• Such a process is called as a ‘n’ order Markov Process where the machine being in a state at n+1 is
conditioned by all the previous states leading up to n.

• Generally Markov property is applies in which the above expressed is reduced to


MACHINE INTELLIGENCE
Discrete Markov Process
• A graphical representation of a first order Markov Chain with conditional dependence on
only the previous state can be represented as

• This is how probabilistic graphical models are generally represented,


• Nodes represent random variables
• Edges represent a conditional probability distribution between these two variables.
• This graphical representation gives us insight into the causal relationships between
random variables.
MACHINE INTELLIGENCE
Discrete Markov Processes

• As an example this is a 2nd order Markov Process

• Where the conditional probability is represented as


follows
MACHINE INTELLIGENCE
Discrete Markov Processes

• Let us represent this with an example of the weather


being sunny rainy or cloudy only thru a first order
Markov chain
MACHINE INTELLIGENCE
Discrete Markov Process

• Lets understand the model a little better and introduce some more
terms that we need for the model that we will apply.
• The p that you see here are the starting
probabilities since we can start from any
state to any other state
• a12 states the tranisition probability of
moving from state 1 to state 2
• S of all a’s from a state must add upto
to one.
MACHINE INTELLIGENCE
Discrete Markov Process
• The p that you see here are the starting
probabilities since we can start from any
state to any other state
• All of the transition probabilities can be
represented as a matrix called as the
transition matrix
• A= | a11 a12 a13 | 𝑁 𝑁
| a21 a22 a23 | 𝑎𝑖𝑗 = 1∀ⅈ 𝜋𝑖 = 0
| a31 a32 a33 | 𝑗=1 𝑖=1

A11 represents the probability of moving


from a1 to a1.

We can now fully define our model


L (P, A) for a discrete Markov process
MACHINE INTELLIGENCE
Example Problem (1)
• Consider a first order Markov process 𝑁 𝑁

with 3 states = { Sunny, Cloudy, Rainy } 𝑎𝑖𝑗 = 1∀ⅈ 𝜋𝑖 = 0


• 𝜋 = 0.3,0.3,0.4 𝑗=1 𝑖=1

S C R
• Starting state is cloudy 𝜋𝐶 = 0 ⋅ 3
• A= S| 0.6 0.2 0.2 | The trellis is CSRCR
C| 0.2 0.5 0.3 | So we have a starting Pc=0.3 and the multiply the P(x)
R| 0.1 0.4 0.5 | each of the next state given the current state
What is the probability of seeing a i.E P(CSRCR|p,A) = P(pc).P(S|C).P(R|S).P(C|R).P(R|C)
sequence of Cloudy, Sunny, Rainy, Cloudy,
Rainy over the next five days? 0.3 × 0 ⋅ 2 × 0.2 × 0.4𝑥 0 ⋅ 3
0.00144
MACHINE INTELLIGENCE
Example Problem (2)
• Consider the trellis over a 14 day period
Compute the parameters of the Discrete
Markov Model
i.ecompute
P And the transition matrix
Solution:
MACHINE INTELLIGENCE
Example Problem (2)
• Consider the trellis over a 14 day period
Compute the transition matrix
MACHINE INTELLIGENCE
Hidden State
• Consider the following problem. We have 2 friends Karan and
Vijay one in Bangalore and the other in Shimoga
• They speak to each other every day We have another
• The only thing Karan states to Vijay on any day is whether is probability called the
happy or Angry. emission probability
i.E given a state – weather
• His anger or happiness is defined by the weather it being what is probability of being
sunny or rainy happy or sad
• Given that he says he is HHSHS can you guess the weather in This is represented by
another matrix called the
Shimoga emission matrix. See you in
the next class
THANK YOU

K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701
MACHINE
INTELLIGENCE
Hidden Markov Model

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE

Discrete Markov Processes

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
HMM

• Let us Recall the problem we


framed
• vijay says karan only his mood
which is dependent on the
weather

KARAN VIJAY
MACHINE INTELLIGENCE
HMM

• weather can be sunny or rainy


• if its sunny vijay is happy and if its
rainy vijay is grumpy
• so vijay tells karan his mood and
he guesses the weather based on
his mood
MACHINE INTELLIGENCE
HMM

• now let say vijay is mostly happy


when its sunny and mostly
grumpy when its rainy
0.6
0.8 0.2
0.4
• that is there are some
exceptions,and say we have
probabilities for this
MACHINE INTELLIGENCE
HMM

• so conversation between karan


and vijay looks something like this
MACHINE INTELLIGENCE
HMM

• so conversation between karan


and vijay looks something like
this
• consider a week conversation
between karan and vijay
MACHINE INTELLIGENCE
HMM

• but does this unlikely ,its more


likely that if a day say is sunny
,next day is going to be sunny
and same with rainy
MACHINE INTELLIGENCE
HMM

• lets get back to our model


• lets say if today is sunny the
probability that next day is sunny
is 0.8,
• then probability of next day being
rainy is 0.2
0.8 0.2 • similarly
0.6

0.4 0.4
0.8 0.2 0.6
MACHINE INTELLIGENCE
HMM

• what we have now is our hidden


markov model
• this HMM has two states
• some observations we can see
• and weather is the hidden states

0.8 0.2
0.6

0.4 0.4
0.8 0.2 0.6
MACHINE INTELLIGENCE
HMM

• this is called transition probability

0.8 0.2
0.6

0.4 0.4
0.8 0.2 0.6
MACHINE INTELLIGENCE
HMM

• this are called emission


probability

0.8 0.2
0.6

0.4 0.4
0.8 0.2 0.6
MACHINE INTELLIGENCE
HMM

• How did we find these • we will try to answer four


probabilities? question now
• Whats the probability that a
random day is sunny or rainy
• if Vijay is happy today,whats
the probaility ,that its sunny
or rainy?
• if for three days Vijay is
Happy,Grumpy,Happy,what
was the weather
MACHINE INTELLIGENCE
How do we find probabilities?

• lets say we some data that we


looked at and that we've been
able to study the past weathers
• so we have the following data
MACHINE INTELLIGENCE
How do we find probabilities?

• lets say we some data that we


looked at and that we've been
able to study the past weathers
• so we have the following data
• and we are going to infer
probabilities from this

8 such transition 0.8

2 such transition 0.2

2 such transition 0.4

3 such transition 0.6


MACHINE INTELLIGENCE
How do we find probabilities?

0.8 • now consider the vijay's mood on


these days
0.2
• from these we can calculate
0.4 emission probability

0.6

8 such transition 0.8

2 such transition 0.2


MACHINE INTELLIGENCE
How do we find probabilities?

0.8 0.8 • now consider the vijay's mood on


these days
0.2 0.2
• from these we can calculate
0.4 emission probability

0.6

2 such transition 0.4

3 such transition 0.6


MACHINE INTELLIGENCE
HMM

• How did we find these


probabilities?
• Whats the probability that a
random day is sunny or rainy
• if Vijay is happy today,whats
the probaility ,that its sunny
or rainy?
• if for three days Vijay is
Happy,Grumpy,Happy,what
was the weather
MACHINE INTELLIGENCE
What is the probability that a random day is sunny or rainy

• Suppose vijay didnt want to talk to


karan,now karan has to figure out
what weather it is
• let us get the data back
• let us calulate probability of sunny
or rainy
• looks cheating,lets use some other
method

10 2/3

5 1/3
MACHINE INTELLIGENCE
What is the probability that a random day is sunny or rainy

0.8
• if today is sunny it could be
becuse yesterday was sunny ,or
yesterday was rainy
0.4 • we can have the following
equation

S=0.8S+0.4R
MACHINE INTELLIGENCE
What is the probability that a random day is sunny or rainy

0.2
0.8
• similarly
0.6
• so now we can solve the system of
these equation,but this two
0.4 equation are almost same ,but we
know that S+R=1

S=0.8S+0.4R R=0.2S+0.6R

S+R=1

S=2/3 R=1/3
MACHINE INTELLIGENCE
HMM

• How did we find these


probabilities?
• Whats the probability that a
random day is sunny or rainy
• if Vijay is happy today,whats
the probaility ,that its sunny
or rainy?
• if for three days Vijay is
Happy,Grumpy,Happy,what
was the weather
MACHINE INTELLIGENCE
Question 3

• we will use Bayes theorm


• suppose today Vijay is happy
1/3 • then there are two possibilities
2/3
• since we are talking about
particular day we need not care
of transition probability

0.4 0.6
0.8 0.2
MACHINE INTELLIGENCE
HMM
MACHINE INTELLIGENCE
HMM
MACHINE INTELLIGENCE
HMM
MACHINE INTELLIGENCE
HMM

• How did we find these


probabilities?
• Whats the probability that a
random day is sunny or rainy
• if Vijay is happy today,whats
the probaility ,that its sunny
or rainy?
• if for three days Vijay is
Happy,Grumpy,Happy,what
was the weather
MACHINE INTELLIGENCE
final question

• suppose vijay says happy grympy


and happy for three
consequective days
• so karan is supposed to guess the
weather now
MACHINE INTELLIGENCE
HMM

• lets look at a simpler case


suppose we have two days
• how many possible scenrarios do
we have for the weather ???
• so four cases to study
MACHINE INTELLIGENCE
HMM

• what we will do is take each one


of them and calculate probability
that given these ways vijay was
happy and then grumpy
• and then we are going to pick
whatever gave us the highest
MACHINE INTELLIGENCE
HMM

• lets calculate the entire


probability of all this happening
at the same time
MACHINE INTELLIGENCE
HMM

• what is the probability wednesday


was sunny
• we calulated previously that is
2/3=0.67

0.67
MACHINE INTELLIGENCE
HMM

• now if wed was sunny what is the


probability of vijay being happy
• now given wed was sunny what was
0.8 0.6 the probability that thursday being
rainy
0.2 • now thursday being rainy,what is the
0.67 probabilty that made vijay grumpy
MACHINE INTELLIGENCE
HMM

• now by law of conditional


probability ,the probability of all this
happening is product of all these
0.8 0.6

0.2
0.67

0.06432
MACHINE INTELLIGENCE
HMM

• doing the same for all the four


case we get this results
• we pick the one that is going to
make happy grumpy likely
happen
MACHINE INTELLIGENCE
HMM

• now back to our question what if


vijay says Happy,grumpy.happy
MACHINE INTELLIGENCE
HMM

Day1 Day2 Day3 • same thing three days ,each with


one mood ,and two possibilities
for each day
• hence 8 possible scenario
MACHINE INTELLIGENCE
HMM

• same thing three days ,each with


one mood ,and two possibilities
for each day
• hence 8 possible scenario
MACHINE INTELLIGENCE
HMM

• Let us consider this


case,sunny,rainy,sunny
0.8
0.8 0.6

0.2 0.4
0.67

0.0205824
MACHINE INTELLIGENCE
HMM

• doing the same process for all the


cases,we get max probability for
this
MACHINE
INTELLIGENCE
Hidden Markov Model

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE

Hidden Markov Model - Estimation

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Hidden Markov Model
• Until now we assumed that the instances that constitute a sample are IID

Likelihood(sample)=Π Likelihood(instance)

• But consider the following samples


1. Dependence of letter in a word
2. Dependence of base pairs in DNA sequence
3. Dependence of successive phonemes in a speech
• With HMM, we determine the internal state by making observations
MACHINE INTELLIGENCE
Hidden Markov Model
• HMM models a process with a First Order Markov process.

• It includes the initial state distribution π (the probability Hidden states – Markov chain:
–Dependent only on the previous state
distribution of the initial state) –“The past is independent of the future
given the present.”
• The transition probabilities A from one state (xt) to another.

• HMM also contains the likelihood B of the observation (yt)


given a hidden state. Matrix B is called the emission
probabilities. It demonstrates the probability of our
observation given a specific internal state.
MACHINE INTELLIGENCE
Hidden Markov Model - Formalism

Hidden states – Markov chain:


–Dependent only on the previous state
–“The past is independent of the future
•Shaded nodes are observed variables given the present.”
•Dependent only on their corresponding hidden state
N and M are defined implicitly

•S : {s1…sN } are the values for the hidden states


•K : {k1…kM } are the values for the observations
MACHINE INTELLIGENCE
Hidden Markov Model - Formalism

Hidden states – Markov chain:


–Dependent only on the previous state
–“The past is independent of the future
given the present.”

• Parameters: {S, K, P, A, B}
• Initial hidden state probabilities: P = {pi}
N and M are defined implicitly

• Transition probabilities. A = {aij} are the state


transition probabilities.
• Emission probabilities. B = {bik} are the
observation state probabilities (HMM can also work
with continues emission probabilities).
MACHINE INTELLIGENCE
Hidden Markov Model - Formalism

Parameters: {S, K, P, A, B}
• Initial hidden state probabilities: P =
{pi}
• Transition probabilities. A = {aij} are
the state transition probabilities.
• Emission probabilities. B = {bik} are the
observation state probabilities
MACHINE INTELLIGENCE
HMM

The complexity of the problem is that the same observations


may be originated from different states (happy or not).
MACHINE INTELLIGENCE
HMM

Two major assumptions are made in HMM. The next state and
the current observation solely depend on the current state only.
MACHINE INTELLIGENCE
HMM

• Given all the observable and the initial state distribution, we


can compute a pretty complex equation for the probability
for the internal state xt P(xt| y₁, y₂, y₃, … , yt) at time t.
• For simplicity here, we will not include π in our equation. All
equations assume π is a given condition, like P(y) → P(y|π).

• The equation uses the transition probability and the


emission probability to compute the probability of the
internal state based on all observations.
MACHINE INTELLIGENCE
HMM
Depending on the situation, we usually ask three different types
of questions regarding an HMM problem.

• Likelihood: How likely are the observations based on the


current model or the probability of being at a state at a specific
time step.

• Decoding: Find the internal state sequence based on the


current model and observations.

• Learning. Learn the HMM model.


MACHINE INTELLIGENCE
HMM

1. Probability of an Observation Sequence:


Given a model μ = (A,B,) over S,K, how do we (efficiently)
compute the likelihood of a particular sequence,
P(O|μ)?
2. Finding the “Best” State Sequence:
Given an observation sequence and a model, how do we
choose a state sequence (X1, . . . ,XT+1) to best explain the
observation sequence?
3. HMM Parameter Estimation:
Given an observation sequence (or corpus thereof), how
do we acquire a model μ = (A,B,) that best explains the
data?
MACHINE INTELLIGENCE
Likelihood (likelihood of the observation)

Likelihood is to find the likelihood of observation Y.

This is computationally intense.


MACHINE INTELLIGENCE
Probability of an Observation Sequence
MACHINE INTELLIGENCE
Probability of an Observation Sequence

αi(t) = P(O1O2 . . .Ot,Xt = si|μ).


• The probability of moving from
state Si to state Sj is given by the
αi(t) is the probability of a machine being in state i at time t and producing
observation o1,o2,03…. 0t-1
probability in the transition matrix
A more formally as aij
• Having moved to Sj the probability
of emitting Ot+1 is given by bj(Ot+1)
MACHINE INTELLIGENCE
Likelihood (likelihood of the observation)

In HMM, we solve the problem at time t by using the result from time t-1
• A circle below represents an HMM hidden state j at time t. So even
the number of state sequence increases exponentially with time, we
can solve it linear if we can express the calculation recursively with
time.

• This is the idea of dynamic programming that


breaks the exponential curse
MACHINE INTELLIGENCE
Probability of an Observation Sequence

αi(t) = P(O1O2 . . .Ot,Xt = si|μ).

As we can see from the diagram on the right as we explained earlier we can
express this recursively int terms of the earlier a s

We Will prove this in our next slide and explain this with an
examples
MACHINE INTELLIGENCE
Probability of an Observation Sequence

αi(t) = P(O1O2 . . .Ot,Xt = si|μ).

As we can see from the diagram on the right as we explained earlier we can
express this recursively int terms of the earlier a s

We Will prove this in our next slides, explain the forward


algorithm and explain this with an examples
MACHINE INTELLIGENCE
Proof of the alpha probability

αi(t) = P(O1O2 . . .Ot,Xt = si|μ).


MACHINE INTELLIGENCE
The forward algorithm

Thus the likelihood of the observations can be calculated recursively for each time step below.:
MACHINE INTELLIGENCE
Toy Example of Forward Algorithm

At time t, the probability of our observations up to time t is:

• Let’s rename the term underlined in red above as αt(j)


(forward probability) and we can express it recursively.
MACHINE INTELLIGENCE
HMM

• Consider this example in which we start with the initial state distribution on the
left.
• Then we propagate the value of α to the right for each timestep.
• Therefore, we break the curse of exponential complexity.
MACHINE INTELLIGENCE
HMM – Canonical Example Problem

π2 = 0.8
MACHINE INTELLIGENCE
HMM

Alpha rule method to find Probability of an Observation Sequence


To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1

S2

To calculate some cell, take previous time step alpha values and multiply each with transition
probability of corresponding cells and add them up. (Σαt(i)*aij). Multiply this sum with observation
probability bi(Ot+1) to get
(Σαt(i)*aij)*bi(Ot+1) = αt+1 at this cell.
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence

To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1

S2

To calculate this cell, take previous time step alpha values and multiply each with transition
probability of corresponding cells and add them up. But, This is first column. α1(i) =
πi*bi(O1)
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence

To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1

S2

πi = π1 = 0.2
bi (O1) = b1(V1) = 0.1
α1(i) = πi*bi(O1) = 0.02
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence

To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.02

S2

πi = π2 = 0.8
bi (O1) = b2(V1) = 0.3
α1(i) = πi*bi(O1) = 0.24
MACHINE INTELLIGENCE
HMM

Alpha rule method to find Probability of an Observation Sequence

To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.02

S2 0.24

To calculate this cell, take previous time step alpha values and
multiply each with transition probability of corresponding cells and
add them up. (Σαt(i)*aij). Multiply this sum with observation
probability bi(Ot+1) to get (Σαt(i)*aij)*bi(Ot+1) = αt+1 at this cell.
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence

To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a11 = 0.4

S1 0.02

a21 = 0.3
S2 0.24

(Σαt(i)*aij) = (0.02 * 0.4 + 0.24 *0.3)=0.08


bt(Ot+1) = b1(V3) = 0.5
αt+1 at this cell = (Σαt(i)*aij)*bi(Ot+1) = 0.04
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence

To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a11 = 0.4

S1 0.02

a21 = 0.3
S2 0.24

(Σαt(i)*aij) = (0.02 * 0.4 + 0.24 *0.3)=0.08


bt(Ot+1) = b1(V3) = 0.5
αt+1 at this cell = (Σαt(i)*aij)*bi(Ot+1) = 0.04
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence

To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a11 = 0.4

S1 0.02

a21 = 0.3
S2 0.24

(Σαt(i)*aij) = (0.02 * 0.4 + 0.24 *0.3)=0.08


bt(Ot+1) = b1(V3) = 0.5
αt+1 at this cell = (Σαt(i)*aij)*bi(Ot+1) = 0.04
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence

To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a11 = 0.4

S1 0.02 0.04
a21 = 0.3
S2 0.24 0.036

(Σαt(i)*aij) = (0.04 * 0.4 + 0.036 *0.3) = 0.0268


bt(Ot+1) = b1(V2) = 0.4
αt+1 at this cell = (Σαt(i)*aij)*bi(Ot+1) = 0.01072
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence

To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a12 = 0.6
S1 0.02 0.04 0.01072

a22 = 0.7
S2 0.24 0.036

(Σαt(i)*aij) = (0.04 * 0.6 + 0.036 *0.7) =


0.0492
bt(Ot+1) = b2(V2) = 0.5
αt+1 at this cell = (Σαt(i)*aij)*bi(Ot+1) = 0.0246
MACHINE
INTELLIGENCE
Hidden Markov Model

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE

Decoding – Hidden Markov Model

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Proof of the alpha probability

αi(t) = P(O1O2 . . .Ot,Xt = si|μ).


MACHINE INTELLIGENCE
The forward algorithm

Thus the likelihood of the observations can be calculated recursively for each time step below.:
MACHINE INTELLIGENCE
HMM

• Consider this example in which we start with the initial state distribution on the
left.
• Then we propagate the value of α to the right for each timestep.
• Therefore, we break the curse of exponential complexity.
MACHINE INTELLIGENCE
Backward Probability

The backward probability b is the probability of seeing the observations from time t
+1 to the end, given that we are in state i at time t (and given the automaton l):
MACHINE INTELLIGENCE
Backward Probability Proof
MACHINE INTELLIGENCE
Backward Probability Algorithm
MACHINE INTELLIGENCE
HMM – Canonical Example Problem with backward probability

π2 = 0.8
Beta Table
Observation sequence = O = {V1,V3,V2}

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1

S2

To fill up beta table, initialize last column as 1’s


Beta Table
Observation sequence = O = {V1,V3,V2}

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 1

S2 1

To calculate Beta value at some cell, take beta values at


next column, multiply each with corresponding transition
probabilities to get aij * βt+1(j). Multiply each of these
values with corresponding bj(Ot+1) to get aij * βt+1(j) *
bj(Ot+1). Finally add them up to get Σaij * βt+1(j) * bj(Ot+1) =
Beta value at this cell.
Beta Table
Observation sequence = O = {V1,V3,V2}

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
a11 = 0.4 b1(V2) = 0.4
S1 1

a12 = 0.6 b2(V2) = 0.5


S2 1

b1(V2) = 0.4, b2(V2) = 0.5


corresponding beta at next cell *a11* b1(V2) = 1*0.4*0.4 = 0.16
corresponding beta at next cell *a12* b2(V2) = 1*0.6*0.5 = 0.30
Beta value at this cell = 0.16+0.30 = 0.46
Beta Table
Observation sequence = O = {V1,V3,V2}

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
a21 = 0.3 b1(V2) = 0.4
S1 0.46 1

a22 = 0.7 b2(V2) = 0.5


S2 1

b1(V2) = 0.4, b2(V2) = 0.5


corresponding beta at next cell *a21* b1(V2) = 1*0.3*0.4 = 0.12
corresponding beta at next cell *a22* b2(V2) = 1*0.7*0.5 = 0.35
Beta value at this cell = 0.12+0.35 = 0.47
Beta Table
Observation sequence = O = {V1,V3,V2}

b1(V3) = 0.5

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
a11 = 0.4

S1 0.46 1
a12 = 0.6
S2 0.47 1
b2(V3) = 0.2

b1(V3) = 0.5, b2(V3) = 0.2


corresponding beta at next cell *a11* b1(V2) = 0.46*0.4*0.5 = 0.0920
corresponding beta at next cell *a12* b2(V2) = 0.47*0.6*0.2 = 0.0564
Beta value at this cell = 0.0920+0.0564 = 0.1484
Beta Table
Observation sequence = O = {V1,V3,V2}

b1(V3) = 0.5

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.1484 a21 = 0.3 0.46 1


a22 = 0.7
S2 0.47 1
b2(V3) = 0.2

b1(V3) = 0.5, b2(V3) = 0.2


corresponding beta at next cell *a21* b1(V2) = 0.46*0.3*0.5 = 0.0690
corresponding beta at next cell *a22* b2(V2) = 0.47*0.7*0.2 = 0.0658
Beta value at this cell = 0.0690+0.0658 = 0.1348
Beta Table
Observation sequence = O = {V1,V3,V2}

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.1484 0.46 1

S2 0.1348 0.47 1
MACHINE INTELLIGENCE
Forward and Backward Procedure
• To learn the HMM model, we need to know what states we are to
explain the observations the best.
• That will be the occupation probability γ — the probability of state i
at time t given all the observations.
• Given the HMM model parameters fixed, we can apply the forward
and backward algorithm to calculate α and β from the observations. γ 𝑝 𝐴, 𝐵 = 𝑝 𝐴 𝐵 ⋅ 𝑝 𝐵
can be calculated by simply multiplying α with β, and then renormalize
it.
𝑝 𝐴, 𝐵
𝑝 𝐴𝐵 =
𝑝 𝐵

.
Type equation here.
MACHINE INTELLIGENCE
Decoding- 2 methods -1st Method
MACHINE INTELLIGENCE
Posterior Decoding
To Find Probability of an Observation Sequence using both Alpha
and Beta Tables

Observation sequence = O = {V1,V3,V2}

Alpha table:

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.02 0.04 0.01072

S2 0.24 0.036 0.0246

Beta table

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.1484 0.46 1

S2 0.1348 0.47 1
To Find Probability of an Observation Sequence using both Alpha
and Beta Tables

Observation sequence = O = {V1,V3,V2}

Take any column of alpha table and then take


corresponding column of beta table and do dot product.
If we consider first column, to find P(O = {V1,V3,V2} | λ):
First column values of Alpha table = {0.02, 0.24}
First column values of Beta table = {0.1484,0.1348}
P(O = {V1,V3,V2} | λ) = 0.02*0.1484+0.24*0.1348
=0.03532
To Find Probability of an Observation Sequence using both Alpha
and Beta Tables

Observation sequence = O = {V1,V3,V2}

Take any column of alpha table and then take


corresponding column of beta table and do dot product.
If we consider second column, to find P(O = {V1,V3,V2} |
λ):
Second column values of Alpha table = {0.04, 0.036}
Second column values of Beta table = {0.46,0.47}
P(O = {V1,V3,V2} | λ) = 0.04*0.46+0.036*0.47 =0.03532
To Find Probability of an Observation Sequence using both Alpha
and Beta Tables

Observation sequence = O = {V1,V3,V2}

Take any column of alpha table and then take


corresponding column of beta table and do dot product.
If we consider last column, to find P(O = {V1,V3,V2} | λ):
Last column values of Alpha table = {0.01072, 0.0246}
Last column values of Beta table = {1,1}
P(O = {V1,V3,V2} | λ) = 0.01072*1+0.0246*1 =0.03532
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)

S1 0.02 0.04 0.01072 To Find Gamma value at


Alpha table
some cell multiply
S2 0.24 0.036 0.0246 corresponding alpha and
beta values at that cell
position and divide by
t =1 (V1 is t =2 (V3 is t =3(V2 is probability of observation
observed) observed) observed) sequence
Beta Table
S1 0.1484 0.46 1

S2 0.1348 0.47 1

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1

S2
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)

gamma at this cell =


S1 0.02 0.04 0.01072 0.01072 *1/0.03532 =
Alpha table
0.30351
S2 0.24 0.036 0.0246

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
Beta Table
S1 0.1484 0.46 1

S2 0.1348 0.47 1

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1

S2
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)

gamma at this cell =


S1 0.02 0.04 0.01072 0.0246 *1/0.03532 =
Alpha table
0.69648
S2 0.24 0.036 0.0246

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
Beta Table
S1 0.1484 0.46 1

S2 0.1348 0.47 1

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.30351

S2
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)

gamma at this cell =


S1 0.02 0.04 0.01072 0.04 *0.46/0.03532 =
Alpha table
0.52095
S2 0.24 0.036 0.0246

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
Beta Table
S1 0.1484 0.46 1

S2 0.1348 0.47 1

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.30351

S2 0.69648
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)

gamma at this cell =


S1 0.02 0.04 0.01072 0.036 *0.47/0.03532 =
Alpha table
0.47904
S2 0.24 0.036 0.0246

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
Beta Table
S1 0.1484 0.46 1

S2 0.1348 0.47 1

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.52095 0.30351

S2 0.69648
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)

gamma at this cell =


S1 0.02 0.04 0.01072 0.02 *0.1484/0.03532 =
Alpha table
0.08403
S2 0.24 0.036 0.0246

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
Beta Table
S1 0.1484 0.46 1

S2 0.1348 0.47 1

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.52095 0.30351

S2 0.47904 0.69648
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)

gamma at this cell =


S1 0.02 0.04 0.01072 0.24 *0.1348/0.03532 =
Alpha table
0.91596
S2 0.24 0.036 0.0246

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
Beta Table
S1 0.1484 0.46 1

S2 0.1348 0.47 1

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.08403 0.52095 0.30351

S2 0.47904 0.69648
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)

S1 0.02 0.04 0.01072 Sum of gamma values in


Alpha table
one column should be 1.
S2 0.24 0.036 0.0246 (By definition)

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)
Beta Table
S1 0.1484 0.46 1

S2 0.1348 0.47 1

t =1 (V1 is t =2 (V3 is t =3(V2 is


observed) observed) observed)

S1 0.08403 0.52095 0.30351

S2 0.91596 0.47904 0.69648


MACHINE INTELLIGENCE
Decoding-Viterbi algorithm

• The decoding problem is finding the optimal internal states sequence given a sequence of
observations.
• Again, we want to express our components recursively.
• Given the state is j at time t, vt(j) is the joint probability of the observation sequence with
the best state sequence.
• If we examine closely, the resulting equation is close to the forward algorithm except the
summation is replaced by the max function.
MACHINE INTELLIGENCE
Decoding-Viterbi algorithm
MACHINE INTELLIGENCE
Decoding-Viterbi algorithm

• So not only it can be done, the solution is similar to the forward algorithm
except the summation is replaced by the maximum function.
• Here, instead of summing over all possible state sequences in the forward
algorithm, the Viterbi algorithm finds the most likely path.
MACHINE INTELLIGENCE
HMM

• Finding the internal states that maximize the likelihood of observations


is similar to the likelihood method.
• We just replace the summation with the maximum function.
MACHINE INTELLIGENCE
HMM
In this algorithm, we also record the maximum path leading to each node at time t
(the red arrow above). e.g. we are transited from a happy state H at t=1 to the
happy state H at t=2 above since it is the most optimal (likely) path.
THANK YOU

K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701
MACHINE
INTELLIGENCE
Hidden Markov Model

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE

Finding the parameters – Hidden Markov


Model

K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Baum-Welch Algorithm

• Besides likelihood and decoding, the last algorithm learns


the HMM model parameters λ given the observation.

• Here, we will use the Baum–Welch algorithm to learn the


transition and the emission probability.

• if we know the state occupation probability (the state


distribution at time t), we can derive the emission
probability and the transition probability.

• If we know these two probabilities, we can derive the state


distribution at time t
MACHINE INTELLIGENCE
Baum-Welsh Algorithm

ξ is the probability of transiting from state i to j after time t given all the
observations. It can be computed by α and β similarly
MACHINE INTELLIGENCE
Baum-Welsh Algorithm

Intuitively, with a fixed HMM model, we refine the state occupation probability (γ)
and the transition (ξ) with the given observations.

Here comes the chicken and egg part. Once the distribution of γ and ξ (θ₂) are
refined, we can perform a point estimate on what will be the best transition and
emission probability (θ₁: a, b).
MACHINE INTELLIGENCE
Baum-Welsh Algorithm

𝛱𝑖 = 𝛾𝑖 1
Probability of the system being in state i at time t
MACHINE INTELLIGENCE
Baum Welsh Algorith,
MACHINE INTELLIGENCE
HMM

We fix one set of parameters to improve others and continue the


iteration until the solution converges.

The Estimation Maximization algorithm is usually defined as:


MACHINE INTELLIGENCE
HMM
Here, the E-step establishes p(γ, ξ | x, a, b). Then, the M-step finds a, b
that roughly maximizes the objective below.
THANK YOU

K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701

You might also like