UE20CS302 Unit3 Slides
UE20CS302 Unit3 Slides
INTELLIGENCE
Ensemble Models and
Bayesian Learning
K.S.Srinivas
Department of Computer Science
and Engineering
Machine Intelligence
Unit III
Emsemble Models and Bayesian
Learning
Srinivas K S
Department of Computer Science
Ensemble Learning
An ensemble method is a
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
General Approach
• We have seen that decision trees earlier have a tendency to overfit i.e
they have high variance
• We could offcourse prune trees but is often difficult.
• Ensemble learning ensures that the combined out of several weak
learners produce a final model that has low variance
• Given a set of n independent observations Z1, . . . , Zn, each with
variance σ2, the variance of the mean Z^ of the observations is
givenby σ2/n.
• In other words, averaging a set of observations reduces variance.
Intuition behind ensemble learning
complexity
• makes the generalization of this model to
unseen data very difficult i.e a high variance
model.
Bias and Variance
Unit III
Bagging
Srinivas K S
Department of Computer Science
Bagging
Unit III
Boosting
Srinivas K S
Department of Computer Science
Boosting - Preamble
Srinivas K S
Department of Computer Science & Engineering
srinivasks@pes.edu
MACHINE
INTELLIGENCE
AdaBoost
K.S.Srinivas
Department of Computer Science
and Engineering
Machine Intelligence
AdaBoost
Srinivas K S
Department of Computer Science
Adaboost – The Algorithm
Source: Peter Flach – Machine Learning – The art and science of algorithms that make sense of data
Schematic illustration of Boosting
Adaboost – Broken down in simple terms
• So lets walk thru the algorithm with all its gory details
• Get your instance set that you will use for training
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
• Create many decision stumps such as if x1 < 2.1 it to be +1 and X1>=
2.1 as -1
• You can create many such stumps and choose the one with the lowest
error rate
• Assume that the decision stump above is the best one (Not true ) but
for this example lets just assume
• Lets make a prediction and calculate the error rate
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
• Evaluate the error for the mth classifier. Here wn( m) is
the weight of the nth data instance in the mth
iteration. The Identity function:
I(a,b) = 1 if a != b and = 0 otherwise.
N
n I ( ym ( xn ) tn )
w (m)
m n 1
N
w
n 1
(m)
n
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
We’ll use alpha to update weights in the next round.
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
In the next round I choose x1 < 3.5 as -1 and x1>= 3.5 as +1
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
You can calculate epsilon, alpha and new weights using the same
procedure
epsilon = 0.21, alpha = 0.65
And find weights for the next round
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
At each round I update my final hypothesis
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
At each round I update my final hypothesis
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
For example, prediction of the 1st instance will be
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Adaboost
Source: https://sefiks.com/2018/11/02/a-step-by-step-adaboost-example/
Alpha vs Error
X 0 1 2 3 4 5 6 7 8 9
Y + + + - - - + + + -
THANK YOU
Srinivas K S
Department of Computer Science & Engineering
srinivasks@pes.edu
MACHINE
INTELLIGENCE
BAYESIAN LEARNING
K.S.Srinivas
Department of Computer Science and Engineering
Probabilistic Learning
• Now suppose I tell you that 8, 2 and 64 are also positive examples.
• Now you may guess that the hidden concept is “powers of two”. This is an example of
induction
• How can we explain this behavior and emulate it in a machine?
• The classic approach to induction is to suppose we have a hypothesis space of
concepts, H, such as: odd numbers, even numbers, all numbers between 1 and 100,
powers of two.
• The subset of H that is consistent with the data D is called the version space.
• As we see more examples, the version space shrinks and we become increasingly
certain about the concept
Likelihood:
• Assume examples are sampled uniformly at random from all
numbers that are consistent with the hypothesis
• Size principle: Favors smallest consistent hypotheses
Prior
where I(D ∈ h) is 1 iff (iff and only if) all the data are in the extension of the hypothesis
h
𝑷 𝑫 𝒉 𝑷(𝒉)
𝑷 𝒉𝑫 =
𝑷(𝑫)
2. ML Hypotheis
ℎ𝑀𝐴𝑃 ≡ argmax 𝑷 𝒉 𝑫
ℎ𝜖𝐻
≡ argmax 𝑷 𝒉𝟏 + , 𝑷 𝒉𝟐 + }
ℎ𝜖𝐻
≡ argmax{ 𝑷 + 𝒉𝟏 𝑷 𝒉𝟏 , 𝑷 + 𝒉𝟐 𝑷 𝒉𝟐 }
ℎ𝜖𝐻
≡ argmax{ 𝟎. 𝟗𝟖 ∗ 𝟎. 𝟎𝟖 , 𝟎. 𝟎𝟑 ∗ 𝟎. 𝟗𝟗𝟐}
ℎ𝜖𝐻
0.0078
P(h1|+) = = 0.21
0.0078 +0.0298
0.0298
P(h2|+) = = 0.79
0.0078 +0.0298
Srinivas K.S
Department of Computer Science
srinivasks@pes.edu
Bayes Theorem and Concept Learning
What is the relationship between Bayes theorem and the problem of
concept learning?
It can be used for designing a straightforward learning algorithm called
Brute-Force MAP LEARNING algorithm
Brute-Force MAP Learning Algorithm
P(D)
Output hypothesis hMAP with the highest posterior probability
Choose P(D|h):
Relation to Concept learning
Choose P(D|h):
P(D|h) = 0 otherwise
1
P(h) = |𝐻| for all h in H
Brute-Force MAP Learning
Therefore, the brute force algorithm can now proceed in two ways.
0 .𝑃(ℎ)
𝑃 ℎ𝐷 = =0
𝑃(𝐷)
evolution of probabilities
(a) all hypotheses have the same probability
(b) + (c) as training data accumulates, the posterior
probability of inconsistent hypotheses becomes zero while
the total probability summing to 1 is shared equally
among the remaining consistent hypotheses
Consistent Learners
FindS will output a MAP hypothesis, even though it does not explicitly use
probabilities in learning.
D:
THANK YOU
Srinivas K.S
Department of Computer Science
srinivasks@pes.edu
MACHINE
INTELLIGENCE
Maximum Likelihood
and Bayes Optimal Classifier
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE
INTELLIGENCE
Maximum Likelihood
and Bayes Optimal Classifier
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Maximum Likelihood and Least Squared Error Hypothesis
hH i 1 2 2
• where μ=h(xi)
MACHINE INTELLIGENCE
Maximum Likelihood and Least Squared Error Hypothesis
argmin
hH
i 1
2 2
m
argmax (di )2
hH i 1
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Bayes Optimal Classifier
• We have so far covered – “What is the most • The most probable classification is -ve.
probable hypothesis given the training data?”. • In this case it is different from the classification
• But we can now attempt to answer the question, generated by MAP hypothesis
“What is the most probable classification of a new
instance given the training data?”
• We can answer this by using MAPhypothesis to
new instance,but we can do better
• Consider a hypothesis space consisting of 3
hypothesis h1,h2,h3.
• Suppose positive probability of these hypothesis
given the training data are 0.4,0.3 and 0.3
respectively.
• Suppose a new instance x is encountered ,which is
classified as +ve by h1 and -ve by h2 and h3.
• Taking all hypothesis into account ,the probability
that x is positive is 0.4 and probability that x being
negative is 0.6
MACHINE INTELLIGENCE
Bayes Optimal Classifier
• In general the most probable classification of new
instance is obtained by combined the prediction
of all hypothesis ,weighted by their posterior
probabilities.
• If the possible classification of the new instance
can take one of any value Vj from set V, then the
probability P(Vj|D) that the correct classification
for the new instance is Vj is
j | D)P(Vj | hj )*P(hi | D)
P(V
vjV hiH
argmax
v j V
P (V j | h j ) * P ( h i | D )
hi H
MACHINE INTELLIGENCE
Bayes Optimal Classifier
• To illustrate in terms of the above example ,the
set of possible values of new instance is V
argmax
v j V
P (V j | h j ) * P ( h i | D )
hi H
V +ve -ve
and h1 ,h2 and h3 are three hypothesis THIS EQUATION IS CALLED Bayes Optimal Classifier
or
P(h1|D) 0.4 P(-ve|h1) 0 P(+ve|h1) 1 Bayes Optimal Learner
P(h2|D) 0.3 P(-ve|h2) 1 P(+ve|h2) 0
Therefore,
P( ve/h
h i H
i ).P(h i /D)
1x0.4
0x0.3
0x0.3
0.
P(ve/h
h i H
i ).P(h i /D) 0x0.4 1x0.3 1x0.3 0.6
P(Vj | hj )*P(hi | D) ve
argmax
vjV hiH
MACHINE INTELLIGENCE
Bayes Optimal Classifier
• This method maximizes the probability that new
instance is classified correctly,given the available
argmax
v j V
P (V j | h j ) * P ( h i | D )
hi H
data,hypothesis space and prior probabilities over
the hypothesis.
MACHINE INTELLIGENCE
Gibbs Algorithm
• Bayes optimal Classifier obtains the best performance that can
be achieved from the training data,it is quite costly to apply.
• The expense is due to the fact that it computes the posterior
probability for every hypothesis in H and combines the
prediction of each hypothesis to classify new instance
• An alternative less optimal method is the “GIBBS ALGORITHM”
defined as follows
1. Choose a hypothesis h from H at random according to
posterior probability distribution of over H.
2. use 'h' to predict the classification of the next instance x
K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701
MACHINE
INTELLIGENCE
Naïve Bayes and Applications
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Naive Bayes Classifier
• Highly practical Bayesian Learning method
• Comparable performance with neural network and decision tree
learning.
• Naive Bayes classier applies to learning task where each instance
x is classified by the conjunction of attribute value and where
the target function f(x) can take any value from finite set V.
• A set of training examples of the target function is provided,and
a new instance is presented in tuple of (a1,a2,a3,a4,........an)
• The learner/classifier is asked to predict the target value or the
classification of the new instance.
=argmax[
P(Vj).P(outlook=sunny|vj).P(temperature=cool|vj).
P(humidity=high|vj).P(wind=strong|vj). ]
Vj can be YES or NO
probabilities of the different target values can
easily be estimated based on their frequency over
the 14 training examples
MACHINE INTELLIGENCE
Example- Play Tennis
Therefore PlayTennis(x)=NO
MACHINE INTELLIGENCE
Special Case
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
• consider the following data set
A great game sports
• the task is to classify the sentence “A very close game”as
sports or not sports The election is over Not sports
• In this data set we do not have numbers but we have only text very clean match sports
• We need to convert all this text into numbers that we can use
a clean but sports
for calculation. HOW?????
forgettable game
• One solution is to use frequency of words
• Ignore word order and sentence construction it was a close election not sports
• Treat every document as a set of words it contains.
• Now the feature used in this case is the counts of words
i.e(words frequency)
• Its a simplistic approach,but works surprisingly well
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
• Now,we need to transform the probability we want to
A great game sports
calculate into something that can be calculated using word
frequencies. The election is over Not sports
very clean match sports
• Bayes Theorem for example:
a clean but sports
P(a
very
close
game/spor
s)xP(s
s)
P(sports/a
very
close
game) forgettable game
P(avery
close
game)
it was a close election not sports
• since in our classier ,we are just trying to find out which
category has bigger probability we can discard the divisor
• This is same for both the categories
• we can compare
P(A very close game/sports) x P(sports)
with
P(A very close game/not sports) x P(not sports)
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
• The probabilities can be calculated:
A great game sports
1. count how many times the sentence
'A very close game' appears in sports category The election is over Not sports
2. Divide by the total very clean match sports
3. obtain P(a very close game|sports)
a clean but sports
forgettable game
• PROBLEM: we do not have the 'sentence' in the training set
it was a close election not sports
=>probability is zero
• unless every sentence appears in the training set,what we
want to classify, the model wont classify
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
so
A great game sports
• we assume that every word in a sentence is independent of
the other ones The election is over Not sports
• no longer we will look for entire sentences,but for only very clean match sports
words(individual)
a clean but sports
forgettable game
• i,e for a sentence “This was a funny party” is same as “funny is
party was this” is same as “party funny this was a” it was a close election not sports
• A problem again
• the word close does not appear in any sports ,and would lead
us 0 when multiplied with other probability
MACHINE INTELLIGENCE
Example2-Text Classification
sentence class
To resolve this we do something called Laplace smoothing
A great game sports
• i.e we add 1 to every count so its never zero
• to again balance this,we add the no of possible words to The election is over Not sports
divisor, very clean match sports
• in our case the possible words are:
a clean but sports
{a,great,game,the election,is,over,.......election}=14
forgettable game
• Applying smoothing we get
it was a close election not sports
WORD P(word/sports) P(word/Not
sports)
a (2+1)/(14+11)=3/25 (1+1)/(9+14)=2/23
P(a/sports)xP(very/sports)xP(close/sports)xP(game/sports)=0.000027648
K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701
MACHINE
INTELLIGENCE
EXPECTATION MAXIMIZATION
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE
INTELLIGENCE
EXPECTATION MAXIMIZATION
K.S.Srinivas
Department of Computer Science
and Engineering
Machine Intelligence
Unit III
Expectation Maximization
Srinivas K.S
Department of Computer Science
Expectation Maximization
• Maximum likelihood becomes intractable if there are variables that interact with
those in the dataset but were hidden or not observed, so-called latent variables.
It does this by f
estimating the values for the latent variables, (E)
then optimizing the model(M),
then repeating these two steps until convergence.
Unsupervised Learning and EM
• K-Means clustering
K-Means Clustering
• The within-cluster variation for cluster Ck is a measure W(Ck) of the amount by which
the observations within a cluster differ from each other.
• In words, this formula says that we want to partition the observations into K clusters
such that the total within-cluster variation, summed over all K clusters, is as small as
possible
• The intra-cluster distance is measured using the Euclidian distance between pair wise
instances in the cluster
Expectation Maximization of K-Means
• The E-step is assigning the data points to the closest cluster.
• The M-step is computing the centroid of each cluster.
• M-Step
Closing Notes on K-Means
A normal distribution such that the mean μ = 0 and
standard deviation σ = 1 for your data
They might be fair coins, be more heavily weighted towards heads; you
don't know.
Here's the clue she's provided: a piece of paper with 5 records of an
experiment where she's:
Expectation Maximization
Chosen one of the two coins at random.
Flipped that same coin 10 times.
How can you provide a reasonable estimate of each coin bias? Let's
refer to these coins as coin A and coin B and their bias as θA and θB.
Expectation Maximization
Expectation Maximization
θ1 = 24/30 = 0.8
θ2 = 9/20 = 0.45
Expectation Maximization
It turns out that we can make progress by starting with a guess for the coin biases
Which will allow us to estimate which coin was chosen in each trial and come up with an
estimate for the expected number of heads and tails for each coin across the trials (E-
step)
We then use these counts to recompute a better guess for each coin bias (M-step)
By repeating these two steps, we continue to get a better estimate of the two coin
biases and converge at a solution that turns out to be a local maximum to the problem.
Expectation Maximization
Estimating likelihood each coin was chosen
Estimate the probability that each coin is the true coin given the flips we see in the trial
Which will allow us to estimate which coin was chosen in each trial .
Use that to proportionally assign heads and tails counts to each coin.
Let's make this concrete with one of the examples we just mentioned:
P(ZA)=P(ZB)=0.5
We can eliminate the values from
the equation
K.S.Srinivas
Department of Computer Science
and Engineering
Machine Intelligence
Unit III
Gaussian Mixture Models
Srinivas K.S
Department of Computer Science
Gaussian Distributions
Clustering.
THANK YOU
Srinivas K S
Department of Computer Science & Engineering
srinivasks@pes.edu
MACHINE
INTELLIGENCE
Hidden Markov Model
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Discrete Markov Process
Random variables X and Y on the same probability space are said to be independent
if the events X = a and Y = b are independent for all values a,b. Equivalently, the joint
distribution of independent r.v.’s decomposes as
Examples: Put m balls with numbers written on them in an urn. Draw n balls from the urn
with replacement, and let Xi be the number on the ith ball. Then X1, X2, ..., Xn will be i.i.d.
But if we draw the balls without replacement, X1, X2, ..., Xn will not be i.i.d. - they will all
have the same distribution, but will not be independent.
If we draw the balls with replacement, but let Xi be i times the number on the ith ball, then
X1, X2, ..., Xn will not be i.i.d. - they will be independent, but they will have different
distributions.
MACHINE INTELLIGENCE
Discrete Markov Process
Random variables
• The possible states of the outcomes are also known as the domain of the
random variable, and the outcome is based on the probability
distribution defined over the domain of the random variable.
• In rolling a six sided die, the domain of the random variable outcome, O, is
given by domain(O) = (1, 2, 3, 4, 5, 6), and the probability distribution is
given by a uniform distribution P(o) = 1/6 ∀ ∈ domain(O).
• The domain of the random variable has discrete variables; such random
variables are known as discrete random variables.
• Consider the random variable representing the stock price of
Google tomorrow. The domain of this random variable will be all positive
real numbers with most of the probability mass distributed around ±5% of
today's price. Such random variables are known as continuous random
variables.
MACHINE INTELLIGENCE
Discrete Markov Process
Random processes
Random variables are able to mathematically represent the outcomes of a single random
phenomenon.
What if we want to represent these random events over some period of time or the length
of an experiment?
let's say we want to represent the stock prices for a whole day at intervals of every
one hour
we want to represent the height of a ball at intervals of every one second after
being dropped from some height in a vacuum.
For such situations, we would need a set of random variables, each of which will represent
the outcome at the given instance of time. These sets of random variables that represent
random variables over a period of time are also known as random processes. It is worth
noting that the domains of all these random variables are the same.
MACHINE INTELLIGENCE
Discrete Markov Process
• Such random processes, in which we can deterministically find the state of each
random variable given the initial conditions (in this case, dropping the ball, zero initial
velocity) and the parameters of the system (in this case, the value of gravity), are known
as deterministic random processes (commonly called deterministic processes).
• Random processes, in which we can't determine the state of a process, even if we are
given the initial conditions and all the parameters of the system, are known as
stochastic random processes (commonly called stochastic processes).
MACHINE INTELLIGENCE
Discrete Markov Process
Markov processes
A stochastic process is called a Markov process if the state of the random variable at the next
instance of time depends only on the outcome of the random variable at the current time.
Markov property
This property of a system, such that the future states of the system depend only on the
current state of the system, is also known as the Markov property.
Systems satisfying the Markov property are also known as memoryless systems
MACHINE INTELLIGENCE
Discrete Markov Process
• Is a stochastic process over a discrete state space satisfying the Markov property.
• The probability of moving from the current state to the next state depends only on the
present state and not on any of the previous states.
• Is is said to be irreducible if we can reach any state of the given Markov chain from any other
state.
• state j is said to be accessible from state i if an integer nij ≥ 0 exists such that the following
condition is met:
MACHINE INTELLIGENCE
Discrete Markov Process
• This is the probability of a system being in state Rn+1 given that the machine has been in R1 at t=1,
R2 at t=2 and Rn at t=n can be represented as follows
• Such a process is called as a ‘n’ order Markov Process where the machine being in a state at n+1 is
conditioned by all the previous states leading up to n.
• Lets understand the model a little better and introduce some more
terms that we need for the model that we will apply.
• The p that you see here are the starting
probabilities since we can start from any
state to any other state
• a12 states the tranisition probability of
moving from state 1 to state 2
• S of all a’s from a state must add upto
to one.
MACHINE INTELLIGENCE
Discrete Markov Process
• The p that you see here are the starting
probabilities since we can start from any
state to any other state
• All of the transition probabilities can be
represented as a matrix called as the
transition matrix
• A= | a11 a12 a13 | 𝑁 𝑁
| a21 a22 a23 | 𝑎𝑖𝑗 = 1∀ⅈ 𝜋𝑖 = 0
| a31 a32 a33 | 𝑗=1 𝑖=1
S C R
• Starting state is cloudy 𝜋𝐶 = 0 ⋅ 3
• A= S| 0.6 0.2 0.2 | The trellis is CSRCR
C| 0.2 0.5 0.3 | So we have a starting Pc=0.3 and the multiply the P(x)
R| 0.1 0.4 0.5 | each of the next state given the current state
What is the probability of seeing a i.E P(CSRCR|p,A) = P(pc).P(S|C).P(R|S).P(C|R).P(R|C)
sequence of Cloudy, Sunny, Rainy, Cloudy,
Rainy over the next five days? 0.3 × 0 ⋅ 2 × 0.2 × 0.4𝑥 0 ⋅ 3
0.00144
MACHINE INTELLIGENCE
Example Problem (2)
• Consider the trellis over a 14 day period
Compute the parameters of the Discrete
Markov Model
i.ecompute
P And the transition matrix
Solution:
MACHINE INTELLIGENCE
Example Problem (2)
• Consider the trellis over a 14 day period
Compute the transition matrix
MACHINE INTELLIGENCE
Hidden State
• Consider the following problem. We have 2 friends Karan and
Vijay one in Bangalore and the other in Shimoga
• They speak to each other every day We have another
• The only thing Karan states to Vijay on any day is whether is probability called the
happy or Angry. emission probability
i.E given a state – weather
• His anger or happiness is defined by the weather it being what is probability of being
sunny or rainy happy or sad
• Given that he says he is HHSHS can you guess the weather in This is represented by
another matrix called the
Shimoga emission matrix. See you in
the next class
THANK YOU
K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701
MACHINE
INTELLIGENCE
Hidden Markov Model
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
HMM
KARAN VIJAY
MACHINE INTELLIGENCE
HMM
0.4 0.4
0.8 0.2 0.6
MACHINE INTELLIGENCE
HMM
0.8 0.2
0.6
0.4 0.4
0.8 0.2 0.6
MACHINE INTELLIGENCE
HMM
0.8 0.2
0.6
0.4 0.4
0.8 0.2 0.6
MACHINE INTELLIGENCE
HMM
0.8 0.2
0.6
0.4 0.4
0.8 0.2 0.6
MACHINE INTELLIGENCE
HMM
0.6
0.6
10 2/3
5 1/3
MACHINE INTELLIGENCE
What is the probability that a random day is sunny or rainy
0.8
• if today is sunny it could be
becuse yesterday was sunny ,or
yesterday was rainy
0.4 • we can have the following
equation
S=0.8S+0.4R
MACHINE INTELLIGENCE
What is the probability that a random day is sunny or rainy
0.2
0.8
• similarly
0.6
• so now we can solve the system of
these equation,but this two
0.4 equation are almost same ,but we
know that S+R=1
S=0.8S+0.4R R=0.2S+0.6R
S+R=1
S=2/3 R=1/3
MACHINE INTELLIGENCE
HMM
0.4 0.6
0.8 0.2
MACHINE INTELLIGENCE
HMM
MACHINE INTELLIGENCE
HMM
MACHINE INTELLIGENCE
HMM
MACHINE INTELLIGENCE
HMM
0.67
MACHINE INTELLIGENCE
HMM
0.2
0.67
0.06432
MACHINE INTELLIGENCE
HMM
0.2 0.4
0.67
0.0205824
MACHINE INTELLIGENCE
HMM
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Hidden Markov Model
• Until now we assumed that the instances that constitute a sample are IID
Likelihood(sample)=Π Likelihood(instance)
• It includes the initial state distribution π (the probability Hidden states – Markov chain:
–Dependent only on the previous state
distribution of the initial state) –“The past is independent of the future
given the present.”
• The transition probabilities A from one state (xt) to another.
• Parameters: {S, K, P, A, B}
• Initial hidden state probabilities: P = {pi}
N and M are defined implicitly
Parameters: {S, K, P, A, B}
• Initial hidden state probabilities: P =
{pi}
• Transition probabilities. A = {aij} are
the state transition probabilities.
• Emission probabilities. B = {bik} are the
observation state probabilities
MACHINE INTELLIGENCE
HMM
Two major assumptions are made in HMM. The next state and
the current observation solely depend on the current state only.
MACHINE INTELLIGENCE
HMM
In HMM, we solve the problem at time t by using the result from time t-1
• A circle below represents an HMM hidden state j at time t. So even
the number of state sequence increases exponentially with time, we
can solve it linear if we can express the calculation recursively with
time.
As we can see from the diagram on the right as we explained earlier we can
express this recursively int terms of the earlier a s
We Will prove this in our next slide and explain this with an
examples
MACHINE INTELLIGENCE
Probability of an Observation Sequence
As we can see from the diagram on the right as we explained earlier we can
express this recursively int terms of the earlier a s
Thus the likelihood of the observations can be calculated recursively for each time step below.:
MACHINE INTELLIGENCE
Toy Example of Forward Algorithm
• Consider this example in which we start with the initial state distribution on the
left.
• Then we propagate the value of α to the right for each timestep.
• Therefore, we break the curse of exponential complexity.
MACHINE INTELLIGENCE
HMM – Canonical Example Problem
π2 = 0.8
MACHINE INTELLIGENCE
HMM
S1
S2
To calculate some cell, take previous time step alpha values and multiply each with transition
probability of corresponding cells and add them up. (Σαt(i)*aij). Multiply this sum with observation
probability bi(Ot+1) to get
(Σαt(i)*aij)*bi(Ot+1) = αt+1 at this cell.
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence
To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
S1
S2
To calculate this cell, take previous time step alpha values and multiply each with transition
probability of corresponding cells and add them up. But, This is first column. α1(i) =
πi*bi(O1)
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence
To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
S1
S2
πi = π1 = 0.2
bi (O1) = b1(V1) = 0.1
α1(i) = πi*bi(O1) = 0.02
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence
To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
S1 0.02
S2
πi = π2 = 0.8
bi (O1) = b2(V1) = 0.3
α1(i) = πi*bi(O1) = 0.24
MACHINE INTELLIGENCE
HMM
To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
S1 0.02
S2 0.24
To calculate this cell, take previous time step alpha values and
multiply each with transition probability of corresponding cells and
add them up. (Σαt(i)*aij). Multiply this sum with observation
probability bi(Ot+1) to get (Σαt(i)*aij)*bi(Ot+1) = αt+1 at this cell.
MACHINE INTELLIGENCE
HMM
Alpha rule method to find Probability of an Observation Sequence
To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a11 = 0.4
S1 0.02
a21 = 0.3
S2 0.24
To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a11 = 0.4
S1 0.02
a21 = 0.3
S2 0.24
To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a11 = 0.4
S1 0.02
a21 = 0.3
S2 0.24
To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a11 = 0.4
S1 0.02 0.04
a21 = 0.3
S2 0.24 0.036
To Find P(O = {V1,V3,V2} | λ), make alpha table, sum up last column
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
a12 = 0.6
S1 0.02 0.04 0.01072
a22 = 0.7
S2 0.24 0.036
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Proof of the alpha probability
Thus the likelihood of the observations can be calculated recursively for each time step below.:
MACHINE INTELLIGENCE
HMM
• Consider this example in which we start with the initial state distribution on the
left.
• Then we propagate the value of α to the right for each timestep.
• Therefore, we break the curse of exponential complexity.
MACHINE INTELLIGENCE
Backward Probability
The backward probability b is the probability of seeing the observations from time t
+1 to the end, given that we are in state i at time t (and given the automaton l):
MACHINE INTELLIGENCE
Backward Probability Proof
MACHINE INTELLIGENCE
Backward Probability Algorithm
MACHINE INTELLIGENCE
HMM – Canonical Example Problem with backward probability
π2 = 0.8
Beta Table
Observation sequence = O = {V1,V3,V2}
S1
S2
S1 1
S2 1
b1(V3) = 0.5
S1 0.46 1
a12 = 0.6
S2 0.47 1
b2(V3) = 0.2
b1(V3) = 0.5
S1 0.1484 0.46 1
S2 0.1348 0.47 1
MACHINE INTELLIGENCE
Forward and Backward Procedure
• To learn the HMM model, we need to know what states we are to
explain the observations the best.
• That will be the occupation probability γ — the probability of state i
at time t given all the observations.
• Given the HMM model parameters fixed, we can apply the forward
and backward algorithm to calculate α and β from the observations. γ 𝑝 𝐴, 𝐵 = 𝑝 𝐴 𝐵 ⋅ 𝑝 𝐵
can be calculated by simply multiplying α with β, and then renormalize
it.
𝑝 𝐴, 𝐵
𝑝 𝐴𝐵 =
𝑝 𝐵
.
Type equation here.
MACHINE INTELLIGENCE
Decoding- 2 methods -1st Method
MACHINE INTELLIGENCE
Posterior Decoding
To Find Probability of an Observation Sequence using both Alpha
and Beta Tables
Alpha table:
Beta table
S1 0.1484 0.46 1
S2 0.1348 0.47 1
To Find Probability of an Observation Sequence using both Alpha
and Beta Tables
S2 0.1348 0.47 1
S1
S2
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
S2 0.1348 0.47 1
S1
S2
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
S2 0.1348 0.47 1
S1 0.30351
S2
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
S2 0.1348 0.47 1
S1 0.30351
S2 0.69648
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
S2 0.1348 0.47 1
S1 0.52095 0.30351
S2 0.69648
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
S2 0.1348 0.47 1
S1 0.52095 0.30351
S2 0.47904 0.69648
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
S2 0.1348 0.47 1
S2 0.47904 0.69648
Gamma Table
t =1 (V1 is t =2 (V3 is t =3(V2 is
observed) observed) observed)
S2 0.1348 0.47 1
• The decoding problem is finding the optimal internal states sequence given a sequence of
observations.
• Again, we want to express our components recursively.
• Given the state is j at time t, vt(j) is the joint probability of the observation sequence with
the best state sequence.
• If we examine closely, the resulting equation is close to the forward algorithm except the
summation is replaced by the max function.
MACHINE INTELLIGENCE
Decoding-Viterbi algorithm
MACHINE INTELLIGENCE
Decoding-Viterbi algorithm
• So not only it can be done, the solution is similar to the forward algorithm
except the summation is replaced by the maximum function.
• Here, instead of summing over all possible state sequences in the forward
algorithm, the Viterbi algorithm finds the most likely path.
MACHINE INTELLIGENCE
HMM
K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701
MACHINE
INTELLIGENCE
Hidden Markov Model
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
K.S.Srinivas
Department of Computer Science and Engineering
MACHINE INTELLIGENCE
Baum-Welch Algorithm
ξ is the probability of transiting from state i to j after time t given all the
observations. It can be computed by α and β similarly
MACHINE INTELLIGENCE
Baum-Welsh Algorithm
Intuitively, with a fixed HMM model, we refine the state occupation probability (γ)
and the transition (ξ) with the given observations.
Here comes the chicken and egg part. Once the distribution of γ and ξ (θ₂) are
refined, we can perform a point estimate on what will be the best transition and
emission probability (θ₁: a, b).
MACHINE INTELLIGENCE
Baum-Welsh Algorithm
𝛱𝑖 = 𝛾𝑖 1
Probability of the system being in state i at time t
MACHINE INTELLIGENCE
Baum Welsh Algorith,
MACHINE INTELLIGENCE
HMM
K.S.Srinivas
srinivasks@pes.edu
+91 80 2672 1983 Extn 701