Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

Lecture 13. em Algorithm (After-Class)

Uploaded by

laijiahao0430
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 13. em Algorithm (After-Class)

Uploaded by

laijiahao0430
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Lecture 13.

EM Algorithm (After-class)

Notes: As we saw before, many estimation problems require maximization of the probability
distribution with respect to an unknown parameter, for example when computing ML estimates of the
parameters or MAP estimates of the hidden random variables. For many interesting problems,
differentiating the probability distribution with respect to the parameter of interest and setting the
derivative to zero results in a nonlinear equation that does not have a closed-form solution. In such
cases, we have to resort to numerical optimization.

Maximum likelihood estimation with unknown parameters


1 w.p.
Example: Let w be Bernoulli r.v., w = {
δ
, y (binary r.v.) be a noise
0 w.p. 1 − δ
observation of w .

y=
PY ∣W (y∣w) = { w
ϵ
1−ϵ y=w
Let w = [w1 , ⋯ , wn ]T be n i.i.d. samples of w .
y = [y1 , ⋯ , yn ]T

We do not observe w , but observe y .

Our goal is to estimate δ and ϵ.

n
ML: PY (y; ϵ, δ) = ∏i=1 PYi (yi ; ϵ, δ)
n
= ∏ [PYi ∣Wi (yi ∣wi = 0; ϵ, δ)PWi (0; ϵ, δ) + PYi ∣Wi (yi ∣wi = 1; ϵ, δ)PWi (1; ϵ, δ)]
i=1
latent r.v.

(ϵ∗ , δ∗ ) = arg max PY (y; ϵ, δ)


ϵ,δ

No close form solution for the optimal ϵ, δ.

Lecture 13. EM Algorithm (After-class) 1


Algorithm to compute the optimal parameters with hidden variables ⇒ EM Algorithm.

General setup
Assume complete data z , generate by PZ (⋅; x), and x is the parameter to estimate.

Can only observe y = g(z), observe data, g: deterministic function.


In our example: z = [w , y], y = y, x = [ϵ, δ]
The goal is to find x∗ = arg max PY (y; x) = arg max log PY (y; x)
x x
PZ Y (z,y;x)

Note that PZ (z; x) = ∑y PZ ∣Y (z∣y; x) ⋅ PY (y; x) = PZ ∣Y (z∣g(z); x) ⋅ PY (g(z); x)


⇒ let y = g(z): log PY (y; x) = log PZ (z; x) − log PZ ∣Y (z∣y; x)

Do expectation of both sides over PZ ∣Y (z∣y; x′ ):


LHS= ∑z ′ PZ ∣Y (z ′ ∣y; x′ ) log PY (y; x) = (∑z ′ PZ ∣Y (z ′ ∣y; x′ )) log PY (y; x) =
log PY (y; x)
RHS= E[log PZ (z; x)∣Y = y, x = x′ ] −E[log PZ ∣Y (z∣y; x)∣Y = y, x = x′ ]
U (x,x′ )≜ ∑ z ′ PZ ∣Y (z ′ ∣y;x′ ) log PZ (z ′ ;x) V (x,x′ )≜− ∑ z ′ PZ ∣Y (z ′ ∣y;x′ ) log PZ ∣Y (z ′ ∣y;x)

⇒ log PY (y; x) = U(x, x′ ) + V (x, x′ ), ∀x′ .

Lemma: V (x, x′ ) ≥ V (x′ , x′ )


PZ ∣Y (z∣y;x′ )
Proof: V (x, x′ ) − V (x′ , x′ ) = ∑z PZ ∣Y (z∣y; x′ ) log PZ ∣Y (z∣y;x)
=
D(PZ ∣Y ;x′ ∥PZ ∣Y ;x ) ≥ 0

′ ′ ′ ′
⇒ If we can find x =
 x such that U(x, x ) ≥ U(x , x ), then
log PY (y; x) = U(x, x′ ) + V (x, x′ ) ≥ U(x′ , x′ ) + V (x′ , x′ ) = log PY (y; x′ )

Lecture 13. EM Algorithm (After-class) 2


EM Algorithm:
^(0)
1. Initialization: choose a x

2. Repeat until convergence:

^(n) , compute
E-step: given the previous estimation x

^(n) ) = E[log PZ (z; x)∣Y = y, x = x


U(x, x ^(n) ]

^(n+1) maximizing U(⋅; x


M-step: find x ^(n) )

^(n+1) = arg max U(x; x


x ^(n) )
x

^(n+1) , x
⇒ U(x ^(n) ) ≥ U(x
^(n) , x
^(n) )

^(0) , x
We can get a sequence of x ^(1) , ⋯ such that

^(0) ) ≤ PY (y; x
PY (y; x ^(1) ) ≤ ⋯

Since PY (y; x) ≤ 1, it is a non-decreasing bound sequence ⇒ it must converge.


EM Algorithm converges to a stationary point of the likelihood function, i.e., let x∗ be the

convergent point, then ∂x PY (y; x)∣x=x∗ = 0

EM Algorithm for mixture model


Mixture model:
Assume data generated by the following process:

1. sample li ∈ {1, ⋯ , k}, i = 1, ⋯ , m and li ∼ multinomial(ϕ), ϕ = [ϕ1 , ⋯ , ϕk ] (


∑ki=1 ϕi = 1)
2. Sample observation yi from some distribution P (li , yi )

P (li , yi ) = P (li ) ⋅ P (yi ∣li )

Mixture Gaussian model: P (yi ∣li = j) ∼ N (μj , Σj )

Lecture 13. EM Algorithm (After-class) 3


In our notation,

⎧ϕ = [ϕ1 , ⋯ , ϕk ]
x = [ϕ, μ, Σ] : ⎨μ = [μ1 , ⋯ , μk ]
⎩Σ = [Σ , ⋯ , Σ ]
1 k

z = [l1 , ⋯ , lm , y1 , ⋯ , ym ]

y = [y1 , ⋯ , ym ]

The EM Algorithm:
E-step:

^(n) ) = EPZ ∣Y (⋅∣y;x^(n) ) [log PZ (z; x)∣Y = y, x


U(x, x ^(n) ]
m k
= ∑ ∑ P (li ∣yi ; x
^(n) ) log P (li , yi ; x)
i=1 li =1

M-step:

^(n+1) = arg max U(x; x


x ^(n) )
x

The Mixture Gaussian:


E-step:
( ) ( )

Lecture 13. EM Algorithm (After-class) 4


P (yi ∣li = j; x
(n) ^(n) ) ⋅ P (li = j; x ^(n) )
P (li = j∣yi ; x
^ )=
^(n) )
P (yi ; x
P (yi ∣li = j; x ^(n) ) ⋅ P (li = j; x^(n) )
= k
∑j ′ =1 P (yi ∣li = j ′ ; x
^(n) ) ⋅ P (li = j ′ ; x
^(n) )
(n)
^ ∣− 2 exp(− 1 (yi − μ
∣2π Σ
(n) d
^
^j )T Σ
(n) (n)−1
^j )) ⋅ ϕ^j
(yi − μ
(n)
j 2 j
=
^ (n)
∑kj ′ =1 ∣2π Σ ′ ∣− d2 exp(− 1 (y − μ
i ^
(n) T ^ (n)−1
′ ) Σ ′ (yi − ^
μ
(n)
′ )) ⋅ ϕ^j ′ (n)
j 2 j j j
≜ wij

^(n) ) ∀x
⇒ Compute U(x, x

m k
^ ) = ∑ ∑ wij log P (li = j, yi ; x)
U(x, x (n)

i=1 j=1
m k
= ∑ ∑ wij (log P (li = j; x) + log P (yi ∣li = j; x))
i=1 j=1
m k
1 1
= ∑ ∑ wij (log ϕj − log(2π)d ∣Σj ∣ − (yi − μj )T Σ−1
j (yi − μj ))
i=1 j=1
2 2

find x∗ ^(n) )
= arg max U(x, x
x
M-step: updating parameters
k
Do the derivatives on ϕj . Note that ∑j=1 ϕj = 1. Use the Lagrangian multiplier method.

^(n) ) − λ(∑j=1 ϕj − 1) ∣
k
∂U(x, x m
∑i=1 wij
= −λ=0
∂ϕj ^
ϕ
(n+1)
∣ϕj =ϕ^j
(n+1)
j

k
By ∑j=1 ϕj = 1, λ = ∑m k
i=1 ∑j=1 wij = m, and then

m
1
ϕ^j ∑ wij
(n+1)
=
m
i=1

Do the derivatives on μj

Lecture 13. EM Algorithm (After-class) 5


m
^(n) ) ∣
∂U(x, x
= − ∑ wij Σ−1
(n+1)
j (yi − μ
^j )=0
∂μj ∣μj =μ^(n+1)
j i=1

Then, we have

m
(n+1) ∑ wij ⋅ yi
^j
μ = i=1m
∑i=1 wij

Do the derivatives on Σj

^(n) ) ∣
∂U(x, x
∂Σj ∣Σj =Σ^(n+1) ,μj =μ^(n+1)
j j
m
1 ^ (n+1)−1 ^ (n+1)−1
= − ∑ wij (Σ
(n) (n) T ^ (n+1)−1
j − Σ j (yi − ^
μ j )(yi − ^
μ j ) Σj )=0
i=1
2

m
⇒ ∑ wij (Σ
^ (n+1) − (yi − μ(n) (n)
j ^j )T ) = 0
^j )(yi − μ
i=1

Then, we have

(n) (n)
∑m
i=1 wij (yi − μ
^j )(yi − μ
^j )T
^ (n+1)
Σ =
j
∑mi=1 wij

Lecture 13. EM Algorithm (After-class) 6

You might also like