Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
A Note on Latent LSTM Allocation
Tomonari MASADA @ Nagasaki University
August 31, 2017
(I’m not fully confident with this note.)
1 ELBO
In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd
} for each document d are drawn
from the categorical distribution whose parameters are obtained as a softmax output of LSTM.
Based on the description of the generative process given in the paper [1], we obtain the full joint
distribution as follows:
p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β) = p(φ; β)
d
p(wd, zd, φ; LSTM, β) (1)
We maximize the evidence p({w1, . . . , wd}; LSTM, β), which is obtained as below.
p({w1, . . . , wd}; LSTM, β) =
{z1,...,zd}
p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β)dφ
=
{z1,...,zd}
p(φ; β)
d
p(wd, zd|φ; LSTM)dφ, (2)
where
p(wd, zd|φ; LSTM) = p(wd|zd, φ)p(zd; LSTM)
=
t
p(wd,t|zd,t, φ)p(zd,t|zd,1:t−1; LSTM) (3)
Jensen’s inequality gives the following lower bound of the log of the evidence:
log p({w1, . . . , wd}; LSTM, β) = log
Z
p(φ; β)
d
p(wd, zd|φ; LSTM)dφ
= log
Z
q(Z, φ)
p(φ; β) d p(wd, zd|φ; LSTM)
q(Z, φ)
dφ
≥
Z
q(Z, φ) log
p(φ; β) d p(wd, zd|φ; LSTM)
q(Z, φ)
dφ
≡ L (4)
Let this lower bound, i.e., ELBO, be denoted by L.
We assume that the variational posterior q(Z, φ) factorizes as k q(φk) × d q(zd). The q(φk) are
Dirichlet distributions whose parameters are ξk = {ξk,1 . . . , ξk,V }.
Then the ELBO L can be rewritten as below.
L = q(φ) log p(φ; β)dφ +
d zd
q(zd) log p(zd; LSTM) +
d zd
q(zd)q(φ) log p(wd|zd, φ)dφ
−
d zd
q(zd) log q(zd) − q(φ) log q(φ)dφ (5)
1
Further we assume that q(zd) factorizes as t q(zd,t), where the q(zd,t) are the categorical distributions
satisfying
K
k=1 q(zd,t = k) = 1. We let γd,t,k denote q(zd,t = k).
The second term of L in Eq. (5) can be rewritten as below.
zd
q(zd) log p(zd; LSTM) =
zd t
q(zd,t)
t
log p(zd,t|zd,1:t−1; LSTM)
=
zd t
q(zd,t) log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM)
+ · · · + log p(zd,Nd
|zd,1, . . . , zd,Nd−1; LSTM)
=
K
zd,1=1
q(zd,1) log p(zd,1; LSTM) +
K
zd,1=1
K
zd,2=1
q(zd,1)q(zd,2) log p(zd,2|zd,1; LSTM)
+ · · · +
K
zd,1=1
· · ·
K
zd,Nd−1=1
q(zd,1) · · · q(zd,Nd−1) log p(zd,Nd−1|zd,1, . . . , zd,Nd−2; LSTM)
+ · · · +
K
zd,1=1
· · ·
K
zd,Nd
=1
q(zd,1) · · · q(zd,Nd
) log p(zd,Nd
|zd,1, . . . , zd,Nd−1; LSTM) (6)
The evaluation of Eq. (6) is intractable. However, for each t, the zd,1:t−1 in p(zd,t|zd,1:t−1; LSTM) can be
regarded as free variables whose values are set by some procedure having nothing to do with the generative
model. We obtain the values of the zd,1:t−1 by LSTM forward pass and denote them as ˆzd,1:t−1. Then we
can simplify Eq. (6) as follows:
zd
q(zd) log p(zd; LSTM) =
Nd
t=1
K
zd,t=1
q(zd,t) log p(zd,t|ˆzd,1:t−1; LSTM)
=
Nd
t=1
K
k=1
γd,t,k log p(zd,t = k|ˆzd,1:t−1; LSTM) (7)
The third term of L in Eq. (5) can be rewritten as below.
d zd
q(zd)q(φ) log p(wd|zd, φ)dφ =
d
q(φ)
zd
q(zd)
t
log φzd,t,wd,t
dφ
= q(φ)
d
Nd
t=1
K
k=1
q(zd,t = k) log φk,wd,t
dφ
=
D
d=1
Nd
t=1
K
k=1
γd,t,k q(φk) log φk,wd,t
dφk
=
D
d=1
Nd
t=1
K
k=1
γd,t,k Ψ(ξk,wd,t
) − Ψ
v
ξk,v (8)
The first term of L in Eq. (5) can be rewritten as below.
q(φ) log p(φ; β)dφ =
k
q(φk) log p(φk; β)dφk
= K log Γ(V β) − KV log Γ(β) +
k v
(β − 1) q(φk) log φk,vdφk
= K log Γ(V β) − KV log Γ(β) + (β − 1)
k v
Ψ(ξk,v) − Ψ
v
ξk,v (9)
2
The fourth term of L in Eq. (5) can be rewritten as below.
d zd
q(zd) log q(zd) =
D
d=1
Nd
t=1
K
k=1
q(zd,t = k) log q(zd,t = k) (10)
The last term of L can be rewritten as below.
q(φ) log q(φ)dφ =
k
q(φk) log q(φk)dφk
=
k
log Γ
v
ξk,v −
k v
log Γ(ξk,v) +
k v
(ξk,v − 1) Ψ(ξk,v) − Ψ
v
ξk,v (11)
2 Inference
The partial differentiation of L with respect to γd,t,k is
∂L
∂γd,t,k
= log p(zd,t = k|ˆzd,1:t−1; LSTM) + Ψ(ξk,wd,t
) − Ψ
v
ξk,v − log γd,t,k + const. (12)
By solving ∂L
∂γd,t,k
= 0, we obtain
γd,t,k ∝ φk,wd,t
p(zd,t = k|ˆzd,1:t−1; LSTM), (13)
where φk,wd,t
≡
exp(Ψ(ξk,wd,t
))
exp(Ψ( v ξk,v)) . When t = 1, γd,1,k ∝ φk,wd,1
p(zd,1 = k|LSTM). Therefore, q(zd,1) does
not depend on the zd,t for t > 1, and we can draw a sample from q(zd,1) without seeing the zd,t for
t > 1. When t = 2, γd,2,k ∝ φk,wd,2
p(zd,2 = k|ˆzd,1; LSTM). That is, q(zd,1) depends only on ˆzd,1. One
possible way to determine ˆzd,1 is to draw a sample from q(zd,1), because this drawing can be performed
without seeing the zd,t for t > 1. For each t s.t. t > 2, we may repeat a similar argument. However, this
procedure to determine the ˆzd,t is made possible by the assumption that lead to the approximation given
in Eq. (7), because we cannot obtain the simple update γd,t,k ∝ φk,wd,t
p(zd,t = k|ˆzd,1:t−1; LSTM) without
this assumption. And this assumption tells nothing about how we should sample the zd,t. For example,
we may draw the zd,t simply based on the softmax output at each t of LSTM without using φ. Anyway, it
is sure that the assumption leads to the approximation given in Eq. (7) provides no answer to the question
why we should use φ when sampling the zd,t.
For ξk,v, we obtain the estimation β + d {t:wd,t=v} γd,t,k as usual.
Let θd,t,k denote p(zd,t = k|ˆzd,1:t−1; LSTM), which is a softmax output of LSTM. The partial differen-
tiation of L with respect to any LSTM parameter is
∂L
∂LSTM
=
d∈B
Nd
t=1
K
k=1
γd,t,k
∂
∂LSTM
log θd,t,k =
d∈B
Nd
t=1
K
k=1
γd,t,k
θd,t,k
∂θd,t,k
∂LSTM
(14)
References
[1] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clustering
and non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 3967–3976, International Convention Centre, Sydney, Australia,
06–11 Aug 2017. PMLR.
3

More Related Content

A Note on Latent LSTM Allocation

  • 1. A Note on Latent LSTM Allocation Tomonari MASADA @ Nagasaki University August 31, 2017 (I’m not fully confident with this note.) 1 ELBO In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd } for each document d are drawn from the categorical distribution whose parameters are obtained as a softmax output of LSTM. Based on the description of the generative process given in the paper [1], we obtain the full joint distribution as follows: p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β) = p(φ; β) d p(wd, zd, φ; LSTM, β) (1) We maximize the evidence p({w1, . . . , wd}; LSTM, β), which is obtained as below. p({w1, . . . , wd}; LSTM, β) = {z1,...,zd} p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β)dφ = {z1,...,zd} p(φ; β) d p(wd, zd|φ; LSTM)dφ, (2) where p(wd, zd|φ; LSTM) = p(wd|zd, φ)p(zd; LSTM) = t p(wd,t|zd,t, φ)p(zd,t|zd,1:t−1; LSTM) (3) Jensen’s inequality gives the following lower bound of the log of the evidence: log p({w1, . . . , wd}; LSTM, β) = log Z p(φ; β) d p(wd, zd|φ; LSTM)dφ = log Z q(Z, φ) p(φ; β) d p(wd, zd|φ; LSTM) q(Z, φ) dφ ≥ Z q(Z, φ) log p(φ; β) d p(wd, zd|φ; LSTM) q(Z, φ) dφ ≡ L (4) Let this lower bound, i.e., ELBO, be denoted by L. We assume that the variational posterior q(Z, φ) factorizes as k q(φk) × d q(zd). The q(φk) are Dirichlet distributions whose parameters are ξk = {ξk,1 . . . , ξk,V }. Then the ELBO L can be rewritten as below. L = q(φ) log p(φ; β)dφ + d zd q(zd) log p(zd; LSTM) + d zd q(zd)q(φ) log p(wd|zd, φ)dφ − d zd q(zd) log q(zd) − q(φ) log q(φ)dφ (5) 1
  • 2. Further we assume that q(zd) factorizes as t q(zd,t), where the q(zd,t) are the categorical distributions satisfying K k=1 q(zd,t = k) = 1. We let γd,t,k denote q(zd,t = k). The second term of L in Eq. (5) can be rewritten as below. zd q(zd) log p(zd; LSTM) = zd t q(zd,t) t log p(zd,t|zd,1:t−1; LSTM) = zd t q(zd,t) log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM) + · · · + log p(zd,Nd |zd,1, . . . , zd,Nd−1; LSTM) = K zd,1=1 q(zd,1) log p(zd,1; LSTM) + K zd,1=1 K zd,2=1 q(zd,1)q(zd,2) log p(zd,2|zd,1; LSTM) + · · · + K zd,1=1 · · · K zd,Nd−1=1 q(zd,1) · · · q(zd,Nd−1) log p(zd,Nd−1|zd,1, . . . , zd,Nd−2; LSTM) + · · · + K zd,1=1 · · · K zd,Nd =1 q(zd,1) · · · q(zd,Nd ) log p(zd,Nd |zd,1, . . . , zd,Nd−1; LSTM) (6) The evaluation of Eq. (6) is intractable. However, for each t, the zd,1:t−1 in p(zd,t|zd,1:t−1; LSTM) can be regarded as free variables whose values are set by some procedure having nothing to do with the generative model. We obtain the values of the zd,1:t−1 by LSTM forward pass and denote them as ˆzd,1:t−1. Then we can simplify Eq. (6) as follows: zd q(zd) log p(zd; LSTM) = Nd t=1 K zd,t=1 q(zd,t) log p(zd,t|ˆzd,1:t−1; LSTM) = Nd t=1 K k=1 γd,t,k log p(zd,t = k|ˆzd,1:t−1; LSTM) (7) The third term of L in Eq. (5) can be rewritten as below. d zd q(zd)q(φ) log p(wd|zd, φ)dφ = d q(φ) zd q(zd) t log φzd,t,wd,t dφ = q(φ) d Nd t=1 K k=1 q(zd,t = k) log φk,wd,t dφ = D d=1 Nd t=1 K k=1 γd,t,k q(φk) log φk,wd,t dφk = D d=1 Nd t=1 K k=1 γd,t,k Ψ(ξk,wd,t ) − Ψ v ξk,v (8) The first term of L in Eq. (5) can be rewritten as below. q(φ) log p(φ; β)dφ = k q(φk) log p(φk; β)dφk = K log Γ(V β) − KV log Γ(β) + k v (β − 1) q(φk) log φk,vdφk = K log Γ(V β) − KV log Γ(β) + (β − 1) k v Ψ(ξk,v) − Ψ v ξk,v (9) 2
  • 3. The fourth term of L in Eq. (5) can be rewritten as below. d zd q(zd) log q(zd) = D d=1 Nd t=1 K k=1 q(zd,t = k) log q(zd,t = k) (10) The last term of L can be rewritten as below. q(φ) log q(φ)dφ = k q(φk) log q(φk)dφk = k log Γ v ξk,v − k v log Γ(ξk,v) + k v (ξk,v − 1) Ψ(ξk,v) − Ψ v ξk,v (11) 2 Inference The partial differentiation of L with respect to γd,t,k is ∂L ∂γd,t,k = log p(zd,t = k|ˆzd,1:t−1; LSTM) + Ψ(ξk,wd,t ) − Ψ v ξk,v − log γd,t,k + const. (12) By solving ∂L ∂γd,t,k = 0, we obtain γd,t,k ∝ φk,wd,t p(zd,t = k|ˆzd,1:t−1; LSTM), (13) where φk,wd,t ≡ exp(Ψ(ξk,wd,t )) exp(Ψ( v ξk,v)) . When t = 1, γd,1,k ∝ φk,wd,1 p(zd,1 = k|LSTM). Therefore, q(zd,1) does not depend on the zd,t for t > 1, and we can draw a sample from q(zd,1) without seeing the zd,t for t > 1. When t = 2, γd,2,k ∝ φk,wd,2 p(zd,2 = k|ˆzd,1; LSTM). That is, q(zd,1) depends only on ˆzd,1. One possible way to determine ˆzd,1 is to draw a sample from q(zd,1), because this drawing can be performed without seeing the zd,t for t > 1. For each t s.t. t > 2, we may repeat a similar argument. However, this procedure to determine the ˆzd,t is made possible by the assumption that lead to the approximation given in Eq. (7), because we cannot obtain the simple update γd,t,k ∝ φk,wd,t p(zd,t = k|ˆzd,1:t−1; LSTM) without this assumption. And this assumption tells nothing about how we should sample the zd,t. For example, we may draw the zd,t simply based on the softmax output at each t of LSTM without using φ. Anyway, it is sure that the assumption leads to the approximation given in Eq. (7) provides no answer to the question why we should use φ when sampling the zd,t. For ξk,v, we obtain the estimation β + d {t:wd,t=v} γd,t,k as usual. Let θd,t,k denote p(zd,t = k|ˆzd,1:t−1; LSTM), which is a softmax output of LSTM. The partial differen- tiation of L with respect to any LSTM parameter is ∂L ∂LSTM = d∈B Nd t=1 K k=1 γd,t,k ∂ ∂LSTM log θd,t,k = d∈B Nd t=1 K k=1 γd,t,k θd,t,k ∂θd,t,k ∂LSTM (14) References [1] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clustering and non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3967–3976, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. 3