Reparameterization of Discrete Variables for Latent LSTM Allocation

Reparameterization of Discrete Variables for Latent LSTM
Allocation
Tomonari MASADA @ Nagasaki University
September 1, 2017
1 ELBO
In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd
} for each document d are drawn
from the categorical distribution whose parameters are obtained as a softmax output of LSTM.
Based on the description of the generative process given in the paper [2], we obtain the full joint
distribution as follows:
p({w1, . . . , wD}, {z1, . . . , zD}, φ; LSTM, β) = p(φ; β)
D
d=1
p(wd, zd, φ; LSTM, β) (1)
We maximize the evidence p({w1, . . . , wD}; LSTM, β), which is obtained as below.
p({w1, . . . , wD}; LSTM, β) =
{z1,...,zD}
p({w1, . . . , wD}, {z1, . . . , zD}, φ; LSTM, β)dφ
=
{z1,...,zD}
p(φ; β)
d
p(wd, zd|φ; LSTM)dφ, (2)
where
p(wd, zd|φ; LSTM) = p(wd|zd, φ)p(zd; LSTM)
=
t
p(wd,t|zd,t, φ)p(zd,t|zd,1:t−1; LSTM) (3)
Jensen’s inequality gives the following lower bound of the log of the evidence:
log p({w1, . . . , wD}; LSTM, β) = log
{z1,...,zD}
p(φ; β)
d
p(wd, zd|φ; LSTM)dφ
= log
{z1,...,zD}
q({z1, . . . , zD}, φ)
p(φ; β) d p(wd, zd|φ; LSTM)
q({z1, . . . , zD}, φ)
dφ
≥
{z1,...,zD}
q({z1, . . . , zD}, φ) log
p(φ; β) d p(wd, zd|φ; LSTM)
q({z1, . . . , zD}, φ)
dφ
≡ L (4)
Let this lower bound, i.e., ELBO, be denoted by L.
We assume that the variational posterior q({z1, . . . , zD}, φ) factorizes as k q(φk) × d q(zd). The
q(φk) are Dirichlet distributions whose parameters are ξk = {ξk,1 . . . , ξk,V }.
Then the ELBO L can be rewritten as below.
L = q(φ) log p(φ; β)dφ +
d zd
q(zd) log p(zd; LSTM) +
d zd
q(zd)q(φ) log p(wd|zd, φ)dφ
−
d zd
q(zd) log q(zd) − q(φ) log q(φ)dφ (5)
1

The second term of L in Eq. (5) can be rewritten as below.
zd
q(zd) log p(zd; LSTM) =
zd
q(zd)
t
log p(zd,t|zd,1:t−1; LSTM)
=
zd
q(zd) log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM)
+ · · · + log p(zd,Nd
|zd,1:Nd−1; LSTM) (6)
The evaluation of Eq. (6) is intractable. However, for each t, we can reparameterize zd,t as zd,t =
gζd,t
(zd,1:t−1, d,t), where zd,t is represented as a one-hot vector zd,t [1]. That is, when zd,t = k,
gk,ζd,t
(zd,1:t−1, d,t) = 1 and gj,ζd,t
(zd,1:t−1, d,t) = 0 for j = k.
Then the expectation with respect to the hidden variables can be rewritten as follows.
Eqζd
(zd)
t
log p(zd,t|zd,1:t−1; LSTM) = Ep( d)
t
log p(gζd,t
(zd,1:t−1, d,t)| d,1:t−1; ζd,1:t−1, LSTM) (7)
We deﬁne gk,ζd,t
(zd,1:t−1, d,t) ≡ 1 if
k−1
j=1 ζd,t,j
K
j=1 ζd,t,j
≤ d,t <
k
j=1 ζd,t,j
K
j=1 ζd,t,j
and gk,ζd,t
(zd,1:t−1, d,t) ≡ 0
otherwise.
log p(zd,t|zd,1:t−1; LSTM) = log
K
k=1
zd,t,kθd,t,k = log
K
k=1
gk,ζd,t
(zd,1:t−1, d,t)θd,t,k (8)
where θd,t is the softmax output of LSTM and thus is a function of zd,1:t−1.
We assume that d,t ∼ U(0, 1).
Eqζd,t
(zd) log p(zd,t|zd,1:t−1; LSTM)
= Eqζd,t
(z
t
d )
log
K
k=1
gk,ζd,t
(zd,1:t−1, d,t)θd,t,kd d,t
= Ep(
t
d )
K
k=1
ζd,t,k(zd,1:t−1)
j ζd,t,j(zd,1:t−1)
log θd,t,k(zd,1:t−1)
z
t
d =h
ζ
t
d
(zd,t,
t
d )
(9)
The third term of L in Eq. (5) can be rewritten as below.
d zd
q(zd)q(φ) log p(wd|zd, φ)dφ =
d
q(φ)
zd
q(zd)
t
log φzd,t,wd,t
dφ
=
d zd
q(zd)
t
q(φ) log φzd,t,wd,t
dφ
=
d zd
q(zd)
t
Ψ(ξzd,t,wd,t
) − Ψ
v
ξzd,t,v (10)
By using the reparameterization described above, we obtain the following.
d zd
q(zd)q(φ) log p(wd|zd, φ)dφ
=
d
Ep(
t
d )
K
k=1
ζd,t,k(zd,1:t−1)
Ψ(ξk,wd,t
) − Ψ
v
ξk,v
z
t
d =h
ζ
t
d
(zd,t,
t
d )
(11)
2

The first term of L in Eq. (5) can be rewritten as below.
q(φ) log p(φ; β)dφ =
k
q(φk) log p(φk; β)dφk
= K log Γ(V β) − KV log Γ(β) +
k v
(β − 1) q(φk) log φk,vdφk
= K log Γ(V β) − KV log Γ(β) + (β − 1)
k v
Ψ(ξk,v) − Ψ
v
ξk,v (12)
The fourth term of L in Eq. (5) can be rewritten as below with the reparameterization described above.
d zd
q(zd) log q(zd)
=
D
d=1
Ep(
t
d )
K
k=1
ζd,t,k(zd,1:t−1)
log
ζd,t,k(zd,1:t−1)
j ζd,t,j(zd,1:t−1) z
t
d =h
ζ
t
d
(zd,t,
t
d )
(13)
The last term of L can be rewritten as below.
q(φ) log q(φ)dφ =
k
q(φk) log q(φk)dφk
=
k
log Γ
v
ξk,v −
k v
log Γ(ξk,v) +
k v
(ξk,v − 1) Ψ(ξk,v) − Ψ
v
ξk,v (14)
2 Inference
For simplicity, we assume that the ζd,t do not depend on {ζd,t : t = t}. Further we let γd,t,k denote
ζd,t,k
j ζd,t,j
. Then the partial differentiation of L with respect to γd,t,k is
∂L
∂γd,t,k
= Ep(
t
d )
log θd,t,k(zd,1:t−1)
z
t
d =h
ζ
t
d
(
t
d )
+ Ψ(ξk,wd,t
) − Ψ
v
ξk,v − log γd,t,k + const. (15)
The first term of Eq. (15) can be approximated by drawing samples ˆzd,t ∼ Categorical(γd,t ) for t < t.
Then we estimate θd,t,k(zd,1:t−1) by LSTM forward pass. By solving ∂L
∂γd,t,k
= 0, we obtain
γd,t,k ∝ φk,wd,t
log θd,t,k(ˆzd,1:t−1), (16)
where φk,wd,t
≡
exp(Ψ(ξk,wd,t
))
exp(Ψ( v ξk,v)) .
For ξk,v, we obtain the estimation β + d {t:wd,t=v} γd,t,k as usual.
Let θd,t,k denote p(zd,t = k|ˆzd,1:t−1; LSTM), which is a softmax output of LSTM. The partial differen-
tiation of L with respect to any LSTM parameter is
∂L
∂LSTM
=
d∈B
Nd
t=1
K
k=1
γd,t,k
∂
∂LSTM
log θd,t,k =
d∈B
Nd
t=1
K
k=1
γd,t,k
θd,t,k
∂θd,t,k
∂LSTM
(17)
References
[1] Seiya Tokui and Issei Sato. Reparameterization trick for discrete variables. CoRR, abs/1611.01239,
2016.
[2] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clustering
and non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 3967–3976, International Convention Centre, Sydney, Australia,
06–11 Aug 2017. PMLR.
3

Reparameterization of Discrete Variables for Latent LSTM Allocation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Tomonari Masada

Mehr von Tomonari Masada (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Reparameterization of Discrete Variables for Latent LSTM Allocation