Explaining how to reparameterize latent discrete variables of latent LSTM allocation.
http://proceedings.mlr.press/v70/zaheer17a.html
https://arxiv.org/abs/1611.01239
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Ā
Reparameterization of Discrete Variables for Latent LSTM Allocation
1. Reparameterization of Discrete Variables for Latent LSTM
Allocation
Tomonari MASADA @ Nagasaki University
September 1, 2017
1 ELBO
In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd
} for each document d are drawn
from the categorical distribution whose parameters are obtained as a softmax output of LSTM.
Based on the description of the generative process given in the paper [2], we obtain the full joint
distribution as follows:
p({w1, . . . , wD}, {z1, . . . , zD}, Ļ; LSTM, Ī²) = p(Ļ; Ī²)
D
d=1
p(wd, zd, Ļ; LSTM, Ī²) (1)
We maximize the evidence p({w1, . . . , wD}; LSTM, Ī²), which is obtained as below.
p({w1, . . . , wD}; LSTM, Ī²) =
{z1,...,zD}
p({w1, . . . , wD}, {z1, . . . , zD}, Ļ; LSTM, Ī²)dĻ
=
{z1,...,zD}
p(Ļ; Ī²)
d
p(wd, zd|Ļ; LSTM)dĻ, (2)
where
p(wd, zd|Ļ; LSTM) = p(wd|zd, Ļ)p(zd; LSTM)
=
t
p(wd,t|zd,t, Ļ)p(zd,t|zd,1:tā1; LSTM) (3)
Jensenās inequality gives the following lower bound of the log of the evidence:
log p({w1, . . . , wD}; LSTM, Ī²) = log
{z1,...,zD}
p(Ļ; Ī²)
d
p(wd, zd|Ļ; LSTM)dĻ
= log
{z1,...,zD}
q({z1, . . . , zD}, Ļ)
p(Ļ; Ī²) d p(wd, zd|Ļ; LSTM)
q({z1, . . . , zD}, Ļ)
dĻ
ā„
{z1,...,zD}
q({z1, . . . , zD}, Ļ) log
p(Ļ; Ī²) d p(wd, zd|Ļ; LSTM)
q({z1, . . . , zD}, Ļ)
dĻ
ā” L (4)
Let this lower bound, i.e., ELBO, be denoted by L.
We assume that the variational posterior q({z1, . . . , zD}, Ļ) factorizes as k q(Ļk) Ć d q(zd). The
q(Ļk) are Dirichlet distributions whose parameters are Ī¾k = {Ī¾k,1 . . . , Ī¾k,V }.
Then the ELBO L can be rewritten as below.
L = q(Ļ) log p(Ļ; Ī²)dĻ +
d zd
q(zd) log p(zd; LSTM) +
d zd
q(zd)q(Ļ) log p(wd|zd, Ļ)dĻ
ā
d zd
q(zd) log q(zd) ā q(Ļ) log q(Ļ)dĻ (5)
1
2. The second term of L in Eq. (5) can be rewritten as below.
zd
q(zd) log p(zd; LSTM) =
zd
q(zd)
t
log p(zd,t|zd,1:tā1; LSTM)
=
zd
q(zd) log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM)
+ Ā· Ā· Ā· + log p(zd,Nd
|zd,1:Ndā1; LSTM) (6)
The evaluation of Eq. (6) is intractable. However, for each t, we can reparameterize zd,t as zd,t =
gĪ¶d,t
(zd,1:tā1, d,t), where zd,t is represented as a one-hot vector zd,t [1]. That is, when zd,t = k,
gk,Ī¶d,t
(zd,1:tā1, d,t) = 1 and gj,Ī¶d,t
(zd,1:tā1, d,t) = 0 for j = k.
Then the expectation with respect to the hidden variables can be rewritten as follows.
EqĪ¶d
(zd)
t
log p(zd,t|zd,1:tā1; LSTM) = Ep( d)
t
log p(gĪ¶d,t
(zd,1:tā1, d,t)| d,1:tā1; Ī¶d,1:tā1, LSTM) (7)
We deļ¬ne gk,Ī¶d,t
(zd,1:tā1, d,t) ā” 1 if
kā1
j=1 Ī¶d,t,j
K
j=1 Ī¶d,t,j
ā¤ d,t <
k
j=1 Ī¶d,t,j
K
j=1 Ī¶d,t,j
and gk,Ī¶d,t
(zd,1:tā1, d,t) ā” 0
otherwise.
log p(zd,t|zd,1:tā1; LSTM) = log
K
k=1
zd,t,kĪød,t,k = log
K
k=1
gk,Ī¶d,t
(zd,1:tā1, d,t)Īød,t,k (8)
where Īød,t is the softmax output of LSTM and thus is a function of zd,1:tā1.
We assume that d,t ā¼ U(0, 1).
EqĪ¶d,t
(zd) log p(zd,t|zd,1:tā1; LSTM)
= EqĪ¶d,t
(z
t
d )
log
K
k=1
gk,Ī¶d,t
(zd,1:tā1, d,t)Īød,t,kd d,t
= Ep(
t
d )
K
k=1
Ī¶d,t,k(zd,1:tā1)
j Ī¶d,t,j(zd,1:tā1)
log Īød,t,k(zd,1:tā1)
z
t
d =h
Ī¶
t
d
(zd,t,
t
d )
(9)
The third term of L in Eq. (5) can be rewritten as below.
d zd
q(zd)q(Ļ) log p(wd|zd, Ļ)dĻ =
d
q(Ļ)
zd
q(zd)
t
log Ļzd,t,wd,t
dĻ
=
d zd
q(zd)
t
q(Ļ) log Ļzd,t,wd,t
dĻ
=
d zd
q(zd)
t
ĪØ(Ī¾zd,t,wd,t
) ā ĪØ
v
Ī¾zd,t,v (10)
By using the reparameterization described above, we obtain the following.
d zd
q(zd)q(Ļ) log p(wd|zd, Ļ)dĻ
=
d
Ep(
t
d )
K
k=1
Ī¶d,t,k(zd,1:tā1)
j Ī¶d,t,j(zd,1:tā1)
ĪØ(Ī¾k,wd,t
) ā ĪØ
v
Ī¾k,v
z
t
d =h
Ī¶
t
d
(zd,t,
t
d )
(11)
2
3. The ļ¬rst term of L in Eq. (5) can be rewritten as below.
q(Ļ) log p(Ļ; Ī²)dĻ =
k
q(Ļk) log p(Ļk; Ī²)dĻk
= K log Ī(V Ī²) ā KV log Ī(Ī²) +
k v
(Ī² ā 1) q(Ļk) log Ļk,vdĻk
= K log Ī(V Ī²) ā KV log Ī(Ī²) + (Ī² ā 1)
k v
ĪØ(Ī¾k,v) ā ĪØ
v
Ī¾k,v (12)
The fourth term of L in Eq. (5) can be rewritten as below with the reparameterization described above.
d zd
q(zd) log q(zd)
=
D
d=1
Ep(
t
d )
K
k=1
Ī¶d,t,k(zd,1:tā1)
j Ī¶d,t,j(zd,1:tā1)
log
Ī¶d,t,k(zd,1:tā1)
j Ī¶d,t,j(zd,1:tā1) z
t
d =h
Ī¶
t
d
(zd,t,
t
d )
(13)
The last term of L can be rewritten as below.
q(Ļ) log q(Ļ)dĻ =
k
q(Ļk) log q(Ļk)dĻk
=
k
log Ī
v
Ī¾k,v ā
k v
log Ī(Ī¾k,v) +
k v
(Ī¾k,v ā 1) ĪØ(Ī¾k,v) ā ĪØ
v
Ī¾k,v (14)
2 Inference
For simplicity, we assume that the Ī¶d,t do not depend on {Ī¶d,t : t = t}. Further we let Ī³d,t,k denote
Ī¶d,t,k
j Ī¶d,t,j
. Then the partial diļ¬erentiation of L with respect to Ī³d,t,k is
āL
āĪ³d,t,k
= Ep(
t
d )
log Īød,t,k(zd,1:tā1)
z
t
d =h
Ī¶
t
d
(
t
d )
+ ĪØ(Ī¾k,wd,t
) ā ĪØ
v
Ī¾k,v ā log Ī³d,t,k + const. (15)
The ļ¬rst term of Eq. (15) can be approximated by drawing samples Ėzd,t ā¼ Categorical(Ī³d,t ) for t < t.
Then we estimate Īød,t,k(zd,1:tā1) by LSTM forward pass. By solving āL
āĪ³d,t,k
= 0, we obtain
Ī³d,t,k ā Ļk,wd,t
log Īød,t,k(Ėzd,1:tā1), (16)
where Ļk,wd,t
ā”
exp(ĪØ(Ī¾k,wd,t
))
exp(ĪØ( v Ī¾k,v)) .
For Ī¾k,v, we obtain the estimation Ī² + d {t:wd,t=v} Ī³d,t,k as usual.
Let Īød,t,k denote p(zd,t = k|Ėzd,1:tā1; LSTM), which is a softmax output of LSTM. The partial diļ¬eren-
tiation of L with respect to any LSTM parameter is
āL
āLSTM
=
dāB
Nd
t=1
K
k=1
Ī³d,t,k
ā
āLSTM
log Īød,t,k =
dāB
Nd
t=1
K
k=1
Ī³d,t,k
Īød,t,k
āĪød,t,k
āLSTM
(17)
References
[1] Seiya Tokui and Issei Sato. Reparameterization trick for discrete variables. CoRR, abs/1611.01239,
2016.
[2] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clustering
and non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 3967ā3976, International Convention Centre, Sydney, Australia,
06ā11 Aug 2017. PMLR.
3