My paper presentation slides of a nice paper in ICLR 2018. (2018/05/02 in IDEA Lab)
Paper Information:
Breaking the Softmax Bottleneck: a high-rank RNN Language Model
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen
https://arxiv.org/abs/1711.03953
Breaking the Softmax Bottleneck: a high-rank RNN Language Model
1. BREAKING THE SOFTMAX BOTTLENECK:
A HIGH-RANK RNN LANGUAGE MODEL
Zhilin Yang∗, Zihang Dai∗, Ruslan Salakhutdinov, William W. Cohen
School of Computer Science
Carnegie Mellon University
{zhiliny,dzihang,rsalakhu,wcohen}@cs.cmu.edu
Publish as a conference paper at ICLR 2018
∗Equal contribution. Ordering determined by dice rolling. Ray
4. • A corpus of tokens:
• Language model:
Language Model
!4
“Sharon likes to play _______”
P(“EverWing” | “Sharon likes to play”) = 0.01
P(“boyfriends”| “Sharon likes to play”) = 0.666
P(“Instagram” | “Sharon likes to play”) = 0.1018
…
context next token
corpus tokens
LM
contextnext token
← Sharon’s diary
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
P(X1|CN) P(X2|CN) … P(XM|CN)
Language Model
5. !5
<s> I am Sharon </s>
<s> Sharon I am </s>
<s> I love group meeting </s>
2-grams example:
{ (<s>, I), (I, am), (am, Sharon) …, (meeting, </s>) }
P( Sharon | am ) = C( Sharon, am ) / C(am) = 1 / 2
P( am | I ) = 2 / 3
P( I | am ) = 2 / 2
….
window size
Ngram Language Model
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
P(X1|CN) P(X2|CN) … P(XM|CN)
store
6. RNN Language Model
!6
(concept)
RNN
xt
gt
P(“EverWing” | context)
P(“boyfriends”| context)
…
Expected
Output:
(lots of weights)
RNN
x1
g1
RNN
x2
g2
RNN
x3
g3
RNN
xn
gn
…
=
(unroll)
Sharon likes to playInput:
(Sharon’s diary)
context
[P(X1|Ct) P(X2|Ct) … P(XM|Ct)]
probability vector
fit
(approximate)
[vector]
7. RNN Language Model
!7
RNN
x1
g1
RNN
x2
g2
RNN
x3
g3
RNN
xn
gn
…
(practical)
[P(X1|Ct) P(X2|Ct) … P(XM|Ct)]
Softmax(hT w)
context Sharon likes to play
[context vector] (h)
hT w = [hT w1, hT w2, … hT wM]
[word vector] [word vector] [word vector] [word vector]w
X1
(w1, w2, …, wM)
X2
1. context encode
2. relation to each word
setting
# of words: 10,000
word vector dim: 400
(dim: 400)
(dim: 400)
(dim: 10,000)
(dim: [10,000, 400])
(dim: 10,000)
9. back to RNN LM
!9
RNN
gn
[P(X1|Ct) P(X2|Ct) … P(XT|Ct)]
Softmax(hT w)
[context vector] (h)
hT w = [hT w1, hT w2, … hT wT]
(dim: 400)
(dim: [10,000, 400])
(dim: 10,000)
(dim: 10,000)
(w1, w2, …, wT) Is it capable of modeling the
Natural Language?
Softmax bottleneck
If NO!
the performance of our model is limited
setting
# of words: 10,000
word vector dim: 400
10. Contribution
!10
• Identify this Softmax bottleneck.
• Propose a new method to address this problem
called Mixture of Softmaxes (MoS)
by matrix factorization (in Section 2) and experiments (in Section 3)
(M = LU)
11. why Softmax
!11
[ 1, 12, 7, 11 ]
[ 0.0, 0.73, 0.01, 0.26 ]
(usually `argmax` or `category label` would be like: [0, 1, 0, 0])
Softmax() simple normalize
[ 0.03, 0.38, 0.22, 0.36 ]
It’s more like `argmax`!
too close
13. • Rank & matrix factorization
• Softmax-based RNN LM & matrix form
• Identify the Softmax bottleneck
• Propose hypothesis: Natural Language is high-rank
• Easy fixes?
• Mixture of Softmaxes (MoS) & Mixture of Contexts (MoC)
!13
Flow of Section 2
14. • We can consider rank is the number of i.i.d. features (dimensions)
• If A is a m x n matrix: rank(A) ≤ min(m, n)
• example:
!14
Rank
Age Height Weight H-W Gender
18 183 90 93 M
22 158 48 110 F
28 159 52 107 F
24 175 73 102 M
Age Height Weight Gender
18 183 90 M
22 158 48 F
28 159 52 F
24 175 73 M
dim = 4x3 → rank = 3 dim = 4x4 → rank = 4?
(in matrix)
NO! → rank = 3
15. rank(V) = rank(WH) ≤ min(rank(W), rank(H))
!15
Matrix Factorization
WH = V
Property:
4 x 62 x 64 x 2
If A is a m x n matrix: rank(A) ≤ min(m, n)
= 2
16. • hc: context vector
• wx: word embedding vector
• θ: all the parameters (weights) in our model
• Pθ: our model
• P*: true data distribution
!16
Sb RNN LM
(dim: 400)
(dim: 400)
by θ
logit
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
P(X1|CN) P(X2|CN) … P(XM|CN)
(Softmax-based RNN Language Model)
17. !17
Matrix form
(N: # of context) (M: # of token)
star (*) means: true data distribution
our model
our model our model
ground truth
Softmax(HθWθ) =
Softmax(A) =
18. !18
Concept of derivation
Softmax SoftmaxSoftmax
if we want
=
F(A) is an infinite set that
contain all possible logits.
matrix factorization
19. !19
Matrix form
(N: # of context) (M: # of token)
star (*) means: true data distribution
our model
our model our model
ground truth
Softmax(HθWθ) =
Softmax(A) =
20. !20
F(A): row-wise shift operation
JN,M is a all-one matrixF is an infinite set.
all possible logits are in F(A)
max rank difference = 1
21. !21
Concept of derivation
Softmax SoftmaxSoftmax
if we want
=
F(A) is an infinite set that
contain all possible logits.
matrix factorization
22. !22
Matrix Factorization
our model a matrix related to
the ground truth
rank(A′) = rank(HθWθ) = min(rank(Hθ), rank(Wθ)) ≤ d
rank(A′) ≤ d
(is upper bounded by d)
Softmax bottleneck
rank(V) = rank(WH) ≤ min(rank(W), rank(H))
WH = V
23. !23
Softmax bottleneck
If d is too small, our model (Softmax) does not have the capacity
to express the true data distribution.
If rank(A) − 1 > d,
there must exists a context c, such that
if we want
rank(A′) ≤ d
(is upper bounded by d)
24. Hypothesis: NL (A) is High-Rank
1. Natural language is highly context-dependent.
2. If A is low-rank
3. (Experiment) The high-rank LM outperforms many low-rank LMs. (in Section 3)
!24
token: “north” in
international news & geo. textbook
They hypothesize it should result in a high-rank matrix A.
We can create all semantic meanings by a few hundred bases.
Softmax bottleneck limits the expressiveness of the model.
learn a low-rank model rank( Softmax-based model ) ≤ d << rank(A)
3 reasons
If true
25. • Increasing the dimension d ?
!25
Easy Fixes?
Higher rank
Increasing the number of parameters, too.
Curse of dimensionality
Empirical conclusion: increasing the d beyond several hundred is not helpful
But !
(Merity et al., 2017; Melis et al., 2017; Krause et al., 2017)
→ Overfitting
Mixture of Softmaxes (MoS) high-rank, but do not increase the dimension d
rank( Softmax-based model ) ≤ d << rank(A)
26. !26
Mixture of Softmaxes
π: mixture weight, or call “prior”
wπ,k & Wh,k are new model parameters
K is a hyper-parameter
(high-rank Language Model)
27. !27
Mixture of Softmaxes
context vector
RNN
gn
RNN output (gt)
priorprior logit
logit
h = tanh(W · gt)
w · gt π
h · wx
weighted average
( ×→∑ )
Softmax
Softmax
K
[π1, π2, …, πk]
[h1w, h2w, …, hkw]
30. • Evaluate the MoS model base on perplexity (PP)
- Penn Treebank (PTB)
- WikiText-2 (WT2)
- 1B Word dataset (a larger dataset)
• MoS is a generic model: experiment on Seq2Seq model (PP & BLEU)
- Switch-board dataset (a dialog dataset)
!30
Main Results
(smaller value is better)
(larger value is better)
using MoS on state-of-the-art model
35. !35
Verify the Role of rank
Rank compassion on PTB.
Test different # of Softmax (K)
estimate the rank of A →
The embedding sizes (d) of
Softmax, MoC and MoS are 400, 280, 280 respectively.
The vocabulary size (M) is 10,000 for all models.
36. !36
Inverse experiment
on character-level language model
Softmax does not suffer from a rank limitation
→ MoS does not improve the performance
(low-rank problem)
→ MoS’ performance is from it’s high-rank
41. • Softmax-based language models
is limited by the dimension of the word embedding
→ Softmax bottleneck
• MoS
improves the performance over Softmax
avoid overfitting & naively increasing dimension
• MoS improves the state-of-the-art results by a large margin
→ It’s importance to have a high-rank model for natural language.
!41
Conclusions