SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
BREAKING THE SOFTMAX BOTTLENECK:
A HIGH-RANK RNN LANGUAGE MODEL
Zhilin Yang∗, Zihang Dai∗, Ruslan Salakhutdinov, William W. Cohen
School of Computer Science
Carnegie Mellon University
{zhiliny,dzihang,rsalakhu,wcohen}@cs.cmu.edu
Publish as a conference paper at ICLR 2018
∗Equal contribution. Ordering determined by dice rolling. Ray
Outline
• Introduction

• Language Modeling as Matrix Factorization

• Experiments

• Related work

• Conclusions
!2
Introduction
Section 1
• A corpus of tokens:

• Language model:
Language Model
!4
“Sharon likes to play _______”
P(“EverWing” | “Sharon likes to play”) = 0.01
P(“boyfriends”| “Sharon likes to play”) = 0.666
P(“Instagram” | “Sharon likes to play”) = 0.1018
…
context next token
corpus tokens
LM
contextnext token
← Sharon’s diary
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
P(X1|CN) P(X2|CN) … P(XM|CN)
Language Model
!5
<s> I am Sharon </s>
<s> Sharon I am </s>
<s> I love group meeting </s>
2-grams example:
{ (<s>, I), (I, am), (am, Sharon) …, (meeting, </s>) }

P( Sharon | am ) = C( Sharon, am ) / C(am) = 1 / 2

P( am | I ) = 2 / 3

P( I | am ) = 2 / 2

….
window size
Ngram Language Model
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
P(X1|CN) P(X2|CN) … P(XM|CN)
store
RNN Language Model
!6
(concept)
RNN
xt
gt
P(“EverWing” | context)
P(“boyfriends”| context)
…
Expected
Output:
(lots of weights)
RNN
x1
g1
RNN
x2
g2
RNN
x3
g3
RNN
xn
gn
…
=
(unroll)
Sharon likes to playInput:
(Sharon’s diary)
context
[P(X1|Ct) P(X2|Ct) … P(XM|Ct)]
probability vector
fit
(approximate)
[vector]
RNN Language Model
!7
RNN
x1
g1
RNN
x2
g2
RNN
x3
g3
RNN
xn
gn
…
(practical)
[P(X1|Ct) P(X2|Ct) … P(XM|Ct)]
Softmax(hT w)
context Sharon likes to play
[context vector] (h)
hT w = [hT w1, hT w2, … hT wM]
[word vector] [word vector] [word vector] [word vector]w
X1
(w1, w2, …, wM)
X2
1. context encode
2. relation to each word
setting

# of words: 10,000 

word vector dim: 400
(dim: 400)
(dim: 400)
(dim: 10,000)
(dim: [10,000, 400])
(dim: 10,000)
Softmax function
!8
[ 1, 12, 7, 11 ]
[hT w1, hT w2, hT w3, hT w4]Ct = “Sharon likes to play”
⇒ context vector (h)
X1 = “EverWing” ⇒ w1
X2 = “boyfriends” ⇒ w2
X3 = “Instagram” ⇒ w3
X4 = “MaLuLian” ⇒ w4
[ 0.0, 0.73, 0.0, 0.27 ][P(X1|Ct), P(X2|Ct), P(X3|Ct), P(X4|Ct)]
sum = e1
+ e12
+ e7
+ e11
[ ]
e1
sum
e12
sum
e7
sum
e11
sum
Softmax(hT w)
=
(word vector)
dot product
Goal:
back to RNN LM
!9
RNN
gn
[P(X1|Ct) P(X2|Ct) … P(XT|Ct)]
Softmax(hT w)
[context vector] (h)
hT w = [hT w1, hT w2, … hT wT]
(dim: 400)
(dim: [10,000, 400])
(dim: 10,000)
(dim: 10,000)
(w1, w2, …, wT) Is it capable of modeling the
Natural Language?
Softmax bottleneck
If NO!
the performance of our model is limited
setting

# of words: 10,000 

word vector dim: 400
Contribution
!10
• Identify this Softmax bottleneck.

• Propose a new method to address this problem 

called Mixture of Softmaxes (MoS)
by matrix factorization (in Section 2) and experiments (in Section 3)
(M = LU)
why Softmax
!11
[ 1, 12, 7, 11 ]
[ 0.0, 0.73, 0.01, 0.26 ]
(usually `argmax` or `category label` would be like: [0, 1, 0, 0])
Softmax() simple normalize
[ 0.03, 0.38, 0.22, 0.36 ]
It’s more like `argmax`!
too close
Language Modeling
as
Matrix Factorization
Section 2
• Rank & matrix factorization

• Softmax-based RNN LM & matrix form

• Identify the Softmax bottleneck

• Propose hypothesis: Natural Language is high-rank

• Easy fixes?

• Mixture of Softmaxes (MoS) & Mixture of Contexts (MoC)
!13
Flow of Section 2
• We can consider rank is the number of i.i.d. features (dimensions)

• If A is a m x n matrix: rank(A) ≤ min(m, n)

• example:
!14
Rank
Age Height Weight H-W Gender
18 183 90 93 M
22 158 48 110 F
28 159 52 107 F
24 175 73 102 M
Age Height Weight Gender
18 183 90 M
22 158 48 F
28 159 52 F
24 175 73 M
dim = 4x3 → rank = 3 dim = 4x4 → rank = 4?
(in matrix)
NO! → rank = 3
rank(V) = rank(WH) ≤ min(rank(W), rank(H))
!15
Matrix Factorization
WH = V
Property:
4 x 62 x 64 x 2
If A is a m x n matrix: rank(A) ≤ min(m, n)
= 2
• hc: context vector

• wx: word embedding vector

• θ: all the parameters (weights) in our model

• Pθ: our model

• P*: true data distribution
!16
Sb RNN LM
(dim: 400)
(dim: 400)
by θ
logit
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
P(X1|CN) P(X2|CN) … P(XM|CN)
(Softmax-based RNN Language Model)
!17
Matrix form
(N: # of context) (M: # of token)
star (*) means: true data distribution
our model
our model our model
ground truth
Softmax(HθWθ) =
Softmax(A) =
!18
Concept of derivation
Softmax SoftmaxSoftmax
if we want
=
F(A) is an infinite set that
contain all possible logits.
matrix factorization
!19
Matrix form
(N: # of context) (M: # of token)
star (*) means: true data distribution
our model
our model our model
ground truth
Softmax(HθWθ) =
Softmax(A) =
!20
F(A): row-wise shift operation
JN,M is a all-one matrixF is an infinite set.
all possible logits are in F(A)
max rank difference = 1
!21
Concept of derivation
Softmax SoftmaxSoftmax
if we want
=
F(A) is an infinite set that
contain all possible logits.
matrix factorization
!22
Matrix Factorization
our model a matrix related to
the ground truth
rank(A′) = rank(HθWθ) = min(rank(Hθ), rank(Wθ)) ≤ d
rank(A′) ≤ d
(is upper bounded by d)
Softmax bottleneck
rank(V) = rank(WH) ≤ min(rank(W), rank(H))
WH = V
!23
Softmax bottleneck
If d is too small, our model (Softmax) does not have the capacity 

to express the true data distribution.
If rank(A) − 1 > d,
there must exists a context c, such that
if we want
rank(A′) ≤ d
(is upper bounded by d)
Hypothesis: NL (A) is High-Rank
1. Natural language is highly context-dependent.

2. If A is low-rank

3. (Experiment) The high-rank LM outperforms many low-rank LMs. (in Section 3)
!24
token: “north” in
international news & geo. textbook
They hypothesize it should result in a high-rank matrix A.
We can create all semantic meanings by a few hundred bases.
Softmax bottleneck limits the expressiveness of the model.
learn a low-rank model rank( Softmax-based model ) ≤ d << rank(A)
3 reasons
If true
• Increasing the dimension d ?
!25
Easy Fixes?
Higher rank
Increasing the number of parameters, too.
Curse of dimensionality
Empirical conclusion: increasing the d beyond several hundred is not helpful
But !
(Merity et al., 2017; Melis et al., 2017; Krause et al., 2017)
→ Overfitting
Mixture of Softmaxes (MoS) high-rank, but do not increase the dimension d
rank( Softmax-based model ) ≤ d << rank(A)
!26
Mixture of Softmaxes
π: mixture weight, or call “prior”

wπ,k & Wh,k are new model parameters

K is a hyper-parameter
(high-rank Language Model)
!27
Mixture of Softmaxes
context vector
RNN
gn
RNN output (gt)
priorprior logit
logit
h = tanh(W · gt)
w · gt π
h · wx
weighted average
( ×→∑ )
Softmax
Softmax
K
[π1, π2, …, πk]
[h1w, h2w, …, hkw]
!28
Mixture of Contexts (MoC)
(low-rank baseline)
It’s very like Softmax model
Experiments
Section 3
• Evaluate the MoS model base on perplexity (PP)

- Penn Treebank (PTB)

- WikiText-2 (WT2)

- 1B Word dataset (a larger dataset)

• MoS is a generic model: experiment on Seq2Seq model (PP & BLEU)

- Switch-board dataset (a dialog dataset)
!30
Main Results
(smaller value is better)
(larger value is better)
using MoS on state-of-the-art model
!31
Main results
on Penn Treebank (PTB)
(PP: smaller value is better)
!32
Main results
on WikiText-2 (WT2)
on 1B Word dataset
!33
Main results
on Seq2Seq model
Switch-board dataset (a dialog dataset)
(BLEU: larger value is better)
!34
Ablation Study
compare MoS & MoC
MoC do not perform better
!35
Verify the Role of rank
Rank compassion on PTB.
Test different # of Softmax (K)
estimate the rank of A →
The embedding sizes (d) of
Softmax, MoC and MoS are 400, 280, 280 respectively.
The vocabulary size (M) is 10,000 for all models.
!36
Inverse experiment
on character-level language model
Softmax does not suffer from a rank limitation
→ MoS does not improve the performance
(low-rank problem)
→ MoS’ performance is from it’s high-rank
!37
MoS computational time
relation between K & time
→ it’s `sub-linear` relation

related to the size of dataset
!38
Qualitative analysis
to see how MoS improves the next-token prediction in detail
!39
Qualitative analysis
to see how MoS improves the next-token prediction in detail
Conclusions
Section 5
• Softmax-based language models 

is limited by the dimension of the word embedding

→ Softmax bottleneck

• MoS

improves the performance over Softmax

avoid overfitting & naively increasing dimension

• MoS improves the state-of-the-art results by a large margin

→ It’s importance to have a high-rank model for natural language.
!41
Conclusions
Thanks ✄

Weitere ähnliche Inhalte

Was ist angesagt?

Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Kush Kulshrestha
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series DataArun Kejariwal
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learningdeawoo Kim
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsShunta Saito
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural NetworksSangwoo Mo
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient TheoremAshwin Rao
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksAndrew Ferlitsch
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
Multilingual power apps
Multilingual power appsMultilingual power apps
Multilingual power appsPeter Heffner
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsNhatHai Phan
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdfppvijith
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkRichard Kuo
 

Was ist angesagt? (20)

Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
Clustering
ClusteringClustering
Clustering
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
Back propagation
Back propagationBack propagation
Back propagation
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
 
Resnet
ResnetResnet
Resnet
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient Theorem
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural Networks
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Multilingual power apps
Multilingual power appsMultilingual power apps
Multilingual power apps
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdf
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 

Ähnlich wie Breaking the Softmax Bottleneck: a high-rank RNN Language Model

CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semanticsJinho Choi
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Mail.ru Group
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationFeynman Liang
 
leanCoR: lean Connection-based DL Reasoner
leanCoR: lean Connection-based DL ReasonerleanCoR: lean Connection-based DL Reasoner
leanCoR: lean Connection-based DL ReasonerAdriano Melo
 
Basics of Dynamic programming
Basics of Dynamic programming Basics of Dynamic programming
Basics of Dynamic programming Yan Xu
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsJonny Daenen
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdfAdvanced-Concepts-Team
 
NTCIR11-Math2-PattaniyilN_slides
NTCIR11-Math2-PattaniyilN_slidesNTCIR11-Math2-PattaniyilN_slides
NTCIR11-Math2-PattaniyilN_slidesNidhin Pattaniyil
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingFrank Nielsen
 

Ähnlich wie Breaking the Softmax Bottleneck: a high-rank RNN Language Model (20)

CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semantics
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
leanCoR: lean Connection-based DL Reasoner
leanCoR: lean Connection-based DL ReasonerleanCoR: lean Connection-based DL Reasoner
leanCoR: lean Connection-based DL Reasoner
 
Basics of Dynamic programming
Basics of Dynamic programming Basics of Dynamic programming
Basics of Dynamic programming
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
 
NTCIR11-Math2-PattaniyilN_slides
NTCIR11-Math2-PattaniyilN_slidesNTCIR11-Math2-PattaniyilN_slides
NTCIR11-Math2-PattaniyilN_slides
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 

Kürzlich hochgeladen

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 

Kürzlich hochgeladen (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 

Breaking the Softmax Bottleneck: a high-rank RNN Language Model

  • 1. BREAKING THE SOFTMAX BOTTLENECK: A HIGH-RANK RNN LANGUAGE MODEL Zhilin Yang∗, Zihang Dai∗, Ruslan Salakhutdinov, William W. Cohen School of Computer Science Carnegie Mellon University {zhiliny,dzihang,rsalakhu,wcohen}@cs.cmu.edu Publish as a conference paper at ICLR 2018 ∗Equal contribution. Ordering determined by dice rolling. Ray
  • 2. Outline • Introduction • Language Modeling as Matrix Factorization • Experiments • Related work • Conclusions !2
  • 4. • A corpus of tokens: • Language model: Language Model !4 “Sharon likes to play _______” P(“EverWing” | “Sharon likes to play”) = 0.01 P(“boyfriends”| “Sharon likes to play”) = 0.666 P(“Instagram” | “Sharon likes to play”) = 0.1018 … context next token corpus tokens LM contextnext token ← Sharon’s diary P(X1|C1) P(X2|C1) … P(XM|C1) P(X1|C2) P(X2|C2) … P(XM|C2) … P(X1|CN) P(X2|CN) … P(XM|CN) Language Model
  • 5. !5 <s> I am Sharon </s> <s> Sharon I am </s> <s> I love group meeting </s> 2-grams example: { (<s>, I), (I, am), (am, Sharon) …, (meeting, </s>) } P( Sharon | am ) = C( Sharon, am ) / C(am) = 1 / 2 P( am | I ) = 2 / 3 P( I | am ) = 2 / 2 …. window size Ngram Language Model P(X1|C1) P(X2|C1) … P(XM|C1) P(X1|C2) P(X2|C2) … P(XM|C2) … P(X1|CN) P(X2|CN) … P(XM|CN) store
  • 6. RNN Language Model !6 (concept) RNN xt gt P(“EverWing” | context) P(“boyfriends”| context) … Expected Output: (lots of weights) RNN x1 g1 RNN x2 g2 RNN x3 g3 RNN xn gn … = (unroll) Sharon likes to playInput: (Sharon’s diary) context [P(X1|Ct) P(X2|Ct) … P(XM|Ct)] probability vector fit (approximate) [vector]
  • 7. RNN Language Model !7 RNN x1 g1 RNN x2 g2 RNN x3 g3 RNN xn gn … (practical) [P(X1|Ct) P(X2|Ct) … P(XM|Ct)] Softmax(hT w) context Sharon likes to play [context vector] (h) hT w = [hT w1, hT w2, … hT wM] [word vector] [word vector] [word vector] [word vector]w X1 (w1, w2, …, wM) X2 1. context encode 2. relation to each word setting # of words: 10,000 word vector dim: 400 (dim: 400) (dim: 400) (dim: 10,000) (dim: [10,000, 400]) (dim: 10,000)
  • 8. Softmax function !8 [ 1, 12, 7, 11 ] [hT w1, hT w2, hT w3, hT w4]Ct = “Sharon likes to play” ⇒ context vector (h) X1 = “EverWing” ⇒ w1 X2 = “boyfriends” ⇒ w2 X3 = “Instagram” ⇒ w3 X4 = “MaLuLian” ⇒ w4 [ 0.0, 0.73, 0.0, 0.27 ][P(X1|Ct), P(X2|Ct), P(X3|Ct), P(X4|Ct)] sum = e1 + e12 + e7 + e11 [ ] e1 sum e12 sum e7 sum e11 sum Softmax(hT w) = (word vector) dot product Goal:
  • 9. back to RNN LM !9 RNN gn [P(X1|Ct) P(X2|Ct) … P(XT|Ct)] Softmax(hT w) [context vector] (h) hT w = [hT w1, hT w2, … hT wT] (dim: 400) (dim: [10,000, 400]) (dim: 10,000) (dim: 10,000) (w1, w2, …, wT) Is it capable of modeling the Natural Language? Softmax bottleneck If NO! the performance of our model is limited setting # of words: 10,000 word vector dim: 400
  • 10. Contribution !10 • Identify this Softmax bottleneck. • Propose a new method to address this problem called Mixture of Softmaxes (MoS) by matrix factorization (in Section 2) and experiments (in Section 3) (M = LU)
  • 11. why Softmax !11 [ 1, 12, 7, 11 ] [ 0.0, 0.73, 0.01, 0.26 ] (usually `argmax` or `category label` would be like: [0, 1, 0, 0]) Softmax() simple normalize [ 0.03, 0.38, 0.22, 0.36 ] It’s more like `argmax`! too close
  • 13. • Rank & matrix factorization • Softmax-based RNN LM & matrix form • Identify the Softmax bottleneck • Propose hypothesis: Natural Language is high-rank • Easy fixes? • Mixture of Softmaxes (MoS) & Mixture of Contexts (MoC) !13 Flow of Section 2
  • 14. • We can consider rank is the number of i.i.d. features (dimensions) • If A is a m x n matrix: rank(A) ≤ min(m, n) • example: !14 Rank Age Height Weight H-W Gender 18 183 90 93 M 22 158 48 110 F 28 159 52 107 F 24 175 73 102 M Age Height Weight Gender 18 183 90 M 22 158 48 F 28 159 52 F 24 175 73 M dim = 4x3 → rank = 3 dim = 4x4 → rank = 4? (in matrix) NO! → rank = 3
  • 15. rank(V) = rank(WH) ≤ min(rank(W), rank(H)) !15 Matrix Factorization WH = V Property: 4 x 62 x 64 x 2 If A is a m x n matrix: rank(A) ≤ min(m, n) = 2
  • 16. • hc: context vector • wx: word embedding vector • θ: all the parameters (weights) in our model • Pθ: our model • P*: true data distribution !16 Sb RNN LM (dim: 400) (dim: 400) by θ logit P(X1|C1) P(X2|C1) … P(XM|C1) P(X1|C2) P(X2|C2) … P(XM|C2) … P(X1|CN) P(X2|CN) … P(XM|CN) (Softmax-based RNN Language Model)
  • 17. !17 Matrix form (N: # of context) (M: # of token) star (*) means: true data distribution our model our model our model ground truth Softmax(HθWθ) = Softmax(A) =
  • 18. !18 Concept of derivation Softmax SoftmaxSoftmax if we want = F(A) is an infinite set that contain all possible logits. matrix factorization
  • 19. !19 Matrix form (N: # of context) (M: # of token) star (*) means: true data distribution our model our model our model ground truth Softmax(HθWθ) = Softmax(A) =
  • 20. !20 F(A): row-wise shift operation JN,M is a all-one matrixF is an infinite set. all possible logits are in F(A) max rank difference = 1
  • 21. !21 Concept of derivation Softmax SoftmaxSoftmax if we want = F(A) is an infinite set that contain all possible logits. matrix factorization
  • 22. !22 Matrix Factorization our model a matrix related to the ground truth rank(A′) = rank(HθWθ) = min(rank(Hθ), rank(Wθ)) ≤ d rank(A′) ≤ d (is upper bounded by d) Softmax bottleneck rank(V) = rank(WH) ≤ min(rank(W), rank(H)) WH = V
  • 23. !23 Softmax bottleneck If d is too small, our model (Softmax) does not have the capacity to express the true data distribution. If rank(A) − 1 > d, there must exists a context c, such that if we want rank(A′) ≤ d (is upper bounded by d)
  • 24. Hypothesis: NL (A) is High-Rank 1. Natural language is highly context-dependent. 2. If A is low-rank 3. (Experiment) The high-rank LM outperforms many low-rank LMs. (in Section 3) !24 token: “north” in international news & geo. textbook They hypothesize it should result in a high-rank matrix A. We can create all semantic meanings by a few hundred bases. Softmax bottleneck limits the expressiveness of the model. learn a low-rank model rank( Softmax-based model ) ≤ d << rank(A) 3 reasons If true
  • 25. • Increasing the dimension d ? !25 Easy Fixes? Higher rank Increasing the number of parameters, too. Curse of dimensionality Empirical conclusion: increasing the d beyond several hundred is not helpful But ! (Merity et al., 2017; Melis et al., 2017; Krause et al., 2017) → Overfitting Mixture of Softmaxes (MoS) high-rank, but do not increase the dimension d rank( Softmax-based model ) ≤ d << rank(A)
  • 26. !26 Mixture of Softmaxes π: mixture weight, or call “prior” wπ,k & Wh,k are new model parameters K is a hyper-parameter (high-rank Language Model)
  • 27. !27 Mixture of Softmaxes context vector RNN gn RNN output (gt) priorprior logit logit h = tanh(W · gt) w · gt π h · wx weighted average ( ×→∑ ) Softmax Softmax K [π1, π2, …, πk] [h1w, h2w, …, hkw]
  • 28. !28 Mixture of Contexts (MoC) (low-rank baseline) It’s very like Softmax model
  • 30. • Evaluate the MoS model base on perplexity (PP) - Penn Treebank (PTB) - WikiText-2 (WT2) - 1B Word dataset (a larger dataset) • MoS is a generic model: experiment on Seq2Seq model (PP & BLEU) - Switch-board dataset (a dialog dataset) !30 Main Results (smaller value is better) (larger value is better) using MoS on state-of-the-art model
  • 31. !31 Main results on Penn Treebank (PTB) (PP: smaller value is better)
  • 32. !32 Main results on WikiText-2 (WT2) on 1B Word dataset
  • 33. !33 Main results on Seq2Seq model Switch-board dataset (a dialog dataset) (BLEU: larger value is better)
  • 34. !34 Ablation Study compare MoS & MoC MoC do not perform better
  • 35. !35 Verify the Role of rank Rank compassion on PTB. Test different # of Softmax (K) estimate the rank of A → The embedding sizes (d) of Softmax, MoC and MoS are 400, 280, 280 respectively. The vocabulary size (M) is 10,000 for all models.
  • 36. !36 Inverse experiment on character-level language model Softmax does not suffer from a rank limitation → MoS does not improve the performance (low-rank problem) → MoS’ performance is from it’s high-rank
  • 37. !37 MoS computational time relation between K & time → it’s `sub-linear` relation related to the size of dataset
  • 38. !38 Qualitative analysis to see how MoS improves the next-token prediction in detail
  • 39. !39 Qualitative analysis to see how MoS improves the next-token prediction in detail
  • 41. • Softmax-based language models is limited by the dimension of the word embedding → Softmax bottleneck • MoS improves the performance over Softmax avoid overfitting & naively increasing dimension • MoS improves the state-of-the-art results by a large margin → It’s importance to have a high-rank model for natural language. !41 Conclusions