SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax Approximations for Learning Word
Embeddings and Language Modeling
Sebastian Ruder
@seb ruder
1st NLP Meet-up
03.08.16
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Agenda
1 Softmax
2 Softmax-based Approaches
Hierarchial Softmax
Differentiated Softmax
CNN-Softmax
3 Sampling-based Approaches
Margin-based Hinge Loss
Noise Contrastive Estimation
Negative Sampling
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Language modeling objective
Goal: Probabilistic model of language
Maximize probability of a word wt given its n previous
words, i.e. p(wt | wt−1, · · · wt−n+1)
N-gram models:
p(wt | wt−1, · · · , wt−n+1) =
count(wt−n+1, · · · , wt−1, wt)
count(wt−n+1, · · · , wt−1)
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax objective for language modeling
Figure: Predicting the next word with the softmax
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax objective for language modeling
Neural networks with softmax:
p(w | wt−1, · · · , wt−n+1) =
exp(h vw )
wi ∈V exp(h vwi )
where
h is ”hidden” representation of input, i.e. previous words
of dimensionality d
vwi
is the ”output” word embedding of word i, = word
embedding
V is the vocabulary
Inner product h vw computes score (”unnormalized”
probability) of model for word w given input
Output word embeddings are stored in a d × |V | matrix
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Neural language model
Figure: Neural language model [Bengio et al., 2003]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax use cases
Maximum entropy models minimize same probability
distribution:
Ph(y | x) =
exp(h · f (x, y))
y ∈Y exp(h · f (x, y ))
where
h is a weight vector
f (x, y) is a feature vector
Pervasive use in NNs:
Go-to multi-class classification objective
”Soft” selection e.g. for attention, memory retrieval, etc.
Denominator is called partition function:
Z =
wi ∈V
exp(h vwi )
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax-based vs. sampling-based
Softmax-based approaches keep softmax layer intact,
make it more efficient.
Sampling-based approaches optimize a different loss
function that approximates the softmax.
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Hierarchical Softmax
Softmax as a binary tree: evaluate at most log2 |V | nodes
instead of all |V | nodes
Figure: Hierarchical softmax [Morin and Bengio, 2005]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Hierarchical Softmax
Structure is important; fastest (and most commonly used)
variant: Huffman tree (short paths for frequent words)
Figure: Hierarchical softmax [Mnih and Hinton, 2008]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Differentiated Softmax
Idea: We have more knowledge (co-occurrences, etc.)
about frequent words, less about rare words
→ words that occur more often allows us to fit more
parameters; extremely rare words only allow to fit a few
→ different embedding sizes to represent each output word
Larger embeddings (more parameters) for frequent words,
smaller embeddings for rare words
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Differentiated Softmax
Figure: Differentiated softmax [Chen et al., 2015]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
CNN-Softmax
Idea: Instead of learning all output word embeddings
separately, learn function to produce them
Figure: CNN-Softmax [Jozefowicz et al., 2016]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Sampling-based approaches
Sampling-based approaches optimize a different loss
function that approximates the softmax.
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Margin-based Hinge Loss
Idea: Why do multi-class classification at all? Only one
correct word, many incorrect ones. [Collobert et al., 2011]
Train model to produce higher scores for correct word
windows than for incorrect ones, i.e. maximize
x∈X w∈V
max{0, 1 − f (x) + f (x(w)
)}
where
x is a correct window
x(w)
is a ”corrupted” window (target word replaced by
random word)
f (x) is the score output by the model
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Noise Contrastive Estimation
Idea: Train model to differentiate target word from noise
Figure: Noise Contrastive Estimation (NCE) [Mnih and Teh, 2012]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Noise Contrastive Estimation
Language modeling reduces to binary classification
Draw k noise samples from a noise distribution (e.g.
unigram) for every word; correct words given their context
are true (y = 1), noise samples are false (y = 0)
Minimize cross-entropy with logistic regression loss
Approximates softmax as number of noise samples k
increases
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Negative Sampling
Simplification of NCE [Mikolov et al., 2013]
No longer approximates softmax as goal is to learn
high-quality word embeddings (rather than language
modeling)
Makes NCE more efficient by making most expensive term
constant
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Thank you for your attention!
The content of most of these slides is also available as blog
posts at sebastianruder.com.
For more information: sebastian@aylien.com
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Bibliography I
[Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P.,
and Janvin, C. (2003).
A Neural Probabilistic Language Model.
The Journal of Machine Learning Research, 3:1137–1155.
[Chen et al., 2015] Chen, W., Grangier, D., and Auli, M.
(2015).
Strategies for Training Large Vocabulary Neural Language
Models.
[Collobert et al., 2011] Collobert, R., Weston, J., Bottou, L.,
Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011).
Natural Language Processing (almost) from Scratch.
Journal of Machine Learning Research, 12(Aug):2493–2537.
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Bibliography II
[Jozefowicz et al., 2016] Jozefowicz, R., Vinyals, O., Schuster,
M., Shazeer, N., and Wu, Y. (2016).
Exploring the Limits of Language Modeling.
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and
Dean, J. (2013).
Distributed Representations of Words and Phrases and their
Compositionality.
NIPS, pages 1–9.
[Mnih and Hinton, 2008] Mnih, A. and Hinton, G. E. (2008).
A Scalable Hierarchical Distributed Language Model.
Advances in Neural Information Processing Systems, pages
1–8.
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Bibliography III
[Mnih and Teh, 2012] Mnih, A. and Teh, Y. W. (2012).
A Fast and Simple Algorithm for Training Neural
Probabilistic Language Models.
Proceedings of the 29th International Conference on
Machine Learning (ICML’12), pages 1751–1758.
[Morin and Bengio, 2005] Morin, F. and Bengio, Y. (2005).
Hierarchical Probabilistic Neural Network Language Model.
Aistats, 5.

Weitere ähnliche Inhalte

Mehr von Sebastian Ruder

Mehr von Sebastian Ruder (20)

On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Softmax Approximations for Learning Word Embeddings and Language Modeling (Sebastian Ruder)