4. Neural Word Embedding
● Continuous vector space representation
o Words represented as dense real-valued vectors in Rd
● Distributed word representation ↔ Word Embedding
o Embed an entire vocabulary into a relatively low-dimensional linear
space where dimensions are latent continuous features.
● Classical n-gram model works in terms of discrete units
o No inherent relationship in n-gram.
● In contrast, word embeddings capture regularities and relationships
between words.
5. Syntactic & Semantic Relationship
Regularities are observed as the constant offset vector between
pair of words sharing some relationship.
Gender Relation
KING-QUEEN ~ MAN - WOMAN
Singular/Plural Relation
KING-KINGS ~ QUEEN - QUEENS
Other Relations:
● Language
France - French
~
Spain - Spanish
● Past Tense
Go – Went
~
Capture - Captured
8. Language Model(LM)
● Different models for estimating continuous representations of
words.
○ Latent Semantic Analysis (LSA)
○ Latent Dirichlet Allocation (LDA)
○ Neural network Language Model(NNLM)
9. Feed Forward NNLM
● Consists of input, projection, hidden and output layers.
● N previous words are encoded using 1-of-V coding, where V is size of the
vocabulary. Ex: A = (1,0,...,0), B = (0,1,...,0), … , Z = (0,0,...,1) in R26
● NNLM becomes computationally complex between projection(P) and
hidden(H) layer
○ For N=10, size of P = 500-2000, size of H = 500-1000
○ Hidden layer is used to compute prob. dist. over all the words in
vocabulary V
● Hierarchical softmax as the rescue.
10. Recurrent NNLM
● No projection Layer, consists of input, hidden and output layers only.
● No need to specify the context length like feed forward NNLM
● What is special in RNN model?
○ Recurrent matrix that connects layer to itself.
○ Allows to form short-term memory
■ Information from the past is represen-
ted by the hidden layer
● RNN-embedded vector achieved state of the
art results in relational similarity identification task.
RNN Model
11. Recurrent NNLM
w(t): Input word at time t
y(t): Output layer produces a prob. Dist.
over words.
s(t): Hidden layer
U: Each column represents a word
● Four-gram neural net language model architecture(Bengio 2001)
● RNN is trained with SGD and backpropagation to maximize the
● log likelihood.
12. Bringing efficiency..
● Computational complexity of the NNLMs are high.
● We can remove the hidden layer and speed up 1000x
○ Continuous bag-of-words model
○ Continuous skip-gram model
● The full softmax can be replaced by:
○ Hierarchical softmax (Morin and Bengio)
○ Hinge loss (Collobert and Weston)
○ Noise contrastive estimation (Mnih et al.)
13. Continuous Bag of Word Model(CBOW)
● Non-linear hidden layer is removed
● Projection layer is shared for all words(not
just the projection matrix).
● All words get projected into the same
position(vectors are averaged).
● Naming Reson: Order of words in the
history does not influence the projection.
● Best performance obtained by a log-
linear classifier with four future and
four history words at the input
Predicts the current word based on
the context.
14. Continuous Skip-gram Model
● Objective: Tries to maximize
classification of a word based on another
word in the same sentence. Maximize the
average log probability
● Define p(wt+j |wt ) using the softmax
function:
Predicts surrounding word given
the current word.
15. Bringing efficiency..
● Computational complexity of the NNLMs are high.
● We can remove the hidden layer and speed up 1000x
○ Continuous bag-of-words model
○ Continuous skip-gram model
● The full softmax can be replaced by:
○ Hierarchical softmax (Morin and Bengio)
○ Hinge loss (Collobert and Weston)
○ Noise contrastive estimation (Mnih et al.)
16. Hierarchical Softmax for efficient computation
● This formulation is impractical because the cost of computing ∇logp(wO|wI)
is proportional to W, which is often large (105–107 terms).
● With hierarchical softmax, the cost is reduced
17. Hierarchical Softmax
● Uses a binary tree (Huffman code) representation of the output layer with the W
words as its leaves.
o A random walk that assigns probabilities to words.
● Instead of evaluating W output nodes, evaluate log(W) nodes to calculate prob. dist.
● Each word w can be reached by an appropriate path from the root of the tree● n(w, j): j-th node on the path from the root to w
● L(w): The length of this path
● n(w, 1) = root and n(w, L(w)) = w
● ch(n): An arbitrary fixed child of an inner node n
● [x] = 1 if x is true and [x] = -1 otherwise
18. Negative Sampling
● Noise Contrastive Estimation (NCE)
o A good model should be able to differentiate data from noise by means of
logistic regression.
o Alternative to the hierarchical softmax.
o Introduced by Gutmann and Hyvarinen and applied to language modeling by
Mnih and Teh.
● NCE approximates the log probability of the softmax
● Define Negative Sampling by the objective which replaces log P(w0|wI) in the skip-
gram.
● Task: Distinguish the target word wO from draws from the noise distribution
19. Subsampling of Frequent words
● Most frequent words provide less information than rare words.
o Co-occurrences of “France” and “Paris” is informative
o Co-occurrences of “France” and “the” is less informative
● A simple subsampling approach to counter imbalance
o Each word wi in the training set is discarded with probability
where f(wi) is the frequency of word wi and t is a chosen threshold,
typically around 10−5
● Aggressive subsampling of words whose frequency is greater than
t while preserving the ranking of the frequencies.
21. Automatic learning by skip-gram model
● No supervised information
about what a capital city
means.
● But the model is still
capable of
o Automatic
organization of
concepts
o Learning implicit
relationship
PCA projection of 100- dimensional skip-gram vectors
23. Learning Phrases
● To learn phrase vectors
o First find words that appear frequently together, and infrequently in
other contexts.
o Replace with unique tokens. Ex: “New York Times” ->
New_York_Times
● Phrases are formed based on the unigram and bigram counts, using
δ(discounting coefficient) prevents too many phrases consisting of very
25. Phrase Skip-gram Results
● Accuracies of the Skip-gram models on the phrase analogy dataset
o Using different hyperparameters
o Models trained on approximately one billion words from the news
dataset
● Size of the training data matters.
o HS-Huffman( dimensionality=1000) trained on 33 billion words
reaches an accuracy of 72%
26. Additive compositionality
● Possible to meaningfully combine words by an element-wise addition of their
vector representations.
○ Word vectors represents the distribution of the context in which it appears.
● Vector values related logarithmically to the probabilities computed by output layer.
○ The sum of two word vectors is related to the product of the two context
distributions
29. Comments
● Reduction of computational complexity is impressive.
● Works with unsupervised/unlabelled data
● Vector representation can be extended to large pieces of text
Paragraph Vector (Mikolov et al. 2013)
● Applicable to a lot of NLP tasks
o Tagging
o Named Entity Recognition
o Translation
o Paraphrasing
words are represented as dense real-valued vectors in Rd
words are represented as dense real-valued vectors in Rd
he basic Skip-gram formulation defines p(wt+j |wt ) using the softmax function
his formulation is impractical because the cost of computing ∇logp(wO|wI)isproportionaltoW,whichisoftenlarge(105–107 terms).
each word w can be reached by an appropriate path from the root of the tree
Neg-k : Negative sampling with k negative samples
the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability. Thus, if “Volga River” appears frequently in the same sentence together with the words “Russian” and “river”, the sum of these two word vectors will result in such a feature vector that is close to the vector of “Volga River”.