Word2vec slide(lab seminar)

Word2vec from scratch
11/10/2015
Jinpyo Lee
KAIST

Contents
• Introduction
• Previous Methods for represent words
• Word2Vec
• Extensions of skip-gram model / Learning Phrase / Additive
Compositionality & Evaluation
• Conclusion
• Demo
• Discussions
• References

Introduction
• Example of NLP processing
• EASY
• Spell Chekcing (Checking)
• Keyword Search (Ctrl+F)
• Finding Synonyms
• MEDIUM
• Parsing information form documents, web, etc.
• HARD
• Machine Translation (e.g. Translate Korean to English)
• Semantic Analysis (e.g. What’s meaning of this query?)
• Co-reference (e.g. What does “it” refers in this sentence?)
• Question Answering (e.g. IBM Watson)

Introduction
• BUT, Most important is
How we represent words
as input for all the NLP tasks.

Introduction
• BUT, Most important is
How we represent meaning of words
as input for all the NLP tasks.

• At first, most NLP treated word as ATOMIC symbol
• They needed notion of similarity & difference
• So,
• WordNet: Taxonomy has hypernyms (is-a)
relationship and synonym set
Simple example of wordnet showing synonyms and antonyms
Prev. Methods for represent words
- Discrete Representation

• COOL! (see also, Semantic Web)
• Great resource but, missing nuances
Expert == Good ? Usually?
 Probably NO!
* Synonym set of good using nltk lib (CS224d-Lecture note)
How about new words?
: Wicked, ace, wizard, genius, ninja

• COOL! (see also, Semantic Web)
• Great resource but, missing nuances
* Synonym set of good using nltk lib (CS224d-Lecture note)
Disadvantage
• Hard to keep up to date
• Requires human labor
• Subjective
• Hard to compute accurate word
similarity

• Another problem of discrete representation
• Can’t gives similarity
• Too sparse
e.g. Horse = [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ]
Zebra = [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]
 “one-hot” representation: Typical, simple representation.
All 0s with one 1, Identical
Horse ∩ Zebra
= [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ] ∩ [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]
= 0 (nothing) (But, we know does are mammal)
Mammal

• Use neighbor to represent words! (Co-occurence)
• Conjecture: Words that are related will often appear in the same
documents.
 Window allow capture both syntactic and semantic info.
e.g. I enjoy baseball
corpus I like NLP
I like deep learning.
* Co-occurrence Matrix with window size = 1 (CS224d-Lecture note)
Co-occurs beside “I”, 2-times

• Use this matrix for word-embedding (feat. SVD)
• Applying Single Value Decomposition
for the simplicity, SVD: X (Co-occur Mat) = U*S*VT
X U S VT
(detail would be in linear algebra textbook)
• Select k-columns from U as k-dimension word-vector

• Result of SVD based Model
K = 2 K = 3

• Disadvantage
• Co-occur Matrix is extremely sparse
• Very high dimensional
• Quadratic cost to train (i.e. perform SVD)
• Needs hacks for the imbalance in word frequency
(i.e. “it”, “the”, “has”, etc.)
• Some solutions exist for problem but, not intrinsic

Word2vec (related paper)
• Then how?
Directly learn (iteration) low-dimensional word vectors at a time!
 Go Back to the 1986
• Learning representations by back-propagating errors
(Rumelhart et al. 1986)
• A neural probabilistic language model (Bengio et al., 2003)
• NLP from Scratch (Collobert & Weston, 2008)
• Word2Vec (Mikolov et al. 2013)
• Efficient Estimation of Word Representation in Vector Space
• Distributed Representations of words and phrases and their
compositionality
7/31

Efficient Estimation of Word
Representation in Vector Space
• Introduce initial architecture of word2vec (2013)
• Two New Model: Continuous-Bag-of-word, Skip-gram model
• Empirically show that this word model has better syntactic, semantic
representation then other model
• Compare two model
• Skip-gram model works well on semantic but training is slower.
• CBOW model works well on syntactic and training is faster.
(P)Review
8/31

Word2vec (profile)
• Distributed Representations of words and phrases
and their compositionality
• NIPS 2013 (Submitted on 16 Oct 2013)
• Tomas Mikorov, (FaceBook (2014 ~ )) et al.
• Includes additional works of “Efficient Estimation of Word
Representation in Vector Space”.
9/31

Word2vec (Contents)
• This paper includes,
• Extensions of skip-gram model (fast & accurate)
• Method
• Hierarchical soft-max
• NEG
• Subsampling
• Ability of Learning Phrase
• Find Additive Compositionality
• Conclusion
10/31

• Skip-gram model
• Objective of Skip-gram model is “Find word representations
useful for predicting context words in a sentence.
• Softmax function
• …
Extension of Skip-Gram
𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝
𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤
𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕
BUT, without understanding
original model, we will..
..going to.. fall ...asleep..
11/31

CBOW (Original)
• Continuous-Bag-of-word model
• Idea: Using context words, we can predict center word
i.e. Probability( “It is ( ? ) to finish”  “time” )
• Present word as distributed vector of probability  Low dimension
• Goal: Train weight-matrix(W ) satisfies below
• Loss-function (using cross-entropy method)
argmax 𝑊 {𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝒕𝒊𝒎𝒆 − 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑝𝑟 time 𝒊𝒕, 𝒊𝒔, 𝒕𝒐, 𝒇𝒊𝒏𝒊𝒔𝒉 ; 𝑊 }
* Softmax(): K-dim vector of x∈ℝ  K-dim vector that has (0,1)∈ℝ
𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶)
context words (window_size=2)
12/31

CBOW (Original)
• Continuous-Bag-of-word model
• Input
• “one-hot” word vector
• Remove nonlinear hidden layer
• Back-propagate error from
output layer to Weight matrix
(Adjust W s)
It
is
finish
to
time
[
0
1
0
0
0
]
Wout T∙h =
𝒚(predicted)
[0 0 1 0 0]T
Win
∙
h
Win ∙ x i
[0 0 0 0 1]T
Win
∙
y(true) =
Backpropagate to
Minimize error
vs
Win(old) Wout(old)
Win(new) Wout(new)
Win
,Wout
∈ ℝ 𝑛×|𝑉|
: Input, output Weight
-matrix, n is dimension for word embedding
x 𝑖
, 𝑦 𝑖
: input, output word vector
(one-hot) from vocabulary V
ℎ: hidden vector, avg of W*x
[NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1]
Initial input, not results
14/31

• Skip-gram model
• Idea: With center word,
we can predict context words
• Mirror of CBOW (vice versa)
i.e. Probability( “time”  “It is ( ? ) to finish” )
• Loss-function:
Skip-Gram (Original)
𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡)
time
It
is
to
finish
Win ∙ x i
h
y i
Win(old) Wout(old)
Win(new) Wout(new)
[NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1]
CBOW: 𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶)
15/31

• Hierarchical Soft-max function
• To train weight matrix in every step, we need to pass the
calculated vector into Loss-Function
• Soft-max function
• Before calculate loss function
calculated vector should normalized as real-number in (0,1)
Extension of Skip-Gram(1)
𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝
𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤
𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕
(𝑬 = − 𝐥𝐨𝐠 𝒑 𝒘 𝒕−𝑪. . 𝒘 𝒕+𝑪 𝒘 𝒕 )
16/31

• Hierarchical Soft-max function (cont.)
• Soft-max function
(I have already calculated, it’s boring …….…)
Original soft-max function
of skip-gram model
17/31

• Hierarchical Soft-max function (cont.)
• Since V is quite large, computing log(𝑝 𝑤𝑜 𝑤𝐼 ) costs to much
• Idea: Construct binary Huffman tree with word
 Cost: O( 𝑽 ) to O(lo g 𝑽 )
• Can train Faster!
• Assigning
• Word = 𝑛𝑜𝑑𝑒 𝑤, 𝐿 𝑤 𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑎𝑙𝑘
(* details in “Hierarchical Probabilistic Neural Network Language Model ")
18/31

• Negative Sampling (similar to NCE)
• Size(Vocabulary) is computationally huge!  Slow for train
• Idea: Just sample several negative examples!
• Do not loop full vocabulary, only use neg. sample  fast
• Change the target word as negative sample and learn
negative examples  get more accuracy
• Objective function
i.e. “Stock boil fish is toy” ????  negative sample
Noise Constrastive Estimation
𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡)
19/31

• Subsampling
• (“Korea”, ”Seoul”) is helpful, but (“Korea”, ”the”) isn’t helpful
• Idea: Frequent word vectors (i.e. “the”) should not change
significantly after training on several million examples.
• Each word 𝑤𝑖 in the training set is discarded with below
probability
• It aggressively subsamples frequent words while preserve
ranking of the frequencies
• But, this formula was chosen heuristically…
f wi : 𝑓𝑟𝑒𝑞𝑢𝑛𝑐𝑦 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑤𝑖
𝑡: 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑, 𝑎𝑟𝑜𝑢𝑛𝑑 10−5
20/31

• Evaluation
• Task: Analogical reasoning
• Accuracy test using cosine similarity determine how the model
answer correctly.
i.e. vec(X) = vec(“Berlin”) – vec(“Germany”) + vec(“France”)
Accuracy = cosine_similarity( vec(X), vec(“Paris”) )
• Model: skip-gram model(Word-embedding dimension = 300)
• Data Set: News article (Google dataset with 1 billion words)
• Comparing Method (w/ or w/o 10-5subsampling)
• NEG(Negative Sampling)-5, 15
• Hierarchical Softmax-Huffman
• NCE-5(Noise Contrastive Estimation)
21/31

• Empirical Results
• Model w/ NEG outperforms the HS on the analogical reasoning task
(even slightly better than NCE)
• The subsampling improves the training speed several times and
makes the word representations more accurate
22/31

• Word base model can not represent idiomatic word
• i.e. “Newyork Times”, “Larry Page”
• Simple data driven approach
• If phrases are formed based on 1-gram, 2-gram counts
• Target words that has high score would meaningful phrase
Learning Phrases
𝛿: 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑖𝑛𝑔 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
(P𝑟𝑒𝑣𝑒𝑛𝑡 𝑡𝑜𝑜 𝑚𝑎𝑛𝑦 𝑝ℎ𝑟𝑎𝑠𝑒𝑠 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑖𝑛𝑔 𝑜𝑓 𝑖𝑛𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝑤𝑜𝑟𝑑𝑠)
23/31

• Evaluation
• Task: Analogical reasoning
• Accuracy test using cosine similarity determine how the model
answer correctly with phrase
• i.e. vec(X) = vec(“Steve Ballmer”) – vec(“Microsoft”) + vec(“Larry Page”)
Accuracy = cosine_similarity( vec(x), vec(“Google”) )
• Model: skip-gram model(Word-embedding dimension = 300)
• Data Set: News article (Google dataset with 1 billion words)
• Comparing Method (w/ or w/o 10-5subsampling)
• NEG-5
• NEG-15
• HS-Huffman
Learning Phrases
24/31

• Empirical Results
• NEG-15 achieves better performance than NEG-5
• HS become the best performing method when subsampling
• This shows that the subsampling can result in faster training and can
also improve accuracy, at least in some cases.
• When training set = 33 billion, d=1000  72% (6B  66%)
• Amount of training set is crucial!
Learning Phrases
25/31

• Simple vector addition (on Skip-gram model)
• Previous experiments shows Analogical reasoning (A+B-C)
• Vector’s values are related logarithmically to the probabilities
 Sum of two vector is related to product of context distribution
• Interesting!
Additive Compositionality
26/31

• Contributions
• Showed detailed process of training distributed
representation of words and phrases
• Can be more accurate and faster model than previous
word2vec model by sub-sampling
• Negative Sampling: Extremely simple and accurate for
frequent words. (not frequent like phrase, HS was better)
• Word vectors can be meaningful by simple vector addition
• Made a code and dataset as open-source project
Conclusion
27/31

• Compare to other Neural network model
<Find most similar word>
• Skip-gram model trained on large corpus outperforms all
to other paper’s models.
Conclusion
28/31

• Very Interesting model
• Simple, short paper
• Easy to read
• Hard to understand detail
• In HS, way of Tree construction
• Several Heuristic methods
• Pre-processing like eliminate stop-words
Speaker’s Opinion
29/31

• Papers
• Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint
arXiv:1301.3781 (2013).
• Morin, Frederic, and Yoshua Bengio. "Hierarchical probabilistic neural network language model." Proceedings of
the international workshop on artificial intelligence and statistics. 2005.
• Guthrie, David, et al. "A closer look at skip-gram modelling." Proceedings of the 5th international Conference on
Language Resources and Evaluation (LREC-2006). 2006.
• Rong, Xin. "word2vec Parameter Learning Explained." arXiv preprint arXiv:1411.2731 (2014).
• Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-
embedding method." arXiv preprint arXiv:1402.3722(2014).
• Collobert, Ronan, et al. "Natural language processing (almost) from scratch." The Journal of Machine Learning
Research 12 (2011): 2493-2537.
• Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003):
1137-1155.
• Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating
errors." Cognitive modeling 5 (1988): 3.
• Websites & Courses
• Richard Socher, CS224d: Deep Learning for Natural Language Processing (http://cs224d.stanford.edu/)
• http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html
• http://nohhj.blogspot.kr/2015/08/word-embedding.html
• https://yinwenpeng.wordpress.com/category/deep-learning-in-nlp/
• http://rare-technologies.com/word2vec-tutorial/
• https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#482
• https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors
References
30/31

? = Word2vec(“Slide” + “End”)
End
31/31

Word2vec slide(lab seminar)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Word2vec slide(lab seminar)

Ähnlich wie Word2vec slide(lab seminar) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Word2vec slide(lab seminar)

Hinweis der Redaktion