2. § 기존의 embedding model은 unique word를 하나의 vector에 할당할 수 있었음
§ 그러나 이와 같은 방식은 vocabulary의 크기가 커지거나 rare word가 많을수록 한계점을 내포함
§ 이러한 word들은 good word representation을 얻기 힘듦
§ 특히 현재까지의 word representation 기법들은 문자의 internal structure를 고려하지 않고 있음
§ 그러나 스페인어나 프랑스어의 경우 대부분의 동사가 40개 이상의 변형된 형태를 지니고 있으므로, 이러한 언어에서는 rare word 문제가
더욱 대두 될 것임
01 Introduction
• 연구 배경 및 목적
2
이처럼 형태학적인 특징이 풍부한 언어의 경우 subword 정보를 활용하여 학습하면,
vector representation을 개선시킬 수 있을 것임
6. 02 General Model
• Row Indexing: 𝑋와 𝑊!"#$%행렬 곱의 병목 해결
§ 𝑋와 𝑊12,0!행렬곱은 vocabulary의 크기가 커질수록 큰 계산 비용이
수반됨
§ ℎ를 구하기 위해서는 input vector 𝑋에서 one-hot index를
𝑊12,0!의 row index로 활용하면 바로 도출 할 수 있음
§ 결과적으로 같지만 계산 비용에서는 큰 차이가 발생함
weight matrix 𝑾𝒊𝒏𝒑𝒖𝒕의 원소 중에서
input vector 𝑋와 관련 있는 부분만 골라서 업데이트 하는 것이 목적
0
1
0
⋮
0
2 0.5 4
0.1 0.2 0.7
0.3
⋮
3
−2
⋮
5
0.1
⋮
0.6
0.1
0.2
0.7
the
quick
brown
⋮
dog
quick
7. 02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
§ latent vector ℎ와 𝑊80!,0! 역시 vocabulary의 크기가 증가하면
큰 계산 비용이 요구됨
§ softmax 연산도 vocabulary의 크기가 증가하면 큰 계산 비용이
요구됨
§ 그러나 하나의 target word와 관련된 context word들은 window
size내의 작은 word 정도밖에 안됨
§ 다시 말해 ℎ와 𝑊80!,0!의 행렬곱 연산은 인풋과 관련되어 업데이트 되
어야할 단어는 몇개 안되는데도 불구하고 vocabulary에 있는 모든 단
어들과의 관계를 비교하여야 하여 비효율적임
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
the
quick
brown
dog
⋯
quick
8. 02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
§ 이를 해결하기 위해 negative sampling 기법을 활용할 수 있음
§ negative sampling의 핵심은 multi-class classification을
binary classification로 근사 해보겠다는 것
• positive example: target word의 context word
• negative example: target word의 context word가 아닌 word
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
the
quick
brown
dog
⋯
quick
10. 02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
§ How to negative sampling?
• Corpus 내에서 자주 등장하는 단어를 더 많이 추출하고 드물게 등장하는 단어는 적게 추출하고자 함
§ Probability distribution is derived through the equation below:
𝑃 𝑤( =
𝑓(𝑤()4/6
∑789
0 𝑓(𝑤7)4/6
𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠
§ Why use 3/4?
• 등장 확률이 낮은 단어가 조금 더 쉽게 샘플링이 될 수 있도록 하기 위함.
11. 02 General Model
• skip-gram
§ We start by briefly reviewing the skip-gram model introduced by Mikolov et al.
§ Inspired by the distributional hypothesis (Harris, 1954), word representations are trained to predict well words that
appear in its context.
§ it is assumed that there is no relationship between context words 𝑤) given target word 𝑤"(conditional independence).
𝑃 𝑤":;, 𝑤"<; 𝑤" = 𝑃 𝑤":; 𝑤" 𝑃 𝑤"<; 𝑤"
§ Given a large training corpus represented as a sequence of words (𝑤;, … , 𝑤=), the objective of the skipgram model is to
maximize following log-likelihood:
E
"8;
=
E
)∈?5
𝑝(𝑤)|𝑤") ⟺ H
"8;
=
H
)∈?5
log 𝑝(𝑤)|𝑤")
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
12. 02 General Model
• skip-gram: Objective
§ One possible choice to define the probability of a context word is the softmax:
𝑃 𝑤% 𝑤! =
𝑒9(;!, ;")
∑>"#
?
𝑒9(;!, >)
§ The problem of predicting context words can instead be framed as a set of independent binary classification tasks.
§ Then the goal is to independently predict the presence(or absence) of context words.
§ For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the
dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood:
log(1 + 𝑒@9(;!, ;")
) + 9
2∈𝒩!,"
log(1 + 𝑒9(;!, ;")
)
𝑠 𝑤", 𝑤) = 𝑢/5
= 𝑣/6
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
13. 02 General Model
• skip-gram: Negative-sampling
§ For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the
dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood:
log(1 + 𝑒@9(;!, ;")) + 9
2∈𝒩!,"
log(1 + 𝑒9(;!, ;"))
𝑠 𝑤", 𝑤) = 𝑢/5
= 𝑣/6
§ For all target words, we can re-write the objective as:
H
"8;
=
[H
)∈?5
log(1 + 𝑒:@(/5, /6)) + H
0∈𝒩5,6
log(1 + 𝑒@(/5, /6))]
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
Negative sampling results in both faster training
and learn accurate representations especially for frequent words (Mikolov et al., 2013)
14. 02 General Model
• skip-gram: Subsampling of frequent Words
§ In very large corpora, frequent words usually provide less information value than the rare words.
§ the vector representations of frequent words do not change significantly after training on several million examples.
§ To counter the imbalance between the rare and frequent words, we used a simple subsampling approach
• each word 𝑤( in the training set is discarded with probability computed by the formula:
𝑃 𝑤( = 1 −
𝑡
𝑓 𝑤(
𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠
𝑡 = 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑 10:B)
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
Subsampling results in both faster training
and significantly better representation of uncommon words (Mikolov et al., 2013)
15. 03 Subword Model
• SISG, FastText
§ Each word w is represented as a bag of character n-gram.
§ We add special boundary symbols ‘<’ and ‘>’ at the beginning and end of words, allowing to distinguish prefixes and
suffixes from other character sequences.
§ We also include the word w itself in the set of its n-grams, to learn a representation for each word(in addition to
character n-gram)
§ Taking the word where and n=3 as an example, it will be represented by the character n-gram:
𝒢/#$-$ = {< 𝑤ℎ, 𝑤ℎ𝑒, ℎ𝑒𝑟, 𝑒𝑟𝑒, 𝑟𝑒 >, < 𝑤ℎ𝑒𝑟𝑒 >}
§ We represent a word by the sum of the vector representations of its n-grams, so we obtain the scoring function:
𝑠 𝑤, 𝑐 = H
C∈𝒢7
𝑧C
= 𝑣)
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
• 𝒢-: a set of subwords given a word 𝑤