SlideShare a Scribd company logo
1 of 16
Download to read offline
01 Introduction
02 General Model(skip-gram)
Contents
03 Subword Model(SISG)
§ 기존의 embedding model은 unique word를 하나의 vector에 할당할 수 있었음
§ 그러나 이와 같은 방식은 vocabulary의 크기가 커지거나 rare word가 많을수록 한계점을 내포함
§ 이러한 word들은 good word representation을 얻기 힘듦
§ 특히 현재까지의 word representation 기법들은 문자의 internal structure를 고려하지 않고 있음
§ 그러나 스페인어나 프랑스어의 경우 대부분의 동사가 40개 이상의 변형된 형태를 지니고 있으므로, 이러한 언어에서는 rare word 문제가
더욱 대두 될 것임
01 Introduction
• 연구 배경 및 목적
2
이처럼 형태학적인 특징이 풍부한 언어의 경우 subword 정보를 활용하여 학습하면,
vector representation을 개선시킬 수 있을 것임
02 General Model
• skip-gram
target
word
context
word
context
word
context
word
context
word
window
size
𝑃 𝑞𝑢𝑖𝑐𝑘, 𝑏𝑟𝑜𝑤𝑛 𝑡ℎ𝑒
= 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑡ℎ𝑒 𝑃(𝑏𝑟𝑤𝑜𝑛|𝑡ℎ𝑒)
𝑃 𝑡ℎ𝑒, 𝑏𝑟𝑜𝑤𝑛, 𝑓𝑜𝑥 𝑞𝑢𝑖𝑐𝑘
= 𝑃 𝑡ℎ𝑒 𝑞𝑢𝑖𝑐𝑘 𝑃 𝑏𝑟𝑜𝑤𝑛 𝑞𝑢𝑖𝑐𝑘 𝑃(𝑓𝑜𝑥|𝑞𝑢𝑖𝑐𝑘)
𝑃 𝑡ℎ𝑒, 𝑞𝑢𝑖𝑐𝑘, 𝑓𝑜𝑥, 𝑗𝑢𝑚𝑝𝑠 𝑏𝑟𝑜𝑤𝑛
= 𝑃 𝑡ℎ𝑒 𝑏𝑟𝑜𝑤𝑛 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑏𝑟𝑜𝑤𝑛 𝑃 𝑓𝑜𝑥 𝑏𝑟𝑜𝑤𝑛 𝑃(𝑗𝑢𝑚𝑝𝑠|𝑏𝑟𝑜𝑤𝑛)
𝑃 𝑞𝑢𝑖𝑐𝑘, 𝑏𝑟𝑜𝑤𝑛, 𝑗𝑢𝑚𝑝𝑠, 𝑜𝑣𝑒𝑟 𝑓𝑜𝑥
= 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑓𝑜𝑥 𝑃 𝑏𝑟𝑤𝑜𝑛 𝑓𝑜𝑥 𝑃 𝑗𝑢𝑚𝑝𝑠 𝑓𝑜𝑥 𝑃(𝑜𝑣𝑒𝑟|𝑓𝑜𝑥)
Assumption
• context word는 조건부 독립(conditionally independent)
X
= "
!"#
$
"
%∈'!
𝑝(𝑤%|𝑤!)
02 General Model
• skip-gram
𝐼#×)
𝑊)×*
𝐻#×*
𝑊*×)
+
𝑂#×)
02 General Model
• skip-gram
0
1
0
⋮
0
2 0.5 4
0.1 0.2 0.7
0.3
⋮
3
−2
⋮
5
0.1
⋮
0.6
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
1
0
0
⋮
0
0
0
1
⋮
0
𝑙𝑜𝑠𝑠("#$, &'()*)
𝑙𝑜𝑠𝑠(&'()*, ,-./0)
+ 𝑙𝑜𝑠𝑠&'()*
𝑦,-./
𝑦!-0.
𝐼#×) 𝑊)×* 𝐻#×* 𝑊*×)
+
the
quick
brown
⋮
dog the
quick
brown
dog
⋯
quick
02 General Model
• Row Indexing: 𝑋와 𝑊!"#$%행렬 곱의 병목 해결
§ 𝑋와 𝑊12,0!행렬곱은 vocabulary의 크기가 커질수록 큰 계산 비용이
수반됨
§ ℎ를 구하기 위해서는 input vector 𝑋에서 one-hot index를
𝑊12,0!의 row index로 활용하면 바로 도출 할 수 있음
§ 결과적으로 같지만 계산 비용에서는 큰 차이가 발생함
weight matrix 𝑾𝒊𝒏𝒑𝒖𝒕의 원소 중에서
input vector 𝑋와 관련 있는 부분만 골라서 업데이트 하는 것이 목적
0
1
0
⋮
0
2 0.5 4
0.1 0.2 0.7
0.3
⋮
3
−2
⋮
5
0.1
⋮
0.6
0.1
0.2
0.7
the
quick
brown
⋮
dog
quick
02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
§ latent vector ℎ와 𝑊80!,0! 역시 vocabulary의 크기가 증가하면
큰 계산 비용이 요구됨
§ softmax 연산도 vocabulary의 크기가 증가하면 큰 계산 비용이
요구됨
§ 그러나 하나의 target word와 관련된 context word들은 window
size내의 작은 word 정도밖에 안됨
§ 다시 말해 ℎ와 𝑊80!,0!의 행렬곱 연산은 인풋과 관련되어 업데이트 되
어야할 단어는 몇개 안되는데도 불구하고 vocabulary에 있는 모든 단
어들과의 관계를 비교하여야 하여 비효율적임
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
the
quick
brown
dog
⋯
quick
02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
§ 이를 해결하기 위해 negative sampling 기법을 활용할 수 있음
§ negative sampling의 핵심은 multi-class classification을
binary classification로 근사 해보겠다는 것
• positive example: target word의 context word
• negative example: target word의 context word가 아닌 word
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
the
quick
brown
dog
⋯
quick
02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ −0.3
⋯ −0.1
the
quick
brown
dog
⋯quick
dot 0.59
0.52dot
𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,'()
𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,*+,-.
dot 0.45 𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,/,0
𝑙𝑎𝑏𝑒𝑙 = 1
𝑙𝑎𝑏𝑒𝑙 = 1
𝑙𝑎𝑏𝑒𝑙 = 0
𝐿𝑜𝑠𝑠&'()*+
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
1
0
0
⋮
0
0
0
1
⋮
0
𝑙𝑜𝑠𝑠("#$%&, ()*)
𝑙𝑜𝑠𝑠("#$%&, ,-./0)
+ 𝑙𝑜𝑠𝑠"#$%&
𝑦1-*2
𝑦(-#*
𝐻3×5 𝑊5×6
7
the
quick
brown
dog
⋯
quick
02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
§ How to negative sampling?
• Corpus 내에서 자주 등장하는 단어를 더 많이 추출하고 드물게 등장하는 단어는 적게 추출하고자 함
§ Probability distribution is derived through the equation below:
𝑃 𝑤( =
𝑓(𝑤()4/6
∑789
0 𝑓(𝑤7)4/6
𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠
§ Why use 3/4?
• 등장 확률이 낮은 단어가 조금 더 쉽게 샘플링이 될 수 있도록 하기 위함.
02 General Model
• skip-gram
§ We start by briefly reviewing the skip-gram model introduced by Mikolov et al.
§ Inspired by the distributional hypothesis (Harris, 1954), word representations are trained to predict well words that
appear in its context.
§ it is assumed that there is no relationship between context words 𝑤) given target word 𝑤"(conditional independence).
𝑃 𝑤":;, 𝑤"<; 𝑤" = 𝑃 𝑤":; 𝑤" 𝑃 𝑤"<; 𝑤"
§ Given a large training corpus represented as a sequence of words (𝑤;, … , 𝑤=), the objective of the skipgram model is to
maximize following log-likelihood:
E
"8;
=
E
)∈?5
𝑝(𝑤)|𝑤") ⟺ H
"8;
=
H
)∈?5
log 𝑝(𝑤)|𝑤")
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
02 General Model
• skip-gram: Objective
§ One possible choice to define the probability of a context word is the softmax:
𝑃 𝑤% 𝑤! =
𝑒9(;!, ;")
∑>"#
?
𝑒9(;!, >)
§ The problem of predicting context words can instead be framed as a set of independent binary classification tasks.
§ Then the goal is to independently predict the presence(or absence) of context words.
§ For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the
dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood:
log(1 + 𝑒@9(;!, ;")
) + 9
2∈𝒩!,"
log(1 + 𝑒9(;!, ;")
)
𝑠 𝑤", 𝑤) = 𝑢/5
= 𝑣/6
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
02 General Model
• skip-gram: Negative-sampling
§ For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the
dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood:
log(1 + 𝑒@9(;!, ;")) + 9
2∈𝒩!,"
log(1 + 𝑒9(;!, ;"))
𝑠 𝑤", 𝑤) = 𝑢/5
= 𝑣/6
§ For all target words, we can re-write the objective as:
H
"8;
=
[H
)∈?5
log(1 + 𝑒:@(/5, /6)) + H
0∈𝒩5,6
log(1 + 𝑒@(/5, /6))]
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
Negative sampling results in both faster training
and learn accurate representations especially for frequent words (Mikolov et al., 2013)
02 General Model
• skip-gram: Subsampling of frequent Words
§ In very large corpora, frequent words usually provide less information value than the rare words.
§ the vector representations of frequent words do not change significantly after training on several million examples.
§ To counter the imbalance between the rare and frequent words, we used a simple subsampling approach
• each word 𝑤( in the training set is discarded with probability computed by the formula:
𝑃 𝑤( = 1 −
𝑡
𝑓 𝑤(
𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠
𝑡 = 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑 10:B)
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
Subsampling results in both faster training
and significantly better representation of uncommon words (Mikolov et al., 2013)
03 Subword Model
• SISG, FastText
§ Each word w is represented as a bag of character n-gram.
§ We add special boundary symbols ‘<’ and ‘>’ at the beginning and end of words, allowing to distinguish prefixes and
suffixes from other character sequences.
§ We also include the word w itself in the set of its n-grams, to learn a representation for each word(in addition to
character n-gram)
§ Taking the word where and n=3 as an example, it will be represented by the character n-gram:
𝒢/#$-$ = {< 𝑤ℎ, 𝑤ℎ𝑒, ℎ𝑒𝑟, 𝑒𝑟𝑒, 𝑟𝑒 >, < 𝑤ℎ𝑒𝑟𝑒 >}
§ We represent a word by the sum of the vector representations of its n-grams, so we obtain the scoring function:
𝑠 𝑤, 𝑐 = H
C∈𝒢7
𝑧C
= 𝑣)
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
• 𝒢-: a set of subwords given a word 𝑤
03 Subword Model
• SISG, FastText
pseudo code of computing loss

More Related Content

What's hot

Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modelingSerhii Havrylov
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Association for Computational Linguistics
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processingBabu Priyavrat
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"nozyh
 
Lda and it's applications
Lda and it's applicationsLda and it's applications
Lda and it's applicationsBabu Priyavrat
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Marina Santini
 

What's hot (20)

Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processing
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
 
Lda and it's applications
Lda and it's applicationsLda and it's applications
Lda and it's applications
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 

Similar to Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰

2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrasesJAEMINJEONG5
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptxGowrySailaja
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizersHa Loc Do
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
 
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...AIST
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection
Dynamic pooling and unfolding recursive autoencoders for paraphrase detectionDynamic pooling and unfolding recursive autoencoders for paraphrase detection
Dynamic pooling and unfolding recursive autoencoders for paraphrase detectionKoza Ozawa
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptxSan Kim
 
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...ssuser4b1f48
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...禎晃 山崎
 

Similar to Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰 (20)

Dependency-Based Word Embeddings
Dependency-Based Word EmbeddingsDependency-Based Word Embeddings
Dependency-Based Word Embeddings
 
2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Word embedding
Word embedding Word embedding
Word embedding
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection
Dynamic pooling and unfolding recursive autoencoders for paraphrase detectionDynamic pooling and unfolding recursive autoencoders for paraphrase detection
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰

  • 1. 01 Introduction 02 General Model(skip-gram) Contents 03 Subword Model(SISG)
  • 2. § 기존의 embedding model은 unique word를 하나의 vector에 할당할 수 있었음 § 그러나 이와 같은 방식은 vocabulary의 크기가 커지거나 rare word가 많을수록 한계점을 내포함 § 이러한 word들은 good word representation을 얻기 힘듦 § 특히 현재까지의 word representation 기법들은 문자의 internal structure를 고려하지 않고 있음 § 그러나 스페인어나 프랑스어의 경우 대부분의 동사가 40개 이상의 변형된 형태를 지니고 있으므로, 이러한 언어에서는 rare word 문제가 더욱 대두 될 것임 01 Introduction • 연구 배경 및 목적 2 이처럼 형태학적인 특징이 풍부한 언어의 경우 subword 정보를 활용하여 학습하면, vector representation을 개선시킬 수 있을 것임
  • 3. 02 General Model • skip-gram target word context word context word context word context word window size 𝑃 𝑞𝑢𝑖𝑐𝑘, 𝑏𝑟𝑜𝑤𝑛 𝑡ℎ𝑒 = 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑡ℎ𝑒 𝑃(𝑏𝑟𝑤𝑜𝑛|𝑡ℎ𝑒) 𝑃 𝑡ℎ𝑒, 𝑏𝑟𝑜𝑤𝑛, 𝑓𝑜𝑥 𝑞𝑢𝑖𝑐𝑘 = 𝑃 𝑡ℎ𝑒 𝑞𝑢𝑖𝑐𝑘 𝑃 𝑏𝑟𝑜𝑤𝑛 𝑞𝑢𝑖𝑐𝑘 𝑃(𝑓𝑜𝑥|𝑞𝑢𝑖𝑐𝑘) 𝑃 𝑡ℎ𝑒, 𝑞𝑢𝑖𝑐𝑘, 𝑓𝑜𝑥, 𝑗𝑢𝑚𝑝𝑠 𝑏𝑟𝑜𝑤𝑛 = 𝑃 𝑡ℎ𝑒 𝑏𝑟𝑜𝑤𝑛 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑏𝑟𝑜𝑤𝑛 𝑃 𝑓𝑜𝑥 𝑏𝑟𝑜𝑤𝑛 𝑃(𝑗𝑢𝑚𝑝𝑠|𝑏𝑟𝑜𝑤𝑛) 𝑃 𝑞𝑢𝑖𝑐𝑘, 𝑏𝑟𝑜𝑤𝑛, 𝑗𝑢𝑚𝑝𝑠, 𝑜𝑣𝑒𝑟 𝑓𝑜𝑥 = 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑓𝑜𝑥 𝑃 𝑏𝑟𝑤𝑜𝑛 𝑓𝑜𝑥 𝑃 𝑗𝑢𝑚𝑝𝑠 𝑓𝑜𝑥 𝑃(𝑜𝑣𝑒𝑟|𝑓𝑜𝑥) Assumption • context word는 조건부 독립(conditionally independent) X = " !"# $ " %∈'! 𝑝(𝑤%|𝑤!)
  • 4. 02 General Model • skip-gram 𝐼#×) 𝑊)×* 𝐻#×* 𝑊*×) + 𝑂#×)
  • 5. 02 General Model • skip-gram 0 1 0 ⋮ 0 2 0.5 4 0.1 0.2 0.7 0.3 ⋮ 3 −2 ⋮ 5 0.1 ⋮ 0.6 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ 0.3 ⋯ 0.1 0.12 0.21 0.10 ⋮ 0.09 1 0 0 ⋮ 0 0 0 1 ⋮ 0 𝑙𝑜𝑠𝑠("#$, &'()*) 𝑙𝑜𝑠𝑠(&'()*, ,-./0) + 𝑙𝑜𝑠𝑠&'()* 𝑦,-./ 𝑦!-0. 𝐼#×) 𝑊)×* 𝐻#×* 𝑊*×) + the quick brown ⋮ dog the quick brown dog ⋯ quick
  • 6. 02 General Model • Row Indexing: 𝑋와 𝑊!"#$%행렬 곱의 병목 해결 § 𝑋와 𝑊12,0!행렬곱은 vocabulary의 크기가 커질수록 큰 계산 비용이 수반됨 § ℎ를 구하기 위해서는 input vector 𝑋에서 one-hot index를 𝑊12,0!의 row index로 활용하면 바로 도출 할 수 있음 § 결과적으로 같지만 계산 비용에서는 큰 차이가 발생함 weight matrix 𝑾𝒊𝒏𝒑𝒖𝒕의 원소 중에서 input vector 𝑋와 관련 있는 부분만 골라서 업데이트 하는 것이 목적 0 1 0 ⋮ 0 2 0.5 4 0.1 0.2 0.7 0.3 ⋮ 3 −2 ⋮ 5 0.1 ⋮ 0.6 0.1 0.2 0.7 the quick brown ⋮ dog quick
  • 7. 02 General Model • Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결 § latent vector ℎ와 𝑊80!,0! 역시 vocabulary의 크기가 증가하면 큰 계산 비용이 요구됨 § softmax 연산도 vocabulary의 크기가 증가하면 큰 계산 비용이 요구됨 § 그러나 하나의 target word와 관련된 context word들은 window size내의 작은 word 정도밖에 안됨 § 다시 말해 ℎ와 𝑊80!,0!의 행렬곱 연산은 인풋과 관련되어 업데이트 되 어야할 단어는 몇개 안되는데도 불구하고 vocabulary에 있는 모든 단 어들과의 관계를 비교하여야 하여 비효율적임 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ 0.3 ⋯ 0.1 0.12 0.21 0.10 ⋮ 0.09 the quick brown dog ⋯ quick
  • 8. 02 General Model • Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결 § 이를 해결하기 위해 negative sampling 기법을 활용할 수 있음 § negative sampling의 핵심은 multi-class classification을 binary classification로 근사 해보겠다는 것 • positive example: target word의 context word • negative example: target word의 context word가 아닌 word 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ 0.3 ⋯ 0.1 0.12 0.21 0.10 ⋮ 0.09 the quick brown dog ⋯ quick
  • 9. 02 General Model • Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ −0.3 ⋯ −0.1 the quick brown dog ⋯quick dot 0.59 0.52dot 𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,'() 𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,*+,-. dot 0.45 𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,/,0 𝑙𝑎𝑏𝑒𝑙 = 1 𝑙𝑎𝑏𝑒𝑙 = 1 𝑙𝑎𝑏𝑒𝑙 = 0 𝐿𝑜𝑠𝑠&'()*+ 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ 0.3 ⋯ 0.1 0.12 0.21 0.10 ⋮ 0.09 1 0 0 ⋮ 0 0 0 1 ⋮ 0 𝑙𝑜𝑠𝑠("#$%&, ()*) 𝑙𝑜𝑠𝑠("#$%&, ,-./0) + 𝑙𝑜𝑠𝑠"#$%& 𝑦1-*2 𝑦(-#* 𝐻3×5 𝑊5×6 7 the quick brown dog ⋯ quick
  • 10. 02 General Model • Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결 § How to negative sampling? • Corpus 내에서 자주 등장하는 단어를 더 많이 추출하고 드물게 등장하는 단어는 적게 추출하고자 함 § Probability distribution is derived through the equation below: 𝑃 𝑤( = 𝑓(𝑤()4/6 ∑789 0 𝑓(𝑤7)4/6 𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 § Why use 3/4? • 등장 확률이 낮은 단어가 조금 더 쉽게 샘플링이 될 수 있도록 하기 위함.
  • 11. 02 General Model • skip-gram § We start by briefly reviewing the skip-gram model introduced by Mikolov et al. § Inspired by the distributional hypothesis (Harris, 1954), word representations are trained to predict well words that appear in its context. § it is assumed that there is no relationship between context words 𝑤) given target word 𝑤"(conditional independence). 𝑃 𝑤":;, 𝑤"<; 𝑤" = 𝑃 𝑤":; 𝑤" 𝑃 𝑤"<; 𝑤" § Given a large training corpus represented as a sequence of words (𝑤;, … , 𝑤=), the objective of the skipgram model is to maximize following log-likelihood: E "8; = E )∈?5 𝑝(𝑤)|𝑤") ⟺ H "8; = H )∈?5 log 𝑝(𝑤)|𝑤") Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary.
  • 12. 02 General Model • skip-gram: Objective § One possible choice to define the probability of a context word is the softmax: 𝑃 𝑤% 𝑤! = 𝑒9(;!, ;") ∑>"# ? 𝑒9(;!, >) § The problem of predicting context words can instead be framed as a set of independent binary classification tasks. § Then the goal is to independently predict the presence(or absence) of context words. § For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood: log(1 + 𝑒@9(;!, ;") ) + 9 2∈𝒩!," log(1 + 𝑒9(;!, ;") ) 𝑠 𝑤", 𝑤) = 𝑢/5 = 𝑣/6 Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary.
  • 13. 02 General Model • skip-gram: Negative-sampling § For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood: log(1 + 𝑒@9(;!, ;")) + 9 2∈𝒩!," log(1 + 𝑒9(;!, ;")) 𝑠 𝑤", 𝑤) = 𝑢/5 = 𝑣/6 § For all target words, we can re-write the objective as: H "8; = [H )∈?5 log(1 + 𝑒:@(/5, /6)) + H 0∈𝒩5,6 log(1 + 𝑒@(/5, /6))] Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary. Negative sampling results in both faster training and learn accurate representations especially for frequent words (Mikolov et al., 2013)
  • 14. 02 General Model • skip-gram: Subsampling of frequent Words § In very large corpora, frequent words usually provide less information value than the rare words. § the vector representations of frequent words do not change significantly after training on several million examples. § To counter the imbalance between the rare and frequent words, we used a simple subsampling approach • each word 𝑤( in the training set is discarded with probability computed by the formula: 𝑃 𝑤( = 1 − 𝑡 𝑓 𝑤( 𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 𝑡 = 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑 10:B) Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary. Subsampling results in both faster training and significantly better representation of uncommon words (Mikolov et al., 2013)
  • 15. 03 Subword Model • SISG, FastText § Each word w is represented as a bag of character n-gram. § We add special boundary symbols ‘<’ and ‘>’ at the beginning and end of words, allowing to distinguish prefixes and suffixes from other character sequences. § We also include the word w itself in the set of its n-grams, to learn a representation for each word(in addition to character n-gram) § Taking the word where and n=3 as an example, it will be represented by the character n-gram: 𝒢/#$-$ = {< 𝑤ℎ, 𝑤ℎ𝑒, ℎ𝑒𝑟, 𝑒𝑟𝑒, 𝑟𝑒 >, < 𝑤ℎ𝑒𝑟𝑒 >} § We represent a word by the sum of the vector representations of its n-grams, so we obtain the scoring function: 𝑠 𝑤, 𝑐 = H C∈𝒢7 𝑧C = 𝑣) Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary. • 𝒢-: a set of subwords given a word 𝑤
  • 16. 03 Subword Model • SISG, FastText pseudo code of computing loss