Paper Reading : Enriching word vectors with subword information(2016)

Enriching Word
Vectors with
Subword
Information (2016)
Piotr Bojanowski and Edouard Grave and
Armand Joulin and Tomas Mikolov
@Mikibear_ 논문 정리 161226

Word2Vec에서 단어의
Intrastructure를 잡아내보자

'Distributed representations of words and phrases and their
compositionality'(Mikolov et al.2013)... (너무나도 유명한)

Mikolov et al.(2013)의 한계?
“In particular, they ignore the internal structure of words, which is an important limitation for morphologically rich
languages, such as Turkish or Finnish. These languages contain many words that occur rarely, making it difficult to
learn good word-level representations.”
-> 단어의 Intrastructure를 고려하지 않아서 형태학적으로 풍부한 단어(morphologically rich languages)에 상대적으로 잘
안 먹힌다.

해결책
당연하다고 생각하는 것으로 다시 돌아가자.
구체적으로는, score function.

'Distributed representations of words and phrases and their
compositionality'(Mikolov et al.2013)

Mikolov et al.2013 Bojanowski et al.2016

w가 나타나는 n-gram set에 관련된
score function의 도입

Bojanowski et al.2016
Subword Model

왜?
"This simple model allows sharing the representations across words, thus allowing
to learn reliable representation for rare words."
다시 말해서, rare words의 embedding 위치를 잡는데 도움이
됨

For example...
주의! 완전 단순화된 예임. (w=3, n = 3) 실제 한국어 처리는 보통 이런 식으로 하지 않음.
철수는 밥을 먹었을 것 같다 먹었을 -> {밥을, 먹었을, 것}
철수는 밥을 먹었을 것 같다 먹었을 -> {철수는, 밥을, 먹었을}. {밥을, 먹었을 것}. {먹었을 것 같다}
‘N-gram을 고려하면 embedding에 도움이 될거야!’

사견으로는,
일종의 Data Argumentation이 아닌가 하는 생각.
"어떻게 하면 희귀하거나 다형적인 단어를 잘 embedding할 수 있을까?"
-> "모델이 각 단어를 한 번만 보는 게 아니라 N-gram으로 더 많이 보고 배우게 하면 된다!"

실험
전체적으로, 영어 코퍼스에서의 syntactic relation의 예측률은 약간 올라가고 semantic relation의 예측률은 하락함.
그러나 (형태학적으로 풍부한 언어인) 체코어에서의 syntactic relation은 dramatically 상승함.

Paper Reading : Enriching word vectors with subword information(2016)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (19)

Paper Reading : Enriching word vectors with subword information(2016)