深層意味表現学習 (Deep Semantic Representations)

深層意味表現学習
ボレガラダヌシカ
英国リバープール大学准教授

単語自身，意味を持っているか
無いよ．
周辺に現れる単語によって決まるだけ
J. R. Firth 1957
Image credit: www.odlt.org 2
“You shall know a word by
the company it keeps”

Quiz
•X は持ち歩く可能で，相手と通信ができて，ネッ
トも見れて，便利だ．X は次の内どれ?
•犬
•飛行機
•iPhone
•バナナ
3

でもそれは本当？
•だって辞書は単語の意味を定義しているじゃないか
•辞書も他の単語との関係を述べることで単語の意味を説明
している．
•膨大なコーパスがあれば周辺単語を集めてくるだけで単語の
意味表現が作れるので自然言語処理屋には嬉しい．
•practicalな意味表現手法
•色んなタスクに応用して成功しているので意味表現として（定
量的に)は正しい
•単語の意味はタスクに依存する？
•どのタスクが良くて，どのタスクがダメなのか？
4

意味表現構築手法
•分布的意味表現
•Distributional Semantic Representations
•単語xをコーパス中でその周辺に現れる全ての単語との共起頻度分布を持っ
て表す.
•高次元,スパース
•古典的なアプローチ
•分散的意味表現
•Distributed Semantic Representations
•有数(10 1000)の次元/分布/クラスターの組み合わせ/混合として単語xの
意味を表す.
•低次元,密
•深層学習/表現学習ブームで最近人気
5

意味表現を作るアプローチ
•単語xをコーパス中でその周辺に現れる全ての単語の共起頻度分布を持って
表す.
意味を表す.
•低次元,密
6

分布的意味表現構築
•「リンゴ」の単語の意味表現を作りなさい．
•S1=リンゴは赤い．
•S2=リンゴは美味しい．
•S3=青森県はリンゴの生産地として有名である．
7

分布的意味表現構築
•「リンゴ」の単語の意味表現を作りなさい．
•S1=リンゴは赤い．
•S2=赤いリンゴは美味しい．
•S3=青森県はリンゴの生産地として有名である．
リンゴ=[(赤い,2),(美味しい,1),(青森県,1),(生産地,1),(有名,1)]
8

応用例：意味的類似性計測
•「リンゴ」と「みかん」の意味的類似性を計測したい．
•まず，「みかん」の意味表現を作ってみる．
•S4=みかんはオレンジ色．
•S5=みかんは美味しい．
•S6=兵庫県はみかんの生産地として有名である．
9
みかん=[(オレンジ色,1),(美味しい,1),(兵庫県,1),(生産地,1),(有名,1)]

「リンゴ」と「みかん」
10
リンゴ=[(赤い,2),(美味しい,1),(青森県,1),(生産地,1),(有名,1)]
みかん=[(オレンジ色,1),(美味しい,1),(兵庫県,1),(生産地,1),(有名,1)]
両方の単語に対し，「美味しい」，「生産地」，「有名」と
いった共通な文脈語があるので「リンゴ」と「みかん」はかなり
意味的に似ているといえる．
定量的に比較したければ集合同士の重なりとしてみれば良い
Jaccard係数 = ¦リンゴ AND みかん¦ / ¦リンゴ OR みかん¦
¦リンゴ AND みかん¦ = ¦{美味しい,生産地,有名}¦ = 3
¦リンゴ OR みかん¦ =¦{赤い,美味しい,青森県,生産地,有名,オレンジ色,兵庫県}¦ = 7
sim(リンゴ,みかん) = 3/7 = 0.4285

細かい工夫が多数
•文脈として何を選ぶか
•文全体 (sentence-level co-occurrences)
•前後のn単語 (proximity window)
•係り受け関係にある単語 (dependencies)
•文脈の距離によって重みをつける．
•遠ければその共起の重みを距離分だけ減らす
•などなど
11

意味表現を作るアプローチ
•単語xをコーパス中でその周辺に現れる全ての単語の共起頻度分布を持って
表す.
意味を表す.
•低次元,密
12

局所的表現 vs. 分散表現
13
•  Clustering,!NearestJ
Neighbors,!RBF!SVMs,!local!
nonJparametric!density!
es>ma>on!&!predic>on,!
decision!trees,!etc.!
•  Parameters!for!each!
dis>nguishable!region!
•  #!dis>nguishable!regions!
linear!in!#!parameters!
#2 The need for distributed
representations
Clustering!
16!
•  Factor!models,!PCA,!RBMs,!
Neural!Nets,!Sparse!Coding,!
Deep!Learning,!etc.!
•  Each!parameter!inﬂuences!
many!regions,!not!just!local!
neighbors!
•  #!dis>nguishable!regions!
grows!almost!exponen>ally!
with!#!parameters!
•  GENERALIZE+NON5LOCALLY+
TO+NEVER5SEEN+REGIONS+
#2 The need for distributed
representations
Mul>J!
Clustering!
17!
C1! C2! C3!
input!
ある点のラベルを決める
ときに近隣する数個の点
しか関与しない．
3個のパーテションで
8個の領域が定義される．
(2nの表現能力)
slide credit: Yoshua Bengio

skip-gramモデル
14
私はみそ汁とご飯を頂いた

skip-gramモデル
15
形態素解析
各単語に対してd次元のベクトルが2個割り当てられている
単語xが意味表現学習対象となる場合のベクトルを対象語ベクトル
v(x)といい，赤で示す．
xの周辺で現れる文脈単語cを文脈語ベクトルv(c)で表し，青で示す．

skip-gramモデル
16
形態素解析
v(x) v(c)
例えば「みそ汁」の周辺で「ご飯」が出現するかどうか
を予測する問題を考えよう．

skip-gramモデル
17
私はみそ汁と ? を頂いた
形態素解析
v(x) v(c)
c=ご飯, c =ケーキとすると (x=みそ汁, c=ご飯)という
組み合わせの方が，(x=みそ汁, c =ケーキ)より日本語として
もっともらしいという「意味」を反映させたv(x), v(c), v(c )
を学習したい．

skip-gramモデル
18
形態素解析
v(x) v(c)
提案1 この尤もらしさをベクトルの内積で定義しましょう.
score(x,c) = v(x)Tv(c)

skip-gramモデル
19
形態素解析
v(x) v(c)
提案2 しかし,内積は(- ,+ )の値であり,正規化されていない
ので,都合が悪い．全ての文脈単語c に関するスコアで
割ることで確率にできる．

対数双線型
•log-bilinear model
20
xの周辺でcが
出現する確率
p(c|x) =
exp(v(x)>
v(c))
P
c02V exp(v(x)>v(c0))
xとcの共起しやすさ
語彙集合(V)に含まれる全ての単語
c とxが共起しやすさ
[Mnih+Hinton ICML’07]

何が凄いか
•skip-gramで学習した単語の意味表現ベクトルを２
次元で可視化すると
•v(king) - v(man) + v(woman) v(queen)
21
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Country and Capital Vectors Projected by PCA
China
Japan
France
Russia
Germany
Italy
Spain
Greece
Turkey
Beijing
Paris
Tokyo
Poland
Moscow
Portugal
Berlin
Rome
Athens
Madrid
Ankara
Warsaw
Lisbon
Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their
capital cities. The ﬁgure illustrates ability of the model to automatically organize concepts and learn implicitly
the relationships between them, as during the training we did not provide any supervised information about
what a capital city means.

単語の意味は一意ではない
•同じ単語でも使う場面において異なる意味を表
すことがある．
•軽いノートPC (+) vs. 軽い男/女 (-)
•同じ単語に対し，複数の意味表現を学習しなけ
ればならない．[Neelakantan+ EMNLP-14]
•ある分野（ドメイン）で良く使われる意味を正
確に予測しなければならない
•意味表現の分野適応 [Bollegala+ ACL-15]
23

ピボット (pivots)
•異なるドメインで似たような意味を持つ単語(意味普
遍な単語/semantic invariant)
•値段,形,安い,高い (excellent, cheap, digital)
•ピボットに関してはそれぞれのドメインにおける意味
表現が近くなって欲しい．
•そうでない(non-pivot)単語に関しては,それぞれのド
メインでピボットを予測できるようになって欲しい．
•イメージ：ピボットを介して,異なるドメインが近く
になる．
24

損失関数
•ranked hinge lossで損失を計測する．[Collobert + Weston ICML08]
•あるレビュー(口コミ)d中で出現しているpivotを使ってdに含
まれているnon-pivotの予測スコアがd中に出現していない
non-pivotより高くなるように意味表現を調整する．
25
ply the
senti-
How-
iment-
n. De-
g from
e sub-
well as
repre-
ty. Al-
of do-
can be
senta-
, prior
show
ns im-
a tar-
et al.,
boundaries. The notation (c, w) 2 d denotes the
co-occurrence of a pivot c and a non-pivot w in a
document d.
We learn domain-speciﬁc word representations
by maximizing the prediction accuracy of the non-
pivots w that occur in the local context of a pivot
c. The hinge loss, L(CS, WS), associated with
predicting a non-pivot w in a source document
d 2 DS that co-occurs with pivots c is given by
X
d2DS
X
(c,w)2d
X
w⇤⇠p(w)
max
⇣
0, 1 cS
>
wS + cS
>
w⇤
S
⌘
.
(1)
Here, w⇤
S is the source domain representation of
a non-pivot w⇤ that does not occur in d. The loss
function given by Eq. 1 requires that a non-pivot
w that co-occurs with a pivot c in the document
d is assigned a higher ranking score as measured
by the inner-product between cS and wS than a
non-pivot w⇤ that does not occur in d. We ran-
domly sample k non-pivots from the set of all
sourceドメインで
pivot, cの意味表現
sourceドメインnon-pivot,
w,w*の意味表現. w∈d,
w*∉d

全体のロス関数
26
L(CS, WS) =
X
d2DS
X
(c,w)2d
X
w⇤⇠p(w)
max 0, 1 cS
>
wS + cS
>
w⇤
S
L(CT , WT ) =
X
d2DT
X
(c,w)2d
X
w⇤⇠p(w)
max 0, 1 cT
>
wT + cT
>
w⇤
T .
, w⇤ denotes target domain non-pivots that
ot occur in d, and are randomly sampled
p(w) following the same procedure as in the
ce domain.
e source and target loss functions given re-
ively by Eqs. 1 and 2 can be used on their own
dependently learn source and target domain
representations. However, by definition, piv-
re common to both domains. We use this
erty to relate the source and target word repre-
tions via a pivot-regularizer, R(CS, CT ), de-
as
R(CS , CT ) =
1
2
KX
i=1
||c
(i)
S c
(i)
T ||
2
. (3)
, ||x|| represents the L2 norm of a vector x,
c(i) is the i-th pivot in a total collection of K
s. Word representations for non-pivots in the
ce and target domains are linked via the pivot
@L
@cT
=
(cT cS )
w⇤
T wT + (cT c
Here, for simplicity, we drop
the loss function and write
batch stochastic gradient des
of 50 instances. Adaptive g
2011) is used to schedule t
word representations are init
sional random vectors samp
and unit variance Gaussian.
tive in Eq. 4 is not jointly c
resentations, it is convex w.
of a particular feature (pivo
the representations for all t
held fixed. In our experime
verged in all cases with less
the dataset.
S T
ned as
R(CS , CT ) =
1
2
KX
i=1
||c
(i)
S c
(i)
T ||
2
. (3)
ere, ||x|| represents the L2 norm of a vector x,
nd c(i) is the i-th pivot in a total collection of K
vots. Word representations for non-pivots in the
urce and target domains are linked via the pivot
gularizer because, the non-pivots in each domain
e predicted using the word representations for
e pivots in each domain, which in turn are reg-
arized by Eq. 3. The overall objective function,
(CS, WS, CT , WT ), we minimize is the sum1 of
e source and target loss functions, regularized
a Eq. 3 with coefficient , and is given by
L(CS , WS , ) + L(CT , WT ) + R(CS , CT ). (4)
3 Training
and unit variance Gauss
tive in Eq. 4 is not joint
resentations, it is convex
of a particular feature (
the representations for
held fixed. In our exper
verged in all cases with
the dataset.
The rank-based predic
inspired by the prior w
tions learning for a sin
al., 2011). However, u
ral network in Collober
posed method uses a com
gle layer to reduce the n
must be learnt, thereby
Similar to the skip-gram
2013a), the proposed me

27
E−>B D−>B K−>B
55
60
65
70
75
80
Accuracy
B−>E D−>E K−>E
50
55
60
65
70
75
80
85
Accuracy
B−>D E−>D K−>D
55
60
65
70
75
80
Accuracy
NA GloVe SFA SCL CS Proposed
B−>K E−>K D−>K
50
60
70
80
90
Accuracy
Figure 1: Accuracies obtained by different methods for each source-target pair in cross-domain sentiment classification.
differences reported in Figure 1 can be directly
attributable to the domain adaptation, or word-
representation learning methods compared. All
methods use L2 regularized logistic regression as
the binary sentiment classifier, and the regulariza-
tion coefficients are set to their optimal values on
正しい意味表現を使い分けることで
評判分類の性能があがる！

単語間の関係の表現学習
•２つの単語の間に成立つ関係をどのように表現できるか．
[Bollegala+ AAAI-15]
•単語はベクトルで表現できるなら２つの単語の間の関係が行列で
表現できるはず．
•この「関係行列」はそれぞれの単語の意味表現からそれらの間の
関係に寄与する属性のみを選択するものと解釈できる．
28
男
女
王
水
配
男
女
王
水
配
king queen
0 1 1 0 1
1 0 1 0 1
1 1 1 0 1
0 0 0 0 0
1 1 1 0 1

学習手法
29
cates co-
ediction-
However,
n learn-
nces be-
that ex-
context.
ased ap-
d by de-
ating se-
i 2010).
hown to
able set-
l for us
between
earning.
ree-way
ons ex-
e to data
existing
ree-way
ostrich bird
penguin
X is a large Y [0.8]
X is a Y
[0.7]
both X and Y are fligtless
[0.5]
Figure 1: A relational graph between three words.
automatically extracted ontologies can be represented as re-
lational graphs.
Consider the relational graph shown in Figure 1. For ex-
ample, let us assume that we observed the context ostrich is
a large bird that lives in Africa in a corpus. Then, we ex-
tract the lexical pattern X is a large Y between ostrich and
bird from this context and include it in the relational graph
by adding two vertices each for ostrich and bird, and an edge
from ostrich to bird. Such lexical patterns have been used for
related tasks such as measuring semantic similarity between
xostrich=[:]
xostrich=[:]
xbird=[:]Glarge=[::]
Gfligtless=[::] Gis-a=[::]
s the
e re-
es of
such
elled
ords
co-
rn is
c re-
2 E
raph
abel
two
co-
u, v).
ver-
d by
pon-
both ostrich and penguin are flightless birds and penguin is
a bird will result in the relational graph shown in Figure 1.
Learning Word Representations
Given a relational graph as the input, we learn d dimen-
sional vector representations for each vertex in the graph.
The dimensionality d of the vector space is a pre-defined pa-
rameter of the method, and by adjusting it one can obtain
word representations at different granularities. Let us con-
sider two vertices u and v connected by an edge with label l
and weight w. We represent the two words u and v respec-
tively by two vectors x(u), x(v) 2 Rd
, and the label l by a
matrix G(l) 2 Rd⇥d
. We model the problem of learning op-
timal word representations ˆx(u) and pattern representations
ˆG(l) as the solution to the following squared loss minimisa-
tion problem
argmin
x(u)2Rd,G(l)2Rd⇥d
1
2
X
(u,v,l,w)2E
(x(u)>
G(l)x(v) w)
2
. (1)
The objective function given by Eq. 1 is jointly non-
convex in both word representations x(u) (or alternatively
x(v)) and pattern representations G(l). However, if G(l) is
positive semidefinite, and one of the two variables is held
関係行列自乗誤差
単語の意味表現ベクトル共起の強さ u, v, l

最適化
30
cates co-
ediction-
However,
n learn-
nces be-
that ex-
context.
ased ap-
d by de-
ating se-
i 2010).
hown to
able set-
l for us
between
earning.
ree-way
ons ex-
e to data
existing
ree-way
ostrich bird
penguin
X is a large Y [0.8]
X is a Y
[0.7]
both X and Y are fligtless
[0.5]
Figure 1: A relational graph between three words.
automatically extracted ontologies can be represented as re-
lational graphs.
Consider the relational graph shown in Figure 1. For ex-
ample, let us assume that we observed the context ostrich is
a large bird that lives in Africa in a corpus. Then, we ex-
tract the lexical pattern X is a large Y between ostrich and
bird from this context and include it in the relational graph
by adding two vertices each for ostrich and bird, and an edge
from ostrich to bird. Such lexical patterns have been used for
related tasks such as measuring semantic similarity between
xostrich=[:]
xostrich=[:]
xbird=[:]Glarge=[::]
Gfligtless=[::] Gis-a=[::]
• 目的関数はそれぞれの変数x(u), G(l), x(v)に対し，非凸関数となって
いる．
• しかし，これらの変数のうちどれか２つを固定すれば残りの変数に関
して凸関数となる．（但しG(l)は正定値行列でなければならない)
• 従って，目的関数をそれぞれの変数で偏微分し，確率的勾配法を使っ
て最適化することができる．
s the
e re-
es of
such
elled
ords
co-
rn is
c re-
2 E
raph
abel
two
co-
u, v).
ver-
d by
pon-
both ostrich and penguin are flightless birds and penguin is
a bird will result in the relational graph shown in Figure 1.
Learning Word Representations
Given a relational graph as the input, we learn d dimen-
sional vector representations for each vertex in the graph.
The dimensionality d of the vector space is a pre-defined pa-
rameter of the method, and by adjusting it one can obtain
word representations at different granularities. Let us con-
sider two vertices u and v connected by an edge with label l
and weight w. We represent the two words u and v respec-
tively by two vectors x(u), x(v) 2 Rd
, and the label l by a
matrix G(l) 2 Rd⇥d
. We model the problem of learning op-
timal word representations ˆx(u) and pattern representations
ˆG(l) as the solution to the following squared loss minimisa-
tion problem
argmin
x(u)2Rd,G(l)2Rd⇥d
1
2
X
(u,v,l,w)2E
(x(u)>
G(l)x(v) w)
2
. (1)
The objective function given by Eq. 1 is jointly non-
convex in both word representations x(u) (or alternatively
x(v)) and pattern representations G(l). However, if G(l) is
positive semidefinite, and one of the two variables is held
関係行列自乗誤差
単語の意味表現ベクトル共起の強さ u, v, l

アナロジー予測の性能
31
Method
capital-
common
capital-
world
city-in-
state
family
(gender)
currency overall
SVD+LEX 11.43 5.43 0 9.52 0 3.84
SVD+POS 4.57 9.06 0 29.05 0 6.57
SVD+DEP 5.88 3.02 0 0 0 1.11
CBOW 8.49 5.26 4.95 47.82 2.37 10.58
skip-gram 9.15 9.34 5.97 67.98 5.29 14.86
GloVe 4.24 4.93 4.35 65.41 0 11.89
Prop+LEX 22.87 31.42 15.83 61.19 25.0 26.61
Prop+POS 22.55 30.82 14.98 60.48 20.0 25.35
Prop+DEP 20.92 31.40 15.27 56.19 20.0 24.68

単語から関係を導出
•v(king) - v(man)はkingとmanの間の関係を表わ
しているはず．そうでなければ，類推問題が解け
ない（関係類似性が計測できない）
•ならば，特定の関係で結ばれている単語同士の
意味表現ベクトルの差分をとれば関係の表現が
作れるはず．[Bollegala+ IJCAI-15] 
32
|R(p)| =
(u,v)2R(p)
f(p, u, v) (3)
We represent a word x using a vector x 2 Rd
. The dimen-
sionality of the representation, d, is a hyperparameter of the
proposed method. Prior work on word representation learn-
ing have observed that the difference between the vectors that
represent two words closely approximates the semantic re-
lations that exist between those two words. For example, the
vector v(king) v(queen) has shown to be similar to the vec-
tor v(man) v(woman). We use this property to represent a
pattern p by a vector p 2 Rd
as the weighted sum of dif-
ferences between the two words in all word-pairs (u, v) that
co-occur with p as follows,
p =
1
|R(p)|
X
(u,v)2R(p)
f(p, u, v)(u v). (4)
For example, consider Fig. 1, where the two word-pairs
los
Dif
fun
gen
we
to
lin
T
con
tio

意味表現学習
33
x1 x2
p1
1
x3 x4
p2
lion cat ostrich bird
large Ys such as Xs X is a huge Y
f(p1, x1, x2)
(p1
>
p2)
-f(p1, x1, x2) f(p2, x3, x4) -f(p2, x3, x4)
Figure 1: Computing the similarity between two patterns.
p2 = X is a huge Y. Assuming that there are no other co-
occurrences between word-pairs and patterns in the corpus,
he representations of the patterns p1 and p2 are given respec-
ively by p1 = x1 x2, and p2 = x3 x4. We measure the
relational similarity between (x1, x2) and (x3, x4) using the
nner-product p1
>
p2.
We model the problem of learning word representations as
all words (or patterns) corresponding to the slot variabl
represent a pattern p by the set R(p) of word-pairs (u,
which f(p, u, v) > 0. Formally, we define R(p) and its
|R(p)| as follows,
R(p) = {(u, v)|f(p, u, v) > 0}
|R(p)| =
X
(u,v)2R(p)
f(p, u, v)
We represent a word x using a vector x 2 Rd
. The d
sionality of the representation, d, is a hyperparameter
proposed method. Prior work on word representation
ing have observed that the difference between the vecto
represent two words closely approximates the seman
lations that exist between those two words. For examp
vector v(king) v(queen) has shown to be similar to th
tor v(man) v(woman). We use this property to repre
as the weighted sum o
ferences between the two words in all word-pairs (u, v
p =
1
|R(p)|
X
(u,v)2R(p)
f(p, u, v)(u v).
For example, consider Fig. 1, where the two word
(lion, cat), and (ostrich, bird) co-occur respectively
the two lexical patterns, p1 = large Ys such as Xs
語彙パターンの集合として関係を表現
sionality of the representation, d, is a hyperparamete
proposed method. Prior work on word representatio
ing have observed that the difference between the vec
represent two words closely approximates the sema
lations that exist between those two words. For exam
vector v(king) v(queen) has shown to be similar to
tor v(man) v(woman). We use this property to rep
as the weighted sum
ferences between the two words in all word-pairs (u
p =
1
|R(p)|
X
(u,v)2R(p)
f(p, u, v)(u v).
For example, consider Fig. 1, where the two wo
(lion, cat), and (ostrich, bird) co-occur respective
the two lexical patterns, p1 = large Ys such as Xuとvの間の関係をそれらの意味表現
ベクトルの「引き算」で与える
(i.e. the sequence of tokens that appear in between
en two words in a context). Although we use lexi-
erns as features for representing semantic relations in
rk, our proposed method is not limited to lexical pat-
nd can be used in principle with any type of features
resent relations. The strength of association between
pair (u, v) and a pattern p is measured using the pos-
intwise mutual information (PPMI), f(p, u, v), which
ed as follows,
f(p, u, v) = max(0, log
✓
g(p, u, v)g(⇤, ⇤, ⇤)
g(p, ⇤, ⇤)g(⇤, u, v)
◆
). (1)
(p, u, v) denotes the number of co-occurrences be-
p and (u, v), and ⇤ denotes the summation taken over
ds (or patterns) corresponding to the slot variable. We
nt a pattern p by the set R(p) of word-pairs (u, v) for
f(p, u, v) > 0. Formally, we define R(p) and its norm
as follows,
R(p) = {(u, v)|f(p, u, v) > 0} (2)
|R(p)| =
X
(u,v)2R(p)
f(p, u, v) (3)
resent a word x using a vector x 2 Rd
. The dimen-
y of the representation, d, is a hyperparameter of the
ed method. Prior work on word representation learn-
e observed that the difference between the vectors that
nt two words closely approximates the semantic re-
that exist between those two words. For example, the
v(king) v(queen) has shown to be similar to the vec-
man) v(woman). We use this property to represent a
p by a vector p 2 Rd
as the weighted sum of dif-
p2 = X is a huge Y. Assuming that there are no other co-
occurrences between word-pairs and patterns in the corpus,
the representations of the patterns p1 and p2 are given respec-
tively by p1 = x1 x2, and p2 = x3 x4. We measure the
relational similarity between (x1, x2) and (x3, x4) using the
inner-product p1
>
p2.
We model the problem of learning word representations as
a binary classification task, where we learn representations
for words such that they can be used to accurately predict
whether a given pair of patterns are relationally similar. In
our previous example, we would learn representations for the
four words lion, cat, ostrich, and bird such that the similarity
between the two patterns large Ys such as Xs, and X is a huge
Y is maximized. Later in Section 3.1, we propose an unsuper-
vised method for selecting relationally similar (positive) and
dissimilar (negative) pairs of patterns as training instances to
train a binary classifier.
Let us denote the target label for two patterns p1, p2 by
t(p1, p2) 2 {1, 0}, where the value 1 indicates that p1 and
p2 are relationally similar, and 0 otherwise. We compute the
prediction loss for a pair of patterns (p1, p2) as the squared
loss between the target and the predicted labels as follows,
L(t(p1, p2), p1, p2) =
1
2
(t(p1, p2) (p1
>
p2))
2
. (5)
Different non-linear functions can be used as the prediction
function (·) such as the logistic-sigmoid, hyperbolic tan-
gent, or rectified linear units. In our preliminary experiments
we found hyperbolic tangent, tanh, given by
(✓) = tanh(✓) =
exp(✓) exp( ✓)
exp(✓) + exp( ✓)
(6)
to work particularly well among those different non-
tations are given by,
@L
@p1
= 0
(p1
>
p2)( (p1
>
p2) t(p1, p2))p2, (8)
@L
@p2
= 0
(p1
>
p2)( (p1
>
p2) t(p1, p2))p1. (9)
Here, 0
denotes the first derivative of tanh, which is given
by 1 (✓)
2
. To simplify the notation we drop the arguments
of the loss function.
From Eq. 4 we get,
@p1
@x
=
1
|R(p1)|
(h(p1, u = x, v) h(p1, u, v = x)) , (10)
@p2
@x
=
1
|R(p2)|
(h(p2, u = x, v) h(p2, u, v = x)) , (11)
where,
h(p, u = x, v) =
X
(x,v)2{(u,v)|(u,v)2R(p),u=x}
f(p, x, v),
and
h(p, u, v = x) =
X
(u,x)2{(u,v)|(u,v)2R(p),v=x}
f(p, u, x).
Substituting the partial derivatives given by Eqs. 8-11 in
Eq. 7 we get,
@L
@x
= (p1, p2)[H(p1, x)
X
(u,v)2R(p2)
f(p2, u, v)(u v)
+H(p2, x)
X
(u,v)2R(p1)
f(p1, u, v)(u v)], (12)
where (p1, p2) is defined as

アナロジー予測の性能
34
Table 1: Word analogy results on benchmark datasets.
Method sem. synt. total SAT SemEval
ivLBL CosAdd 63.60 61.80 62.60 20.85 34.63
ivLBL CosMult 65.20 63.00 64.00 19.78 33.42
ivLBL PairDiff 52.60 48.50 50.30 22.45 36.94
skip-gram CosAdd 31.89 67.67 51.43 29.67 40.89
skip-gram CosMult 33.98 69.62 53.45 28.87 38.54
skip-gram PairDiff 7.20 19.73 14.05 35.29 43.99
CBOW CosAdd 39.75 70.11 56.33 29.41 40.31
CBOW CosMult 38.97 70.39 56.13 28.34 38.19
CBOW PairDiff 5.76 13.43 9.95 33.16 42.89
GloVe CosAdd 86.67 82.81 84.56 27.00 40.11
GloVe CosMult 86.84 84.80 85.72 25.66 37.56
GloVe PairDiff 45.93 41.23 43.36 44.65 44.67
Prop CosAdd 86.70 85.35 85.97 29.41 41.86
Prop CosMult 86.91 87.04 86.98 28.87 39.67
Prop PairDiff 41.85 42.86 42.40 45.99 44.88
number of candidate word-pairs out of which only one is cor-

コーパス vs. 辞書
•コーパスさえあれば単語（関係）の分散的意味表現
が学習できる．
•しかし，既に人間が長年かけて作った「辞書」とい
うもので単語の意味が定義されている
•この両方を使うことでより正確な意味表現が学習で
きないか．[Bollegala+ AAAI-15]
•特に，コーパスが不完全な場合，辞書(オントロ
ジー）が役立つ
•私は犬と猫が好きだ．
35

JointReps
•コーパス中で同一文内に出現する単語を予測す
る．その際に生じる誤差（目的関数）を最小化
する．
•辞書(WordNet)で定義されている意味的関係を
制約として入れる． 
36
then extract unigrams from the co-occurrence windows as
the corresponding context words. We down-weight distant
(and potentially noisy) co-occurrences using the reciprocal
1/l of the distance in tokens l between the two words that
co-occur.
A word wi is assigned two vectors wi and ˜wi denoting
whether wi is respectively the target of the prediction (cor-
responding to the rows of X), or in the context of another
word (corresponding to the columns of X). The GloVe ob-
jective can then be written as:
JC =
1
2
X
i2V
X
j2V
f(Xij)
⇣
wi
>
˜wj + bi + ˜bj log(Xij)
⌘2
(1)
Here, bi and ˜bj are real-valued scalar bias terms that adjust
for the difference between the inner-product and the loga-
rithm of the co-occurrence counts. The function f discounts
the co-occurrences between frequent words and is given by:
f(t) =
(
(t/tmax)↵
if t < tmax
1 otherwise
(2)
(3).
miza
Here
coeffi
man
corp
value
Th
w.r.t
we fi
the r
tion
rame
pre-d
keep
Th
d as re-
seman-
hod for
, where
vectors
a man-
tracted
nly the
escribe
Rd
for
denote
wi, and
pus) is
repre-
od that
ic lexi-
etween
Miller,
aphrase
we do
rticular
s paper
tmax = 100 in our experiments. The objective function de-
fined by (1) encourages the learning of word representations
that demonstrate the desirable property that vector differ-
ence between the word embeddings for two words represents
the semantic relations that exist between those two words.
For example, Mikolov et al. [2013c] observed that the dif-
ference between the word embeddings for the words king
and man when added to the word embedding for the word
woman yields a vector similar to that of queen.
Unfortunately, the objective function given by (1) does
not capture the semantic relations that exist between wi and
wj as specified in the lexicon S. Consequently, it considers
all co-occurrences equally and is likely to encounter prob-
lems when the co-occurrences are rare. To overcome this
problem we propose a regularizer, JS, by considering the
three-way co-occurrence among words wi, wj, and a seman-
tic relation R that exists between the target word wi and one
of its context words wj in the lexicon as follows:
JS =
1
2
X
i2V
X
j2V
R(i, j) (wi ˜wj)2
(3)
Here, R(i, j) is a binary function that returns 1 if the se-
mantic relation R exists between the words wi and wj in

単語間の意味的類似性計測
37
Table 1: Performance of the proposed method with different semantic relation types.
Method RG MC RW SCWS MEN sem syn total SemEval
corpus only 0.7523 0.6398 0.2708 0.460 0.6933 61.49 66.00 63.95 37.98
Synonyms 0.7866 0.7019 0.2731 0.4705 0.7090 61.46 69.33 65.76 38.65
Antonyms 0.7694 0.6417 0.2730 0.4644 0.6973 61.64 66.66 64.38 38.01
Hypernyms 0.7759 0.6713 0.2638 0.4554 0.6987 61.22 68.89 65.41 38.21
Hyponyms 0.7660 0.6324 0.2655 0.4570 0.6972 61.38 68.28 65.15 38.30
Member-holonyms 0.7681 0.6321 0.2743 0.4604 0.6952 61.69 66.36 64.24 37.95
Member-meronyms 0.7701 0.6223 0.2739 0.4611 0.6963 61.61 66.31 64.17 37.98
Part-holonyms 0.7852 0.6841 0.2732 0.4650 0.7007 61.44 67.34 64.66 38.07
Part-meronyms 0.7786 0.6691 0.2761 0.4679 0.7005 61.66 67.11 64.63 38.29
syn) analogies, and 8869 semantic analogies (sem).
mEval dataset contains manually ranked word-pairs
word-pairs describing various semantic relation types,
s defective, and agent-goal. In total there are 3218
airs in the SemEval dataset. Given a proportional
y a : b :: c : d, we compute the cosine similarity be-
b a+c and c, where the boldface symbols represent
beddings of the corresponding words. For the Google
we measure the accuracy for predicting the fourth
in each proportional analogy from the entire vocab-
We use the binomial exact test with Clopper-Pearson
nce interval to test for the statistical signiﬁcance of
orted accuracy values. For SemEval we use the ofﬁ-
luation tool3
to compute MaxDiff scores.
Table 2: Comparison against prior work.
Method RG MEN sem syn
RCM 0.471 0.501 - 29.9
R-NET - - 32.64 43.46
C-NET - - 37.07 40.06
RC-NET - - 34.36 44.42
Retro (CBOW) 0.577 0.605 36.65 52.5
Retro (SG) 0.745 0.657 45.29 65.65
Retro (corpus only) 0.786 0.673 61.11 68.14
Proposed (synonyms) 0.787 0.709 61.46 69.33
learning of rare words, among which the co-occurrences a
人間が付けた類似度スコアとアルゴリズムが
出した類似度とのSpearman相関を使って評価.
様々な意味的関係を制約として使える．
類義語関係(synonymy)が最も有効．

残された難題
•単語の共起を予測するのが意味表現を学習するた
めの最適なタスクなのか．
•単語の意味表現ベクトルがなす空間について何も
しらない．
•そもそもベクトルで十分なのかさえ分からない
•文や文書の意味をどう表すか．（構成的意味論）
•多言語，曖昧性をどう扱うか．
38

39
御免 - sorry + thanks = 有難う
Danushka Bollegala
www.csc.liv.ac.uk/~danushka
danushka.bollegala@liverpool.ac.uk
@Bollegala

深層意味表現学習 (Deep Semantic Representations)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 深層意味表現学習 (Deep Semantic Representations)

Ähnlich wie 深層意味表現学習 (Deep Semantic Representations) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

深層意味表現学習 (Deep Semantic Representations)