21. 何が凄いか
•skip-gramで学習した単語の意味表現ベクトルを2
次元で可視化すると
•v(king) - v(man) + v(woman) v(queen)
21
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Country and Capital Vectors Projected by PCA
China
Japan
France
Russia
Germany
Italy
Spain
Greece
Turkey
Beijing
Paris
Tokyo
Poland
Moscow
Portugal
Berlin
Rome
Athens
Madrid
Ankara
Warsaw
Lisbon
Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their
capital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitly
the relationships between them, as during the training we did not provide any supervised information about
what a capital city means.
25. 損失関数
•ranked hinge lossで損失を計測する.[Collobert + Weston ICML08]
•あるレビュー(口コミ)d中で出現しているpivotを使ってdに含
まれているnon-pivotの予測スコアがd中に出現していない
non-pivotより高くなるように意味表現を調整する.
25
ply the
senti-
How-
iment-
n. De-
g from
e sub-
well as
repre-
ty. Al-
of do-
can be
senta-
, prior
show
ns im-
a tar-
et al.,
boundaries. The notation (c, w) 2 d denotes the
co-occurrence of a pivot c and a non-pivot w in a
document d.
We learn domain-specific word representations
by maximizing the prediction accuracy of the non-
pivots w that occur in the local context of a pivot
c. The hinge loss, L(CS, WS), associated with
predicting a non-pivot w in a source document
d 2 DS that co-occurs with pivots c is given by
X
d2DS
X
(c,w)2d
X
w⇤⇠p(w)
max
⇣
0, 1 cS
>
wS + cS
>
w⇤
S
⌘
.
(1)
Here, w⇤
S is the source domain representation of
a non-pivot w⇤ that does not occur in d. The loss
function given by Eq. 1 requires that a non-pivot
w that co-occurs with a pivot c in the document
d is assigned a higher ranking score as measured
by the inner-product between cS and wS than a
non-pivot w⇤ that does not occur in d. We ran-
domly sample k non-pivots from the set of all
sourceドメインで
pivot, cの意味表現
sourceドメインnon-pivot,
w,w*の意味表現. w∈d,
w*∉d
26. 全体のロス関数
26
L(CS, WS) =
X
d2DS
X
(c,w)2d
X
w⇤⇠p(w)
max 0, 1 cS
>
wS + cS
>
w⇤
S
L(CT , WT ) =
X
d2DT
X
(c,w)2d
X
w⇤⇠p(w)
max 0, 1 cT
>
wT + cT
>
w⇤
T .
, w⇤ denotes target domain non-pivots that
ot occur in d, and are randomly sampled
p(w) following the same procedure as in the
ce domain.
e source and target loss functions given re-
ively by Eqs. 1 and 2 can be used on their own
dependently learn source and target domain
representations. However, by definition, piv-
re common to both domains. We use this
erty to relate the source and target word repre-
tions via a pivot-regularizer, R(CS, CT ), de-
as
R(CS , CT ) =
1
2
KX
i=1
||c
(i)
S c
(i)
T ||
2
. (3)
, ||x|| represents the L2 norm of a vector x,
c(i) is the i-th pivot in a total collection of K
s. Word representations for non-pivots in the
ce and target domains are linked via the pivot
@L
@cT
=
(cT cS )
w⇤
T wT + (cT c
Here, for simplicity, we drop
the loss function and write
batch stochastic gradient des
of 50 instances. Adaptive g
2011) is used to schedule t
word representations are init
sional random vectors samp
and unit variance Gaussian.
tive in Eq. 4 is not jointly c
resentations, it is convex w.
of a particular feature (pivo
the representations for all t
held fixed. In our experime
verged in all cases with less
the dataset.
S T
ned as
R(CS , CT ) =
1
2
KX
i=1
||c
(i)
S c
(i)
T ||
2
. (3)
ere, ||x|| represents the L2 norm of a vector x,
nd c(i) is the i-th pivot in a total collection of K
vots. Word representations for non-pivots in the
urce and target domains are linked via the pivot
gularizer because, the non-pivots in each domain
e predicted using the word representations for
e pivots in each domain, which in turn are reg-
arized by Eq. 3. The overall objective function,
(CS, WS, CT , WT ), we minimize is the sum1 of
e source and target loss functions, regularized
a Eq. 3 with coefficient , and is given by
L(CS , WS , ) + L(CT , WT ) + R(CS , CT ). (4)
3 Training
and unit variance Gauss
tive in Eq. 4 is not joint
resentations, it is convex
of a particular feature (
the representations for
held fixed. In our exper
verged in all cases with
the dataset.
The rank-based predic
inspired by the prior w
tions learning for a sin
al., 2011). However, u
ral network in Collober
posed method uses a com
gle layer to reduce the n
must be learnt, thereby
Similar to the skip-gram
2013a), the proposed me
27. 27
E−>B D−>B K−>B
55
60
65
70
75
80
Accuracy
B−>E D−>E K−>E
50
55
60
65
70
75
80
85
Accuracy
B−>D E−>D K−>D
55
60
65
70
75
80
Accuracy
NA GloVe SFA SCL CS Proposed
B−>K E−>K D−>K
50
60
70
80
90
Accuracy
Figure 1: Accuracies obtained by different methods for each source-target pair in cross-domain sentiment classification.
differences reported in Figure 1 can be directly
attributable to the domain adaptation, or word-
representation learning methods compared. All
methods use L2 regularized logistic regression as
the binary sentiment classifier, and the regulariza-
tion coefficients are set to their optimal values on
正しい意味表現を使い分けることで
評判分類の性能があがる!
29. 学習手法
29
cates co-
ediction-
However,
n learn-
nces be-
that ex-
context.
ased ap-
d by de-
ating se-
i 2010).
hown to
able set-
l for us
between
earning.
ree-way
ons ex-
e to data
existing
ree-way
ostrich bird
penguin
X is a large Y [0.8]
X is a Y
[0.7]
both X and Y are fligtless
[0.5]
Figure 1: A relational graph between three words.
automatically extracted ontologies can be represented as re-
lational graphs.
Consider the relational graph shown in Figure 1. For ex-
ample, let us assume that we observed the context ostrich is
a large bird that lives in Africa in a corpus. Then, we ex-
tract the lexical pattern X is a large Y between ostrich and
bird from this context and include it in the relational graph
by adding two vertices each for ostrich and bird, and an edge
from ostrich to bird. Such lexical patterns have been used for
related tasks such as measuring semantic similarity between
xostrich=[:]
xostrich=[:]
xbird=[:]Glarge=[::]
Gfligtless=[::] Gis-a=[::]
s the
e re-
es of
such
elled
ords
co-
rn is
c re-
2 E
raph
abel
two
co-
u, v).
ver-
d by
pon-
both ostrich and penguin are flightless birds and penguin is
a bird will result in the relational graph shown in Figure 1.
Learning Word Representations
Given a relational graph as the input, we learn d dimen-
sional vector representations for each vertex in the graph.
The dimensionality d of the vector space is a pre-defined pa-
rameter of the method, and by adjusting it one can obtain
word representations at different granularities. Let us con-
sider two vertices u and v connected by an edge with label l
and weight w. We represent the two words u and v respec-
tively by two vectors x(u), x(v) 2 Rd
, and the label l by a
matrix G(l) 2 Rd⇥d
. We model the problem of learning op-
timal word representations ˆx(u) and pattern representations
ˆG(l) as the solution to the following squared loss minimisa-
tion problem
argmin
x(u)2Rd,G(l)2Rd⇥d
1
2
X
(u,v,l,w)2E
(x(u)>
G(l)x(v) w)
2
. (1)
The objective function given by Eq. 1 is jointly non-
convex in both word representations x(u) (or alternatively
x(v)) and pattern representations G(l). However, if G(l) is
positive semidefinite, and one of the two variables is held
関係行列 自乗誤差
単語の意味表現ベクトル 共起の強さ u, v, l
30. 最適化
30
cates co-
ediction-
However,
n learn-
nces be-
that ex-
context.
ased ap-
d by de-
ating se-
i 2010).
hown to
able set-
l for us
between
earning.
ree-way
ons ex-
e to data
existing
ree-way
ostrich bird
penguin
X is a large Y [0.8]
X is a Y
[0.7]
both X and Y are fligtless
[0.5]
Figure 1: A relational graph between three words.
automatically extracted ontologies can be represented as re-
lational graphs.
Consider the relational graph shown in Figure 1. For ex-
ample, let us assume that we observed the context ostrich is
a large bird that lives in Africa in a corpus. Then, we ex-
tract the lexical pattern X is a large Y between ostrich and
bird from this context and include it in the relational graph
by adding two vertices each for ostrich and bird, and an edge
from ostrich to bird. Such lexical patterns have been used for
related tasks such as measuring semantic similarity between
xostrich=[:]
xostrich=[:]
xbird=[:]Glarge=[::]
Gfligtless=[::] Gis-a=[::]
• 目的関数はそれぞれの変数x(u), G(l), x(v)に対し,非凸関数となって
いる.
• しかし,これらの変数のうちどれか2つを固定すれば残りの変数に関
して凸関数となる.(但しG(l)は正定値行列でなければならない)
• 従って,目的関数をそれぞれの変数で偏微分し,確率的勾配法を使っ
て最適化することができる.
s the
e re-
es of
such
elled
ords
co-
rn is
c re-
2 E
raph
abel
two
co-
u, v).
ver-
d by
pon-
both ostrich and penguin are flightless birds and penguin is
a bird will result in the relational graph shown in Figure 1.
Learning Word Representations
Given a relational graph as the input, we learn d dimen-
sional vector representations for each vertex in the graph.
The dimensionality d of the vector space is a pre-defined pa-
rameter of the method, and by adjusting it one can obtain
word representations at different granularities. Let us con-
sider two vertices u and v connected by an edge with label l
and weight w. We represent the two words u and v respec-
tively by two vectors x(u), x(v) 2 Rd
, and the label l by a
matrix G(l) 2 Rd⇥d
. We model the problem of learning op-
timal word representations ˆx(u) and pattern representations
ˆG(l) as the solution to the following squared loss minimisa-
tion problem
argmin
x(u)2Rd,G(l)2Rd⇥d
1
2
X
(u,v,l,w)2E
(x(u)>
G(l)x(v) w)
2
. (1)
The objective function given by Eq. 1 is jointly non-
convex in both word representations x(u) (or alternatively
x(v)) and pattern representations G(l). However, if G(l) is
positive semidefinite, and one of the two variables is held
関係行列 自乗誤差
単語の意味表現ベクトル 共起の強さ u, v, l
32. 単語から関係を導出
•v(king) - v(man)はkingとmanの間の関係を表わ
しているはず.そうでなければ,類推問題が解け
ない(関係類似性が計測できない)
•ならば,特定の関係で結ばれている単語同士の
意味表現ベクトルの差分をとれば関係の表現が
作れるはず.[Bollegala+ IJCAI-15]
32
|R(p)| =
(u,v)2R(p)
f(p, u, v) (3)
We represent a word x using a vector x 2 Rd
. The dimen-
sionality of the representation, d, is a hyperparameter of the
proposed method. Prior work on word representation learn-
ing have observed that the difference between the vectors that
represent two words closely approximates the semantic re-
lations that exist between those two words. For example, the
vector v(king) v(queen) has shown to be similar to the vec-
tor v(man) v(woman). We use this property to represent a
pattern p by a vector p 2 Rd
as the weighted sum of dif-
ferences between the two words in all word-pairs (u, v) that
co-occur with p as follows,
p =
1
|R(p)|
X
(u,v)2R(p)
f(p, u, v)(u v). (4)
For example, consider Fig. 1, where the two word-pairs
los
Dif
fun
gen
we
to
lin
T
con
tio
33. 意味表現学習
33
x1 x2
p1
1
x3 x4
p2
lion cat ostrich bird
large Ys such as Xs X is a huge Y
f(p1, x1, x2)
(p1
>
p2)
-f(p1, x1, x2) f(p2, x3, x4) -f(p2, x3, x4)
Figure 1: Computing the similarity between two patterns.
p2 = X is a huge Y. Assuming that there are no other co-
occurrences between word-pairs and patterns in the corpus,
he representations of the patterns p1 and p2 are given respec-
ively by p1 = x1 x2, and p2 = x3 x4. We measure the
relational similarity between (x1, x2) and (x3, x4) using the
nner-product p1
>
p2.
We model the problem of learning word representations as
all words (or patterns) corresponding to the slot variabl
represent a pattern p by the set R(p) of word-pairs (u,
which f(p, u, v) > 0. Formally, we define R(p) and its
|R(p)| as follows,
R(p) = {(u, v)|f(p, u, v) > 0}
|R(p)| =
X
(u,v)2R(p)
f(p, u, v)
We represent a word x using a vector x 2 Rd
. The d
sionality of the representation, d, is a hyperparameter
proposed method. Prior work on word representation
ing have observed that the difference between the vecto
represent two words closely approximates the seman
lations that exist between those two words. For examp
vector v(king) v(queen) has shown to be similar to th
tor v(man) v(woman). We use this property to repre
pattern p by a vector p 2 Rd
as the weighted sum o
ferences between the two words in all word-pairs (u, v
co-occur with p as follows,
p =
1
|R(p)|
X
(u,v)2R(p)
f(p, u, v)(u v).
For example, consider Fig. 1, where the two word
(lion, cat), and (ostrich, bird) co-occur respectively
the two lexical patterns, p1 = large Ys such as Xs
語彙パターンの集合として関係を表現
sionality of the representation, d, is a hyperparamete
proposed method. Prior work on word representatio
ing have observed that the difference between the vec
represent two words closely approximates the sema
lations that exist between those two words. For exam
vector v(king) v(queen) has shown to be similar to
tor v(man) v(woman). We use this property to rep
pattern p by a vector p 2 Rd
as the weighted sum
ferences between the two words in all word-pairs (u
co-occur with p as follows,
p =
1
|R(p)|
X
(u,v)2R(p)
f(p, u, v)(u v).
For example, consider Fig. 1, where the two wo
(lion, cat), and (ostrich, bird) co-occur respective
the two lexical patterns, p1 = large Ys such as Xuとvの間の関係をそれらの意味表現
ベクトルの「引き算」で与える
(i.e. the sequence of tokens that appear in between
en two words in a context). Although we use lexi-
erns as features for representing semantic relations in
rk, our proposed method is not limited to lexical pat-
nd can be used in principle with any type of features
resent relations. The strength of association between
pair (u, v) and a pattern p is measured using the pos-
intwise mutual information (PPMI), f(p, u, v), which
ed as follows,
f(p, u, v) = max(0, log
✓
g(p, u, v)g(⇤, ⇤, ⇤)
g(p, ⇤, ⇤)g(⇤, u, v)
◆
). (1)
(p, u, v) denotes the number of co-occurrences be-
p and (u, v), and ⇤ denotes the summation taken over
ds (or patterns) corresponding to the slot variable. We
nt a pattern p by the set R(p) of word-pairs (u, v) for
f(p, u, v) > 0. Formally, we define R(p) and its norm
as follows,
R(p) = {(u, v)|f(p, u, v) > 0} (2)
|R(p)| =
X
(u,v)2R(p)
f(p, u, v) (3)
resent a word x using a vector x 2 Rd
. The dimen-
y of the representation, d, is a hyperparameter of the
ed method. Prior work on word representation learn-
e observed that the difference between the vectors that
nt two words closely approximates the semantic re-
that exist between those two words. For example, the
v(king) v(queen) has shown to be similar to the vec-
man) v(woman). We use this property to represent a
p by a vector p 2 Rd
as the weighted sum of dif-
p2 = X is a huge Y. Assuming that there are no other co-
occurrences between word-pairs and patterns in the corpus,
the representations of the patterns p1 and p2 are given respec-
tively by p1 = x1 x2, and p2 = x3 x4. We measure the
relational similarity between (x1, x2) and (x3, x4) using the
inner-product p1
>
p2.
We model the problem of learning word representations as
a binary classification task, where we learn representations
for words such that they can be used to accurately predict
whether a given pair of patterns are relationally similar. In
our previous example, we would learn representations for the
four words lion, cat, ostrich, and bird such that the similarity
between the two patterns large Ys such as Xs, and X is a huge
Y is maximized. Later in Section 3.1, we propose an unsuper-
vised method for selecting relationally similar (positive) and
dissimilar (negative) pairs of patterns as training instances to
train a binary classifier.
Let us denote the target label for two patterns p1, p2 by
t(p1, p2) 2 {1, 0}, where the value 1 indicates that p1 and
p2 are relationally similar, and 0 otherwise. We compute the
prediction loss for a pair of patterns (p1, p2) as the squared
loss between the target and the predicted labels as follows,
L(t(p1, p2), p1, p2) =
1
2
(t(p1, p2) (p1
>
p2))
2
. (5)
Different non-linear functions can be used as the prediction
function (·) such as the logistic-sigmoid, hyperbolic tan-
gent, or rectified linear units. In our preliminary experiments
we found hyperbolic tangent, tanh, given by
(✓) = tanh(✓) =
exp(✓) exp( ✓)
exp(✓) + exp( ✓)
(6)
to work particularly well among those different non-
tations are given by,
@L
@p1
= 0
(p1
>
p2)( (p1
>
p2) t(p1, p2))p2, (8)
@L
@p2
= 0
(p1
>
p2)( (p1
>
p2) t(p1, p2))p1. (9)
Here, 0
denotes the first derivative of tanh, which is given
by 1 (✓)
2
. To simplify the notation we drop the arguments
of the loss function.
From Eq. 4 we get,
@p1
@x
=
1
|R(p1)|
(h(p1, u = x, v) h(p1, u, v = x)) , (10)
@p2
@x
=
1
|R(p2)|
(h(p2, u = x, v) h(p2, u, v = x)) , (11)
where,
h(p, u = x, v) =
X
(x,v)2{(u,v)|(u,v)2R(p),u=x}
f(p, x, v),
and
h(p, u, v = x) =
X
(u,x)2{(u,v)|(u,v)2R(p),v=x}
f(p, u, x).
Substituting the partial derivatives given by Eqs. 8-11 in
Eq. 7 we get,
@L
@x
= (p1, p2)[H(p1, x)
X
(u,v)2R(p2)
f(p2, u, v)(u v)
+H(p2, x)
X
(u,v)2R(p1)
f(p1, u, v)(u v)], (12)
where (p1, p2) is defined as
36. JointReps
•コーパス中で同一文内に出現する単語を予測す
る.その際に生じる誤差(目的関数)を最小化
する.
•辞書(WordNet)で定義されている意味的関係を
制約として入れる.
36
then extract unigrams from the co-occurrence windows as
the corresponding context words. We down-weight distant
(and potentially noisy) co-occurrences using the reciprocal
1/l of the distance in tokens l between the two words that
co-occur.
A word wi is assigned two vectors wi and ˜wi denoting
whether wi is respectively the target of the prediction (cor-
responding to the rows of X), or in the context of another
word (corresponding to the columns of X). The GloVe ob-
jective can then be written as:
JC =
1
2
X
i2V
X
j2V
f(Xij)
⇣
wi
>
˜wj + bi + ˜bj log(Xij)
⌘2
(1)
Here, bi and ˜bj are real-valued scalar bias terms that adjust
for the difference between the inner-product and the loga-
rithm of the co-occurrence counts. The function f discounts
the co-occurrences between frequent words and is given by:
f(t) =
(
(t/tmax)↵
if t < tmax
1 otherwise
(2)
(3).
miza
Here
coeffi
man
corp
value
Th
w.r.t
we fi
the r
tion
rame
pre-d
keep
Th
d as re-
seman-
hod for
, where
vectors
a man-
tracted
nly the
escribe
Rd
for
denote
wi, and
pus) is
repre-
od that
ic lexi-
etween
Miller,
aphrase
we do
rticular
s paper
tmax = 100 in our experiments. The objective function de-
fined by (1) encourages the learning of word representations
that demonstrate the desirable property that vector differ-
ence between the word embeddings for two words represents
the semantic relations that exist between those two words.
For example, Mikolov et al. [2013c] observed that the dif-
ference between the word embeddings for the words king
and man when added to the word embedding for the word
woman yields a vector similar to that of queen.
Unfortunately, the objective function given by (1) does
not capture the semantic relations that exist between wi and
wj as specified in the lexicon S. Consequently, it considers
all co-occurrences equally and is likely to encounter prob-
lems when the co-occurrences are rare. To overcome this
problem we propose a regularizer, JS, by considering the
three-way co-occurrence among words wi, wj, and a seman-
tic relation R that exists between the target word wi and one
of its context words wj in the lexicon as follows:
JS =
1
2
X
i2V
X
j2V
R(i, j) (wi ˜wj)2
(3)
Here, R(i, j) is a binary function that returns 1 if the se-
mantic relation R exists between the words wi and wj in
37. 単語間の意味的類似性計測
37
Table 1: Performance of the proposed method with different semantic relation types.
Method RG MC RW SCWS MEN sem syn total SemEval
corpus only 0.7523 0.6398 0.2708 0.460 0.6933 61.49 66.00 63.95 37.98
Synonyms 0.7866 0.7019 0.2731 0.4705 0.7090 61.46 69.33 65.76 38.65
Antonyms 0.7694 0.6417 0.2730 0.4644 0.6973 61.64 66.66 64.38 38.01
Hypernyms 0.7759 0.6713 0.2638 0.4554 0.6987 61.22 68.89 65.41 38.21
Hyponyms 0.7660 0.6324 0.2655 0.4570 0.6972 61.38 68.28 65.15 38.30
Member-holonyms 0.7681 0.6321 0.2743 0.4604 0.6952 61.69 66.36 64.24 37.95
Member-meronyms 0.7701 0.6223 0.2739 0.4611 0.6963 61.61 66.31 64.17 37.98
Part-holonyms 0.7852 0.6841 0.2732 0.4650 0.7007 61.44 67.34 64.66 38.07
Part-meronyms 0.7786 0.6691 0.2761 0.4679 0.7005 61.66 67.11 64.63 38.29
syn) analogies, and 8869 semantic analogies (sem).
mEval dataset contains manually ranked word-pairs
word-pairs describing various semantic relation types,
s defective, and agent-goal. In total there are 3218
airs in the SemEval dataset. Given a proportional
y a : b :: c : d, we compute the cosine similarity be-
b a+c and c, where the boldface symbols represent
beddings of the corresponding words. For the Google
we measure the accuracy for predicting the fourth
in each proportional analogy from the entire vocab-
We use the binomial exact test with Clopper-Pearson
nce interval to test for the statistical significance of
orted accuracy values. For SemEval we use the offi-
luation tool3
to compute MaxDiff scores.
Table 2: Comparison against prior work.
Method RG MEN sem syn
RCM 0.471 0.501 - 29.9
R-NET - - 32.64 43.46
C-NET - - 37.07 40.06
RC-NET - - 34.36 44.42
Retro (CBOW) 0.577 0.605 36.65 52.5
Retro (SG) 0.745 0.657 45.29 65.65
Retro (corpus only) 0.786 0.673 61.11 68.14
Proposed (synonyms) 0.787 0.709 61.46 69.33
learning of rare words, among which the co-occurrences a
人間が付けた類似度スコアとアルゴリズムが
出した類似度とのSpearman相関を使って評価.
様々な意味的関係を制約として使える.
類義語関係(synonymy)が最も有効.