Align, Disambiguate and Walk : A Uniﬁed Approach forMeasuring Semantic Similarity

Align,
Disambiguate
and
Walk

:

A
Uniﬁed
Approach
for

Measuring
Seman7c
Similarity
Mohammad
Taher
Pilehvar,
David
Jurgens
and

Roberto
Navigli

ACL
2013

最先端NLP勉強会
#5@chiba

2013/08/31

紹介者
:
Koji
Matsuda

13/09/03
snlp#5
matsuda
1
2013/09/03
改訂

Sentence
Textual
Similarity
(STS)
13/09/03
snlp#5
matsuda
2
Measure
the
degree
of
seman7c
equivalence
between
two
sentences
NOTE:
Diﬀer
from
Textual
Entailment(TE)
and
Paraphrase(PARA)

•  
TE

:

STS
assumes
symmetric
and
graded
equivalence
of
the
pair

•  PARA
:

STS
need
incorporates
graded
seman7c
similarity
[Agirre+,
SemEval-‐2012]
→ STS
is
more
directly
applicable
number
of
NLP
tasks

MT,
Summariza7on,
Deep
QA,
etc.

Example
•  Surface
Based
Approach
:

•  labeled
DISSIMILAR
due
to
minimal
lexical
overlap

•  Sense
Representa7on
Based
Approach:

•  enables
consider
similarity
between
meanings
of
the
word

•  (e.g.

ﬁre
and
terminate)

•  but,
diﬃcult
to
incorporate
those
informa7on

•  due
to
Polysemy,
Representa7on
of
individual
sense
13/09/03
snlp#5
matsuda
3

Seman7c
Similarity
at
mul7ple
Levels
Sense
Sense
Word
Word
Text
Text
13/09/03
snlp#5
matsuda
4

Seman7c
Similarity
at
mul7ple
Levels
Sense
Sense
Word
Word
Text
Text
Seman7c

Signature
Seman7c

Signature
1.  How
to
create
Seman7c
Signature?

2.  How
to
calculate
Similarity
of
Seman7c
Signatures?
13/09/03
snlp#5
matsuda
5
Uniﬁed
Seman7c
Representa7on
of

Lexical-‐item

(arbitrarily-‐sized
piece
of
text,
or
sense)

Overview
of
Proposed
Method
13/09/03
snlp#5
matsuda
6
Random
Walk
over

the
WordNet
Graph
Compare
Sense
Level

Seman>c
Signatures

-‐  Cosine

-‐  Weighted
Overlap

-‐  Top-‐k
Jaccard
Note:
ﬁgure
from
slide
by
authors

Seman7c
Signatures
•  mul7-‐seeded
random
walk
over
WordNet
Graph
Random
walk
over

WordNet
Graph
Seman7c
Signature

(mul7nomial
distribu7on

over
senses(WordNet
Synset))
Sense
Word
Text
Set
of

Senses
seeds

(v(0))
13/09/03
snlp#5
matsuda
7

Personalized
PageRank
13/09/03
snlp#5
matsuda
8
Yellow
Node

:
Seed
Node(Synset)

Red
Node
Size:
Probability
of
Synset

Egde

:
WordNet
Rela7on

Note:
ﬁgure
from
slide
by
authors

Alignment-‐Based
Disambigua7on
•  How
to
extract
“Set
of
Senses”
(seeds)
from

Text/Word?

– Need
solve
WSD

•  They
proposed
Alignment-‐Based
WSD

– Maximize
sum
of
similarity
between
two
text/
word

– Can
use
arbitrary
similarity
measure
over
senses

13/09/03
snlp#5
matsuda
9

Alignment-‐Based
Disambigua7on
manager
ﬁre
worker
employee
terminate
work

boss
R(man,emp)
13/09/03
snlp#5
matsuda
10
Word
Level
Alignment

Alignment-‐Based
Disambigua7on
manager
ﬁre
worker
employee
terminate
work

boss
R(man,emp)
R(man,bos)
R(man,ter)
R(man,wor)
13/09/03
snlp#5
matsuda
11
Word
Level
Alignment
←
Maximum
Relatedness
on
Word
Level

Alignment-‐Based
Disambigua7on
manager
ﬁre
worker
employee
terminate
work

boss
R(man,bos)
13/09/03
snlp#5
matsuda
12
manager#1
manager#2
boss

#1
boss

#2
R(m#1,b#1)
R(m#1,b#2)
R(m#2,b#1)
R(m#2,b#2)
Word
Level
Alignment
Sense
Level
Alignment

Alignment-‐Based
Disambigua7on
manager
ﬁre
worker
employee
terminate
work

boss
R(man,bos)
13/09/03
snlp#5
matsuda
13
manager#1
manager#2
boss

#1
boss

#2
R(m#1,b#1)
R(m#1,b#2)
R(m#2,b#1)
R(m#2,b#2)
Word
Level
Alignment
Sense
Level
Alignment
↑

Maximum
Relatedness
on
Sense

Alignment-‐Based
Disambigua7on
manager
fire
worker
employee
terminate
work

boss
R(man,bos)
R(fir,ter)
R(fir,wor)
R(wor,emp)
13/09/03
snlp#5
matsuda
14
manager#1
manager#2
boss

#1
boss

#2
R(m#1,b#2)
Word
Level
Alignment
Sense
Level
Alignment

Alignment-‐Based
Disambigua7on
manager
fire
worker
employee
terminate
work

boss
R(man,bos)
R(fir,ter)
R(fir,wor)
R(wor,emp)
13/09/03
snlp#5
matsuda
15
manager#1
manager#2
boss

#1
boss

#2
R(m#1,b#2)
Word
Level
Alignment
Sense
Level
Alignment
Result
:

Seman7c
Signature
Similarity
•  How
to
calculate
similarity
of
Seman7c

Signatures?

– Parametric

•  Cosine

– Non
Parametric(Rank-‐Based)

•  Weighted
Overlap

•  Top-‐k
Jaccard

13/09/03
snlp#5
matsuda
16
Sense

a

b

c

d

e
Sense

a

b

c

d

e
Compare

Seman7c
Signature
Similarity
•  Weighted
Overlap
(ADWWO)
13/09/03
snlp#5
matsuda
17
Sense

a

b

c

d

e
Rank(r1)

2

4

1

0

3

(r2)

4

1

2

5

0
•  Top-‐k
Jaccard
(ADWJac)
Sense

a

b

c

d

e
Rank(r1)

2

4

1

5

3

(r2)

4

1

2

5

3
|{a,c,e}∩
{b,c,e}|

|{a,c,e}∪{b,c,e}|
Rjac
=

Rwo
=
1

(2+4)+(4+1)+(1+2)
Max
when
same
sense
has
same
rank
Max
when
top-‐k
sets
has
same
senses

Overview
of
Proposed
Method
13/09/03
snlp#5
matsuda
18
Random
Walk
over

the
WordNet
Graph
Compare
Sense
Level

Seman>c
Signatures

-‐  Cosine

-‐  Weighted
Overlap

-‐  Top-‐k
Jaccard
Note:
ﬁgure
from
slide
by
authors

Experiments
•  Textual
Similarity

– SemEval-‐2012
STS
task
[Agirre+,
SemEval2012]

•  Word
Similarity

– TOEFL
Dataset

– RG-‐65
Dataset

•  Sense
Similarity

– Sense
Coarsening
(OntoNotes,
Senseval-‐2)

13/09/03
snlp#5
matsuda
19

Textual
Similarity
•  SemEval
2012
STS
task
(task
17)

•  Model

–  Regression
(Gaussian
Process)

–  Features

•  Main
:
ADWcos,
ADWWO,
ADWJac(k=250,500,1000,2500)

•  String-‐Based
:
Longest
Common
Subsequence(Substring),
Greedy
String
Tiling,
character/
word
n-‐gram
similarity

id
Sentence
Score(0-‐5)
1
The
bird
is
bathing
in
the
sink.
0
Birdie
is
washing
itself
in
the
water
basin.
2
In
May
2010,
the
troops
axempted
to
invade
Kabul.
1
The
US
army
invaded
Kabul
on
May
7th
last
year,
2010.
3
John
said
he
is
considered
a
witness
but
not
a
suspect.
2
"He
is
not
a
suspect
anymore."
John
said.
4
They
ﬂew
out
of
the
nest
in
groups.
3
They
ﬂew
into
the
nest
together.
400

750
pairs
*
5
Set
13/09/03
snlp#5
matsuda
20

Textual
Similarity
Performance
Table
2
:
Pearson
correla7on
coeﬃcient
13/09/03
snlp#5
matsuda
21

Textual
Similarity
(detail)
13/09/03
snlp#5
matsuda
22
Mpar
:

MSR
Paraphrase
Corpus
(web
news)

contain
many
named-‐en7ty

Mvid
:

MSR
Video
Paraphrase
Corpus

SMTe
:

French
to
English
SMT
result
and
Reference
Transla7on
pair

from
Europerl
Corpus
[ACL
2007,
2008
SMT
Workshop]

SMTn
:

Same
as
SMTe,
but
News
conversa7on
Corpus
is
used

OnWN
:
Glosses
from
OntoNotes
and
WordNet

Textual
Similarity
(detail)
13/09/03
snlp#5
matsuda
23
DW

:
Without
performing
any
Alignment

ADW-‐MF
:
Main
feature
only
(
don’t
make
use
of
string
based
feature)
•  Alignment
is
helpful

•  In
Mper
dataset
(
contain
many
Named
En7ty
),

string-‐based
method
is
strong
baseline

improve

Word
Similarity
•  TOEFL
dataset
[Landauer
and
Dumais,
1997]

– Synonym
selec7on
task

– 80
mul7ple-‐choice
ques7ons

•  4
choice
per
ques7on

•  RG-‐65
dataset
[Rubenstein
amd
Goodenough,1965]

– Similarity
grading
for
word
pair

– 65
word-‐pair

•  Judged
by
51
human
subject

– Scale
0
-‐
4

13/09/03
snlp#5
matsuda
24
Note:
ﬁgure
from
slide
by
authors

Word
Similarity
(TOEFL)
13/09/03
snlp#5
matsuda
25

Word
Similarity
(RG-‐65)
13/09/03
snlp#5
matsuda
26

Sense
Similarity
•  Coarsening
WordNet
sense
inventory
13/09/03
snlp#5
matsuda
27
Note:
ﬁgure
from
slide
by
authors

Sense
Coarsing
Onto
:
OntoNotes
[Hovy+,
2006],

SE-‐2
:
Senseval-‐2
sense
groping
set
[Kilgarriﬀ,
2001]
Binary
Classiﬁca7on
(senses
can
be
merged
or
not?)
F-‐Score
13/09/03
snlp#5
matsuda
28

Conclusions
•  Uniﬁed
approach
for
compu7ng
seman7c

similarity
at
mul7ple
lexical
levels

– Based
on
Random-‐Walk
over
WordNet
Graph

– Alignment
based
Word
Sense
Disambigua7on

– Similarity
Measure
based
on
ranking
of
sense

•  Achieves
state-‐of-‐the-‐art
performance
in

three
tasks

– Similarity
judgment
tasks
(sense,
word,
text)

13/09/03
snlp#5
matsuda
29

My
Comment
• 
☺
I
think
that
this
method
provides
simple
but
powerful

representa7on
of
seman7cs
for
rela7vely
longer
sentence
and

individual
word,
or
word
sense

– 
☺
As
a
result,
this
method
expand
solvable
type
of
STS
problem

– 
☹
But
ignore
sequence
order
and
parse
tree.
So
I
think
it
is
impotant

for
represen7ng
short
phrase
or
compound.

•  Actually,
this
work
is
simply
combined
method
of
Personalized

PageRank-‐based
WSD
[Agirre
and
Soroa,
EACL
2009]
and
Word-‐
level
Alignment
for
Similarity
Calc
[Corley
and
Mihalcea,
ACL
2005]

• 
☹
As
view
from
the
perspec7ve
of
compo7sional
seman7cs,
I
think

that
this
work
make
an
incorrect
assump7on.

–  Let
S(x)
as
Seman7c
Signature
of
x,
they
suppose
S(xy)
∝
S(x)+S(y)
?

•  e.g.
S(red
car)
∝
S(red)
+
S(car)

?

13/09/03
snlp#5
matsuda
30

Toward
STS
with
various
clues
13/09/03
snlp#5
matsuda
31
Syntax
Word
Sense
Domain
Knowlegde
Surface
Explicit
Implicit
Concrete
Abstract
This
Work

Composi7onal
Seman7cs
Automa7c
Extending

Lexical
Resoueces

Robust
Similarity
Measures
Named
En7ty

Linking
to
Knowledge
Base

頂いたコメントへの返信/その他メモ
•  Synset間のリンクは全て用いているのか？(乾先生)

–  Personalized
PageRank-‐based
WSDの元論文[Agirre
and
Soroa,
09]では，すべての
rela7onを用いたと述べられている(本論文でも踏襲)

–  しかし，antonymなど，単純に伝播させるべきではないリンクが存在する，というのはそ
うかもしれない

•  意味をぼやかす(周囲のSynsetに伝播させる)ことで，WSDの性能が上が
るというのは一般性がある性質なのか？(乾先生）

–  Knowledge-‐based
WSDにおいては，知識ベースの不完全さ(スパースさ，カバレッジの
低さ)が問題になることが多く，その影響を和らげるためにソフトな情報を用いることは
よく行われている

•  Word
to
Wordの場合もアラインメントを行うのか？(松原さん)

–  はい，実際は語義レベルでのアラインメントを行っている(図が説明不足でした)

•  アラインメントで，「最大値」をとってきている(好意的な解釈をさがす)ので，
類似度の「下限」のようなものをもとめているといえる

–  多義性が問題になる場合，overes7mateすることがあるように思える

•  文や単語の「ペア」に対して類似度を定義するモデルであるため，
representa7on単体で用いるのは難しい

–  WordNet
Synsetのglossとのペアを用いるという手段はある
13/09/03
snlp#5
matsuda
32

Align, Disambiguate and Walk : A Uniﬁed Approach forMeasuring Semantic Similarity

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (7)

Ähnlich wie Align, Disambiguate and Walk : A Uniﬁed Approach forMeasuring Semantic Similarity

Ähnlich wie Align, Disambiguate and Walk : A Uniﬁed Approach forMeasuring Semantic Similarity (20)

Mehr von Koji Matsuda

Mehr von Koji Matsuda (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Align, Disambiguate and Walk : A Uniﬁed Approach forMeasuring Semantic Similarity