The document presents a unified approach for measuring semantic similarity between texts at multiple levels (sense, word, text) using semantic signatures. It generates semantic signatures through multi-seeded random walks over the WordNet graph. It then aligns and disambiguates words and senses to extract sense "seeds" for the signatures. Finally, it calculates signature similarity using measures like cosine similarity, weighted overlap, and top-k Jaccard. The approach provides a unified framework for semantic similarity that can be applied to various NLP tasks.
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Similarity
1. Align,
Disambiguate
and
Walk
:
A
Unified
Approach
for
Measuring
Seman7c
Similarity
Mohammad
Taher
Pilehvar,
David
Jurgens
and
Roberto
Navigli
ACL
2013
最先端NLP勉強会
#5@chiba
2013/08/31
紹介者
:
Koji
Matsuda
13/09/03
snlp#5
matsuda
1
2013/09/03
改訂
2. Sentence
Textual
Similarity
(STS)
13/09/03
snlp#5
matsuda
2
Measure
the
degree
of
seman7c
equivalence
between
two
sentences
NOTE:
Differ
from
Textual
Entailment(TE)
and
Paraphrase(PARA)
•
TE
:
STS
assumes
symmetric
and
graded
equivalence
of
the
pair
• PARA
:
STS
need
incorporates
graded
seman7c
similarity
[Agirre+,
SemEval-‐2012]
→ STS
is
more
directly
applicable
number
of
NLP
tasks
MT,
Summariza7on,
Deep
QA,
etc.
3. Example
• Surface
Based
Approach
:
• labeled
DISSIMILAR
due
to
minimal
lexical
overlap
• Sense
Representa7on
Based
Approach:
• enables
consider
similarity
between
meanings
of
the
word
• (e.g.
fire
and
terminate)
• but,
difficult
to
incorporate
those
informa7on
• due
to
Polysemy,
Representa7on
of
individual
sense
13/09/03
snlp#5
matsuda
3
4. Seman7c
Similarity
at
mul7ple
Levels
Sense
Sense
Word
Word
Text
Text
13/09/03
snlp#5
matsuda
4
5. Seman7c
Similarity
at
mul7ple
Levels
Sense
Sense
Word
Word
Text
Text
Seman7c
Signature
Seman7c
Signature
1. How
to
create
Seman7c
Signature?
2. How
to
calculate
Similarity
of
Seman7c
Signatures?
13/09/03
snlp#5
matsuda
5
Unified
Seman7c
Representa7on
of
Lexical-‐item
(arbitrarily-‐sized
piece
of
text,
or
sense)
6. Overview
of
Proposed
Method
13/09/03
snlp#5
matsuda
6
Random
Walk
over
the
WordNet
Graph
Compare
Sense
Level
Seman>c
Signatures
-‐ Cosine
-‐ Weighted
Overlap
-‐ Top-‐k
Jaccard
Note:
figure
from
slide
by
authors
7. Seman7c
Signatures
• mul7-‐seeded
random
walk
over
WordNet
Graph
Random
walk
over
WordNet
Graph
Seman7c
Signature
(mul7nomial
distribu7on
over
senses(WordNet
Synset))
Sense
Word
Text
Set
of
Senses
seeds
(v(0))
13/09/03
snlp#5
matsuda
7
8. Personalized
PageRank
13/09/03
snlp#5
matsuda
8
Yellow
Node
:
Seed
Node(Synset)
Red
Node
Size:
Probability
of
Synset
Egde
:
WordNet
Rela7on
Note:
figure
from
slide
by
authors
9. Alignment-‐Based
Disambigua7on
• How
to
extract
“Set
of
Senses”
(seeds)
from
Text/Word?
– Need
solve
WSD
• They
proposed
Alignment-‐Based
WSD
– Maximize
sum
of
similarity
between
two
text/
word
– Can
use
arbitrary
similarity
measure
over
senses
13/09/03
snlp#5
matsuda
9
11. Alignment-‐Based
Disambigua7on
manager
fire
worker
employee
terminate
work
boss
R(man,emp)
R(man,bos)
R(man,ter)
R(man,wor)
13/09/03
snlp#5
matsuda
11
Word
Level
Alignment
←
Maximum
Relatedness
on
Word
Level
12. Alignment-‐Based
Disambigua7on
manager
fire
worker
employee
terminate
work
boss
R(man,bos)
13/09/03
snlp#5
matsuda
12
manager#1
manager#2
boss
#1
boss
#2
R(m#1,b#1)
R(m#1,b#2)
R(m#2,b#1)
R(m#2,b#2)
Word
Level
Alignment
Sense
Level
Alignment
13. Alignment-‐Based
Disambigua7on
manager
fire
worker
employee
terminate
work
boss
R(man,bos)
13/09/03
snlp#5
matsuda
13
manager#1
manager#2
boss
#1
boss
#2
R(m#1,b#1)
R(m#1,b#2)
R(m#2,b#1)
R(m#2,b#2)
Word
Level
Alignment
Sense
Level
Alignment
↑
Maximum
Relatedness
on
Sense
14. Alignment-‐Based
Disambigua7on
manager
fire
worker
employee
terminate
work
boss
R(man,bos)
R(fir,ter)
R(fir,wor)
R(wor,emp)
13/09/03
snlp#5
matsuda
14
manager#1
manager#2
boss
#1
boss
#2
R(m#1,b#2)
Word
Level
Alignment
Sense
Level
Alignment
15. Alignment-‐Based
Disambigua7on
manager
fire
worker
employee
terminate
work
boss
R(man,bos)
R(fir,ter)
R(fir,wor)
R(wor,emp)
13/09/03
snlp#5
matsuda
15
manager#1
manager#2
boss
#1
boss
#2
R(m#1,b#2)
Word
Level
Alignment
Sense
Level
Alignment
Result
:
16. Seman7c
Signature
Similarity
• How
to
calculate
similarity
of
Seman7c
Signatures?
– Parametric
• Cosine
– Non
Parametric(Rank-‐Based)
• Weighted
Overlap
• Top-‐k
Jaccard
13/09/03
snlp#5
matsuda
16
Sense
a
b
c
d
e
Sense
a
b
c
d
e
Compare
17. Seman7c
Signature
Similarity
• Weighted
Overlap
(ADWWO)
13/09/03
snlp#5
matsuda
17
Sense
a
b
c
d
e
Rank(r1)
2
4
1
0
3
(r2)
4
1
2
5
0
• Top-‐k
Jaccard
(ADWJac)
Sense
a
b
c
d
e
Rank(r1)
2
4
1
5
3
(r2)
4
1
2
5
3
|{a,c,e}∩
{b,c,e}|
|{a,c,e}∪{b,c,e}|
Rjac
=
Rwo
=
1
(2+4)+(4+1)+(1+2)
Max
when
same
sense
has
same
rank
Max
when
top-‐k
sets
has
same
senses
18. Overview
of
Proposed
Method
13/09/03
snlp#5
matsuda
18
Random
Walk
over
the
WordNet
Graph
Compare
Sense
Level
Seman>c
Signatures
-‐ Cosine
-‐ Weighted
Overlap
-‐ Top-‐k
Jaccard
Note:
figure
from
slide
by
authors
20. Textual
Similarity
• SemEval
2012
STS
task
(task
17)
• Model
– Regression
(Gaussian
Process)
– Features
• Main
:
ADWcos,
ADWWO,
ADWJac(k=250,500,1000,2500)
• String-‐Based
:
Longest
Common
Subsequence(Substring),
Greedy
String
Tiling,
character/
word
n-‐gram
similarity
id
Sentence
Score(0-‐5)
1
The
bird
is
bathing
in
the
sink.
0
Birdie
is
washing
itself
in
the
water
basin.
2
In
May
2010,
the
troops
axempted
to
invade
Kabul.
1
The
US
army
invaded
Kabul
on
May
7th
last
year,
2010.
3
John
said
he
is
considered
a
witness
but
not
a
suspect.
2
"He
is
not
a
suspect
anymore."
John
said.
4
They
flew
out
of
the
nest
in
groups.
3
They
flew
into
the
nest
together.
400
750
pairs
*
5
Set
13/09/03
snlp#5
matsuda
20
22. Textual
Similarity
(detail)
13/09/03
snlp#5
matsuda
22
Mpar
:
MSR
Paraphrase
Corpus
(web
news)
contain
many
named-‐en7ty
Mvid
:
MSR
Video
Paraphrase
Corpus
SMTe
:
French
to
English
SMT
result
and
Reference
Transla7on
pair
from
Europerl
Corpus
[ACL
2007,
2008
SMT
Workshop]
SMTn
:
Same
as
SMTe,
but
News
conversa7on
Corpus
is
used
OnWN
:
Glosses
from
OntoNotes
and
WordNet
23. Textual
Similarity
(detail)
13/09/03
snlp#5
matsuda
23
DW
:
Without
performing
any
Alignment
ADW-‐MF
:
Main
feature
only
(
don’t
make
use
of
string
based
feature)
• Alignment
is
helpful
• In
Mper
dataset
(
contain
many
Named
En7ty
),
string-‐based
method
is
strong
baseline
improve
24. Word
Similarity
• TOEFL
dataset
[Landauer
and
Dumais,
1997]
– Synonym
selec7on
task
– 80
mul7ple-‐choice
ques7ons
• 4
choice
per
ques7on
• RG-‐65
dataset
[Rubenstein
amd
Goodenough,1965]
– Similarity
grading
for
word
pair
– 65
word-‐pair
• Judged
by
51
human
subject
– Scale
0
-‐
4
13/09/03
snlp#5
matsuda
24
Note:
figure
from
slide
by
authors
28. Sense
Coarsing
Onto
:
OntoNotes
[Hovy+,
2006],
SE-‐2
:
Senseval-‐2
sense
groping
set
[Kilgarriff,
2001]
Binary
Classifica7on
(senses
can
be
merged
or
not?)
F-‐Score
13/09/03
snlp#5
matsuda
28
29. Conclusions
• Unified
approach
for
compu7ng
seman7c
similarity
at
mul7ple
lexical
levels
– Based
on
Random-‐Walk
over
WordNet
Graph
– Alignment
based
Word
Sense
Disambigua7on
– Similarity
Measure
based
on
ranking
of
sense
• Achieves
state-‐of-‐the-‐art
performance
in
three
tasks
– Similarity
judgment
tasks
(sense,
word,
text)
13/09/03
snlp#5
matsuda
29
30. My
Comment
•
☺
I
think
that
this
method
provides
simple
but
powerful
representa7on
of
seman7cs
for
rela7vely
longer
sentence
and
individual
word,
or
word
sense
–
☺
As
a
result,
this
method
expand
solvable
type
of
STS
problem
–
☹
But
ignore
sequence
order
and
parse
tree.
So
I
think
it
is
impotant
for
represen7ng
short
phrase
or
compound.
• Actually,
this
work
is
simply
combined
method
of
Personalized
PageRank-‐based
WSD
[Agirre
and
Soroa,
EACL
2009]
and
Word-‐
level
Alignment
for
Similarity
Calc
[Corley
and
Mihalcea,
ACL
2005]
•
☹
As
view
from
the
perspec7ve
of
compo7sional
seman7cs,
I
think
that
this
work
make
an
incorrect
assump7on.
– Let
S(x)
as
Seman7c
Signature
of
x,
they
suppose
S(xy)
∝
S(x)+S(y)
?
• e.g.
S(red
car)
∝
S(red)
+
S(car)
?
13/09/03
snlp#5
matsuda
30
31. Toward
STS
with
various
clues
13/09/03
snlp#5
matsuda
31
Syntax
Word
Sense
Domain
Knowlegde
Surface
Explicit
Implicit
Concrete
Abstract
This
Work
Composi7onal
Seman7cs
Automa7c
Extending
Lexical
Resoueces
Robust
Similarity
Measures
Named
En7ty
Linking
to
Knowledge
Base